QAT (Quantization Aware Training) Implementation Deep Dive - Accuracy Preservation Strategies for Model Compression¶
This is a follow-up to this morning's article
Morning article: AI Daily News - September 12, 2025 (archived)
Goals¶
- Master implementation patterns to avoid PTQ (Post-Training Quantization) accuracy degradation
- Specific methods to keep accuracy loss under 1% during INT8/INT4 conversion
- Establish criteria for choosing between QAT/QAD (Quantization Aware Distillation)
Architecture Overview¶
QAT is a technique that maintains high accuracy even in low-precision environments by making the model aware of quantization during training.
# QAT Pipeline Flow
Model Training → Insert Fake Quantization →
Fine-tuning with Quantization → Export INT8 Model
Implementation Steps¶
Step 1: QAT Setup with PyTorch¶
import torch
import torch.quantization as quant
# Insert fake quantization nodes into model
model.qconfig = quant.get_default_qat_qconfig('fbgemm')
quant.prepare_qat(model, inplace=True)
model.train()
# Retrain for 3-5 epochs with normal training loop
for epoch in range(3):
for batch in dataloader:
outputs = model(batch['input'])
loss = criterion(outputs, batch['target'])
optimizer.zero_grad()
loss.backward()
optimizer.step()
Step 2: Convert to Quantized Model¶
# Switch to eval mode for calibration
model.eval()
model_int8 = quant.convert(model, inplace=False)
# Size comparison
original_size = os.path.getsize('model_fp32.pth')
quantized_size = os.path.getsize('model_int8.pth')
print(f"Compression ratio: {original_size/quantized_size:.2f}x")
Step 3: TensorRT Optimization¶
import tensorrt as trt
# Convert QAT model to TensorRT engine
builder = trt.Builder(logger)
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.INT8)
config.int8_calibrator = calibrator
engine = builder.build_engine(network, config)
# Inference speed: 2-4x speedup (GPU dependent)
Benchmark Results¶
| Method | Model Size | Accuracy Drop | Inference Speedup | Implementation Difficulty |
|---|---|---|---|---|
| PTQ (INT8) | 25% | 2-5% | 2.5x | Low |
| QAT (INT8) | 25% | <1% | 2.5x | Medium |
| QAD (INT8) | 25% | <0.5% | 2.5x | High |
| Dynamic Quant | 40% | 1-2% | 1.8x | Low |
Failure Patterns and Mitigation¶
| Symptom | Cause | Mitigation |
|---|---|---|
| Training doesn't converge | Learning rate too high | Adjust LR to 1/10 for QAT |
| Sudden accuracy drop | BatchNorm handling error | Execute BN fusion before prepare_qat |
| INT4 accuracy collapse | Excessive compression | Apply per-layer quantization settings |
| Runtime errors | Insufficient calibration | Run 100+ batches with representative data |
Implementation Selection Criteria¶
# Decision flowchart
if accuracy_drop_tolerance < 1.0: # Accuracy focused
if has_training_data:
use_QAT() # 3-5 epochs retraining
else:
use_QAD() # Distillation from teacher model
else: # Speed focused
use_PTQ() # Calibration only
Next Steps¶
Build upon this QAT implementation to achieve further model optimization and production deployment improvements.