Need lower latency on H100/Blackwell without the accuracy hit of aggressive INT quantization.
→Use FP8 (E4M3) quantization via TensorRT-LLM; Hopper and Blackwell have native FP8 Tensor Cores.
Why: FP8 preserves dynamic range better than INT8 and runs at full hardware speed on Hopper+, giving near-FP16 quality at INT8-class throughput.
Reference↗
Model barely fits in GPU memory and throughput is memory-bandwidth-bound.
→Apply INT4 weight-only quantization (AWQ or GPTQ); keep activations in FP16/FP8.
Why: Weight-only INT4 roughly halves memory versus INT8 and relieves bandwidth pressure; activation precision stays high so accuracy loss is small.
Deciding between post-training quantization and quantization-aware training.
→Start with PTQ (calibrate on a representative sample); fall back to QAT only if PTQ accuracy loss exceeds the budget.
Why: PTQ is fast and needs no retraining; QAT recovers accuracy but costs a training run, so reserve it for precision-critical models.
Long-context serving where KV cache dominates memory and limits batch size.
→Enable FP8 or INT8 KV-cache quantization in TensorRT-LLM.
Why: KV cache grows with sequence length × batch; quantizing it frees memory for larger batches and longer contexts with minimal quality impact.
Mixed request lengths cause GPU idle time with static batching.
→Use in-flight (continuous) batching in TensorRT-LLM so finished sequences are evicted and new ones join mid-flight.
Why: Continuous batching keeps the GPU saturated and raises throughput far above static batching for heterogeneous request streams.
Reference↗
A large teacher model meets quality but misses the latency and cost target.
→Distill into a smaller student model, then quantize the student for inference.
Why: Distillation transfers capability to a cheaper architecture; combined with quantization it compounds the cost/latency savings.
Single-stream latency is too high for an interactive use case.
→Apply speculative decoding with a small draft model verified by the target model.
Why: The draft proposes multiple tokens that the large model verifies in one pass, cutting wall-clock latency without changing output distribution.
Quantizing everything to INT4 tanks accuracy on a few sensitive layers.
→Use mixed-precision: keep sensitive layers (e.g. final projection, attention) higher precision and quantize the rest.
Why: Per-layer sensitivity varies; selective precision protects accuracy where it matters while still shrinking the bulk of the weights.
PTQ accuracy is poor despite a reasonable quantization scheme.
→Recalibrate with an in-distribution sample (hundreds of representative prompts) matching production traffic.
Why: Calibration sets activation ranges; an unrepresentative sample produces bad scales and avoidable accuracy loss.