Playbook — NCP-GENL NVIDIA-Certified Professional: Generative AI LLMs

Last reviewed: June 2026

A scannable reference of architectural patterns the NCP-GENL exam tests. Read top-to-bottom, or jump to a section.

Model Optimization

Need lower latency on H100/Blackwell without the accuracy hit of aggressive INT quantization.

Use FP8 (E4M3) quantization via TensorRT-LLM; Hopper and Blackwell have native FP8 Tensor Cores.

Why: FP8 preserves dynamic range better than INT8 and runs at full hardware speed on Hopper+, giving near-FP16 quality at INT8-class throughput.

Reference

Model barely fits in GPU memory and throughput is memory-bandwidth-bound.

Apply INT4 weight-only quantization (AWQ or GPTQ); keep activations in FP16/FP8.

Why: Weight-only INT4 roughly halves memory versus INT8 and relieves bandwidth pressure; activation precision stays high so accuracy loss is small.

Deciding between post-training quantization and quantization-aware training.

Start with PTQ (calibrate on a representative sample); fall back to QAT only if PTQ accuracy loss exceeds the budget.

Why: PTQ is fast and needs no retraining; QAT recovers accuracy but costs a training run, so reserve it for precision-critical models.

Long-context serving where KV cache dominates memory and limits batch size.

Enable FP8 or INT8 KV-cache quantization in TensorRT-LLM.

Why: KV cache grows with sequence length × batch; quantizing it frees memory for larger batches and longer contexts with minimal quality impact.

Mixed request lengths cause GPU idle time with static batching.

Use in-flight (continuous) batching in TensorRT-LLM so finished sequences are evicted and new ones join mid-flight.

Why: Continuous batching keeps the GPU saturated and raises throughput far above static batching for heterogeneous request streams.

Reference

A large teacher model meets quality but misses the latency and cost target.

Distill into a smaller student model, then quantize the student for inference.

Why: Distillation transfers capability to a cheaper architecture; combined with quantization it compounds the cost/latency savings.

Single-stream latency is too high for an interactive use case.

Apply speculative decoding with a small draft model verified by the target model.

Why: The draft proposes multiple tokens that the large model verifies in one pass, cutting wall-clock latency without changing output distribution.

Quantizing everything to INT4 tanks accuracy on a few sensitive layers.

Use mixed-precision: keep sensitive layers (e.g. final projection, attention) higher precision and quantize the rest.

Why: Per-layer sensitivity varies; selective precision protects accuracy where it matters while still shrinking the bulk of the weights.

PTQ accuracy is poor despite a reasonable quantization scheme.

Recalibrate with an in-distribution sample (hundreds of representative prompts) matching production traffic.

Why: Calibration sets activation ranges; an unrepresentative sample produces bad scales and avoidable accuracy loss.

GPU Acceleration and Optimization

Model weights exceed a single GPU but fit within one NVLink-connected node.

Use tensor parallelism across the GPUs in the node.

Why: Tensor parallelism shards each layer and exchanges activations every step, so it needs the high intra-node bandwidth of NVLink/NVSwitch.

Model is too large for one node and must span nodes over InfiniBand.

Add pipeline parallelism across nodes, keeping tensor parallelism within each node.

Why: Pipeline parallelism communicates only at stage boundaries, tolerating slower inter-node links; reserve bandwidth-hungry tensor parallel for NVLink.

Scaling to more GPUs yields diminishing throughput gains.

Profile with Nsight Systems to classify the bottleneck; if collectives dominate, reduce parallel degree or improve topology.

Why: Beyond a point, all-reduce/all-gather overhead outweighs added compute; diagnosing communication-bound vs compute-bound guides the fix.

Reference

Per-step kernel launch overhead inflates decode latency at small batch sizes.

Enable CUDA Graphs to capture and replay the decode loop.

Why: CUDA Graphs collapse many small launches into one replay, removing CPU-side launch overhead that dominates at low batch sizes.

Tensor-parallel ranks placed across a slow link cause stalls.

Pin tensor-parallel ranks to GPUs sharing NVLink/NVSwitch; place pipeline stages across nodes.

Why: Mismatched placement routes high-frequency collectives over PCIe or InfiniBand, throttling the whole pipeline.

Attention is memory-bound and limits achievable context length.

Use FlashAttention (fused, IO-aware attention kernels) as provided by the TensorRT-LLM/NeMo stack.

Why: FlashAttention avoids materializing the full attention matrix, cutting memory traffic and enabling longer sequences at higher speed.

Several small models underutilize full H100 GPUs.

Partition GPUs with MIG (Multi-Instance GPU) to isolate each model on a slice.

Why: MIG gives hardware-isolated partitions, raising utilization and providing predictable QoS for co-located small workloads.

Prompt Engineering

Downstream service requires strictly valid JSON every time.

Use guided/constrained decoding (grammar or JSON schema) in the serving runtime rather than relying on prompt wording alone.

Why: Constrained decoding masks invalid tokens at generation time, guaranteeing schema-valid output where prompting only reduces the failure rate.

Task needs a consistent format the base model handles inconsistently.

Try few-shot exemplars first; move to fine-tuning only if prompt-based steering plateaus or token cost is excessive.

Why: Few-shot is zero-training and instantly editable; fine-tuning wins only when patterns are stable and prompt overhead hurts.

Multi-step reasoning task gives wrong final answers.

Elicit chain-of-thought ('think step by step') or use a structured reasoning template before the final answer.

Why: Exposing intermediate steps improves multi-hop accuracy and makes errors auditable, at the cost of extra tokens.

A prompt tweak silently regressed production quality.

Version system prompts as code, gate changes behind eval, and roll out via the same CI as model artifacts.

Why: Prompts are part of the model contract; unversioned edits cause untracked regressions and unreproducible behavior.

Model hallucinates facts outside its training data.

Retrieve relevant context and inject it into the prompt with an instruction to answer only from provided context.

Why: Grounding on retrieved passages constrains the model to source material and cuts hallucination on knowledge-intensive queries.

Latency and cost are high because prompts are bloated.

Trim and compress the prompt: dedupe instructions, summarize retrieved context, and cap exemplars to the minimum that holds quality.

Why: Prefill scales with input tokens; lean prompts cut both latency and per-request cost without measurable quality loss.

User-supplied text can override the system instruction.

Separate trusted instructions from untrusted input with clear delimiters and treat retrieved/user content as data, not commands.

Why: Concatenating untrusted text into the instruction channel invites prompt injection; explicit boundaries reduce the attack surface.

Fine-Tuning

Adapting a large base model to a domain on a limited GPU budget.

Use LoRA: train low-rank adapters and freeze the base weights.

Why: LoRA trains a tiny fraction of parameters, slashing memory and compute while matching full fine-tuning on most narrow tasks.

Reference

Even LoRA training of a 70B model won't fit available memory.

Use QLoRA: quantize the frozen base to 4-bit (NF4) and train LoRA adapters on top.

Why: Holding the base in 4-bit while updating only adapters lets large models be fine-tuned on a single GPU with minimal accuracy loss.

Choosing LoRA rank for a new fine-tuning task.

Start with a modest rank (e.g. 8-16); raise it only if the task is complex and validation loss is still improving.

Why: Higher rank adds capacity and cost; over-ranking risks overfitting on small datasets while under-ranking caps achievable quality.

Model follows instructions but its outputs don't match human preference.

Do supervised fine-tuning first, then preference alignment with RLHF or DPO.

Why: SFT teaches the format and task; preference optimization shapes which valid answers humans actually prefer.

RLHF with PPO is unstable and operationally heavy.

Use DPO (Direct Preference Optimization) on a preference dataset instead of a reward model + PPO loop.

Why: DPO optimizes preferences directly without a separate reward model or RL rollout, simplifying the pipeline and improving stability.

LoRA adapter adds per-request overhead at serving time.

Merge the adapter weights into the base for deployment when only one adapter is served.

Why: A merged model has no adapter branch at inference; keep adapters separate only when hot-swapping multiple tasks on one base.

Fine-tuning on a narrow task degrades general capabilities.

Mix in a slice of general/instruction data, lower the learning rate, and prefer PEFT over full fine-tuning.

Why: Replaying general data and limiting weight movement preserves broad skills while still learning the new task.

Data Preparation

Pretraining/fine-tuning data contains heavy near-duplicates.

Run fuzzy deduplication (e.g. MinHash/LSH) before training.

Why: Duplicates waste compute, bias the model toward repeated content, and can cause memorization; dedup improves generalization per token.

Suspiciously high benchmark scores after training.

Decontaminate the training set against benchmark/eval data via n-gram overlap filtering.

Why: Leakage of test items inflates metrics and hides real quality; decontamination keeps evaluation honest.

Corpus may contain personal data subject to governance rules.

Add a PII detection-and-redaction stage to the data pipeline before training.

Why: Training on raw PII risks regurgitation and compliance violations; scrubbing upfront is far cheaper than fixing a leaky model.

Raw web-scraped data is noisy and lowers model quality.

Apply quality filters (heuristics plus a classifier) to drop low-quality, boilerplate, and spam documents.

Why: Data quality outweighs raw quantity past a threshold; filtering yields better models from the same training budget.

Fine-tuning data must feed cleanly into the NeMo training pipeline.

Convert to the expected NeMo format (e.g. JSONL with prompt/response fields) and tokenize with the model's tokenizer.

Why: Format and tokenizer mismatches cause silent truncation or label errors; conforming to NeMo's schema keeps training reproducible.

Reference

Model Deployment

Standing up a production LLM endpoint quickly with an OpenAI-compatible API.

Deploy with an NVIDIA NIM microservice; build a custom Triton ensemble only for non-standard pre/post-processing needs.

Why: NIM ships optimized engines and a standard API out of the box; custom Triton is worth the effort only when you need bespoke pipeline control.

Reference

Independent requests arrive faster than single-request serving can handle.

Enable Triton dynamic batching to coalesce concurrent requests into GPU batches.

Why: Batching amortizes kernel overhead across requests, raising throughput at a small, bounded latency cost.

Reference

A single model instance leaves GPU compute underutilized.

Configure multiple model instances per GPU in Triton to overlap execution.

Why: Concurrent instances fill compute gaps left by memory stalls, improving utilization when memory allows.

Traffic is spiky and fixed replicas either waste GPUs or drop SLOs.

Autoscale replicas on queue depth / GPU utilization with a warm pool to absorb cold starts.

Why: LLM cold starts (engine load) are slow; scaling on a leading signal with warm capacity protects latency during spikes.

Existing clients expect the OpenAI chat-completions API.

Expose the model through NIM's OpenAI-compatible endpoint so clients integrate without rewrites.

Why: A drop-in compatible API minimizes client migration work and lets you swap backends transparently.

Evaluation

A model or prompt change must not silently regress quality.

Run a curated golden eval set in CI and block deploys that drop below a quality threshold.

Why: Automated regression gates catch quality drops before they reach users, the same way unit tests gate code.

Open-ended outputs have no single reference answer to score against.

Use an LLM-as-judge with a rubric, calibrated against human ratings on a sample.

Why: A rubric-driven judge scales subjective evaluation; human calibration guards against the judge's own bias.

High MMLU score but users complain about the production task.

Evaluate on task-specific metrics tied to business outcomes, not just generic benchmarks.

Why: Generic benchmarks correlate weakly with narrow deployed tasks; the right metric reflects what users actually need.

Offline evals look good but real-world impact is uncertain.

Run an online A/B test routing a fraction of traffic to the new version and compare outcome metrics.

Why: Live A/B captures distribution shift and user behavior that offline sets miss, confirming real improvement.

Production Monitoring and Reliability

Need visibility into GPU health and utilization across a serving fleet.

Export DCGM metrics (utilization, memory, ECC, temperature) into Prometheus and alert on them.

Why: DCGM is the standard NVIDIA telemetry source; without it, GPU-level saturation and faults go undetected.

Reference

Users intermittently see slow responses but average latency looks fine.

Track p95/p99 time-to-first-token and inter-token latency, and alert on percentile SLO breaches.

Why: Averages hide tail latency; LLM UX is governed by p95/p99, so percentile SLIs are the right alerting signal.

Deploying a new model version to a high-traffic endpoint.

Roll out via canary (small traffic slice) with automated rollback on SLO or quality regression.

Why: Canarying limits blast radius and lets metrics confirm safety before full rollout, unlike a big-bang deploy.

Throughput collapses under load with no obvious GPU compute spike.

Monitor KV-cache and batch-slot utilization; scale out or shorten max context when the cache saturates.

Why: KV-cache exhaustion caps concurrency before compute does; watching it explains throughput cliffs that GPU-util alone misses.

LLM Architecture

KV cache is too large for the target batch and context.

Prefer an architecture using Grouped-Query Attention (GQA) or Multi-Query Attention (MQA).

Why: GQA/MQA share key/value heads, shrinking KV-cache memory and raising attainable batch size with little quality loss.

Need to extend a model's usable context beyond its trained length.

Use RoPE scaling (e.g. NTK-aware / YaRN) plus light long-context fine-tuning.

Why: RoPE interpolation stretches positional encodings; a short fine-tune adapts the model to the longer range without full retraining.

Want more capacity without proportional inference cost.

Consider a Mixture-of-Experts model that activates only top-k experts per token.

Why: MoE scales parameters while keeping per-token FLOPs low, but adds routing complexity and uneven expert load to manage.

Safety, Ethics, and Compliance

A deployed model needs topic, safety, and format boundaries.

Wrap the model with NeMo Guardrails to enforce input and output rails (topical, moderation, jailbreak).

Why: Programmable rails add a controllable safety layer around the model without retraining it.

Reference

Model occasionally produces toxic or unsafe content.

Add an output moderation classifier and block/regenerate responses that exceed a risk threshold.

Why: A separate moderation pass catches unsafe generations that prompt-level instructions alone don't reliably prevent.

Stakeholders require evidence the model meets responsible-AI standards.

Run bias and toxicity benchmarks, document results, and track them across versions in a model card.

Why: Documented, repeatable safety evaluation supports compliance and surfaces regressions before they reach production.