Playbook — NCA-GENL NVIDIA-Certified Associate: Generative AI LLMs

Last reviewed: June 2026

A scannable reference of architectural patterns the NCA-GENL exam tests. Read top-to-bottom, or jump to a section.

Core Machine Learning and AI Knowledge

Explain what lets a transformer weigh distant tokens when generating the next one.

Self-attention. Each token attends to every other token via query/key/value projections, producing context-weighted representations.

Why: Attention, not recurrence, is what gives transformers long-range context and parallelizable training.

Pick how to inject new knowledge or behavior into an LLM.

New facts that change often → RAG. New task behavior/style → fine-tune. New base capability/vocabulary at scale → continued pre-training.

Why: RAG keeps data external and updatable; fine-tuning bakes behavior into weights; pre-training is the most expensive lever.

Define what makes a model a foundation model.

A large model pre-trained on broad, mostly unlabeled data that is adaptable to many downstream tasks via prompting, RAG, or fine-tuning.

Estimate how text maps to model input units and what drives cost.

Text is split into sub-word tokens by a tokenizer (e.g. BPE). Cost and context limits are measured in tokens, not characters or words.

Why: Rare or non-English words split into more tokens, inflating context use and inference cost.

A long document does not fit in a single prompt.

The input exceeds the model's context window (max tokens for input + output). Chunk the document for RAG or choose a longer-context model.

Why: The context window is a hard limit; everything beyond it is truncated and silently lost.

Power semantic search or RAG retrieval over text.

Use an embedding model to convert text into dense vectors, then retrieve by cosine/dot-product similarity from a vector store.

Why: Embeddings place semantically similar text near each other, enabling meaning-based rather than keyword retrieval.

Choose output behavior: deterministic vs. creative.

Low temperature (~0.0-0.3) → focused, repeatable. High temperature (~0.7-1.0) → diverse, creative. Use near-0 for classification or extraction.

Why: Temperature scales the probability distribution before sampling; lower values concentrate mass on the top tokens.

Constrain the candidate token pool beyond temperature.

Top-k keeps the k most-likely tokens; top-p (nucleus) keeps the smallest set whose cumulative probability reaches p.

Why: Top-p adapts the candidate set to the distribution shape; top-k is fixed-width regardless of confidence.

Identify how LLMs learn from unlabeled text.

Self-supervised learning — next-token (causal) or masked-token prediction creates labels from the text itself, no human annotation.

Why: It is what lets LLMs train on internet-scale corpora without manual labeling.

Match architecture to task family.

Generation → decoder-only (GPT-style). Understanding/classification → encoder-only (BERT-style). Seq-to-seq translation/summarization → encoder-decoder (T5-style).

Why: Decoder-only models predict left-to-right; encoders see bidirectional context, better for representation tasks.

Make a base model follow instructions and prefer helpful, safe answers.

Instruction tuning followed by alignment such as RLHF — reinforcement learning from human preference rankings.

Why: A raw pre-trained model predicts text; alignment steers it toward intended assistant behavior.

The model states confident but fabricated facts.

Hallucination. Mitigate by grounding with RAG, lowering temperature, citing sources, and adding guardrails plus human review for high-stakes outputs.

Why: LLMs predict plausible tokens, not verified facts; grounding supplies the missing evidence.

Distinguish model size from training data size.

Parameters = learned weights (model capacity). Tokens = volume of training text. Both scale capability under scaling laws.

Why: A bigger model under-trained on too few tokens underperforms a smaller, well-trained one (Chinchilla insight).

Separate the two GPU-heavy phases of an LLM lifecycle.

Training updates weights from data (one-time, batch). Inference runs the frozen model to generate outputs (ongoing, latency-sensitive).

Why: Optimization tools differ: training uses parallelism frameworks; inference uses TensorRT-LLM and Triton.

A fine-tuned model memorizes training examples and fails on new inputs.

Overfitting. Mitigate with more/diverse data, early stopping, lower learning rate, fewer epochs, or regularization like dropout.

Why: A large train-vs-validation gap means the model fit noise instead of generalizable patterns.

Software Development

Deploy an optimized LLM as a production microservice with an OpenAI-compatible API quickly.

Use an NVIDIA NIM microservice — a prebuilt, containerized, TensorRT-LLM-optimized model endpoint.

Why: NIM packages the model, runtime, and optimized engine so you skip manual TensorRT-LLM and Triton wiring.

Reference

Serve multiple models with batching, concurrency, and multiple backends behind one inference server.

NVIDIA Triton Inference Server. Supports dynamic batching, model ensembles, and TensorRT/PyTorch/ONNX backends.

Why: Triton maximizes GPU utilization via concurrent model execution and dynamic batching.

Reference

Cut LLM inference latency on NVIDIA GPUs before serving.

Compile the model with TensorRT-LLM — kernel fusion, quantization, in-flight batching, and KV-cache optimization.

Why: TensorRT-LLM produces an optimized engine far faster than running the raw framework model.

Reference

Train, customize, or fine-tune LLMs at scale on NVIDIA GPUs.

NVIDIA NeMo framework — end-to-end toolkit for building, customizing, and deploying generative AI models.

Why: NeMo covers data curation, training, PEFT, and alignment in one stack designed for multi-GPU scaling.

Reference

Build an app that answers from private documents the base model never saw.

RAG pipeline: chunk + embed documents into a vector store, retrieve top-k by similarity at query time, and inject them into the prompt.

Why: Retrieval grounds answers in current, owned data without retraining the model.

Constrain assistant tone, role, and rules across a whole conversation.

Set a system prompt/message defining role, constraints, and format before user turns.

Why: The system message persists across turns and steers behavior more reliably than per-turn instructions.

Improve accuracy on a structured task without any training.

Few-shot prompting — embed 2-5 input/output examples in the prompt before the real input.

Why: In-context learning lets the model pattern-match against examples with no weight updates.

The model gets multi-step reasoning or math problems wrong.

Chain-of-thought prompting — instruct it to reason step by step before giving the final answer.

Why: Eliciting intermediate steps improves reasoning accuracy on compositional tasks.

Let the LLM trigger external APIs, databases, or tools reliably.

Use function/tool calling — define tool schemas; the model emits structured arguments your code executes.

Why: Structured tool calls beat parsing free-text, and they ground the model in live systems for agentic flows.

Downstream code needs strict JSON from the model.

Request a JSON schema in the prompt and use constrained/guided decoding; validate the output before use.

Why: Schema-guided decoding prevents malformed JSON that would break parsing.

A chat UI must show tokens as they are produced rather than after completion.

Use streaming (token-by-token) inference from the serving endpoint.

Why: Streaming lowers perceived latency; NIM and Triton both support streamed responses.

Compose retrieval, prompting, and tool steps into one application pipeline.

Use an orchestration framework such as LangChain or LlamaIndex to chain retrievers, prompts, models, and tools.

Why: These frameworks provide reusable RAG and agent abstractions over NIM/NeMo endpoints.

Decide between a packaged microservice and a hand-built serving stack.

Fast standardized deployment → NIM. Deep custom backend/model logic → Triton + TensorRT-LLM directly.

Why: NIM trades configurability for speed; raw Triton gives full control of the serving graph.

Reference

Experimentation

Fine-tune a large model on limited GPU memory without touching all weights.

LoRA / PEFT — train small low-rank adapter matrices while freezing the base weights.

Why: LoRA cuts trainable parameters by orders of magnitude, so fine-tuning fits on modest GPUs.

Reference

Fine-tune a very large model with the tightest possible memory budget.

QLoRA — quantize the frozen base model to 4-bit and train LoRA adapters on top.

Why: Quantizing the base shrinks memory further than LoRA alone, enabling larger models on one GPU.

Pick the cheapest customization that meets the quality bar.

Escalate in order: prompt engineering → few-shot → RAG → LoRA fine-tuning → full fine-tuning.

Why: Cost and effort rise at each step; stop at the first one that hits the target.

Supervised fine-tuning needs the right training data shape.

Provide instruction/response (prompt-completion) pairs, typically in JSONL.

Why: SFT teaches the model to map inputs to desired outputs; the pairs define that mapping.

Fine-tuning loss diverges or the model forgets prior capabilities.

Lower the learning rate and/or reduce epochs; watch validation loss for catastrophic forgetting.

Why: Too-high LR destabilizes training and overwrites pre-trained knowledge.

Measure whether a fine-tune or prompt change actually helped.

Hold out a validation/test set the model never trained on and compare metrics before vs. after.

Why: Evaluating on training data overstates quality; only held-out data reflects generalization.

Compare many fine-tuning runs with different hyperparameters and data.

Log runs, configs, and metrics with an experiment tracker (e.g. MLflow, Weights & Biases, TensorBoard).

Why: Reproducibility requires recording which config produced which result; memory does not scale.

Score generated-text quality automatically.

Summarization → ROUGE. Translation → BLEU. Semantic match → BERTScore. Open-ended quality → LLM-as-judge or human eval.

Why: Lexical-overlap metrics miss meaning; for nuanced quality, human or model-judge evaluation is needed.

RAG retrieves irrelevant or too little context.

Tune chunk size/overlap, top-k, embedding model, and add re-ranking; verify retrieval quality separately from generation.

Why: Most RAG failures are retrieval failures; fix retrieval before blaming the generator.

Decide which of two prompt variants performs better.

Run both against a fixed evaluation set and compare metrics; iterate on data and prompt, not just the model.

Why: Controlled comparison on the same inputs isolates the effect of the prompt change.

After fine-tuning on a narrow task the model loses general ability.

Catastrophic forgetting. Mitigate with PEFT/LoRA, lower LR, fewer epochs, or mixing general data into the fine-tune set.

Why: Adapter-based tuning preserves base weights, limiting drift from the original capabilities.

Data Analysis

Curate a large web/text corpus for LLM training at GPU scale.

NVIDIA NeMo Curator — GPU-accelerated cleaning, dedup, quality filtering, and PII handling for training data.

Why: Data quality drives model quality; Curator scales curation that would be infeasible on CPU.

Reference

Training corpus contains many near-duplicate documents.

Deduplicate (exact and fuzzy/near-dup) before training.

Why: Duplicates waste compute, bias the model toward repeated content, and risk memorization/leakage.

Split documents for RAG retrieval.

Chunk into semantically coherent passages with modest overlap; size to the embedding model and context budget.

Why: Oversized chunks dilute relevance; tiny chunks lose context. Overlap preserves boundary meaning.

Raw scraped text is noisy, with boilerplate, toxic, or low-quality content.

Apply quality and toxicity filters, language ID, and heuristics to drop low-value documents.

Why: Garbage in degrades the model; filtering improves downstream quality more than adding raw volume.

Prepare a document collection for semantic retrieval.

Generate embeddings for each chunk with a consistent embedding model and store them in a vector index.

Why: Query and document embeddings must come from the same model to be comparable.

Check whether a training set under-represents groups or topics.

Analyze distribution across classes, sources, and demographics; rebalance or augment gaps before training.

Why: Skewed training data produces skewed model behavior; the fix belongs at the data layer.

Training or RAG data may contain personal information.

Detect and redact/mask PII during data preparation before it reaches model weights or the index.

Why: Knowledge baked into weights cannot be reliably masked at inference; remove PII upstream.

Trustworthy AI

Keep an LLM app on-topic, block unsafe content, and prevent jailbreaks.

NVIDIA NeMo Guardrails — programmable rails for topic control, safety filtering, and dialog flow.

Why: Guardrails enforce policy on inputs and outputs independent of the underlying model.

Reference

Reduce confident-but-wrong answers in a deployed assistant.

Ground responses with RAG, require citations, add fact-checking rails, and keep humans in the loop for high-stakes outputs.

Why: Grounding supplies verifiable evidence the model would otherwise invent.

User input tries to override the system prompt or exfiltrate data.

Defense in depth: guardrails, input/output filtering, instruction isolation, and least-privilege tool permissions for agents.

Why: No single control stops injection; combine filtering with limited capabilities.

A deployed model produces skewed or unfair outputs for certain groups.

Audit outputs for bias, rebalance/augment training data, and add fairness checks to evaluation.

Why: Bias usually originates in data; measure and correct it before and after deployment.

Prompts and responses must not leave the organization's control.

Self-host with NIM/Triton on owned infrastructure, encrypt data, and avoid sending sensitive content to third-party APIs.

Why: On-prem or VPC deployment keeps confidential data inside the trust boundary.