Playbook — NCP-AAI NVIDIA-Certified Professional: Agentic AI

Last reviewed: June 2026

A scannable reference of architectural patterns the NCP-AAI exam tests. Read top-to-bottom, or jump to a section.

Agent Architecture and Design

Choosing between one agent and a multi-agent system for a complex workflow.

Default to a single agent with tools. Split into multiple agents only when task boundaries are distinct, context overflows, or different model tiers suit different sub-tasks.

Why: Each added agent multiplies latency, error surface, and orchestration cost; most workloads succeed with one well-tooled agent.

Orchestrator must dispatch heterogeneous sub-tasks to specialists.

Use a supervisor agent that decomposes the goal, routes to worker agents with their own prompts and tools, and aggregates results.

Why: Centralized control keeps state coherent and makes the decision boundary auditable versus a free-for-all swarm.

Agent flow has conditional branches, loops, and parallel fan-out.

Model the workflow as an explicit graph of nodes and edges rather than a free-form loop, so control flow is deterministic and resumable.

Why: A graph makes branches testable and lets you checkpoint and replay from any node after a failure.

Incoming requests vary widely in type and cost.

Front the system with a lightweight router agent that classifies intent and dispatches to the cheapest capable downstream agent or tool.

Why: Routing avoids paying frontier-model cost for trivial requests and isolates concerns per path.

Multiple agents must read and write common workflow state.

Externalize state to a shared store (key-value or document) keyed by session, rather than passing the full transcript between agents.

Why: A shared store bounds context growth and prevents divergent copies of state across agents.

Designing agents for horizontal scale-out.

Keep agent compute stateless; persist conversation and memory externally so any replica can pick up any request.

Why: Stateless nodes autoscale cleanly and survive pod restarts without losing in-flight work.

A sub-agent or tool fails mid-workflow.

Design idempotent steps with retry/backoff, compensating actions for side effects, and a fallback path or human escalation when retries exhaust.

Why: Agentic systems fail partially; recovery must be a first-class design concern, not an afterthought.

Sub-agents are developed by separate teams.

Define each agent's input/output contract as a typed schema and treat agents as services behind stable interfaces.

Why: Explicit contracts let agents evolve independently and be unit-tested in isolation.

Agent output quality is inconsistent on hard tasks.

Add a critic/reflection step that reviews the draft against criteria and triggers a bounded retry before returning.

Why: Self-critique catches errors cheaply, but cap iterations to avoid runaway loops and cost.

Agent Development

Agent must interact with external APIs, databases, or files.

Expose capabilities as typed function/tool definitions; the model emits a tool call, your code executes it and returns the result, then the loop continues.

Why: Structured tool calling is more reliable and auditable than parsing free-text instructions.

Agent must reason about observations before acting again.

Implement a ReAct loop: the model produces a thought, selects a tool, receives the observation, and repeats until a stop condition is met.

Why: Interleaving reasoning and action exposes the chain for debugging and improves multi-step accuracy.

The model misuses or hallucinates tool arguments.

Write precise tool descriptions, constrain argument types and enums, and provide one or two usage examples per tool.

Why: Most tool-call errors trace back to vague schemas; the description is the prompt for the tool.

Downstream code needs reliable JSON from the agent.

Constrain generation to a JSON schema (structured output) rather than parsing free text, and validate before use.

Why: Schema-constrained decoding eliminates brittle regex parsing and silent format drift.

Building a production agent on the NVIDIA stack.

Use the NeMo Agent Toolkit to compose agents, tools, and workflows, wiring model calls to NIM-served backends.

Why: The toolkit standardizes agent plumbing and integrates natively with NVIDIA serving.

Reference

A tool returns an error or times out.

Return the error back to the model as a tool result so it can retry, adjust arguments, or choose an alternative path.

Why: Surfacing failures to the agent enables recovery; swallowing them leaves the agent blind.

Several independent tool calls are needed in one step.

Issue tool calls in parallel when the model supports it and the calls have no ordering dependency, then merge results.

Why: Parallel execution cuts wall-clock latency for fan-out work like multi-source lookups.

A specialist capability should be reusable across workflows.

Wrap a sub-agent behind a single tool interface so the parent invokes it like any other tool.

Why: Treating sub-agents as tools keeps composition uniform and hides internal complexity.

Agent drifts off-task or ignores constraints.

Pin role, allowed tools, output format, and hard constraints in a concise system prompt; restate critical rules near the end.

Why: A tight system prompt is the cheapest, highest-leverage control on agent behavior.

Evaluation and Tuning

Measuring whether an agent solved a multi-step task correctly.

Evaluate both the final answer and the trajectory — tool-call accuracy, step order, and unnecessary actions — against a labeled set.

Why: A correct answer from a broken trajectory is fragile; trajectory scoring catches latent failures.

No ground-truth labels exist for open-ended agent outputs.

Use an LLM-as-judge with a rubric to score outputs, calibrated against a small human-labeled sample.

Why: Judge models scale evaluation, but must be calibrated or they encode their own bias.

You need to catch regressions before each release.

Build an offline eval harness with a fixed scenario suite that runs on every change and gates deploys on a pass threshold.

Why: Agentic behavior shifts subtly with prompt or model changes; a regression suite is the safety net.

Agent picks the wrong tool or wrong arguments.

Track tool-selection precision/recall and argument validity as standalone metrics, not just end-task success.

Why: Isolating the tool-call layer pinpoints whether failures come from selection or from the schema.

Eval pass rate dropped after a change.

Inspect full trajectories of failing cases, cluster failure modes, and fix the dominant cluster first.

Why: Aggregate scores hide root cause; per-trace clustering reveals the actual defect.

Agent underperforms and you must improve it.

Iterate prompts and tool descriptions first; only escalate to a larger model or fine-tuning when prompt changes plateau.

Why: Prompt iteration is fast and cheap; model swaps add cost and should be evidence-driven.

Comparing two agent designs that both pass accuracy targets.

Add cost-per-task and p95 latency to the evaluation so the cheaper, faster design wins ties.

Why: Production viability is accuracy plus cost plus latency, not accuracy alone.

Deployment and Scaling

Serving model inference for agents in production.

Deploy models as NIM microservices, giving agents a standardized, GPU-accelerated inference endpoint with built-in batching.

Why: NIM packages optimized inference behind a stable API so agents need not manage serving internals.

Reference

Agent traffic is spiky and unpredictable.

Containerize agents and serving, run on Kubernetes, and autoscale on concurrency or GPU utilization with sensible min/max bounds.

Why: Autoscaling absorbs spikes while min replicas avoid cold-start latency on the critical path.

GPU inference cost is too high under load.

Enable dynamic/continuous batching at the NIM layer to raise tokens-per-GPU-second before adding hardware.

Why: Batching dramatically improves GPU utilization; scaling nodes first wastes capacity.

Agents launch unbounded parallel tool and model calls.

Apply per-agent and global concurrency limits with a queue so the system degrades gracefully under load.

Why: Unbounded fan-out exhausts GPU and downstream quotas, cascading into failures.

Choosing GPU hardware for an agent inference workload.

Size to model footprint and latency targets — H100 for established large models, Blackwell where memory bandwidth and reasoning throughput dominate.

Why: Matching hardware to the model avoids both under-provisioning and paying for idle capacity.

Shipping a new agent or model version safely.

Roll out via canary to a small traffic slice, compare live metrics against baseline, then progress or roll back.

Why: Agent behavior changes are hard to fully predict offline; canary limits blast radius.

Long agent chains risk hanging requests.

Set per-step and end-to-end timeout budgets; cancel and fall back when exceeded.

Why: Without budgets a single slow tool can pin a GPU slot and starve other requests.

Cognition, Planning, and Memory

Task requires many interdependent steps.

Use a plan-and-execute pattern: generate an explicit plan first, then execute steps, replanning when an assumption breaks.

Why: Upfront planning reduces wandering and gives a checkpoint to validate before spending tool calls.

Decomposition quality is the bottleneck.

Route the planning step to a Nemotron reasoning model while using cheaper models for execution.

Why: Spend reasoning-grade compute where it matters — the plan — not on every routine sub-step.

Agent must remember facts across a long session.

Keep recent turns in working context; persist durable facts to a long-term memory store retrieved on demand.

Why: Stuffing everything into context inflates cost and latency and eventually overflows the window.

Choosing how to store agent memory.

Store episodic interaction history separately from semantic facts; retrieve semantic memory by similarity, episodic by recency/session.

Why: Different access patterns demand different stores; one bucket retrieves poorly for both.

A long-running conversation approaches the context limit.

Summarize older turns into a compact running summary and drop raw history, keeping only recent verbatim turns.

Why: Rolling summarization preserves continuity while bounding token cost and avoiding truncation errors.

Knowledge Integration and Data Handling

Agent must ground answers in private enterprise data.

Give the agent a retrieval tool over a vector store so it decides when and what to retrieve, rather than always prepending context.

Why: Agentic retrieval fetches only when needed, cutting tokens and irrelevant context.

Building a high-quality retrieval pipeline on NVIDIA.

Use NeMo Retriever embedding and reranking NIM microservices for accelerated, production-grade RAG.

Why: NeMo Retriever provides tuned embedding/rerank models served efficiently on GPU.

Reference

Pure vector search misses exact-match and keyword queries.

Combine dense vector search with sparse/keyword retrieval and rerank the merged candidates.

Why: Hybrid retrieval recovers precise terms (IDs, codes) that embeddings blur.

Retrieved chunks are too coarse or too fragmented.

Chunk on semantic boundaries with modest overlap and attach metadata; tune size to the embedding model and query type.

Why: Chunk granularity directly drives retrieval relevance; both extremes degrade grounding.

Agent returns stale information from the index.

Pipeline incremental re-indexing on source changes and stamp documents with timestamps for recency-aware ranking.

Why: Without freshness handling, RAG confidently grounds answers in outdated data.

NVIDIA Platform Implementation

Choosing a model backend for agent reasoning.

Select a Nemotron model sized to the reasoning load and serve it through NIM for a standardized endpoint.

Why: Nemotron reasoning variants are tuned for agentic planning and tool use; NIM standardizes serving.

Reference

Mapping an agentic need to the right NVIDIA component.

Use NeMo Agent Toolkit for orchestration, NIM for serving, NeMo Retriever for RAG, NeMo Guardrails for safety, and Nemotron for reasoning.

Why: Knowing which component owns which concern is a recurring exam and design decision.

Assembling an end-to-end agentic application on NVIDIA.

Compose discrete NIM microservices (LLM, embedding, rerank, guardrails) behind the agent layer, scaling each independently.

Why: Microservice decomposition lets each capability scale and version on its own.

Data residency rules forbid sending data to external APIs.

Self-host NIM microservices on owned GPU infrastructure so models and data stay within the boundary.

Why: NIM's portable packaging supports on-prem deployment that meets residency requirements.

Run, Monitor, and Maintain

A production agent misbehaves and you must diagnose it.

Emit distributed traces capturing each model call, tool call, and decision, then inspect the failing trajectory end to end.

Why: Agent failures are multi-step; without full traces you cannot locate where reasoning went wrong.

Agent token spend and latency creep up over time.

Track tokens, cost, and p95 latency per agent and per tool, with alerts on threshold breaches.

Why: Cost and latency drift silently as prompts and traffic evolve; metrics catch it early.

Quality degrades gradually without code changes.

Run the eval suite continuously against production samples and alert on metric drift from baseline.

Why: Data and upstream-model drift erode quality invisibly between releases.

Safety, Ethics, and Compliance

Agent must stay on-topic and refuse unsafe requests.

Apply NeMo Guardrails with input, output, topical, and dialog rails around the agent.

Why: Programmable rails enforce policy independent of, and as a backstop to, the model's own behavior.

Reference

Untrusted content could hijack the agent via retrieved or tool data.

Treat all external content as untrusted, isolate it from instructions, and constrain tool authority so injected commands cannot escalate.

Why: Injection exploits the agent's power; defense is least-privilege plus instruction/data separation.

Agent handles regulated or personal data.

Redact or tokenize PII before model calls and write tamper-evident audit logs of agent actions and tool invocations.

Why: Compliance demands both minimizing exposure and proving what the agent did.

Human-AI Interaction and Oversight

Agent can take high-risk actions like payments or deletions.

Insert a human approval gate before irreversible or high-impact tool calls, pausing the workflow until confirmed.

Why: Autonomy is fine for reversible steps; consequential actions need a human in the loop.

Agent is uncertain or repeatedly fails a task.

Define a confidence/failure threshold that escalates to a human with full context rather than guessing.

Why: Graceful handoff beats a confident wrong answer on high-stakes work.

Stakeholders distrust the agent's outputs.

Surface the agent's reasoning summary, sources, and tools used so humans can review and override decisions.

Why: Explainability builds trust and is often required for oversight and audit.