🏠Home 📚Certifications 📱Mobile Apps

✍️Blog 💼Careers 📊Progress 📅Calendar 💬Support

Privacy Policy Terms of Use Contact Us Cookie Policy Disclaimer Accessibility Statement DMCA / Copyright

Skip to content

NCA-AIIOPlaybook

Playbook — NCA-AIIO NVIDIA-Certified Associate: AI Infrastructure and Operations

Last reviewed: June 2026

A scannable reference of architectural patterns the NCA-AIIO exam tests. Read top-to-bottom, or jump to a section.

Sections

AI Infrastructure19 entries
Essential AI Knowledge18 entries
AI Operations11 entries

AI Infrastructure

Decide whether a workload belongs on GPUs or CPUs.

Massively parallel math (deep-learning training/inference, matrix ops, simulation) → GPU. Serial, branch-heavy control logic, OS tasks, light I/O → CPU.

Why: GPUs have thousands of cores optimized for throughput on parallel SIMT work; CPUs win on latency-sensitive serial logic. Most AI systems pair both.

Pick the NVIDIA building block: a complete appliance vs. a board for OEM systems.

Turnkey integrated AI server (GPUs + CPUs + NVLink + networking + software) → DGX. GPU baseboard that OEMs/cloud providers build servers around → HGX.

Why: DGX is NVIDIA's ready-to-run reference system; HGX is the multi-GPU board hyperscalers integrate themselves.

GPUs in one server need faster GPU-to-GPU bandwidth than the bus provides.

Use NVLink (and NVSwitch for all-to-all) for high-bandwidth intra-node GPU interconnect; PCIe is the fallback when NVLink is unavailable.

Why: NVLink delivers far higher GPU-to-GPU bandwidth and lower latency than PCIe — critical for model-parallel and large-batch training inside a node.

All 8 GPUs in a node must talk to each other at full NVLink bandwidth simultaneously.

NVSwitch — a non-blocking switch fabric that connects every GPU to every other GPU at full NVLink speed.

Why: Point-to-point NVLink alone does not give all-to-all bandwidth; NVSwitch provides the crossbar for full-mesh GPU communication.

Distinguish scale-up (inside a server) from scale-out (across servers) interconnect.

Scale-up GPU interconnect within a node → NVLink/NVSwitch. Scale-out across nodes in a cluster → InfiniBand (or RoCE Ethernet).

Why: NVLink is intra-node; InfiniBand connects nodes into a cluster for multi-node distributed training.

Pick the cluster fabric for large-scale distributed training where collective-op latency matters most.

Lowest latency, in-network compute (SHARP), RDMA-native → InfiniBand. Familiar, lower-cost, broad ecosystem → RoCE on Spectrum-X Ethernet.

Why: InfiniBand with SHARP offloads all-reduce into the switch, cutting collective latency; Spectrum-X is NVIDIA's Ethernet answer for AI fabrics.

Offload networking, storage, and security processing off the CPU so cores are freed for AI compute.

NVIDIA BlueField DPU — programmable data processing unit that offloads and isolates infrastructure services from the host CPU/GPU.

Why: DPUs accelerate east-west networking, NVMe-oF storage, and zero-trust security, raising effective GPU/CPU utilization and tenant isolation.

Need a high-speed RDMA NIC for GPU nodes without full DPU offload.

NVIDIA ConnectX SmartNIC — high-throughput InfiniBand/Ethernet adapter with RDMA and GPUDirect support.

Why: ConnectX gives line-rate RDMA; BlueField adds a programmable Arm subsystem on top for full infrastructure offload.

Cut latency by moving data into GPU memory without staging through the CPU/host memory.

GPUDirect RDMA — NICs read/write GPU memory directly; GPUDirect Storage does the same for NVMe storage.

Why: Bypassing the CPU bounce buffer removes copies and latency on the data path, vital for multi-node training throughput.

Pick a current-gen data-center GPU architecture for large-model training.

Hopper (H100/H200) is the established generation with Transformer Engine + FP8; Blackwell (B200/GB200) is the newer generation with higher throughput and FP4 for the largest models.

Why: Both target transformer workloads; Blackwell pushes scale and lower-precision (FP4) inference further. Match to budget and model size.

Identify the hardware that accelerates deep-learning matrix math.

Tensor Cores — specialized units that perform fused matrix-multiply-accumulate at mixed precision (FP16/BF16/FP8/FP4).

Why: They deliver order-of-magnitude higher throughput on GEMM/convolution than standard CUDA cores, which drives DL performance.

A large model fails to fit; memory bandwidth, not compute, is the bottleneck.

Choose GPUs with more and faster HBM (e.g. H200/B200 with HBM3e); use multi-GPU model parallelism when one GPU's memory is insufficient.

Why: Training/inference of large models is often memory-capacity and bandwidth bound; HBM provides the high bandwidth GPUs need.

Stand up a turnkey, validated multi-rack AI supercomputer for enterprise training.

NVIDIA DGX SuperPOD — reference architecture of DGX nodes, InfiniBand fabric, storage, and Base Command software.

Why: SuperPOD is the pre-validated full-stack design; it removes the guesswork of wiring fabric, storage, and orchestration at scale.

Get DGX-class training capacity without owning the hardware.

NVIDIA DGX Cloud — managed AI training infrastructure hosted on major cloud providers, accessed as a service.

Why: OpEx vs. CapEx: DGX Cloud suits bursty or short-term training; on-prem DGX/SuperPOD suits sustained high utilization and data-gravity constraints.

Choose on-prem GPU cluster vs. cloud GPUs for AI workloads.

Sustained high utilization, data sovereignty, predictable spend → on-prem DGX/SuperPOD. Variable/bursty demand, fast start, no data-center footprint → cloud or DGX Cloud.

Why: Owned GPUs amortize well only at high steady utilization; idle owned hardware is pure cost.

A new GPU cluster exceeds the rack power and cooling budget of an existing data center.

Plan for high-density power (tens of kW/rack) and liquid cooling for the newest GPUs; size PDUs, busways, and thermal capacity before install.

Why: Modern GPU nodes (and GB200 racks) draw far more power and heat than legacy servers; air cooling and standard PDUs often cannot keep up.

Training stalls because the data pipeline cannot feed GPUs fast enough.

Use high-throughput parallel/NVMe storage with GPUDirect Storage; size for sustained read bandwidth to keep GPUs saturated.

Why: Underprovisioned storage I/O leaves expensive GPUs idle waiting on data; the storage tier must match aggregate GPU read demand.

A model is too large to train on a single node within an acceptable time.

Scale out to multiple nodes over InfiniBand using data/tensor/pipeline parallelism; NCCL handles the GPU collective communication.

Why: Multi-node scaling needs a low-latency fabric and an optimized collectives library (NCCL); a slow fabric kills scaling efficiency.

A single A100/H100 is overkill for small inference jobs; you want hardware-isolated slices.

Multi-Instance GPU (MIG) — partition one GPU into up to 7 isolated instances, each with dedicated compute and memory.

Why: MIG gives true hardware isolation and predictable QoS for multi-tenant inference, unlike soft time-slicing.

Essential AI Knowledge

Distinguish AI vs. machine learning vs. deep learning.

AI is the broad goal; ML is a subset that learns from data; DL is a subset of ML using multi-layer neural networks.

Why: They nest: DL ⊂ ML ⊂ AI. DL drives modern GPU demand because neural networks are massively parallel.

Distinguish the compute profile of training vs. inference.

Training = compute- and memory-heavy, long-running, batch, many GPUs. Inference = latency-sensitive, lighter, often single/partial GPU, runs continuously in production.

Why: They have different hardware and scaling needs; sizing a cluster requires separating the two workloads.

Pick a learning paradigm: labeled data, unlabeled data, or reward-driven trial and error.

Labeled → supervised. Unlabeled clustering/structure → unsupervised. Agent learns from reward → reinforcement learning.

Why: The data you have (and the goal) dictates the paradigm; RLHF is reinforcement learning steered by human feedback to align LLMs.

Explain why neural networks map well to GPUs.

They are layers of weighted matrix multiplications and nonlinear activations — dense parallel linear algebra that GPUs execute efficiently.

Why: Forward/backward passes are GEMM-heavy; Tensor Cores accelerate exactly this, which is why DL runs on GPUs.

Identify the architecture behind modern LLMs and generative AI.

The transformer — attention-based architecture that scales with data and parameters; foundation models and LLMs are built on it.

Why: Transformers are highly parallelizable, which is why they drive demand for large GPU clusters and Transformer Engine hardware.

Speed up training and cut memory use without materially hurting accuracy.

Use mixed precision — FP16/BF16 (and FP8 on Hopper/Blackwell) for math, FP32 for accumulation; Tensor Cores accelerate the lower-precision ops.

Why: Lower precision halves memory and multiplies throughput; loss scaling / BF16 preserves numerical stability.

Name the foundation that lets software run on NVIDIA GPUs.

CUDA — NVIDIA's parallel-computing platform and programming model; CUDA-X is the library layer (cuDNN, cuBLAS, NCCL, RAPIDS, etc.).

Why: Frameworks like PyTorch/TensorFlow call CUDA-X libraries under the hood; CUDA is the moat that ties AI software to NVIDIA GPUs.

Accelerate deep-learning primitives (convolutions, attention) inside a framework.

cuDNN provides GPU-optimized DL primitives; cuBLAS handles dense linear algebra; both sit under PyTorch/TensorFlow.

Why: These libraries are why frameworks get GPU speed without you writing CUDA kernels.

Get NVIDIA-optimized, GPU-ready containers, models, and Helm charts.

NGC (NVIDIA GPU Cloud) catalog — curated registry of optimized containers (frameworks, NIM, Triton), pretrained models, and SDKs.

Why: NGC containers come tuned and tested for NVIDIA GPUs, removing dependency and driver-compatibility guesswork.

Serve many models from multiple frameworks behind one standardized, GPU-efficient endpoint.

NVIDIA Triton Inference Server — multi-framework model serving with dynamic batching, concurrent model execution, and GPU sharing.

Why: Triton maximizes GPU utilization for inference via batching and model concurrency instead of one process per model.

Deploy a foundation model as a production-ready, optimized inference microservice fast.

NVIDIA NIM — prebuilt, containerized inference microservices with optimized engines and standard APIs for popular models.

Why: NIM packages model + optimized runtime (TensorRT-LLM/Triton) + API into one deployable unit, cutting time-to-production.

Reduce inference latency and increase throughput for a trained model.

Compile the model with TensorRT (or TensorRT-LLM for LLMs) — layer fusion, precision calibration (INT8/FP8), and kernel auto-tuning.

Why: TensorRT produces an optimized inference engine for the target GPU, often multiplying throughput vs. the raw framework.

Accelerate pandas/scikit-learn-style data prep and classical ML on GPUs.

NVIDIA RAPIDS — cuDF (DataFrames), cuML (ML), cuGraph (graphs) run the data-science workflow on GPUs.

Why: RAPIDS keeps tabular ETL and classical ML on the GPU, avoiding CPU bottlenecks in the pipeline.

Manage AI workloads, jobs, and users across a DGX/SuperPOD cluster.

NVIDIA Base Command — job scheduling, cluster management, and workload orchestration for DGX infrastructure.

Why: Base Command is the operations control plane for DGX systems; it handles multi-user job submission and resource tracking.

Need supported, secure, production-grade AI software with enterprise SLAs.

NVIDIA AI Enterprise — the supported software suite (frameworks, NIM, Triton, RAPIDS, GPU Operator) with security patches and enterprise support.

Why: It bundles the validated stack with support and lifecycle guarantees, which regulated/production environments require.

Define a foundation model and how teams adapt it.

Large model pretrained on broad data, adaptable to many tasks via prompting, RAG, or fine-tuning rather than training from scratch.

Why: Adaptation (prompt/RAG/fine-tune) is far cheaper than pretraining; most enterprises consume foundation models, not build them.

Add private/current knowledge to an LLM-backed app.

Frequently changing facts → RAG (retrieve from a vector store at inference). Teach new behavior/style/domain skill → fine-tuning.

Why: RAG keeps data external and updatable without retraining; fine-tuning bakes behavior into weights and is costlier to refresh.

Judge whether expensive GPUs are being used efficiently.

Track GPU utilization, memory usage, and SM/Tensor-Core activity; low utilization signals data-pipeline, batch-size, or scheduling bottlenecks.

Why: High wall-clock GPU "busy" can still mask low effective compute; look at Tensor-Core/SM occupancy, not just the utilization gauge.

AI Operations

Monitor GPU health, utilization, temperature, power, and errors across a cluster.

NVIDIA DCGM (Data Center GPU Manager) — telemetry, health checks, and diagnostics; export metrics to Prometheus/Grafana.

Why: DCGM is the standard GPU telemetry source; the DCGM Exporter feeds Prometheus for cluster-wide dashboards and alerts.

Provision GPU drivers, the container toolkit, and monitoring on a Kubernetes cluster without per-node manual setup.

NVIDIA GPU Operator — automates driver, container runtime, device plugin, DCGM, and MIG configuration on Kubernetes.

Why: It manages the full GPU software lifecycle declaratively, removing fragile node-by-node driver installs.

Pick an orchestrator for GPU workloads.

Microservices/inference, cloud-native, mixed workloads → Kubernetes. Batch HPC-style training jobs, gang scheduling, traditional clusters → Slurm.

Why: Kubernetes excels at long-running services and elasticity; Slurm excels at queued batch jobs with MPI-style scheduling.

Kubernetes pods need to request and be scheduled onto GPUs.

The NVIDIA device plugin advertises GPUs as schedulable resources; pods request `nvidia.com/gpu` and the scheduler places them.

Why: Without the device plugin, Kubernetes cannot see or allocate GPUs; it is what makes GPUs a first-class resource.

Many small jobs/users must share GPUs to raise utilization.

Hardware isolation → MIG. Soft sharing of one GPU → time-slicing or MPS. Combine with namespace quotas for fairness.

Why: MIG gives QoS guarantees; time-slicing/MPS oversubscribe a GPU without isolation. Pick per the isolation requirement.

High-priority training must preempt low-priority experiments on a shared cluster.

Use priority/preemption and queues in the scheduler (Slurm partitions or Kubernetes PriorityClasses with quota); gang-schedule multi-GPU jobs.

Why: Gang scheduling prevents partial allocation deadlocks; priority classes enforce business order on contended GPUs.

Keep GPU drivers, CUDA, and container toolkit versions consistent and compatible across nodes.

Standardize via the GPU Operator (Kubernetes) or NGC containers; match driver to the CUDA versions your frameworks need and roll updates in maintenance windows.

Why: Driver/CUDA/framework mismatches are a top cause of cluster failures; container-pinned CUDA decouples app from host driver within supported ranges.

Size a GPU cluster for forecasted training and inference demand.

Separate training (peak, batch) from inference (sustained, latency-bound); plan power/cooling/fabric headroom and target high steady utilization.

Why: Oversizing wastes CapEx on idle GPUs; undersizing throttles delivery. Plan to the workload mix, not a single peak.

GPUs throttle or fail under sustained heavy load.

Monitor temperature and power via DCGM; ensure adequate cooling (liquid for dense racks), set sane power limits, and alert on thermal thresholds.

Why: Thermal throttling silently cuts throughput; proactive telemetry and cooling design protect both performance and hardware lifespan.

Deliver GPU acceleration to multiple VMs or VDI users from shared hardware.

NVIDIA vGPU software partitions a physical GPU across VMs with scheduling and isolation; MIG can back vGPU profiles for hard partitioning.

Why: vGPU enables virtualized/multi-tenant GPU access (VDI, cloud) that bare-metal passthrough cannot share.

A node returns Xid errors or failed jobs; you must isolate bad GPUs before they corrupt more runs.

Run DCGM diagnostics and active health checks; cordon/drain the node, replace or reset the GPU, and only then return it to the pool.

Why: Xid errors and ECC faults flag failing GPUs; automated health gating keeps a sick GPU from poisoning the scheduling pool.