Playbook — NCA-ADS NVIDIA-Certified Associate: Accelerated Data Science

Last reviewed: June 2026

A scannable reference of architectural patterns the NCA-ADS exam tests. Read top-to-bottom, or jump to a section.

Data Manipulation and Preparation

Existing pandas pipeline on a 40 GB CSV is too slow on CPU.

Swap pandas for cuDF; most read/filter/groupby/join calls keep the same API and run on the GPU.

Why: cuDF mirrors the pandas API by design, so migration is mostly an import change rather than a rewrite.

Team wants GPU speedups without touching existing pandas code.

Load the cudf.pandas accelerator (%load_ext cudf.pandas or python -m cudf.pandas); it runs ops on GPU and falls back to CPU automatically.

Why: Zero-code-change acceleration with transparent CPU fallback keeps unsupported ops working.

Reference

Need the fastest columnar load of a large analytics dataset on GPU.

Store as Parquet and read with cudf.read_parquet; column pruning and predicate pushdown minimize device transfer.

Why: Columnar Parquet maps cleanly to Arrow-backed cuDF and reads far faster than row-oriented CSV.

cuDF is slower than pandas on a 50 MB file.

Keep small data on CPU; host-to-device transfer and kernel-launch overhead dominate below ~1–2 GB.

Why: GPU acceleration pays off at scale; for tiny data the copy cost exceeds the compute win.

Aggregate billions of rows by key with multiple statistics.

Use df.groupby(key).agg({...}) in cuDF; aggregations run as parallel GPU kernels.

Clean and normalize a high-cardinality text column at GPU scale.

Use cuDF's .str accessor (lower, strip, replace, contains, split); string ops are GPU-accelerated via libcudf.

Why: cuDF has a dedicated GPU string layer, so text cleaning need not fall back to CPU.

Join two large device DataFrames on a shared key.

Use cudf.merge / df.merge with the join key; hash joins execute on the GPU.

Why: Both frames must already be on the device to avoid a round-trip; mixing pandas and cuDF forces a host copy.

Dataset has missing values that break downstream cuML training.

Use cuDF fillna/dropna and explicit dtype casts before fitting; cuML expects clean numeric device arrays.

Mixed/object dtypes cause errors or memory bloat in cuDF.

Cast to compact numeric or categorical dtypes (int32/float32, category) early to shrink GPU memory footprint.

Why: Downcasting reduces device-memory pressure, the most common bottleneck on a single GPU.

Need label/one-hot encoding for categorical features before training.

Use cuDF categorical dtype with .cat.codes or cuML preprocessing encoders to keep data on-device.

Need raw numeric array math not exposed by the cuDF DataFrame API.

Convert via df.values or to_cupy() and operate with CuPy (NumPy-compatible GPU arrays), then bring results back.

Why: cuDF and CuPy share device memory through the __cuda_array_interface__, so conversion is zero-copy.

Machine Learning With RAPIDS

Port a scikit-learn training script to GPU.

Use cuML estimators (LinearRegression, LogisticRegression, KMeans, RandomForest); fit/predict mirror the sklearn API.

Why: cuML targets sklearn API compatibility, so swapping the import is usually enough.

Reference

Gradient-boosted trees on a large tabular dataset, training too slow on CPU.

Train XGBoost with device="cuda" (tree_method="hist"); it consumes cuDF/CuPy data directly.

Why: XGBoost's native GPU histogram method gives large speedups and integrates tightly with RAPIDS.

Cluster millions of points fast for segmentation.

Use cuML KMeans (or DBSCAN for density-based); both run fully on the GPU.

Reduce high-dimensional data to 2D for visualization at scale.

Use cuML UMAP or t-SNE; GPU implementations handle datasets that are impractical on CPU.

Why: UMAP/t-SNE are compute-heavy; the GPU versions make interactive-scale embeddings feasible.

Need an accurate ensemble classifier with feature importances.

Use cuML RandomForestClassifier; train on device arrays and export to FIL for fast inference.

Deploy a tree model for high-throughput batch scoring.

Load the model into the Forest Inference Library (FIL) to run GPU-accelerated predictions on large batches.

Why: FIL accelerates inference for XGBoost/LightGBM/cuML forests far beyond per-tree CPU scoring.

An algorithm you need has no cuML GPU implementation.

Confirm coverage in the cuML docs; if absent, keep that step on scikit-learn and accelerate the rest.

Why: Not every estimator is GPU-backed — know the supported set rather than assuming full parity.

Avoid silent host copies during cuML training.

Pass cuDF/CuPy device data directly to fit(); mixing in NumPy/pandas triggers a host-to-device transfer.

Data Science Pipelines and Workflow Automation

Dataset is larger than a single GPU's memory.

Use dask-cuDF to partition the data across multiple GPUs/nodes and process partitions in parallel.

Why: Dask handles out-of-core and multi-GPU distribution that a single cuDF frame cannot.

Reference

Want to use all GPUs on one multi-GPU box.

Start a LocalCUDACluster from dask-cuda and connect a Client; one worker is pinned per GPU.

Why: LocalCUDACluster wires each Dask worker to a distinct GPU so the scheduler can balance work.

Building a multi-step Dask pipeline that recomputes too often.

Compose lazily and call .compute() once at the end; use persist() to cache reused intermediates in GPU memory.

Why: Dask is lazy — triggering compute too early or repeatedly redoes work.

Skewed partitions cause some GPU workers to lag.

Repartition to balanced sizes and align partition keys with downstream joins/groupbys.

Why: Uneven partitions create stragglers that bottleneck the whole job.

Keep an ETL → train → score workflow entirely on GPU.

Chain cuDF prep into cuML/XGBoost without converting to pandas in between, keeping data resident on the device.

Why: Every CPU round-trip adds transfer cost; staying on-device preserves the speedup end to end.

Need a workflow that reruns identically for review.

Pin RAPIDS/CUDA versions, set random seeds, and parameterize inputs so the pipeline is deterministic and re-executable.

Descriptive Analysis and Visualization

Compute summary statistics across a billion-row table.

Use cuDF describe/mean/std/quantile and corr; aggregations run as GPU kernels.

Scatter plot of 100M points overplots and is unreadable.

Render with Datashader, which rasterizes the points on GPU into a density image instead of drawing each marker.

Why: Datashader aggregates into pixels, so plot cost is bounded by image size, not point count.

Need an interactive cross-filtering dashboard over a huge GPU DataFrame.

Use cuxfilter to link charts with GPU-accelerated cross-filtering on cuDF data.

Why: cuxfilter keeps the data on-device so brushing/filtering stays interactive at scale.

Visualize the distribution of a large numeric column.

Bin with cuDF/CuPy on GPU, then plot the small aggregated result with Plotly or Matplotlib.

Why: Aggregate first on GPU; only the tiny summary needs to reach the plotting library.

Assess feature relationships before modeling.

Compute df.corr() in cuDF on GPU, then render the small matrix as a heatmap.

Want declarative interactive charts backed by GPU data.

Pair HoloViews/hvPlot with Datashader and cuDF for high-volume, interactive visualizations.

Foundations of Accelerated Data Science

Justify GPU acceleration for a data workload.

Use GPUs for massively data-parallel, throughput-bound ops over large datasets; keep small, branchy, or latency-sensitive work on CPU.

Why: GPUs win on SIMT parallelism across many elements; they lose on small or control-heavy tasks.

Explain how RAPIDS shares data across cuDF, CuPy, and ML libs without copies.

RAPIDS is built on the Apache Arrow columnar memory format, enabling zero-copy interchange between GPU libraries.

Why: A shared on-device columnar layout lets components hand off data without serialization.

A pipeline is GPU-accelerated but barely faster.

Profile data movement; repeated host↔device copies often dominate. Keep data resident on the GPU between steps.

Why: PCIe transfer is the hidden tax — minimizing copies is usually the biggest single win.

Understand what executes work on the GPU.

CUDA launches kernels across thousands of threads grouped into blocks/grids under the SIMT model; RAPIDS libraries wrap these so you rarely write kernels yourself.

Workload errors out with out-of-memory on a single GPU.

Reduce dtype sizes, process in chunks, or scale out with Dask; GPU VRAM is far smaller than host RAM.

Why: Device memory is the first constraint in GPU data science — design around it.

Map a CPU data-science task to the right RAPIDS library.

cuDF for DataFrames, cuML for ML, cuGraph for graphs, cuSpatial for geospatial, Dask for scale-out.

Reference

Introductory MLOps Practices

Need to compare many training runs and their metrics.

Log params, metrics, and artifacts to MLflow Tracking; query and compare runs from the UI.

Why: Centralized experiment tracking makes results reproducible and comparable across runs.

Want live dashboards and team-shared experiment logs.

Use Weights & Biases (wandb.init/log) to stream metrics and share visual experiment dashboards.

Track which trained model is staging vs production.

Why: A registry gives a single source of truth for model lineage and promotion.

A model can't be reproduced months later.

Version data, code, environment, and seeds together; log the full config with each run.

Why: Reproducibility requires capturing all four — code alone is not enough.

Move a trained model toward serving.

Package the model and dependencies (e.g., container image), then expose batch or REST inference; use FIL for fast GPU tree scoring.

Advanced Data Structures

Rank nodes by influence in a large graph.

Build a cuGraph Graph from an edge list and run cugraph.pagerank on the GPU.

Why: cuGraph runs PageRank, BFS, and centrality on graphs too large for CPU libraries.

Reference

Find clusters/communities in a network dataset.

Use cuGraph connected-components or Louvain; ingest edges from a cuDF DataFrame.

Data is high-dimensional and mostly zeros.

Use GPU sparse formats (CSR/COO via CuPy sparse) instead of dense arrays to fit memory and speed compute.

Why: Sparse storage avoids wasting VRAM and kernels on zero entries.

Software and Environment Management

Set up a working RAPIDS environment.

Install via conda, pip, or Docker using the RAPIDS Release Selector to match your CUDA/Python versions.

Why: The selector pins compatible package builds, the most common source of install failures.

Reference

RAPIDS import fails or sees no GPU after install.

Verify the NVIDIA driver and CUDA toolkit versions satisfy the RAPIDS build requirements; run nvidia-smi to confirm the GPU.

Why: Driver/CUDA mismatch is the top cause of "no CUDA device" errors.

Want a reproducible, preconfigured RAPIDS environment.

Pull the RAPIDS container from NVIDIA NGC; it ships matched CUDA, drivers, and libraries.

Why: NGC images remove version-matching guesswork and standardize the environment across machines.

Reference

Data Manipulation and Preparation

Existing pandas pipeline on a 40 GB CSV is too slow on CPU.

Swap pandas for cuDF; most read/filter/groupby/join calls keep the same API and run on the GPU.

Why: cuDF mirrors the pandas API by design, so migration is mostly an import change rather than a rewrite.

Reference

Team wants GPU speedups without touching existing pandas code.

Load the cudf.pandas accelerator (%load_ext cudf.pandas or python -m cudf.pandas); it runs ops on GPU and falls back to CPU automatically.

Why: Zero-code-change acceleration with transparent CPU fallback keeps unsupported ops working.

Reference

Need the fastest columnar load of a large analytics dataset on GPU.

Store as Parquet and read with cudf.read_parquet; column pruning and predicate pushdown minimize device transfer.

Why: Columnar Parquet maps cleanly to Arrow-backed cuDF and reads far faster than row-oriented CSV.

cuDF is slower than pandas on a 50 MB file.

Keep small data on CPU; host-to-device transfer and kernel-launch overhead dominate below ~1–2 GB.

Why: GPU acceleration pays off at scale; for tiny data the copy cost exceeds the compute win.

Aggregate billions of rows by key with multiple statistics.

Use df.groupby(key).agg({...}) in cuDF; aggregations run as parallel GPU kernels.

Clean and normalize a high-cardinality text column at GPU scale.

Use cuDF's .str accessor (lower, strip, replace, contains, split); string ops are GPU-accelerated via libcudf.

Why: cuDF has a dedicated GPU string layer, so text cleaning need not fall back to CPU.

Join two large device DataFrames on a shared key.

Use cudf.merge / df.merge with the join key; hash joins execute on the GPU.

Why: Both frames must already be on the device to avoid a round-trip; mixing pandas and cuDF forces a host copy.

Dataset has missing values that break downstream cuML training.

Use cuDF fillna/dropna and explicit dtype casts before fitting; cuML expects clean numeric device arrays.

Mixed/object dtypes cause errors or memory bloat in cuDF.

Cast to compact numeric or categorical dtypes (int32/float32, category) early to shrink GPU memory footprint.

Why: Downcasting reduces device-memory pressure, the most common bottleneck on a single GPU.

Need label/one-hot encoding for categorical features before training.

Use cuDF categorical dtype with .cat.codes or cuML preprocessing encoders to keep data on-device.

Need raw numeric array math not exposed by the cuDF DataFrame API.

Convert via df.values or to_cupy() and operate with CuPy (NumPy-compatible GPU arrays), then bring results back.

Why: cuDF and CuPy share device memory through the __cuda_array_interface__, so conversion is zero-copy.

Machine Learning With RAPIDS

Port a scikit-learn training script to GPU.

Use cuML estimators (LinearRegression, LogisticRegression, KMeans, RandomForest); fit/predict mirror the sklearn API.

Why: cuML targets sklearn API compatibility, so swapping the import is usually enough.

Reference

Gradient-boosted trees on a large tabular dataset, training too slow on CPU.

Train XGBoost with device="cuda" (tree_method="hist"); it consumes cuDF/CuPy data directly.

Why: XGBoost's native GPU histogram method gives large speedups and integrates tightly with RAPIDS.

Cluster millions of points fast for segmentation.

Use cuML KMeans (or DBSCAN for density-based); both run fully on the GPU.

Reduce high-dimensional data to 2D for visualization at scale.

Use cuML UMAP or t-SNE; GPU implementations handle datasets that are impractical on CPU.

Why: UMAP/t-SNE are compute-heavy; the GPU versions make interactive-scale embeddings feasible.

Need an accurate ensemble classifier with feature importances.

Use cuML RandomForestClassifier; train on device arrays and export to FIL for fast inference.

Deploy a tree model for high-throughput batch scoring.

Load the model into the Forest Inference Library (FIL) to run GPU-accelerated predictions on large batches.

Why: FIL accelerates inference for XGBoost/LightGBM/cuML forests far beyond per-tree CPU scoring.

An algorithm you need has no cuML GPU implementation.

Confirm coverage in the cuML docs; if absent, keep that step on scikit-learn and accelerate the rest.

Why: Not every estimator is GPU-backed — know the supported set rather than assuming full parity.

Avoid silent host copies during cuML training.

Pass cuDF/CuPy device data directly to fit(); mixing in NumPy/pandas triggers a host-to-device transfer.

Data Science Pipelines and Workflow Automation

Dataset is larger than a single GPU's memory.

Use dask-cuDF to partition the data across multiple GPUs/nodes and process partitions in parallel.

Why: Dask handles out-of-core and multi-GPU distribution that a single cuDF frame cannot.

Reference

Want to use all GPUs on one multi-GPU box.

Start a LocalCUDACluster from dask-cuda and connect a Client; one worker is pinned per GPU.

Why: LocalCUDACluster wires each Dask worker to a distinct GPU so the scheduler can balance work.

Building a multi-step Dask pipeline that recomputes too often.

Compose lazily and call .compute() once at the end; use persist() to cache reused intermediates in GPU memory.

Why: Dask is lazy — triggering compute too early or repeatedly redoes work.

Skewed partitions cause some GPU workers to lag.

Repartition to balanced sizes and align partition keys with downstream joins/groupbys.

Why: Uneven partitions create stragglers that bottleneck the whole job.

Keep an ETL → train → score workflow entirely on GPU.

Chain cuDF prep into cuML/XGBoost without converting to pandas in between, keeping data resident on the device.

Why: Every CPU round-trip adds transfer cost; staying on-device preserves the speedup end to end.

Need a workflow that reruns identically for review.

Pin RAPIDS/CUDA versions, set random seeds, and parameterize inputs so the pipeline is deterministic and re-executable.

Descriptive Analysis and Visualization

Compute summary statistics across a billion-row table.

Use cuDF describe/mean/std/quantile and corr; aggregations run as GPU kernels.

Scatter plot of 100M points overplots and is unreadable.

Render with Datashader, which rasterizes the points on GPU into a density image instead of drawing each marker.

Why: Datashader aggregates into pixels, so plot cost is bounded by image size, not point count.

Need an interactive cross-filtering dashboard over a huge GPU DataFrame.

Use cuxfilter to link charts with GPU-accelerated cross-filtering on cuDF data.

Why: cuxfilter keeps the data on-device so brushing/filtering stays interactive at scale.

Visualize the distribution of a large numeric column.

Bin with cuDF/CuPy on GPU, then plot the small aggregated result with Plotly or Matplotlib.

Why: Aggregate first on GPU; only the tiny summary needs to reach the plotting library.

Assess feature relationships before modeling.

Compute df.corr() in cuDF on GPU, then render the small matrix as a heatmap.

Want declarative interactive charts backed by GPU data.

Pair HoloViews/hvPlot with Datashader and cuDF for high-volume, interactive visualizations.

Foundations of Accelerated Data Science

Justify GPU acceleration for a data workload.

Use GPUs for massively data-parallel, throughput-bound ops over large datasets; keep small, branchy, or latency-sensitive work on CPU.

Why: GPUs win on SIMT parallelism across many elements; they lose on small or control-heavy tasks.

Explain how RAPIDS shares data across cuDF, CuPy, and ML libs without copies.

RAPIDS is built on the Apache Arrow columnar memory format, enabling zero-copy interchange between GPU libraries.

Why: A shared on-device columnar layout lets components hand off data without serialization.

A pipeline is GPU-accelerated but barely faster.

Profile data movement; repeated host↔device copies often dominate. Keep data resident on the GPU between steps.

Why: PCIe transfer is the hidden tax — minimizing copies is usually the biggest single win.

Understand what executes work on the GPU.

CUDA launches kernels across thousands of threads grouped into blocks/grids under the SIMT model; RAPIDS libraries wrap these so you rarely write kernels yourself.

Workload errors out with out-of-memory on a single GPU.

Reduce dtype sizes, process in chunks, or scale out with Dask; GPU VRAM is far smaller than host RAM.

Why: Device memory is the first constraint in GPU data science — design around it.

Map a CPU data-science task to the right RAPIDS library.

cuDF for DataFrames, cuML for ML, cuGraph for graphs, cuSpatial for geospatial, Dask for scale-out.

Reference

Introductory MLOps Practices

Need to compare many training runs and their metrics.

Log params, metrics, and artifacts to MLflow Tracking; query and compare runs from the UI.

Why: Centralized experiment tracking makes results reproducible and comparable across runs.

Want live dashboards and team-shared experiment logs.

Use Weights & Biases (wandb.init/log) to stream metrics and share visual experiment dashboards.

Track which trained model is staging vs production.

Why: A registry gives a single source of truth for model lineage and promotion.

A model can't be reproduced months later.

Version data, code, environment, and seeds together; log the full config with each run.

Why: Reproducibility requires capturing all four — code alone is not enough.

Move a trained model toward serving.

Package the model and dependencies (e.g., container image), then expose batch or REST inference; use FIL for fast GPU tree scoring.

Advanced Data Structures

Rank nodes by influence in a large graph.

Build a cuGraph Graph from an edge list and run cugraph.pagerank on the GPU.

Why: cuGraph runs PageRank, BFS, and centrality on graphs too large for CPU libraries.

Reference

Find clusters/communities in a network dataset.

Use cuGraph connected-components or Louvain; ingest edges from a cuDF DataFrame.

Data is high-dimensional and mostly zeros.

Use GPU sparse formats (CSR/COO via CuPy sparse) instead of dense arrays to fit memory and speed compute.

Why: Sparse storage avoids wasting VRAM and kernels on zero entries.

Software and Environment Management

Set up a working RAPIDS environment.

Install via conda, pip, or Docker using the RAPIDS Release Selector to match your CUDA/Python versions.

Why: The selector pins compatible package builds, the most common source of install failures.

Reference

RAPIDS import fails or sees no GPU after install.

Verify the NVIDIA driver and CUDA toolkit versions satisfy the RAPIDS build requirements; run nvidia-smi to confirm the GPU.

Why: Driver/CUDA mismatch is the top cause of "no CUDA device" errors.

Want a reproducible, preconfigured RAPIDS environment.

Pull the RAPIDS container from NVIDIA NGC; it ships matched CUDA, drivers, and libraries.

Why: NGC images remove version-matching guesswork and standardize the environment across machines.

Reference