Existing pandas pipeline on a 40 GB CSV is too slow on CPU.
→Swap pandas for cuDF; most read/filter/groupby/join calls keep the same API and run on the GPU.
Why: cuDF mirrors the pandas API by design, so migration is mostly an import change rather than a rewrite.
Reference↗
Team wants GPU speedups without touching existing pandas code.
→Load the cudf.pandas accelerator (%load_ext cudf.pandas or python -m cudf.pandas); it runs ops on GPU and falls back to CPU automatically.
Why: Zero-code-change acceleration with transparent CPU fallback keeps unsupported ops working.
Reference↗
Need the fastest columnar load of a large analytics dataset on GPU.
→Store as Parquet and read with cudf.read_parquet; column pruning and predicate pushdown minimize device transfer.
Why: Columnar Parquet maps cleanly to Arrow-backed cuDF and reads far faster than row-oriented CSV.
cuDF is slower than pandas on a 50 MB file.
→Keep small data on CPU; host-to-device transfer and kernel-launch overhead dominate below ~1–2 GB.
Why: GPU acceleration pays off at scale; for tiny data the copy cost exceeds the compute win.
Aggregate billions of rows by key with multiple statistics.
→Use df.groupby(key).agg({...}) in cuDF; aggregations run as parallel GPU kernels.
Clean and normalize a high-cardinality text column at GPU scale.
→Use cuDF's .str accessor (lower, strip, replace, contains, split); string ops are GPU-accelerated via libcudf.
Why: cuDF has a dedicated GPU string layer, so text cleaning need not fall back to CPU.
Join two large device DataFrames on a shared key.
→Use cudf.merge / df.merge with the join key; hash joins execute on the GPU.
Why: Both frames must already be on the device to avoid a round-trip; mixing pandas and cuDF forces a host copy.
Dataset has missing values that break downstream cuML training.
→Use cuDF fillna/dropna and explicit dtype casts before fitting; cuML expects clean numeric device arrays.
Mixed/object dtypes cause errors or memory bloat in cuDF.
→Cast to compact numeric or categorical dtypes (int32/float32, category) early to shrink GPU memory footprint.
Why: Downcasting reduces device-memory pressure, the most common bottleneck on a single GPU.
Need label/one-hot encoding for categorical features before training.
→Use cuDF categorical dtype with .cat.codes or cuML preprocessing encoders to keep data on-device.
Need raw numeric array math not exposed by the cuDF DataFrame API.
→Convert via df.values or to_cupy() and operate with CuPy (NumPy-compatible GPU arrays), then bring results back.
Why: cuDF and CuPy share device memory through the __cuda_array_interface__, so conversion is zero-copy.