Pick a visual data-prep tool.
→ML-focused, integrates with SageMaker Studio + flow → Processing job → Pipeline → Notebook export → SageMaker Data Wrangler. Generic data cleaning with reusable recipes, profiling, no SageMaker dependency → AWS Glue DataBrew. 50 TB+ Spark with custom code → Amazon EMR.
Why: Data Wrangler is the SageMaker-native option (300+ transforms, datetime extraction, exports to Pipeline/Processing). DataBrew is recipe-based and source-agnostic. EMR handles scale and arbitrary Spark.
Reference↗
Catalog data across S3, RDS, DynamoDB so analysts and SageMaker can discover datasets.
→AWS Glue Crawlers populate the AWS Glue Data Catalog with schemas + metadata. Athena, Redshift Spectrum, and SageMaker all consume it.
Reference↗
Need column- and row-level access control on the data lake with audit logging.
→AWS Lake Formation. IAM and S3 bucket policies do not provide column-level granularity on structured data.
Why: Lake Formation centralizes governance for the Glue Data Catalog and integrates with CloudTrail for audit.
Reference↗
Run ad-hoc SQL on S3 data without provisioning anything.
→Amazon Athena. Serverless, pay-per-TB-scanned. Partition data and use Parquet to cut cost and time.
Reference↗
50 TB of feature engineering with existing PySpark code, must finish in 4 hours.
→Amazon EMR with Spark. Tunable cluster size, Spot support, runs the existing code unchanged.
Why: Glue ETL also runs Spark but EMR gives more control over cluster shape; SageMaker Processing is for smaller-scale single-container jobs.
Reference↗
Run a custom scikit-learn / pandas preprocessing script before training. Ephemeral compute, no idle cost.
→SageMaker Processing job with the SKLearn (or PySpark) container. Provisions, runs, terminates.
Why: Better than running on a notebook (stays up, costs money) or Lambda (15-min limit, memory caps).
Reference↗
Label 100,000 images cost-efficiently — want human + automated labeling.
→Amazon SageMaker Ground Truth with automated data labeling enabled. After an initial human-labeled subset, Ground Truth trains a model and auto-labels high-confidence samples.
Why: Active learning typically cuts labeling cost by up to 70%. A2I is for human review of model predictions, not bulk labeling.
Reference↗
Multiple annotators disagree; need a senior reviewer to verify a sample of labels.
→Ground Truth label verification (audit) workflow. A subset of labels is routed to a review workforce that approves, rejects, or adjusts. Combine with annotation consolidation for multi-worker majority voting.
Reference↗
Same engineered features needed at training (batch) and inference (sub-10ms).
→Amazon SageMaker Feature Store with both online + offline stores enabled on the feature group. Online store backs real-time GetRecord; offline store (Parquet in S3) backs training.
Why: Eliminates train/serve skew without a custom DynamoDB ↔ S3 sync.
Reference↗
Defining a feature group — what is mandatory.
→Record identifier name (unique key per record) and event time feature name (timestamp for point-in-time queries).
Reference↗
Join two feature groups for training without leaking future feature values.
→Point-in-time join against the offline store using the event-time column. Each training row sees only feature values that existed at its event timestamp.
Why: Plain JOIN on latest values causes data leakage by exposing post-event feature drift to the model.
Reference↗
Pick a SageMaker training data input mode for a 500 GB dataset.
→File mode → entire dataset downloaded first (slow start, EBS cost). Pipe mode → streams from S3, low startup, low storage. FastFile mode → lazy file-level streaming. Use Pipe (or FastFile) for large datasets to avoid the download.
Reference↗
Millions of small files (each ~50 KB) — Pipe mode throughput is poor.
→Bundle into Amazon RecordIO (protobuf) and stream via Pipe mode. Sequential records eliminate per-file S3 GET overhead.
Reference↗
Pick a storage format and layout for ML data lake on S3 with frequent column-subset reads + partition filters.
→Parquet (columnar, compressed) partitioned by the most-filtered column (e.g. date or region). Drives column pruning + partition pruning in Athena and SageMaker.
Reference↗
Glue ETL re-processes already-handled files on every run.
→Enable Glue job bookmarks. Use the PAUSE option so a failed run does not advance the bookmark; reset only when needed.
Reference↗
Validate schema, types, value ranges, and null constraints inside the Glue ETL pipeline.
→AWS Glue Data Quality with DQDL rules. Halts the pipeline when checks fail.
Reference↗
Encode categorical features. Some are ordered (Basic/Standard/Premium), some are not (US states).
→Ordered → ordinal encoding (preserves rank). Unordered → one-hot encoding (avoids fake ordinality). Avoid label encoding on unordered features. Target encoding requires careful CV to avoid leakage.
Numerical column has missing values that correlate with another feature (e.g. income missing depends on employment type).
→Group-based median imputation (median per employment type). Preserves the relationship; mean is sensitive to outliers; dropping loses data; zero adds bias.
Binary classification with 0.3% positive class.
→SMOTE oversampling on the training fold only (after split). Combine with PR-curve / F1 evaluation, not accuracy.
Why: Apply oversampling AFTER splitting to avoid leakage. Accuracy is misleading on imbalanced data.
Right-skewed numeric feature (e.g. income) hurts linear-model performance.
→Log transform. Compresses the right tail and produces a more symmetric distribution. Standardization/min-max change scale, not shape.
50 highly-correlated features; want lower dimensionality preserving variance.
→PCA. Transforms correlated features into uncorrelated principal components ranked by variance.
Pick a train/val/test split.
→Imbalanced classification → stratified split (preserves class ratio). Time-series → chronological split (train on early period, test on latest); never random-shuffle. IID tabular → random.