🏠Home 📚Certifications 📱Mobile Apps

✍️Blog 📊Progress 📅Calendar 💬Support

Privacy Policy Terms of Use Contact Us Cookie Policy Disclaimer Accessibility Statement DMCA / Copyright

Skip to content

MLA-C01Playbook

Playbook

AWS Certified Machine Learning Engineer Associate

Last reviewed: May 2026

A scannable reference of architectural patterns the MLA-C01 exam tests. Read top-to-bottom, or jump to a section.

Sections

Data Preparation for ML22 entries
ML Model Development19 entries
Deployment and Orchestration of ML Workflows18 entries
ML Solution Monitoring, Maintenance, and Security18 entries

Data Preparation for ML

Pick a visual data-prep tool.

ML-focused, integrates with SageMaker Studio + flow → Processing job → Pipeline → Notebook export → SageMaker Data Wrangler. Generic data cleaning with reusable recipes, profiling, no SageMaker dependency → AWS Glue DataBrew. 50 TB+ Spark with custom code → Amazon EMR.

Why: Data Wrangler is the SageMaker-native option (300+ transforms, datetime extraction, exports to Pipeline/Processing). DataBrew is recipe-based and source-agnostic. EMR handles scale and arbitrary Spark.

Catalog data across S3, RDS, DynamoDB so analysts and SageMaker can discover datasets.

AWS Glue Crawlers populate the AWS Glue Data Catalog with schemas + metadata. Athena, Redshift Spectrum, and SageMaker all consume it.

Need column- and row-level access control on the data lake with audit logging.

AWS Lake Formation. IAM and S3 bucket policies do not provide column-level granularity on structured data.

Why: Lake Formation centralizes governance for the Glue Data Catalog and integrates with CloudTrail for audit.

Run ad-hoc SQL on S3 data without provisioning anything.

Amazon Athena. Serverless, pay-per-TB-scanned. Partition data and use Parquet to cut cost and time.

50 TB of feature engineering with existing PySpark code, must finish in 4 hours.

Amazon EMR with Spark. Tunable cluster size, Spot support, runs the existing code unchanged.

Why: Glue ETL also runs Spark but EMR gives more control over cluster shape; SageMaker Processing is for smaller-scale single-container jobs.

Run a custom scikit-learn / pandas preprocessing script before training. Ephemeral compute, no idle cost.

SageMaker Processing job with the SKLearn (or PySpark) container. Provisions, runs, terminates.

Why: Better than running on a notebook (stays up, costs money) or Lambda (15-min limit, memory caps).

Label 100,000 images cost-efficiently — want human + automated labeling.

Amazon SageMaker Ground Truth with automated data labeling enabled. After an initial human-labeled subset, Ground Truth trains a model and auto-labels high-confidence samples.

Why: Active learning typically cuts labeling cost by up to 70%. A2I is for human review of model predictions, not bulk labeling.

Multiple annotators disagree; need a senior reviewer to verify a sample of labels.

Ground Truth label verification (audit) workflow. A subset of labels is routed to a review workforce that approves, rejects, or adjusts. Combine with annotation consolidation for multi-worker majority voting.

Same engineered features needed at training (batch) and inference (sub-10ms).

Amazon SageMaker Feature Store with both online + offline stores enabled on the feature group. Online store backs real-time GetRecord; offline store (Parquet in S3) backs training.

Why: Eliminates train/serve skew without a custom DynamoDB ↔ S3 sync.

Defining a feature group — what is mandatory.

Record identifier name (unique key per record) and event time feature name (timestamp for point-in-time queries).

Join two feature groups for training without leaking future feature values.

Point-in-time join against the offline store using the event-time column. Each training row sees only feature values that existed at its event timestamp.

Why: Plain JOIN on latest values causes data leakage by exposing post-event feature drift to the model.

Pick a SageMaker training data input mode for a 500 GB dataset.

File mode → entire dataset downloaded first (slow start, EBS cost). Pipe mode → streams from S3, low startup, low storage. FastFile mode → lazy file-level streaming. Use Pipe (or FastFile) for large datasets to avoid the download.

Millions of small files (each ~50 KB) — Pipe mode throughput is poor.

Bundle into Amazon RecordIO (protobuf) and stream via Pipe mode. Sequential records eliminate per-file S3 GET overhead.

Pick a storage format and layout for ML data lake on S3 with frequent column-subset reads + partition filters.

Parquet (columnar, compressed) partitioned by the most-filtered column (e.g. date or region). Drives column pruning + partition pruning in Athena and SageMaker.

Glue ETL re-processes already-handled files on every run.

Enable Glue job bookmarks. Use the PAUSE option so a failed run does not advance the bookmark; reset only when needed.

Validate schema, types, value ranges, and null constraints inside the Glue ETL pipeline.

AWS Glue Data Quality with DQDL rules. Halts the pipeline when checks fail.

Encode categorical features. Some are ordered (Basic/Standard/Premium), some are not (US states).

Ordered → ordinal encoding (preserves rank). Unordered → one-hot encoding (avoids fake ordinality). Avoid label encoding on unordered features. Target encoding requires careful CV to avoid leakage.

Numerical column has missing values that correlate with another feature (e.g. income missing depends on employment type).

Group-based median imputation (median per employment type). Preserves the relationship; mean is sensitive to outliers; dropping loses data; zero adds bias.

Binary classification with 0.3% positive class.

SMOTE oversampling on the training fold only (after split). Combine with PR-curve / F1 evaluation, not accuracy.

Why: Apply oversampling AFTER splitting to avoid leakage. Accuracy is misleading on imbalanced data.

Right-skewed numeric feature (e.g. income) hurts linear-model performance.

Log transform. Compresses the right tail and produces a more symmetric distribution. Standardization/min-max change scale, not shape.

50 highly-correlated features; want lower dimensionality preserving variance.

PCA. Transforms correlated features into uncorrelated principal components ranked by variance.

Pick a train/val/test split.

Imbalanced classification → stratified split (preserves class ratio). Time-series → chronological split (train on early period, test on latest); never random-shuffle. IID tabular → random.

ML Model Development

Pick a SageMaker built-in algorithm.

Tabular classification/regression → XGBoost or Linear Learner. Multi-class text classification at scale → BlazingText (supervised). Time-series with related series and seasonality → DeepAR. Unsupervised anomaly detection on numeric → Random Cut Forest. Topic modeling → Neural Topic Model. Translation / Seq2Seq → Sequence-to-Sequence. Pixel-level classes → Semantic Segmentation. Paired-entity embeddings (user/item) → Object2Vec.

Compare many algorithms automatically on tabular data; want a leaderboard and the notebooks behind it.

SageMaker Autopilot. Tries algorithms, does feature engineering, tunes hyperparameters, generates candidate notebooks.

Custom training framework / proprietary tokenizer not in built-ins.

BYOC (Bring Your Own Container): Docker image with the code and dependencies, push to Amazon ECR, reference in SageMaker training. Keeps managed infra (Spot, distributed, lifecycle) without giving up customization.

Small image dataset (~2,000) for medical classification.

Transfer learning from a model pre-trained on ImageNet (e.g. ResNet). Fine-tune the last layers. SageMaker Image Classification supports it directly.

Why: Training from scratch on small data overfits. Pre-trained features (edges, textures) transfer cleanly to medical imagery.

Fine-tune a pre-trained foundation model fast without writing custom training code.

SageMaker JumpStart fine-tuning API: pick model ID, supply dataset in expected format (typically JSONL), launch a fine-tuning job, deploy to an endpoint from JumpStart.

Adapt an LLM to a domain. Lots of static knowledge → choose RAG vs fine-tuning vs prompt-only.

Frequently changing domain knowledge → RAG via Bedrock Knowledge Bases. Brand voice / consistent style with labeled examples → Bedrock model customization (fine-tuning, often parameter-efficient adapters). Small static guidance → prompt engineering with few-shot.

Tune 8 hyperparameters; each training job is 30 minutes; limited compute.

SageMaker Automatic Model Tuning with Bayesian optimization (default). Builds a probabilistic model of the objective and samples promising regions.

Why: Grid search explodes combinatorially; random search wastes budget. Specify objective metric (e.g. `validation:auc`) and type (`Maximize`).

Tuning plateaued after 50 jobs.

New tuning job with warm start using parent jobs as priors and narrowed ranges centered on best-performing configurations.

Continue training the existing model on monthly new labels — do not start from scratch.

Incremental training: pass the previous model artifacts as input. Supported by Image Classification, Object Detection, Semantic Segmentation built-ins.

Pick a distributed-training strategy.

Model fits on one GPU but data is huge → data parallelism (replicate model, split batches, AllReduce gradients). Model does not fit on one GPU → model parallelism (split layers/tensors across GPUs). 10B+ params → SageMaker model parallel library (tensor + pipeline parallel).

PyTorch / TensorFlow training too slow; want graph-level optimization without changing accuracy.

SageMaker Training Compiler. Compiles the model graph; can cut training time up to 50%.

Long training jobs that can tolerate interruptions; want big cost savings.

SageMaker Managed Spot Training (up to 90% off). Configure checkpoints to S3 so SageMaker can resume after interruption.

Training loss keeps falling, validation loss starts rising after epoch 50.

Overfitting. Apply early stopping at the validation-loss minimum, plus dropout / L2 weight decay. More layers makes it worse.

Pick the right classification metric.

Imbalanced + rare positive matters → recall, F1, PR curve / Average Precision (NOT ROC AUC, which is inflated by many TNs). Multi-class with imbalance → macro-averaged F1. Threshold-independent ranking → AUC. Probability calibration → log loss / Brier.

Regression model over-predicts at the high end and under-predicts at the low end.

Plot residuals vs predicted value; use Mean Error (signed) for systematic bias. RMSE / MAE / R² hide direction.

Each input can belong to multiple classes simultaneously.

Sigmoid activation per output neuron with binary cross-entropy loss (independent probabilities). Softmax + categorical cross-entropy assumes mutually exclusive classes.

Stack multiple base models with a meta-learner.

k-fold cross-validation: each base model produces out-of-fold predictions on its held-out fold; collect across folds and train the meta-learner on those.

Why: Training base models and predicting on the same training set leaks information into the meta-learner.

Track and compare many training runs (params, metrics, artifacts).

SageMaker Experiments. Pass `experiment_config` (experiment + trial + trial component) to the training job; SageMaker auto-logs hyperparameters, input config, metrics, and artifacts.

Detect training pathologies (vanishing gradient, loss not decreasing, exploding tensor) without rewriting the script.

SageMaker Debugger with built-in rules (`VanishingGradient`, `LossNotDecreasing`, `ExplodingTensor`, `Overfit`). Captures tensors via hooks; evaluates rules on the fly.

Deployment and Orchestration of ML Workflows

Pick a SageMaker inference mode.

Steady low-latency synchronous → real-time endpoint. Spiky / idle traffic, no GPU need → serverless inference (configure Provisioned Concurrency to kill cold starts). Long-running per-request (>60 s) or large payloads → asynchronous inference. Bulk offline scoring of S3 records → batch transform.

Many low-traffic models — one endpoint each is too expensive.

SageMaker Multi-Model Endpoint (MME). Models load on demand into shared instances. One endpoint, many models, low cost.

Two independent models invoked in parallel per request from one endpoint.

Multi-container endpoint in direct invocation mode. Caller targets each container independently.

Sequential per-request: tokenize → embed → classify, each in a separate container.

SageMaker Inference Pipeline (serial mode). Up to 15 containers chained; output of each feeds the next; one endpoint.

Real-time endpoint must absorb 1000 req/s peaks but scale to near-zero at night.

Application Auto Scaling target-tracking on `InvocationsPerInstance`. Adds/removes instances behind the endpoint as traffic shifts.

Roll out a new model to 10% of traffic, bake for 30 min, auto-rollback on alarms.

SageMaker endpoint deployment configuration with canary or linear traffic shifting + CloudWatch alarms for auto-rollback.

Validate a new model against production traffic without affecting users.

Shadow variants. Production traffic is duplicated to the shadow model; only the production model returns to the client.

Run two model versions on one endpoint with a 90/10 traffic split.

SageMaker production variants with `initial_variant_weight` 0.9 / 0.1. Update with `UpdateEndpointWeightsAndCapacities`.

Pick the right instance type for a real-time endpoint based on cost / latency / throughput.

SageMaker Inference Recommender. Benchmarks model across candidate instance types and reports recommendations.

Version models, gate production deployment with formal approval, track lineage.

SageMaker Model Registry. Approval status (PendingApproval / Approved / Rejected), tracks lineage, integrates with Pipelines and CI/CD.

Native ML workflow: train → evaluate → conditionally register/deploy.

SageMaker Pipelines with TrainingStep → ConditionStep (metric threshold) → RegisterModel → Lambda step (or CreateModel/Endpoint). Native SageMaker integration, parameterization, caching, lineage.

Pipeline must coordinate Glue ETL + Lambda + SageMaker training + SNS / DynamoDB.

AWS Step Functions. Native service integrations across the stack; richer than Pipelines for non-SageMaker steps.

Why: Pipelines is the right pick for pure-ML workflows; Step Functions is the right pick when you need the broader AWS service integrations.

Want pre-built MLOps CI/CD scaffolding (CodePipeline + CodeBuild + Pipelines).

SageMaker MLOps Project Templates. Generates the repo + pipeline + IAM + Pipelines steps in one click.

Auto-retrain when Model Monitor detects drift.

Model Monitor → CloudWatch alarm on violation metric → EventBridge rule → start SageMaker Pipeline execution.

Deploy a TensorFlow model to ARM edge devices; need it small + fast.

SageMaker Neo. Compiles for the target hardware; up to 25× faster, ~1/10th memory. Deploy via the DLR runtime; combine with IoT Greengrass for offline edge.

Small model (<50 MB), <100 req/day, ≤10 s latency tolerable, want lowest cost.

AWS Lambda with container image (up to 10 GB). Pay per request, no idle cost; SageMaker endpoints bill per hour.

Inference takes 60+ seconds (LLM long-form). Real-time endpoint times out.

SageMaker Asynchronous Inference. Returns an S3 location immediately; processes up to 60 minutes; SNS notification on completion.

Tune Batch Transform for max throughput with independent records.

Set `BatchStrategy=MultiRecord` with a large `MaxPayloadInMB`, and raise `MaxConcurrentTransforms` to parallelize across the instance.

ML Solution Monitoring, Maintenance, and Security

Detect that input feature distributions have drifted from training-time baseline.

SageMaker Model Monitor — Data Quality. Capture inference data, compare against a baseline computed from training data, alarm on drift.

Why: Setup order is locked: (1) baseline job → (2) monitoring schedule → (3) CloudWatch alarms on the constraint-violation metrics.

Detect prediction-quality degradation (accuracy / F1 / RMSE) when ground truth arrives with delay.

SageMaker Model Monitor — Model Quality. Merges captured predictions with delayed ground-truth labels; alarms when metrics fall below baseline.

Input distribution looks unchanged but prediction quality has shifted.

SageMaker Clarify Feature Attribution Drift Monitor (SHAP-based). Detects concept drift via shifting feature importances. Pair with Model Quality monitor when ground truth is available.

Accuracy dropped but input feature distributions are unchanged.

Concept drift (label/feature relationship changed). Data drift was ruled out. Fix: retrain on recent labeled data.

Check the dataset for bias before training.

SageMaker Clarify pre-training bias metrics. Class Imbalance (CI) for sample-size disparity; Difference in Positive Proportions of Labels (DPL) for label-rate disparity; KL/JS divergence for distributional gaps.

Check the trained model for bias.

SageMaker Clarify post-training bias metrics. Disparate Impact (DI), Accuracy Difference (AD), Conditional Acceptance, Treatment Equality. Run against model predictions.

Why: Pre-train DPL clean but post-train DI biased = model itself amplifies a proxy variable. Investigate features (e.g. zip code).

Regulator requires per-prediction feature attribution.

SageMaker Clarify SHAP values. Magnitude + direction of each feature's contribution per prediction. Integrates with Model Cards.

Compliance requires structured documentation of every production model (intended use, training data, evaluation, ethics, limitations).

SageMaker Model Cards. Versioned; integrated with the Model Registry.

Audit who created which training job / endpoint / notebook and when.

AWS CloudTrail. Captures all SageMaker API calls (identity, time, IP, parameters). Store in S3, query with Athena.

Alert on endpoint 5xx errors / latency spikes.

CloudWatch alarms on `Invocation5XXErrors`, `Invocation4XXErrors`, `ModelLatency`, `OverheadLatency`. Notify via SNS.

Notebook needs to read training data from one S3 bucket and write artifacts to another.

Custom IAM policy: `s3:GetObject` on the training bucket/prefix and `s3:PutObject` on the artifacts bucket/prefix, attached to the SageMaker execution role. Avoid `AmazonS3FullAccess`.

Per-team isolation across SageMaker resources.

Attribute-based access control (ABAC) with IAM condition `aws:ResourceTag/project`. Resources tagged `project=A` accessible only to roles whose policies match.

Encrypt training data and model artifacts with customer-managed keys + rotation.

SSE-KMS with a Customer Managed Key (CMK). KMS rotation, key policies, CloudTrail audit. Specify the KMS key in training job + endpoint config (volume + output) for SageMaker to use it.

Distributed training over multiple instances; encrypt traffic between training containers.

Set `EnableInterContainerTrafficEncryption=true` on the training job. Adds TLS between distributed containers.

Container must not make outbound network calls; data should stay inside SageMaker copy-channels.

Set `EnableNetworkIsolation=true` on the training/processing job or endpoint. SageMaker copies S3 input channels in before the container runs; container has no outbound.

Training must not touch the public internet.

Run SageMaker in a private subnet with NO NAT/Internet Gateway. Add VPC endpoints — gateway endpoint for S3, interface endpoints for SageMaker API + Runtime + ECR + STS + CloudWatch Logs.

ML pipeline pulls features from RDS — credentials must be rotated automatically.

AWS Secrets Manager with automatic rotation enabled (built-in Lambda rotation for RDS).

Enforce that all SageMaker resources use VPC + KMS + approved instance types.

Preventive → SageMaker Service Catalog products (pre-approved configs) and IAM condition keys (`sagemaker:VpcSecurityGroupIds`, `sagemaker:VolumeKmsKey`) that deny non-compliant API calls. Detective → AWS Config managed/custom rules.