Playbook

AWS Certified Data Engineer Associate

Last reviewed: May 2026

A scannable reference of architectural patterns the DEA-C01 exam tests. Read top-to-bottom, or jump to a section.

Data Ingestion and Transformation

Pick a Kinesis service for streaming ingestion.

Sub-second consumer-controlled processing → Kinesis Data Streams. Fully managed delivery to S3/Redshift/OpenSearch with optional format conversion → Kinesis Data Firehose.

Why: KDS retains records (24h–365d) and supports multiple consumers. Firehose has no replay; trades replay for zero-ops delivery.

Reference

Stream hits ProvisionedThroughputExceeded errors during peak.

Reshard. Each shard supports 1 MB/s or 1,000 records/s ingest, 2 MB/s egress. Use uniform partition keys; enable Enhanced Fan-Out for >2 MB/s per consumer.

Why: Hot partition keys concentrate traffic on one shard. Random or hash-based keys spread load.

Reference

Streaming workload is spiky and unpredictable; manual resharding is operational pain.

Kinesis Data Streams in on-demand capacity mode. Auto-scales to 200 MB/s by default; pay per data volume.

Reference

Multiple consumers reading the same stream hit the 2 MB/s/shard read limit.

Enhanced Fan-Out. Each consumer gets dedicated 2 MB/s/shard via push-based HTTP/2 SubscribeToShard.

Reference

Maximize ingest throughput from producer-side application.

Kinesis Producer Library (KPL) with aggregation + collection. Batches multiple user records into one Kinesis record up to 1 MB; reduces PUT cost.

Why: Single-record PutRecord is rate-limited and expensive at 50k events/s. KPL aggregates client-side.

Reference

Land JSON clickstream into S3 as Parquet, partitioned by event time.

Firehose with record format conversion (JSON → Parquet) using Glue Data Catalog table + dynamic partitioning on event timestamp.

Why: Parquet + partitioning cuts Athena scan cost dramatically. Dynamic partitioning avoids a separate ETL step.

Reference

Some records fail Firehose transformation or delivery; need to capture them for replay.

Configure S3 backup with `AllData` or `FailedDataOnly`. Failed records land at the configured prefix with error metadata.

Reference

Ensure no data loss in MSK if a broker AZ fails.

Replication factor ≥ 3 across 3 AZs and `min.insync.replicas=2` with producer `acks=all`. Enable Multi-AZ via ZooKeeper-less KRaft or 3-AZ broker placement.

Reference

Stream from MSK to S3, OpenSearch, or RDS without managing Kafka Connect cluster.

MSK Connect with managed connector (Confluent S3 Sink, Debezium for CDC). Auto-scales workers per WCU.

Reference

Topic stores latest version of a record per key; old versions can be discarded.

Set topic `cleanup.policy=compact`. Kafka retains the most recent value for each key; older same-key records are eligible for compaction.

Reference

Recurring weekly transfer of 10 TB from on-prem NFS to S3 over Direct Connect.

AWS DataSync with on-prem agent + scheduled task. Verifies data integrity, supports incremental transfers, parallel.

Why: DataSync is faster than aws-cli sync and handles bandwidth throttling, retries, and verification natively.

Reference

Pull data from SaaS APIs (Salesforce, ServiceNow, Zendesk) into S3 on a schedule.

AWS AppFlow. Managed connectors, OAuth handled, scheduled or event-triggered, writes Parquet to S3.

Reference

Replicate ongoing changes from on-prem SQL Server to Aurora MySQL with minimal downtime.

AWS DMS with full-load + CDC task. Use Schema Conversion Tool (SCT) for heterogeneous schema/code conversion before DMS.

Reference

DMS replication instance fails — replication interrupts.

Enable Multi-AZ on the replication instance. Synchronous standby in another AZ; automatic failover.

Reference

Need near-real-time analytics on OLTP Aurora data without ETL pipeline.

Aurora zero-ETL integration to Redshift. Continuous replication of Aurora data to Redshift; queries see new data within seconds.

Why: Eliminates DMS / Glue / custom CDC pipelines for the OLTP-to-warehouse use case.

Reference

Move 100 TB of historical archive from on-prem to S3; bandwidth limited.

AWS Snowball Edge Storage Optimized. Physical device shipped to site; copy data; ship back.

Reference

Source JSON has nested arrays; downstream relational analysis needs flattened rows.

Glue PySpark `Relationalize` transform (or `explode()` in DataFrame) flattens nested arrays into separate rows/tables.

Reference

Glue Crawler infers ambiguous types (`choice<int,string>`) from messy CSV data.

Apply `ResolveChoice` transform — cast to specific type or project to struct. Or fix at source by enforcing schema.

Reference

Glue ETL job runs hourly on growing S3 data; need to process only new files.

Enable Glue job bookmarks. Glue tracks processed files/partitions and skips them on reruns.

Why: Avoids reprocessing entire dataset. Required for incremental ETL pipelines.

Reference

Glue Spark job fails with OutOfMemoryError on driver during large aggregations.

Switch to G.2X or G.4X workers (more driver memory) or enable `--enable-glue-datacatalog` push-down predicates to reduce shuffled data.

Reference

Run continuous Spark Structured Streaming against a Kinesis source with managed infra.

AWS Glue streaming ETL job. Spark Structured Streaming under the hood; checkpointing to S3.

Reference

Business analyst needs to clean and transform data without writing code.

AWS Glue DataBrew. Visual recipe-based transforms (250+), profiling, lineage. Output to S3, Redshift, RDS.

Reference

Run Glue ETL job only after Crawler successfully updates the Data Catalog.

Glue Workflow with conditional triggers. Crawler success → trigger ETL job. Failure → skip / alarm.

Reference

Crawler infers all CSV columns as `string` — needs date and number types.

Add a custom Glue classifier (Grok pattern or column hint) before crawling. Alternatively pre-write a header row with explicit types.

Reference

Multiple producers/consumers on Kafka need schema evolution without breaking each other.

AWS Glue Schema Registry with compatibility rules (BACKWARD/FORWARD/FULL). Producers register schema; consumers fetch + validate.

Reference

Pick between EMR and Glue for Spark ETL.

Long-running custom Spark with deep tuning, multiple frameworks (Hive, Presto, Flink) → EMR. Serverless pay-per-job ETL with Glue Data Catalog integration → Glue. Spiky/unpredictable Spark → EMR Serverless.

Reference

Intermittent Spark/Hive jobs; want zero cluster ops and no idle compute.

EMR Serverless. Pre-initialized capacity pools for low-latency starts; scales per-job; pay per vCPU-hour.

Reference

Mix on-demand core + spot task nodes for cost-optimized EMR.

Instance Fleets with target capacity per type. Core fleet on-demand for HDFS stability; task fleet spot with diversified instance types.

Reference

Standardize on Kubernetes; want EMR Spark jobs to share cluster with other workloads.

EMR on EKS. Spark runs as pods on existing EKS cluster; share infra and IAM roles via IRSA.

Reference

Stateful streaming with windowed aggregations and exactly-once semantics.

Kinesis Data Analytics for Apache Flink. Managed Flink runtime; checkpoints to S3; auto-scales.

Reference

Lightweight per-record transform on a Kinesis stream (<1 ms each).

Lambda with Event Source Mapping on KDS. Tune `BatchSize`, `MaximumBatchingWindowInSeconds`, and `ParallelizationFactor`.

Why: Lambda is cheaper than KCL/Glue Streaming for small per-record work.

Reference

Step Functions step occasionally fails on transient throttling; retry then alert.

Add `Retry` block with `ErrorEquals: ["Lambda.ThrottlingException", "States.TaskFailed"]`, `IntervalSeconds`, `MaxAttempts`, `BackoffRate=2`. Plus `Catch` to a notification state.

Reference

Process 500,000 JSON files in parallel through Lambda transform.

Step Functions distributed Map state with `MaxConcurrency` and ItemReader from S3. Fan-out across thousands of parallel Lambda invocations.

Reference

Complex DAG with cross-service dependencies (Glue + Redshift COPY + Lambda + email) and lineage requirements.

Amazon MWAA (Managed Workflows for Apache Airflow). Native Airflow operators for AWS services; Git-driven DAG sync.

Reference

Need to roll back DAG changes if a deploy causes failures.

Store DAGs in S3 versioned bucket + sync via S3 versioning. Or maintain DAG repo in Git with environment-per-branch + S3 sync via CI.

Reference

Data Store Management

Raw data hot for 30 days, occasional access for next 90 days, archive for 7 years.

S3 lifecycle: 0–30 days Standard, transition at 30 days to Standard-IA, transition at 120 days to Glacier Flexible Retrieval, expire after 7 years.

Reference

Unpredictable access patterns; manual lifecycle policy is wrong choice.

S3 Intelligent-Tiering. Auto-moves objects between Frequent / Infrequent / Archive Instant Access / Archive / Deep Archive based on access pattern. Per-object monitoring fee; no retrieval fees in Frequent/IA.

Reference

Athena queries on data lake are slow; partition has thousands of 1-5 KB JSON files.

Compact small files via Glue/EMR job into ~256 MB Parquet files. Use Iceberg `OPTIMIZE` or Hudi compaction for managed table formats.

Why: Athena/Spark per-file overhead dominates with tiny files. Sweet spot is ~128–512 MB Parquet.

Reference

One bucket; multiple teams need different prefix-scoped access patterns.

S3 Access Points — per-team named endpoint with its own policy bound to a prefix. Simpler than one giant bucket policy.

Reference

Different consumers need different views of the same S3 object (redacted PII, summarized).

S3 Object Lambda Access Point. GET request invokes Lambda that transforms the object on the fly; consumer sees the transformed view.

Reference

Need ACID transactions, schema evolution, and time-travel on S3 data lake.

Apache Iceberg tables (Glue Catalog + S3 storage). Atomic commits, MERGE/UPDATE/DELETE, snapshot isolation, partition evolution.

Why: Hive-style append-only S3 doesn't support row-level updates. Iceberg/Hudi/Delta solve this.

Reference

Multiple writers and readers on a data-lake table; need transactional consistency and row-level access control.

Lake Formation governed tables (Iceberg-backed) with LF-Tags for permissions.

Reference

Athena, Redshift Spectrum, EMR, and Glue ETL all need a shared metadata store.

AWS Glue Data Catalog. Single Hive-compatible metastore consumed by every analytics service.

Reference

Redshift cluster needs to scale storage independently of compute.

RA3 nodes with managed storage (RMS). Storage backed by S3; compute scales separately. Required for AQUA, Concurrency Scaling, Federated Queries.

Reference

Redshift query frequently filters by `created_at`; full-table scans are slow.

Define a sort key on `created_at` (or compound sort key including `created_at`). Redshift uses zone maps to skip blocks during scan.

Reference

Frequent joins between `orders` and `order_items`; query shuffles cause slowness.

Use the same DISTKEY (`order_id`) on both tables. Co-located rows avoid network shuffle during join.

Why: KEY distribution co-locates joining rows on the same compute node.

Reference

Loading 32 gzip CSV files (~1 GB each) into 4-node Redshift cluster is slow.

COPY in parallel from a single manifest. Aim for #files = multiple of slice count (slices = nodes × vCPU). 4 nodes ra3.xlplus = 8 slices → 32 files = 4 per slice.

Reference

Join 5 TB cold Parquet data in S3 with hot Redshift fact tables; don't want to load it.

Redshift Spectrum. External tables in Glue Catalog; queries read S3 directly with Redshift compute.

Reference

Reporting team queries during peak slow down ETL workloads; both run on the same cluster.

Enable Concurrency Scaling on the relevant WLM queue. Redshift transparently routes overflow queries to scaled-out clusters.

Reference

Dashboard query repeatedly joins 3 large tables and aggregates; latency is high.

Materialized view with auto-refresh. Redshift maintains pre-computed result; query reads from materialized data.

Reference

Intermittent analytical workload; provisioned cluster sits idle.

Amazon Redshift Serverless. Auto-provisions and scales RPUs per workload; pay per RPU-hour. Zero ops.

Reference

Need to join Redshift data with live Aurora MySQL data without ETL.

Redshift Federated Queries. CREATE EXTERNAL SCHEMA pointing at Aurora; queries push down predicates over the live RDS connection.

Reference

Dashboard joins orders + customers + products on every render; star schema is too slow.

Denormalize into a wide fact table or materialized view. BI workloads favor read-time joins resolved at write time.

Reference

S3 partitions by `year/month/day/hour`; `MSCK REPAIR TABLE` takes 30+ min.

Enable Athena partition projection (no Glue Catalog partition entries). Define partition key types + ranges in table properties.

Why: Athena computes partition locations at query time from the projection rules — no MSCK, no Glue API throttling.

Reference

Convert Athena query results to Parquet, partitioned, in one operation.

CREATE TABLE AS SELECT (CTAS) with `format=PARQUET`, `partitioned_by=ARRAY['region']`, `external_location` set to target S3 prefix.

Reference

Same query template runs with different parameter values throughout the day.

Athena prepared statements: `PREPARE`, `EXECUTE` with parameter values. Avoids re-parsing and gives clean parameterization.

Reference

IoT device readings; need (1) all readings for a device in a time window, (2) latest reading per device.

PK = `device_id`, SK = `timestamp`. GSI with PK = `device_id`, SK = inverted `timestamp` (or use Query with `ScanIndexForward=false LIMIT 1`).

Reference

Session table grows unbounded; old sessions can be deleted after 7 days.

Enable DynamoDB TTL on a `expires_at` epoch attribute. DynamoDB removes expired items at no cost (within ~48h).

Reference

IoT sensor data: hot queries on last 7 days, occasional queries on 2 years.

Amazon Timestream. Memory store for recent data (fast queries); auto-tiering to magnetic store for historical.

Reference

Cassandra-compatible store for high-write time-series with 90-day retention.

Amazon Keyspaces with TTL on rows. Compatible with Cassandra CQL; serverless capacity, no cluster management.

Reference

OpenSearch storage cost grows; old indexes rarely queried.

OpenSearch ISM policies tier data: hot → UltraWarm (S3-backed) → Cold. Cold tier detached but searchable on demand.

Reference

Data Operations and Support

Validate that ETL output has ≥1,000 rows and column null-rate <2% before downstream consumption.

AWS Glue Data Quality rules (DQDL): `RowCount >= 1000`, `Completeness "col" > 0.98`. Pipeline halts on rule failure.

Reference

Custom Spark-based data quality framework on EMR; need column-level statistical checks.

AWS Deequ library on Spark. Define constraints (`isComplete`, `hasMin`, `isContainedIn`); Deequ runs as a Spark job and emits metrics.

Reference

Analysts need to discover, request access to, and understand the lineage of data products across accounts.

Amazon DataZone. Data catalog with business glossary, access workflows, lineage; spans Lake Formation, Redshift, RDS.

Reference

Lambda emits per-record processing metrics; CloudWatch PutMetricData costs are high.

CloudWatch Embedded Metric Format (EMF). Log JSON in EMF schema; CloudWatch extracts metrics from logs at no per-PutMetricData cost.

Reference

Find all Glue jobs whose duration exceeded 1 hour in the last 7 days.

CloudWatch Logs Insights query: `fields @timestamp, @message | filter @message like /JobRunDuration/ | parse @message "duration=*" as d | filter d > 3600`.

Reference

Glue job is slow; need to know if it's under-resourced or has skewed shuffle.

Enable Glue job metrics + observability. CloudWatch shows max DPU usage, executor utilization, shuffle read/write per stage.

Reference

Glue Spark job sizes vary by 10× across runs; over-provisioned for small inputs.

Enable Glue auto scaling (Glue 3.0+). Workers added/removed during execution based on stage parallelism.

Reference

Athena scans 5 TB to answer queries that touch one day of data; cost too high.

Partition by date and ensure WHERE clause uses partition keys. Validate with `EXPLAIN` showing partition pruning.

Reference

Athena queries on JSON data lake are slow and expensive.

Convert to Parquet (columnar) or ORC. Reads only needed columns; native compression cuts both scan cost and time.

Reference

EMR cluster cost optimization without data loss risk.

Core nodes on On-Demand (host HDFS / shuffle). Task nodes on Spot via Instance Fleets with diversified instance types.

Reference

Redshift cluster runs 24/7; on-demand pricing is expensive.

Redshift Reserved Nodes (1y or 3y, all-/partial-/no-upfront). Up to ~75% discount vs on-demand for steady-state workloads.

Reference

Pick between Athena, Redshift, and EMR for 500 GB daily / 50 queries.

Ad-hoc, infrequent → Athena (per-TB scanned). Predictable BI dashboards → Redshift (RA3 + Reserved). Heavy custom Spark → EMR.

Why: Athena bills per data scanned; Redshift bills per-cluster-hour; EMR per-instance-hour. Match billing to access pattern.

Reference

Glue job triggered multiple times concurrently; want to limit to one run at a time.

Set Glue job `MaxConcurrentRuns=1`. Subsequent triggers wait; eliminates concurrent-state corruption.

Reference

Glue ETL retries produce duplicate output rows in S3 target.

Idempotency: write to a temp prefix per run, then atomic-rename via S3 multipart `CompleteMultipartUpload` or use Iceberg/Hudi MERGE for upserts.

Reference

Bad ETL run wrote corrupt rows to Aurora MySQL; recover to a point in time minutes ago.

Aurora Backtrack (MySQL-compatible only). Rewinds the cluster to a target time without restoring from snapshot.

Reference

Pipeline overwrote correct S3 objects with corrupt data.

S3 bucket versioning + restore prior version. Combine with MFA Delete to prevent accidental version expiration.

Reference

Automate EBS snapshot creation, retention, and cross-region copy for disaster recovery.

Amazon Data Lifecycle Manager (DLM) with per-tag policy: schedule, retention, cross-region copy.

Reference

MSK consumers fall behind producers; need to detect and alert.

CloudWatch metric `MaxOffsetLag` per consumer group. Alarm when > threshold; scale consumer count or increase partition parallelism.

Reference

Kinesis consumer falling behind; want to detect.

CloudWatch metric `GetRecords.IteratorAgeMilliseconds`. Alarm > 60s usually means consumers under-provisioned.

Reference

Identify the slowest Redshift queries from the last hour for tuning.

Query `SVL_QLOG` / `STL_QUERY` / `SYS_QUERY_HISTORY` for top elapsed-time entries; use `SVL_QUERY_REPORT` for per-step breakdown.

Reference

Data Security and Governance

Sales teams should only see rows for their assigned regions in the shared data lake.

Lake Formation row-level security via data filter: `region IN ('NA', 'EU')` per IAM principal. Single table; per-principal filtered view.

Reference

Healthcare table — analysts must not see SSN and diagnosis columns.

Lake Formation column-level permissions: GRANT SELECT on table EXCEPT (`ssn`, `diagnosis_code`).

Reference

Many teams + many tables; per-table grants are unmaintainable.

Lake Formation LF-Tags. Tag tables/columns; grant tag-based permissions to principals. Adding a new table just needs the right tag.

Reference

Account A has the data lake; Account B's analysts need read access to specific tables.

Lake Formation cross-account sharing via RAM. Account A grants permissions to B's IAM principal/account; B accesses via Athena/Redshift Spectrum.

Reference

Row-level security inside Redshift (not Lake Formation).

Redshift native RLS policies: `CREATE RLS POLICY` with predicate referencing session context (`current_user`, `session_role`). Attach policy to table.

Reference

Compliance requires customer-managed key with audit trail for Redshift encryption.

Redshift cluster encrypted with customer-managed KMS key. Key rotation enabled; CloudTrail captures every Decrypt operation against the CMK.

Reference

Encrypt Glue ETL job inputs/outputs with company-managed key.

Glue Security Configuration with CMK for S3 + CloudWatch Logs + Job bookmarks. Glue role granted `kms:Decrypt`/`Encrypt` on the key.

Reference

Discover and classify PII (names, SSNs, emails) sitting in S3 data lake.

Amazon Macie. ML-driven sensitive-data discovery on S3; produces findings with object location and PII type.

Reference

Audit every S3 GetObject / PutObject in the data lake bucket.

CloudTrail data events for the bucket. CloudTrail by default logs only management events; data events must be enabled explicitly.

Why: Data events are billed per-event; scope to the sensitive bucket only to control cost.

Reference

Need who/when/IP for every S3 access; CloudTrail data events are too expensive.

S3 server access logging. Free; logs delivered to a separate logging bucket; less detail than CloudTrail but covers requester + IP + path.

Reference

Prevent any bucket in the account from accidentally being made public, even if a bucket policy says so.

S3 Block Public Access at account level. Overrides any bucket-level policy; enforced as a guardrail.

Reference

Redshift in VPC must read from S3 without going over the public internet.

S3 Gateway Endpoint in the Redshift subnet route table. Traffic routes via AWS backbone; no NAT, no IGW.

Reference

Glue ETL job needs to access RDS in private subnet AND call Glue Data Catalog APIs.

Glue connection on the RDS VPC + Interface VPC Endpoints for `glue.amazonaws.com` + S3 Gateway Endpoint.

Reference

Glue ETL needs S3 read, Redshift write, Secrets Manager read.

Single Glue execution role with least-privilege policies: `s3:GetObject` on source prefix, `redshift-data:ExecuteStatement`, `secretsmanager:GetSecretValue` on the specific secret ARN.

Reference

Detect unusual data access patterns — large download by an IAM user with no prior data-lake access.

GuardDuty S3 Protection. Behavioral baselines per IAM principal; findings on anomalous access volumes/patterns.

Reference

Compliance requires WORM (write once, read many) retention on financial data for 7 years.

S3 Object Lock with Compliance mode + retention period 7y. Even root cannot delete; meets SEC 17a-4 / FINRA.

Reference

Continuous compliance evidence collection for HIPAA / SOC 2 audits.

AWS Audit Manager with prebuilt frameworks. Auto-collects evidence from CloudTrail, Config, Security Hub; produces audit-ready reports.

Reference

Data Ingestion and Transformation

Pick a Kinesis service for streaming ingestion.

Sub-second consumer-controlled processing → Kinesis Data Streams. Fully managed delivery to S3/Redshift/OpenSearch with optional format conversion → Kinesis Data Firehose.

Why: KDS retains records (24h–365d) and supports multiple consumers. Firehose has no replay; trades replay for zero-ops delivery.

Reference

Stream hits ProvisionedThroughputExceeded errors during peak.

Reshard. Each shard supports 1 MB/s or 1,000 records/s ingest, 2 MB/s egress. Use uniform partition keys; enable Enhanced Fan-Out for >2 MB/s per consumer.

Why: Hot partition keys concentrate traffic on one shard. Random or hash-based keys spread load.

Reference

Streaming workload is spiky and unpredictable; manual resharding is operational pain.

Kinesis Data Streams in on-demand capacity mode. Auto-scales to 200 MB/s by default; pay per data volume.

Reference

Multiple consumers reading the same stream hit the 2 MB/s/shard read limit.

Enhanced Fan-Out. Each consumer gets dedicated 2 MB/s/shard via push-based HTTP/2 SubscribeToShard.

Reference

Maximize ingest throughput from producer-side application.

Kinesis Producer Library (KPL) with aggregation + collection. Batches multiple user records into one Kinesis record up to 1 MB; reduces PUT cost.

Why: Single-record PutRecord is rate-limited and expensive at 50k events/s. KPL aggregates client-side.

Reference

Land JSON clickstream into S3 as Parquet, partitioned by event time.

Firehose with record format conversion (JSON → Parquet) using Glue Data Catalog table + dynamic partitioning on event timestamp.

Why: Parquet + partitioning cuts Athena scan cost dramatically. Dynamic partitioning avoids a separate ETL step.

Reference

Some records fail Firehose transformation or delivery; need to capture them for replay.

Configure S3 backup with `AllData` or `FailedDataOnly`. Failed records land at the configured prefix with error metadata.

Reference

Ensure no data loss in MSK if a broker AZ fails.

Replication factor ≥ 3 across 3 AZs and `min.insync.replicas=2` with producer `acks=all`. Enable Multi-AZ via ZooKeeper-less KRaft or 3-AZ broker placement.

Reference

Stream from MSK to S3, OpenSearch, or RDS without managing Kafka Connect cluster.

MSK Connect with managed connector (Confluent S3 Sink, Debezium for CDC). Auto-scales workers per WCU.

Reference

Topic stores latest version of a record per key; old versions can be discarded.

Set topic `cleanup.policy=compact`. Kafka retains the most recent value for each key; older same-key records are eligible for compaction.

Reference

Recurring weekly transfer of 10 TB from on-prem NFS to S3 over Direct Connect.

AWS DataSync with on-prem agent + scheduled task. Verifies data integrity, supports incremental transfers, parallel.

Why: DataSync is faster than aws-cli sync and handles bandwidth throttling, retries, and verification natively.

Reference

Pull data from SaaS APIs (Salesforce, ServiceNow, Zendesk) into S3 on a schedule.

AWS AppFlow. Managed connectors, OAuth handled, scheduled or event-triggered, writes Parquet to S3.

Reference

Replicate ongoing changes from on-prem SQL Server to Aurora MySQL with minimal downtime.

AWS DMS with full-load + CDC task. Use Schema Conversion Tool (SCT) for heterogeneous schema/code conversion before DMS.

Reference

DMS replication instance fails — replication interrupts.

Enable Multi-AZ on the replication instance. Synchronous standby in another AZ; automatic failover.

Reference

Need near-real-time analytics on OLTP Aurora data without ETL pipeline.

Aurora zero-ETL integration to Redshift. Continuous replication of Aurora data to Redshift; queries see new data within seconds.

Why: Eliminates DMS / Glue / custom CDC pipelines for the OLTP-to-warehouse use case.

Reference

Move 100 TB of historical archive from on-prem to S3; bandwidth limited.

AWS Snowball Edge Storage Optimized. Physical device shipped to site; copy data; ship back.

Reference

Source JSON has nested arrays; downstream relational analysis needs flattened rows.

Glue PySpark `Relationalize` transform (or `explode()` in DataFrame) flattens nested arrays into separate rows/tables.

Reference

Glue Crawler infers ambiguous types (`choice<int,string>`) from messy CSV data.

Apply `ResolveChoice` transform — cast to specific type or project to struct. Or fix at source by enforcing schema.

Reference

Glue ETL job runs hourly on growing S3 data; need to process only new files.

Enable Glue job bookmarks. Glue tracks processed files/partitions and skips them on reruns.

Why: Avoids reprocessing entire dataset. Required for incremental ETL pipelines.

Reference

Glue Spark job fails with OutOfMemoryError on driver during large aggregations.

Switch to G.2X or G.4X workers (more driver memory) or enable `--enable-glue-datacatalog` push-down predicates to reduce shuffled data.

Reference

Run continuous Spark Structured Streaming against a Kinesis source with managed infra.

AWS Glue streaming ETL job. Spark Structured Streaming under the hood; checkpointing to S3.

Reference

Business analyst needs to clean and transform data without writing code.

AWS Glue DataBrew. Visual recipe-based transforms (250+), profiling, lineage. Output to S3, Redshift, RDS.

Reference

Run Glue ETL job only after Crawler successfully updates the Data Catalog.

Glue Workflow with conditional triggers. Crawler success → trigger ETL job. Failure → skip / alarm.

Reference

Crawler infers all CSV columns as `string` — needs date and number types.

Add a custom Glue classifier (Grok pattern or column hint) before crawling. Alternatively pre-write a header row with explicit types.

Reference

Multiple producers/consumers on Kafka need schema evolution without breaking each other.

AWS Glue Schema Registry with compatibility rules (BACKWARD/FORWARD/FULL). Producers register schema; consumers fetch + validate.

Reference

Pick between EMR and Glue for Spark ETL.

Reference

Intermittent Spark/Hive jobs; want zero cluster ops and no idle compute.

EMR Serverless. Pre-initialized capacity pools for low-latency starts; scales per-job; pay per vCPU-hour.

Reference

Mix on-demand core + spot task nodes for cost-optimized EMR.

Instance Fleets with target capacity per type. Core fleet on-demand for HDFS stability; task fleet spot with diversified instance types.

Reference

Standardize on Kubernetes; want EMR Spark jobs to share cluster with other workloads.

EMR on EKS. Spark runs as pods on existing EKS cluster; share infra and IAM roles via IRSA.

Reference

Stateful streaming with windowed aggregations and exactly-once semantics.

Kinesis Data Analytics for Apache Flink. Managed Flink runtime; checkpoints to S3; auto-scales.

Reference

Lightweight per-record transform on a Kinesis stream (<1 ms each).

Lambda with Event Source Mapping on KDS. Tune `BatchSize`, `MaximumBatchingWindowInSeconds`, and `ParallelizationFactor`.

Why: Lambda is cheaper than KCL/Glue Streaming for small per-record work.

Reference

Step Functions step occasionally fails on transient throttling; retry then alert.

Add `Retry` block with `ErrorEquals: ["Lambda.ThrottlingException", "States.TaskFailed"]`, `IntervalSeconds`, `MaxAttempts`, `BackoffRate=2`. Plus `Catch` to a notification state.

Reference

Process 500,000 JSON files in parallel through Lambda transform.

Step Functions distributed Map state with `MaxConcurrency` and ItemReader from S3. Fan-out across thousands of parallel Lambda invocations.

Reference

Complex DAG with cross-service dependencies (Glue + Redshift COPY + Lambda + email) and lineage requirements.

Amazon MWAA (Managed Workflows for Apache Airflow). Native Airflow operators for AWS services; Git-driven DAG sync.

Reference

Need to roll back DAG changes if a deploy causes failures.

Store DAGs in S3 versioned bucket + sync via S3 versioning. Or maintain DAG repo in Git with environment-per-branch + S3 sync via CI.

Reference

Data Store Management

Raw data hot for 30 days, occasional access for next 90 days, archive for 7 years.

S3 lifecycle: 0–30 days Standard, transition at 30 days to Standard-IA, transition at 120 days to Glacier Flexible Retrieval, expire after 7 years.

Reference

Unpredictable access patterns; manual lifecycle policy is wrong choice.

Reference

Athena queries on data lake are slow; partition has thousands of 1-5 KB JSON files.

Compact small files via Glue/EMR job into ~256 MB Parquet files. Use Iceberg `OPTIMIZE` or Hudi compaction for managed table formats.

Why: Athena/Spark per-file overhead dominates with tiny files. Sweet spot is ~128–512 MB Parquet.

Reference

One bucket; multiple teams need different prefix-scoped access patterns.

S3 Access Points — per-team named endpoint with its own policy bound to a prefix. Simpler than one giant bucket policy.

Reference

Different consumers need different views of the same S3 object (redacted PII, summarized).

S3 Object Lambda Access Point. GET request invokes Lambda that transforms the object on the fly; consumer sees the transformed view.

Reference

Need ACID transactions, schema evolution, and time-travel on S3 data lake.

Apache Iceberg tables (Glue Catalog + S3 storage). Atomic commits, MERGE/UPDATE/DELETE, snapshot isolation, partition evolution.

Why: Hive-style append-only S3 doesn't support row-level updates. Iceberg/Hudi/Delta solve this.

Reference

Multiple writers and readers on a data-lake table; need transactional consistency and row-level access control.

Lake Formation governed tables (Iceberg-backed) with LF-Tags for permissions.

Reference

Athena, Redshift Spectrum, EMR, and Glue ETL all need a shared metadata store.

AWS Glue Data Catalog. Single Hive-compatible metastore consumed by every analytics service.

Reference

Redshift cluster needs to scale storage independently of compute.

RA3 nodes with managed storage (RMS). Storage backed by S3; compute scales separately. Required for AQUA, Concurrency Scaling, Federated Queries.

Reference

Redshift query frequently filters by `created_at`; full-table scans are slow.

Define a sort key on `created_at` (or compound sort key including `created_at`). Redshift uses zone maps to skip blocks during scan.

Reference

Frequent joins between `orders` and `order_items`; query shuffles cause slowness.

Use the same DISTKEY (`order_id`) on both tables. Co-located rows avoid network shuffle during join.

Why: KEY distribution co-locates joining rows on the same compute node.

Reference

Loading 32 gzip CSV files (~1 GB each) into 4-node Redshift cluster is slow.

COPY in parallel from a single manifest. Aim for #files = multiple of slice count (slices = nodes × vCPU). 4 nodes ra3.xlplus = 8 slices → 32 files = 4 per slice.

Reference

Join 5 TB cold Parquet data in S3 with hot Redshift fact tables; don't want to load it.

Redshift Spectrum. External tables in Glue Catalog; queries read S3 directly with Redshift compute.

Reference

Reporting team queries during peak slow down ETL workloads; both run on the same cluster.

Enable Concurrency Scaling on the relevant WLM queue. Redshift transparently routes overflow queries to scaled-out clusters.

Reference

Dashboard query repeatedly joins 3 large tables and aggregates; latency is high.

Materialized view with auto-refresh. Redshift maintains pre-computed result; query reads from materialized data.

Reference

Intermittent analytical workload; provisioned cluster sits idle.

Amazon Redshift Serverless. Auto-provisions and scales RPUs per workload; pay per RPU-hour. Zero ops.

Reference

Need to join Redshift data with live Aurora MySQL data without ETL.

Redshift Federated Queries. CREATE EXTERNAL SCHEMA pointing at Aurora; queries push down predicates over the live RDS connection.

Reference

Dashboard joins orders + customers + products on every render; star schema is too slow.

Denormalize into a wide fact table or materialized view. BI workloads favor read-time joins resolved at write time.

Reference

S3 partitions by `year/month/day/hour`; `MSCK REPAIR TABLE` takes 30+ min.

Enable Athena partition projection (no Glue Catalog partition entries). Define partition key types + ranges in table properties.

Why: Athena computes partition locations at query time from the projection rules — no MSCK, no Glue API throttling.

Reference

Convert Athena query results to Parquet, partitioned, in one operation.

CREATE TABLE AS SELECT (CTAS) with `format=PARQUET`, `partitioned_by=ARRAY['region']`, `external_location` set to target S3 prefix.

Reference

Same query template runs with different parameter values throughout the day.

Athena prepared statements: `PREPARE`, `EXECUTE` with parameter values. Avoids re-parsing and gives clean parameterization.

Reference

IoT device readings; need (1) all readings for a device in a time window, (2) latest reading per device.

PK = `device_id`, SK = `timestamp`. GSI with PK = `device_id`, SK = inverted `timestamp` (or use Query with `ScanIndexForward=false LIMIT 1`).

Reference

Session table grows unbounded; old sessions can be deleted after 7 days.

Enable DynamoDB TTL on a `expires_at` epoch attribute. DynamoDB removes expired items at no cost (within ~48h).

Reference

IoT sensor data: hot queries on last 7 days, occasional queries on 2 years.

Amazon Timestream. Memory store for recent data (fast queries); auto-tiering to magnetic store for historical.

Reference

Cassandra-compatible store for high-write time-series with 90-day retention.

Amazon Keyspaces with TTL on rows. Compatible with Cassandra CQL; serverless capacity, no cluster management.

Reference

OpenSearch storage cost grows; old indexes rarely queried.

OpenSearch ISM policies tier data: hot → UltraWarm (S3-backed) → Cold. Cold tier detached but searchable on demand.

Reference

Data Operations and Support

Validate that ETL output has ≥1,000 rows and column null-rate <2% before downstream consumption.

AWS Glue Data Quality rules (DQDL): `RowCount >= 1000`, `Completeness "col" > 0.98`. Pipeline halts on rule failure.

Reference

Custom Spark-based data quality framework on EMR; need column-level statistical checks.

AWS Deequ library on Spark. Define constraints (`isComplete`, `hasMin`, `isContainedIn`); Deequ runs as a Spark job and emits metrics.

Reference

Analysts need to discover, request access to, and understand the lineage of data products across accounts.

Amazon DataZone. Data catalog with business glossary, access workflows, lineage; spans Lake Formation, Redshift, RDS.

Reference

Lambda emits per-record processing metrics; CloudWatch PutMetricData costs are high.

CloudWatch Embedded Metric Format (EMF). Log JSON in EMF schema; CloudWatch extracts metrics from logs at no per-PutMetricData cost.

Reference

Find all Glue jobs whose duration exceeded 1 hour in the last 7 days.

CloudWatch Logs Insights query: `fields @timestamp, @message | filter @message like /JobRunDuration/ | parse @message "duration=*" as d | filter d > 3600`.

Reference

Glue job is slow; need to know if it's under-resourced or has skewed shuffle.

Enable Glue job metrics + observability. CloudWatch shows max DPU usage, executor utilization, shuffle read/write per stage.

Reference

Glue Spark job sizes vary by 10× across runs; over-provisioned for small inputs.

Enable Glue auto scaling (Glue 3.0+). Workers added/removed during execution based on stage parallelism.

Reference

Athena scans 5 TB to answer queries that touch one day of data; cost too high.

Partition by date and ensure WHERE clause uses partition keys. Validate with `EXPLAIN` showing partition pruning.

Reference

Athena queries on JSON data lake are slow and expensive.

Convert to Parquet (columnar) or ORC. Reads only needed columns; native compression cuts both scan cost and time.

Reference

EMR cluster cost optimization without data loss risk.

Core nodes on On-Demand (host HDFS / shuffle). Task nodes on Spot via Instance Fleets with diversified instance types.

Reference

Redshift cluster runs 24/7; on-demand pricing is expensive.

Redshift Reserved Nodes (1y or 3y, all-/partial-/no-upfront). Up to ~75% discount vs on-demand for steady-state workloads.

Reference

Pick between Athena, Redshift, and EMR for 500 GB daily / 50 queries.

Ad-hoc, infrequent → Athena (per-TB scanned). Predictable BI dashboards → Redshift (RA3 + Reserved). Heavy custom Spark → EMR.

Why: Athena bills per data scanned; Redshift bills per-cluster-hour; EMR per-instance-hour. Match billing to access pattern.

Reference

Glue job triggered multiple times concurrently; want to limit to one run at a time.

Set Glue job `MaxConcurrentRuns=1`. Subsequent triggers wait; eliminates concurrent-state corruption.

Reference

Glue ETL retries produce duplicate output rows in S3 target.

Idempotency: write to a temp prefix per run, then atomic-rename via S3 multipart `CompleteMultipartUpload` or use Iceberg/Hudi MERGE for upserts.

Reference

Bad ETL run wrote corrupt rows to Aurora MySQL; recover to a point in time minutes ago.

Aurora Backtrack (MySQL-compatible only). Rewinds the cluster to a target time without restoring from snapshot.

Reference

Pipeline overwrote correct S3 objects with corrupt data.

S3 bucket versioning + restore prior version. Combine with MFA Delete to prevent accidental version expiration.

Reference

Automate EBS snapshot creation, retention, and cross-region copy for disaster recovery.

Amazon Data Lifecycle Manager (DLM) with per-tag policy: schedule, retention, cross-region copy.

Reference

MSK consumers fall behind producers; need to detect and alert.

CloudWatch metric `MaxOffsetLag` per consumer group. Alarm when > threshold; scale consumer count or increase partition parallelism.

Reference

Kinesis consumer falling behind; want to detect.

CloudWatch metric `GetRecords.IteratorAgeMilliseconds`. Alarm > 60s usually means consumers under-provisioned.

Reference

Identify the slowest Redshift queries from the last hour for tuning.

Query `SVL_QLOG` / `STL_QUERY` / `SYS_QUERY_HISTORY` for top elapsed-time entries; use `SVL_QUERY_REPORT` for per-step breakdown.

Reference

Data Security and Governance

Sales teams should only see rows for their assigned regions in the shared data lake.

Lake Formation row-level security via data filter: `region IN ('NA', 'EU')` per IAM principal. Single table; per-principal filtered view.

Reference

Healthcare table — analysts must not see SSN and diagnosis columns.

Lake Formation column-level permissions: GRANT SELECT on table EXCEPT (`ssn`, `diagnosis_code`).

Reference

Many teams + many tables; per-table grants are unmaintainable.

Lake Formation LF-Tags. Tag tables/columns; grant tag-based permissions to principals. Adding a new table just needs the right tag.

Reference

Account A has the data lake; Account B's analysts need read access to specific tables.

Lake Formation cross-account sharing via RAM. Account A grants permissions to B's IAM principal/account; B accesses via Athena/Redshift Spectrum.

Reference

Row-level security inside Redshift (not Lake Formation).

Redshift native RLS policies: `CREATE RLS POLICY` with predicate referencing session context (`current_user`, `session_role`). Attach policy to table.

Reference

Compliance requires customer-managed key with audit trail for Redshift encryption.

Redshift cluster encrypted with customer-managed KMS key. Key rotation enabled; CloudTrail captures every Decrypt operation against the CMK.

Reference

Encrypt Glue ETL job inputs/outputs with company-managed key.

Glue Security Configuration with CMK for S3 + CloudWatch Logs + Job bookmarks. Glue role granted `kms:Decrypt`/`Encrypt` on the key.

Reference

Discover and classify PII (names, SSNs, emails) sitting in S3 data lake.

Amazon Macie. ML-driven sensitive-data discovery on S3; produces findings with object location and PII type.

Reference

Audit every S3 GetObject / PutObject in the data lake bucket.

CloudTrail data events for the bucket. CloudTrail by default logs only management events; data events must be enabled explicitly.

Why: Data events are billed per-event; scope to the sensitive bucket only to control cost.

Reference

Need who/when/IP for every S3 access; CloudTrail data events are too expensive.

S3 server access logging. Free; logs delivered to a separate logging bucket; less detail than CloudTrail but covers requester + IP + path.

Reference

Prevent any bucket in the account from accidentally being made public, even if a bucket policy says so.

S3 Block Public Access at account level. Overrides any bucket-level policy; enforced as a guardrail.

Reference

Redshift in VPC must read from S3 without going over the public internet.

S3 Gateway Endpoint in the Redshift subnet route table. Traffic routes via AWS backbone; no NAT, no IGW.

Reference

Glue ETL job needs to access RDS in private subnet AND call Glue Data Catalog APIs.

Glue connection on the RDS VPC + Interface VPC Endpoints for `glue.amazonaws.com` + S3 Gateway Endpoint.

Reference

Glue ETL needs S3 read, Redshift write, Secrets Manager read.

Single Glue execution role with least-privilege policies: `s3:GetObject` on source prefix, `redshift-data:ExecuteStatement`, `secretsmanager:GetSecretValue` on the specific secret ARN.

Reference

Detect unusual data access patterns — large download by an IAM user with no prior data-lake access.

GuardDuty S3 Protection. Behavioral baselines per IAM principal; findings on anomalous access volumes/patterns.

Reference

Compliance requires WORM (write once, read many) retention on financial data for 7 years.

S3 Object Lock with Compliance mode + retention period 7y. Even root cannot delete; meets SEC 17a-4 / FINRA.

Reference

Continuous compliance evidence collection for HIPAA / SOC 2 audits.

AWS Audit Manager with prebuilt frameworks. Auto-collects evidence from CloudTrail, Config, Security Hub; produces audit-ready reports.

Reference