Pick a Kinesis service for streaming ingestion.
→Sub-second consumer-controlled processing → Kinesis Data Streams. Fully managed delivery to S3/Redshift/OpenSearch with optional format conversion → Kinesis Data Firehose.
Why: KDS retains records (24h–365d) and supports multiple consumers. Firehose has no replay; trades replay for zero-ops delivery.
Reference↗
Stream hits ProvisionedThroughputExceeded errors during peak.
→Reshard. Each shard supports 1 MB/s or 1,000 records/s ingest, 2 MB/s egress. Use uniform partition keys; enable Enhanced Fan-Out for >2 MB/s per consumer.
Why: Hot partition keys concentrate traffic on one shard. Random or hash-based keys spread load.
Reference↗
Streaming workload is spiky and unpredictable; manual resharding is operational pain.
→Kinesis Data Streams in on-demand capacity mode. Auto-scales to 200 MB/s by default; pay per data volume.
Reference↗
Multiple consumers reading the same stream hit the 2 MB/s/shard read limit.
→Enhanced Fan-Out. Each consumer gets dedicated 2 MB/s/shard via push-based HTTP/2 SubscribeToShard.
Reference↗
Maximize ingest throughput from producer-side application.
→Kinesis Producer Library (KPL) with aggregation + collection. Batches multiple user records into one Kinesis record up to 1 MB; reduces PUT cost.
Why: Single-record PutRecord is rate-limited and expensive at 50k events/s. KPL aggregates client-side.
Reference↗
Land JSON clickstream into S3 as Parquet, partitioned by event time.
→Firehose with record format conversion (JSON → Parquet) using Glue Data Catalog table + dynamic partitioning on event timestamp.
Why: Parquet + partitioning cuts Athena scan cost dramatically. Dynamic partitioning avoids a separate ETL step.
Reference↗
Some records fail Firehose transformation or delivery; need to capture them for replay.
→Configure S3 backup with `AllData` or `FailedDataOnly`. Failed records land at the configured prefix with error metadata.
Reference↗
Ensure no data loss in MSK if a broker AZ fails.
→Replication factor ≥ 3 across 3 AZs and `min.insync.replicas=2` with producer `acks=all`. Enable Multi-AZ via ZooKeeper-less KRaft or 3-AZ broker placement.
Reference↗
Stream from MSK to S3, OpenSearch, or RDS without managing Kafka Connect cluster.
→MSK Connect with managed connector (Confluent S3 Sink, Debezium for CDC). Auto-scales workers per WCU.
Reference↗
Topic stores latest version of a record per key; old versions can be discarded.
→Set topic `cleanup.policy=compact`. Kafka retains the most recent value for each key; older same-key records are eligible for compaction.
Reference↗
Recurring weekly transfer of 10 TB from on-prem NFS to S3 over Direct Connect.
→AWS DataSync with on-prem agent + scheduled task. Verifies data integrity, supports incremental transfers, parallel.
Why: DataSync is faster than aws-cli sync and handles bandwidth throttling, retries, and verification natively.
Reference↗
Pull data from SaaS APIs (Salesforce, ServiceNow, Zendesk) into S3 on a schedule.
→AWS AppFlow. Managed connectors, OAuth handled, scheduled or event-triggered, writes Parquet to S3.
Reference↗
Replicate ongoing changes from on-prem SQL Server to Aurora MySQL with minimal downtime.
→AWS DMS with full-load + CDC task. Use Schema Conversion Tool (SCT) for heterogeneous schema/code conversion before DMS.
Reference↗
DMS replication instance fails — replication interrupts.
→Enable Multi-AZ on the replication instance. Synchronous standby in another AZ; automatic failover.
Reference↗
Need near-real-time analytics on OLTP Aurora data without ETL pipeline.
→Aurora zero-ETL integration to Redshift. Continuous replication of Aurora data to Redshift; queries see new data within seconds.
Why: Eliminates DMS / Glue / custom CDC pipelines for the OLTP-to-warehouse use case.
Reference↗
Move 100 TB of historical archive from on-prem to S3; bandwidth limited.
→AWS Snowball Edge Storage Optimized. Physical device shipped to site; copy data; ship back.
Reference↗
Source JSON has nested arrays; downstream relational analysis needs flattened rows.
→Glue PySpark `Relationalize` transform (or `explode()` in DataFrame) flattens nested arrays into separate rows/tables.
Reference↗
Glue Crawler infers ambiguous types (`choice<int,string>`) from messy CSV data.
→Apply `ResolveChoice` transform — cast to specific type or project to struct. Or fix at source by enforcing schema.
Reference↗
Glue ETL job runs hourly on growing S3 data; need to process only new files.
→Enable Glue job bookmarks. Glue tracks processed files/partitions and skips them on reruns.
Why: Avoids reprocessing entire dataset. Required for incremental ETL pipelines.
Reference↗
Glue Spark job fails with OutOfMemoryError on driver during large aggregations.
→Switch to G.2X or G.4X workers (more driver memory) or enable `--enable-glue-datacatalog` push-down predicates to reduce shuffled data.
Reference↗
Run continuous Spark Structured Streaming against a Kinesis source with managed infra.
→AWS Glue streaming ETL job. Spark Structured Streaming under the hood; checkpointing to S3.
Reference↗
Business analyst needs to clean and transform data without writing code.
→AWS Glue DataBrew. Visual recipe-based transforms (250+), profiling, lineage. Output to S3, Redshift, RDS.
Reference↗
Run Glue ETL job only after Crawler successfully updates the Data Catalog.
→Glue Workflow with conditional triggers. Crawler success → trigger ETL job. Failure → skip / alarm.
Reference↗
Crawler infers all CSV columns as `string` — needs date and number types.
→Add a custom Glue classifier (Grok pattern or column hint) before crawling. Alternatively pre-write a header row with explicit types.
Reference↗
Multiple producers/consumers on Kafka need schema evolution without breaking each other.
→AWS Glue Schema Registry with compatibility rules (BACKWARD/FORWARD/FULL). Producers register schema; consumers fetch + validate.
Reference↗
Pick between EMR and Glue for Spark ETL.
→Long-running custom Spark with deep tuning, multiple frameworks (Hive, Presto, Flink) → EMR. Serverless pay-per-job ETL with Glue Data Catalog integration → Glue. Spiky/unpredictable Spark → EMR Serverless.
Reference↗
Intermittent Spark/Hive jobs; want zero cluster ops and no idle compute.
→EMR Serverless. Pre-initialized capacity pools for low-latency starts; scales per-job; pay per vCPU-hour.
Reference↗
Mix on-demand core + spot task nodes for cost-optimized EMR.
→Instance Fleets with target capacity per type. Core fleet on-demand for HDFS stability; task fleet spot with diversified instance types.
Reference↗
Standardize on Kubernetes; want EMR Spark jobs to share cluster with other workloads.
→EMR on EKS. Spark runs as pods on existing EKS cluster; share infra and IAM roles via IRSA.
Reference↗
Stateful streaming with windowed aggregations and exactly-once semantics.
→Kinesis Data Analytics for Apache Flink. Managed Flink runtime; checkpoints to S3; auto-scales.
Reference↗
Lightweight per-record transform on a Kinesis stream (<1 ms each).
→Lambda with Event Source Mapping on KDS. Tune `BatchSize`, `MaximumBatchingWindowInSeconds`, and `ParallelizationFactor`.
Why: Lambda is cheaper than KCL/Glue Streaming for small per-record work.
Reference↗
Step Functions step occasionally fails on transient throttling; retry then alert.
→Add `Retry` block with `ErrorEquals: ["Lambda.ThrottlingException", "States.TaskFailed"]`, `IntervalSeconds`, `MaxAttempts`, `BackoffRate=2`. Plus `Catch` to a notification state.
Reference↗
Process 500,000 JSON files in parallel through Lambda transform.
→Step Functions distributed Map state with `MaxConcurrency` and ItemReader from S3. Fan-out across thousands of parallel Lambda invocations.
Reference↗
Complex DAG with cross-service dependencies (Glue + Redshift COPY + Lambda + email) and lineage requirements.
→Amazon MWAA (Managed Workflows for Apache Airflow). Native Airflow operators for AWS services; Git-driven DAG sync.
Reference↗
Need to roll back DAG changes if a deploy causes failures.
→Store DAGs in S3 versioned bucket + sync via S3 versioning. Or maintain DAG repo in Git with environment-per-branch + S3 sync via CI.
Reference↗