Variable, bursty database workload — capacity needs swing 10x within minutes.
→Aurora Serverless v2. Set min/max ACU; Aurora scales in seconds without connection drops.
Why: v2 scales by adding capacity to the existing instance — no failover. Provisioned Aurora cannot scale this fast; Serverless v1 scales slower and pauses connections.
Reference↗
Global app with <1s RPO and <1min RTO for cross-region DB failover.
→Aurora Global Database. Storage-based replication, typical replication lag <1s. Promote secondary in seconds.
Why: Global DB ships pages, not transactions — sub-second cross-region. Cross-region read replicas via logical replication can't match this.
Reference↗
Reproduce a production database for testing without paying for a full copy.
→Aurora cloning. Copy-on-write — initial clone is free; only changed pages billed.
Why: Clones are point-in-time, instant, isolated. Snapshot+restore takes hours and bills full storage immediately.
Reference↗
Recover from a logical error (DROP TABLE in prod) in minutes, not hours.
→Aurora MySQL Backtrack. Rewinds the cluster in place to a prior point in time without restoring from backup.
Why: Backtrack is in-place and fast. PITR restores create a new cluster — slower and requires app cutover.
Reference↗
Route reporting queries to specific reader instances with larger memory.
→Aurora custom endpoints. Define endpoint pointing to a subset of readers (the larger ones).
Why: Default reader endpoint round-robins all readers. Custom endpoints partition the cluster by workload type.
Reference↗
DynamoDB table experiences hot partition spikes throttling some reads/writes.
→Provisioned with auto-scaling + adaptive capacity (automatic). Redesign partition key if a single key is the hotspot.
Why: Adaptive capacity reallocates throughput across partitions without action. But if one key is hot, only schema redesign (composite key, write sharding) helps.
Reference↗
Side-effect on every DynamoDB write — push to OpenSearch for search indexing.
→DynamoDB Streams + Lambda trigger. Lambda batches stream records and writes to OpenSearch.
Why: Streams capture item-level changes for 24h. Native trigger model — Kinesis Data Streams adapter exists for longer retention/analytics.
Reference↗
Two-phase write across multiple DynamoDB items must be atomic.
→TransactWriteItems / TransactGetItems. ACID semantics across up to 100 items.
Why: Native transactions avoid the distributed-saga complexity. Cost is 2x normal capacity per item — use only when atomicity is required.
Reference↗
Migrate a self-hosted MongoDB cluster to a managed service preserving the API.
→Amazon DocumentDB. MongoDB-compatible API. Use mongodump/mongorestore or DMS for migration.
Why: DocumentDB is API-compatible with MongoDB 4.0/5.0 (most operators, not all). Verify driver/feature compatibility before commit.
Reference↗
Recommendation engine needs to traverse a social graph of 100M nodes.
→Amazon Neptune. Property graph (Gremlin) or RDF (SPARQL).
Why: Purpose-built graph DB. Modeling relationships in DynamoDB or RDS is possible but query performance degrades with hop depth.
Reference↗
IoT fleet emits 10M timeseries datapoints/sec with mixed-frequency retention.
→Amazon Timestream. Memory store (recent), magnetic store (historical) — automatic tiering.
Why: Purpose-built timeseries — DynamoDB/RDS scaling cost prohibitive at this rate. Built-in retention tiering reduces storage cost.
Reference↗
Banking ledger needs cryptographic verification of every record change.
→Amazon QLDB. Immutable, cryptographically verifiable journal. Use SHA-256 digest export for proofs.
Why: QLDB is purpose-built ledger. DynamoDB Streams give change history but no built-in cryptographic chaining.
Log analytics workload with unpredictable peaks and hands-off operations.
→Amazon OpenSearch Serverless. Decoupled compute/storage; auto-scales OCUs.
Why: No cluster sizing or shard management. For predictable, sustained workloads, provisioned domains are cheaper.
Reference↗
Petabyte-scale analytics with elastic compute and data sharing across teams.
→Redshift RA3 nodes with managed storage. Cross-cluster data sharing (no copy).
Why: RA3 separates compute from storage — scale each independently. Data sharing eliminates ETL between teams' clusters.
Reference↗
Existing Redshift cluster + S3 data lake — query S3 from Redshift, or use Athena?
→Redshift Spectrum when joins between cluster tables and S3 data are needed. Athena when fully serverless ad-hoc on S3 only.
Why: Spectrum runs S3 queries through Redshift compute. Athena pays per TB scanned. Pick by where the dominant data lives.
Reference↗
Different teams need different row/column visibility on the same Glue Catalog tables.
→AWS Lake Formation with row-level + column-level + cell-level filters. Grant via LF tags.
Why: IAM/S3 policies cannot do row-level. Lake Formation enforces fine-grained access via Glue Catalog metadata + Athena/Redshift Spectrum/EMR consumers.
Reference↗
Daily Glue job processes incremental data; must not reprocess yesterday's files.
→Glue job bookmarks. Track processed S3 keys / DB rows; resume from last successful checkpoint.
Why: Bookmarks avoid duplicate processing without manual state tracking. Disable for full-reprocess runs.
Reference↗
Pick managed Kafka vs Kinesis Data Streams for event streaming.
→MSK when existing Kafka clients/ecosystem. Kinesis for tight AWS integration (Lambda triggers, Firehose, KCL) and serverless option.
Why: Both stream durably with replay. MSK preserves Kafka API and ecosystem; Kinesis costs less for small streams and integrates natively.
Reference↗
Variable Kafka throughput; want hands-off cluster management.
→MSK Serverless. Auto-scales partitions and throughput; pay per partition + data.
Why: No broker sizing. For sustained high throughput, provisioned MSK is cheaper.
Reference↗
Wire SQS → filter → Step Functions without writing glue Lambda.
→EventBridge Pipes. Source → optional filter → optional enrichment → target.
Why: Replaces a typical Lambda-as-glue. Reduces code, cost, and operational surface.
Reference↗
Replay last week's events through new consumer without re-emitting from source.
→EventBridge archive + replay. Archive captures matched events; replay them to a target later.
Why: Built-in replay avoids needing a separate event store. Useful for incident recovery and onboarding new consumers.
Reference↗
Hundreds of producers emit events; consumers need typed bindings.
→EventBridge Schema Registry with auto-discovery. Generate strongly-typed code bindings (Java, Python, TypeScript).
Why: Discovery learns schemas from observed events. Bindings give compile-time safety.
Reference↗
Sub-second-billed orchestration of high-volume short workflows (>100k/sec).
→Step Functions Express workflows. Per-execution-ms billing; 5-minute max.
Why: Standard workflows are durable + history-tracked, billed per state transition. Express trades audit trail for cost on short-lived flows.
Reference↗
Process 10M S3 objects in parallel through a Step Function.
→Distributed Map state. Concurrent child executions up to 10,000 parallel; reads source from S3 directly.
Why: Inline Map caps at 40 parallel. Distributed Map scales to S3-bucket-size jobs without hitting service quotas.
Reference↗
FIFO queue requires >300 messages/sec.
→SQS FIFO with high-throughput mode enabled. Up to 70k msgs/sec per API per region; partition by `MessageGroupId`.
Why: Standard FIFO caps at 300 msg/sec without batching. High-throughput mode partitions ordering by group ID.
Reference↗
Multiple consumers each need full read throughput on the same Kinesis stream.
→Enhanced Fan-Out (EFO). Each consumer gets a dedicated 2 MB/s/shard pipe via HTTP/2 push.
Why: Default polling shares the 2 MB/s/shard limit across consumers. EFO eliminates the contention at higher cost.
Reference↗
Firehose to S3; data lake queries scan too much because partitioning is by ingestion time, not event time.
→Firehose dynamic partitioning. Extract event-time / tenant ID from JSON; write to S3 prefix `year=YYYY/month=MM/tenant=X/`.
Why: Athena/Spectrum partition pruning on event time slashes scan cost and latency.
Reference↗
Mobile/web client needs real-time updates and selective field fetching.
→AWS AppSync (GraphQL) with subscriptions. WebSocket-backed.
Why: GraphQL clients fetch only requested fields and subscribe to deltas. REST/HTTP API Gateway forces over-fetch and polling.
Reference↗
Internal API must not be reachable from public internet.
→API Gateway private endpoint via interface VPC endpoint. Resource policy restricts to specific VPCs.
Why: Private APIs are reachable only from VPC + connected networks. Public APIs require WAF + auth to be safe.
Reference↗
Lock down S3 origin so only CloudFront can read it.
→Origin Access Control (OAC). Replaces legacy OAI; supports SSE-KMS and all S3 features.
Why: OAI doesn't support SSE-KMS objects. AWS recommends OAC for all new distributions.
Reference↗
Time-limit access to specific paid videos in S3.
→CloudFront signed URLs (per-URL) or signed cookies (multiple URLs). Trusted key group signs requests.
Why: Pre-signed S3 URLs bypass CloudFront caching. CloudFront signed URLs cache at edge AND restrict access.
Reference↗
Lightweight viewer-request transformation: header rewrite, redirect, A/B routing.
→CloudFront Functions. JS, sub-millisecond, all edge POPs.
Why: Lambda@Edge is full Node/Python at regional edge — heavier and pricier. Functions are 10x cheaper for simple manipulation.
Reference↗
Run untrusted multi-tenant workloads in EKS with strong isolation.
→EKS Fargate per-pod isolation. Each pod runs in a dedicated micro-VM.
Why: Managed node groups share kernel — privilege escalation crosses tenants. Fargate kernel isolation is the strongest in EKS.
Reference↗
EKS cluster autoscaling latency too slow; node groups' instance type sprawl.
→Karpenter. Provisioner picks instance types just-in-time based on pending pod requirements.
Why: Cluster Autoscaler scales pre-defined ASGs, slow and limited. Karpenter scales arbitrary EC2 in seconds with diversification.
Reference↗
EKS pod needs least-privilege IAM (avoid node-instance role sharing).
→IAM Roles for Service Accounts (IRSA) via OIDC provider. Annotate ServiceAccount with role ARN.
Why: EKS Pod Identity is the newer alternative — simpler trust model. IRSA is mature and works across regions.
Reference↗
ECS-on-EC2 task starts take 5–7 minutes during scale-out — need <60s.
→ECS Capacity Provider with managed scaling target ~80% on `CapacityProviderReservation`. Maintain idle buffer.
Why: Reserved buffer means new tasks land on existing capacity instantly while ASG launches replacements.
Reference↗
Lambda triggered by SQS but only 5% of messages match — wasted invocations.
→Event source mapping with filter criteria. Lambda only invoked for matched messages.
Why: Pre-Lambda filter avoids per-invocation cost on irrelevant messages. Filtering supported on SQS, Kinesis, DynamoDB, MQ, Kafka.
Reference↗
Production app needs an LLM endpoint with low operational overhead.
→Amazon Bedrock for managed foundation models (Claude, Llama, Titan). SageMaker only when you need to host custom models or open-weights tightly tuned.
Why: Bedrock is API-only — no infra. SageMaker is full ML platform — choose when you own training/fine-tuning lifecycle.
Reference↗
Pick managed AI for vision / NLP without training a model.
→Rekognition (image/video labels, faces, content moderation). Comprehend (sentiment, entities, languages, PII detection). Translate. Polly. Transcribe.
Why: Pre-trained AWS AI services skip the entire ML lifecycle for common tasks. Use SageMaker only when off-the-shelf doesn't fit.
Reference↗
Web app supports email/password + Google + Apple + SAML enterprise SSO.
→Cognito User Pool with hosted UI. Configure OIDC + SAML IdPs. App receives Cognito JWT.
Why: User Pool aggregates IdPs into one token. Identity Pool only swaps tokens for AWS creds — for AWS API access, not auth.
Reference↗
DynamoDB Global Tables with simultaneous writes to same key in two regions.
→Last-writer-wins by timestamp. Application designs idempotent writes or partitions writes by region.
Why: GT replication is async multi-master. Conflict resolution is timestamp-based — apps must tolerate eventual consistency.
Reference↗