🏠Home 📚Certifications 📱Mobile Apps

✍️Blog 📊Progress 📅Calendar 💬Support

Privacy Policy Terms of Use Contact Us Cookie Policy Disclaimer Accessibility Statement DMCA / Copyright

Skip to content

SAP-C02Playbook

Playbook

AWS Certified Solutions Architect Professional

Last reviewed: May 2026

A scannable reference of architectural patterns the SAP-C02 exam tests. Read top-to-bottom, or jump to a section.

Sections

Design Solutions for Organizational Complexity38 entries
Design for New Solutions41 entries
Continuous Improvement for Existing Solutions35 entries
Accelerate Workload Migration and Modernization26 entries

Design Solutions for Organizational Complexity

Stand up a 100+ account AWS estate with consistent guardrails, logging, and identity from day one.

AWS Control Tower as the landing zone. Account Factory provisions accounts; mandatory + strongly-recommended guardrails enforce baselines; centralized log archive + audit accounts created automatically.

Why: Control Tower codifies the well-architected multi-account pattern. Building from scratch via Organizations alone reproduces the same plumbing manually.

Need to add custom guardrails and resources beyond Control Tower defaults across all accounts.

Customizations for AWS Control Tower (CfCT). Pipeline of CloudFormation templates + SCPs deployed via StackSets to OUs.

Why: CfCT extends Control Tower without breaking its lifecycle. Custom Config rules, security baselines, networking — all version-controlled and replayable.

Enforce S3 KMS encryption + auto-remediate non-compliant buckets across 300 accounts in <15 minutes.

AWS Config organization-wide conformance pack via delegated administrator. Config rule + SSM Automation document for auto-remediation.

Why: Conformance packs deploy Config rules + remediation across the org from one account. Per-account Lambda or SCP-only approaches miss either real-time detection or remediation.

Tamper-proof CloudTrail logs across all accounts retained 7 years; only security team can read.

Organization trail delivering to a dedicated logging account S3 bucket. Object Lock in Compliance mode with 7-year retention. SCP restricting bucket access to security IAM roles.

Why: Compliance-mode Object Lock blocks deletion even by root. Org trail collects from all accounts automatically. Dedicated logging account isolates blast radius.

Federate 150 accounts to corporate AD via SAML; assign permissions by AD group.

IAM Identity Center with external SAML 2.0 IdP. Permission sets mapped to AD groups via SCIM provisioning. Account assignments via groups.

Why: Identity Center centralizes federation across all org accounts. Permission sets are reusable across accounts; SCIM keeps user/group state in sync.

Grant access to resources tagged with the user's cost center, scaling to thousands of users.

Attribute-based access control in Identity Center. Pass AD attributes via SAML; permission sets reference `aws:PrincipalTag/CostCenter` against `aws:ResourceTag/CostCenter`.

Why: ABAC scales without per-user policy changes. Adding a new cost center is just a tag — no IAM rewrite.

CI/CD account assumes a deployment role in 50 workload accounts to run CloudFormation.

IAM role per workload account with trust policy permitting the CI/CD account principal. CI/CD assumes via STS AssumeRole. Use external ID if a 3rd-party tool initiates.

Why: External ID prevents the confused deputy problem. Role chaining hard-caps session at 1 hour even if role allows longer.

Central network team owns the VPC; 30 spoke accounts deploy workloads into shared subnets.

AWS RAM shares subnets to participant accounts. Participants launch resources without owning the VPC; central team retains route table + NAT control.

Why: Shared VPCs eliminate per-account VPC sprawl + duplicate IPAM. Participants cannot delete the VPC or change routing.

Connect VPCs across 5 regions + on-prem with deterministic routing and central inspection.

Transit Gateway in each region. TGW peering for inter-region. Inspection VPC with appliances reachable via TGW route tables.

Why: TGW peering avoids full mesh of inter-region VPN/peering. Per-attachment route tables let security inspect specific flows without breaking others.

Build a global private network across regions + branch sites with policy-driven routing — beyond TGW peering.

AWS Cloud WAN. Core network policy in JSON declaratively defines segments, regions, attachments, sharing.

Why: Cloud WAN replaces hub-of-hubs TGW design with a single managed global backbone. Segments give logical isolation across regions.

On-prem DC needs 10 Gbps link to AWS with link-failure resilience and no internet exposure.

Two Direct Connect connections at separate DX locations. Each with a private VIF terminating on a Direct Connect Gateway → TGW. BGP failover between connections.

Why: Single DX is a single point of failure. Different DX locations protect against site-wide outages. DX Gateway lets one VIF reach multiple regions/VPCs.

Direct Connect link as primary; need automatic VPN failover.

Site-to-Site VPN attached to the same TGW as the DX gateway. AWS prefers DX BGP routes; VPN takes over when DX BGP withdraws.

Why: BGP route preference makes failover automatic. Pre-provisioned VPN avoids provisioning delay during the outage.

Regulator requires layer-2 encryption between on-prem and AWS over Direct Connect.

Direct Connect with MACsec on a dedicated 10 Gbps or 100 Gbps connection. Pre-shared key configured on both ends.

Why: IPsec runs at layer 3; MACsec encrypts at layer 2 line-rate, satisfying regulators that mandate physical-link encryption.

East-west traffic between VPCs must pass stateful inspection.

Centralized inspection VPC with AWS Network Firewall. TGW route tables direct cross-VPC traffic through the firewall VPC before reaching destination.

Why: Network Firewall is the managed Suricata-rule engine for stateful inspection. Centralization avoids per-VPC firewall sprawl.

Enforce a baseline WAF + Network Firewall config across every account in the org automatically.

AWS Firewall Manager with delegated administrator. Policies for WAF, Shield Advanced, Network Firewall, security groups apply org-wide.

Why: Firewall Manager auto-attaches policies to new resources. Without it, each account drifts from baseline as accounts are added.

Centralize Security Hub findings from 100+ accounts in one pane of glass.

Security Hub delegated administrator. Aggregation region collects findings from all member accounts + all enabled regions into one console.

Why: Without aggregation, findings stay per-account/region. Delegated admin avoids using the management account for security operations.

Enable GuardDuty across the org with central monitoring and per-account billing visibility.

GuardDuty with delegated administrator. Auto-enable on new accounts via the org integration. Findings aggregated to admin account.

Why: Auto-enable closes the gap on newly-created accounts that would otherwise be unmonitored.

Continuous PII discovery across all S3 buckets in 200 accounts.

Macie with delegated administrator. Org-wide auto-enable. Findings flow to Security Hub for unified review.

Why: Macie cannot read across accounts without explicit setup. Org-level configuration ensures every bucket is in scope.

Investigate a GuardDuty finding by correlating CloudTrail + VPC Flow Logs across accounts.

Amazon Detective delegated administrator in a dedicated security account. Member accounts contribute to the behavior graph.

Why: Detective auto-builds the behavior graph from VPC Flow Logs, CloudTrail, GuardDuty. Delegated admin (not management) follows AWS best practice.

Detect when any resource in the org is shared with an external account.

IAM Access Analyzer with org as zone of trust, delegated to security account. Findings on cross-account access in S3, IAM roles, KMS keys, Lambda, SQS, Secrets.

Why: Access Analyzer uses formal verification, not pattern matching. Org-level zone of trust treats sibling accounts as trusted.

Maximize Savings Plan utilization across 50 accounts with mismatched workload patterns.

Consolidated billing in Organizations with Savings Plans + RI sharing enabled. Plans purchased in payer account are shared org-wide.

Why: Sharing pools usage so unused capacity in one account offsets demand in another. Disable sharing only for cost-allocation isolation.

Let app teams self-serve approved infrastructure (VPCs, RDS) without IAM admin rights.

AWS Service Catalog portfolios. Pre-approved CloudFormation products with constraints. Share portfolios across accounts via Organizations.

Why: Provides guardrailed self-service. Constraint policies hide complexity (instance types, tags) while products carry the IAM scope to launch.

Enforce mandatory `CostCenter` and `Environment` tags consistently across the org.

Organizations tag policies attached to OUs. Define allowed values + capitalization. Combine with Config rule `required-tags` for enforcement.

Why: Tag policies validate; Config rules detect non-compliance. SCPs can deny resource creation lacking tags.

Prevent root-user actions in member accounts (compliance requirement).

SCP denying any action when `aws:PrincipalArn` matches `arn:aws:iam::*:root`.

Why: SCPs apply even to root. IAM cannot deny root. Root actions should never be needed except account recovery.

Mandate AWS Backup plans across all accounts with consistent retention.

Organizations backup policies attached to OUs. Define plans + selection criteria; apply automatically to in-scope resources.

Why: Per-account Backup plan duplication leads to drift. Org policies enforce one source of truth.

100+ VPCs each with NAT Gateway is bloating cost. Want one egress point.

Centralized egress VPC with NAT Gateway. Spoke VPCs route 0.0.0.0/0 → TGW → egress VPC → NAT.

Why: One NAT instead of 100 cuts cost dramatically. TGW cross-region data transfer rules apply, so design carefully for inter-region traffic.

EC2 in VPC needs to resolve on-prem hostnames; on-prem must resolve VPC private DNS.

Route 53 Resolver inbound + outbound endpoints. Forwarding rules send `corp.local` queries to on-prem; on-prem DNS forwards `*.compute.internal` to inbound endpoint.

Why: Resolver endpoints are HA ENIs in two AZs. Conditional forwarding gives bidirectional resolution without exposing DNS to internet.

Internal services need DNS resolvable from multiple VPCs across accounts.

Route 53 private hosted zone associated with VPCs from multiple accounts via cross-account VPC association.

Why: One PHZ shared via cross-account association beats per-VPC duplicates that drift.

Windows workloads need full AD with trust to on-prem forest.

AWS Managed Microsoft AD. Establish two-way forest trust with on-prem AD over DX/VPN.

Why: Managed AD is real Microsoft AD (DCs in two AZs, schema extensible). AD Connector only proxies; Simple AD lacks trust support.

Apps in AWS need to authenticate against existing on-prem AD without replicating identities.

AD Connector. Acts as a proxy from VPC to on-prem AD over DX/VPN.

Why: No directory data leaves on-prem; auth requests pass through. Latency depends on the link.

Latency-sensitive workload must run in a specific data center but be managed via AWS APIs.

AWS Outposts rack/server. Same AWS APIs (EC2, EBS, ECS, EKS, RDS subset) run on-prem. Connects to a parent Region.

Why: For sub-millisecond local latency to on-prem systems or data residency where Local Zones don't cover. Single-AZ — pair two Outposts for HA.

Reduce latency to end users in a metro that's far from the parent Region.

AWS Local Zones. Deploy compute, storage close to population centers; data plane routed back to parent Region for control plane.

Why: Local Zones host EC2/EBS/RDS/ELB near major cities. Cheaper than Outposts when full DC ownership isn't needed.

Application requires single-digit-millisecond latency to mobile users on 5G.

AWS Wavelength Zones in carrier 5G networks. Deploy EC2/EBS at the carrier edge; traffic stays on the mobile provider's network.

Why: Eliminates the public-internet hop entirely for 5G use cases like AR/VR, real-time inference, gaming.

Compliance auditor needs current config of every resource across the org.

AWS Config aggregator in the audit account, scoped to the entire organization across all regions.

Why: Config aggregator is the read-only org-wide view. Aggregators do not enable Config in member accounts — that's separate.

CloudWatch Logs from 50 accounts need to land in one S3 archive for SIEM ingestion.

Subscription filters in each account → cross-account Kinesis Data Stream / Firehose → S3 in logging account.

Why: Subscription filters allow log groups to push real-time. Firehose handles batching, compression, S3 partitioning.

Generate evidence reports for SOC 2, PCI, HIPAA continuously across the org.

AWS Audit Manager. Pre-built frameworks map controls to AWS evidence (Config, CloudTrail, Security Hub). Delegated admin in security account.

Why: Audit Manager auto-collects evidence per control. Saves hundreds of hours of manual screenshot collection per audit cycle.

Deploy a baseline IAM role to every existing + future account in the org.

CloudFormation StackSets with service-managed permissions + auto-deploy on new accounts. Target the entire organization or specific OUs.

Why: Self-managed StackSets require IAM in each account. Service-managed leverages org permissions and is the default for Organizations.

After months of running StackSets, suspect manual changes have caused drift.

Initiate drift detection on the StackSet. Review per-stack-instance results without modifying resources.

Why: Drift detection compares live resource config to template. Re-deploying StackSets to "fix" drift can cause unintended changes.

Design for New Solutions

Variable, bursty database workload — capacity needs swing 10x within minutes.

Aurora Serverless v2. Set min/max ACU; Aurora scales in seconds without connection drops.

Why: v2 scales by adding capacity to the existing instance — no failover. Provisioned Aurora cannot scale this fast; Serverless v1 scales slower and pauses connections.

Global app with <1s RPO and <1min RTO for cross-region DB failover.

Aurora Global Database. Storage-based replication, typical replication lag <1s. Promote secondary in seconds.

Why: Global DB ships pages, not transactions — sub-second cross-region. Cross-region read replicas via logical replication can't match this.

Reproduce a production database for testing without paying for a full copy.

Aurora cloning. Copy-on-write — initial clone is free; only changed pages billed.

Why: Clones are point-in-time, instant, isolated. Snapshot+restore takes hours and bills full storage immediately.

Recover from a logical error (DROP TABLE in prod) in minutes, not hours.

Aurora MySQL Backtrack. Rewinds the cluster in place to a prior point in time without restoring from backup.

Why: Backtrack is in-place and fast. PITR restores create a new cluster — slower and requires app cutover.

Route reporting queries to specific reader instances with larger memory.

Aurora custom endpoints. Define endpoint pointing to a subset of readers (the larger ones).

Why: Default reader endpoint round-robins all readers. Custom endpoints partition the cluster by workload type.

DynamoDB table experiences hot partition spikes throttling some reads/writes.

Provisioned with auto-scaling + adaptive capacity (automatic). Redesign partition key if a single key is the hotspot.

Why: Adaptive capacity reallocates throughput across partitions without action. But if one key is hot, only schema redesign (composite key, write sharding) helps.

Side-effect on every DynamoDB write — push to OpenSearch for search indexing.

DynamoDB Streams + Lambda trigger. Lambda batches stream records and writes to OpenSearch.

Why: Streams capture item-level changes for 24h. Native trigger model — Kinesis Data Streams adapter exists for longer retention/analytics.

Two-phase write across multiple DynamoDB items must be atomic.

TransactWriteItems / TransactGetItems. ACID semantics across up to 100 items.

Why: Native transactions avoid the distributed-saga complexity. Cost is 2x normal capacity per item — use only when atomicity is required.

Migrate a self-hosted MongoDB cluster to a managed service preserving the API.

Amazon DocumentDB. MongoDB-compatible API. Use mongodump/mongorestore or DMS for migration.

Why: DocumentDB is API-compatible with MongoDB 4.0/5.0 (most operators, not all). Verify driver/feature compatibility before commit.

Recommendation engine needs to traverse a social graph of 100M nodes.

Amazon Neptune. Property graph (Gremlin) or RDF (SPARQL).

Why: Purpose-built graph DB. Modeling relationships in DynamoDB or RDS is possible but query performance degrades with hop depth.

IoT fleet emits 10M timeseries datapoints/sec with mixed-frequency retention.

Amazon Timestream. Memory store (recent), magnetic store (historical) — automatic tiering.

Why: Purpose-built timeseries — DynamoDB/RDS scaling cost prohibitive at this rate. Built-in retention tiering reduces storage cost.

Banking ledger needs cryptographic verification of every record change.

Amazon QLDB. Immutable, cryptographically verifiable journal. Use SHA-256 digest export for proofs.

Why: QLDB is purpose-built ledger. DynamoDB Streams give change history but no built-in cryptographic chaining.

Log analytics workload with unpredictable peaks and hands-off operations.

Amazon OpenSearch Serverless. Decoupled compute/storage; auto-scales OCUs.

Why: No cluster sizing or shard management. For predictable, sustained workloads, provisioned domains are cheaper.

Petabyte-scale analytics with elastic compute and data sharing across teams.

Redshift RA3 nodes with managed storage. Cross-cluster data sharing (no copy).

Why: RA3 separates compute from storage — scale each independently. Data sharing eliminates ETL between teams' clusters.

Existing Redshift cluster + S3 data lake — query S3 from Redshift, or use Athena?

Redshift Spectrum when joins between cluster tables and S3 data are needed. Athena when fully serverless ad-hoc on S3 only.

Why: Spectrum runs S3 queries through Redshift compute. Athena pays per TB scanned. Pick by where the dominant data lives.

Different teams need different row/column visibility on the same Glue Catalog tables.

AWS Lake Formation with row-level + column-level + cell-level filters. Grant via LF tags.

Why: IAM/S3 policies cannot do row-level. Lake Formation enforces fine-grained access via Glue Catalog metadata + Athena/Redshift Spectrum/EMR consumers.

Daily Glue job processes incremental data; must not reprocess yesterday's files.

Glue job bookmarks. Track processed S3 keys / DB rows; resume from last successful checkpoint.

Why: Bookmarks avoid duplicate processing without manual state tracking. Disable for full-reprocess runs.

Pick managed Kafka vs Kinesis Data Streams for event streaming.

MSK when existing Kafka clients/ecosystem. Kinesis for tight AWS integration (Lambda triggers, Firehose, KCL) and serverless option.

Why: Both stream durably with replay. MSK preserves Kafka API and ecosystem; Kinesis costs less for small streams and integrates natively.

Variable Kafka throughput; want hands-off cluster management.

MSK Serverless. Auto-scales partitions and throughput; pay per partition + data.

Why: No broker sizing. For sustained high throughput, provisioned MSK is cheaper.

Wire SQS → filter → Step Functions without writing glue Lambda.

EventBridge Pipes. Source → optional filter → optional enrichment → target.

Why: Replaces a typical Lambda-as-glue. Reduces code, cost, and operational surface.

Replay last week's events through new consumer without re-emitting from source.

EventBridge archive + replay. Archive captures matched events; replay them to a target later.

Why: Built-in replay avoids needing a separate event store. Useful for incident recovery and onboarding new consumers.

Hundreds of producers emit events; consumers need typed bindings.

EventBridge Schema Registry with auto-discovery. Generate strongly-typed code bindings (Java, Python, TypeScript).

Why: Discovery learns schemas from observed events. Bindings give compile-time safety.

Sub-second-billed orchestration of high-volume short workflows (>100k/sec).

Step Functions Express workflows. Per-execution-ms billing; 5-minute max.

Why: Standard workflows are durable + history-tracked, billed per state transition. Express trades audit trail for cost on short-lived flows.

Process 10M S3 objects in parallel through a Step Function.

Distributed Map state. Concurrent child executions up to 10,000 parallel; reads source from S3 directly.

Why: Inline Map caps at 40 parallel. Distributed Map scales to S3-bucket-size jobs without hitting service quotas.

FIFO queue requires >300 messages/sec.

SQS FIFO with high-throughput mode enabled. Up to 70k msgs/sec per API per region; partition by `MessageGroupId`.

Why: Standard FIFO caps at 300 msg/sec without batching. High-throughput mode partitions ordering by group ID.

Multiple consumers each need full read throughput on the same Kinesis stream.

Enhanced Fan-Out (EFO). Each consumer gets a dedicated 2 MB/s/shard pipe via HTTP/2 push.

Why: Default polling shares the 2 MB/s/shard limit across consumers. EFO eliminates the contention at higher cost.

Firehose to S3; data lake queries scan too much because partitioning is by ingestion time, not event time.

Firehose dynamic partitioning. Extract event-time / tenant ID from JSON; write to S3 prefix `year=YYYY/month=MM/tenant=X/`.

Why: Athena/Spectrum partition pruning on event time slashes scan cost and latency.

Mobile/web client needs real-time updates and selective field fetching.

AWS AppSync (GraphQL) with subscriptions. WebSocket-backed.

Why: GraphQL clients fetch only requested fields and subscribe to deltas. REST/HTTP API Gateway forces over-fetch and polling.

Internal API must not be reachable from public internet.

API Gateway private endpoint via interface VPC endpoint. Resource policy restricts to specific VPCs.

Why: Private APIs are reachable only from VPC + connected networks. Public APIs require WAF + auth to be safe.

Lock down S3 origin so only CloudFront can read it.

Origin Access Control (OAC). Replaces legacy OAI; supports SSE-KMS and all S3 features.

Why: OAI doesn't support SSE-KMS objects. AWS recommends OAC for all new distributions.

Time-limit access to specific paid videos in S3.

CloudFront signed URLs (per-URL) or signed cookies (multiple URLs). Trusted key group signs requests.

Why: Pre-signed S3 URLs bypass CloudFront caching. CloudFront signed URLs cache at edge AND restrict access.

Lightweight viewer-request transformation: header rewrite, redirect, A/B routing.

CloudFront Functions. JS, sub-millisecond, all edge POPs.

Why: Lambda@Edge is full Node/Python at regional edge — heavier and pricier. Functions are 10x cheaper for simple manipulation.

Run untrusted multi-tenant workloads in EKS with strong isolation.

EKS Fargate per-pod isolation. Each pod runs in a dedicated micro-VM.

Why: Managed node groups share kernel — privilege escalation crosses tenants. Fargate kernel isolation is the strongest in EKS.

EKS cluster autoscaling latency too slow; node groups' instance type sprawl.

Karpenter. Provisioner picks instance types just-in-time based on pending pod requirements.

Why: Cluster Autoscaler scales pre-defined ASGs, slow and limited. Karpenter scales arbitrary EC2 in seconds with diversification.

EKS pod needs least-privilege IAM (avoid node-instance role sharing).

IAM Roles for Service Accounts (IRSA) via OIDC provider. Annotate ServiceAccount with role ARN.

Why: EKS Pod Identity is the newer alternative — simpler trust model. IRSA is mature and works across regions.

ECS-on-EC2 task starts take 5–7 minutes during scale-out — need <60s.

ECS Capacity Provider with managed scaling target ~80% on `CapacityProviderReservation`. Maintain idle buffer.

Why: Reserved buffer means new tasks land on existing capacity instantly while ASG launches replacements.

Lambda triggered by SQS but only 5% of messages match — wasted invocations.

Event source mapping with filter criteria. Lambda only invoked for matched messages.

Why: Pre-Lambda filter avoids per-invocation cost on irrelevant messages. Filtering supported on SQS, Kinesis, DynamoDB, MQ, Kafka.

Production app needs an LLM endpoint with low operational overhead.

Amazon Bedrock for managed foundation models (Claude, Llama, Titan). SageMaker only when you need to host custom models or open-weights tightly tuned.

Why: Bedrock is API-only — no infra. SageMaker is full ML platform — choose when you own training/fine-tuning lifecycle.

Pick managed AI for vision / NLP without training a model.

Rekognition (image/video labels, faces, content moderation). Comprehend (sentiment, entities, languages, PII detection). Translate. Polly. Transcribe.

Why: Pre-trained AWS AI services skip the entire ML lifecycle for common tasks. Use SageMaker only when off-the-shelf doesn't fit.

Web app supports email/password + Google + Apple + SAML enterprise SSO.

Cognito User Pool with hosted UI. Configure OIDC + SAML IdPs. App receives Cognito JWT.

Why: User Pool aggregates IdPs into one token. Identity Pool only swaps tokens for AWS creds — for AWS API access, not auth.

DynamoDB Global Tables with simultaneous writes to same key in two regions.

Last-writer-wins by timestamp. Application designs idempotent writes or partitions writes by region.

Why: GT replication is async multi-master. Conflict resolution is timestamp-based — apps must tolerate eventual consistency.

Continuous Improvement for Existing Solutions

EC2 fleet over-provisioned across the org; need automated right-sizing recommendations.

AWS Compute Optimizer enabled at the org level. Review recommendations against utilization windows; export to S3 for tracking.

Why: Compute Optimizer uses ML on CloudWatch metrics. Manual right-sizing misses workload-shape signals.

Catch unexpected cost spikes within hours, not at month-end.

AWS Cost Anomaly Detection. ML monitors per-service / per-account spend; alerts via SNS / email when threshold breached.

Why: Budgets fire on planned thresholds. Anomaly detection catches surprises (compromised key, runaway training job) days/weeks earlier.

When account hits 100% of monthly budget, halt non-essential resources automatically.

AWS Budget actions. Apply restrictive IAM policy + trigger Lambda via SNS to stop non-essential EC2/RDS.

Why: Budgets actions move from "alert-only" to "enforce". Pair with Cost Anomaly Detection to catch unbudgeted spend.

Org-wide visibility into S3 cost optimization opportunities.

S3 Storage Lens with advanced metrics + organization-wide scope. Surfaces cold-tier candidates, IT-tier opportunities, abandoned multipart uploads.

Why: Free tier covers basic metrics; advanced tier shows replication, activity, optimization recommendations. Centralized in audit/security account.

S3 bill keeps growing despite delete operations.

Lifecycle rule abort `incomplete multipart uploads` after 7 days. Inspect with `s3api list-multipart-uploads`.

Why: Failed uploads leave parts that bill as storage but are invisible in console listing. Common cost-leak.

Cold archive data accessed once per quarter at most.

S3 Glacier Flexible Retrieval (1–12 hr restore). For "never accessed" use Deep Archive (12 hr retrieval, lowest cost).

Why: Standard-IA keeps milli-second access; Glacier tiers trade access time for ~80–95% cost reduction.

Cut NAT Gateway egress cost for S3 + DynamoDB traffic.

Gateway VPC endpoints for S3 + DynamoDB (free). Route traffic via endpoint, bypass NAT.

Why: NAT charges per-GB; gateway endpoints are free. For other AWS services, interface endpoints reduce but don't eliminate cost.

Workload chatty across AZs; data-transfer cost dominating bill.

Co-locate microservices in the same AZ where possible. Use VPC Lattice or service mesh with AZ-affinity routing.

Why: Cross-AZ is $0.01/GB each direction. Microservice chatter at scale adds up. Trade some HA for cost where 99.95% suffices.

Egress traffic to internet is the single biggest line item.

Front everything with CloudFront. CloudFront-to-internet egress is cheaper than direct EC2/ALB egress.

Why: CloudFront egress pricing is tiered and significantly lower than regional egress. Caching reduces origin egress further.

Pick between Compute Savings Plan vs EC2 Instance Savings Plan vs Reserved Instances.

Compute SP: most flexible (any region, family, OS) — slightly less discount. EC2 Instance SP: family-locked region — deeper discount. RI: rare cases needing capacity reservation.

Why: Compute SP covers Lambda + Fargate + EC2. RIs only beat SPs when capacity reservation matters; in most cases SPs win.

Stateless batch fleet runs on Spot — interruption rate too high.

Spot Fleet with capacity-optimized strategy across many instance types + AZs.

Why: Lowest-price strategy concentrates on a single pool — high interruption. Capacity-optimized picks pools with deepest available capacity.

Reduce compute cost on stateless web tier by ~20% without rewriting.

Migrate to Graviton (ARM) — `c7g`, `m7g`, Lambda ARM, Aurora Graviton. Compatibility test for compiled binaries.

Why: Graviton offers ~20% better price-performance for most workloads. Java/Python/Node "just work"; native code may need recompile.

Cost-cut a long-running but interruption-tolerant Fargate service.

Fargate Spot via capacity provider strategy. Mix Spot + on-demand for HA tasks.

Why: Fargate Spot ~70% cheaper. Tasks get a 2-min warning before termination — pair with graceful drain.

CloudWatch Logs storage cost growing month over month.

Set retention per log group (default is forever). For long-term, export to S3 + delete in CW. Use Logs Infrequent Access class.

Why: CW Logs costs $0.03/GB ingestion + storage forever. S3 Standard-IA at $0.0125/GB is cheaper for archival access.

Replace fragmented monitoring with unified observability across services.

CloudWatch ServiceLens for service map; X-Ray for traces; CloudWatch Logs Insights for ad-hoc; Container Insights for ECS/EKS; RUM for browser; Synthetics for canaries.

Why: AWS-native stack avoids per-host agents. Pair with OpenTelemetry SDK for portability.

Trace a request across services in 5 accounts.

X-Ray cross-account observability. Source accounts share traces with central monitoring account via OAM.

Why: Without OAM, traces fragment per account. Cross-account aggregation centralizes the request path view.

View metrics + logs + traces from multiple accounts in one CloudWatch console.

CloudWatch Observability Access Manager (OAM). Source accounts link to a monitoring account via sink + link.

Why: OAM is the canonical multi-account observability fabric. Eliminates per-account console hopping.

Aurora cluster slowness — pinpoint top SQL by wait event.

Performance Insights enabled on the cluster. Top SQL by load + wait analysis without query log dump.

Why: PI samples wait events with low overhead. CloudWatch metrics tell you something's slow, PI tells you what.

Auto-detect anomalies in DynamoDB / RDS / Lambda / ECS without writing alarm thresholds.

Amazon DevOps Guru. ML-based anomaly detection on operational metrics + correlated events.

Why: Static thresholds miss rare modes. DevOps Guru learns baselines and alerts on deviations from normal.

Patch 5,000 EC2 instances on a schedule without per-instance scripts.

SSM Patch Manager with patch baselines + maintenance windows. Tag-based targeting; auto-approve security patches after N days.

Why: Patch Manager centralizes the entire patch lifecycle. Self-managed scripts drift and miss new instances.

Auto-remediate Config rule failures (e.g., open SG) without human approval.

Config remediation action invoking SSM Automation document. Pre-built `AWS-DisablePublicAccessForSecurityGroup` etc.

Why: Config detects; SSM Automation acts. Tighter loop than SNS → human → ticket.

AMI/container-image golden pipeline must be reproducible and patch-current.

EC2 Image Builder pipeline. Source AMI → recipe (components) → test → distribute to regions/accounts.

Why: Replaces ad-hoc Packer scripts with a managed lifecycle. Schedule rebuilds for monthly patch refresh.

Continuous CVE scanning across EC2 + ECR images + Lambda.

Amazon Inspector v2 with org-wide enable. Findings flow to Security Hub.

Why: Inspector v2 covers EC2 + container images + Lambda dependencies in one service. Manual CVE matching is impossible at scale.

Validate that a multi-tier app can meet 1-hour RTO / 15-minute RPO.

AWS Resilience Hub. Define policy → assess app → recommendations + automated runbooks.

Why: Resilience Hub formalizes RTO/RPO claims with concrete tests. Manual DR runbooks drift.

Test that auto-scaling and failover work under real failures, not assumed ones.

AWS Fault Injection Service (FIS). Templated experiments — kill instances, throttle APIs, inject latency. Run during Game Days.

Why: Chaos-engineering as a service. Real failure exposes brittle assumptions; reading runbooks doesn't.

Multi-region failover — automated readiness check + zonal evacuation.

Route 53 Application Recovery Controller. Readiness checks + routing controls for cell-based failover.

Why: Plain Route 53 health checks evaluate endpoints. ARC adds active/standby control planes for explicit, audited failover.

Upgrade RDS major version with rollback capability.

RDS Blue/Green Deployments. Spin up green clone with new version; replay binlog; switch in <1 min.

Why: In-place major upgrade is irreversible. Blue/Green keeps the old DB live until cutover succeeds.

Reduce blast radius of bad deployments with auto-rollback.

CodeDeploy with Canary config (e.g., `CodeDeployDefault.ECSCanary10Percent5Minutes`). CloudWatch alarm triggers rollback.

Why: Canary contains breakage to 10% for 5 minutes. All-at-once max blast; rolling spreads but no traffic-based gate.

Lambda functions over-provisioned on memory.

Compute Optimizer for Lambda. Memory-tuning recommendations from invocation profiles.

Why: AWS Lambda Power Tuning state machine is the alternative — Compute Optimizer is hands-off.

Generate least-privilege IAM policy from observed CloudTrail activity.

IAM Access Analyzer policy generation. Analyzes CloudTrail for the role; emits a policy of only used actions.

Why: Beats manually grinding `iam:Get*` etc. Use the generated policy as a starting point, then review.

Connection failing from EC2 to RDS — figure out why without packet captures.

VPC Reachability Analyzer. Static analysis of route tables, SGs, NACLs, NAT, peering. Returns the blocker.

Why: Faster than tcpdump. Identifies the specific config (which SG rule, which NACL deny).

Audit which paths from internet can reach internal resources.

VPC Network Access Analyzer. Scope expressions describe forbidden paths (e.g., internet → DB tier). Returns matching paths.

Why: Reachability Analyzer is point-to-point; Network Access Analyzer is scope-wide compliance.

Quick wins on cost across the org.

Trusted Advisor cost-optimization checks (need Business/Enterprise Support). Idle ELBs, low-utilization EC2, unused EIPs, RI utilization.

Why: Free tier of TA is limited; Business+ unlocks all checks. Org view with delegated admin shows aggregated findings.

Lambda → RDS connection storms exhaust DB connections.

RDS Proxy. Connection pooling between Lambda and RDS/Aurora. Failover faster (~66% reduction).

Why: Lambda concurrency creates one connection per invocation worst case. Proxy multiplexes onto a small pool.

Long-tail content cache miss rate at origin too high — origin under load.

CloudFront Origin Shield in a region near the origin. Deduplicates requests across edges before hitting origin.

Why: Without Origin Shield each POP independently misses to origin. Shield reduces origin hit rate by ~70%.

Accelerate Workload Migration and Modernization

Lift-and-shift 200 on-prem servers to EC2 with minimal downtime.

AWS Application Migration Service (MGN). Continuous block-level replication; cut over per server in minutes.

Why: MGN is the AWS-recommended rehost tool (replaced SMS + CloudEndure). Per-server cut over enables wave-based migration.

Migrate on-prem Oracle to Aurora PostgreSQL with minimal downtime.

Schema Conversion Tool (SCT) for schema + procedure rewrite. AWS DMS for full-load + CDC.

Why: SCT addresses code; DMS addresses data. CDC keeps source synced until cutover.

Discover all on-prem databases and assess migration complexity.

AWS DMS Fleet Advisor. Inventory + assess heterogeneous fleets at scale.

Why: Fleet Advisor consolidates discovery + sizing into one workflow before launching DMS jobs.

Categorize 500 apps for migration strategy.

Seven Rs framework: Retire (decommission), Retain (keep on-prem), Relocate (VMware Cloud lift), Rehost (MGN), Replatform (RDS instead of self-managed DB), Repurchase (drop & SaaS), Refactor (microservices).

Why: Larger portfolios mix all 7. Mapping per app early avoids one-size-fits-all migration debt.

Build the migration inventory with dependencies before starting waves.

AWS Application Discovery Service. Agentless (vCenter scan) or Agent-based (per-server). Outputs dependency map.

Why: Without dependency mapping, wave-planning misses tight couplings. Discovery surfaces them automatically.

Track 100s of in-flight server + DB migrations across MGN, DMS, manual.

AWS Migration Hub as the single pane. Aggregates status from MGN, DMS, Refactor Spaces.

Why: Per-tool consoles fragment status. Migration Hub consolidates and supports portfolio reporting.

Move 100 TB from a remote site with no usable WAN bandwidth.

AWS Snowball Edge Storage Optimized. Ship the device, copy locally, return to AWS. Multiple devices in parallel for >80 TB.

Why: Snowmobile (45 PB) is for exabyte; Snowcone (8 TB) for tiny. Edge is the petabyte-scale workhorse.

Continuous data replication on-prem NFS → S3 with bandwidth caps.

AWS DataSync agent. Scheduled tasks; per-task bandwidth throttle; verify mode for integrity.

Why: DataSync is purpose-built and 10x faster than self-managed rsync over WAN. Snowball is offline; DataSync is online.

On-prem app expects NFS/SMB but data should land in S3.

File Gateway in Storage Gateway. Local cache + S3 backend; objects accessible via S3 API too.

Why: Volume Gateway exposes iSCSI; Tape Gateway emulates VTL. File Gateway is the NAS-to-S3 bridge.

VMware-heavy shop wants AWS-side capacity without retooling vSphere/NSX.

VMware Cloud on AWS. Same vSphere stack on bare-metal AWS hosts. Use HCX for live migration.

Why: Preserves operational tooling. Bridge before refactor. After, gradually replatform to native AWS services.

Containerize legacy Java/.NET monoliths without rewriting.

AWS App2Container CLI. Inspects running app, generates container artifacts + ECS/EKS manifests.

Why: A2C captures runtime config (env, ports, dependencies) into a working image. Manual containerization misses non-obvious deps.

COBOL mainframe modernization — convert to Java microservices.

AWS Mainframe Modernization service with Blu Age (refactor) or Micro Focus (replatform). Choose based on tolerance for runtime emulation.

Why: Refactor unlocks cloud-native patterns; Replatform is faster but emulates the mainframe. Both reduce mainframe license cost.

Decompose a monolith over 18 months without freezing development.

Strangler Fig pattern. Front the monolith with API Gateway/ALB; route specific endpoints to new microservices as carved out.

Why: Big-bang rewrites usually fail. Strangler decouples cutover by route, keeps the monolith functional during transition.

Want to incrementally extract microservices without owning the routing plane.

AWS Migration Hub Refactor Spaces. Managed application/route/service abstraction over API Gateway + VPCs.

Why: Saves writing the plumbing of the strangler fig. Pre-built routing + VPC connectivity for incremental extraction.

Self-managed PostgreSQL on EC2 → RDS for managed ops.

DMS for cutover with CDC. Use RDS Custom only if you need OS access or vendor-specific extensions.

Why: RDS handles backups/patches/HA. RDS Custom is escape hatch for legacy needs but reintroduces ops burden.

Move from RDS MySQL to Aurora MySQL for performance + cost.

Aurora read replica from RDS, then promote. Or DMS for zero-downtime when version skews matter.

Why: Read-replica path is simplest in-engine. DMS handles version differences and heterogeneous moves.

Enterprise wants AWS migration funding + best-practices framework.

AWS Migration Acceleration Program (MAP). Phases: Assess (MRA), Mobilize (MAP partner + tooling), Migrate & Modernize.

Why: MAP unlocks funding and structured methodology. Skipping MAP misses both.

Pre-migration cost estimate for the executive sponsor.

AWS Pricing Calculator (designed config) + Migration Evaluator (data-driven from on-prem inventory).

Why: Pricing Calculator gives "what-if" pricing. Migration Evaluator ingests vSphere/Hyper-V data to project actual savings.

Decommission self-hosted SFTP servers; vendor partners need to keep using SFTP.

AWS Transfer Family (SFTP/FTPS/FTP) backed by S3 or EFS.

Why: Managed protocol service. IAM-mapped users; VPC-only endpoints. Avoids running EC2 SSH daemons.

Lift-and-shift Windows file shares with AD integration.

Amazon FSx for Windows File Server. AD-joined; SMB; DataSync for online sync from on-prem; Snowball for bulk.

Why: FSx for Windows is the AD-native landing zone. EFS is Linux-only; S3 lacks SMB semantics.

Migrate NetApp ONTAP workloads keeping all NetApp features (snapshots, FlexClone).

Amazon FSx for NetApp ONTAP. Native ONTAP APIs; multi-protocol NFS+SMB; SnapMirror replication from on-prem.

Why: Other FSx flavors don't expose ONTAP-specific features. Lift-and-shift NetApp without re-architecting backups/replication.

DNS-based cutover risks DNS-cache stragglers.

Cutover behind CloudFront / ALB / Global Accelerator. Switch backend without changing public DNS.

Why: Caches respect TTL but clients/firewalls cache aggressively. Stable public address insulates from DNS stragglers.

Gradual traffic migration from on-prem to AWS for risk control.

Route 53 weighted routing. Start 1% → AWS, ramp gradually. Health checks for automatic failback.

Why: Weighted routing enables canary-style migration at the DNS layer. ARC adds explicit gates for higher-stakes cuts.

Track Windows / Oracle / SQL Server BYOL licenses across migrated workloads.

AWS License Manager. Define rules; enforce on launch; share via RAM across the org.

Why: BYOL non-compliance is expensive. License Manager prevents accidental over-deployment.

After migration, dev/test RDS instances are over-provisioned overnight.

Migrate dev/test to Aurora Serverless v2 with low min ACU. Auto-scale down when idle.

Why: Saves nightly idle cost without instance scheduler complexity.

Run Kubernetes on-prem with same tools as EKS during migration.

EKS Anywhere on on-prem hardware. Same Kubernetes versions + ECR + AWS Outposts integration.

Why: Consistent control plane reduces operator skill drift. Migration to EKS later is a workload move, not a tooling rewrite.