Last reviewed: May 2026
Build the AWS services on the SOA-C03 exam with plain Terraform — one block at a time, each tied back to an exam domain. The same code works on OpenTofu.
By the end of this lab you'll have provisioned, with plain Terraform, a complete monitor-and-auto-remediate loop — a CloudWatch log group with a metric filter, an SNS topic that pages a human, a Systems Manager runbook that does the automated fix, and an EventBridge rule that wires high-severity events to the runbook so common incidents resolve themselves before anyone wakes up.
Every resource is plain Terraform — the same code works without modification on OpenTofu. No variables, no modules. Drop the snippets into a single main.tf, run terraform init, then terraform apply step-by-step.
>= 1.5 or OpenTofu >= 1.6.us-east-1.Everything in this lab costs nothing while idle:
If the auto-remediation actually fires (Step 5 triggers an SSM document), that costs nothing extra — Systems Manager Automation is free for the actions used here.
Standard opener. default_tags apply across the whole stack so the operations team can later filter Cost Explorer, AWS Config, and Tag Editor by Project = certlabpro-soa-c03 to see everything this lab created. SOA-C03's Reliability and Business Continuity domain explicitly tests this — tagging is the foundation of every cross-cutting operational query.
terraform {
required_version = ">= 1.5"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.60"
}
}
}
provider "aws" {
region = "us-east-1"
default_tags {
tags = {
Project = "certlabpro-soa-c03"
ManagedBy = "terraform"
}
}
}Every operational story on AWS starts in CloudWatch Logs. We create a log group with explicit 30-day retention (the default of never expire is the SOA-C03 cost-anti-pattern that comes up in every cost-optimization question) and a metric filter that watches the log stream for the word ERROR and publishes a count to CloudWatch Metrics.
Metric filters turn unstructured log data into actionable metrics. That's the SOA-C03 mental model for monitoring: logs → filter → metric → alarm → SNS → human (or automation). We're building the chain piece-by-piece from this step forward.
resource "aws_cloudwatch_log_group" "app" {
name = "/certlabpro/soa-c03/app"
retention_in_days = 30
}
resource "aws_cloudwatch_log_metric_filter" "app_errors" {
name = "certlabpro-soa-c03-app-errors"
log_group_name = aws_cloudwatch_log_group.app.name
pattern = "ERROR"
metric_transformation {
name = "AppErrorCount"
namespace = "CertLabPro/SOA-C03"
value = "1"
default_value = "0"
}
}Now we connect the metric from Step 2 to a human. We create an SNS topic, subscribe an email address to it, and set up a CloudWatch alarm that fires when the error count crosses our threshold. SOA-C03 tests this exact chain — metric → alarm → SNS → email — under the Monitoring, Logging, and Remediation domain (~20% of the exam).
After terraform apply, AWS sends a confirmation email to the address in endpoint — click Confirm subscription once, and the alarm will then actually reach you when it trips.
treat_missing_data = "notBreaching" is a small but exam-relevant detail: by default a missing data point counts as breaching, which means a brand-new alarm with no data fires immediately. Setting it to notBreaching matches the SOA-C03 convention for low-volume metrics.
resource "aws_sns_topic" "ops_alerts" {
name = "certlabpro-soa-c03-ops-alerts"
}
resource "aws_sns_topic_subscription" "ops_email" {
topic_arn = aws_sns_topic.ops_alerts.arn
protocol = "email"
endpoint = "ops@example.com" # replace with your real email
}
resource "aws_cloudwatch_metric_alarm" "app_errors_spike" {
alarm_name = "certlabpro-soa-c03-app-errors-spike"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "AppErrorCount"
namespace = "CertLabPro/SOA-C03"
period = 300
statistic = "Sum"
threshold = 10
alarm_description = "More than 10 ERROR log lines in 5 minutes."
alarm_actions = [aws_sns_topic.ops_alerts.arn]
treat_missing_data = "notBreaching"
}Paging a human is fine for novel incidents but expensive for known-recurring ones. SOA-C03 leans hard on AWS Systems Manager Automation as the answer to "how do I auto-fix the things I already know how to fix?" — restart an unhealthy service, rotate a credential, clean up disk space.
We author a minimal SSM document that runs an aws:sleep step (one of the AWS-managed step types) — in production this would be aws:executeAutomation against a known-recovery runbook, or aws:runCommand against a fleet of instances. The shape is the same: declare a sequence of steps, give the document an execution role, register it as a reusable automation.
The IAM role we attach gives SSM Automation permission to assume itself and call the actions inside the document. SOA-C03's Reliability and Business Continuity domain tests this exact pattern: a named, version-controlled runbook is auditable; a Slack message saying "hey can you restart that thing" is not.
resource "aws_iam_role" "ssm_automation" {
name = "certlabpro-soa-c03-ssm-automation"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "ssm.amazonaws.com" }
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_role_policy_attachment" "ssm_automation" {
role = aws_iam_role.ssm_automation.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonSSMAutomationRole"
}
resource "aws_ssm_document" "remediate_app_errors" {
name = "certlabpro-soa-c03-remediate-app-errors"
document_type = "Automation"
document_format = "YAML"
content = <<-EOT
schemaVersion: "0.3"
description: "Lab-only runbook — auto-acknowledges app-error spikes."
assumeRole: "${aws_iam_role.ssm_automation.arn}"
mainSteps:
- name: pause
action: aws:sleep
inputs:
Duration: PT5S
EOT
}The final piece of the loop. CloudWatch alarms emit events to the EventBridge default bus when they change state — we filter for the alarm we created in Step 3 transitioning to ALARM, and target the SSM document from Step 4 as the response.
The EventBridge rule needs its own IAM role to call SSM Automation on our behalf — that's a subtle but recurring SOA-C03 detail. The exam tests whether you remember that EventBridge invoking a target on your behalf is a service-to-service action that needs a dedicated execution role, distinct from the SSM document's own assume-role.
The full chain is now: log line containing ERROR → metric filter publishes to CloudWatch Metrics → alarm trips when count > 10 in 5 minutes → alarm publishes state-change event to EventBridge AND emails the ops team via SNS → EventBridge rule matches the state change → SSM Automation runs the remediation runbook. The pager fires and the fix kicks off in parallel. That's the SOA-C03 operational ideal.
resource "aws_iam_role" "eventbridge_invoke_ssm" {
name = "certlabpro-soa-c03-eventbridge-invoke-ssm"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "events.amazonaws.com" }
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_role_policy" "eventbridge_invoke_ssm" {
name = "start-automation"
role = aws_iam_role.eventbridge_invoke_ssm.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = "ssm:StartAutomationExecution"
Resource = "*"
}]
})
}
resource "aws_cloudwatch_event_rule" "app_errors_alarm" {
name = "certlabpro-soa-c03-app-errors-alarm-fired"
description = "Fires the auto-remediation runbook when the app-errors alarm trips."
event_pattern = jsonencode({
source = ["aws.cloudwatch"]
"detail-type" = ["CloudWatch Alarm State Change"]
detail = {
alarmName = [aws_cloudwatch_metric_alarm.app_errors_spike.alarm_name]
state = { value = ["ALARM"] }
}
})
}
resource "aws_cloudwatch_event_target" "run_ssm_doc" {
rule = aws_cloudwatch_event_rule.app_errors_alarm.name
arn = "arn:aws:ssm:us-east-1::automation-definition/${aws_ssm_document.remediate_app_errors.name}"
role_arn = aws_iam_role.eventbridge_invoke_ssm.arn
}terraform destroy tears down everything in this lab. One caveat: the SNS email subscription stays in your account history after destroy (AWS keeps the unsubscribe record for compliance). No charges, just a paper trail. Everything else terminates cleanly within a minute.
SOA-C03 covers operational ground this lab can't fit — AWS Config rules and conformance packs for compliance drift, CloudTrail for API audit, Trusted Advisor checks, CloudFormation drift detection and StackSets for multi-account ops, AWS Backup, AWS Health events, Resource Explorer, License Manager, and Service Quotas.
We stick to the alarm-to-auto-remediate loop because it's the single most-tested operational pattern on the exam and the one that ties together the four highest-frequency services (CloudWatch, SNS, SSM, EventBridge). The other operational tools are conceptual coverage — see the Browse and Editorial sections of this cert page.