Hands-on Lab — SOA-C03 AWS Certified CloudOps Engineer Associate

Last reviewed: May 2026

Build the AWS services on the SOA-C03 exam with plain Terraform — one block at a time, each tied back to an exam domain. The same code works on OpenTofu.

Overview

By the end of this lab you'll have provisioned, with plain Terraform, a complete monitor-and-auto-remediate loop — a CloudWatch log group with a metric filter, an SNS topic that pages a human, a Systems Manager runbook that does the automated fix, and an EventBridge rule that wires high-severity events to the runbook so common incidents resolve themselves before anyone wakes up.

Every resource is plain Terraform — the same code works without modification on OpenTofu. No variables, no modules. Drop the snippets into a single main.tf, run terraform init, then terraform apply step-by-step.

Prerequisites

Terraform >= 1.5 or OpenTofu >= 1.6.
An AWS account with permissions to create CloudWatch, SNS, IAM, SSM, and EventBridge resources.
The AWS CLI authenticated for us-east-1.
An email address you can confirm a subscription on (Step 3 sends alerts there).
This lab uses Systems Manager Automation (Step 4) — make sure your account has it enabled (it's on by default for all new AWS accounts).

Cost note

Everything in this lab costs nothing while idle:

CloudWatch Logs: 5 GB ingestion free; this lab generates kilobytes.
CloudWatch alarms: 10 alarms free, then $0.10 per alarm/month.
SNS: 1K email deliveries free/month.
SSM Automation: free for AWS-managed documents and small custom ones.
EventBridge default bus: $1 per million events; lab traffic is essentially zero.

If the auto-remediation actually fires (Step 5 triggers an SSM document), that costs nothing extra — Systems Manager Automation is free for the actions used here.

Steps

1.Pick our Terraform version and AWS region
Standard opener. default_tags apply across the whole stack so the operations team can later filter Cost Explorer, AWS Config, and Tag Editor by Project = certlabpro-soa-c03 to see everything this lab created. SOA-C03's Reliability and Business Continuity domain explicitly tests this — tagging is the foundation of every cross-cutting operational query.
```
terraform {
  required_version = ">= 1.5"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.60"
    }
  }
}

provider "aws" {
  region = "us-east-1"

  default_tags {
    tags = {
      Project   = "certlabpro-soa-c03"
      ManagedBy = "terraform"
    }
  }
}
```
2.Set up a log group with retention and a metric filter for the thing we care about
Provisions:
- Amazon CloudWatch Logs
Every operational story on AWS starts in CloudWatch Logs. We create a log group with explicit 30-day retention (the default of never expire is the SOA-C03 cost-anti-pattern that comes up in every cost-optimization question) and a metric filter that watches the log stream for the word ERROR and publishes a count to CloudWatch Metrics.

Metric filters turn unstructured log data into actionable metrics. That's the SOA-C03 mental model for monitoring: logs → filter → metric → alarm → SNS → human (or automation). We're building the chain piece-by-piece from this step forward.
```
resource "aws_cloudwatch_log_group" "app" {
  name              = "/certlabpro/soa-c03/app"
  retention_in_days = 30
}

resource "aws_cloudwatch_log_metric_filter" "app_errors" {
  name           = "certlabpro-soa-c03-app-errors"
  log_group_name = aws_cloudwatch_log_group.app.name
  pattern        = "ERROR"

  metric_transformation {
    name          = "AppErrorCount"
    namespace     = "CertLabPro/SOA-C03"
    value         = "1"
    default_value = "0"
  }
}
```
3.Alarm on the metric and route notifications through SNS to a human
Provisions:
- Amazon CloudWatch
- Amazon SNS
Now we connect the metric from Step 2 to a human. We create an SNS topic, subscribe an email address to it, and set up a CloudWatch alarm that fires when the error count crosses our threshold. SOA-C03 tests this exact chain — metric → alarm → SNS → email — under the Monitoring, Logging, and Remediation domain (~20% of the exam).

After terraform apply, AWS sends a confirmation email to the address in endpoint — click Confirm subscription once, and the alarm will then actually reach you when it trips.

treat_missing_data = "notBreaching" is a small but exam-relevant detail: by default a missing data point counts as breaching, which means a brand-new alarm with no data fires immediately. Setting it to notBreaching matches the SOA-C03 convention for low-volume metrics.
```
resource "aws_sns_topic" "ops_alerts" {
  name = "certlabpro-soa-c03-ops-alerts"
}

resource "aws_sns_topic_subscription" "ops_email" {
  topic_arn = aws_sns_topic.ops_alerts.arn
  protocol  = "email"
  endpoint  = "ops@example.com" # replace with your real email
}

resource "aws_cloudwatch_metric_alarm" "app_errors_spike" {
  alarm_name          = "certlabpro-soa-c03-app-errors-spike"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "AppErrorCount"
  namespace           = "CertLabPro/SOA-C03"
  period              = 300
  statistic           = "Sum"
  threshold           = 10
  alarm_description   = "More than 10 ERROR log lines in 5 minutes."
  alarm_actions       = [aws_sns_topic.ops_alerts.arn]
  treat_missing_data  = "notBreaching"
}
```
4.Write a Systems Manager runbook for the automated fix
Provisions:
- AWS Systems Manager
Paging a human is fine for novel incidents but expensive for known-recurring ones. SOA-C03 leans hard on AWS Systems Manager Automation as the answer to "how do I auto-fix the things I already know how to fix?" — restart an unhealthy service, rotate a credential, clean up disk space.

We author a minimal SSM document that runs an aws:sleep step (one of the AWS-managed step types) — in production this would be aws:executeAutomation against a known-recovery runbook, or aws:runCommand against a fleet of instances. The shape is the same: declare a sequence of steps, give the document an execution role, register it as a reusable automation.

The IAM role we attach gives SSM Automation permission to assume itself and call the actions inside the document. SOA-C03's Reliability and Business Continuity domain tests this exact pattern: a named, version-controlled runbook is auditable; a Slack message saying "hey can you restart that thing" is not.
```
resource "aws_iam_role" "ssm_automation" {
  name = "certlabpro-soa-c03-ssm-automation"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "ssm.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy_attachment" "ssm_automation" {
  role       = aws_iam_role.ssm_automation.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonSSMAutomationRole"
}

resource "aws_ssm_document" "remediate_app_errors" {
  name            = "certlabpro-soa-c03-remediate-app-errors"
  document_type   = "Automation"
  document_format = "YAML"

  content = <<-EOT
    schemaVersion: "0.3"
    description: "Lab-only runbook — auto-acknowledges app-error spikes."
    assumeRole: "${aws_iam_role.ssm_automation.arn}"
    mainSteps:
      - name: pause
        action: aws:sleep
        inputs:
          Duration: PT5S
  EOT
}
```

5.Wire EventBridge so high-severity alarm events trigger the runbook

Provisions:

Amazon EventBridge
AWS IAM

The final piece of the loop. CloudWatch alarms emit events to the EventBridge default bus when they change state — we filter for the alarm we created in Step 3 transitioning to ALARM, and target the SSM document from Step 4 as the response.

The EventBridge rule needs its own IAM role to call SSM Automation on our behalf — that's a subtle but recurring SOA-C03 detail. The exam tests whether you remember that EventBridge invoking a target on your behalf is a service-to-service action that needs a dedicated execution role, distinct from the SSM document's own assume-role.

The full chain is now: log line containing ERROR → metric filter publishes to CloudWatch Metrics → alarm trips when count > 10 in 5 minutes → alarm publishes state-change event to EventBridge AND emails the ops team via SNS → EventBridge rule matches the state change → SSM Automation runs the remediation runbook. The pager fires and the fix kicks off in parallel. That's the SOA-C03 operational ideal.

resource "aws_iam_role" "eventbridge_invoke_ssm" {
  name = "certlabpro-soa-c03-eventbridge-invoke-ssm"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "events.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "eventbridge_invoke_ssm" {
  name = "start-automation"
  role = aws_iam_role.eventbridge_invoke_ssm.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Action = "ssm:StartAutomationExecution"
      Resource = "*"
    }]
  })
}

resource "aws_cloudwatch_event_rule" "app_errors_alarm" {
  name        = "certlabpro-soa-c03-app-errors-alarm-fired"
  description = "Fires the auto-remediation runbook when the app-errors alarm trips."

  event_pattern = jsonencode({
    source      = ["aws.cloudwatch"]
    "detail-type" = ["CloudWatch Alarm State Change"]
    detail = {
      alarmName = [aws_cloudwatch_metric_alarm.app_errors_spike.alarm_name]
      state     = { value = ["ALARM"] }
    }
  })
}

resource "aws_cloudwatch_event_target" "run_ssm_doc" {
  rule     = aws_cloudwatch_event_rule.app_errors_alarm.name
  arn      = "arn:aws:ssm:us-east-1::automation-definition/${aws_ssm_document.remediate_app_errors.name}"
  role_arn = aws_iam_role.eventbridge_invoke_ssm.arn
}

Cleanup

terraform destroy tears down everything in this lab. One caveat: the SNS email subscription stays in your account history after destroy (AWS keeps the unsubscribe record for compliance). No charges, just a paper trail. Everything else terminates cleanly within a minute.

What this lab doesn't cover

SOA-C03 covers operational ground this lab can't fit — AWS Config rules and conformance packs for compliance drift, CloudTrail for API audit, Trusted Advisor checks, CloudFormation drift detection and StackSets for multi-account ops, AWS Backup, AWS Health events, Resource Explorer, License Manager, and Service Quotas.

We stick to the alarm-to-auto-remediate loop because it's the single most-tested operational pattern on the exam and the one that ties together the four highest-frequency services (CloudWatch, SNS, SSM, EventBridge). The other operational tools are conceptual coverage — see the Browse and Editorial sections of this cert page.

← Back to SOA-C03 hub

Overview

Prerequisites

Terraform >= 1.5 or OpenTofu >= 1.6.
An AWS account with permissions to create CloudWatch, SNS, IAM, SSM, and EventBridge resources.
The AWS CLI authenticated for us-east-1.
An email address you can confirm a subscription on (Step 3 sends alerts there).
This lab uses Systems Manager Automation (Step 4) — make sure your account has it enabled (it's on by default for all new AWS accounts).

Cost note

Everything in this lab costs nothing while idle:

CloudWatch Logs: 5 GB ingestion free; this lab generates kilobytes.
CloudWatch alarms: 10 alarms free, then $0.10 per alarm/month.
SNS: 1K email deliveries free/month.
SSM Automation: free for AWS-managed documents and small custom ones.
EventBridge default bus: $1 per million events; lab traffic is essentially zero.

If the auto-remediation actually fires (Step 5 triggers an SSM document), that costs nothing extra — Systems Manager Automation is free for the actions used here.

Steps

1.Pick our Terraform version and AWS region

Standard opener. default_tags apply across the whole stack so the operations team can later filter Cost Explorer, AWS Config, and Tag Editor by Project = certlabpro-soa-c03 to see everything this lab created. SOA-C03's Reliability and Business Continuity domain explicitly tests this — tagging is the foundation of every cross-cutting operational query.

terraform {
  required_version = ">= 1.5"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.60"
    }
  }
}

provider "aws" {
  region = "us-east-1"

  default_tags {
    tags = {
      Project   = "certlabpro-soa-c03"
      ManagedBy = "terraform"
    }
  }
}

2.Set up a log group with retention and a metric filter for the thing we care about

Provisions:

Amazon CloudWatch Logs

Every operational story on AWS starts in CloudWatch Logs. We create a log group with explicit 30-day retention (the default of never expire is the SOA-C03 cost-anti-pattern that comes up in every cost-optimization question) and a metric filter that watches the log stream for the word ERROR and publishes a count to CloudWatch Metrics.

Metric filters turn unstructured log data into actionable metrics. That's the SOA-C03 mental model for monitoring: logs → filter → metric → alarm → SNS → human (or automation). We're building the chain piece-by-piece from this step forward.

resource "aws_cloudwatch_log_group" "app" {
  name              = "/certlabpro/soa-c03/app"
  retention_in_days = 30
}

resource "aws_cloudwatch_log_metric_filter" "app_errors" {
  name           = "certlabpro-soa-c03-app-errors"
  log_group_name = aws_cloudwatch_log_group.app.name
  pattern        = "ERROR"

  metric_transformation {
    name          = "AppErrorCount"
    namespace     = "CertLabPro/SOA-C03"
    value         = "1"
    default_value = "0"
  }
}

3.Alarm on the metric and route notifications through SNS to a human

Provisions:

Amazon CloudWatch
Amazon SNS

Now we connect the metric from Step 2 to a human. We create an SNS topic, subscribe an email address to it, and set up a CloudWatch alarm that fires when the error count crosses our threshold. SOA-C03 tests this exact chain — metric → alarm → SNS → email — under the Monitoring, Logging, and Remediation domain (~20% of the exam).

After terraform apply, AWS sends a confirmation email to the address in endpoint — click Confirm subscription once, and the alarm will then actually reach you when it trips.

treat_missing_data = "notBreaching" is a small but exam-relevant detail: by default a missing data point counts as breaching, which means a brand-new alarm with no data fires immediately. Setting it to notBreaching matches the SOA-C03 convention for low-volume metrics.

resource "aws_sns_topic" "ops_alerts" {
  name = "certlabpro-soa-c03-ops-alerts"
}

resource "aws_sns_topic_subscription" "ops_email" {
  topic_arn = aws_sns_topic.ops_alerts.arn
  protocol  = "email"
  endpoint  = "ops@example.com" # replace with your real email
}

resource "aws_cloudwatch_metric_alarm" "app_errors_spike" {
  alarm_name          = "certlabpro-soa-c03-app-errors-spike"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "AppErrorCount"
  namespace           = "CertLabPro/SOA-C03"
  period              = 300
  statistic           = "Sum"
  threshold           = 10
  alarm_description   = "More than 10 ERROR log lines in 5 minutes."
  alarm_actions       = [aws_sns_topic.ops_alerts.arn]
  treat_missing_data  = "notBreaching"
}

4.Write a Systems Manager runbook for the automated fix

Provisions:

AWS Systems Manager

Paging a human is fine for novel incidents but expensive for known-recurring ones. SOA-C03 leans hard on AWS Systems Manager Automation as the answer to "how do I auto-fix the things I already know how to fix?" — restart an unhealthy service, rotate a credential, clean up disk space.

We author a minimal SSM document that runs an aws:sleep step (one of the AWS-managed step types) — in production this would be aws:executeAutomation against a known-recovery runbook, or aws:runCommand against a fleet of instances. The shape is the same: declare a sequence of steps, give the document an execution role, register it as a reusable automation.

The IAM role we attach gives SSM Automation permission to assume itself and call the actions inside the document. SOA-C03's Reliability and Business Continuity domain tests this exact pattern: a named, version-controlled runbook is auditable; a Slack message saying "hey can you restart that thing" is not.

resource "aws_iam_role" "ssm_automation" {
  name = "certlabpro-soa-c03-ssm-automation"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "ssm.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy_attachment" "ssm_automation" {
  role       = aws_iam_role.ssm_automation.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonSSMAutomationRole"
}

resource "aws_ssm_document" "remediate_app_errors" {
  name            = "certlabpro-soa-c03-remediate-app-errors"
  document_type   = "Automation"
  document_format = "YAML"

  content = <<-EOT
    schemaVersion: "0.3"
    description: "Lab-only runbook — auto-acknowledges app-error spikes."
    assumeRole: "${aws_iam_role.ssm_automation.arn}"
    mainSteps:
      - name: pause
        action: aws:sleep
        inputs:
          Duration: PT5S
  EOT
}

5.Wire EventBridge so high-severity alarm events trigger the runbook

Provisions:

Amazon EventBridge
AWS IAM

resource "aws_iam_role" "eventbridge_invoke_ssm" {
  name = "certlabpro-soa-c03-eventbridge-invoke-ssm"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "events.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "eventbridge_invoke_ssm" {
  name = "start-automation"
  role = aws_iam_role.eventbridge_invoke_ssm.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Action = "ssm:StartAutomationExecution"
      Resource = "*"
    }]
  })
}

resource "aws_cloudwatch_event_rule" "app_errors_alarm" {
  name        = "certlabpro-soa-c03-app-errors-alarm-fired"
  description = "Fires the auto-remediation runbook when the app-errors alarm trips."

  event_pattern = jsonencode({
    source      = ["aws.cloudwatch"]
    "detail-type" = ["CloudWatch Alarm State Change"]
    detail = {
      alarmName = [aws_cloudwatch_metric_alarm.app_errors_spike.alarm_name]
      state     = { value = ["ALARM"] }
    }
  })
}

resource "aws_cloudwatch_event_target" "run_ssm_doc" {
  rule     = aws_cloudwatch_event_rule.app_errors_alarm.name
  arn      = "arn:aws:ssm:us-east-1::automation-definition/${aws_ssm_document.remediate_app_errors.name}"
  role_arn = aws_iam_role.eventbridge_invoke_ssm.arn
}

What this lab doesn't cover

Hands-on Lab — SOA-C03 AWS Certified CloudOps Engineer Associate

Overview

Prerequisites

💰Cost note

Steps

Cleanup

What this lab doesn't cover

Hands-on Lab — SOA-C03 AWS Certified CloudOps Engineer Associate

Overview

Prerequisites

💰Cost note

Steps

Cleanup

What this lab doesn't cover

Cost note

Cost note