Hands-on Lab — DEA-C01 AWS Certified Data Engineer Associate

Last reviewed: May 2026

Build the AWS services on the DEA-C01 exam with plain Terraform — one block at a time, each tied back to an exam domain. The same code works on OpenTofu.

Overview

By the end of this lab you'll have provisioned, with plain Terraform, the foundation every AWS data lake shares — an S3 bucket with a tiered lifecycle policy, a Glue Data Catalog database, a Glue crawler that discovers schema from objects landing in S3, and an Athena workgroup that lets you query the lake without provisioning servers. This is the architecture DEA-C01 calls data-lake-on-S3, and it shows up in roughly a quarter of the exam questions.

Every resource is plain Terraform — the same code works without modification on OpenTofu. Drop the snippets into a single main.tf, run terraform init, then terraform apply step-by-step.

Prerequisites

Terraform >= 1.5 or OpenTofu >= 1.6.
An AWS account with permissions to create S3, Glue, Athena, and IAM resources.
The AWS CLI authenticated for us-east-1 (any region works; we default to us-east-1).
You should already know what a partition is in the columnar-data sense — DEA-C01 assumes it, and the Athena workgroup we configure in Step 5 is more useful if you understand why partitioning matters.

Cost note

All resources here idle at $0:

S3: 5 GB free for new accounts; this lab puts kilobytes in.
Glue Data Catalog: first 1M objects free; the lab catalog has 2 tables.
Glue crawler: $0.44 per DPU-hour; a one-shot lab crawler costs ~$0.02 to run once.
Athena: $5 per TB scanned; a lab query that scans a 10 KB sample is fractions of a cent.

The one bill watch-out is leaving the Glue crawler on a schedule. If you set schedule to a cron expression in Step 4 and forget to destroy, the crawler runs forever — still pennies per run, but it adds up if it's daily for a year. Destroy when done.

Steps

1.Pick our Terraform version and AWS region

Standard opener. Glue and Athena are regional services — pick the region your raw data already lives in, because cross-region data transfer charges add up fast at petabyte scale. We default to us-east-1.

terraform {
  required_version = ">= 1.5"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.60"
    }
  }
}

provider "aws" {
  region = "us-east-1"

  default_tags {
    tags = {
      Project   = "certlabpro-dea-c01"
      ManagedBy = "terraform"
    }
  }
}

2.Build a data lake bucket with a cost-aware lifecycle policy

Provisions:

Amazon S3

The S3 bucket is the entire substrate of a data lake. DEA-C01 specifically tests the storage-class lifecycle — you can save 80%+ on cold data by transitioning it through Standard → Standard-IA → Glacier Flexible → Glacier Deep Archive as it ages. We turn on encryption, lock down public access, and set a three-tier lifecycle rule that mirrors the most-frequently-tested DEA-C01 cost pattern.

The transitions are 30 days → IA, 90 days → Glacier Flexible Retrieval, 180 days → Deep Archive. Those numbers are exam minimums (you can't transition to IA before 30 days — that's an S3 hard limit, and DEA-C01 tests it). For a data lake that mixes recent and archival, this lifecycle saves 60–90% on storage with zero application change.

resource "aws_s3_bucket" "lake" {
  bucket_prefix = "certlabpro-dea-c01-"
}

resource "aws_s3_bucket_public_access_block" "lake" {
  bucket = aws_s3_bucket.lake.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_s3_bucket_server_side_encryption_configuration" "lake" {
  bucket = aws_s3_bucket.lake.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "lake" {
  bucket = aws_s3_bucket.lake.id

  rule {
    id     = "tier-cold-data"
    status = "Enabled"

    filter { prefix = "raw/" }

    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }

    transition {
      days          = 90
      storage_class = "GLACIER"
    }

    transition {
      days          = 180
      storage_class = "DEEP_ARCHIVE"
    }
  }
}

3.Create a Glue Data Catalog database to hold our schemas
Provisions:
- AWS Glue
The Glue Data Catalog is the central metadata store every other AWS analytics service reads from — Athena, EMR, Redshift Spectrum, Lake Formation, and SageMaker Feature Store all share this one catalog. DEA-C01 tests this central-catalog model relentlessly: you provision the catalog once, and every analytics surface in your account picks it up automatically.

A database in Glue is a namespace (think schema in PostgreSQL); a table is the metadata describing how to read objects in S3 as structured data. We create the database here; the crawler in Step 4 will populate tables underneath it.
```
resource "aws_glue_catalog_database" "main" {
  name        = "certlabpro_dea_c01"
  description = "Glue Catalog database for the certlabpro DEA-C01 lab."

  location_uri = "s3://${aws_s3_bucket.lake.bucket}/raw/"
}
```

4.Add a Glue crawler that discovers schema from objects landing in S3

Provisions:

AWS Glue
AWS IAM

A Glue crawler walks an S3 path, infers column names and types from the file contents, and writes the schema into the catalog database from Step 3. Whenever new files land, you re-run the crawler and the table picks up new partitions or schema evolution automatically. DEA-C01 tests this discovery pattern under the Data Storage and Management domain — it's the difference between a data engineer manually writing DDL and a data engineer letting Glue do it.

The IAM role we attach gives the crawler permission to read from the bucket from Step 2 and write into the catalog from Step 3. AWS publishes a managed policy AWSGlueServiceRole that covers most of this; we attach it and add inline S3 read access to our specific bucket. We deliberately don't set a schedule here — drop in schedule = "cron(0 5 * * ? *)" later if you want daily catalog refresh.

resource "aws_iam_role" "glue_crawler" {
  name = "certlabpro-dea-c01-glue-crawler"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "glue.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy_attachment" "glue_service" {
  role       = aws_iam_role.glue_crawler.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
}

resource "aws_iam_role_policy" "glue_lake_read" {
  name = "read-lake-bucket"
  role = aws_iam_role.glue_crawler.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect   = "Allow"
      Action   = ["s3:GetObject", "s3:ListBucket"]
      Resource = [aws_s3_bucket.lake.arn, "${aws_s3_bucket.lake.arn}/*"]
    }]
  })
}

resource "aws_glue_crawler" "raw_data" {
  name          = "certlabpro-dea-c01-raw-data"
  database_name = aws_glue_catalog_database.main.name
  role          = aws_iam_role.glue_crawler.arn

  s3_target {
    path = "s3://${aws_s3_bucket.lake.bucket}/raw/"
  }

  schema_change_policy {
    update_behavior = "UPDATE_IN_DATABASE"
    delete_behavior = "LOG"
  }
}

5.Add an Athena workgroup with bounded query cost and a dedicated result bucket
Provisions:
- Amazon Athena
Athena queries the catalog from Steps 3–4 with SQL — no servers, pay-per-TB-scanned. The DEA-C01 Data Operations domain tests two specific Athena attributes hard: the per-query data-scanned limit (capped here at 1 GB so a runaway query can't accidentally scan 100 TB), and the separate results bucket (Athena writes query output back to S3; mixing results into the source bucket is the recurring exam anti-pattern).

The workgroup is the unit of governance — you can have a production workgroup with strict scan limits and a analytics-power-users workgroup with higher limits, then attach IAM principals to whichever fits. With this final piece in place, the data-lake foundation is complete: data lands in s3://<bucket>/raw/, the crawler from Step 4 catalogs it, Athena queries it within the cost guardrails this workgroup sets.
```
resource "aws_s3_bucket" "athena_results" {
  bucket_prefix = "certlabpro-dea-c01-athena-results-"
}

resource "aws_s3_bucket_public_access_block" "athena_results" {
  bucket = aws_s3_bucket.athena_results.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_athena_workgroup" "main" {
  name  = "certlabpro-dea-c01"
  state = "ENABLED"

  configuration {
    enforce_workgroup_configuration    = true
    publish_cloudwatch_metrics_enabled = true
    bytes_scanned_cutoff_per_query     = 1073741824 # 1 GB per query — runaway-query guardrail

    result_configuration {
      output_location = "s3://${aws_s3_bucket.athena_results.bucket}/output/"

      encryption_configuration {
        encryption_option = "SSE_S3"
      }
    }
  }
}
```

Cleanup

terraform destroy tears down everything in this lab. Two notes:

Both S3 buckets have force_destroy = false (the safe default), so destroy will fail if either contains objects (the lake bucket from Step 2 will collect raw files; Athena writes results into Step 5's bucket). Empty both via the console (or aws s3 rm s3://<bucket> --recursive) before destroying.
The Glue crawler terminates immediately on destroy; the catalog database also drops, including any tables the crawler created underneath it. If you've added tables manually that you want to keep, export them via the Glue API first.

What this lab doesn't cover

DEA-C01 covers more analytics ground than this lab can show in five plain-Terraform steps — Kinesis Data Streams + Kinesis Data Firehose for streaming ingestion, Amazon EMR for distributed Spark, AWS Lambda for serverless transformation, Step Functions for pipeline orchestration, Redshift for data warehousing, MSK for managed Kafka, OpenSearch for log analytics, QuickSight for BI dashboards, AWS DMS for database migration, and Lake Formation for fine-grained data lake permissions.

We stick to the single most-tested foundation — S3 + Glue Catalog + Glue Crawler + Athena — because it's the substrate every other DEA-C01 pattern builds on. Kinesis Firehose writes to this S3 bucket; EMR reads from this Glue Catalog; Lake Formation gates this Athena workgroup. Once you can build this foundation cleanly, the rest is bolt-ons.

A second hands-on lab covering Kinesis Firehose → S3 → Glue → Athena (the streaming variant of the same chain) would be a natural follow-up. Conceptual coverage of the rest lives on the Browse, Playbook, and Editorial sections of this cert page.

← Back to DEA-C01 hub

Overview

Every resource is plain Terraform — the same code works without modification on OpenTofu. Drop the snippets into a single main.tf, run terraform init, then terraform apply step-by-step.

Prerequisites

Terraform >= 1.5 or OpenTofu >= 1.6.
An AWS account with permissions to create S3, Glue, Athena, and IAM resources.
The AWS CLI authenticated for us-east-1 (any region works; we default to us-east-1).
You should already know what a partition is in the columnar-data sense — DEA-C01 assumes it, and the Athena workgroup we configure in Step 5 is more useful if you understand why partitioning matters.

Cost note

All resources here idle at $0:

S3: 5 GB free for new accounts; this lab puts kilobytes in.
Glue Data Catalog: first 1M objects free; the lab catalog has 2 tables.
Glue crawler: $0.44 per DPU-hour; a one-shot lab crawler costs ~$0.02 to run once.
Athena: $5 per TB scanned; a lab query that scans a 10 KB sample is fractions of a cent.

Steps

1.Pick our Terraform version and AWS region

terraform {
  required_version = ">= 1.5"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.60"
    }
  }
}

provider "aws" {
  region = "us-east-1"

  default_tags {
    tags = {
      Project   = "certlabpro-dea-c01"
      ManagedBy = "terraform"
    }
  }
}

2.Build a data lake bucket with a cost-aware lifecycle policy

Provisions:

Amazon S3

resource "aws_s3_bucket" "lake" {
  bucket_prefix = "certlabpro-dea-c01-"
}

resource "aws_s3_bucket_public_access_block" "lake" {
  bucket = aws_s3_bucket.lake.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_s3_bucket_server_side_encryption_configuration" "lake" {
  bucket = aws_s3_bucket.lake.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "lake" {
  bucket = aws_s3_bucket.lake.id

  rule {
    id     = "tier-cold-data"
    status = "Enabled"

    filter { prefix = "raw/" }

    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }

    transition {
      days          = 90
      storage_class = "GLACIER"
    }

    transition {
      days          = 180
      storage_class = "DEEP_ARCHIVE"
    }
  }
}

3.Create a Glue Data Catalog database to hold our schemas

Provisions:

AWS Glue

The Glue Data Catalog is the central metadata store every other AWS analytics service reads from — Athena, EMR, Redshift Spectrum, Lake Formation, and SageMaker Feature Store all share this one catalog. DEA-C01 tests this central-catalog model relentlessly: you provision the catalog once, and every analytics surface in your account picks it up automatically.

A database in Glue is a namespace (think schema in PostgreSQL); a table is the metadata describing how to read objects in S3 as structured data. We create the database here; the crawler in Step 4 will populate tables underneath it.

resource "aws_glue_catalog_database" "main" {
  name        = "certlabpro_dea_c01"
  description = "Glue Catalog database for the certlabpro DEA-C01 lab."

  location_uri = "s3://${aws_s3_bucket.lake.bucket}/raw/"
}

4.Add a Glue crawler that discovers schema from objects landing in S3

Provisions:

AWS Glue
AWS IAM

resource "aws_iam_role" "glue_crawler" {
  name = "certlabpro-dea-c01-glue-crawler"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "glue.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy_attachment" "glue_service" {
  role       = aws_iam_role.glue_crawler.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
}

resource "aws_iam_role_policy" "glue_lake_read" {
  name = "read-lake-bucket"
  role = aws_iam_role.glue_crawler.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect   = "Allow"
      Action   = ["s3:GetObject", "s3:ListBucket"]
      Resource = [aws_s3_bucket.lake.arn, "${aws_s3_bucket.lake.arn}/*"]
    }]
  })
}

resource "aws_glue_crawler" "raw_data" {
  name          = "certlabpro-dea-c01-raw-data"
  database_name = aws_glue_catalog_database.main.name
  role          = aws_iam_role.glue_crawler.arn

  s3_target {
    path = "s3://${aws_s3_bucket.lake.bucket}/raw/"
  }

  schema_change_policy {
    update_behavior = "UPDATE_IN_DATABASE"
    delete_behavior = "LOG"
  }
}

5.Add an Athena workgroup with bounded query cost and a dedicated result bucket

Provisions:

Amazon Athena

Athena queries the catalog from Steps 3–4 with SQL — no servers, pay-per-TB-scanned. The DEA-C01 Data Operations domain tests two specific Athena attributes hard: the per-query data-scanned limit (capped here at 1 GB so a runaway query can't accidentally scan 100 TB), and the separate results bucket (Athena writes query output back to S3; mixing results into the source bucket is the recurring exam anti-pattern).

The workgroup is the unit of governance — you can have a production workgroup with strict scan limits and a analytics-power-users workgroup with higher limits, then attach IAM principals to whichever fits. With this final piece in place, the data-lake foundation is complete: data lands in s3://<bucket>/raw/, the crawler from Step 4 catalogs it, Athena queries it within the cost guardrails this workgroup sets.

resource "aws_s3_bucket" "athena_results" {
  bucket_prefix = "certlabpro-dea-c01-athena-results-"
}

resource "aws_s3_bucket_public_access_block" "athena_results" {
  bucket = aws_s3_bucket.athena_results.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_athena_workgroup" "main" {
  name  = "certlabpro-dea-c01"
  state = "ENABLED"

  configuration {
    enforce_workgroup_configuration    = true
    publish_cloudwatch_metrics_enabled = true
    bytes_scanned_cutoff_per_query     = 1073741824 # 1 GB per query — runaway-query guardrail

    result_configuration {
      output_location = "s3://${aws_s3_bucket.athena_results.bucket}/output/"

      encryption_configuration {
        encryption_option = "SSE_S3"
      }
    }
  }
}

Cleanup

terraform destroy tears down everything in this lab. Two notes:

Both S3 buckets have force_destroy = false (the safe default), so destroy will fail if either contains objects (the lake bucket from Step 2 will collect raw files; Athena writes results into Step 5's bucket). Empty both via the console (or aws s3 rm s3://<bucket> --recursive) before destroying.
The Glue crawler terminates immediately on destroy; the catalog database also drops, including any tables the crawler created underneath it. If you've added tables manually that you want to keep, export them via the Glue API first.

What this lab doesn't cover

Hands-on Lab — DEA-C01 AWS Certified Data Engineer Associate

Overview

Prerequisites

💰Cost note

Steps

Cleanup

What this lab doesn't cover

Hands-on Lab — DEA-C01 AWS Certified Data Engineer Associate

Overview

Prerequisites

💰Cost note

Steps

Cleanup

What this lab doesn't cover

Cost note

Cost note