Last reviewed: May 2026
Build the AWS services on the DEA-C01 exam with plain Terraform — one block at a time, each tied back to an exam domain. The same code works on OpenTofu.
By the end of this lab you'll have provisioned, with plain Terraform, the foundation every AWS data lake shares — an S3 bucket with a tiered lifecycle policy, a Glue Data Catalog database, a Glue crawler that discovers schema from objects landing in S3, and an Athena workgroup that lets you query the lake without provisioning servers. This is the architecture DEA-C01 calls data-lake-on-S3, and it shows up in roughly a quarter of the exam questions.
Every resource is plain Terraform — the same code works without modification on OpenTofu. Drop the snippets into a single main.tf, run terraform init, then terraform apply step-by-step.
>= 1.5 or OpenTofu >= 1.6.us-east-1 (any region works; we default to us-east-1).All resources here idle at $0:
The one bill watch-out is leaving the Glue crawler on a schedule. If you set schedule to a cron expression in Step 4 and forget to destroy, the crawler runs forever — still pennies per run, but it adds up if it's daily for a year. Destroy when done.
Standard opener. Glue and Athena are regional services — pick the region your raw data already lives in, because cross-region data transfer charges add up fast at petabyte scale. We default to us-east-1.
terraform {
required_version = ">= 1.5"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.60"
}
}
}
provider "aws" {
region = "us-east-1"
default_tags {
tags = {
Project = "certlabpro-dea-c01"
ManagedBy = "terraform"
}
}
}The S3 bucket is the entire substrate of a data lake. DEA-C01 specifically tests the storage-class lifecycle — you can save 80%+ on cold data by transitioning it through Standard → Standard-IA → Glacier Flexible → Glacier Deep Archive as it ages. We turn on encryption, lock down public access, and set a three-tier lifecycle rule that mirrors the most-frequently-tested DEA-C01 cost pattern.
The transitions are 30 days → IA, 90 days → Glacier Flexible Retrieval, 180 days → Deep Archive. Those numbers are exam minimums (you can't transition to IA before 30 days — that's an S3 hard limit, and DEA-C01 tests it). For a data lake that mixes recent and archival, this lifecycle saves 60–90% on storage with zero application change.
resource "aws_s3_bucket" "lake" {
bucket_prefix = "certlabpro-dea-c01-"
}
resource "aws_s3_bucket_public_access_block" "lake" {
bucket = aws_s3_bucket.lake.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_s3_bucket_server_side_encryption_configuration" "lake" {
bucket = aws_s3_bucket.lake.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
resource "aws_s3_bucket_lifecycle_configuration" "lake" {
bucket = aws_s3_bucket.lake.id
rule {
id = "tier-cold-data"
status = "Enabled"
filter { prefix = "raw/" }
transition {
days = 30
storage_class = "STANDARD_IA"
}
transition {
days = 90
storage_class = "GLACIER"
}
transition {
days = 180
storage_class = "DEEP_ARCHIVE"
}
}
}The Glue Data Catalog is the central metadata store every other AWS analytics service reads from — Athena, EMR, Redshift Spectrum, Lake Formation, and SageMaker Feature Store all share this one catalog. DEA-C01 tests this central-catalog model relentlessly: you provision the catalog once, and every analytics surface in your account picks it up automatically.
A database in Glue is a namespace (think schema in PostgreSQL); a table is the metadata describing how to read objects in S3 as structured data. We create the database here; the crawler in Step 4 will populate tables underneath it.
resource "aws_glue_catalog_database" "main" {
name = "certlabpro_dea_c01"
description = "Glue Catalog database for the certlabpro DEA-C01 lab."
location_uri = "s3://${aws_s3_bucket.lake.bucket}/raw/"
}A Glue crawler walks an S3 path, infers column names and types from the file contents, and writes the schema into the catalog database from Step 3. Whenever new files land, you re-run the crawler and the table picks up new partitions or schema evolution automatically. DEA-C01 tests this discovery pattern under the Data Storage and Management domain — it's the difference between a data engineer manually writing DDL and a data engineer letting Glue do it.
The IAM role we attach gives the crawler permission to read from the bucket from Step 2 and write into the catalog from Step 3. AWS publishes a managed policy AWSGlueServiceRole that covers most of this; we attach it and add inline S3 read access to our specific bucket. We deliberately don't set a schedule here — drop in schedule = "cron(0 5 * * ? *)" later if you want daily catalog refresh.
resource "aws_iam_role" "glue_crawler" {
name = "certlabpro-dea-c01-glue-crawler"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "glue.amazonaws.com" }
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_role_policy_attachment" "glue_service" {
role = aws_iam_role.glue_crawler.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
}
resource "aws_iam_role_policy" "glue_lake_read" {
name = "read-lake-bucket"
role = aws_iam_role.glue_crawler.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = ["s3:GetObject", "s3:ListBucket"]
Resource = [aws_s3_bucket.lake.arn, "${aws_s3_bucket.lake.arn}/*"]
}]
})
}
resource "aws_glue_crawler" "raw_data" {
name = "certlabpro-dea-c01-raw-data"
database_name = aws_glue_catalog_database.main.name
role = aws_iam_role.glue_crawler.arn
s3_target {
path = "s3://${aws_s3_bucket.lake.bucket}/raw/"
}
schema_change_policy {
update_behavior = "UPDATE_IN_DATABASE"
delete_behavior = "LOG"
}
}Athena queries the catalog from Steps 3–4 with SQL — no servers, pay-per-TB-scanned. The DEA-C01 Data Operations domain tests two specific Athena attributes hard: the per-query data-scanned limit (capped here at 1 GB so a runaway query can't accidentally scan 100 TB), and the separate results bucket (Athena writes query output back to S3; mixing results into the source bucket is the recurring exam anti-pattern).
The workgroup is the unit of governance — you can have a production workgroup with strict scan limits and a analytics-power-users workgroup with higher limits, then attach IAM principals to whichever fits. With this final piece in place, the data-lake foundation is complete: data lands in s3://<bucket>/raw/, the crawler from Step 4 catalogs it, Athena queries it within the cost guardrails this workgroup sets.
resource "aws_s3_bucket" "athena_results" {
bucket_prefix = "certlabpro-dea-c01-athena-results-"
}
resource "aws_s3_bucket_public_access_block" "athena_results" {
bucket = aws_s3_bucket.athena_results.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_athena_workgroup" "main" {
name = "certlabpro-dea-c01"
state = "ENABLED"
configuration {
enforce_workgroup_configuration = true
publish_cloudwatch_metrics_enabled = true
bytes_scanned_cutoff_per_query = 1073741824 # 1 GB per query — runaway-query guardrail
result_configuration {
output_location = "s3://${aws_s3_bucket.athena_results.bucket}/output/"
encryption_configuration {
encryption_option = "SSE_S3"
}
}
}
}terraform destroy tears down everything in this lab. Two notes:
force_destroy = false (the safe default), so destroy will fail if either contains objects (the lake bucket from Step 2 will collect raw files; Athena writes results into Step 5's bucket). Empty both via the console (or aws s3 rm s3://<bucket> --recursive) before destroying.DEA-C01 covers more analytics ground than this lab can show in five plain-Terraform steps — Kinesis Data Streams + Kinesis Data Firehose for streaming ingestion, Amazon EMR for distributed Spark, AWS Lambda for serverless transformation, Step Functions for pipeline orchestration, Redshift for data warehousing, MSK for managed Kafka, OpenSearch for log analytics, QuickSight for BI dashboards, AWS DMS for database migration, and Lake Formation for fine-grained data lake permissions.
We stick to the single most-tested foundation — S3 + Glue Catalog + Glue Crawler + Athena — because it's the substrate every other DEA-C01 pattern builds on. Kinesis Firehose writes to this S3 bucket; EMR reads from this Glue Catalog; Lake Formation gates this Athena workgroup. Once you can build this foundation cleanly, the rest is bolt-ons.
A second hands-on lab covering Kinesis Firehose → S3 → Glue → Athena (the streaming variant of the same chain) would be a natural follow-up. Conceptual coverage of the rest lives on the Browse, Playbook, and Editorial sections of this cert page.