Last reviewed: May 2026
Build the AWS services on the PDE exam with plain Terraform — one block at a time, each tied back to an exam domain. The same code works on OpenTofu.
By the end of this lab you'll have provisioned, with plain Terraform, the canonical PDE streaming pipeline — a Cloud Storage ingest bucket, a Pub/Sub topic as the event ingress, a BigQuery dataset + table partitioned + clustered for query-cost control, and a Dataflow Flex Template job streaming Pub/Sub → BigQuery. Five blocks; the Pub/Sub → Dataflow → BigQuery pattern PDE tests on every exam.
Drop the snippets into a single main.tf, run terraform init, then terraform apply step-by-step.
>= 1.5 or OpenTofu >= 1.6.your-project-id in the provider block.Dataflow is the line item to watch:
n1-standard-1 worker): ~$50/month while running. Destroy promptly after each lab session.~$50/month while the Dataflow job is running. Stop it via gcloud dataflow jobs cancel if you don't intend to leave it running.
Enable Cloud Storage, Pub/Sub, BigQuery, and Dataflow APIs.
terraform {
required_version = ">= 1.5"
required_providers {
google = { source = "hashicorp/google", version = "~> 6.0" }
}
}
provider "google" {
project = "your-project-id" # REPLACE
region = "us-central1"
}
locals {
labels = {
project = "certlabpro-pde"
managed_by = "terraform"
}
}
resource "google_project_service" "storage" {
service = "storage.googleapis.com"
disable_on_destroy = false
}
resource "google_project_service" "pubsub" {
service = "pubsub.googleapis.com"
disable_on_destroy = false
}
resource "google_project_service" "bigquery" {
service = "bigquery.googleapis.com"
disable_on_destroy = false
}
resource "google_project_service" "dataflow" {
service = "dataflow.googleapis.com"
disable_on_destroy = false
}Dataflow jobs need a GCS staging bucket for temp files (Python wheels, JAR uploads, intermediate state). PDE-recommended pattern: one bucket per data domain, with subfolders for staging/, temp/, and templates/ to keep operational state separate from real data.
resource "random_id" "suffix" {
byte_length = 4
}
resource "google_storage_bucket" "ingest" {
name = "certlabpro-pde-ingest-${random_id.suffix.hex}"
location = "US"
uniform_bucket_level_access = true
force_destroy = true # lab-only
labels = local.labels
depends_on = [google_project_service.storage]
}PDE-canonical streaming ingress: publishers (clickstream, IoT, CDC) push events to a Pub/Sub topic; Dataflow subscribes and writes to BigQuery. Pub/Sub gives you durable, at-least-once delivery and the decoupling layer that lets you change consumers without touching producers.
We create the topic events + a Dataflow-owned subscription events-to-bq. The subscription's ack_deadline_seconds = 60 is the PDE-recommended Dataflow setting; longer than the typical Dataflow window-emission cadence.
resource "google_pubsub_topic" "events" {
name = "events"
labels = local.labels
depends_on = [google_project_service.pubsub]
}
resource "google_pubsub_subscription" "events_to_bq" {
name = "events-to-bq"
topic = google_pubsub_topic.events.id
ack_deadline_seconds = 60
message_retention_duration = "604800s" # 7 days
labels = local.labels
}PDE exam tests this partition + cluster choice relentlessly. We create the events table partitioned by event_time (DAY granularity, the recurring PDE-canonical pick) and clustered by event_type — every query filtering on event_time skips irrelevant partitions, every query filtering on event_type skips irrelevant blocks within the partition.
require_partition_filter = true forces queries to include a WHERE event_time >= ... clause — the PDE-recommended guardrail against accidentally-expensive full-table scans.
resource "google_bigquery_dataset" "analytics" {
dataset_id = "analytics"
location = "US"
delete_contents_on_destroy = true
labels = local.labels
depends_on = [google_project_service.bigquery]
}
resource "google_bigquery_table" "events" {
dataset_id = google_bigquery_dataset.analytics.dataset_id
table_id = "events"
deletion_protection = false
time_partitioning {
type = "DAY"
field = "event_time"
require_partition_filter = true
}
clustering = ["event_type"]
schema = jsonencode([
{ name = "event_time", type = "TIMESTAMP", mode = "REQUIRED" },
{ name = "event_type", type = "STRING", mode = "REQUIRED" },
{ name = "event_id", type = "STRING", mode = "REQUIRED" },
{ name = "payload", type = "JSON", mode = "NULLABLE" },
])
labels = local.labels
}Dataflow Flex Templates are the PDE-canonical job-as-resource shape — Google ships pre-built templates for common patterns (Pub/Sub → BigQuery, Pub/Sub → GCS, JDBC → BigQuery, etc.) and you launch them with parameters.
We launch the Google-provided Pubsub_Subscription_to_BigQuery template against the subscription from Step 3 and the table from Step 4. The job starts immediately on terraform apply; it'll show up under Dataflow → Jobs. Cancel it when done via gcloud dataflow jobs cancel <job-id> --region us-central1 to stop the ~$50/month worker billing.
data "google_project" "current" {}
resource "google_dataflow_flex_template_job" "pubsub_to_bq" {
provider = google-beta
name = "certlabpro-pde-pubsub-to-bq"
container_spec_gcs_path = "gs://dataflow-templates-us-central1/latest/flex/PubSub_Subscription_to_BigQuery"
region = "us-central1"
parameters = {
inputSubscription = google_pubsub_subscription.events_to_bq.id
outputTableSpec = "${data.google_project.current.project_id}:${google_bigquery_dataset.analytics.dataset_id}.${google_bigquery_table.events.table_id}"
}
temp_location = "gs://${google_storage_bucket.ingest.name}/temp"
staging_location = "gs://${google_storage_bucket.ingest.name}/staging"
on_delete = "cancel"
depends_on = [google_project_service.dataflow]
}terraform destroy tears down everything. The Dataflow job is cancelled (on_delete = "cancel") and worker billing stops within a few minutes. Pub/Sub + BigQuery + GCS resources destroy cleanly. The BigQuery table holding ingested events is deleted along with the dataset (delete_contents_on_destroy).
PDE covers many GCP data surfaces this lab can't fit — Dataproc (managed Hadoop / Spark), Cloud Composer (managed Airflow for batch orchestration), Cloud Data Fusion (visual ETL), Database Migration Service (DMS), Datastream (CDC from Oracle / MySQL / Postgres → BigQuery), Cloud Storage Transfer Service, BigQuery Omni / BigLake (multi-cloud / external tables), BigQuery ML (in-database ML), BigQuery BI Engine (cached query layer for Looker), Looker / Looker Studio, Cloud Pub/Sub Lite (cheaper but more-limited), Spanner / Bigtable / Firestore for application-tier data (covered in [[gcp-pcdoe]]), Vertex AI Pipelines / Feature Store / Workbench (covered in [[gcp-pmle]]), Cloud DLP / Sensitive Data Protection for PII redaction.
We stick to the GCS + Pub/Sub + Dataflow + BigQuery primitives because they're the PDE-canonical streaming pipeline that every exam scenario builds on. Composer orchestrates the same shapes in batch. Dataproc is the alternative compute engine for Spark workloads writing to the same BigQuery. Datastream is a managed CDC variant of the same Pub/Sub → Dataflow → BigQuery shape. Master the canonical pipeline; the alternatives slot in.
For service-by-service conceptual coverage, see the Browse, Playbook, and Editorial sections of this cert page.