Hands-on Lab — PDE Google Cloud Professional Data Engineer

Last reviewed: May 2026

Build the AWS services on the PDE exam with plain Terraform — one block at a time, each tied back to an exam domain. The same code works on OpenTofu.

Overview

By the end of this lab you'll have provisioned, with plain Terraform, the canonical PDE streaming pipeline — a Cloud Storage ingest bucket, a Pub/Sub topic as the event ingress, a BigQuery dataset + table partitioned + clustered for query-cost control, and a Dataflow Flex Template job streaming Pub/Sub → BigQuery. Five blocks; the Pub/Sub → Dataflow → BigQuery pattern PDE tests on every exam.

Drop the snippets into a single main.tf, run terraform init, then terraform apply step-by-step.

Prerequisites

Terraform >= 1.5 or OpenTofu >= 1.6.
A GCP project you own (with billing attached).
gcloud CLI authenticated as ADC.
Replace your-project-id in the provider block.

Cost note

Dataflow is the line item to watch:

Dataflow streaming job (1 n1-standard-1 worker): ~$50/month while running. Destroy promptly after each lab session.
Pub/Sub: 10 GB messages/month free.
BigQuery storage: 10 GB free.
BigQuery queries: 1 TB/month free.
GCS: 5 GB Standard free.

~$50/month while the Dataflow job is running. Stop it via gcloud dataflow jobs cancel if you don't intend to leave it running.

Steps

1.Provider, project services, naming

Enable Cloud Storage, Pub/Sub, BigQuery, and Dataflow APIs.

terraform {
  required_version = ">= 1.5"

  required_providers {
    google = { source = "hashicorp/google", version = "~> 6.0" }
  }
}

provider "google" {
  project = "your-project-id" # REPLACE
  region  = "us-central1"
}

locals {
  labels = {
    project    = "certlabpro-pde"
    managed_by = "terraform"
  }
}

resource "google_project_service" "storage" {
  service            = "storage.googleapis.com"
  disable_on_destroy = false
}

resource "google_project_service" "pubsub" {
  service            = "pubsub.googleapis.com"
  disable_on_destroy = false
}

resource "google_project_service" "bigquery" {
  service            = "bigquery.googleapis.com"
  disable_on_destroy = false
}

resource "google_project_service" "dataflow" {
  service            = "dataflow.googleapis.com"
  disable_on_destroy = false
}

2.Provision a Cloud Storage ingest + temp bucket for Dataflow

Provisions:

Cloud Storage

Dataflow jobs need a GCS staging bucket for temp files (Python wheels, JAR uploads, intermediate state). PDE-recommended pattern: one bucket per data domain, with subfolders for staging/, temp/, and templates/ to keep operational state separate from real data.

resource "random_id" "suffix" {
  byte_length = 4
}

resource "google_storage_bucket" "ingest" {
  name                        = "certlabpro-pde-ingest-${random_id.suffix.hex}"
  location                    = "US"
  uniform_bucket_level_access = true
  force_destroy               = true # lab-only

  labels = local.labels

  depends_on = [google_project_service.storage]
}

3.Wire the Pub/Sub topic (the streaming ingress)
Provisions:
- Pub/Sub
PDE-canonical streaming ingress: publishers (clickstream, IoT, CDC) push events to a Pub/Sub topic; Dataflow subscribes and writes to BigQuery. Pub/Sub gives you durable, at-least-once delivery and the decoupling layer that lets you change consumers without touching producers.

We create the topic events + a Dataflow-owned subscription events-to-bq. The subscription's ack_deadline_seconds = 60 is the PDE-recommended Dataflow setting; longer than the typical Dataflow window-emission cadence.
```
resource "google_pubsub_topic" "events" {
  name = "events"

  labels = local.labels

  depends_on = [google_project_service.pubsub]
}

resource "google_pubsub_subscription" "events_to_bq" {
  name  = "events-to-bq"
  topic = google_pubsub_topic.events.id

  ack_deadline_seconds       = 60
  message_retention_duration = "604800s" # 7 days

  labels = local.labels
}
```

4.Create the BigQuery target: partitioned + clustered table

Provisions:

BigQuery

PDE exam tests this partition + cluster choice relentlessly. We create the events table partitioned by event_time (DAY granularity, the recurring PDE-canonical pick) and clustered by event_type — every query filtering on event_time skips irrelevant partitions, every query filtering on event_type skips irrelevant blocks within the partition.

require_partition_filter = true forces queries to include a WHERE event_time >= ... clause — the PDE-recommended guardrail against accidentally-expensive full-table scans.

resource "google_bigquery_dataset" "analytics" {
  dataset_id                  = "analytics"
  location                    = "US"
  delete_contents_on_destroy  = true

  labels = local.labels

  depends_on = [google_project_service.bigquery]
}

resource "google_bigquery_table" "events" {
  dataset_id          = google_bigquery_dataset.analytics.dataset_id
  table_id            = "events"
  deletion_protection = false

  time_partitioning {
    type                     = "DAY"
    field                    = "event_time"
    require_partition_filter = true
  }

  clustering = ["event_type"]

  schema = jsonencode([
    { name = "event_time",  type = "TIMESTAMP", mode = "REQUIRED" },
    { name = "event_type",  type = "STRING",    mode = "REQUIRED" },
    { name = "event_id",    type = "STRING",    mode = "REQUIRED" },
    { name = "payload",     type = "JSON",      mode = "NULLABLE" },
  ])

  labels = local.labels
}

5.Launch a Dataflow Flex Template streaming Pub/Sub → BigQuery

Provisions:

Dataflow

Dataflow Flex Templates are the PDE-canonical job-as-resource shape — Google ships pre-built templates for common patterns (Pub/Sub → BigQuery, Pub/Sub → GCS, JDBC → BigQuery, etc.) and you launch them with parameters.

We launch the Google-provided Pubsub_Subscription_to_BigQuery template against the subscription from Step 3 and the table from Step 4. The job starts immediately on terraform apply; it'll show up under Dataflow → Jobs. Cancel it when done via gcloud dataflow jobs cancel <job-id> --region us-central1 to stop the ~$50/month worker billing.

data "google_project" "current" {}

resource "google_dataflow_flex_template_job" "pubsub_to_bq" {
  provider                = google-beta
  name                    = "certlabpro-pde-pubsub-to-bq"
  container_spec_gcs_path = "gs://dataflow-templates-us-central1/latest/flex/PubSub_Subscription_to_BigQuery"
  region                  = "us-central1"

  parameters = {
    inputSubscription = google_pubsub_subscription.events_to_bq.id
    outputTableSpec   = "${data.google_project.current.project_id}:${google_bigquery_dataset.analytics.dataset_id}.${google_bigquery_table.events.table_id}"
  }

  temp_location     = "gs://${google_storage_bucket.ingest.name}/temp"
  staging_location  = "gs://${google_storage_bucket.ingest.name}/staging"

  on_delete = "cancel"

  depends_on = [google_project_service.dataflow]
}

Cleanup

terraform destroy tears down everything. The Dataflow job is cancelled (on_delete = "cancel") and worker billing stops within a few minutes. Pub/Sub + BigQuery + GCS resources destroy cleanly. The BigQuery table holding ingested events is deleted along with the dataset (delete_contents_on_destroy).

What this lab doesn't cover

PDE covers many GCP data surfaces this lab can't fit — Dataproc (managed Hadoop / Spark), Cloud Composer (managed Airflow for batch orchestration), Cloud Data Fusion (visual ETL), Database Migration Service (DMS), Datastream (CDC from Oracle / MySQL / Postgres → BigQuery), Cloud Storage Transfer Service, BigQuery Omni / BigLake (multi-cloud / external tables), BigQuery ML (in-database ML), BigQuery BI Engine (cached query layer for Looker), Looker / Looker Studio, Cloud Pub/Sub Lite (cheaper but more-limited), Spanner / Bigtable / Firestore for application-tier data (covered in [[gcp-pcdoe]]), Vertex AI Pipelines / Feature Store / Workbench (covered in [[gcp-pmle]]), Cloud DLP / Sensitive Data Protection for PII redaction.

We stick to the GCS + Pub/Sub + Dataflow + BigQuery primitives because they're the PDE-canonical streaming pipeline that every exam scenario builds on. Composer orchestrates the same shapes in batch. Dataproc is the alternative compute engine for Spark workloads writing to the same BigQuery. Datastream is a managed CDC variant of the same Pub/Sub → Dataflow → BigQuery shape. Master the canonical pipeline; the alternatives slot in.

For service-by-service conceptual coverage, see the Browse, Playbook, and Editorial sections of this cert page.

← Back to PDE hub

Overview

Drop the snippets into a single main.tf, run terraform init, then terraform apply step-by-step.

Cost note

Dataflow is the line item to watch:

Dataflow streaming job (1 n1-standard-1 worker): ~$50/month while running. Destroy promptly after each lab session.
Pub/Sub: 10 GB messages/month free.
BigQuery storage: 10 GB free.
BigQuery queries: 1 TB/month free.
GCS: 5 GB Standard free.

~$50/month while the Dataflow job is running. Stop it via gcloud dataflow jobs cancel if you don't intend to leave it running.

Steps

1.Provider, project services, naming

Enable Cloud Storage, Pub/Sub, BigQuery, and Dataflow APIs.

terraform {
  required_version = ">= 1.5"

  required_providers {
    google = { source = "hashicorp/google", version = "~> 6.0" }
  }
}

provider "google" {
  project = "your-project-id" # REPLACE
  region  = "us-central1"
}

locals {
  labels = {
    project    = "certlabpro-pde"
    managed_by = "terraform"
  }
}

resource "google_project_service" "storage" {
  service            = "storage.googleapis.com"
  disable_on_destroy = false
}

resource "google_project_service" "pubsub" {
  service            = "pubsub.googleapis.com"
  disable_on_destroy = false
}

resource "google_project_service" "bigquery" {
  service            = "bigquery.googleapis.com"
  disable_on_destroy = false
}

resource "google_project_service" "dataflow" {
  service            = "dataflow.googleapis.com"
  disable_on_destroy = false
}

2.Provision a Cloud Storage ingest + temp bucket for Dataflow

Provisions:

Cloud Storage

resource "random_id" "suffix" {
  byte_length = 4
}

resource "google_storage_bucket" "ingest" {
  name                        = "certlabpro-pde-ingest-${random_id.suffix.hex}"
  location                    = "US"
  uniform_bucket_level_access = true
  force_destroy               = true # lab-only

  labels = local.labels

  depends_on = [google_project_service.storage]
}

3.Wire the Pub/Sub topic (the streaming ingress)

Provisions:

Pub/Sub

PDE-canonical streaming ingress: publishers (clickstream, IoT, CDC) push events to a Pub/Sub topic; Dataflow subscribes and writes to BigQuery. Pub/Sub gives you durable, at-least-once delivery and the decoupling layer that lets you change consumers without touching producers.

We create the topic events + a Dataflow-owned subscription events-to-bq. The subscription's ack_deadline_seconds = 60 is the PDE-recommended Dataflow setting; longer than the typical Dataflow window-emission cadence.

resource "google_pubsub_topic" "events" {
  name = "events"

  labels = local.labels

  depends_on = [google_project_service.pubsub]
}

resource "google_pubsub_subscription" "events_to_bq" {
  name  = "events-to-bq"
  topic = google_pubsub_topic.events.id

  ack_deadline_seconds       = 60
  message_retention_duration = "604800s" # 7 days

  labels = local.labels
}

4.Create the BigQuery target: partitioned + clustered table

Provisions:

BigQuery

require_partition_filter = true forces queries to include a WHERE event_time >= ... clause — the PDE-recommended guardrail against accidentally-expensive full-table scans.

resource "google_bigquery_dataset" "analytics" {
  dataset_id                  = "analytics"
  location                    = "US"
  delete_contents_on_destroy  = true

  labels = local.labels

  depends_on = [google_project_service.bigquery]
}

resource "google_bigquery_table" "events" {
  dataset_id          = google_bigquery_dataset.analytics.dataset_id
  table_id            = "events"
  deletion_protection = false

  time_partitioning {
    type                     = "DAY"
    field                    = "event_time"
    require_partition_filter = true
  }

  clustering = ["event_type"]

  schema = jsonencode([
    { name = "event_time",  type = "TIMESTAMP", mode = "REQUIRED" },
    { name = "event_type",  type = "STRING",    mode = "REQUIRED" },
    { name = "event_id",    type = "STRING",    mode = "REQUIRED" },
    { name = "payload",     type = "JSON",      mode = "NULLABLE" },
  ])

  labels = local.labels
}

5.Launch a Dataflow Flex Template streaming Pub/Sub → BigQuery

Provisions:

Dataflow

data "google_project" "current" {}

resource "google_dataflow_flex_template_job" "pubsub_to_bq" {
  provider                = google-beta
  name                    = "certlabpro-pde-pubsub-to-bq"
  container_spec_gcs_path = "gs://dataflow-templates-us-central1/latest/flex/PubSub_Subscription_to_BigQuery"
  region                  = "us-central1"

  parameters = {
    inputSubscription = google_pubsub_subscription.events_to_bq.id
    outputTableSpec   = "${data.google_project.current.project_id}:${google_bigquery_dataset.analytics.dataset_id}.${google_bigquery_table.events.table_id}"
  }

  temp_location     = "gs://${google_storage_bucket.ingest.name}/temp"
  staging_location  = "gs://${google_storage_bucket.ingest.name}/staging"

  on_delete = "cancel"

  depends_on = [google_project_service.dataflow]
}

Cleanup

What this lab doesn't cover

For service-by-service conceptual coverage, see the Browse, Playbook, and Editorial sections of this cert page.

Hands-on Lab — PDE Google Cloud Professional Data Engineer

Overview

Prerequisites

💰Cost note

Steps

Cleanup

What this lab doesn't cover

Hands-on Lab — PDE Google Cloud Professional Data Engineer

Overview

Prerequisites

💰Cost note

Steps

Cleanup

What this lab doesn't cover

Cost note

Cost note