动手实验室 — PDE Google Cloud Professional Data Engineer

最后审核时间：2026年5月

使用原生 Terraform 构建 PDE 考试中的 AWS 服务——每次构建一个代码块，并紧扣考试领域。相同的代码可在 OpenTofu 上运行。

概述

通过本实验，您将使用纯粹的 Terraform 配置好规范的 PDE 流式传输管道 — 一个 Cloud Storage 摄取存储桶、一个作为事件入口的 Pub/Sub 主题、一个为控制查询成本而进行分区和集群化的 BigQuery 数据集和表，以及一个将 Pub/Sub 流式传输到 BigQuery 的 Dataflow Flex Template 作业。共五个代码块；这是 PDE 在每次考试中都会考到的 Pub/Sub → Dataflow → BigQuery 模式。

将这些代码片段放入一个 main.tf 文件中，运行 terraform init，然后逐步运行 terraform apply。

先决条件

Terraform >= 1.5 或 OpenTofu >= 1.6。
您拥有的 GCP 项目（已启用结算功能）。
gcloud CLI 已通过 ADC 身份验证。
替换 provider 块中的 your-project-id。

费用说明

请注意 Dataflow 的费用：

Dataflow 流式传输作业 (1 个 n1-standard-1 工作器)：运行时约 $50/月。每次实验会话结束后请立即销毁。
Pub/Sub：每月 10 GB 消息免费。
BigQuery 存储：10 GB 免费。
BigQuery 查询：每月 1 TB 免费。
GCS：5 GB 标准存储免费。

Dataflow 作业运行时每月约 $50。如果您不打算让它一直运行，请通过 gcloud dataflow jobs cancel 命令停止它。

步骤

1.Provider、项目服务、命名

启用 Cloud Storage、Pub/Sub、BigQuery 和 Dataflow API。

terraform {
  required_version = ">= 1.5"

  required_providers {
    google = { source = "hashicorp/google", version = "~> 6.0" }
  }
}

provider "google" {
  project = "your-project-id" # REPLACE
  region  = "us-central1"
}

locals {
  labels = {
    project    = "certlabpro-pde"
    managed_by = "terraform"
  }
}

resource "google_project_service" "storage" {
  service            = "storage.googleapis.com"
  disable_on_destroy = false
}

resource "google_project_service" "pubsub" {
  service            = "pubsub.googleapis.com"
  disable_on_destroy = false
}

resource "google_project_service" "bigquery" {
  service            = "bigquery.googleapis.com"
  disable_on_destroy = false
}

resource "google_project_service" "dataflow" {
  service            = "dataflow.googleapis.com"
  disable_on_destroy = false
}

2.为 Dataflow 预置一个 Cloud Storage 摄取和临时存储桶
配置服务：
- Cloud Storage
Dataflow 作业需要一个 GCS 暂存存储桶 来存放临时文件（Python wheels、JAR 上传、中间状态）。PDE 推荐的模式是：每个数据域一个存储桶，包含用于 staging/、temp/ 和 templates/ 的子文件夹，以将操作状态与实际数据分开。
```
resource "random_id" "suffix" {
  byte_length = 4
}

resource "google_storage_bucket" "ingest" {
  name                        = "certlabpro-pde-ingest-${random_id.suffix.hex}"
  location                    = "US"
  uniform_bucket_level_access = true
  force_destroy               = true # lab-only

  labels = local.labels

  depends_on = [google_project_service.storage]
}
```
3.连接 Pub/Sub 主题（流式入口）
配置服务：
- Pub/Sub
PDE 规范的流式入口：发布者（点击流、IoT、CDC）将事件推送到 Pub/Sub 主题；Dataflow 订阅并写入 BigQuery。Pub/Sub 提供持久的、至少一次的交付，以及一个解耦层，让您可以在不影响生产者的情况下更改消费者。

我们创建主题 events 和一个 Dataflow 拥有的订阅 events-to-bq。订阅的 ack_deadline_seconds = 60 是 PDE 推荐的 Dataflow 设置；这比典型的 Dataflow 窗口发射周期要长。
```
resource "google_pubsub_topic" "events" {
  name = "events"

  labels = local.labels

  depends_on = [google_project_service.pubsub]
}

resource "google_pubsub_subscription" "events_to_bq" {
  name  = "events-to-bq"
  topic = google_pubsub_topic.events.id

  ack_deadline_seconds       = 60
  message_retention_duration = "604800s" # 7 days

  labels = local.labels
}
```

4.创建 BigQuery 目标：分区加集群表

配置服务：

BigQuery

PDE 考试会不遗余力地考察这种分区加集群选择。我们创建的 events 表按 event_time（DAY 粒度，PDE 规范的常见选择）分区，并按 event_type 进行集群 — 任何过滤 event_time 的查询都会跳过不相关的分区，任何过滤 event_type 的查询都会跳过分区内不相关的块。

require_partition_filter = true 强制查询包含 WHERE event_time >= ... 子句 — 这是 PDE 推荐的防护措施，可防止意外昂贵的整表扫描。

resource "google_bigquery_dataset" "analytics" {
  dataset_id                  = "analytics"
  location                    = "US"
  delete_contents_on_destroy  = true

  labels = local.labels

  depends_on = [google_project_service.bigquery]
}

resource "google_bigquery_table" "events" {
  dataset_id          = google_bigquery_dataset.analytics.dataset_id
  table_id            = "events"
  deletion_protection = false

  time_partitioning {
    type                     = "DAY"
    field                    = "event_time"
    require_partition_filter = true
  }

  clustering = ["event_type"]

  schema = jsonencode([
    { name = "event_time",  type = "TIMESTAMP", mode = "REQUIRED" },
    { name = "event_type",  type = "STRING",    mode = "REQUIRED" },
    { name = "event_id",    type = "STRING",    mode = "REQUIRED" },
    { name = "payload",     type = "JSON",      mode = "NULLABLE" },
  ])

  labels = local.labels
}

5.启动 Dataflow Flex Template 流式作业：Pub/Sub → BigQuery

配置服务：

Dataflow

Dataflow Flex Templates 是 PDE 规范的“作业即资源”形态 — Google 提供预构建的模板用于常见模式（Pub/Sub → BigQuery、Pub/Sub → GCS、JDBC → BigQuery 等），您可以使用参数启动它们。

我们针对步骤 3 中的订阅和步骤 4 中的表启动 Google 提供的 Pubsub_Subscription_to_BigQuery 模板。作业在 terraform apply 后立即启动；它将显示在 Dataflow → Jobs 下。完成后取消它，通过 gcloud dataflow jobs cancel <job-id> --region us-central1 命令停止每月约 $50 的工作器计费。

data "google_project" "current" {}

resource "google_dataflow_flex_template_job" "pubsub_to_bq" {
  provider                = google-beta
  name                    = "certlabpro-pde-pubsub-to-bq"
  container_spec_gcs_path = "gs://dataflow-templates-us-central1/latest/flex/PubSub_Subscription_to_BigQuery"
  region                  = "us-central1"

  parameters = {
    inputSubscription = google_pubsub_subscription.events_to_bq.id
    outputTableSpec   = "${data.google_project.current.project_id}:${google_bigquery_dataset.analytics.dataset_id}.${google_bigquery_table.events.table_id}"
  }

  temp_location     = "gs://${google_storage_bucket.ingest.name}/temp"
  staging_location  = "gs://${google_storage_bucket.ingest.name}/staging"

  on_delete = "cancel"

  depends_on = [google_project_service.dataflow]
}

清理

terraform destroy 将销毁所有内容。Dataflow 作业 将被取消 (on_delete = "cancel")，并且工作器计费将在几分钟内停止。Pub/Sub + BigQuery + GCS 资源将干净地销毁。保存已摄取事件的 BigQuery 表将与数据集一起删除 (delete_contents_on_destroy)。

本实验不涵盖的内容

PDE 涵盖了本实验无法容纳的许多 GCP 数据服务 — Dataproc（托管式 Hadoop / Spark）、Cloud Composer（用于批处理编排的托管式 Airflow）、Cloud Data Fusion（可视化 ETL）、Database Migration Service (DMS)、Datastream（从 Oracle / MySQL / Postgres → BigQuery 的 CDC）、Cloud Storage Transfer Service、BigQuery Omni / BigLake（多云 / 外部表）、BigQuery ML（数据库内机器学习）、BigQuery BI Engine（Looker 的缓存查询层）、Looker / Looker Studio、Cloud Pub/Sub Lite（更便宜但功能受限）、用于应用层数据的 Spanner / Bigtable / Firestore（涵盖于 [[gcp-pcdoe]]）、Vertex AI Pipelines / Feature Store / Workbench（涵盖于 [[gcp-pmle]]）、用于 PII 匿名化的 Cloud DLP / Sensitive Data Protection。

我们坚持使用 GCS + Pub/Sub + Dataflow + BigQuery 这些基本组件，因为它们是每个考试场景都基于的 PDE 规范流式传输管道。Composer 以批处理方式编排相同的形态。Dataproc 是用于将 Spark 工作负载写入相同 BigQuery 的替代计算引擎。Datastream 是 Pub/Sub → Dataflow → BigQuery 相同形态的托管式 CDC 变体。掌握规范管道；替代方案将随之而来。

有关按服务划分的概念性覆盖，请参阅本认证页面的浏览、手册和 Editorial 部分。

← 返回 PDE 中心

概述

将这些代码片段放入一个 main.tf 文件中，运行 terraform init，然后逐步运行 terraform apply。

费用说明

请注意 Dataflow 的费用：

Dataflow 流式传输作业 (1 个 n1-standard-1 工作器)：运行时约 $50/月。每次实验会话结束后请立即销毁。
Pub/Sub：每月 10 GB 消息免费。
BigQuery 存储：10 GB 免费。
BigQuery 查询：每月 1 TB 免费。
GCS：5 GB 标准存储免费。

Dataflow 作业运行时每月约 $50。如果您不打算让它一直运行，请通过 gcloud dataflow jobs cancel 命令停止它。

步骤

1.Provider、项目服务、命名

启用 Cloud Storage、Pub/Sub、BigQuery 和 Dataflow API。

terraform {
  required_version = ">= 1.5"

  required_providers {
    google = { source = "hashicorp/google", version = "~> 6.0" }
  }
}

provider "google" {
  project = "your-project-id" # REPLACE
  region  = "us-central1"
}

locals {
  labels = {
    project    = "certlabpro-pde"
    managed_by = "terraform"
  }
}

resource "google_project_service" "storage" {
  service            = "storage.googleapis.com"
  disable_on_destroy = false
}

resource "google_project_service" "pubsub" {
  service            = "pubsub.googleapis.com"
  disable_on_destroy = false
}

resource "google_project_service" "bigquery" {
  service            = "bigquery.googleapis.com"
  disable_on_destroy = false
}

resource "google_project_service" "dataflow" {
  service            = "dataflow.googleapis.com"
  disable_on_destroy = false
}

2.为 Dataflow 预置一个 Cloud Storage 摄取和临时存储桶

配置服务：

Cloud Storage

Dataflow 作业需要一个 GCS 暂存存储桶 来存放临时文件（Python wheels、JAR 上传、中间状态）。PDE 推荐的模式是：每个数据域一个存储桶，包含用于 staging/、temp/ 和 templates/ 的子文件夹，以将操作状态与实际数据分开。

resource "random_id" "suffix" {
  byte_length = 4
}

resource "google_storage_bucket" "ingest" {
  name                        = "certlabpro-pde-ingest-${random_id.suffix.hex}"
  location                    = "US"
  uniform_bucket_level_access = true
  force_destroy               = true # lab-only

  labels = local.labels

  depends_on = [google_project_service.storage]
}

3.连接 Pub/Sub 主题（流式入口）

配置服务：

Pub/Sub

PDE 规范的流式入口：发布者（点击流、IoT、CDC）将事件推送到 Pub/Sub 主题；Dataflow 订阅并写入 BigQuery。Pub/Sub 提供持久的、至少一次的交付，以及一个解耦层，让您可以在不影响生产者的情况下更改消费者。

我们创建主题 events 和一个 Dataflow 拥有的订阅 events-to-bq。订阅的 ack_deadline_seconds = 60 是 PDE 推荐的 Dataflow 设置；这比典型的 Dataflow 窗口发射周期要长。

resource "google_pubsub_topic" "events" {
  name = "events"

  labels = local.labels

  depends_on = [google_project_service.pubsub]
}

resource "google_pubsub_subscription" "events_to_bq" {
  name  = "events-to-bq"
  topic = google_pubsub_topic.events.id

  ack_deadline_seconds       = 60
  message_retention_duration = "604800s" # 7 days

  labels = local.labels
}

4.创建 BigQuery 目标：分区加集群表

配置服务：

BigQuery

require_partition_filter = true 强制查询包含 WHERE event_time >= ... 子句 — 这是 PDE 推荐的防护措施，可防止意外昂贵的整表扫描。

resource "google_bigquery_dataset" "analytics" {
  dataset_id                  = "analytics"
  location                    = "US"
  delete_contents_on_destroy  = true

  labels = local.labels

  depends_on = [google_project_service.bigquery]
}

resource "google_bigquery_table" "events" {
  dataset_id          = google_bigquery_dataset.analytics.dataset_id
  table_id            = "events"
  deletion_protection = false

  time_partitioning {
    type                     = "DAY"
    field                    = "event_time"
    require_partition_filter = true
  }

  clustering = ["event_type"]

  schema = jsonencode([
    { name = "event_time",  type = "TIMESTAMP", mode = "REQUIRED" },
    { name = "event_type",  type = "STRING",    mode = "REQUIRED" },
    { name = "event_id",    type = "STRING",    mode = "REQUIRED" },
    { name = "payload",     type = "JSON",      mode = "NULLABLE" },
  ])

  labels = local.labels
}

5.启动 Dataflow Flex Template 流式作业：Pub/Sub → BigQuery

配置服务：

Dataflow

data "google_project" "current" {}

resource "google_dataflow_flex_template_job" "pubsub_to_bq" {
  provider                = google-beta
  name                    = "certlabpro-pde-pubsub-to-bq"
  container_spec_gcs_path = "gs://dataflow-templates-us-central1/latest/flex/PubSub_Subscription_to_BigQuery"
  region                  = "us-central1"

  parameters = {
    inputSubscription = google_pubsub_subscription.events_to_bq.id
    outputTableSpec   = "${data.google_project.current.project_id}:${google_bigquery_dataset.analytics.dataset_id}.${google_bigquery_table.events.table_id}"
  }

  temp_location     = "gs://${google_storage_bucket.ingest.name}/temp"
  staging_location  = "gs://${google_storage_bucket.ingest.name}/staging"

  on_delete = "cancel"

  depends_on = [google_project_service.dataflow]
}

本实验不涵盖的内容

有关按服务划分的概念性覆盖，请参阅本认证页面的浏览、手册和 Editorial 部分。

动手实验室 — PDE Google Cloud Professional Data Engineer

概述

先决条件

💰费用说明

步骤

清理

本实验不涵盖的内容

动手实验室 — PDE Google Cloud Professional Data Engineer

概述

先决条件

💰费用说明

步骤

清理

本实验不涵盖的内容

费用说明

费用说明