Playbook

Google Cloud Professional Data Engineer

Last reviewed: May 2026

A scannable reference of architectural patterns the PDE exam tests. Read top-to-bottom, or jump to a section.

1. Designing Data Processing Systems

Continuous, high-volume data requires analysis within minutes of arrival.

Pub/Sub for ingestion -> Dataflow (streaming) for transformation -> BigQuery with streaming inserts or Storage Write API for analytics.

Why: This is the canonical serverless, autoscaling streaming pattern. Batch processing (e.g., Dataproc) would not meet low-latency requirements.

Data pipeline must handle unpredictable traffic spikes (e.g., 10x seasonal volume) while maintaining low latency.

Use fully managed, autoscaling services: Pub/Sub for ingestion, Dataflow with autoscaling enabled, and BigQuery for storage.

Why: Managed services automatically scale resources to match load, avoiding over-provisioning costs and ensuring performance under peak traffic.

Migrate a large on-premises Hadoop/Hive data warehouse to Google Cloud.

Migrate data to Cloud Storage, then load into BigQuery. Replace Hive/Spark SQL with BigQuery for serverless analytics. Use Dataproc for Spark jobs that are not easily translated to SQL.

Why: BigQuery provides a serverless, high-performance replacement for Hadoop data warehouses, reducing operational overhead.

A streaming pipeline requires messages to be processed exactly once and in order for each entity (e.g., per stock symbol).

Publish messages to Pub/Sub with an ordering key. Process with a Dataflow streaming pipeline, which guarantees in-order processing for a given key.

Why: Pub/Sub ordering keys combined with Dataflow provide managed, scalable, ordered, and exactly-once processing without manual state management.

Reference

Build a flexible, scalable data lake to support both batch and streaming workloads with data governance.

Use Cloud Storage as the storage layer. Use Dataflow for both batch and stream processing. Use Dataplex with Data Catalog for metadata management, discovery, and governance.

Why: This architecture decouples storage and compute, allowing use of multiple processing engines (Dataflow, Dataproc) on a central data store with unified governance.

A pipeline processing sensitive data (e.g., PHI, PII) must comply with regulations like HIPAA or GDPR.

Enable Cloud Audit Logs for all data access. Implement VPC Service Controls to create a security perimeter preventing data exfiltration.

Why: Audit logging is critical for tracking data access for compliance. VPC Service Controls provide a strong defense against data exfiltration, a key requirement for sensitive data.

A lambda architecture with separate batch and speed layers needs to present a unified view of the data.

Use BigQuery for the serving layer. Use a `MERGE` statement to update/insert batch-processed data into a master table, overwriting streaming data for the same period. Expose a view that `UNION`s historical batch data with real-time streaming data for the current period.

Why: This pattern provides both low-latency real-time views and batch-corrected historical accuracy without requiring client-side reconciliation logic.

Implement a decentralized data mesh architecture where domains own their data products.

Use Dataplex for federated governance over domain-specific "lakes" and "zones". Use BigQuery datasets per domain. Use Analytics Hub to share data products between domains.

Why: Dataplex provides the central governance plane while allowing domain autonomy, a core tenet of data mesh.

Combine a data lake and data warehouse, allowing Spark jobs on raw data and fast SQL on curated data.

Store data in open formats (Iceberg, Delta Lake) on Cloud Storage. Use BigLake to provide a unified governance and access layer. Query data from both Dataproc (Spark) and BigQuery.

Why: BigLake allows querying data in place on Cloud Storage with BigQuery performance and fine-grained security, unifying the lake and warehouse.

Design a disaster recovery strategy for a critical BigQuery data warehouse with a low RPO (e.g., 1 hour).

Configure BigQuery cross-region dataset replication for critical datasets. Use Terraform or Dataform to manage schema and view definitions. Orchestrate failover with Cloud Functions triggered by Cloud Monitoring alerts.

Why: Cross-region replication provides a continuously updated, queryable copy in a DR region, meeting low RPO/RTO requirements for critical data.

2. Ingesting and Processing Data

Continuously replicate changes from an OLTP database (e.g., Oracle, PostgreSQL, MySQL) to BigQuery with low latency.

Use Datastream to perform Change Data Capture (CDC). Configure it to stream changes directly to BigQuery, which applies them using its `MERGE` capability.

Why: Datastream is a managed, serverless CDC service that simplifies real-time database replication without requiring custom pipelines or significant source database load.

Reference

A Dataflow streaming pipeline must produce accurate event-time windowed results despite some events arriving hours late.

Configure event-time windows with `allowedLateness` set to accommodate the delay. Use triggers with early firings for preliminary results and accumulating fired panes to include late data.

Why: Dataflow's model of watermarks, triggers, and allowed lateness provides a robust framework for balancing completeness and latency when dealing with out-of-order data.

A Dataflow pipeline writing to BigQuery experiences duplicates after restarts or transient failures.

Use the BigQuery Storage Write API sink (`STORAGE_WRITE_API`) with the method set to `at-least-once` (default, formerly `STREAMING_INSERTS`) or `exactly-once` (`COMMITTED` mode).

Why: The Storage Write API in `COMMITTED` mode provides built-in exactly-once semantics for streaming, eliminating the need for custom deduplication logic.

Ingest data from a paginated, rate-limited REST API using Dataflow.

Use a `SplittableDoFn` to process the paginated source in parallel. Implement rate-limiting logic (e.g., using a Guava RateLimiter) and exponential backoff for retries within the DoFn.

Why: A `SplittableDoFn` allows for dynamic work rebalancing. Combining it with rate-limiting and retry logic creates a resilient and efficient pattern for handling external APIs.

A single data stream needs to be written to multiple destinations (e.g., BigQuery, Bigtable, Cloud Storage).

In a single Dataflow pipeline, after initial processing, apply multiple `PTransform` writers to the same final `PCollection`.

Why: The fan-out pattern is highly efficient as the data is processed only once. It avoids the cost and complexity of running multiple separate pipelines reading from the same source.

A high-volume stream must be enriched by joining with a slowly changing dimension table (e.g., user profiles) that updates periodically.

Use the side input pattern in Dataflow. Load the dimension table as a `PCollectionView`. Configure a periodic trigger to refresh the side input on a schedule, preventing pipeline restarts.

Why: Side inputs broadcast the dimension data to all workers for fast in-memory lookups, avoiding per-element API/DB calls. Periodic refresh handles updates efficiently.

Dataproc cluster workloads vary significantly, leading to either over-provisioning or under-performance.

Create a Dataproc cluster with an autoscaling policy. Define min/max primary and secondary worker counts. The policy will scale the cluster based on YARN metrics.

Why: Autoscaling optimizes costs by matching cluster resources to job demand, scaling up for heavy loads and down during idle periods.

A Dataflow pipeline requires custom binaries, proprietary libraries, or specific versions not in standard worker images, and must run in a VPC with no internet.

Build a custom container image with all dependencies pre-installed. Push the image to Artifact Registry. Deploy the pipeline using a Flex Template that references the custom container.

Why: Flex Templates with custom containers provide complete control over the runtime environment and dependencies, crucial for offline or specialized environments.

A Dataflow or Spark job performing a `GroupByKey` is slow because some keys have disproportionately many values (a "hot key").

Implement a two-stage aggregation (key salting). First, append a random suffix to the key to split the hot key across multiple workers. Aggregate partially. Second, remove the suffix and aggregate the partial results.

Why: This fanout technique manually breaks up the work for the hot key, allowing it to be processed in parallel and overcoming the bottleneck.

A streaming pipeline must not fail due to malformed records. Invalid records must be isolated for analysis without halting processing.

In a `DoFn`, use a try-catch block for parsing. Use a multi-output DoFn with `TupleTag` to route valid records to the main output and invalid records (with error context) to a separate error output. Sink the error PCollection to a dead-letter destination like a Pub/Sub topic or BigQuery table.

Why: This pattern provides resiliency by isolating bad data, preventing pipeline failures, and ensuring failed records are captured for debugging and reprocessing.

3. Storing and Managing Data

BigQuery queries are slow and expensive, typically filtering on a date/time column and other high-cardinality columns (e.g., `customer_id`).

Partition the table by the date/time column (e.g., daily partitions). Cluster the table by up to four frequently filtered columns (e.g., `customer_id`, `product_category`).

Why: Partitioning prunes the data scanned to only relevant time periods. Clustering further sorts data within partitions, minimizing data scanned for filters on clustered columns. This is the primary BQ performance tuning pattern.

Reference

Application requires low-latency (sub-10ms) reads and writes for massive datasets (billions of rows), such as for real-time personalization or an IoT feature store.

Use Bigtable. Design a row key that supports the primary access pattern. For time-series, use `entity_id#reverse_timestamp`.

Why: Bigtable is a NoSQL wide-column store optimized for high-throughput, low-latency workloads at scale. BigQuery is for analytics and has higher point-lookup latency.

A transactional application requires global distribution, horizontal scalability, and strong consistency with a SQL interface.

Use Cloud Spanner with a multi-region configuration.

Why: Spanner is the only service that provides all of these capabilities: globally distributed, ACID transactions, and a relational schema. Cloud SQL is regional; Bigtable is not relational and has eventual consistency between clusters.

A BigQuery data warehouse has large amounts of historical data that is queried infrequently but must be retained, leading to high storage costs.

No action required for partitions/tables unmodified for 90 consecutive days. BigQuery automatically applies long-term storage pricing, a ~50% cost reduction.

Why: This is an automatic, built-in optimization. Manually moving data to GCS (unless for Archive tier) is often unnecessary and adds complexity.

Data in a Cloud Storage bucket has a predictable access pattern: frequent for 30 days, occasional for 90 days, then rare.

Configure a bucket lifecycle policy to transition objects: Standard -> Nearline (at 30 days) -> Coldline (at 90 days).

Why: Lifecycle policies automate cost optimization by moving data to cheaper storage classes as it becomes less frequently accessed.

A BigQuery table must enforce a unique key constraint.

Enforce uniqueness in the loading pipeline. Use a `MERGE` statement with logic that only inserts when the key does not already exist. Alternatively, use a stateful DoFn in Dataflow to deduplicate.

Why: BigQuery does not enforce `PRIMARY KEY` or `UNIQUE` constraints. Uniqueness must be managed by the data loading process.

A dimension table in BigQuery needs to maintain a full history of changes for point-in-time analysis (SCD Type 2).

Add `valid_from` and `valid_to` timestamp columns. When a change occurs, use a `MERGE` statement to update the `valid_to` on the old record and insert a new record.

Why: This is the standard pattern for implementing SCD Type 2 in a data warehouse. `MERGE` provides an efficient, atomic way to perform the required update and insert operations.

An application requires a managed, scalable database for flexible-schema JSON documents with transactional support and complex query needs.

Use Firestore in Native mode. Utilize collections, documents, and subcollections to model the data. Create composite indexes for complex queries.

Why: Firestore is a serverless NoSQL document database optimized for transactional workloads with rich querying capabilities, unlike Bigtable (key-value) or BigQuery (analytical).

Need to query data in Cloud Storage (Parquet, Avro, etc.) via BigQuery while enforcing fine-grained (row/column) security.

Create BigLake tables over the Cloud Storage data. Apply BigQuery row-level and column-level security policies to the BigLake tables.

Why: BigLake extends BigQuery governance to open-format data in Cloud Storage, enabling a secure, unified data lakehouse architecture.

4. Preparing and Using Data for Analysis

A data science team needs to train ML models on large BigQuery datasets without moving or exporting data.

Use BigQuery ML. Write `CREATE MODEL` statements in SQL to train, evaluate, and predict directly within BigQuery.

Why: BQML eliminates data movement, simplifies the ML workflow, and leverages BigQuery's processing power, accelerating iteration.

Reference

ML models require features for both batch training and low-latency online inference, with consistency between them to avoid skew.

Use Vertex AI Feature Store. Ingest features via batch or streaming. It provides an offline store (BigQuery) for training and an online store (Bigtable) for low-latency serving.

Why: This is a purpose-built, managed service that solves the complex problem of feature consistency, point-in-time correctness, and dual-serving requirements.

Business users need self-service BI, but create inconsistent metrics and reports when querying the data warehouse directly.

Implement a Looker semantic layer using LookML. Define dimensions, measures, and joins once. Users explore the governed model instead of raw tables.

Why: LookML provides a "single source of truth" for business logic, ensuring consistent and accurate reporting while still allowing for self-service exploration.

Need to implement automated data quality checks (nulls, uniqueness, value ranges) and monitoring for data in BigQuery and Cloud Storage.

Use Dataplex Data Quality. Define rules in YAML or use auto-generated rules from profiling. Schedule scans to monitor quality over time.

Why: Dataplex provides a managed, integrated data quality solution that is more scalable and maintainable than custom SQL checks or scripts.

Discover natural groupings or segments within a customer dataset without predefined labels.

Use BigQuery ML to train a `KMEANS` clustering model directly on the customer data.

Why: K-means is an unsupervised learning algorithm ideal for segmentation. BQML makes it accessible via SQL without data export.

Enable semantic search (based on meaning, not keywords) over text data stored in BigQuery.

Use the `ML.GENERATE_EMBEDDING` function with a Vertex AI foundation model to create vector embeddings. Store them and use the `VECTOR_SEARCH` function for similarity search.

Why: This pattern brings powerful semantic search capabilities directly into BigQuery, avoiding the need for external search indexes like Elasticsearch.

Integrate Large Language Model (LLM) capabilities like text summarization or classification directly into a BigQuery analytics workflow.

Create a BigQuery ML remote model that points to a Vertex AI LLM endpoint. Use the `ML.GENERATE_TEXT` function within a SQL query to process text data.

Why: This tightly integrates generative AI into SQL, allowing analysts to leverage LLMs on their data without leaving the BigQuery environment or writing complex application code.

5. Maintaining and Automating Data Workloads

A multi-step data pipeline involves complex dependencies, retries, and tasks across different GCP services (e.g., Dataflow, BigQuery, Dataproc).

Use Cloud Composer (managed Apache Airflow). Define the workflow as a Directed Acyclic Graph (DAG) using Python.

Why: Composer is the designated GCP tool for complex workflow orchestration, providing robust dependency management, scheduling, retry logic, and monitoring that simpler tools like Cloud Scheduler lack.

An Airflow DAG task that calls an external API frequently fails due to transient network issues.

Configure task-level retries in the DAG with `retry_exponential_backoff=True`. This increases the delay between retries, giving the external system time to recover.

Why: Exponential backoff is a best practice for retrying transient failures, as it avoids overwhelming a struggling downstream system with rapid, repeated requests.

Manage, version, test, and schedule a complex set of interdependent SQL transformations in BigQuery.

Use Dataform. Define tables and dependencies in SQLX files, use Git for version control, write data quality assertions, and schedule execution workflows.

Why: Dataform is Google Cloud's native solution for ELT, providing dependency management, testing, and version control for BigQuery transformations, promoting DataOps best practices.

Need to understand and visualize how data flows from source to final report across multiple services like BigQuery and Dataflow.

Use Dataplex, which automatically captures and displays data lineage from supported Google Cloud services in the Data Catalog UI.

Why: Automated lineage tracking is crucial for impact analysis, debugging, and governance. Dataplex provides this out-of-the-box for integrated services.

A running Dataflow streaming job needs to be updated with new logic without losing data or state.

Launch the new pipeline version using the `--update` command-line option and specifying the job ID of the running pipeline. Use the `drain` mode to allow the old job to finish processing in-flight data.

Why: Dataflow's in-place update mechanism provides a zero-downtime way to deploy changes to streaming pipelines while preserving state and guaranteeing exactly-once processing.

For compliance, all read and write access to sensitive data in BigQuery and Cloud Storage must be logged and auditable.

Enable Cloud Audit Logs, specifically Data Access logs, for the relevant services. Create a log sink to export these logs to BigQuery for long-term retention and analysis.

Why: Cloud Audit Logs provide a tamper-proof, comprehensive record of data access. Sinking logs to BigQuery allows for powerful SQL-based auditing and reporting.

BigQuery datasets, tables, and access controls need to be managed as code for repeatability and versioning (Infrastructure as Code).

Define all BigQuery resources (datasets, tables, IAM policies) in Terraform configuration files (`.tf`). Manage deployments through a CI/CD pipeline.

Why: Terraform is the standard for IaC on GCP, enabling audited, version-controlled, and consistent management of data infrastructure, preventing manual configuration drift.

An ML model in production shows degrading performance over time.

Implement Vertex AI Model Monitoring. Configure a monitoring job to detect training-serving skew and prediction drift by comparing production traffic against a baseline. Set up alerts to trigger investigation or automated retraining.

Why: Model performance degrades due to data drift. Proactive monitoring is essential to detect this and maintain model accuracy, justifying retraining.