Playbook

Google Cloud Associate Data Practitioner

Last reviewed: May 2026

A scannable reference of architectural patterns the ADP exam tests. Read top-to-bottom, or jump to a section.

Data Preparation and Ingestion

Load large batch files (CSV, Parquet, Avro) from Cloud Storage into BigQuery.

Use a BigQuery load job. Specify a wildcard URI (e.g., `gs://bucket/path/*`) to load multiple files in a single job.

Why: This is the fastest and most cost-effective method for batch ingestion. Load jobs are free. It avoids per-row costs of streaming.

Reference

Ingest high-volume, real-time data (IoT, clickstream) with potential for transformation.

Pub/Sub -> Dataflow -> BigQuery.

Why: Canonical scalable streaming pattern. Pub/Sub provides a durable, scalable buffer. Dataflow enables complex transformations, windowing, and exactly-once processing.

Replicate an operational database (MySQL, PostgreSQL, Oracle) to BigQuery with low latency, capturing all changes (inserts, updates, deletes).

Use Datastream for Change Data Capture (CDC).

Why: Purpose-built for low-impact, real-time CDC. It handles initial backfill and streams ongoing changes directly to BigQuery.

Reference

Perform complex data validation, enrichment, or transformation (e.g., flattening nested JSON/XML) before loading to BigQuery.

Use a Dataflow pipeline with custom Apache Beam transforms (e.g., ParDo).

Why: Dataflow provides maximum flexibility for custom code (Python/Java), complex logic, and routing invalid records to a dead-letter queue.

Transfer terabytes or petabytes of data from another cloud (e.g., S3) or an on-premises data center to Cloud Storage.

For cloud-to-cloud, use Storage Transfer Service. For on-prem with limited network bandwidth, use Transfer Appliance.

Why: STS is a managed, high-performance service for online transfers. Transfer Appliance is for offline (physical shipping) transfers when the network is the bottleneck.

Query data residing in Cloud Storage or Amazon S3 directly from BigQuery without loading it.

Create a BigQuery External Table. For unified governance with Spark, use a BigLake Table.

Why: Avoids data duplication and storage costs in BigQuery. BigLake adds fine-grained security (row/column level) and governance over object storage data.

Reference

An ingestion pipeline must automatically adapt when new columns are added to source files (JSON, Avro).

Configure the BigQuery load job with `schemaUpdateOptions` set to `ALLOW_FIELD_ADDITION`.

Why: Automates schema evolution. BigQuery adds the new columns to the table schema without failing the load job.

Stream high-volume data to BigQuery with exactly-once semantics at a lower cost than the legacy streaming API.

Use the BigQuery Storage Write API.

Why: Provides higher throughput and lower costs than the older `insertAll` API, with strong guarantees like exactly-once delivery within a stream.

Reference

Data Pipeline Orchestration

Orchestrate a complex workflow with multiple dependent tasks (e.g., Dataflow, BigQuery, Cloud Functions) on a schedule.

Use Cloud Composer (managed Apache Airflow).

Why: The standard for complex workflow orchestration. Provides DAGs for defining dependencies, scheduling, retries, alerting, and a rich operator ecosystem.

A Cloud Composer DAG needs to pause and wait for a specific file to appear in a Cloud Storage bucket before proceeding.

Use the `GCSObjectExistenceSensor` in the Airflow DAG.

Why: This is the idiomatic Airflow "sensor" pattern for waiting on external conditions. It is more efficient than a custom polling loop in a PythonOperator.

A streaming Dataflow pipeline needs to correctly aggregate events by timestamp, even if events arrive out-of-order or late.

Use event-time windowing with watermarks and configure `allowedLateness`.

Why: This core Dataflow/Beam feature correctly groups data based on when the event occurred, not when it was processed. `allowedLateness` prevents late data from being dropped.

Run large-scale, non-interactive Apache Spark jobs for batch processing or ML.

Use a Dataproc cluster. For maximum cost savings, use an ephemeral cluster with Spot VMs (formerly preemptible VMs).

Why: Dataproc is the managed Spark/Hadoop service. Ephemeral clusters exist only for the job duration, and Spot VMs offer deep discounts for fault-tolerant workloads.

Create a standardized Dataflow pipeline that can be executed by different teams with varying parameters (e.g., input/output paths).

Package the pipeline as a Dataflow Flex Template.

Why: Flex Templates are the modern standard for reusable Dataflow jobs. They are container-based, support custom dependencies, and accept runtime parameters.

A task in a Cloud Composer DAG fails intermittently due to temporary external issues (e.g., API rate limiting, resource contention).

Configure `retries` and `retry_delay` with `retry_exponential_backoff=True` for the task.

Why: This makes the pipeline resilient by automatically retrying failed tasks with increasing delays, often resolving transient issues without manual intervention.

A Dataflow streaming pipeline is falling behind, exhibiting high system lag or data freshness.

Investigate Dataflow monitoring metrics. Check if autoscaling is hitting the `maxNumWorkers` limit. Increase `maxNumWorkers` or switch to a larger machine type.

Why: High system lag is a primary indicator of insufficient processing capacity. The pipeline needs more or bigger workers to keep up with the data influx.

Data Management

Optimize a large BigQuery table for query cost and performance.

Partition the table by a frequently filtered time-unit column (e.g., transaction date). Cluster the table by other high-cardinality, frequently filtered columns (e.g., `customer_id`).

Why: Partitioning is the most effective way to reduce cost and latency by pruning the amount of data scanned. Clustering further improves performance by sorting data within partitions.

Reference

Prevent data from a sensitive BigQuery dataset from being copied to an unauthorized destination (e.g., a public GCS bucket), even by a user with valid credentials.

Use VPC Service Controls to create a service perimeter around the project containing the BigQuery dataset.

Why: VPC Service Controls act as a "virtual firewall" for GCP services, preventing data from leaving the perimeter. This is a critical defense-in-depth control against data exfiltration.

Reference

Restrict access to sensitive columns (e.g., PII) in a BigQuery table to authorized groups, while allowing others to query the remaining columns.

Use Data Catalog to create a taxonomy and policy tags. Apply policy tags to sensitive columns and grant the "Fine-Grained Reader" role to authorized groups.

Why: This is the native, scalable method for column-level security in BigQuery. It provides centralized governance without needing to create and manage separate views.

Filter a table so that users can only see rows that pertain to them (e.g., sales managers see only their own region's data).

Create a Row-Level Security Policy on the table that filters rows based on `SESSION_USER()`.

Why: Provides dynamic, predicate-based filtering at query time. This is more secure and manageable than creating an authorized view for each user or role.

Automatically delete data from a BigQuery table after a specified retention period to comply with regulations (e.g., delete data older than 7 years).

For time-series data, set a partition expiration on the time-partitioned table. For other tables, set the default table expiration.

Why: This is a built-in, "set-and-forget" feature that ensures compliance without manual cleanup scripts or external orchestration.

A BigQuery table was accidentally modified or deleted.

Use BigQuery Time Travel to query the table as it existed at a point in time before the incident, using `FOR SYSTEM_TIME AS OF`.

Why: BigQuery automatically maintains a 7-day history of table data. This allows for instant recovery within the time travel window without needing to restore from backups.

Reference

Discover, manage, secure, and monitor data assets (BigQuery, GCS) across an entire organization.

Use Dataplex.

Why: Dataplex acts as an intelligent data fabric, providing a unified pane for data governance, quality, lineage, discovery, and lifecycle management across disparate data silos.

Understand and visualize how data flows from source systems, through transformation jobs, to final reporting tables.

Use Dataplex Data Lineage.

Why: Automatically captures lineage information from BigQuery, Data Fusion, and Composer logs to provide an interactive, graph-based view of data dependencies for impact analysis and auditing.

Ensure predictable query performance and cost for critical workloads, avoiding "slot contention" from other users.

Purchase BigQuery Editions (capacity-based pricing). Create reservations to dedicate a pool of slots to specific projects or folders.

Why: Switches from a shared, on-demand pool to a dedicated compute capacity, guaranteeing resources for critical jobs and providing predictable billing.

Scan all data assets in BigQuery and Cloud Storage to automatically identify and classify PII and other sensitive data.

Configure a Cloud Data Loss Prevention (DLP) discovery scan job.

Why: Cloud DLP uses hundreds of predefined detectors to find sensitive data at scale. It can integrate with Data Catalog to automatically apply policy tags for governance.

A containerized application (on GKE or Cloud Run) needs to securely authenticate to BigQuery without managing service account keys.

Use Workload Identity.

Why: The recommended best practice for service-to-service authentication. It maps a Kubernetes service account to a GCP IAM service account, using short-lived, automatically rotated tokens.

For compliance, generate a report of all users who have queried a sensitive BigQuery table in the last 90 days.

Enable and query the BigQuery Data Access audit logs, which can be routed to a BigQuery dataset for analysis.

Why: Data Access logs provide an immutable record of who accessed what data and when. They are essential for security and compliance audits but must be explicitly enabled.

Identify which users or queries are responsible for high BigQuery costs.

Query the `INFORMATION_SCHEMA.JOBS` view.

Why: This metadata view contains detailed information for every query run, including the user, bytes billed, and slots consumed, enabling precise cost attribution and analysis.

Data Analysis and Presentation

Perform complex analytical calculations like running totals, ranking within groups (e.g., top N per category), or comparing a row to a preceding row.

Use BigQuery SQL window functions (`SUM() OVER (...)`, `RANK() OVER (...)`, `LAG() OVER (...)`).

Why: The standard and most efficient SQL method for performing calculations across a set of table rows that are somehow related to the current row.

Create and share interactive, auto-refreshing dashboards on BigQuery data for business users who do not write SQL.

Use Looker Studio.

Why: The native, free GCP visualization tool. It connects directly to BigQuery and allows sharing via a simple link, managing data source credentials separately from user access.

Reference

Enable business analysts to use familiar spreadsheet tools (pivot tables, charts, formulas) to analyze terabytes of data in BigQuery.

Use Connected Sheets.

Why: Provides a live connection from Google Sheets to BigQuery. All processing and computation occurs in BigQuery, bypassing the size and performance limits of a traditional spreadsheet.

A Looker Studio dashboard that queries large, complex aggregations is slow and costly.

Create a BigQuery Materialized View to pre-compute the aggregations. Point the Looker Studio data source to the materialized view.

Why: Materialized views pre-calculate and cache expensive query results. This dramatically improves dashboard performance and reduces query costs for repetitive workloads.

Build, train, and serve a machine learning model (e.g., for classification, regression, or forecasting) using data that resides in BigQuery.

Use BigQuery ML (BQML).

Why: Democratizes ML by allowing users to train models with standard SQL `CREATE MODEL` syntax. The model lives and runs within BigQuery, simplifying deployment and prediction.

Reference

Forecast future business metrics (e.g., sales, demand) based on historical time-series data.

Use BigQuery ML with the `ARIMA_PLUS` model type.

Why: `ARIMA_PLUS` is a purpose-built BQML model for time-series forecasting that automatically handles trends, seasonality, holidays, and anomaly detection.

A BigQuery query joining a very large fact table (TBs) with a small dimension table (<100MB) is slow.

Ensure BigQuery is using a broadcast join. While often automatic, you can check the query plan or use a `JOIN` hint if necessary.

Why: A broadcast join sends the entire small table to each processing slot, avoiding a costly and slow data shuffle of the large table across the network.

A BigQuery ML model needs to be retrained regularly (e.g., weekly) on new data to prevent model drift.

Use a BigQuery Scheduled Query to run a `CREATE OR REPLACE MODEL` statement.

Why: This is the simplest, most integrated way to automate BQML retraining. It requires no external services like Composer or Cloud Functions.

Build a collaborative filtering recommendation system (e.g., "users who bought X also bought Y").

Use BigQuery ML with the `MATRIX_FACTORIZATION` model type.

Why: This model is specifically designed for recommendation tasks based on user-item interaction data.