Continuously replicate changes from an OLTP database (e.g., Oracle, PostgreSQL, MySQL) to BigQuery with low latency.
→Use Datastream to perform Change Data Capture (CDC). Configure it to stream changes directly to BigQuery, which applies them using its `MERGE` capability.
Why: Datastream is a managed, serverless CDC service that simplifies real-time database replication without requiring custom pipelines or significant source database load.
Reference↗
A Dataflow streaming pipeline must produce accurate event-time windowed results despite some events arriving hours late.
→Configure event-time windows with `allowedLateness` set to accommodate the delay. Use triggers with early firings for preliminary results and accumulating fired panes to include late data.
Why: Dataflow's model of watermarks, triggers, and allowed lateness provides a robust framework for balancing completeness and latency when dealing with out-of-order data.
A Dataflow pipeline writing to BigQuery experiences duplicates after restarts or transient failures.
→Use the BigQuery Storage Write API sink (`STORAGE_WRITE_API`) with the method set to `at-least-once` (default, formerly `STREAMING_INSERTS`) or `exactly-once` (`COMMITTED` mode).
Why: The Storage Write API in `COMMITTED` mode provides built-in exactly-once semantics for streaming, eliminating the need for custom deduplication logic.
Ingest data from a paginated, rate-limited REST API using Dataflow.
→Use a `SplittableDoFn` to process the paginated source in parallel. Implement rate-limiting logic (e.g., using a Guava RateLimiter) and exponential backoff for retries within the DoFn.
Why: A `SplittableDoFn` allows for dynamic work rebalancing. Combining it with rate-limiting and retry logic creates a resilient and efficient pattern for handling external APIs.
A single data stream needs to be written to multiple destinations (e.g., BigQuery, Bigtable, Cloud Storage).
→In a single Dataflow pipeline, after initial processing, apply multiple `PTransform` writers to the same final `PCollection`.
Why: The fan-out pattern is highly efficient as the data is processed only once. It avoids the cost and complexity of running multiple separate pipelines reading from the same source.
A high-volume stream must be enriched by joining with a slowly changing dimension table (e.g., user profiles) that updates periodically.
→Use the side input pattern in Dataflow. Load the dimension table as a `PCollectionView`. Configure a periodic trigger to refresh the side input on a schedule, preventing pipeline restarts.
Why: Side inputs broadcast the dimension data to all workers for fast in-memory lookups, avoiding per-element API/DB calls. Periodic refresh handles updates efficiently.
Dataproc cluster workloads vary significantly, leading to either over-provisioning or under-performance.
→Create a Dataproc cluster with an autoscaling policy. Define min/max primary and secondary worker counts. The policy will scale the cluster based on YARN metrics.
Why: Autoscaling optimizes costs by matching cluster resources to job demand, scaling up for heavy loads and down during idle periods.
A Dataflow pipeline requires custom binaries, proprietary libraries, or specific versions not in standard worker images, and must run in a VPC with no internet.
→Build a custom container image with all dependencies pre-installed. Push the image to Artifact Registry. Deploy the pipeline using a Flex Template that references the custom container.
Why: Flex Templates with custom containers provide complete control over the runtime environment and dependencies, crucial for offline or specialized environments.
A Dataflow or Spark job performing a `GroupByKey` is slow because some keys have disproportionately many values (a "hot key").
→Implement a two-stage aggregation (key salting). First, append a random suffix to the key to split the hot key across multiple workers. Aggregate partially. Second, remove the suffix and aggregate the partial results.
Why: This fanout technique manually breaks up the work for the hot key, allowing it to be processed in parallel and overcoming the bottleneck.
A streaming pipeline must not fail due to malformed records. Invalid records must be isolated for analysis without halting processing.
→In a `DoFn`, use a try-catch block for parsing. Use a multi-output DoFn with `TupleTag` to route valid records to the main output and invalid records (with error context) to a separate error output. Sink the error PCollection to a dead-letter destination like a Pub/Sub topic or BigQuery table.
Why: This pattern provides resiliency by isolating bad data, preventing pipeline failures, and ensuring failed records are captured for debugging and reprocessing.