Create a near real-time, read-only replica of an Azure SQL Database in Fabric without impacting the source.
→Use Fabric Mirroring for Azure SQL Database.
Why: Mirroring provides low-latency, continuous replication of data into OneLake as Delta tables, ideal for real-time analytics with no ETL development.
Share a dataset with another workspace or access external data without creating a copy.
→Create a Shortcut pointing to the source lakehouse table or external data location.
Why: Shortcuts act as symbolic links, providing a unified view of data in OneLake while avoiding data duplication, storage costs, and sync issues.
Combine high-velocity streaming data with historical batch data for unified analytics.
→Use Eventstream for real-time ingestion and a Lakehouse with Delta Lake tables for unified storage.
Why: Eventstream handles the streaming path, while Delta Lake's ACID properties allow it to serve as a target for both streaming appends and batch updates.
Enable both T-SQL-based analysis and Python-based data science on the same lakehouse data.
→Leverage the automatically generated SQL analytics endpoint for the Lakehouse.
Why: Fabric provides dual-engine access to the same Delta tables: a SQL endpoint for T-SQL queries and the Spark engine for notebooks, without data duplication.
Ingest data from an on-premises data source (e.g., Oracle, SQL Server) into Fabric.
→Install and configure an on-premises data gateway.
Why: The gateway acts as a secure bridge, relaying data between the on-premises network and the Fabric cloud service without exposing the source to the internet.
Automatically process new files as soon as they arrive in Azure Blob Storage.
→Use a Storage Event trigger for the data pipeline, configured to fire on blob creation events.
Why: Event-driven triggers provide lower latency and are more efficient than scheduled polling, which can miss data or run unnecessarily.
Implement Slowly Changing Dimension Type 2 logic or process Change Data Capture (CDC) streams.
→Use the Delta Lake MERGE operation with `WHEN MATCHED` and `WHEN NOT MATCHED` clauses.
Why: MERGE provides atomic upsert (update/insert/delete) capabilities, which is the foundational operation for maintaining historical records in SCD2 patterns.
Transform a DataFrame column containing nested arrays of objects into separate rows.
→Apply the `explode()` function to the array column in a PySpark notebook.
Why: `explode()` is the standard Spark function for un-nesting arrays, creating a new row for each element in the array.
Handle late-arriving data in a stateful streaming aggregation (e.g., windowed counts).
→Configure a watermark on the event-time column in the Spark Structured Streaming query.
Why: Watermarking defines a time threshold for how long the engine will wait for late data, preventing state from growing indefinitely while ensuring correctness.
Perform an incremental data load from a source system that has a timestamp column but no CDC.
→Implement a high-watermark pattern. Store the max timestamp from the last run and use it to filter the source in the next run.
Why: This is an efficient and common pattern to extract only new or updated records without the overhead of full table scans or the requirement of formal CDC.
A pipeline activity fails intermittently due to transient network issues or source system load.
→Configure the activity's retry policy with a specified count and exponential backoff interval.
Why: Builds resilience into the pipeline by automatically retrying failed operations, often resolving transient issues without manual intervention.
Ingest and query high-volume, low-latency telemetry or log data for real-time exploratory analysis.
→Ingest data into an Eventhouse and query it using Kusto Query Language (KQL).
Why: Eventhouse (built on Azure Data Explorer) and KQL are purpose-built for high-performance time-series and log analytics.
Create a single, reusable pipeline to load dozens of tables that share the same transformation logic.
→Use a metadata-driven approach. Store source/destination info in a control table and use a ForEach activity to iterate and pass parameters to a generic child pipeline.
Why: This pattern is highly scalable and maintainable, avoiding the duplication and management overhead of creating separate pipelines for each table.
Optimize the performance of a Dataflow Gen2 that sources data from a relational database like SQL Server.
→Design transformations that can be folded. Verify query folding status in the Power Query editor.
Why: Query folding pushes transformation logic down to the source database engine, which is significantly more performant than pulling all data into the Spark engine for transformation.
Query a table as it existed at a specific point in the past for an audit or to recover from an accidental update.
→Use Delta Lake's time travel feature with `VERSION AS OF` or `TIMESTAMP AS OF` in the query.
Why: Delta Lake natively versions every transaction, allowing for point-in-time queries without requiring manual snapshots or backups.