Implementing a medallion architecture (Bronze, Silver, Gold) and needing to access data across layers without physical data duplication.
→Use OneLake shortcuts to reference data in other lakehouses or layers.
Why: Shortcuts are symbolic links in OneLake. They provide a unified namespace and allow access to data without copying, which is ideal for a logical data mesh or medallion architecture.
Reference↗
Migrating an existing T-SQL-heavy analytics workload from Azure Synapse to Fabric.
→Use a Fabric Data Warehouse.
Why: The Fabric Warehouse offers full T-SQL compatibility, making it the ideal target for migrating existing SQL scripts, stored procedures, and analyst queries with minimal changes. The Lakehouse SQL endpoint has read-only T-SQL access and uses Spark SQL for writes.
Ingesting and querying high-volume, high-velocity streaming data (e.g., IoT telemetry) with sub-second latency.
→Use Fabric Eventstream for ingestion and a KQL Database for storage and analysis.
Why: This is the purpose-built streaming analytics stack in Fabric. KQL (Kusto Query Language) is optimized for time-series analysis on streaming data, offering much lower latency than batch-oriented lakehouses or warehouses.
Implementing Slowly Changing Dimension (SCD) Type 2 to maintain a full history of dimension changes in a lakehouse.
→Use a `MERGE INTO` statement in a Spark notebook or pipeline. Match on the business key; `WHEN MATCHED` updates the old record (sets `IsCurrent` to false, `EndDate` to now); `WHEN NOT MATCHED` inserts the new record.
Why: Delta Lake's `MERGE` operation provides atomic upsert capabilities, making it the standard and most efficient way to implement SCD logic in a Fabric lakehouse.
Replicating data in near real-time from an operational database (e.g., Azure SQL DB) to a Fabric lakehouse for analytics.
→Use Fabric Mirroring.
Why: Mirroring is a low-latency, low-impact change data capture (CDC) solution built into Fabric. It automatically replicates data and schema changes to OneLake as Delta tables, eliminating the need for complex ETL pipelines.
Ingesting and transforming complex, nested JSON data from an API into a flattened, structured Delta table.
→Use a PySpark notebook. Use functions like `from_json` to parse the schema, and `explode` to flatten arrays into rows.
Why: PySpark provides the most powerful and flexible tools for handling complex and evolving JSON structures programmatically, far beyond the capabilities of a standard copy activity.
Ingesting data into Fabric from an on-premises SQL Server database that is behind a corporate firewall.
→Install and configure an on-premises data gateway on a server within the local network. Add the gateway as a data source in Fabric.
Why: The gateway acts as a secure bridge, relaying queries and data between Fabric cloud services and on-premises data sources without requiring inbound firewall ports to be opened.
Query performance on a large, frequently updated Delta table has degraded due to an accumulation of many small data files.
→Run the `OPTIMIZE` command to compact small files into larger ones. Optionally use `ZORDER BY` on frequently filtered columns to co-locate related data.
Why: Fewer, larger files are significantly more efficient for Spark to read. Z-ordering improves data skipping, allowing queries to read even less data. This is a critical maintenance task for Delta tables.
Aggregating streaming time-series data into fixed, non-overlapping time intervals (e.g., average temperature per sensor every 5 minutes).
→Use a KQL query with the `summarize` operator and the `bin()` function. Example: `SensorData | summarize avg(temperature) by sensor_id, bin(timestamp, 5m)`.
Why: The `bin()` function is the standard, highly optimized way in KQL to group events into fixed time buckets (tumbling windows) for aggregation.
A Dataflow Gen2 refresh is slow. The data source is a relational database like Azure SQL.
→Review the transformation steps in the Power Query editor to ensure query folding is active. Reorder or modify steps to maximize folding.
Why: Query folding pushes transformation logic back to the source database to be executed as a single native query. This is vastly more efficient than pulling all raw data into the dataflow engine and transforming it in memory.
A Spark notebook is performing a slow join between a very large fact table (billions of rows) and a small dimension table (thousands of rows).
→Use a broadcast join by providing a hint (`spark.sql.functions.broadcast`) or letting the optimizer choose based on statistics.
Why: Broadcasting sends the entire small table to every executor node. This avoids a costly "shuffle" operation where the large table's data must be repartitioned and sent across the network, dramatically improving performance.
A data pipeline orchestrates multiple activities. One activity might fail, but subsequent, independent activities should still run, and the overall failure should be logged.
→Configure activity dependencies. Activities that should run regardless of outcome should depend on the previous activity with the "Completion" condition.
Why: This allows for building robust, parallel execution paths. You can create separate branches for "Succeeded" and "Failed" conditions to implement custom logging or notification logic.
A pipeline to incrementally load data from a source with a `last_modified` timestamp.
→Implement a watermark pattern. Store the `max(last_modified)` from the last successful run. In the next run, query the source for records where `last_modified` is greater than the stored watermark.
Why: This is the most efficient pattern for incremental loads from sources that provide a modification timestamp, ensuring only new or updated data is processed, minimizing data transfer and compute.
Analyze a real-time stream of IoT data to detect unusual spikes or dips in sensor readings.
→Use the `series_decompose_anomalies()` function in a KQL query within an Eventhouse/KQL Database.
Why: This built-in KQL function is specifically designed for time-series anomaly detection. It automatically decomposes the series into seasonal, trend, and residual components to identify statistically significant outliers, requiring minimal manual configuration.
Need to join data from a Warehouse, a Lakehouse, and a mirrored Azure SQL Database in a single T-SQL query without moving data.
→Use three-part naming conventions (`database.schema.table`) in a query run from the Warehouse or Lakehouse SQL endpoint. Use shortcuts to reference the mirrored database.
Why: Fabric provides a unified query engine that can access data across different Fabric items within the same workspace using a single SQL statement, enabling data virtualization.
A dataflow needs to process a file where some rows may be invalid. The entire flow should not fail; valid rows should be loaded, and invalid rows should be logged.
→In Power Query, add a step to validate rows and create a "IsValid" column. Then, create two reference queries from that point: one that filters for `IsValid = true` to load to the destination, and another that filters for `IsValid = false` to load to an error log.
Why: This pattern provides robust error handling by splitting the data stream. It prevents a few bad rows from halting the entire process and provides a clear mechanism for auditing data quality issues.