Playbook

Microsoft Fabric Data Engineer Associate

Last reviewed: May 2026

A scannable reference of architectural patterns the DP-700 exam tests. Read top-to-bottom, or jump to a section.

Plan, implement, and manage a solution for data analytics

Design the initial data ingestion layer in a medallion architecture to capture raw source data.

Ingest data into the Bronze layer with minimal transformation and a permissive schema.

Why: Preserves the original data fidelity, including malformed records, for reprocessing, auditing, and data lineage.

Implement isolated environments and a promotion process for Fabric artifacts.

Use Fabric Deployment Pipelines with distinct Development, Test, and Production workspace stages.

Why: Provides a structured, safe mechanism to test changes and promote artifacts without impacting production workloads.

Enforce source control and approval workflows for changes to production Fabric items.

Integrate the Fabric workspace with Azure DevOps Git. Use branch policies to enforce pull request reviews.

Why: Enables version control, change tracking, and mandatory peer reviews, aligning data engineering with DevOps best practices.

Automate environment-specific connection string changes during pipeline deployments.

Configure deployment rules in the deployment pipeline to parameterize data source connections for each stage.

Why: Eliminates manual post-deployment configuration, reducing errors and ensuring each environment connects to the correct data source.

Organize workspaces for multiple business units requiring both isolation and shared governance.

Create separate workspaces per business unit and group them under Fabric Domains.

Why: Workspaces provide content and security isolation, while Domains enable centralized governance and discovery across related workspaces.

Improve data discovery and signal the quality of datasets to business users.

Apply descriptions and tags to lakehouse tables and use Endorsement labels (Promoted, Certified).

Why: Endorsement levels build user trust and guide them to high-quality, curated datasets for reporting and analysis.

Ensure consistent data classification and protection across all Fabric items.

Integrate with Microsoft Purview Information Protection and enable downstream inheritance for sensitivity labels.

Why: Automates the application of sensitivity labels from data sources to downstream artifacts like semantic models and reports, enforcing security policies.

Determine the primary factor for sizing a Fabric capacity.

Analyze the concurrent query execution and compute requirements of the workload.

Why: Fabric capacity is consumed by compute operations (Capacity Units), not data storage volume. Concurrency and job complexity are the key drivers.

Provide secure, production-grade access from a Fabric shortcut to an external ADLS Gen2 account.

Use a Service Principal with Azure AD authentication, granting it least-privilege RBAC roles on the storage account.

Why: Service Principal is the most secure and auditable method, avoiding the risks of shared account keys or SAS tokens.

Prepare and serve data

Create a near real-time, read-only replica of an Azure SQL Database in Fabric without impacting the source.

Use Fabric Mirroring for Azure SQL Database.

Why: Mirroring provides low-latency, continuous replication of data into OneLake as Delta tables, ideal for real-time analytics with no ETL development.

Share a dataset with another workspace or access external data without creating a copy.

Create a Shortcut pointing to the source lakehouse table or external data location.

Why: Shortcuts act as symbolic links, providing a unified view of data in OneLake while avoiding data duplication, storage costs, and sync issues.

Combine high-velocity streaming data with historical batch data for unified analytics.

Use Eventstream for real-time ingestion and a Lakehouse with Delta Lake tables for unified storage.

Why: Eventstream handles the streaming path, while Delta Lake's ACID properties allow it to serve as a target for both streaming appends and batch updates.

Enable both T-SQL-based analysis and Python-based data science on the same lakehouse data.

Leverage the automatically generated SQL analytics endpoint for the Lakehouse.

Why: Fabric provides dual-engine access to the same Delta tables: a SQL endpoint for T-SQL queries and the Spark engine for notebooks, without data duplication.

Ingest data from an on-premises data source (e.g., Oracle, SQL Server) into Fabric.

Install and configure an on-premises data gateway.

Why: The gateway acts as a secure bridge, relaying data between the on-premises network and the Fabric cloud service without exposing the source to the internet.

Automatically process new files as soon as they arrive in Azure Blob Storage.

Use a Storage Event trigger for the data pipeline, configured to fire on blob creation events.

Why: Event-driven triggers provide lower latency and are more efficient than scheduled polling, which can miss data or run unnecessarily.

Extract all records from a REST API that returns data in pages.

In a Copy activity, configure the REST connector's built-in pagination rules. Alternatively, use an Until or ForEach loop with variables to manage page tokens.

Why: Automates the process of iterating through all API pages until all data is retrieved, handling dynamic next-page links or offsets.

Implement Slowly Changing Dimension Type 2 logic or process Change Data Capture (CDC) streams.

Use the Delta Lake MERGE operation with `WHEN MATCHED` and `WHEN NOT MATCHED` clauses.

Why: MERGE provides atomic upsert (update/insert/delete) capabilities, which is the foundational operation for maintaining historical records in SCD2 patterns.

Transform a DataFrame column containing nested arrays of objects into separate rows.

Apply the `explode()` function to the array column in a PySpark notebook.

Why: `explode()` is the standard Spark function for un-nesting arrays, creating a new row for each element in the array.

Handle late-arriving data in a stateful streaming aggregation (e.g., windowed counts).

Configure a watermark on the event-time column in the Spark Structured Streaming query.

Why: Watermarking defines a time threshold for how long the engine will wait for late data, preventing state from growing indefinitely while ensuring correctness.

Perform an incremental data load from a source system that has a timestamp column but no CDC.

Implement a high-watermark pattern. Store the max timestamp from the last run and use it to filter the source in the next run.

Why: This is an efficient and common pattern to extract only new or updated records without the overhead of full table scans or the requirement of formal CDC.

A pipeline activity fails intermittently due to transient network issues or source system load.

Configure the activity's retry policy with a specified count and exponential backoff interval.

Why: Builds resilience into the pipeline by automatically retrying failed operations, often resolving transient issues without manual intervention.

Ingest and query high-volume, low-latency telemetry or log data for real-time exploratory analysis.

Ingest data into an Eventhouse and query it using Kusto Query Language (KQL).

Why: Eventhouse (built on Azure Data Explorer) and KQL are purpose-built for high-performance time-series and log analytics.

Create a single, reusable pipeline to load dozens of tables that share the same transformation logic.

Use a metadata-driven approach. Store source/destination info in a control table and use a ForEach activity to iterate and pass parameters to a generic child pipeline.

Why: This pattern is highly scalable and maintainable, avoiding the duplication and management overhead of creating separate pipelines for each table.

Optimize the performance of a Dataflow Gen2 that sources data from a relational database like SQL Server.

Design transformations that can be folded. Verify query folding status in the Power Query editor.

Why: Query folding pushes transformation logic down to the source database engine, which is significantly more performant than pulling all data into the Spark engine for transformation.

Query a table as it existed at a specific point in the past for an audit or to recover from an accidental update.

Use Delta Lake's time travel feature with `VERSION AS OF` or `TIMESTAMP AS OF` in the query.

Why: Delta Lake natively versions every transaction, allowing for point-in-time queries without requiring manual snapshots or backups.

Implement and manage data engineering and data science semantic models

Enforce row-level security (RLS) where users should only see data corresponding to their region or department.

Implement RLS rules using DAX expressions within the semantic model.

Why: The semantic model is the centralized and recommended layer for enforcing business rules like RLS. The logic is applied dynamically based on the user's identity.

Prevent a group of users from seeing sensitive columns (e.g., salary, PII) in a table.

Implement Column-Level Security (CLS) in the semantic model or warehouse.

Why: CLS provides granular control to restrict access to specific columns for designated user roles, protecting sensitive data within a shared table.

Build a Power BI report on a very large lakehouse dataset with high performance requirements.

Create a semantic model using DirectLake mode.

Why: DirectLake offers the performance of Import mode by loading data into memory, but without duplicating the data, by reading directly from the Delta files in OneLake.

Improve query performance and reduce capacity consumption for reports with high-level summaries.

Create and configure aggregation tables within the semantic model.

Why: Queries hitting pre-aggregated data are significantly faster and consume fewer resources than those scanning the full detail table, optimizing the user experience and cost.

Reduce the refresh time and resource usage for a large semantic model where only recent data changes.

Configure an incremental refresh policy on the large fact tables in the semantic model.

Why: This partitions the data and only refreshes the most recent partitions, avoiding costly full reloads of historical data that does not change.

Monitor and troubleshoot a data analytics solution

Query performance on a Delta table has degraded due to a large number of small files from streaming ingestion.

Run the `OPTIMIZE` command on the Delta table.

Why: `OPTIMIZE` compacts small files into fewer, larger files. This significantly improves read performance as the query engine has to open fewer files.

Improve query performance on a large Delta table that is frequently filtered by a non-partitioned, high-cardinality column.

Run `OPTIMIZE` with a `ZORDER BY` clause on the frequently filtered columns.

Why: Z-Ordering co-locates related data within files, allowing the query engine to use data skipping to read less data, dramatically speeding up filtered queries.

Optimize read performance for Power BI reports querying Delta tables in a Fabric lakehouse.

Ensure V-Order optimization is enabled on the Delta tables.

Why: V-Order is a Fabric-specific write-time optimization that enhances read performance for the Power BI engine by improving compression and data ordering.

Reclaim storage space from a Delta table that has accumulated significant history from updates and deletes.

Run the `VACUUM` command on the table.

Why: `VACUUM` physically removes data files that are no longer referenced by the table and are older than the retention period, reducing storage costs.

Optimize a Spark join between a very large fact table and a small dimension table.

Use a broadcast join by providing a hint (`broadcast()`) to send the small table to all executors.

Why: Broadcasting avoids a costly and network-intensive shuffle operation of the large table, which is a major performance bottleneck in large-scale joins.

A Spark join operation is slow or failing because one key value has a disproportionately large amount of data (data skew).

Implement a "salting" technique: add a random key to the skewed values to distribute them across more partitions, then join and aggregate.

Why: Salting manually breaks up the skewed partition, allowing the workload to be balanced across all executors and preventing OOM errors or long-running tasks.

A Spark notebook job is running slower than expected and the cause is unclear.

Use the Spark UI, accessible from the monitoring hub, to analyze the Directed Acyclic Graph (DAG), stage durations, and task details.

Why: The Spark UI provides a detailed, physical view of the query execution, allowing you to pinpoint bottlenecks like data skew, spills to disk, or inefficient shuffles.

A Spark job fails with an OutOfMemoryError on the driver node, even with large executor memory.

Review the code for actions like `.collect()` or `.toPandas()` that pull large amounts of distributed data into the driver node's memory.

Why: The driver has its own memory limit. Collecting a large DataFrame to the driver is a common anti-pattern that causes OOM errors; use distributed operations instead.

Identify which workspaces, reports, or pipelines are consuming the most compute resources in a Fabric capacity.

Install and analyze the Fabric Capacity Metrics app.

Why: This app provides a detailed breakdown of Capacity Unit (CU) consumption over time by workspace, item type, and specific operation, enabling targeted optimization and cost analysis.

Implement centralized, long-term auditing and monitoring of all activities within a Fabric workspace.

In the Fabric admin settings, configure diagnostic settings for the workspace to stream logs to an Azure Log Analytics workspace.

Why: Provides a robust, queryable, and long-term store for all audit and operational logs, enabling advanced monitoring, alerting, and compliance reporting.

Reduce the operational cost of a Fabric capacity that has predictable periods of inactivity (e.g., nights, weekends).

Implement automation (e.g., via APIs and Azure Automation) to pause the capacity during off-hours and resume it before business hours.

Why: Capacity compute is a primary cost driver. Pausing the capacity stops CU billing, providing significant cost savings during idle periods.

A critical data pipeline must be monitored, and the operations team needs to be notified immediately upon failure.

Configure alerts in the Fabric Monitoring Hub or use Data Activator to monitor pipeline status and trigger notifications.

Why: Proactive alerting ensures that failures are detected and addressed quickly, minimizing data downtime and impact on business users.