Playbook

Microsoft Azure Data Scientist Associate

Last reviewed: May 2026

A scannable reference of architectural patterns the DP-100 exam tests. Read top-to-bottom, or jump to a section.

Set up an Azure Machine Learning workspace

Need a centralized, collaborative platform for the entire machine learning lifecycle, from data prep to deployment and monitoring.

Azure Machine Learning workspace.

Why: It is the foundational service that integrates all required components: compute, datastores, environments, experiment tracking, model registry, and endpoints.

Reference

Require that all ML workspace traffic, including to dependent resources like Storage and ACR, remains on the Azure private network and is not exposed to the public internet.

Configure the Azure ML workspace with a managed virtual network and use private endpoints for the workspace and all its dependent resources (Storage, Key Vault, ACR).

Why: Private endpoints provide secure, private connectivity to Azure services, ensuring traffic does not traverse the public internet. A managed VNet simplifies this configuration for ML compute.

Reference

ML solution must comply with strict data residency rules, ensuring all data and compute remain within a specific geographic region (e.g., European Union).

Create the Azure ML workspace, all associated storage accounts, and compute resources in a region within the required geography. Use network isolation to prevent data exfiltration.

Why: Azure resources are bound to the region they are created in. This ensures physical data location compliance. Network isolation (managed VNet) prevents data from being processed outside this boundary.

Enforce organizational standards across all ML workspaces, such as requiring cost-allocation tags, restricting VM sizes, or mandating diagnostic log shipping.

Use Azure Policy to apply and enforce rules for resource creation and configuration.

Why: Azure Policy provides scalable, centralized governance. It prevents non-compliant resources from being created, ensuring consistent standards without manual oversight.

Reference

Access data in Azure Storage from an ML workspace without storing credentials (account keys, SAS tokens) in code or configuration.

Create the datastore connection using identity-based authentication. Grant the workspace's managed identity (or the user/compute identity) the appropriate RBAC role (e.g., Storage Blob Data Reader) on the storage account.

Why: This is a credential-less, zero-trust pattern that uses Azure AD for authentication, improving security and simplifying credential management.

Multiple teams work on projects with different security levels (e.g., PII vs. anonymized data). Need to provide resource isolation.

Create separate Azure ML workspaces for each security boundary. A workspace for PII projects should have stricter network isolation than one for non-sensitive projects.

Why: The workspace is the primary security and isolation boundary. Segregating by security level is a best practice to prevent data leakage and apply appropriate controls.

Need to separate development/experimentation activities from production-grade model training and deployment to prevent interference and ensure stability.

Use separate Azure ML workspaces for development and production environments.

Why: This isolates production resources, data, and models from experimental work, providing stability and clear governance for production MLOps pipelines.

Provision compute for ML training jobs that run intermittently, with a high priority on minimizing cost.

Use an Azure ML compute cluster with low-priority VMs, a minimum node count of 0, and auto-scaling configured.

Why: Low-priority VMs provide significant cost savings for interruptible workloads. A minimum of 0 nodes ensures you pay nothing when the cluster is idle.

Reference

Need to provision compute for both interactive notebook development by individual data scientists and for running larger, unattended training jobs.

Provision Compute Instances for interactive development (one per user). Provision Compute Clusters for batch training jobs.

Why: Compute Instances are single-user, persistent VMs optimized for interactive work. Compute Clusters are auto-scaling, multi-node resources optimized for batch jobs.

Ensure that ML training runs are reproducible by capturing all software dependencies, including specific Python package versions.

Define an Azure ML Environment using a conda environment YAML file or a Dockerfile. Register and version this environment for use in training jobs.

Why: Environments are versioned, reusable specifications of a runtime. This decouples the environment from the compute, ensuring any run with that environment version is identical.

Feature engineering logic needs to be consistent between training and inference, and features should be reusable across multiple models and teams.

Use Azure ML Managed Feature Store to define, compute, and serve features.

Why: A feature store ensures consistency (preventing training-serving skew), enables feature discovery and reuse, and provides both offline (for training) and online (for low-latency inference) storage.

Run experiments and train models

Systematically track all ML experiments, including code versions, hyperparameters, metrics, and model artifacts, for comparison and reproducibility.

Use MLflow, which is natively integrated into Azure ML. Enable autologging or use explicit `mlflow.log_*` commands in the training script.

Why: MLflow provides a standardized, open-source framework for experiment tracking. Azure ML acts as a managed MLflow tracking server, providing a UI for comparing runs.

Reference

Training a classification model on a dataset with a severe class imbalance (e.g., fraud detection), leading to poor performance on the minority class.

Apply techniques like SMOTE (Synthetic Minority Over-sampling Technique) to the training data. Evaluate the model using metrics insensitive to imbalance, such as Precision-Recall AUC or F1-score.

Why: Simply using accuracy is misleading. SMOTE creates synthetic minority samples to help the model learn, and PR-AUC/F1-score correctly measures performance on the positive class.

Need to find optimal hyperparameters for a model with a long training time and a limited compute budget.

Use a sweep job with Bayesian sampling and an early termination policy (e.g., Bandit or Median Stopping).

Why: Bayesian sampling intelligently explores the search space, focusing on promising regions. Early termination stops poor-performing runs early, saving significant compute time and cost.

Build a time series forecasting model using AutoML.

Configure the AutoML job with `task='forecasting'`, specify the `time_column_name`, and set the `forecast_horizon`.

Why: Specifying the task as "forecasting" enables AutoML to apply time-series-specific techniques like lag feature generation, seasonality detection, and time-aware cross-validation.

Train a large deep learning model across multiple GPUs on multiple compute nodes to reduce training time.

Use a compute cluster with GPU-enabled nodes. In the command job, configure the `distribution` property (e.g., `type: "PyTorch"`, `process_count_per_instance: <# GPUs>`).

Why: Azure ML simplifies distributed training by managing node setup and communication. The `distribution` configuration tells Azure ML how to launch the distributed training processes.

Reference

Automate a multi-step ML workflow (e.g., data prep, train, evaluate) that can be reused with different parameters.

Define an Azure ML pipeline using components for each step. Use pipeline inputs to parameterize the workflow.

Why: Component-based pipelines promote modularity and reusability. They also support automatic step caching (reuse), which saves time by not re-running steps whose inputs have not changed.

A model performs very well on the training set but poorly on the validation set, indicated by a diverging train and validation loss curve.

This is a classic sign of overfitting. Mitigate by applying regularization (e.g., dropout, L2), using data augmentation, implementing early stopping, or reducing model complexity.

Why: The gap between training and validation performance shows the model has memorized the training data instead of generalizing. Regularization techniques penalize complexity to improve generalization.

A long-running training job on low-priority (spot) VMs is at risk of being preempted and losing progress.

Implement checkpointing within the training script to periodically save the model and optimizer state to the `./outputs` directory.

Why: The `./outputs` directory is automatically persisted by Azure ML. Saving checkpoints allows the job to be resumed from the last saved state upon preemption, preserving progress and saving costs.

An organization has a policy that only certain ML algorithms can be used in production. Need to enforce this during AutoML runs.

In the AutoML configuration, use the `blocked_models` parameter to explicitly exclude unapproved algorithms from the search space.

Why: This provides a direct, enforceable way to align AutoML with governance policies, preventing the selection of non-compliant models.

Deploy and operationalize machine learning solutions

Deploy a model for real-time, low-latency (<100ms) predictions with high availability.

Deploy the model to an Azure ML Managed Online Endpoint.

Why: Managed online endpoints are a fully managed service optimized for real-time inference, providing auto-scaling, load balancing, blue-green deployments, and built-in monitoring.

Reference

Score a large volume of data (millions of records) asynchronously, with cost efficiency being a priority.

Deploy the model to an Azure ML Batch Endpoint.

Why: Batch endpoints are designed for high-throughput, asynchronous scoring of large datasets. They can use scalable compute clusters that spin down to zero when idle, optimizing costs.

Deploy a new model version while minimizing risk. Need to gradually shift traffic to the new version and allow for easy rollback.

Use a single managed online endpoint with two deployments (e.g., "blue" for the old model, "green" for the new). Use traffic splitting to control the percentage of requests going to each deployment.

Why: This blue-green deployment pattern allows for safe, zero-downtime rollouts. You can validate the new model on a small portion of live traffic before committing to a full switch.

Package a model with its dependencies and artifacts in a standardized, framework-agnostic way for deployment.

Use the MLflow model format. When registering the model, include the conda.yaml or requirements.txt file and any necessary code artifacts.

Why: MLflow provides a standard model packaging convention that Azure ML understands natively. This simplifies deployment as Azure ML can automatically build the required environment.

A deployed model has high latency because it loads large auxiliary files (e.g., a large featurizer) on every prediction request.

Move the file loading logic from the `run()` function to the `init()` function in the scoring script.

Why: The `init()` function runs only once when the container starts. Loading assets here makes them globally available to all `run()` calls, avoiding redundant loading on every request.

A real-time endpoint experiences variable traffic (high peaks, low troughs). Need to maintain performance cost-effectively.

Configure auto-scaling on the managed online endpoint deployment. Set a minimum and maximum number of instances and define a scaling rule based on CPU utilization or request latency.

Why: Auto-scaling automatically adjusts the number of compute instances to match the traffic load, ensuring performance during peaks and saving costs during lulls.

A model deployment requires specific system libraries, custom CUDA versions, or a custom inference server not present in the default Azure ML images.

Create a custom Dockerfile that extends an Azure ML base inference image, add the required dependencies, build it, and push it to Azure Container Registry. Reference this image in the deployment environment.

Why: Extending a base image provides full control over the runtime environment while maintaining compatibility with Azure ML's serving infrastructure.

Automate the end-to-end ML lifecycle, including retraining, evaluation, and deployment, triggered by code or data changes.

Use Azure DevOps or GitHub Actions integrated with the Azure ML CLI v2 to create a CI/CD pipeline. The pipeline should include a quality gate that compares the new model to a baseline before deploying.

Why: This MLOps pattern automates the ML workflow, ensuring consistency, quality, and rapid iteration. The quality gate prevents model performance regressions.

A production model's performance is degrading due to changes in the input data distribution. The model needs to be retrained automatically when significant drift is detected.

Configure an Azure ML data drift monitor on the endpoint. Set up an alert that triggers an Azure Logic App or Azure Function, which in turn starts the retraining pipeline.

Why: This creates a closed-loop MLOps system that automatically maintains model relevance in response to changing data patterns, without manual intervention.

A newly deployed model version is found to be faulty in production. Need to quickly revert to the previous stable version.

If using a blue-green deployment, shift 100% of traffic back to the stable deployment. Alternatively, update the endpoint to redeploy the previous model version from the model registry.

Why: Traffic shifting provides an instantaneous rollback. Redeploying a version from the registry is also a fast and reliable way to restore a known-good state.

Need to monitor both the operational health (latency, errors) and the predictive quality (data drift, accuracy) of a deployed model.

Enable Application Insights integration on the endpoint for operational metrics. Configure Azure ML data collection and data drift monitoring for model quality metrics.

Why: This two-pronged approach provides a complete view of model health. App Insights tracks system performance, while data collection/drift monitoring tracks the model's predictive performance.

The model endpoint is failing due to malformed or unexpected input data from clients.

Implement input validation logic within the `run()` function of the scoring script. Check data types, ranges, and structures, and return a meaningful error (e.g., HTTP 400) for invalid requests.

Why: Server-side validation protects the model from crashing and provides clear, immediate feedback to API consumers, making the service more robust.

Implement Responsible and Generative AI

Need to understand why a complex "black-box" model is making certain predictions, for debugging, compliance, or stakeholder trust.

Use the Responsible AI dashboard in Azure ML to generate model explanations. Use SHAP for local (individual prediction) explanations and global feature importance for overall model behavior.

Why: SHAP values provide a robust, model-agnostic way to attribute the impact of each feature on a specific prediction, which is crucial for regulatory and debugging scenarios.

A model used for decisions like loan approval must be fair and not discriminate against protected demographic groups.

Use the Responsible AI dashboard's fairness assessment to analyze fairness metrics (e.g., demographic parity, equalized odds) across sensitive features. Apply mitigation techniques like post-processing threshold adjustments if disparities are found.

Why: Fairness assessment provides quantitative evidence of a model's behavior across groups. Mitigation techniques help correct biases to ensure equitable outcomes.

An LLM needs to answer questions based on specific, private company documents without hallucinating facts.

Implement a Retrieval-Augmented Generation (RAG) pattern. Use Azure AI Search to create a vector index of the documents. At query time, retrieve relevant document chunks and pass them to the LLM as context in the prompt.

Why: RAG grounds the LLM's response in factual, up-to-date information, significantly reducing hallucinations and allowing it to use knowledge not in its original training data.

An LLM must consistently follow specific guidelines, tone, and output formats (e.g., generate JSON).

Use detailed system prompt engineering. Provide a clear persona, explicit rules and constraints, and few-shot examples of desired input/output pairs.

Why: A well-crafted system prompt is the most direct and effective way to steer an LLM's behavior without the cost and complexity of fine-tuning.

Need to measure the quality of a RAG-based LLM application.

Use evaluation metrics specific to RAG, such as Groundedness (is the answer supported by the context?) and Relevance (does the answer address the user's question?).

Why: Standard NLP metrics like ROUGE are insufficient. Groundedness and Relevance directly measure the core challenges of RAG: preventing hallucination and providing useful answers.

An LLM application is too slow or expensive for production use.

Implement a router to use smaller, cheaper models (e.g., GPT-3.5-Turbo) for simple tasks. Enable response caching for repeated queries. Optimize prompt length.

Why: Using the right-sized model for the task is the most effective cost-saving measure. Caching eliminates redundant API calls, directly reducing cost and latency.

An LLM application processes sensitive data that must not leave the corporate network or be used for model training.

Deploy Azure OpenAI service with a private endpoint. Configure the resource to not log prompt/completion data.

Why: Private endpoints ensure network isolation. The no-logging option provides an additional layer of data privacy, meeting strict compliance requirements.

A prompt flow developed in Azure AI Studio needs to be deployed as a highly available, scalable production endpoint.

Deploy the prompt flow as an Azure ML Managed Online Endpoint.

Why: This provides a seamless path from development to production, leveraging the same robust infrastructure (auto-scaling, load balancing, monitoring) used for traditional ML models.

A user-facing generative AI application must be protected from generating or processing harmful, offensive, or unsafe content.

Use both the built-in Azure OpenAI content filters and the Azure AI Content Safety service for defense-in-depth moderation of both prompts and completions.

Why: Layered safety is critical. The built-in filters provide a baseline, while the dedicated Content Safety service offers more granular control and multi-modal capabilities.

A conversational AI chatbot needs to maintain context across multiple user turns.

LLMs are stateless. The application must manage the conversation history (e.g., in a session or database) and include relevant parts of the history in each new prompt to the LLM.

Why: Explicitly providing context in each API call is the only way for a stateless LLM to "remember" the conversation.

Need to systematically test different prompts to find the one that yields the best LLM performance.

Use prompt flow Variants. Define multiple prompt versions for a node and run a bulk test against an evaluation dataset to compare performance metrics.

Why: Variants provide a structured, data-driven approach to prompt engineering, moving beyond manual trial-and-error to systematic optimization.

Need to monitor a production LLM application for both operational health and response quality.

Combine Application Insights for operational telemetry (latency, error rates, token usage) with periodic batch evaluation jobs using an evaluation flow to assess response quality (groundedness, relevance).

Why: LLM monitoring requires tracking both system performance and the quality of the generated content. This combination provides a holistic view of application health.