Playbook

Microsoft Azure Cosmos DB Developer Specialty

Last reviewed: May 2026

A scannable reference of architectural patterns the DP-420 exam tests. Read top-to-bottom, or jump to a section.

Design and implement data models

A one-to-few relationship exists where related data is bounded, small, and frequently read together.

Embed the related data as a nested object or array within the main document.

Why: Optimizes read performance by retrieving all necessary data in a single point read, minimizing RU cost and latency. Avoids client-side joins.

Reference

A one-to-many relationship where the "many" side grows unboundedly or is updated independently of the "one" side.

Store related items as separate documents and use the parent document's ID as a reference.

Why: Prevents documents from exceeding the 2 MB size limit and avoids high RU costs for updates on large embedded arrays.

Reference

A document contains an array that can grow unboundedly over time, risking the 2 MB document size limit (e.g., event logs, comments).

Split the array across multiple "bucket" documents. When a bucket reaches a size/item threshold, create a new one.

Why: Keeps individual document sizes manageable while maintaining the logical grouping of related data.

Modeling a many-to-many relationship, such as students and courses, or articles and tags.

For bounded relationships, duplicate relationship data on both sides (e.g., embed course IDs in student doc, student IDs in course doc). For unbounded, use a separate "join" or "edge" document container.

Why: Denormalization optimizes for both query directions (students in course, courses for student) without requiring joins. A join container is for unbounded cases.

Modeling hierarchical data (e.g., organizational chart, product categories) and needing to query for all descendants of a node.

Store an array of all ancestor IDs or names (the path) in each document.

Why: Enables efficient subtree queries with a single `ARRAY_CONTAINS` filter, avoiding costly recursive lookups.

A document has an unbounded array (e.g., blog comments), but the most common query only needs the most recent N items.

Embed a subset of recent items in the main document and store all items as separate referenced documents.

Why: Optimizes the primary read path for performance and cost, while still allowing access to the full dataset when needed.

Storing a sequence of immutable events for an entity and needing to query for current state or analytical aggregates.

Store events in a single container partitioned by the entity ID. Use Change Feed or Synapse Link to compute and store materialized views or aggregates.

Why: Provides a complete audit trail and decouples the write model from various read models, offering high flexibility.

Need to preserve the state of related data at a specific point in time (e.g., a customer's address on an order).

Embed a copy (snapshot) of the related data in the document, rather than referencing it.

Why: Ensures historical accuracy by decoupling the document from future changes to the referenced data.

Ingesting high-frequency time-series data (e.g., IoT sensor readings) and querying by device over time ranges.

Use device ID as the partition key. Aggregate readings into time-bucketed documents (e.g., hourly or minutely) instead of one doc per reading.

Why: Drastically reduces document count and write RUs, while co-locating data for efficient time-range queries within a partition.

Need to perform multiple create, update, or delete operations as a single atomic transaction.

Use the SDK's TransactionalBatch feature. All operations must target the same logical partition key.

Why: Provides ACID guarantees for up to 100 operations within a single partition, ensuring that either all operations succeed or all fail together.

Documents should be automatically deleted from a container after a specific period (e.g., 30 days).

Enable Time to Live (TTL) on the container and set the default `ttl` value in seconds (e.g., 2592000 for 30 days). A `ttl` of -1 on an individual document overrides the default and prevents expiration.

Why: TTL is a no-cost feature that uses leftover RUs to perform background deletions, providing an efficient, hands-off way to manage data lifecycle.

Need to store large binary objects (images, videos, documents > 2 MB) associated with Cosmos DB metadata.

Store the binary object in Azure Blob Storage. Store the URI to the blob in the Cosmos DB document along with the metadata.

Why: Cosmos DB is optimized for structured metadata and has a 2 MB document limit. Blob Storage is a cost-effective and scalable service for large object storage.

Integrate an Azure Cosmos DB solution

The same data needs to be queried by different properties, leading to inefficient cross-partition queries (e.g., query orders by customer, then by product).

Use the Change Feed to populate a second container (a materialized view) with the same data, but partitioned by the secondary query property.

Why: Shifts compute from read time to write time, enabling efficient, single-partition queries for multiple access patterns.

Need to run complex analytical queries (aggregations, joins) on live operational data without impacting the transactional workload.

Enable Azure Synapse Link on the Cosmos DB container. Run analytical queries against the container's analytical store using Synapse serverless SQL or Spark pools.

Why: Provides a no-ETL, cloud-native HTAP solution. Queries against the columnar analytical store do not consume transactional RUs and are highly performant.

Need to trigger downstream actions in a scalable, reliable, and serverless manner in response to data changes.

Use an Azure Function with the Cosmos DB trigger. The trigger leverages the Change Feed Processor library automatically.

Why: This is the recommended pattern for event-driven architectures. It provides automatic scaling, checkpointing, and partition lease management.

Reference

An operation must atomically update the database and publish a message to a messaging system (e.g., Service Bus, Event Hubs).

Perform the database write. Use a Change Feed processor to reliably read the committed change and publish the corresponding message, with retry logic.

Why: Avoids unreliable dual writes and the need for distributed transactions. Change Feed acts as a durable outbox, guaranteeing eventual delivery of the message.

Design and implement data distribution

Choosing a partition key for a new container to ensure performance and scalability.

Select a property with high cardinality that is present in most, if not all, point-read and query operations.

Why: Aligning the partition key with the most common query filter ensures that most operations are routed to a single logical partition, which is the most efficient access pattern.

Reference

A single partition key value receives a disproportionately high volume of requests, causing throttling (a "hot partition").

Create a synthetic partition key by concatenating the original key with a random suffix or another high-cardinality property (e.g., `userId + "-" + random(1-10)`).

Why: Distributes the write and read load for a single logical entity across multiple physical partitions, mitigating throttling.

Data needs to be partitioned by multiple levels (e.g., tenant, then year, then month) to avoid large partitions and support multi-level queries.

Configure a hierarchical partition key with an ordered array of paths, like `["/tenantId", "/year"]`.

Why: Allows sub-partitioning to prevent the 20 GB logical partition limit and enables more efficient routing for queries that filter on the hierarchy.

A globally distributed application with multi-region writes enabled needs to handle concurrent updates to the same document.

For simple overwrites, use Last-Writer-Wins (LWW). For operations requiring merge logic (e.g., incrementing a counter, updating inventory), use a custom conflict resolution policy with a merge stored procedure.

Why: Custom merge logic prevents data loss (e.g., a lost increment) that would occur with LWW, ensuring data integrity for critical business operations.

Balancing read latency, availability, and data consistency for a globally distributed application.

Default to Session consistency for a good balance and read-your-own-writes. Use Bounded Staleness for predictable read lag. Override specific critical write/read operations to Strong consistency as needed.

Why: Session is the most widely used level, providing low latency and strong guarantees within a client session. Overriding on a per-request basis allows for flexibility.

Optimize an Azure Cosmos DB solution

Write operations are consuming excessive RUs, and only a small subset of document properties are ever used in query filters.

Switch from the default indexing policy to a custom policy. Explicitly include paths for queried properties and exclude all other paths (`"/*"` in `excludedPaths`).

Why: Each indexed property incurs an RU cost on writes. Excluding unused properties can significantly reduce write RU consumption and index storage size.

Reference

A frequent query filters on one property and sorts by another (e.g., `WHERE c.status = "active" ORDER BY c.timestamp DESC`).

Create a composite index on the properties in the order they appear in the query: `(status ASC, timestamp DESC)`.

Why: Allows the query engine to serve the filtered and sorted result directly from the index, avoiding a costly in-memory sort operation and drastically reducing RU charge.

A query retrieves large documents but the application only needs one or two small properties from them.

Use query projection to select only the required properties (e.g., `SELECT c.id, c.name FROM c`) instead of `SELECT *`.

Why: Reduces RU cost by lowering the data payload size transferred from the database engine to the client.

An application frequently polls for document updates, but the data changes infrequently, leading to high RU costs for reads.

Store the ETag from the last read. On subsequent reads, send the ETag in an `If-None-Match` header.

Why: If the document has not changed, Cosmos DB returns a 304 Not Modified status with a minimal RU charge (typically ~1 RU), saving cost and bandwidth.

A workload has variable or unpredictable traffic patterns, with significant peaks and troughs.

Configure autoscale throughput on the database or container. Set the maximum RU/s needed for peak load.

Why: Automatically scales throughput between 10% of max and the max RU/s based on usage, optimizing costs by not paying for idle provisioned capacity.

A workload is for development, testing, or a low-traffic application with long idle periods.

Use the Serverless capacity mode for the Cosmos DB account.

Why: You only pay for the RUs consumed per operation, with no minimum provisioned capacity. This is the most cost-effective option for sporadic workloads.

Need to ingest or modify a large number of documents (thousands to millions) as quickly as possible.

Use the SDK's bulk support feature (e.g., `AllowBulkExecution = true` in .NET SDK v3).

Why: The SDK optimizes for high throughput by batching operations, managing concurrency, and handling retries/throttling internally, far outperforming sequential operations.

A stored procedure processing a large batch of documents is timing out.

Implement bounded execution. The stored procedure should check if it is approaching the 5-second execution limit and, if so, return a continuation token to the client. The client then re-invokes the procedure with the token to resume processing.

Why: Stored procedures have a hard execution time limit. A continuation pattern is the standard way to handle long-running, multi-step server-side logic.

Maintain an Azure Cosmos DB solution

A business-critical application requires high availability with minimal data loss (RPO) and fast recovery time (RTO) in case of a regional outage.

Configure the Cosmos DB account with multiple write regions and enable automatic failover.

Why: Provides the lowest RPO and RTO. Data is replicated across regions, and in case of an outage, Cosmos DB automatically promotes a secondary region to be the new primary write region.

Need the ability to recover from accidental data deletion or corruption by restoring the database to a specific point in time.

Enable Continuous Backup mode on the Cosmos DB account.

Why: Continuous backup allows you to restore to any point in time (down to the second) within the retention period (7 or 30 days). The restore operation creates a new account.

Reference

A compliance requirement mandates that data encryption keys must be managed and controlled by the customer.

Configure the Cosmos DB account with Customer-Managed Keys (CMK), using a key from an Azure Key Vault.

Why: Provides an additional layer of security where you control the key lifecycle (including rotation and revocation) for encryption-at-rest.

Need to grant an application or user fine-grained, identity-based access to data, following the principle of least privilege.

Use Azure AD integration and assign a built-in role (e.g., Cosmos DB Built-in Data Reader) or a custom RBAC role, scoped to the specific container or database.

Why: Eliminates the need to manage and share master keys. RBAC provides auditable, identity-based access control.

A Cosmos DB account must be accessible only from within a specific Azure Virtual Network (VNet), with no traffic over the public internet.

Create a Private Endpoint for the Cosmos DB account in the VNet and disable public network access in the firewall settings.

Why: Private Endpoints provide a private IP address for the Cosmos DB account within your VNet, ensuring all traffic flows over the secure Azure backbone.

Diagnosing the root cause of HTTP 429 (Too Many Requests) throttling errors.

Monitor the "Normalized RU Consumption" metric in Azure Monitor. Use Diagnostic Logs (`CDBPartitionKeyRUConsumption`) to identify which partition keys are consuming the most RUs.

Why: Normalized RU consumption shows if overall throughput is exhausted. Partition-level logs pinpoint hot partitions, which is a common cause of throttling even when overall usage is low.

Need to monitor and alert on request latency to ensure SLA compliance.

Monitor the "Server Side Latency P99" metric in Azure Monitor. Create an alert rule for when this metric exceeds the SLA threshold.

Why: P99 latency represents the worst-case experience for 99% of requests and is what Cosmos DB SLAs are based on. It is a more meaningful indicator of performance issues than average latency.

A compliance requirement dictates that all data access operations (reads, writes, queries) must be audited.

Enable Diagnostic Settings on the Cosmos DB account and forward the `DataPlaneRequests` log category to a Log Analytics workspace or Storage Account.

Why: The `DataPlaneRequests` log provides detailed information on every data operation, including the operation type, client IP, and resource accessed, which is essential for security auditing.

An untrusted client (e.g., a mobile app) needs temporary, scoped-down access to specific Cosmos DB resources (e.g., only documents in their own partition).

Implement a trusted middle-tier service that authenticates the user, then uses a master key to generate and return a short-lived, permission-scoped resource token to the client.

Why: This is the most secure pattern for client-side access, as it avoids exposing master keys and provides fine-grained, temporary access control.