Playbook — AI-103 Microsoft Azure AI Apps and Agents Developer Associate

Last reviewed: June 2026

A scannable reference of architectural patterns the AI-103 exam tests. Read top-to-bottom, or jump to a section.

Plan and manage an Azure AI solution

A chat feature runs at high volume with short, simple turns and a tight latency and cost budget.

Deploy a small language model (SLM) such as Phi from the Foundry model catalog instead of a frontier LLM.

Why: SLMs cut cost and latency for narrow tasks; reserve large LLMs for complex reasoning. Match model size to task, not to brand.

Reference

A single agent must reason over user-uploaded images and text in one request.

Choose a multimodal model (e.g. GPT-4o family) in the Foundry catalog rather than chaining a vision model into a text-only LLM.

Why: Native multimodal models accept image and text in one prompt; a text-only model forces a lossy caption hand-off that drops visual detail.

Answers must be grounded in a private corporate knowledge base, not the model's pretraining.

Build a retrieval layer: index the corpus in Azure AI Search with vector embeddings and ground the model via RAG over that index.

Why: Grounding injects retrieved, citable context at inference; fine-tuning bakes knowledge in statically and cannot cite or update cheaply.

Reference

An agent needs to call internal REST APIs and also retrieve from an indexed document store.

Register the APIs as agent tools (function/OpenAPI) and attach the AI Search index as a knowledge source on the Foundry agent.

Why: Tools give the agent action capability; knowledge sources give grounded retrieval. They are distinct integration surfaces, not the same connector.

Several teams need isolated agent configs, connections, and deployments under shared governance.

Use a Foundry hub with per-team Foundry projects; each project scopes its own connections, deployments, and access.

Why: The hub centralizes networking, policy, and shared resources; the project is the workspace unit for an app or team. Don't share one project across teams.

A production app needs predictable data residency and reserved throughput for a model deployment.

Use a Standard (regional) or Provisioned Throughput (PTU) deployment rather than a Global deployment for residency-sensitive, high-throughput workloads.

Why: Global deployments route to any region for capacity; Standard pins the region, and PTU reserves capacity for stable latency. Pick by residency and SLA needs.

Reference

Prompt and agent definitions must move from dev to prod with review and rollback.

Store prompt flow / agent definitions as code in a repo and promote them through environments with Azure DevOps or GitHub Actions pipelines.

Why: Treat prompts and agent config as versioned artifacts; manual portal edits in prod have no audit trail or rollback path.

A burst of traffic triggers 429 errors against a model deployment.

Raise the deployment's TPM/RPM quota where available, add client-side retry with exponential backoff, and consider a PTU deployment for guaranteed capacity.

Why: Quota is the tokens-per-minute ceiling; backoff smooths transient throttling. Spinning up duplicate resources without quota planning just moves the bottleneck.

Reference

Spend is unpredictable and dominated by long RAG prompts.

Cap max output tokens, trim retrieved context to top-k, cache reusable system context, and track token usage per deployment in Azure Monitor.

Why: Cost scales with input plus output tokens; shrinking context and outputs is the direct lever. Switching region or SKU rarely changes per-token price meaningfully.

Over weeks, answer quality and grounding fidelity appear to degrade in production.

Run continuous online evaluations in Foundry for groundedness, relevance, and coherence on sampled live traffic and alert on score drops.

Why: Scheduled evaluators detect drift you can't see in raw latency metrics; CPU/latency dashboards alone never reveal a grounding regression.

Reference

RAG answers go stale because new documents are not being retrieved.

Monitor the AI Search indexer run history and document counts; schedule incremental indexing and alert on failed indexer runs.

Why: Retrieval quality silently breaks when the indexer fails or lags; model-side metrics look fine because the gap is in the data pipeline.

An app must call a Foundry model deployment with no secrets in config.

Enable a managed identity on the app and grant it the "Cognitive Services OpenAI User" role; authenticate with Entra ID tokens, not API keys.

Why: Keyless Entra auth removes leakable secrets and centralizes RBAC; storing API keys, even in Key Vault, still leaves a key to rotate and protect.

Reference

Foundry traffic must never traverse the public internet.

Place the Foundry resource and dependencies behind private endpoints, disable public network access, and resolve via private DNS zones.

Why: Private endpoints pin traffic to the VNet; firewall IP allow-lists still route over public endpoints and are weaker isolation.

Generated responses occasionally include hateful or violent content.

Apply an Azure AI Content Safety filter at the deployment with appropriate severity thresholds for hate, sexual, violence, and self-harm categories.

Why: Content filters screen prompts and completions server-side; relying only on a system-prompt instruction is easily bypassed by jailbreaks.

Reference

An autonomous agent can execute irreversible actions such as issuing refunds.

Configure a human-in-the-loop approval gate for high-impact tools and constrain the agent to an allow-listed set of actions.

Why: Approval modes and tool-access constraints bound autonomy; an unconstrained autonomous agent has no brake on a destructive tool call.

Auditors need to see which sources and tool calls produced a given answer.

Enable tracing in Foundry (OpenTelemetry) to capture prompts, retrieved citations, tool invocations, and outputs per request.

Why: End-to-end traces give provenance and reproducibility; aggregate token metrics alone can't reconstruct a single answer's reasoning chain.

Reference

Implement generative AI and agentic solutions

A backend service must call models and agents defined in a Foundry project.

Use the Azure AI Foundry SDK (AIProjectClient) with the project connection string and a DefaultAzureCredential to get model and agent clients.

Why: The project client resolves connections and deployments centrally; hardcoding per-model endpoints and keys bypasses project governance.

Reference

Build a Q&A app grounded in policy documents.

Embed and index the docs, retrieve top-k chunks per query, and pass them as context into the chat completion with a cite-your-sources instruction.

Why: RAG keeps knowledge current and citable without retraining; passing the full corpus into the prompt blows the context window and cost.

The model must look up live order status during a conversation.

Define a tool with a JSON schema, let the model emit a tool call, execute it server-side, and return the result for the model to summarize.

Why: Function/tool calling lets the model invoke real systems deterministically; asking it to "guess" the status produces fabrications.

Reference

A task needs several dependent tool calls before a final answer.

Run a tool-use loop: feed each tool result back to the model and iterate until it returns a final message, with a max-iteration cap.

Why: Iterative tool loops support multistep reasoning; a single round trip can't chain dependent lookups, and an uncapped loop can run away.

Before shipping, quantify how often a RAG app hallucinates or drifts off-topic.

Run Foundry evaluators for groundedness, relevance, and coherence over a labeled test set and gate release on threshold scores.

Why: Built-in evaluators give measurable quality and safety signals; eyeballing a few samples doesn't catch systematic fabrication.

Reference

Define a support agent with a clear persona, goals, and boundaries.

Set the agent's system instructions (role, goals, refusal rules) and attach only the tools it needs for its scope.

Why: Tight instructions plus least-tool access keep the agent on-task; broad instructions and every tool invite scope creep and unsafe actions.

An agent must remember context across turns within a session.

Use Foundry Agent Service threads, which persist the message history per conversation so each run sees prior turns.

Why: Threads provide managed conversation memory; re-sending the whole transcript manually each call is brittle and easy to truncate wrong.

Reference

An agent needs web grounding and code execution without custom plumbing.

Attach built-in Foundry agent tools such as Grounding with Bing Search and the Code Interpreter rather than hand-rolling integrations.

Why: Managed tools are governed and supported out of the box; custom reimplementations add maintenance and skip platform safety controls.

A primary agent should delegate billing questions to a specialized billing agent.

Use connected agents: expose the billing agent as a tool the main agent can call, so it routes sub-tasks to specialists.

Why: Connected agents enable hierarchical delegation; cramming every domain into one mega-agent bloats instructions and degrades accuracy.

Reference

A workflow needs a planner, a researcher, and a writer collaborating with shared state.

Orchestrate them with a multi-agent framework (Semantic Kernel / AutoGen on Foundry) using a defined orchestration pattern and shared context.

Why: Frameworks manage turn-taking, state, and termination; ad-hoc string passing between agents has no coordination or stop condition.

An agent runs unattended overnight and must not take risky actions alone.

Bound it with allow-listed tools, per-action budgets, content filters, and a checkpoint that escalates high-impact steps for approval.

Why: Layered safeguards keep autonomy safe; an autonomous loop with full tool access and no approval gate can cause irreversible damage.

An agent intermittently fails mid-task and you must find the failing step.

Inspect the run's traced steps and tool-call inputs/outputs in Foundry to locate the failing tool or malformed argument.

Why: Step-level traces pinpoint where a run broke; a single final error message hides which tool call or reasoning step actually failed.

Outputs are inconsistent and ignore formatting instructions.

Use a clear system message, few-shot examples, and explicit output constraints; for strict shape, enable structured outputs / JSON schema.

Why: Structured prompting and schema-enforced outputs make results reliable; raising temperature or retrying blindly doesn't fix instruction-following.

Reference

A creative copy task feels too repetitive; a data-extraction task is too random.

Raise temperature/top-p for the creative task and lower them toward 0 for extraction to make it deterministic.

Why: Sampling params trade diversity against determinism; switching models is overkill when the parameter setting is the real cause.

A reasoning agent makes avoidable logic errors on hard tasks.

Add a reflection / self-critique step where the agent reviews and revises its draft, or use a reasoning model for the step.

Why: Chain-of-thought and self-critique improve hard-task accuracy; a single forward pass has no chance to catch its own mistake.

Operations needs token spend, latency, and safety signals per request in production.

Emit OpenTelemetry traces and metrics from the app to Azure Monitor / Application Insights, capturing tokens, latency, and content-safety flags.

Why: Unified observability ties cost, performance, and safety together; scraping logs by hand can't correlate a slow turn with its token usage.

Reference

One app mixes cheap classification with occasional complex reasoning.

Orchestrate multiple deployments: route simple turns to an SLM and escalate hard turns to a frontier LLM behind one app layer.

Why: Model routing optimizes cost and quality per turn; using one premium model for everything overpays for the easy majority.

Implement computer vision solutions

A marketing app must generate original images from text prompts.

Deploy an image-generation model (e.g. DALL-E / GPT-image in the Foundry catalog) and call it with the text prompt and size parameters.

Why: Generative image models synthesize new visuals; the Image Analysis (vision) API only describes existing images, it cannot create them.

Reference

Replace just the background of an existing product photo, keeping the product intact.

Use the image edit (inpainting) endpoint with the source image plus a mask that marks only the editable region.

Why: A mask scopes edits to the painted area; a plain text-to-image call regenerates the whole frame and loses the original product.

Produce short generated video clips from a text description.

Use a text-to-video model such as Sora in the Foundry catalog with prompt, duration, and resolution parameters.

Why: Video generation is a distinct model family; image models output single frames and cannot produce temporal motion.

Users ask free-form questions about an uploaded chart image.

Send the image plus the question to a multimodal LLM (GPT-4o) for visual question answering and a natural-language answer.

Why: Multimodal chat handles open visual QA; fixed-taxonomy image tagging returns labels, not answers to arbitrary questions.

Auto-generate descriptive alt text for thousands of images for accessibility.

Use the Image Analysis caption / dense-captions capability to produce human-readable descriptions at scale.

Why: Captioning yields concise alt text directly; object detection returns bounding boxes that still need to be turned into prose.

Reference

Extract structured fields and segment-level insights from long recorded videos.

Use Azure AI Content Understanding with a video analyzer to get structured, schema-defined output across the timeline.

Why: Content Understanding produces grounded structured output across modalities; frame-by-frame image calls don't give timeline-aware structure.

Reference

A multimodal agent reads user images that may contain hidden instruction text.

Enable prompt shields / indirect-injection detection and treat text inside images as untrusted data, not as instructions.

Why: Embedded image text is a classic indirect prompt-injection vector; passing OCR'd text straight into the system prompt lets attackers hijack the agent.

Reference

Implement text analysis solutions

Pull names, dates, and amounts from emails into a typed JSON record.

Prompt an LLM with a target JSON schema and enable structured outputs so every field is returned in a fixed shape.

Why: Schema-constrained LLM extraction handles open formats and guarantees parseable JSON; brittle regex breaks on natural-language variety.

Produce a concise, rewritten summary of long support transcripts.

Use an LLM for abstractive summarization with a length and focus instruction, or the Language service summarization skill.

Why: Abstractive summaries paraphrase the gist; extractive sentence-picking just copies sentences and can miss the overall point.

Reference

Classify customer messages by sentiment and flag aggressive tone.

Use an LLM (or the Language sentiment API) to label polarity and detect tone, returning a category and confidence.

Why: Sentiment/tone analysis is a classification task with defined labels; free-text generation without a label schema is hard to route on downstream.

Translate high volumes of UI strings accurately and cheaply across 30 languages.

Use Azure AI Translator for bulk, deterministic translation; reserve an LLM for nuanced, context-heavy passages.

Why: Translator is purpose-built, cheaper, and consistent at scale; an LLM per string costs more and can drift in tone across runs.

Reference

A voice agent must transcribe caller audio in real time.

Use the Speech service real-time speech-to-text (or fast transcription) to feed text into the agent pipeline.

Why: Streaming STT gives low-latency partial transcripts for live conversation; batch transcription is for offline files, not live turns.

Reference

Transcription mishears product names and medical jargon.

Train a Custom Speech model with domain audio and phrase lists to boost recognition of specialized vocabulary.

Why: Custom Speech adapts the acoustic/language model to your terms; the base model has no exposure to your private jargon.

Reference

The agent must reply with natural-sounding spoken audio.

Use neural Text to Speech with an appropriate voice and SSML to control prosody, pauses, and pronunciation.

Why: Neural TTS plus SSML yields lifelike, controllable speech; plain text without SSML gives flat phrasing on numbers and names.

Reference

Implement information extraction solutions

Vector-only retrieval misses exact keyword and code-identifier matches.

Use hybrid search in Azure AI Search (vector plus keyword) with semantic ranking to reorder the merged results.

Why: Hybrid plus semantic reranking beats either signal alone; pure vector search can miss literal terms, pure keyword misses paraphrase.

Reference

The corpus includes scanned PDFs whose text isn't selectable.

Add an OCR cognitive skill (Document Intelligence / Vision) to the indexing skillset so scanned text is extracted before chunking and embedding.

Why: OCR enrichment surfaces text from images for retrieval; indexing the raw scanned PDF yields nothing searchable.

Reference

During ingestion you need OCR, key-phrase extraction, and translation applied per document.

Define an AI Search skillset chaining the needed cognitive skills, projecting outputs into index fields the indexer populates.

Why: A skillset declaratively orchestrates enrichment at index time; doing it in app code per query repeats work and breaks reuse.

You want chunking and embedding handled inside the index pipeline, not in app code.

Use AI Search integrated vectorization to split documents and call an embedding model during indexing and at query time.

Why: Integrated vectorization keeps chunking/embedding consistent between ingest and query; custom client-side embedding risks model mismatch.

Reference

Extract structured fields from invoices with varied layouts.

Use a Document Intelligence prebuilt invoice model, or train a custom model, to return typed fields with confidence and bounding regions.

Why: Document Intelligence understands layout and returns typed fields; an OCR-only dump gives raw text with no field semantics.

Reference

You need a clean, grounded markdown representation of mixed documents for RAG.

Use Content Understanding analyzers to produce structured / markdown output that preserves headings, tables, and field grounding.

Why: Grounded markdown keeps structure and citations for retrieval; flattened plain text loses tables and section context the model needs.

Reference

A Foundry agent must retrieve from your enriched search index at run time.

Add the AI Search index as a knowledge source / tool on the agent so each run grounds answers in retrieved, cited results.

Why: Wiring the index as an agent tool gives live grounded retrieval; pasting static snippets into instructions can't stay current with the corpus.