An Azure OpenAI chatbot needs to provide consistent, focused, and non-creative responses for a customer service scenario.
→Set the `temperature` parameter to a low value, such as 0.1 or 0.2. Avoid setting it to exactly 0 for most models.
Why: Temperature controls the randomness of the output. Lowering it makes the model more deterministic and likely to choose the highest-probability tokens.
In a RAG solution, ensure the generative model only synthesizes answers from documents the specific user is permitted to access.
→Implement security trimming at the retrieval stage. In Azure AI Search, apply security filters to the search query based on the user's AAD identity and group memberships.
Why: Access control must be enforced before the LLM sees the data. Filtering at the search (retrieval) layer is the only secure way to implement this.
Consistently extract structured data from unstructured text into a valid JSON object using Azure OpenAI.
→Use a prompt that includes: 1) A clear role. 2) Explicit instruction to return ONLY JSON. 3) The desired JSON schema with field names and types. 4) Few-shot examples if possible.
Why: Highly structured and explicit prompts significantly increase the reliability of getting well-formed, structured output from LLMs.
A mission-critical application requires guaranteed, consistent throughput from Azure OpenAI, with no throttling during peak load.
→Purchase and deploy the model using Provisioned Throughput Units (PTU).
Why: PTUs provide dedicated, reserved model processing capacity, unlike standard pay-as-you-go deployments which operate on a shared capacity model and are subject to throttling.
Reference↗
Maintain context in a long-running chatbot conversation without exceeding the model's token limit.
→Implement a conversation summarization strategy. Periodically use a separate LLM call to summarize older parts of the conversation, and include this summary plus the most recent turns in the prompt.
Why: This "summarize and slide" pattern preserves long-term context much more effectively and economically than simple truncation or sending the entire (and eventually too long) history.
Enable an Azure OpenAI model to call an external API to get current weather information.
→Define the API as a tool for the model using a precise JSON Schema format. Include a clear function `description` and detailed `parameter` descriptions so the model knows when and how to use it.
Why: The model relies entirely on the schema and descriptions to make an informed decision to call a function. A well-described function is critical for reliability.
Use Azure OpenAI to summarize a document that is much longer than the model's context window.
→Implement a "map-reduce" or "refine" strategy. Chunk the document, generate a summary for each chunk (map), and then generate a final summary from the collection of chunk summaries (reduce).
Why: This is the standard pattern for applying fixed-context models to arbitrarily long inputs, ensuring the entire document content is considered.
Improve the perceived responsiveness of a chat application by displaying the AI's response as it is being generated.
→When calling the Chat Completions API, set the `stream` parameter to `true`. Process the server-sent events as they arrive to build the response token by token.
Why: Streaming provides a much better user experience for real-time applications than waiting for the full response to be generated, which can take several seconds.
An AI agent must dynamically decide which of several tools (e.g., database query, web search, email sender) to use to fulfill a user request.
→Use a framework like Semantic Kernel or Azure AI Agent Service. Define each capability as a distinct tool/plugin and let the agent's planner or ReAct loop orchestrate the tool calls.
Why: Agentic frameworks provide the orchestration layer (planner/reasoning loop) that enables an LLM to move beyond simple Q&A to become an autonomous actor that uses tools.
Prevent an autonomous AI agent from performing high-risk actions (e.g., deleting data, spending money) without oversight.
→Implement a human-in-the-loop pattern. When the agent plans a high-risk action, the system must pause and require explicit confirmation from a human operator before executing.
Why: This is a critical responsible AI pattern for agentic systems, balancing autonomy with safety by gating irreversible or high-impact actions.