🏠Home 📚Certifications 📱Mobile Apps

✍️Blog 📊Progress 📅Calendar 💬Support

Privacy Policy Terms of Use Contact Us Cookie Policy Disclaimer Accessibility Statement DMCA / Copyright

Skip to content

AIP-C01Playbook

Playbook

AWS Certified Generative AI Developer - Professional

Last reviewed: May 2026

A scannable reference of architectural patterns the AIP-C01 exam tests. Read top-to-bottom, or jump to a section.

Sections

Foundation Model Integration, Data Management, and Compliance31 entries
Implementation and Integration33 entries
AI Safety, Security, and Governance24 entries
Operational Efficiency and Optimization13 entries
Testing, Validation, and Troubleshooting12 entries

Foundation Model Integration, Data Management, and Compliance

Pick a Bedrock foundation model for a use case.

Long-context reasoning + tool use → Claude (Sonnet/Opus). Cost-optimized chat → Claude Haiku or Titan Text Lite. Code → Claude or Llama. Embeddings → Titan Embeddings V2 or Cohere Embed. Image generation → Titan Image, Stable Diffusion, or Nova Canvas. Open-weights with self-host control → Llama, Mistral, or Custom Model Import.

Why: No single model is best across cost, latency, capability, and license terms. Match model class to the bottleneck.

KB source is short, self-contained FAQs or product blurbs (~100–500 words each).

Fixed-size chunking with default token size (300) and overlap (20%).

Why: Self-contained units don't benefit from boundary-aware chunking. Fixed-size is simplest and cheapest.

Documents have natural topic shifts within paragraphs; fixed-size splits break sentences mid-thought.

Semantic chunking. Bedrock Knowledge Bases groups consecutive sentences whose embeddings are close, splits at meaning boundaries.

Why: Preserves coherent ideas inside a chunk → cleaner retrieval, higher answer quality.

Long technical manuals with cross-references between sections; questions need synthesis across a document.

Hierarchical chunking. Bedrock builds parent (large) + child (small) chunks; retrieves on child embeddings, returns parent context.

Why: Small chunks give precise retrieval; parent context preserves cross-references and surrounding detail.

Source files are pre-chunked or each file is intentionally one logical unit.

No chunking strategy. Each file becomes one chunk in the KB.

PDF source contains text + diagrams; users ask questions that require understanding the diagrams.

Enable Bedrock KB advanced parsing with a foundation model (Claude/Nova) as the parser. Diagrams and tables are described via vision, then embedded.

Why: Default parsing is text-only. Multimodal parsing converts visual content to descriptive text before embedding.

Pick Titan Embeddings G1 vs V2.

V2 supports configurable dimensions (256/512/1024) and outperforms G1 on multilingual benchmarks. G1 is fixed at 1536. Pick V2 for storage-constrained or non-English use cases; G1 only for legacy compatibility.

500K product catalog: short titles (50 words) + long specs (500 words). Optimize search quality + cost.

Embed each item once (combined or separate fields). Use Titan Embeddings V2 with reduced dimensions (256 or 512) for cost; embed query and document with the same model.

Why: Mixing embedding models or skipping normalization breaks similarity search. Lower dimensions cut storage and query cost with marginal quality loss.

Pick a vector store for Bedrock Knowledge Bases.

Default / fastest setup → Amazon OpenSearch Serverless (auto-managed). Sub-ms with frequent schema updates + relational joins → Aurora PostgreSQL with pgvector. Existing Pinecone / MongoDB Atlas / Redis customer → keep it. Tiny KB (<10K docs) cost-optimized → Aurora pgvector or Neptune Analytics.

Why: OpenSearch Serverless is the path-of-least-resistance default. Aurora pgvector wins when you need transactions or joins on metadata.

KB returns semantically relevant docs, but they're from outdated/wrong-region versions.

Add metadata to source files (`version`, `region`, `effective_date`) and apply metadata filters at query time via `retrievalConfiguration.vectorSearchConfiguration.filter`.

Why: Pure vector similarity ignores recency and authority. Metadata filtering narrows the candidate pool before ranking.

RAG misses queries that contain exact identifiers (SKUs, error codes, regulation numbers) because semantic search overweights similar-meaning text.

Enable hybrid search on the KB (semantic + keyword/BM25). Combines vector similarity with lexical match for IDs, codes, and proper nouns.

Top-k=5 retrieves 5 chunks but the most relevant one is often ranked 3rd or 4th.

Increase `numberOfResults` to 20 then enable a reranking model (Cohere Rerank or Amazon Rerank) to re-order by relevance to the original query.

Why: Embedding similarity ≠ task relevance. Cross-encoder rerankers see query + chunk together and score precisely.

User questions are conversational, multi-part, or contain pronouns/follow-ups; KB retrieval quality drops.

Enable Bedrock KB query reformulation. The model rewrites complex queries into multiple focused sub-queries before retrieval.

S3 source documents are updated frequently; KB must always reflect the latest versions without manual sync.

Configure the KB data source for automated sync via S3 event notifications → EventBridge → StartIngestionJob, or use the KB scheduled sync. Avoid relying on the manual console "Sync" button.

Long-doc QA model hallucinates on questions whose answers are in the middle of the document.

Don't pass full documents in the prompt — chunk + retrieve via RAG so only the relevant chunks reach the model. If full-doc is mandatory, use a model with strong long-context recall (Claude Sonnet 200K) and place the question after the document.

Why: Most LLMs exhibit "lost in the middle" recall degradation. RAG sidesteps it; placement helps when RAG isn't available.

Pick the cheapest customization that meets the quality bar.

Try in order: (1) prompt engineering, (2) RAG with KB, (3) fine-tuning, (4) continued pre-training, (5) Custom Model Import. Stop at the first one that meets the bar.

Why: Effort and ongoing cost grow at each step. Fine-tuning + Provisioned Throughput is much pricier than RAG.

Fine-tune a Bedrock model with labeled task examples.

JSONL file in S3 with one example per line: `{"prompt": "...", "completion": "..."}` (or chat-format equivalent for the model family).

Why: Each model family (Titan, Claude, Llama) has a specific schema; check the model's fine-tuning doc before formatting.

Adapt a foundation model to specialized vocabulary (legal, medical, scientific) using lots of unlabeled domain text.

Continued pre-training on the unlabeled domain corpus. Different from instruction fine-tuning (which needs prompt-completion pairs).

Why: Continued pre-training updates language understanding; instruction fine-tuning teaches task behavior. Different data shape, different goal.

Customer interaction data for fine-tuning contains names, emails, phone numbers.

Scrub or tokenize PII before uploading the training dataset to S3. Once weights absorb PII, output filtering cannot reliably mask it.

Why: Fine-tuned model may regurgitate training-data fragments. Scrubbing at the data layer is the only durable mitigation.

Bring a self-fine-tuned Llama or Mistral model and serve it through Bedrock's unified API.

Custom Model Import. Upload weights to S3, register with Bedrock, invoke via the Bedrock runtime with unified IAM and logging.

Why: Lets you reuse Bedrock Guardrails, KBs, and Agents on bring-your-own weights without standing up SageMaker endpoints.

Serve a fine-tuned Bedrock model in production.

Buy Provisioned Throughput. Custom (fine-tuned, continued-pretrained, imported) models cannot be invoked on-demand.

High-traffic Claude application hits per-region quotas during peaks; need higher throughput without buying Provisioned Throughput.

Cross-region inference profiles. Bedrock routes invocations across multiple regions transparently to lift effective TPM/RPM quotas.

Why: Single-region on-demand quotas cap during spikes; cross-region profiles roughly multiply quotas with no app code changes beyond using the inference-profile ARN.

APAC users see significantly higher latency than US/EU users on a Bedrock app deployed in us-east-1.

Deploy regional Bedrock endpoints in ap-northeast-1 / ap-southeast-1 / ap-south-1 (where the model is GA). Route users via Route 53 latency or geolocation policy.

Why: LLM round-trip dominates for long contexts; cross-Pacific RTT alone is 150–250 ms.

HIPAA-regulated app needs to summarize PHI with Bedrock.

Use only HIPAA-eligible foundation models (per the HIPAA Eligible Services list). Sign a BAA with AWS. Encrypt prompts/responses with customer-managed KMS keys. Disable model invocation logging or scope it to a private S3 bucket with restricted access.

Decide what data may flow to Bedrock based on sensitivity (public / confidential / restricted).

Public → unrestricted. Confidential → only via VPC endpoints + CMK + invocation logging in private buckets. Restricted (trade secrets, regulated PHI/PCI) → block from Bedrock entirely or use Bedrock-eligible compliance regime + redact before invoke.

Multi-account org wants Account A to share a custom Bedrock model with Account B without copying weights.

Custom model sharing via AWS RAM. Owner shares the custom model ARN; consumer accounts invoke it through the standard Bedrock runtime with cross-account IAM principals on the resource policy.

Why: Avoids redundant fine-tuning costs and centralizes model lifecycle. RAM controls who can consume the shared resource.

Need a niche third-party model (e.g. healthcare-specialized LLM) not in the standard Bedrock catalog.

Amazon Bedrock Marketplace. Subscribe to the model from the Marketplace catalog, deploy to a Bedrock endpoint, invoke via the standard runtime API.

Why: Unifies third-party billing, IAM, KMS, and observability with first-party Bedrock models.

High-volume search app re-embeds the same documents on every query refresh; embedding cost dominates.

Pre-compute embeddings on document ingest, store the vector in DynamoDB or OpenSearch keyed by document id + content hash. Re-embed only when the content hash changes.

Why: Embedding the same text repeatedly is the most common avoidable cost. Hash-keyed cache is an O(1) skip.

GDPR right-to-be-forgotten on a fine-tuned model: user requests deletion of their PII from training data.

Delete records from the training corpus, then fine-tune a fresh base model from scratch. Cannot reliably scrub data from existing weights — output filtering is not sufficient.

Why: Once weights absorb training data, masking at inference is unreliable. The defensible path is full retraining without the affected records.

Shared KB serves multiple teams; each team must only see its own documents.

Tag every chunk with `tenant_id` / `team_id` / `clearance` metadata at ingest. At query time set `retrievalConfiguration.vectorSearchConfiguration.filter` to the caller's allowed values from the IAM session or app context.

Why: Vector similarity ignores access control; metadata filtering is the only durable per-tenant isolation in a shared KB.

EU customer requires that prompts and KB embeddings never leave eu-west-1.

Deploy Bedrock + KB + S3 source bucket in eu-west-1. Pin invocations via inference profile ARN scoped to eu-west-1; SCP `aws:RequestedRegion` deny on other regions for `bedrock:*`.

Implementation and Integration

Multi-step workflow needs LLM reasoning, calls to external APIs/databases, and synthesis.

Amazon Bedrock Agent. Define instructions, action groups (Lambda + OpenAPI schema), and an optional KB. The agent plans, invokes tools, and stitches results.

Why: Saves writing the orchestration loop yourself. Built-in trace, session memory, and return-of-control hooks.

Bedrock Agent must call three internal APIs (CRM, inventory, payments).

Define one action group per API. Each action group has an OpenAPI schema describing its operations and a Lambda function (or return-of-control endpoint) that executes calls.

Agent must perform high-risk operations (account deletion, large refunds) only after human/business confirmation.

Configure the action group with Return of Control (RoC). Bedrock returns the proposed action to the application instead of invoking it; the application gates execution behind approval and resubmits results.

Why: Keeps high-risk steps outside the agent runtime so they can be audited or human-confirmed before they execute.

Agent must remember context across turns within one user session.

Use the agent's built-in session attributes and prompt session attributes. Pass `sessionId` to InvokeAgent — Bedrock retains conversation state for the configured idle timeout.

Agent must recall facts about a returning user across sessions (preferences, history) and summarize older exchanges.

Enable Bedrock Agent memory. The agent persists summarized session history per `memoryId` and replays it as context on future invocations.

Workflow needs specialized agents (research, code, billing) coordinated by a top-level planner.

Bedrock Agents multi-agent collaboration: define one supervisor agent and several collaborator agents. The supervisor delegates subtasks based on collaborator descriptions and synthesizes results.

Need a multi-step pipeline: extract → classify → route → summarize, with conditional branches.

Amazon Bedrock Prompt Flows. Visual workflow with prompt nodes, condition nodes, KB nodes, Lambda nodes; versioned and invokable as a single API.

Why: Replaces hand-rolled Step Functions for prompt pipelines and exposes one entry point.

Multi-tenant SaaS: per-tenant system prompts, model preferences, and versioning.

Amazon Bedrock Prompt Management. Store prompts as versioned, parameterized assets; reference by ARN at runtime; A/B different versions per tenant.

App must work across Claude, Llama, Titan, and Cohere with one chat-style API surface.

Use the Bedrock Converse API. Unified message-list format, tool use, and system prompts across model families. Avoid model-specific InvokeModel JSON when portability matters.

Chatbot must show responses token-by-token to cut perceived latency.

ConverseStream (or InvokeModelWithResponseStream). Pair with API Gateway WebSocket or AppSync subscriptions to fan out tokens to the browser.

Real-time customer-support chat: response streaming, 500 concurrent users, conversation history.

Browser ↔ API Gateway WebSocket ↔ Lambda ↔ Bedrock ConverseStream. Persist conversation in DynamoDB keyed by `sessionId` and reload on each turn.

Why: WebSocket avoids HTTP polling; DynamoDB session store survives Lambda statelessness.

Need the model to decide when to call functions (database lookup, calculator, API).

Use Converse API tool use (`toolConfig`) — declare tools with name + JSON schema; the model emits `toolUse` blocks; the app executes and returns `toolResult`. Works across Claude, Llama, Mistral, Cohere Command R.

New ticket in third-party system → automatic Bedrock analysis (sentiment, urgency, category) → routing.

Webhook → API Gateway → EventBridge → Lambda target → Bedrock. EventBridge decouples producers from consumers and gives retry + DLQ for free.

Multiple microservices submit Bedrock generation requests; consumers don't need results immediately.

Producers → SQS → Lambda (or ECS) consumer → Bedrock InvokeModel → store result in S3/DynamoDB. SQS smooths spikes and retries failures within service quotas.

Generate descriptions for 100,000 SKUs nightly; latency tolerant; want lowest cost.

Amazon Bedrock Batch Inference. Submit input JSONL in S3, Bedrock runs the job at up to 50% lower per-token cost vs on-demand, writes output JSONL.

Why: Batch trades latency for cost. Use it whenever results aren't needed in real time.

API Gateway in front of Lambda + Bedrock returns 504 Gateway Timeout on long generations.

API Gateway REST integration timeout caps at 29 seconds. Switch to async pattern (return job id, poll via second endpoint) or to API Gateway WebSocket + ConverseStream so partial tokens flow before the timeout window.

Generate product descriptions from a product image + brief text.

Use a vision-capable model on Bedrock (Claude 3+ Sonnet, Nova) via Converse API with `image` content blocks alongside text.

Sub-second message translation to English with high quality.

Foundation model (Claude Haiku or Llama small) via Bedrock for nuance, OR Amazon Translate for speed/cost when literal translation is sufficient. Bedrock for context-aware; Translate for transactional.

Gradually shift production traffic from Model A to Model B with kill-switch capability.

AWS AppConfig feature flag holding the active-model identifier and traffic split. Lambda reads the flag per invocation, routes accordingly. Roll back instantly via AppConfig deployment rollback.

Decide between Bedrock and SageMaker JumpStart for hosting a foundation model.

Bedrock when you want managed inference, unified API, KB/Agents/Guardrails. SageMaker JumpStart when you need a private VPC-hosted endpoint with full network/IAM control or open-weights model not in Bedrock.

Pick action group definition style: OpenAPI 3.0 spec vs function schema.

OpenAPI when the underlying API already has an OpenAPI 3.0 spec or you need full HTTP semantics (paths, methods, parameter types). Function schema for inline/lightweight actions defined via simple JSON property declarations.

Why: OpenAPI is canonical for existing REST APIs. Function schema is faster for new agent-internal helpers.

Agent must perform precise math, statistical analysis, or run small Python snippets to answer questions.

Enable Bedrock Agents code interpreter. The agent runs Python in a managed sandbox; results flow back into the response synthesis.

Why: LLMs are unreliable at exact math; a sandboxed runtime gives deterministic numeric results without writing custom action groups.

Default agent prompts produce verbose responses; need to tighten the orchestration prompt for production.

Configure prompt template overrides on the agent for each step (pre-processing, orchestration, KB response generation, post-processing). Overrides are versioned with the agent.

Iterate on an agent in dev while production traffic stays on a stable version.

Use agent versions and aliases. `DRAFT` for active edits; publish numbered versions; route via aliases (`prod` → version 7, `dev` → DRAFT). Promote by updating the alias.

Agent picks the wrong action group; need to debug the reasoning step-by-step.

Enable trace on InvokeAgent (`enableTrace: true`). The response stream includes `preProcessingTrace`, `orchestrationTrace`, `postProcessingTrace`, and `failureTrace` blocks showing model rationale, tool selection, and inputs.

Build a Bedrock Flow for "extract entities → look up in KB → summarize → email".

Compose nodes: prompt node (extract), knowledge-base node (look up), prompt node (summarize), Lambda node (send email via SES). Use S3 input/output nodes for batch flows; condition nodes for branching.

Pick Bedrock Flows vs Step Functions for a multi-step GenAI pipeline.

Bedrock Flows when steps are mostly Bedrock primitives (prompts, KBs, agents) — single-API invocation, no extra IAM glue. Step Functions when the workflow spans many AWS services with retries, parallel branches, complex error handling, or long-running waits.

Implement a chat loop where the model iteratively calls tools, then formulates the final answer.

Pattern: send user message → model returns `toolUse` → app executes tool → app sends `toolResult` back via Converse → loop until model returns final text. Cap iterations to prevent runaways.

Why: The model decides when it has enough info to stop; the app must drive the loop and enforce a max-step bound.

Model needs to look up customer + order + inventory; sequential tool calls add 3× latency.

Models that support parallel tool use (Claude 3+, Nova) emit multiple `toolUse` blocks in one turn. Execute them concurrently in the app and return all `toolResult`s before the next inference.

Persist multi-turn chat state across stateless Lambda invocations with auto-cleanup of stale sessions.

DynamoDB table keyed by `sessionId` storing `messages` + `lastActivity`. Set TTL attribute (`expiresAt`) to auto-delete sessions older than 24 hours. Lambda reads/writes per turn.

Chat sees ~1000 QPS; per-turn DynamoDB reads on session history are a hotspot.

Front DynamoDB with ElastiCache for Redis. Cache the last N messages per session in a Redis hash; write-through to DynamoDB for durability. TTL Redis keys to bound memory.

A retry on a Bedrock InvokeModel call risks billing twice for the same logical request.

Generate an idempotency key per logical request (e.g. UUID v5 of input + user). Cache the response keyed by idempotency key in DynamoDB or ElastiCache; return the cached response on retry.

Why: Bedrock itself is non-idempotent — same input is billed every call. App-layer caching is the only idempotency story.

Run two production model versions during migration without switching all users at once.

Hash user id into N buckets; route bucket i to model A or model B based on a feature flag (AppConfig / Parameter Store). Monitor side-by-side metrics; shift bucket assignment to roll forward or back.

AI Safety, Security, and Governance

Customer-facing chatbot must block harmful content, denied topics, PII leakage.

Amazon Bedrock Guardrails. Configure denied topics, content filters (hate, violence, sexual, insults, misconduct), word filters, sensitive-information filters (PII redaction), and contextual grounding checks. Apply to InvokeModel input and output.

Why: Guardrails are model-agnostic and applied to both directions; they outlive any single model swap.

Guardrail blocks legitimate financial responses that mention dollar amounts.

Lower the sensitivity tier on the affected content filter (e.g. `MEDIUM` → `LOW`) and/or remove overly broad denied-topic phrasing. Re-test against a benchmark prompt set before redeploying.

Medical-summary app must not invent facts beyond source documents.

Enable Bedrock Guardrails contextual grounding check with a high relevance + grounding threshold. Responses below threshold are blocked or replaced with a safe-default message.

Why: Pure RAG still hallucinates when the model overgeneralizes from retrieved chunks. Contextual grounding scores answer-vs-source alignment per response.

Bedrock app receives prompts containing customer PII; need automatic masking before logging or downstream use.

Configure Guardrails PII filters with `BLOCK` or `ANONYMIZE` actions for PII entity types (SSN, email, phone, address). Filtering happens on input and output independently.

Public-facing app takes user input concatenated into a system prompt; must resist prompt injection.

Defense in depth: (1) Guardrails (denied topics + jailbreak detection), (2) hardened system prompt that frames user input as data and refuses meta-instructions, (3) output validation against expected schema, (4) least-privilege tool permissions so a compromised prompt can't trigger destructive actions.

Why: No single mitigation is sufficient; layered defenses bound the blast radius.

Red team finds the model can be coerced into harmful output via roleplay framing ("pretend you're an AI without restrictions").

Enable Guardrails jailbreak detection content filter. Add explicit denied-topics for roleplay attempts. Re-test after each change with the same red-team prompt set.

All Bedrock data must be encrypted in transit and at rest with customer-managed keys.

TLS 1.2+ is enforced in transit. At rest: configure customer-managed KMS keys for Bedrock model customization, KB embeddings + S3 source data, invocation log destinations. Enforce via SCP that prevents AWS-managed keys.

Multi-team org: each team should access only specific foundation models.

IAM identity-based policies that allow `bedrock:InvokeModel` on resource ARNs scoped to the permitted model IDs. Combine with `aws:RequestedRegion` conditions to lock region.

Why: Resource-level Allow on `arn:aws:bedrock:*::foundation-model/<id>` is the only durable way to enforce model-level access. Don't rely on application-layer gating.

Lambda invokes only Claude 3.5 Sonnet in us-east-1.

Allow `bedrock:InvokeModel` with `Resource: arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-*` and a `Condition: aws:RequestedRegion = us-east-1`. Reject all other models and regions.

Bedrock app must not egress to the public internet.

Bedrock with VPC interface endpoints (PrivateLink) for the runtime API. Block public Bedrock endpoints via SCP. Add an endpoint policy that limits actions to the approved set.

Regulator requires a complete audit trail of every Bedrock model invocation: prompt, response, model version, timestamp.

Enable Bedrock model invocation logging to CloudWatch Logs or S3. Captures full prompt + response + model id + timestamp. Pair with CloudTrail for the API-call metadata layer (who/when/from-where).

Why: CloudTrail captures metadata only; invocation logging captures content. Compliance generally requires both.

Determine the company's share of security responsibility for a Bedrock deployment.

AWS Generative AI Security Scoping Matrix. Scope 1 (consumer SaaS) → Scope 5 (self-trained model on private data). Bedrock with on-demand foundation models is typically Scope 2; KB/Agent + RAG pushes toward Scope 3; fine-tuning Scope 4; Custom Model Import Scope 5.

Protect the GenAI API endpoint behind API Gateway from abuse.

AWS WAF with rate-based rules (per IP), bot-control managed rule set, and a custom string-match rule on suspected jailbreak phrases. Block common LLM-DDoS patterns (long-prompt floods).

Find PII or other sensitive data in S3 source corpora before they enter a KB or fine-tuning job.

Amazon Macie scheduled discovery job on the relevant S3 buckets. Findings go to Security Hub / EventBridge for follow-up redaction.

Detect AI-generated images downstream for content provenance.

Use Titan Image Generator (or Nova Canvas) — outputs include an invisible watermark. Verify with the Bedrock watermark detection API.

Marketing chatbot must not name competitors and must not make unsubstantiated claims.

Guardrails denied topics: explicit list of competitor names + topic-level "unverified product claims". Add a word filter for absolute claims ("guaranteed", "best", "100%").

Apply a Bedrock Guardrail to outputs from a non-Bedrock model (e.g. self-hosted SageMaker endpoint).

Call the standalone `ApplyGuardrail` API with the text + guardrail id + version. Returns whether content was blocked or modified, with which filters fired.

Why: Decouples guardrails from the model. Use as a pre-check on user input or post-check on any model output.

Single Guardrail policy must apply across us-east-1, eu-west-1, and ap-southeast-1.

Recreate the same guardrail (same configuration) in each region. Guardrails are regional resources; use IaC (CloudFormation / CDK / Terraform) to keep configs in lockstep.

Why: There is no managed cross-region replication for guardrails. IaC is the only durable consistency story.

Attacker poisons documents in a public-facing KB so the agent leaks system prompt or data when retrieving them.

Treat retrieved KB content as untrusted: enable Guardrails on inputs AND outputs, sanitize retrieved chunks via prompt-injection detection or pattern-matching, enforce least-privilege on agent action groups so a compromised prompt cannot escalate.

Why: Indirect injection bypasses input filtering — the malicious prompt arrives via retrieved context, not the user message.

Need per-user model access on a multi-tenant app with a single backend role.

Pass user attributes as session tags during AssumeRole. Reference them via `aws:PrincipalTag/<key>` conditions in the Bedrock identity policy to gate `bedrock:InvokeModel` per user.

Pick destination for Bedrock invocation logging.

CloudWatch Logs for short prompts/responses, fast Logs Insights queries, smaller-scale apps. S3 for high-volume, large payloads (KB + agent traces), long-term retention, downstream Athena/Glue analysis. Use S3 if any single response can exceed 256 KB.

Why: CloudWatch Logs has per-event size limits; S3 has none. Pick by payload size and analysis pattern.

Protect a public chat API from DDoS and large-scale token-flood abuse.

AWS Shield Standard is on by default; enable Shield Advanced on critical endpoints for L7 protections + 24/7 SRT support. Pair with WAF rate-based rules and CloudFront for absorption at the edge.

Image-generating app must block sexually explicit, violent, or hateful imagery.

Bedrock Guardrails image content filters on input (uploaded images) and output (generated images). Filters classify visual content with HIGH/MEDIUM/LOW thresholds.

Workflow before fine-tuning a Bedrock model on customer support transcripts.

Pipeline: S3 source → Macie discovery job to identify PII → Comprehend PII detection + redaction (or Glue with regex) → cleaned dataset to a separate S3 prefix → Bedrock fine-tune. Macie failures trigger EventBridge → SNS to security on-call.

Why: Once data enters the weights, removal requires retraining. Pre-flight redaction is far cheaper than post-incident retraining.

Operational Efficiency and Optimization

Pick on-demand vs Provisioned Throughput.

Variable / unknown traffic → on-demand. Steady high-volume with guaranteed throughput SLA → Provisioned Throughput (model units, 1- or 6-month commit). Custom (fine-tuned, imported) models → Provisioned Throughput is mandatory.

Why: On-demand is per-token, no commit. PT is per-hour, dedicated capacity, ~50% cheaper per token at high utilization.

App reuses the same 4,000-token system prompt across all user interactions; only the user message changes.

Enable Bedrock prompt caching. Mark the static prefix as cacheable; subsequent invocations skip re-processing it for ~5-minute cache TTL, cutting per-call cost ~90% on cached tokens.

Many users ask similar but not identical questions; want to cache answers across paraphrases.

Embed the user query and look up nearest neighbors in a vector cache (DynamoDB + ElastiCache, or OpenSearch) above a similarity threshold. Cache hit → return stored response. Cache miss → invoke Bedrock and write back.

Why: Standard key-value caches miss paraphrases. Semantic similarity captures intent.

Reduce per-call cost on a Bedrock app.

Tighten the system prompt, drop redundant few-shot examples, set explicit `maxTokens` on output, use stop sequences to terminate early. Pick a smaller model where quality permits.

Why: Cost is roughly proportional to total tokens processed. Output tokens are typically priced higher than input tokens — capping output is high-leverage.

Code completion: sub-second latency, balanced cost, high request volume.

Claude Haiku (or Nova Micro / Llama small) on Bedrock. Avoid Opus or large Llama for latency-sensitive token-completion paths.

KB has 500K docs but only ~200 queries/day; minimize cost.

Aurora PostgreSQL Serverless v2 with pgvector. Scales to near-zero ACUs at idle; pay-per-query model beats always-on OpenSearch Serverless OCU floors at low QPS.

OpenSearch Serverless KB has 800ms query latency; need <200ms.

Increase the OCU floor on the search collection (more compute = more cached vectors). Reduce embedding dimension, raise top-k tightly, prune metadata, enable result caching at the application layer.

Long-running fine-tuning jobs that tolerate interruption; minimize cost.

For SageMaker fine-tuning use Managed Spot Training (up to 90% off). Bedrock's native fine-tuning is on-demand only — choose SageMaker JumpStart for spot-eligible custom training when budget dominates.

Allocate Bedrock spend across teams or product lines.

Apply cost-allocation tags to Bedrock resources (Provisioned Throughput, custom models, application stacks). Activate tags in Billing → Cost Allocation Tags. Reports break down per tag.

Monitor Bedrock invocation latency, token volume, and errors.

CloudWatch metrics under `AWS/Bedrock`: `InvocationLatency`, `InputTokenCount`, `OutputTokenCount`, `Invocations`, `InvocationClientErrors`, `InvocationServerErrors`, `InvocationThrottles`. Set alarms on p95 latency and error rates.

~100 conversations/day, simple FAQ; minimize cost.

Bedrock on-demand with smallest competent model (Titan Text Lite, Claude Haiku, or Nova Micro). Lambda + API Gateway HTTP API. No KB if FAQ fits in system prompt; tiny KB on Aurora pgvector if needed.

Size Provisioned Throughput for a steady-state Bedrock workload.

Measure peak input + output tokens-per-second on shadow traffic. Bedrock publishes per-model unit throughput; provision `ceil(peak TPS / per-unit TPS)` units. Validate with shadow traffic before committing.

Why: Under-provisioning causes throttling; over-provisioning wastes the hourly commit. Empirical sizing on shadow traffic is the only reliable approach.

Allocate Bedrock cost per application or team in a shared account.

Create application inference profiles per app, attach cost-allocation tags (e.g. `application=chatbot-X`, `team=marketing`). Each invocation references the profile ARN; Cost Explorer breaks down spend per tag.

Testing, Validation, and Troubleshooting

Compare three foundation models on a summarization task; want automated, reproducible evaluation.

Amazon Bedrock Model Evaluation jobs (automatic). Provide a prompt dataset; Bedrock runs each model and reports BLEU, ROUGE, BERTScore plus toxicity / accuracy where applicable.

ROUGE scores look high but human readers say summaries miss key points.

Switch to Bedrock human-based evaluation with custom metrics (relevance, completeness, faithfulness). Define a rubric, route a sample to a workforce, aggregate scores.

Why: Lexical-overlap metrics (BLEU, ROUGE) miss semantic faithfulness. Human evaluation is the ground truth for subjective tasks.

Need scaled, reproducible evaluation but human-only review is too slow/expensive.

Bedrock LLM-as-a-judge evaluation. A strong model scores responses against a rubric; results correlate well with human reviewers and run in minutes vs days.

Generated portfolio summaries must match source document figures exactly.

Constrain generation: low temperature (0–0.2), strict prompt instructions ("quote numbers verbatim from source"), Guardrails contextual grounding check on output, post-generation regex/parser that validates numbers vs source.

Why: Even with grounded RAG, models paraphrase numbers. Multiple layers (prompt + grounding + parser) catch the residual cases.

RAG often returns "I don't have enough information" even for topics covered in the KB.

Inspect retrieval traces: chunk scores, retrieved chunk count, query-to-chunk alignment. Common fixes: enable hybrid search, raise top-k, tune chunk size, switch to semantic chunking, enable query reformulation, lower the relevance threshold.

Agent returns outdated pricing even after a recent KB sync; data source is S3 with versioning.

Confirm the latest IngestionJob `status: COMPLETE` and `documentsModified` reflects the new objects. Versioning means non-current versions can still be indexed if the data source isn't scoped to current versions only — verify the data source filter and re-sync.

HR agent occasionally reveals salary information about other employees when asked cleverly.

Tighten the agent's instructions ("only answer about the requesting user's own data"), gate the action group via session attributes that include the user id, scope IAM on the Lambda backing the action group to only query the user's own records, add a Guardrails denied topic on cross-user salary queries.

Bedrock invocations have intermittent p95 latency spikes.

Check CloudWatch `InvocationThrottles` (rate-limit hits) and `ModelLatency`; turn on AWS X-Ray tracing on the calling Lambda; inspect CloudWatch Logs Insights for slow tool calls or KB retrieval. Mitigate via cross-region inference, smaller model, prompt caching, or batching.

Migrate from Claude v2 to Claude 3.5 Sonnet without regressions.

Run a Bedrock evaluation job comparing both on a representative prompt set. Then shadow traffic in production: send the same input to both, compare outputs offline. Promote with AppConfig feature flag at 10% → 50% → 100%.

Run Bedrock Model Evaluation as part of CI/CD on every model-config change.

Use the `CreateEvaluationJob` API. Define dataset in S3, evaluators (built-in or custom), and target models. Poll job status; promote on `COMPLETED` with metrics above thresholds.

Why: The Studio UI is for one-offs; the API is the only path to automated, repeatable evaluation gates.

Avoid quality regressions when upgrading the foundation model in production.

Maintain a curated regression test set: 100–500 representative prompts with expected outputs (or rubrics). Run via Bedrock Model Evaluation on every model swap. Block promotion if scores drop > defined threshold.

Measure whether the model picks the right tool with the right arguments in tool-use chat.

Build a labeled set: prompt + expected `toolUse` block(s). Run via custom evaluator that diffs actual vs expected tool name + JSON arguments. Track precision/recall per tool.

Why: Lexical metrics (BLEU) miss whether the agent invoked the right action. Tool-use accuracy is the right metric for agentic workloads.