Pick a Bedrock foundation model for a use case.
→Long-context reasoning + tool use → Claude (Sonnet/Opus). Cost-optimized chat → Claude Haiku or Titan Text Lite. Code → Claude or Llama. Embeddings → Titan Embeddings V2 or Cohere Embed. Image generation → Titan Image, Stable Diffusion, or Nova Canvas. Open-weights with self-host control → Llama, Mistral, or Custom Model Import.
Why: No single model is best across cost, latency, capability, and license terms. Match model class to the bottleneck.
Reference↗
KB source is short, self-contained FAQs or product blurbs (~100–500 words each).
→Fixed-size chunking with default token size (300) and overlap (20%).
Why: Self-contained units don't benefit from boundary-aware chunking. Fixed-size is simplest and cheapest.
Reference↗
Documents have natural topic shifts within paragraphs; fixed-size splits break sentences mid-thought.
→Semantic chunking. Bedrock Knowledge Bases groups consecutive sentences whose embeddings are close, splits at meaning boundaries.
Why: Preserves coherent ideas inside a chunk → cleaner retrieval, higher answer quality.
Reference↗
Long technical manuals with cross-references between sections; questions need synthesis across a document.
→Hierarchical chunking. Bedrock builds parent (large) + child (small) chunks; retrieves on child embeddings, returns parent context.
Why: Small chunks give precise retrieval; parent context preserves cross-references and surrounding detail.
Reference↗
Source files are pre-chunked or each file is intentionally one logical unit.
→No chunking strategy. Each file becomes one chunk in the KB.
Reference↗
PDF source contains text + diagrams; users ask questions that require understanding the diagrams.
→Enable Bedrock KB advanced parsing with a foundation model (Claude/Nova) as the parser. Diagrams and tables are described via vision, then embedded.
Why: Default parsing is text-only. Multimodal parsing converts visual content to descriptive text before embedding.
Reference↗
Pick Titan Embeddings G1 vs V2.
→V2 supports configurable dimensions (256/512/1024) and outperforms G1 on multilingual benchmarks. G1 is fixed at 1536. Pick V2 for storage-constrained or non-English use cases; G1 only for legacy compatibility.
Reference↗
500K product catalog: short titles (50 words) + long specs (500 words). Optimize search quality + cost.
→Embed each item once (combined or separate fields). Use Titan Embeddings V2 with reduced dimensions (256 or 512) for cost; embed query and document with the same model.
Why: Mixing embedding models or skipping normalization breaks similarity search. Lower dimensions cut storage and query cost with marginal quality loss.
Reference↗
Pick a vector store for Bedrock Knowledge Bases.
→Default / fastest setup → Amazon OpenSearch Serverless (auto-managed). Sub-ms with frequent schema updates + relational joins → Aurora PostgreSQL with pgvector. Existing Pinecone / MongoDB Atlas / Redis customer → keep it. Tiny KB (<10K docs) cost-optimized → Aurora pgvector or Neptune Analytics.
Why: OpenSearch Serverless is the path-of-least-resistance default. Aurora pgvector wins when you need transactions or joins on metadata.
Reference↗
KB returns semantically relevant docs, but they're from outdated/wrong-region versions.
→Add metadata to source files (`version`, `region`, `effective_date`) and apply metadata filters at query time via `retrievalConfiguration.vectorSearchConfiguration.filter`.
Why: Pure vector similarity ignores recency and authority. Metadata filtering narrows the candidate pool before ranking.
Reference↗
RAG misses queries that contain exact identifiers (SKUs, error codes, regulation numbers) because semantic search overweights similar-meaning text.
→Enable hybrid search on the KB (semantic + keyword/BM25). Combines vector similarity with lexical match for IDs, codes, and proper nouns.
Reference↗
Top-k=5 retrieves 5 chunks but the most relevant one is often ranked 3rd or 4th.
→Increase `numberOfResults` to 20 then enable a reranking model (Cohere Rerank or Amazon Rerank) to re-order by relevance to the original query.
Why: Embedding similarity ≠ task relevance. Cross-encoder rerankers see query + chunk together and score precisely.
Reference↗
User questions are conversational, multi-part, or contain pronouns/follow-ups; KB retrieval quality drops.
→Enable Bedrock KB query reformulation. The model rewrites complex queries into multiple focused sub-queries before retrieval.
Reference↗
S3 source documents are updated frequently; KB must always reflect the latest versions without manual sync.
→Configure the KB data source for automated sync via S3 event notifications → EventBridge → StartIngestionJob, or use the KB scheduled sync. Avoid relying on the manual console "Sync" button.
Reference↗
Long-doc QA model hallucinates on questions whose answers are in the middle of the document.
→Don't pass full documents in the prompt — chunk + retrieve via RAG so only the relevant chunks reach the model. If full-doc is mandatory, use a model with strong long-context recall (Claude Sonnet 200K) and place the question after the document.
Why: Most LLMs exhibit "lost in the middle" recall degradation. RAG sidesteps it; placement helps when RAG isn't available.
Pick the cheapest customization that meets the quality bar.
→Try in order: (1) prompt engineering, (2) RAG with KB, (3) fine-tuning, (4) continued pre-training, (5) Custom Model Import. Stop at the first one that meets the bar.
Why: Effort and ongoing cost grow at each step. Fine-tuning + Provisioned Throughput is much pricier than RAG.
Reference↗
Fine-tune a Bedrock model with labeled task examples.
→JSONL file in S3 with one example per line: `{"prompt": "...", "completion": "..."}` (or chat-format equivalent for the model family).
Why: Each model family (Titan, Claude, Llama) has a specific schema; check the model's fine-tuning doc before formatting.
Reference↗
Adapt a foundation model to specialized vocabulary (legal, medical, scientific) using lots of unlabeled domain text.
→Continued pre-training on the unlabeled domain corpus. Different from instruction fine-tuning (which needs prompt-completion pairs).
Why: Continued pre-training updates language understanding; instruction fine-tuning teaches task behavior. Different data shape, different goal.
Reference↗
Customer interaction data for fine-tuning contains names, emails, phone numbers.
→Scrub or tokenize PII before uploading the training dataset to S3. Once weights absorb PII, output filtering cannot reliably mask it.
Why: Fine-tuned model may regurgitate training-data fragments. Scrubbing at the data layer is the only durable mitigation.
Reference↗
Bring a self-fine-tuned Llama or Mistral model and serve it through Bedrock's unified API.
→Custom Model Import. Upload weights to S3, register with Bedrock, invoke via the Bedrock runtime with unified IAM and logging.
Why: Lets you reuse Bedrock Guardrails, KBs, and Agents on bring-your-own weights without standing up SageMaker endpoints.
Reference↗
Serve a fine-tuned Bedrock model in production.
→Buy Provisioned Throughput. Custom (fine-tuned, continued-pretrained, imported) models cannot be invoked on-demand.
Reference↗
High-traffic Claude application hits per-region quotas during peaks; need higher throughput without buying Provisioned Throughput.
→Cross-region inference profiles. Bedrock routes invocations across multiple regions transparently to lift effective TPM/RPM quotas.
Why: Single-region on-demand quotas cap during spikes; cross-region profiles roughly multiply quotas with no app code changes beyond using the inference-profile ARN.
Reference↗
APAC users see significantly higher latency than US/EU users on a Bedrock app deployed in us-east-1.
→Deploy regional Bedrock endpoints in ap-northeast-1 / ap-southeast-1 / ap-south-1 (where the model is GA). Route users via Route 53 latency or geolocation policy.
Why: LLM round-trip dominates for long contexts; cross-Pacific RTT alone is 150–250 ms.
Reference↗
HIPAA-regulated app needs to summarize PHI with Bedrock.
→Use only HIPAA-eligible foundation models (per the HIPAA Eligible Services list). Sign a BAA with AWS. Encrypt prompts/responses with customer-managed KMS keys. Disable model invocation logging or scope it to a private S3 bucket with restricted access.
Reference↗
Decide what data may flow to Bedrock based on sensitivity (public / confidential / restricted).
→Public → unrestricted. Confidential → only via VPC endpoints + CMK + invocation logging in private buckets. Restricted (trade secrets, regulated PHI/PCI) → block from Bedrock entirely or use Bedrock-eligible compliance regime + redact before invoke.
Multi-account org wants Account A to share a custom Bedrock model with Account B without copying weights.
→Custom model sharing via AWS RAM. Owner shares the custom model ARN; consumer accounts invoke it through the standard Bedrock runtime with cross-account IAM principals on the resource policy.
Why: Avoids redundant fine-tuning costs and centralizes model lifecycle. RAM controls who can consume the shared resource.
Reference↗
Need a niche third-party model (e.g. healthcare-specialized LLM) not in the standard Bedrock catalog.
→Amazon Bedrock Marketplace. Subscribe to the model from the Marketplace catalog, deploy to a Bedrock endpoint, invoke via the standard runtime API.
Why: Unifies third-party billing, IAM, KMS, and observability with first-party Bedrock models.
Reference↗
High-volume search app re-embeds the same documents on every query refresh; embedding cost dominates.
→Pre-compute embeddings on document ingest, store the vector in DynamoDB or OpenSearch keyed by document id + content hash. Re-embed only when the content hash changes.
Why: Embedding the same text repeatedly is the most common avoidable cost. Hash-keyed cache is an O(1) skip.
GDPR right-to-be-forgotten on a fine-tuned model: user requests deletion of their PII from training data.
→Delete records from the training corpus, then fine-tune a fresh base model from scratch. Cannot reliably scrub data from existing weights — output filtering is not sufficient.
Why: Once weights absorb training data, masking at inference is unreliable. The defensible path is full retraining without the affected records.
Shared KB serves multiple teams; each team must only see its own documents.
→Tag every chunk with `tenant_id` / `team_id` / `clearance` metadata at ingest. At query time set `retrievalConfiguration.vectorSearchConfiguration.filter` to the caller's allowed values from the IAM session or app context.
Why: Vector similarity ignores access control; metadata filtering is the only durable per-tenant isolation in a shared KB.
Reference↗
EU customer requires that prompts and KB embeddings never leave eu-west-1.
→Deploy Bedrock + KB + S3 source bucket in eu-west-1. Pin invocations via inference profile ARN scoped to eu-west-1; SCP `aws:RequestedRegion` deny on other regions for `bedrock:*`.
Reference↗