Playbook

AWS Certified AI Practitioner

Last reviewed: April 2026

A scannable reference of architectural patterns the AIF-C01 exam tests. Read top-to-bottom, or jump to a section.

Fundamentals of AI and ML

Pick a learning paradigm: labeled data, unlabeled data, or interactive trial-and-error.

Labeled → supervised. Unlabeled clustering/segmentation → unsupervised. Agent learns by reward → reinforcement.

Why: Choice is dictated by what data exists. RLHF is reinforcement learning steered by human ratings, used to align LLMs.

Reference

Adapt a pre-trained model to a new related task instead of training from scratch.

Use transfer learning. Fine-tune the existing model on the new domain dataset.

Why: Reuses learned representations, cuts training time and data needs vs. building a model from zero.

Pick a SageMaker inference mode for the workload shape.

Steady low-latency → real-time. Spiky/idle traffic → serverless. Large payload (≤1 GB) or long job (≤1 hr) with near-real-time → asynchronous. Offline bulk → batch transform.

Why: Real-time has payload/timeout limits; async queues large jobs; batch is for periodic offline scoring.

Reference

Multiple ML teams need to share and reuse engineered features.

Amazon SageMaker Feature Store as the central repo for online + offline features.

Why: Avoids duplicate feature engineering and keeps train/serve consistency across teams.

Reference

Build ML models without coding or ML expertise (e.g. demand forecasting for analysts).

Amazon SageMaker Canvas — visual no-code interface for training and inference.

Reference

Deploy a foundation model fast inside a VPC.

Amazon SageMaker JumpStart — pre-trained models deploy as SageMaker endpoints in your VPC.

Why: JumpStart bundles model artifacts and notebooks for one-click VPC-bound deployment.

Reference

Automate hyperparameter tuning and model selection.

Amazon SageMaker Autopilot — explores algorithms and tunes hyperparameters automatically.

Reference

Pick the right evaluation metric for a classification model.

Image/binary classification correctness → accuracy. Class breakdown → confusion matrix. Imbalanced classes → F1, precision, recall. Threshold-independent → AUC.

Why: Accuracy misleads on imbalanced data; confusion matrix shows TP/FP/TN/FN counts; F1 balances precision and recall.

Cost of missing a positive (false negative) is much higher than a false positive — e.g. fraud detection, disease screening.

Optimize for recall (sensitivity). Accept lower precision.

Why: Recall = TP / (TP + FN). Maximizing it minimizes missed positives at the price of more false alarms.

Model scores high on training data but poorly on test/production data; or accuracy first improves then degrades as epochs grow.

Overfitting. Mitigate with more data, regularization, early stopping, dropout, or simpler model.

Why: Large train-vs-test gap means the model memorized noise instead of learning patterns.

Pick the managed AI service for a single-purpose task.

NLP/sentiment/entities → Comprehend. Speech-to-text → Transcribe. Text-to-speech → Polly. Translation → Translate. Chatbot/voice UI → Lex. Image/video → Rekognition. Doc/PDF text extraction → Textract. Recommendations → Personalize. Forecasting → Forecast.

Why: Managed AI services beat custom models when the task is well-scoped and on-catalog.

Reference

Fundamentals of Generative AI

Build a generative AI application on AWS without managing model infrastructure.

Amazon Bedrock — fully managed access to foundation models (Anthropic Claude, Meta Llama, Amazon Titan, Stability, AI21, Mistral, Cohere) via a single API.

Why: No GPU provisioning, no model hosting; pay per token. SageMaker JumpStart is the alternative when you need a self-hosted endpoint in your VPC.

Reference

Define what makes a model a "foundation model".

Large model pre-trained on diverse, mostly unlabeled data; adaptable to many downstream tasks via prompting, fine-tuning, or RAG.

Estimate how much input fits in one prompt and what drives inference cost.

Tokens are sub-word units. Context window = max tokens per request (input + output). Inference cost is roughly proportional to tokens processed.

Why: Token count, not request count, drives Bedrock pricing. If a long doc exceeds context window, chunk it or pick a larger-window model.

Pick output style: deterministic vs creative.

Low temperature (~0.0–0.3) → deterministic, repeatable. High temperature (~0.7–1.0) → creative, varied. Use 0 for classification or sentiment to get consistent labels.

Restrict the candidate token pool beyond temperature.

Top-K = consider only the K most-likely tokens. Top-P (nucleus) = consider tokens until cumulative probability reaches P.

Why: Top-P adapts the candidate set size to the distribution shape; Top-K is fixed-width.

Get LLM output in a specific style, length, or language.

Prompt engineering. Add explicit instructions ("Respond in French, under 50 words, formal tone").

Why: Cheaper and faster than fine-tuning, retraining, or changing model size for stylistic control.

Improve LLM accuracy on a specific task without retraining.

Few-shot prompting — embed 2–5 labeled input/output examples in the prompt before the new input.

Why: In-context learning lets the model pattern-match against examples without weight updates.

LLM gives wrong answers on multi-step reasoning problems.

Chain-of-thought prompting — instruct the model to walk through reasoning steps before the final answer ("Let's think step by step").

LLM generates text that sounds plausible but is factually wrong or fabricated.

Hallucination. Mitigate with RAG (ground in retrieved facts), Bedrock Guardrails, lower temperature, and human review of high-stakes outputs.

Power semantic search, clustering, or RAG retrieval over text or multimodal data.

Use an embedding model (e.g. Titan Embeddings, Cohere Embed) to convert content into dense vectors. Store and query in a vector DB.

Why: Embeddings capture semantic meaning so similar items land near each other in vector space (cosine / dot-product similarity).

Reference

Search application accepts both text and images as input.

Multimodal embedding model (e.g. Titan Multimodal Embeddings) — projects text and images into the same vector space.

Reference

Prototype a generative AI app fast with no code or AWS account setup.

PartyRock (Amazon Bedrock Playground) — browser-based no-code app builder.

Reference

Pick a Bedrock pricing model.

Variable / unpredictable load → On-demand (per-token). Steady high-volume or guaranteed throughput → Provisioned Throughput. Custom fine-tuned models → must use Provisioned Throughput.

Why: On-demand has no commitment; Provisioned Throughput buys dedicated capacity in model units.

Reference

Pick the cheapest customization that gets the quality you need.

Try in this order: (1) prompt engineering, (2) RAG with a knowledge base, (3) fine-tuning, (4) continued pre-training.

Why: Effort and cost grow at each step. Stop at the first one that meets the bar.

Applications of Foundation Models

Augment a foundation model with private company data (PDFs, docs, S3 content) without fine-tuning.

Create an Amazon Bedrock Knowledge Base. Bedrock handles ingestion, chunking, embedding, and retrieval (RAG) at inference time.

Why: Cheaper and faster to update than fine-tuning. Source data changes → re-sync the KB; no retraining.

Reference

Data changes frequently (inventory, pricing, news) and the model must reflect current state.

RAG with a knowledge base. Avoid fine-tuning — retraining cycles can't keep up.

Why: RAG separates the model from the data; the KB updates independently of the model.

Fine-tune a foundation model with labeled examples for a specific task.

Provide prompt-completion (instruction-response) pairs. JSONL format is standard.

Why: Instruction fine-tuning teaches the model to map user inputs to desired outputs in the target task.

Reference

Teach a foundation model specialized vocabulary (medical, legal, scientific) using lots of unlabeled domain text.

Continued pre-training on the unlabeled domain corpus.

Why: Continued pre-training updates the model's understanding of vocabulary and concepts; instruction fine-tuning teaches task behavior. Different goal, different data shape.

Reference

Multi-step workflow that combines LLM reasoning with calls to external APIs, databases, or AWS services.

Amazon Bedrock Agents — orchestrates LLM reasoning, tool/API invocation, and result synthesis in a single managed runtime.

Why: Agents plan steps, call tools, and stitch results back into a final response without you writing the orchestration loop.

Reference

Pick a vector database for embeddings.

Managed RAG → Bedrock Knowledge Bases (handles vector store automatically). Custom vector DB → OpenSearch Service (k-NN), Aurora PostgreSQL with pgvector, Neptune Analytics, or RDS for PostgreSQL with pgvector.

Why: OpenSearch is the default for high-scale k-NN; pgvector reuses an existing relational DB.

Reference

Deploy a fine-tuned model from Bedrock for production serving.

Buy Provisioned Throughput for the custom Bedrock model. Custom models cannot be invoked via on-demand pricing.

Why: Custom-model capacity is dedicated, billed in model units, and required for invocation.

Reference

Estimate or reduce Bedrock inference cost.

Cost ≈ tokens processed × per-token rate. Reduce by shortening prompts, trimming few-shot examples, picking smaller models, or using prompt caching where supported.

Reference

Generate high-accuracy labeled data with human-in-the-loop review (e.g. specialized images, medical records).

Amazon SageMaker Ground Truth Plus — managed HITL labeling workforce.

Why: For periodic auditing of low-confidence model predictions, pair with Amazon A2I (Augmented AI).

Reference

Speech recognition mishears domain-specific terms (medical, legal, brand names).

Amazon Transcribe with a custom language model or custom vocabulary trained on domain text.

Reference

Model performs well on training but poorly in production (overfit) — increase generalization without changing architecture.

Increase volume and diversity of training data. Don't cut data or only add hyperparameters.

Why: More representative data is the highest-leverage fix; regularization and early stopping help but data dominates.

Evaluate generative output quality.

Translation quality → BLEU. Summarization quality → ROUGE. Semantic similarity to reference → BERTScore. Stylistic preference → human evaluation with custom prompt sets.

Pick a Bedrock foundation model for a use case where output style matters.

Run human evaluation on a custom prompt dataset across candidate models. Don't rely on public leaderboards or latency metrics alone.

Why: Style/tone fit is subjective; benchmarks miss it.

Reference

Generate charts and dashboards from natural-language questions over business data.

Amazon Q in QuickSight — natural-language BI over QuickSight datasets.

Reference

Guidelines for Responsible AI

Detect bias in training data or model predictions; produce explainability reports.

Amazon SageMaker Clarify. Runs bias metrics across protected attributes pre- and post-training, plus SHAP-based feature attribution.

Why: Required for regulated domains (lending, hiring, healthcare) where you must demonstrate fairness and explainability.

Reference

Model performs worse for one demographic, ethnic group, or geography (e.g. flags certain groups disproportionately).

Sampling bias. Rebalance the dataset: data augmentation for underrepresented classes; ensure diverse, representative sources.

Why: Training data that under-represents groups produces models that under-serve them. Fix at the data layer, not the model layer.

Document a model's intended use, training data, performance, limitations, and risks for governance and audit.

Amazon SageMaker Model Cards — structured, versioned documentation tied to the model.

Reference

Restrict LLM topics, filter harmful content, mask PII, or block prompt injection patterns.

Amazon Bedrock Guardrails. Configure denied topics, content filters (hate, violence, sexual, insults), word filters, sensitive-info filters, and contextual grounding checks.

Why: Applied to both inputs and outputs; works across any Bedrock model and your own custom models.

Reference

Determine the company's share of security responsibility for a generative AI deployment.

AWS Generative AI Security Scope Matrix. Scope 1 (consumer app, lowest responsibility) → Scope 5 (self-trained model, highest responsibility).

Why: Building and training a model from scratch on private data places maximum security responsibility on the company.

Reference

Stakeholders or regulators require an explanation of how the model reaches its predictions.

Use interpretable models when possible (decision trees, linear/logistic regression). For complex models, use Partial Dependence Plots, SHAP feature importance via SageMaker Clarify, or SageMaker Model Cards.

Why: PDPs show the marginal effect of each feature; SHAP attributes contribution per prediction; model cards capture the whole story for audit.

Generative AI output may reproduce copyrighted material or be passed off as human-authored work.

Plagiarism / IP-infringement risk. Mitigate with citation requirements, content provenance tracking, watermarking where supported, human review, and clear AI-content disclosure policies.

Security, Compliance, and Governance for AI Solutions

Foundation-model app must keep prompts and responses on the AWS network — no public internet egress.

Bedrock with VPC endpoints (PrivateLink) for the runtime API. Block public Bedrock endpoints with SCPs at the org level.

Why: PrivateLink keeps requests private and avoids data leaving the VPC; SCPs enforce the rule across all accounts.

Reference

Multiple teams call Bedrock against shared S3 data; each team must only access its own customer data.

Create one IAM service role per team that grants Bedrock access only to that team's S3 prefix or KMS key.

Why: Custom service roles enforce least privilege at the resource level. Don't give Bedrock broad S3 access and rely on app-layer filtering.

Bedrock fails to read S3 data encrypted with SSE-KMS.

Grant the Bedrock service role `kms:Decrypt` on the relevant CMK and `s3:GetObject` on the bucket/prefix.

Why: Bedrock assumes its service role to read the data; the role needs both S3 and KMS permissions.

Reference

Capture Bedrock activity for monitoring, debugging, audit, and compliance.

Two complementary services. CloudTrail = who/when/from-where for every API call (identity, timestamp, source IP). Bedrock model invocation logging = the actual prompt/response payload, written to CloudWatch Logs or S3. Enable both.

Why: CloudTrail captures metadata only; invocation logging captures content. Compliance often requires both.

Reference

Auditor requests AWS compliance reports (SOC, ISO, PCI, HIPAA) for the AI workload.

AWS Artifact — self-service portal for on-demand AWS compliance reports and agreements.

Why: AWS Audit Manager continuously audits your usage; AWS Artifact provides AWS's own attestations.

Reference

Discover and classify PII or other sensitive data sitting in S3 (training corpora, model logs).

Amazon Macie — ML-driven sensitive-data discovery for S3.

Why: Use Macie to find data that needs masking, deletion, or KMS encryption before it ends up in a model or its outputs.

Reference

Malicious user input tries to override system prompt, exfiltrate data, or trigger unintended actions.

Defense in depth: Bedrock Guardrails for content filtering, prompt templates that detect/ignore override patterns, input length limits, output validation, and least-privilege tool permissions for agents.

Why: No single mitigation is sufficient; combine input filtering, output filtering, and capability limits.

A custom model was trained on confidential data that shouldn't leak into responses.

Delete the model, scrub the confidential records from the training set, and retrain. Output filtering is not sufficient.

Why: Knowledge embedded in model weights cannot be reliably masked at inference; only retraining without that data removes it.

Decide what AWS secures vs. what the customer secures for an AI workload.

AWS Shared Responsibility Model: AWS = security OF the cloud (hardware, hypervisor, regions). Customer = security IN the cloud (data, IAM, KMS keys, network, app config).

Reference

Fundamentals of AI and ML

Pick a learning paradigm: labeled data, unlabeled data, or interactive trial-and-error.

Labeled → supervised. Unlabeled clustering/segmentation → unsupervised. Agent learns by reward → reinforcement.

Why: Choice is dictated by what data exists. RLHF is reinforcement learning steered by human ratings, used to align LLMs.

Reference

Adapt a pre-trained model to a new related task instead of training from scratch.

Use transfer learning. Fine-tune the existing model on the new domain dataset.

Why: Reuses learned representations, cuts training time and data needs vs. building a model from zero.

Pick a SageMaker inference mode for the workload shape.

Steady low-latency → real-time. Spiky/idle traffic → serverless. Large payload (≤1 GB) or long job (≤1 hr) with near-real-time → asynchronous. Offline bulk → batch transform.

Why: Real-time has payload/timeout limits; async queues large jobs; batch is for periodic offline scoring.

Reference

Multiple ML teams need to share and reuse engineered features.

Amazon SageMaker Feature Store as the central repo for online + offline features.

Why: Avoids duplicate feature engineering and keeps train/serve consistency across teams.

Reference

Build ML models without coding or ML expertise (e.g. demand forecasting for analysts).

Amazon SageMaker Canvas — visual no-code interface for training and inference.

Reference

Deploy a foundation model fast inside a VPC.

Amazon SageMaker JumpStart — pre-trained models deploy as SageMaker endpoints in your VPC.

Why: JumpStart bundles model artifacts and notebooks for one-click VPC-bound deployment.

Reference

Automate hyperparameter tuning and model selection.

Amazon SageMaker Autopilot — explores algorithms and tunes hyperparameters automatically.

Reference

Pick the right evaluation metric for a classification model.

Image/binary classification correctness → accuracy. Class breakdown → confusion matrix. Imbalanced classes → F1, precision, recall. Threshold-independent → AUC.

Why: Accuracy misleads on imbalanced data; confusion matrix shows TP/FP/TN/FN counts; F1 balances precision and recall.

Cost of missing a positive (false negative) is much higher than a false positive — e.g. fraud detection, disease screening.

Optimize for recall (sensitivity). Accept lower precision.

Why: Recall = TP / (TP + FN). Maximizing it minimizes missed positives at the price of more false alarms.

Model scores high on training data but poorly on test/production data; or accuracy first improves then degrades as epochs grow.

Overfitting. Mitigate with more data, regularization, early stopping, dropout, or simpler model.

Why: Large train-vs-test gap means the model memorized noise instead of learning patterns.

Pick the managed AI service for a single-purpose task.

Why: Managed AI services beat custom models when the task is well-scoped and on-catalog.

Reference

Fundamentals of Generative AI

Build a generative AI application on AWS without managing model infrastructure.

Amazon Bedrock — fully managed access to foundation models (Anthropic Claude, Meta Llama, Amazon Titan, Stability, AI21, Mistral, Cohere) via a single API.

Why: No GPU provisioning, no model hosting; pay per token. SageMaker JumpStart is the alternative when you need a self-hosted endpoint in your VPC.

Reference

Define what makes a model a "foundation model".

Large model pre-trained on diverse, mostly unlabeled data; adaptable to many downstream tasks via prompting, fine-tuning, or RAG.

Estimate how much input fits in one prompt and what drives inference cost.

Tokens are sub-word units. Context window = max tokens per request (input + output). Inference cost is roughly proportional to tokens processed.

Why: Token count, not request count, drives Bedrock pricing. If a long doc exceeds context window, chunk it or pick a larger-window model.

Pick output style: deterministic vs creative.

Low temperature (~0.0–0.3) → deterministic, repeatable. High temperature (~0.7–1.0) → creative, varied. Use 0 for classification or sentiment to get consistent labels.

Restrict the candidate token pool beyond temperature.

Top-K = consider only the K most-likely tokens. Top-P (nucleus) = consider tokens until cumulative probability reaches P.

Why: Top-P adapts the candidate set size to the distribution shape; Top-K is fixed-width.

Get LLM output in a specific style, length, or language.

Prompt engineering. Add explicit instructions ("Respond in French, under 50 words, formal tone").

Why: Cheaper and faster than fine-tuning, retraining, or changing model size for stylistic control.

Improve LLM accuracy on a specific task without retraining.

Few-shot prompting — embed 2–5 labeled input/output examples in the prompt before the new input.

Why: In-context learning lets the model pattern-match against examples without weight updates.

LLM gives wrong answers on multi-step reasoning problems.

Chain-of-thought prompting — instruct the model to walk through reasoning steps before the final answer ("Let's think step by step").

LLM generates text that sounds plausible but is factually wrong or fabricated.

Hallucination. Mitigate with RAG (ground in retrieved facts), Bedrock Guardrails, lower temperature, and human review of high-stakes outputs.

Power semantic search, clustering, or RAG retrieval over text or multimodal data.

Use an embedding model (e.g. Titan Embeddings, Cohere Embed) to convert content into dense vectors. Store and query in a vector DB.

Why: Embeddings capture semantic meaning so similar items land near each other in vector space (cosine / dot-product similarity).

Reference

Search application accepts both text and images as input.

Multimodal embedding model (e.g. Titan Multimodal Embeddings) — projects text and images into the same vector space.

Reference

Prototype a generative AI app fast with no code or AWS account setup.

PartyRock (Amazon Bedrock Playground) — browser-based no-code app builder.

Reference

Pick a Bedrock pricing model.

Variable / unpredictable load → On-demand (per-token). Steady high-volume or guaranteed throughput → Provisioned Throughput. Custom fine-tuned models → must use Provisioned Throughput.

Why: On-demand has no commitment; Provisioned Throughput buys dedicated capacity in model units.

Reference

Pick the cheapest customization that gets the quality you need.

Try in this order: (1) prompt engineering, (2) RAG with a knowledge base, (3) fine-tuning, (4) continued pre-training.

Why: Effort and cost grow at each step. Stop at the first one that meets the bar.

Applications of Foundation Models

Augment a foundation model with private company data (PDFs, docs, S3 content) without fine-tuning.

Create an Amazon Bedrock Knowledge Base. Bedrock handles ingestion, chunking, embedding, and retrieval (RAG) at inference time.

Why: Cheaper and faster to update than fine-tuning. Source data changes → re-sync the KB; no retraining.

Reference

Data changes frequently (inventory, pricing, news) and the model must reflect current state.

RAG with a knowledge base. Avoid fine-tuning — retraining cycles can't keep up.

Why: RAG separates the model from the data; the KB updates independently of the model.

Fine-tune a foundation model with labeled examples for a specific task.

Provide prompt-completion (instruction-response) pairs. JSONL format is standard.

Why: Instruction fine-tuning teaches the model to map user inputs to desired outputs in the target task.

Reference

Teach a foundation model specialized vocabulary (medical, legal, scientific) using lots of unlabeled domain text.

Continued pre-training on the unlabeled domain corpus.

Why: Continued pre-training updates the model's understanding of vocabulary and concepts; instruction fine-tuning teaches task behavior. Different goal, different data shape.

Reference

Multi-step workflow that combines LLM reasoning with calls to external APIs, databases, or AWS services.

Amazon Bedrock Agents — orchestrates LLM reasoning, tool/API invocation, and result synthesis in a single managed runtime.

Why: Agents plan steps, call tools, and stitch results back into a final response without you writing the orchestration loop.

Reference

Pick a vector database for embeddings.

Why: OpenSearch is the default for high-scale k-NN; pgvector reuses an existing relational DB.

Reference

Deploy a fine-tuned model from Bedrock for production serving.

Buy Provisioned Throughput for the custom Bedrock model. Custom models cannot be invoked via on-demand pricing.

Why: Custom-model capacity is dedicated, billed in model units, and required for invocation.

Reference

Estimate or reduce Bedrock inference cost.

Cost ≈ tokens processed × per-token rate. Reduce by shortening prompts, trimming few-shot examples, picking smaller models, or using prompt caching where supported.

Reference

Generate high-accuracy labeled data with human-in-the-loop review (e.g. specialized images, medical records).

Amazon SageMaker Ground Truth Plus — managed HITL labeling workforce.

Why: For periodic auditing of low-confidence model predictions, pair with Amazon A2I (Augmented AI).

Reference

Speech recognition mishears domain-specific terms (medical, legal, brand names).

Amazon Transcribe with a custom language model or custom vocabulary trained on domain text.

Reference

Model performs well on training but poorly in production (overfit) — increase generalization without changing architecture.

Increase volume and diversity of training data. Don't cut data or only add hyperparameters.

Why: More representative data is the highest-leverage fix; regularization and early stopping help but data dominates.

Evaluate generative output quality.

Translation quality → BLEU. Summarization quality → ROUGE. Semantic similarity to reference → BERTScore. Stylistic preference → human evaluation with custom prompt sets.

Pick a Bedrock foundation model for a use case where output style matters.

Run human evaluation on a custom prompt dataset across candidate models. Don't rely on public leaderboards or latency metrics alone.

Why: Style/tone fit is subjective; benchmarks miss it.

Reference

Generate charts and dashboards from natural-language questions over business data.

Amazon Q in QuickSight — natural-language BI over QuickSight datasets.

Reference

Guidelines for Responsible AI

Detect bias in training data or model predictions; produce explainability reports.

Amazon SageMaker Clarify. Runs bias metrics across protected attributes pre- and post-training, plus SHAP-based feature attribution.

Why: Required for regulated domains (lending, hiring, healthcare) where you must demonstrate fairness and explainability.

Reference

Model performs worse for one demographic, ethnic group, or geography (e.g. flags certain groups disproportionately).

Sampling bias. Rebalance the dataset: data augmentation for underrepresented classes; ensure diverse, representative sources.

Why: Training data that under-represents groups produces models that under-serve them. Fix at the data layer, not the model layer.

Document a model's intended use, training data, performance, limitations, and risks for governance and audit.

Amazon SageMaker Model Cards — structured, versioned documentation tied to the model.

Reference

Restrict LLM topics, filter harmful content, mask PII, or block prompt injection patterns.

Amazon Bedrock Guardrails. Configure denied topics, content filters (hate, violence, sexual, insults), word filters, sensitive-info filters, and contextual grounding checks.

Why: Applied to both inputs and outputs; works across any Bedrock model and your own custom models.

Reference

Determine the company's share of security responsibility for a generative AI deployment.

AWS Generative AI Security Scope Matrix. Scope 1 (consumer app, lowest responsibility) → Scope 5 (self-trained model, highest responsibility).

Why: Building and training a model from scratch on private data places maximum security responsibility on the company.

Reference

Stakeholders or regulators require an explanation of how the model reaches its predictions.

Why: PDPs show the marginal effect of each feature; SHAP attributes contribution per prediction; model cards capture the whole story for audit.

Generative AI output may reproduce copyrighted material or be passed off as human-authored work.

Plagiarism / IP-infringement risk. Mitigate with citation requirements, content provenance tracking, watermarking where supported, human review, and clear AI-content disclosure policies.

Security, Compliance, and Governance for AI Solutions

Foundation-model app must keep prompts and responses on the AWS network — no public internet egress.

Bedrock with VPC endpoints (PrivateLink) for the runtime API. Block public Bedrock endpoints with SCPs at the org level.

Why: PrivateLink keeps requests private and avoids data leaving the VPC; SCPs enforce the rule across all accounts.

Reference

Multiple teams call Bedrock against shared S3 data; each team must only access its own customer data.

Create one IAM service role per team that grants Bedrock access only to that team's S3 prefix or KMS key.

Why: Custom service roles enforce least privilege at the resource level. Don't give Bedrock broad S3 access and rely on app-layer filtering.

Bedrock fails to read S3 data encrypted with SSE-KMS.

Grant the Bedrock service role `kms:Decrypt` on the relevant CMK and `s3:GetObject` on the bucket/prefix.

Why: Bedrock assumes its service role to read the data; the role needs both S3 and KMS permissions.

Reference

Capture Bedrock activity for monitoring, debugging, audit, and compliance.

Why: CloudTrail captures metadata only; invocation logging captures content. Compliance often requires both.

Reference

Auditor requests AWS compliance reports (SOC, ISO, PCI, HIPAA) for the AI workload.

AWS Artifact — self-service portal for on-demand AWS compliance reports and agreements.

Why: AWS Audit Manager continuously audits your usage; AWS Artifact provides AWS's own attestations.

Reference

Discover and classify PII or other sensitive data sitting in S3 (training corpora, model logs).

Amazon Macie — ML-driven sensitive-data discovery for S3.

Why: Use Macie to find data that needs masking, deletion, or KMS encryption before it ends up in a model or its outputs.

Reference

Malicious user input tries to override system prompt, exfiltrate data, or trigger unintended actions.

Why: No single mitigation is sufficient; combine input filtering, output filtering, and capability limits.

A custom model was trained on confidential data that shouldn't leak into responses.

Delete the model, scrub the confidential records from the training set, and retrain. Output filtering is not sufficient.

Why: Knowledge embedded in model weights cannot be reliably masked at inference; only retraining without that data removes it.

Decide what AWS secures vs. what the customer secures for an AI workload.

AWS Shared Responsibility Model: AWS = security OF the cloud (hardware, hypervisor, regions). Customer = security IN the cloud (data, IAM, KMS keys, network, app config).

Reference