Playbook — CNPA CNCF Certified Cloud Native Platform Engineering Associate

Last reviewed: May 2026

A scannable reference of architectural patterns the CNPA exam tests. Read top-to-bottom, or jump to a section.

Platform Engineering Core Fundamentals

Establish the core principle for a platform team to ensure adoption and reduce developer friction.

Treat the internal platform as a product. Treat internal developers as customers, conduct user research, gather feedback, and iterate on features to reduce their cognitive load.

Why: This mindset shifts the focus from building infrastructure to delivering value, ensuring the platform solves real developer problems and isn't bypassed ("shadow IT").

Establish a single source of truth for the desired state of all infrastructure and applications.

Use Git repositories as the single source of truth. Deploy an in-cluster agent (ArgoCD, Flux) that runs a continuous reconciliation loop to compare the cluster state against Git.

Why: This provides a complete audit trail, enables easy rollbacks, and prevents configuration drift by automatically reverting out-of-band changes.

Reference

Prevent configuration drift and ensure consistency of deployed artifacts across all environments.

Treat infrastructure as immutable. Never modify running resources. Instead, create new, versioned artifacts (container images, VM images) and replace the old ones. Enforce this with read-only container filesystems (`readOnlyRootFilesystem: true`).

Why: Immutability eliminates configuration drift and makes deployments predictable and repeatable. "Replace, don't repair."

Choose a secure GitOps deployment model, especially in multi-cluster or restricted network environments.

Implement a pull-based model. An agent (ArgoCD, Flux) running inside the cluster pulls manifests from Git. Avoid push-based models where an external CI system pushes to the Kubernetes API.

Why: Pull-based models are more secure as they don't require exposing the Kubernetes API server externally or managing credentials for multiple clusters in CI.

Accelerate development and ensure best practices without overly restricting experienced teams.

Define "golden paths" (or paved roads): pre-configured, well-supported templates and workflows for common tasks (e.g., creating a new microservice).

Why: Golden paths reduce cognitive load and decision fatigue for the 80% case, but should still allow "escape hatches" for expert teams with unique requirements.

Provide multi-tenancy in a shared Kubernetes platform with appropriate isolation levels.

For strongest isolation, use separate clusters. For a balance of strong isolation and efficiency, use virtual clusters (vClusters). For basic, soft multi-tenancy, use namespace-level isolation with RBAC, NetworkPolicies, and ResourceQuotas.

Why: The choice depends on the security and "noisy neighbor" risk. Virtual clusters provide control plane isolation without the cost of full physical clusters.

Define the primary interaction mode between the platform team and stream-aligned (product) teams.

The platform team should primarily operate in an "X-as-a-Service" mode, providing self-service tools, APIs, and documentation.

Why: At scale, a platform team cannot use a high-touch collaboration model with every team. The as-a-service model enables scaling and developer autonomy.

Platform Observability, Security, and Conformance

Implement a comprehensive observability strategy for a distributed system.

Collect and correlate the three pillars: Metrics (numeric time-series data via Prometheus), Logs (structured events via Fluent Bit), and Traces (request flows via OpenTelemetry).

Why: No single pillar is sufficient. Correlating them (e.g., embedding trace IDs in logs) is essential for quickly diagnosing issues in complex microservice architectures.

Enforce security and organizational policies across all Kubernetes clusters automatically.

Use a policy engine like OPA/Gatekeeper or Kyverno, integrated as a validating/mutating admission controller. Store policies in Git and sync them via GitOps.

Why: This provides automated, preventative guardrails, giving developers fast feedback in their CI/CD pipeline rather than slow, manual review gates.

Select a policy engine for Kubernetes based on team skillsets and policy complexity.

Use Kyverno for policies that can be expressed in familiar Kubernetes-style YAML. Use OPA/Gatekeeper for complex policies requiring a more powerful, purpose-built language (Rego) and external data integration.

Why: Kyverno has a lower learning curve for Kubernetes practitioners. OPA/Rego is more powerful but requires learning a new language.

Ensure the integrity and authenticity of container images deployed to production.

Implement image signing in the CI pipeline using Sigstore/Cosign. Use a policy controller (Kyverno, Gatekeeper) to create an admission policy that verifies image signatures before allowing a pod to be created.

Why: This ensures that only images built by trusted CI pipelines and which have not been tampered with can run in the cluster, preventing unauthorized code execution.

Reference

Secure all service-to-service communication within the cluster with a zero-trust approach.

Deploy a service mesh (e.g., Istio, Linkerd) and enable strict mutual TLS (mTLS) for all in-mesh traffic.

Why: mTLS provides both encryption in transit and strong, cryptographically-verifiable identity for both client and server, preventing spoofing and man-in-the-middle attacks inside the cluster.

Enforce security best practices for all workloads running in the cluster.

Enable the built-in Pod Security Admission controller. Configure namespaces to enforce the `restricted` profile for workloads and `baseline` for platform components.

Why: The `restricted` profile enforces critical security hardening (e.g., run as non-root, drop all capabilities, disallow privilege escalation) and is a foundational security measure.

Reference

Detect anomalous or malicious behavior inside running containers at the OS level.

Deploy a runtime security tool that uses eBPF, such as Falco or Tetragon. Define rules to detect suspicious system calls, file access, and process execution.

Why: Traditional security tools are blind to activity inside containers. eBPF provides deep, low-overhead visibility into kernel-level events, enabling detection of threats that other tools miss.

Build a scalable and resilient observability data pipeline.

Use the OpenTelemetry (OTel) Collector. Chain processors to transform data (e.g., `attributes` processor to remove PII, `batch` processor for efficiency). Use the `memory_limiter` processor early in the pipeline to prevent OOMs.

Why: The Collector decouples instrumentation from backends and provides a flexible, vendor-neutral way to process, filter, and route telemetry data before export.

Reference

Continuous Delivery & Platform Engineering

Deploy new application versions to production while minimizing risk and blast radius.

Implement automated canary deployments using a tool like Flagger or Argo Rollouts. Gradually shift traffic to the new version while automatically analyzing key metrics (success rate, latency). Roll back automatically on SLO violation.

Why: Automated canary analysis validates new versions with real production traffic, providing a much higher degree of safety than simple rolling updates.

Deploy a new version of an application with the ability to perform an instant rollback.

Maintain two identical production environments ("blue" and "green"). Deploy the new version to the inactive (green) environment. After validation, switch the load balancer to route all traffic to green. Keep blue idle for instant rollback.

Why: This pattern provides zero-downtime deployments and the fastest possible rollback, but typically requires double the infrastructure resources.

Manage secrets declaratively in a GitOps workflow without storing plaintext credentials in Git.

Use a dedicated secrets operator. Either encrypt secrets before committing (Bitnami Sealed Secrets, Mozilla SOPS) or reference secrets from an external vault (External Secrets Operator).

Why: This keeps sensitive data out of Git while allowing secrets to be managed declaratively alongside the application configuration, maintaining the GitOps workflow.

Manage application configurations across multiple environments (dev, staging, prod) without duplication.

Use a tool like Kustomize with a base-and-overlay structure, or Helm with environment-specific values files. Promote changes by updating image tags or configuration in the target environment's overlay/values file, typically via a pull request.

Why: This "Don't Repeat Yourself" (DRY) approach prevents configuration drift between environments and makes differences explicit and auditable.

Manage deployments of the same application across a large, dynamic fleet of clusters.

Use ArgoCD ApplicationSets with a cluster generator. The generator dynamically discovers clusters based on labels and uses a template to generate an Application resource for each matching cluster.

Why: This automates application bootstrapping for new clusters and manages configuration at scale, avoiding the need to manually create hundreds of Application resources.

Reference

Enable continuous deployment to production while controlling the release of new features to users.

Integrate a feature flagging system. Deploy new code to production behind a disabled feature flag. Release the feature by enabling the flag for specific user segments, decoupling deployment from release.

Why: This separates technical risk (deployment) from business risk (release), enabling high-velocity deployments, A/B testing, and "kill switch" capabilities.

Automatically deploy new container images as soon as they are pushed to a registry.

Use FluxCD's Image Automation components. The `ImageRepository` scans the registry, the `ImagePolicy` selects the new tag (e.g., based on semver), and the `ImageUpdateAutomation` commits the tag change back to the Git repository.

Why: This closes the loop from CI (image push) to CD (deployment) for a fully automated GitOps workflow, without the CI system needing access to the cluster.

Platform APIs and Provisioning Infrastructure

Provide a unified, declarative API for developers to self-service provision both Kubernetes and cloud infrastructure resources (e.g., databases, message queues).

Use Crossplane. Install cloud provider plugins and define high-level CompositeResourceDefinitions (XRDs) for developers (e.g., `kind: PostgresSQLInstance`). Map these to underlying cloud resources using Compositions.

Why: This extends the Kubernetes control plane to manage external resources, allowing developers to use familiar `kubectl` and GitOps workflows for all their application dependencies, governed by platform-defined patterns.

Reference

Automate complex, stateful application lifecycle management (e.g., installation, upgrades, backups, failure recovery) in a Kubernetes-native way.

Build a Kubernetes Operator. Define a Custom Resource Definition (CRD) for your application and implement a custom controller that runs a reconciliation loop to manage the application's state.

Why: Operators encode human operational knowledge into software, enabling robust automation and treating complex applications as first-class Kubernetes resources.

Ensure an operator can perform cleanup of external resources (e.g., a cloud load balancer) before its associated Custom Resource is deleted from Kubernetes.

Add a finalizer to the Custom Resource metadata. When a user deletes the CR, it enters a `Terminating` state. The operator's reconciliation logic detects this, performs cleanup, and then removes the finalizer, allowing the K8s API server to complete the deletion.

Why: Without a finalizer, the CR could be deleted before the operator has time to clean up external resources, leading to orphaned, costly infrastructure.

Manage the lifecycle of a fleet of Kubernetes clusters themselves using declarative, GitOps-friendly tooling.

Use Cluster API (CAPI). A management cluster runs CAPI controllers that reconcile `Cluster` and `Machine` resources to provision and configure workload clusters across various cloud providers.

Why: CAPI turns cluster management into a declarative Kubernetes workflow, enabling consistent, automated, and version-controlled provisioning and upgrades of entire clusters.

Reference

Evolve platform APIs (defined as CRDs) without breaking existing users or requiring a "big bang" migration.

Support multiple versions in the CRD definition (e.g., v1beta1, v1). Implement a conversion webhook to translate between versions, allowing new clients to use v1 while old clients continue to use v1beta1 against the same stored object.

Why: Conversion webhooks are the native Kubernetes mechanism for enabling non-disruptive API evolution, which is critical for a stable platform product.

IDPs and Developer Experience

Reduce developer cognitive load and improve discoverability by centralizing tools, documentation, and software assets.

Implement an Internal Developer Portal (IDP) using a framework like CNCF Backstage. Populate its Software Catalog, provide Software Templates for scaffolding new services, and integrate TechDocs for "docs-as-code".

Why: An IDP acts as a "single pane of glass" for developers, providing golden paths and self-service capabilities that abstract platform complexity and accelerate onboarding and development.

Reference

Provide a single, reliable inventory of all software in the organization, including ownership, dependencies, and operational status.

Implement a software catalog (e.g., Backstage Software Catalog) populated via `catalog-info.yaml` files in Git repositories. This creates a central, searchable registry of services, libraries, APIs, etc.

Why: A catalog solves discoverability ("what services exist?") and ownership ("who do I talk to about this service?"), which is critical for scaling microservice architectures.

Enable developers to create new, production-ready services that adhere to organizational standards in minutes.

Use a scaffolding tool like Backstage Software Templates. Define templates that generate a new Git repo with standard project structure, CI/CD pipeline configuration, observability dashboards, and `catalog-info.yaml`.

Why: Templates codify best practices and provide a "paved road" for developers, drastically reducing the time-to-first-commit and ensuring new services are created with security, observability, and compliance built-in.

Ensure technical documentation is up-to-date, versioned, and co-located with the software it describes.

Adopt a "docs-as-code" approach. Store documentation in Markdown files within the service's Git repository. Use a tool like Backstage TechDocs to automatically build and render this documentation in the IDP.

Why: This model treats documentation like code—it can be reviewed in pull requests and is versioned alongside the feature it describes, preventing stale or outdated docs.

Measuring Your Platform

Measure the effectiveness of the platform and its impact on software delivery performance.

Track the four DORA metrics: Deployment Frequency (velocity), Lead Time for Changes (velocity), Change Failure Rate (stability), and Time to Restore Service (MTTR, stability).

Why: DORA metrics are industry-standard, outcome-oriented measures that are proven to correlate with organizational performance. They provide a balanced view of both speed and stability.

Reference

Provide accurate, granular cost visibility to teams using a shared Kubernetes platform.

Deploy a FinOps tool like OpenCost or Kubecost. Attribute costs to workloads based on their actual resource consumption over time. Allocate shared cluster costs (e.g., system components, node overhead) proportionally.

Why: Accurate chargeback/showback drives accountability and encourages teams to optimize resource usage. Without it, shared platform costs are opaque and difficult to manage.

Measure whether the platform is actually providing value and being used by development teams.

Track the adoption rate of key platform features, especially golden path templates and shared CI/CD pipelines. Supplement with developer satisfaction surveys (NPS-style).

Why: High adoption of optional, opinionated platform features is a strong signal that the platform is solving real problems. Low adoption indicates a mismatch with developer needs.

Assess the current state of the platform and create a roadmap for improvement.

Use a Platform Maturity Model to evaluate capabilities across multiple dimensions: e.g., Self-Service, Observability, Security, Reliability, and Governance. Define levels from ad-hoc/manual to fully automated and optimized.

Why: A maturity model provides a structured framework for self-assessment, helps identify weak spots, and aligns the team on a strategic vision for the platform's evolution.