Playbook

CNCF Certified Kubernetes Administrator

Last reviewed: May 2026

A scannable reference of architectural patterns the CKA exam tests. Read top-to-bottom, or jump to a section.

Cluster Architecture, Installation & Configuration

Requirement to perform a disaster recovery backup of the cluster state.

Use `etcdctl snapshot save` with the appropriate TLS certificates (`--cacert`, `--cert`, `--key`) and endpoint.

Why: etcd stores the entire cluster state. Direct snapshotting is the canonical way to back it up. In a kubeadm cluster, TLS is enabled, so certs are mandatory for `etcdctl` to authenticate.

Reference

Restore a cluster from a disaster recovery backup.

Use `etcdctl snapshot restore` to a new data directory. Then, update the `etcd.yaml` static pod manifest to point its `--data-dir` volume mount to the new location and restart the kubelet.

Why: Restoring creates a new data directory. The static pod manifest must be updated to use this new data, otherwise etcd will start with the old (or empty) data directory.

Reference

Perform a version upgrade for a kubeadm-managed cluster.

1. On control plane: upgrade `kubeadm`, run `kubeadm upgrade plan`, then `kubeadm upgrade apply`. 2. On each worker node: `kubectl drain`, upgrade `kubelet`, restart kubelet service, `kubectl uncordon`.

Why: The process is multi-step and sequential. `kubeadm` only upgrades control plane components; `kubelet` must be upgraded manually on each node. Draining nodes ensures workloads are safely evicted before maintenance.

Reference

Cluster certificates are expiring and need to be checked or renewed.

Use `kubeadm certs check-expiration` to view expiry dates. Use `kubeadm certs renew all` (or for specific components) to renew them. Restart control plane pods after renewal.

Why: Kubeadm-generated certificates have a 1-year validity. Renewal is a common maintenance task. Control plane components must be restarted to load the new certificates.

A control plane component (e.g., API server) needs to be configured or restarted.

Modify the component's manifest in `/etc/kubernetes/manifests/`. The kubelet on the node will automatically detect the change and restart the pod.

Why: Control plane components in kubeadm are run as static pods, managed directly by the kubelet, not the API server. All management happens via manifest files in the watched directory.

Define access controls for users or applications.

Use a `Role` and `RoleBinding` for namespace-scoped permissions. Use a `ClusterRole` and `ClusterRoleBinding` for cluster-wide permissions.

Why: This is the fundamental separation in RBAC. A Role is always tied to a namespace, while a ClusterRole can grant access to non-namespaced resources (like nodes) or to resources across all namespaces.

Reference

A service account needs to access resources across all namespaces.

Create a `ClusterRole` defining the permissions. Create a `ClusterRoleBinding` to grant that ClusterRole to the specific `ServiceAccount`.

Why: Even though a ServiceAccount is namespaced, a ClusterRoleBinding can grant it cluster-wide permissions. A `RoleBinding` would only grant the permissions within the RoleBinding's own namespace.

Services & Networking

Expose an application to external traffic without a cloud load balancer.

Use a Service of `type: NodePort`. This exposes the service on a static port (default range: 30000-32767) on each node's IP address.

Why: NodePort is a simple way to get external traffic into the cluster. It's less expensive and platform-agnostic compared to `type: LoadBalancer`, but requires clients to know a node IP.

Expose multiple HTTP/S services under a single IP address, with host- or path-based routing.

Deploy an Ingress Controller (e.g., NGINX). Create `Ingress` resources that define routing rules from hosts/paths to backend `Services`.

Why: Ingress is the standard Kubernetes resource for L7 routing. It requires a separate controller to actually implement the routing logic. This decouples routing rules from the proxy implementation.

Secure a namespace by denying all ingress traffic by default.

Create a `NetworkPolicy` that selects all pods (`podSelector: {}`) and specifies an empty ingress rule (`ingress: []`).

Why: Once a pod is selected by any NetworkPolicy, all traffic not explicitly allowed is denied. A policy selecting all pods with an empty ingress rule effectively creates a "deny-all" firewall for the namespace.

Reference

Allow pods in a "frontend" namespace to access pods in a "backend" namespace.

In the "backend" namespace, create a NetworkPolicy. In the `ingress.from` rule, use a `namespaceSelector` to match labels on the "frontend" `Namespace` resource.

Why: `podSelector` only works within the policy's namespace. To allow traffic from other namespaces, you must use a `namespaceSelector`. This requires labeling the `Namespace` objects themselves.

An application needs to connect to another service within the cluster.

Use the service's internal DNS name: `<service-name>.<namespace>.svc.cluster.local`. If in the same namespace, `<service-name>` is sufficient.

Why: Kubernetes provides stable DNS-based service discovery via CoreDNS. This decouples applications from specific pod IPs, which are ephemeral.

A stateful application (e.g., a database replica set) requires direct network identity for each pod.

Create a headless `Service` (`clusterIP: None`) for the `StatefulSet`. This provides unique DNS A records for each pod (e.g., `pod-0.my-service.my-ns...`).

Why: A headless service doesn't load balance. Instead, it provides DNS records for each pod, allowing clients to connect to specific instances, which is crucial for leader election or peer discovery in stateful systems.

An external-facing service needs to see the original client IP address for logging or IP-based filtering.

Set `externalTrafficPolicy: Local` on the `NodePort` or `LoadBalancer` Service.

Why: The default `Cluster` policy obscures the client IP via SNAT. `Local` avoids this extra network hop by only routing traffic to pods on the node that received the traffic, preserving the source IP.

Workloads & Scheduling

Co-locate or spread apart pods for performance or high availability.

Use `podAffinity` to schedule pods on the same node/zone as other specific pods. Use `podAntiAffinity` to avoid scheduling them together.

Why: This provides more advanced scheduling control than node-level affinity. Anti-affinity with `requiredDuringScheduling...` is critical for spreading replicas of a service across nodes or zones for HA.

Dedicate nodes to specific workloads or prevent certain workloads from running on them.

Apply a `taint` to a node (e.g., `gpu=true:NoSchedule`). Add a matching `toleration` to the pods that should be allowed to run on that node.

Why: Taints repel pods, while tolerations allow them. This is the primary mechanism for dedicating nodes. `NoExecute` effect will evict already-running pods that don't have the toleration.

Deploy a monitoring or logging agent on every node in the cluster.

Use a `DaemonSet`. It ensures that a copy of the pod runs on each node that matches its scheduling criteria.

Why: DaemonSet is designed for this exact purpose. It automatically deploys to new nodes and handles node-level pod management, which would be difficult with a Deployment.

Run a one-off batch task or a recurring scheduled task.

Use a `Job` for a task that runs once to completion. Use a `CronJob` to create Jobs on a repeating schedule (e.g., nightly backups).

Why: Jobs ensure pods run until a specified number of completions. CronJobs are a higher-level controller that manages Jobs based on a cron schedule.

Update an application to a new version with zero downtime.

Use a `Deployment` with the default `RollingUpdate` strategy. Configure `maxSurge` and `maxUnavailable` to control the update velocity and availability.

Why: Rolling updates gradually replace old pods with new ones, ensuring the service remains available. `maxUnavailable` guarantees a minimum number of pods are running, while `maxSurge` allows bursting above the desired replica count to speed up the rollout.

Ensure pods get guaranteed resources and don't consume excessive resources on a node.

Set `resources.requests` (CPU/memory) to guarantee a minimum for scheduling. Set `resources.limits` to prevent a container from exceeding a certain amount.

Why: Requests are used by the scheduler for placement and guarantee resources. Limits are enforced by the kubelet and container runtime; exceeding the memory limit results in OOMKill.

Deploy a stateful application that requires stable, unique network identifiers and persistent storage per replica.

Use a `StatefulSet` with a `volumeClaimTemplate`. This creates a unique `PersistentVolumeClaim` for each pod, ensuring data is re-attached to the same pod identity upon restart.

Why: StatefulSets provide stable pod names (e.g., `web-0`, `web-1`) and a unique, persistent PVC for each. This is essential for applications that rely on stable identity and storage.

Storage

Provide persistent storage for an application without pre-provisioning volumes.

Create a `StorageClass` that defines a storage provisioner. Then, create a `PersistentVolumeClaim` (PVC) that requests storage from that class. A `PersistentVolume` (PV) will be dynamically provisioned.

Why: This decouples applications from the underlying storage infrastructure. Developers request storage via PVCs, and the cluster administrator defines how that storage is provisioned via StorageClasses.

Control what happens to a persistent volume after its claim is deleted.

Set `persistentVolumeReclaimPolicy` on the PV or StorageClass. `Delete` automatically deletes the underlying storage. `Retain` leaves the volume and data intact, requiring manual cleanup.

Why: `Retain` is the safest option for production data, as it prevents accidental data loss. `Delete` is convenient for ephemeral or development environments. The default depends on the provisioner.

Define how a volume can be mounted by pods.

Use `accessModes`: `ReadWriteOnce` (RWO) for single-node read-write, `ReadOnlyMany` (ROX) for multi-node read-only, `ReadWriteMany` (RWX) for multi-node read-write.

Why: The access mode must be supported by the underlying storage provider. Mismatching application needs (e.g., needing RWX) with storage capabilities (only supporting RWO) is a common cause of Pending PVCs.

Inject configuration files or sensitive data into a pod.

Mount a `ConfigMap` or `Secret` as a volume. Each key in the data object becomes a file in the mount path.

Why: This is the standard way to provide configuration to pods. It allows configuration to be managed as a Kubernetes object and updated independently of the pod image.

An application needs more storage space in its existing persistent volume.

Ensure the `StorageClass` has `allowVolumeExpansion: true`. Edit the `PVC` to request a larger size in `spec.resources.requests.storage`.

Why: Volume expansion is an opt-in feature. The StorageClass must explicitly allow it, and the underlying CSI driver must support it. The pod may need to be restarted for the filesystem to be resized.

Troubleshooting

A pod is stuck in the `Pending` state and is not being scheduled.

Run `kubectl describe pod <pod-name>`. Check the `Events` section for messages from the scheduler.

Why: The `describe` command is the primary tool for this. It will show reasons like "Insufficient cpu/memory", "node(s) had taints the pod didn't tolerate", or "didn't match node selector".

A pod is repeatedly starting and failing, with a `CrashLoopBackOff` status.

1. `kubectl logs <pod-name> --previous` to see the logs from the crashed container. 2. `kubectl describe pod <pod-name>` to check the exit code and reason.

Why: `CrashLoopBackOff` means the application inside the container is exiting. The logs from the previous instance (`--previous`) are crucial, as the current container might not have logged anything useful yet. The exit code can also indicate the type of error.

A pod fails to start with `ImagePullBackOff` or `ErrImagePull` status.

`kubectl describe pod <pod-name>` to see the event message. Verify the image name and tag are correct. For private registries, ensure an `imagePullSecrets` is configured and the secret is valid.

Why: This is a registry or image name issue, not an application issue. Common causes are typos, incorrect tags, or authentication failure with a private registry.

A node has a `NotReady` status.

SSH into the affected node. Check the kubelet service status with `systemctl status kubelet`. View its logs with `journalctl -u kubelet`.

Why: The `kubelet` is the agent responsible for node health reporting. If it's down or can't communicate with the API server, the node will be marked NotReady. Its logs are the first place to look.

A service exists, but traffic is not reaching the backend pods.

1. `kubectl describe svc <service-name>` and verify the `Selector` matches pod labels. 2. `kubectl get endpoints <service-name>` and ensure it lists the correct pod IPs. If not, the labels are mismatched.

Why: The link between a Service and its Pods is the label selector. If the selector is wrong or the pods don't have the right labels, the Endpoints object will be empty, and the service will have nowhere to route traffic.

Pods are unable to resolve service names or external hostnames.

1. Check if CoreDNS pods are running in `kube-system`. 2. Check CoreDNS logs. 3. Run a debug pod (e.g., `busybox`) and use `nslookup` to test resolution from within the cluster.

Why: DNS is a critical cluster dependency. Failures usually trace back to the CoreDNS deployment itself, its configuration (in a ConfigMap), or network policies blocking DNS traffic on UDP/TCP port 53.

A node must be taken offline for maintenance.

First, `kubectl cordon <node-name>` to mark it unschedulable. Then, `kubectl drain <node-name> --ignore-daemonsets` to safely evict all user pods.

Why: `cordon` prevents new pods from being scheduled. `drain` respects PodDisruptionBudgets and evicts pods gracefully. `--ignore-daemonsets` is needed because DaemonSet pods cannot be evicted.

Identify which pods or nodes are consuming the most CPU or memory.

Use `kubectl top pods` and `kubectl top nodes`. This requires the `metrics-server` to be deployed in the cluster.

Why: `kubectl top` provides a quick, real-time view of resource consumption, essential for identifying resource-hungry applications or node resource pressure.

A pod has been in the `Terminating` state for a long time and is not being removed.

Force delete the pod with `kubectl delete pod <pod-name> --grace-period=0 --force`.

Why: This can happen if a finalizer is stuck or the kubelet cannot clean up resources. Force deletion removes the pod from the API server immediately, but should be used as a last resort as it may leave orphaned resources on the node.