A pod is stuck in the `Pending` state and is not being scheduled.
→Run `kubectl describe pod <pod-name>`. Check the `Events` section for messages from the scheduler.
Why: The `describe` command is the primary tool for this. It will show reasons like "Insufficient cpu/memory", "node(s) had taints the pod didn't tolerate", or "didn't match node selector".
A pod is repeatedly starting and failing, with a `CrashLoopBackOff` status.
→1. `kubectl logs <pod-name> --previous` to see the logs from the crashed container. 2. `kubectl describe pod <pod-name>` to check the exit code and reason.
Why: `CrashLoopBackOff` means the application inside the container is exiting. The logs from the previous instance (`--previous`) are crucial, as the current container might not have logged anything useful yet. The exit code can also indicate the type of error.
A pod fails to start with `ImagePullBackOff` or `ErrImagePull` status.
→`kubectl describe pod <pod-name>` to see the event message. Verify the image name and tag are correct. For private registries, ensure an `imagePullSecrets` is configured and the secret is valid.
Why: This is a registry or image name issue, not an application issue. Common causes are typos, incorrect tags, or authentication failure with a private registry.
A node has a `NotReady` status.
→SSH into the affected node. Check the kubelet service status with `systemctl status kubelet`. View its logs with `journalctl -u kubelet`.
Why: The `kubelet` is the agent responsible for node health reporting. If it's down or can't communicate with the API server, the node will be marked NotReady. Its logs are the first place to look.
A service exists, but traffic is not reaching the backend pods.
→1. `kubectl describe svc <service-name>` and verify the `Selector` matches pod labels. 2. `kubectl get endpoints <service-name>` and ensure it lists the correct pod IPs. If not, the labels are mismatched.
Why: The link between a Service and its Pods is the label selector. If the selector is wrong or the pods don't have the right labels, the Endpoints object will be empty, and the service will have nowhere to route traffic.
Pods are unable to resolve service names or external hostnames.
→1. Check if CoreDNS pods are running in `kube-system`. 2. Check CoreDNS logs. 3. Run a debug pod (e.g., `busybox`) and use `nslookup` to test resolution from within the cluster.
Why: DNS is a critical cluster dependency. Failures usually trace back to the CoreDNS deployment itself, its configuration (in a ConfigMap), or network policies blocking DNS traffic on UDP/TCP port 53.
A node must be taken offline for maintenance.
→First, `kubectl cordon <node-name>` to mark it unschedulable. Then, `kubectl drain <node-name> --ignore-daemonsets` to safely evict all user pods.
Why: `cordon` prevents new pods from being scheduled. `drain` respects PodDisruptionBudgets and evicts pods gracefully. `--ignore-daemonsets` is needed because DaemonSet pods cannot be evicted.
Identify which pods or nodes are consuming the most CPU or memory.
→Use `kubectl top pods` and `kubectl top nodes`. This requires the `metrics-server` to be deployed in the cluster.
Why: `kubectl top` provides a quick, real-time view of resource consumption, essential for identifying resource-hungry applications or node resource pressure.
A pod has been in the `Terminating` state for a long time and is not being removed.
→Force delete the pod with `kubectl delete pod <pod-name> --grace-period=0 --force`.
Why: This can happen if a finalizer is stuck or the kubelet cannot clean up resources. Force deletion removes the pod from the API server immediately, but should be used as a last resort as it may leave orphaned resources on the node.