Playbook

Google Cloud Professional Cloud DevOps Engineer

Last reviewed: May 2026

A scannable reference of architectural patterns the PCDOE exam tests. Read top-to-bottom, or jump to a section.

Domain 1: Design and build a secure and compliant cloud environment

Enforce preventative guardrails across an organization, like restricting resource locations or disabling service account key creation.

Apply Organization Policy constraints (e.g., `constraints/gcp.resourceLocations`, `constraints/iam.disableServiceAccountKeyCreation`) at the organization or folder level.

Why: Organization Policies are inherited and enforced at the API level, preventing non-compliant actions before they occur. This is more effective than reactive detection and remediation.

Reference

Structure a multi-department, multi-environment organization to manage policies and access control effectively.

Design a folder hierarchy, typically: Organization > Business Unit (Folder) > Environment (e.g., prod, staging) (Sub-folder) > Projects.

Why: This structure allows for granular policy inheritance. Common policies are set at the BU level, while environment-specific policies (e.g., more restrictive for `prod`) are set at the environment level.

Aggregate logs from all projects for compliance, security analysis, and operational troubleshooting with cost optimization.

Create an organization-level aggregated log sink. Route logs to multiple destinations based on need: BigQuery for analysis, Cloud Storage (Coldline/Archive) for long-term/low-cost archival, and Pub/Sub for real-time streaming to a SIEM.

Why: This tiered approach optimizes for both cost and capability. BigQuery provides powerful querying, while Cloud Storage offers cheap archival. Using a single destination is either too expensive or not performant enough for all use cases.

Reference

Prevent data exfiltration from managed services like BigQuery and Cloud Storage, allowing access only from authorized networks or identities.

Create a VPC Service Controls perimeter around projects containing sensitive data. Define access levels to allow access from specific IP ranges (corporate network) or devices.

Why: VPC Service Controls creates a virtual perimeter around API-based services, mitigating risks from stolen credentials or misconfigured IAM policies by blocking access from outside the perimeter.

Establish security guardrails that cannot be overridden by project owners, such as preventing a specific role from being granted.

Implement IAM Deny policies at the organization or folder level. These policies explicitly deny permissions, and they always override any `allow` policies.

Why: Deny policies provide a powerful way to enforce organization-wide security controls that cannot be bypassed at lower levels of the resource hierarchy, ensuring consistent security posture.

Ensure all new projects are provisioned with a standard baseline configuration (networking, IAM, logging, etc.).

Use Infrastructure as Code (e.g., Terraform with Cloud Build) to create a "landing zone". Automate project creation and configuration via a pipeline.

Why: Automation ensures consistency, reduces manual error, and speeds up project provisioning. It codifies best practices, making governance auditable and repeatable.

Allow external systems (like GitHub Actions or on-prem CI/CD) to access GCP resources without using long-lived service account keys.

Configure Workload Identity Federation. Create a provider that trusts the external IdP (e.g., GitHub OIDC) and map external identities to a GCP service account. Use attribute conditions to restrict access to specific repos/branches.

Why: This eliminates the need to manage and rotate service account keys, which is a major security risk. It provides short-lived, identity-based credentials.

Domain 3: Design and build a secure and reliable cloud infrastructure

Centralize network administration (VPCs, subnets, firewalls) while allowing separate teams to manage their own project resources.

Implement Shared VPC. Create a "host project" for network resources and "service projects" for application workloads. Grant `roles/compute.networkUser` to service project identities.

Why: Shared VPC decouples network administration from project administration, providing centralized control and security while giving teams autonomy. It scales better and is more secure than VPC Peering for this use case.

Manage GKE cluster configurations and applications declaratively from a Git repository.

Use a Git repository as the single source of truth for manifests. Install Config Sync in the GKE clusters to continuously reconcile the cluster state with the configuration in the repository.

Why: GitOps provides an auditable, version-controlled, and automated way to manage Kubernetes. It separates CI (building artifacts) from CD (syncing state).

Domain 2: Design and build a secure and reliable software delivery process

Prevent container images with critical vulnerabilities from being deployed.

Enable automatic vulnerability scanning in Artifact Registry. In the Cloud Build pipeline, add a step that uses the Container Analysis API to check for vulnerabilities and fails the build if critical issues are found.

Why: This creates an automated quality gate in the CI pipeline, preventing vulnerable artifacts from ever reaching a deployable state. It shifts security left.

Enforce at runtime that only trusted, signed container images can be deployed to GKE or Cloud Run.

Implement Binary Authorization. Create attestors (e.g., for passing vulnerability scans, QA sign-off). Configure the CI pipeline to create attestations. Enforce a policy on GKE/Cloud Run that requires specific attestations for deployment.

Why: Binary Authorization provides a strong, policy-based enforcement at deployment time. It protects against deploying compromised or unvetted images, even if they make it into the registry.

Reference

Access sensitive information like API keys or passwords during a Cloud Build run without exposing them in logs or source code.

Store secrets in Secret Manager. In the `cloudbuild.yaml`, use the `availableSecrets` field to mount the secret as an environment variable or file.

Why: This is the native, secure integration. Cloud Build handles the authentication and automatically redacts the secret values from logs, preventing accidental exposure.

Establish a verifiable chain of custody for software artifacts to ensure they were built by a trusted system from trusted source code.

Use Cloud Build to generate SLSA-compliant provenance attestations. Store these attestations in Artifact Registry alongside the images. Use Binary Authorization to verify the provenance before deployment.

Why: SLSA provides a framework for hardening the software supply chain. This combination of tools provides an end-to-end, verifiable chain of trust from source to production.

Run CI/CD jobs that need to access resources in a private VPC, like a private Artifact Registry or a Cloud SQL database.

Create a Cloud Build private pool and configure VPC peering between the pool's network and your target VPC. Configure builds to run in this pool.

Why: Private pools provide network isolation and allow builds to securely access resources on a private network without exposing them to the internet.

Automatically delete old or unused container images to manage storage costs while retaining important images.

Configure Artifact Registry cleanup policies. Use a `keep` policy for tags like `production` and `latest`. Use `delete` policies based on age, tag patterns, and version counts for other images.

Why: Cleanup policies provide a declarative, automated way to manage image lifecycle, balancing cost savings with the need to retain production and recent development artifacts.

Domain 4: Implement and execute secure and reliable deployment patterns

Automate a multi-stage deployment from dev to staging to production with approvals and different strategies per environment.

Define a single Cloud Deploy delivery pipeline with a progression of targets (dev, staging, prod). Configure `requireApproval: true` for the production target and specify different deployment strategies (e.g., canary) for each target.

Why: Cloud Deploy provides a managed, auditable continuous delivery service. It simplifies progressive delivery patterns like canary and blue-green deployments with integrated approvals and rollbacks.

Domain 6: Observe, troubleshoot, and improve secure and reliable services

Define metrics to measure the reliability of a service from the user's perspective.

Define Service Level Indicators (SLIs) based on user-facing concerns: availability (percentage of successful requests), latency (percentage of requests faster than a threshold), and correctness/freshness (percentage of data processed correctly or is up-to-date).

Why: SLIs must measure user happiness, not internal server health. Metrics like CPU utilization are causes, while high latency is a symptom. SRE focuses on monitoring and managing symptoms.

Get notified of SLO violations early enough to react, without being flooded by alerts for minor, transient issues.

Configure alerts based on the SLO burn rate (the speed at which the error budget is being consumed). Use multi-window alerts: a high burn rate over a short window for critical pages, and a lower burn rate over a long window for non-urgent tickets.

Why: Burn rate alerting is predictive. It alerts on the *rate* of failure, which indicates a real problem, rather than a single failed request, reducing alert fatigue and focusing on what matters.

Reference

Diagnose latency issues in a microservices architecture by understanding the full lifecycle of a request.

Instrument services with OpenTelemetry SDKs and export traces to Cloud Trace. Ensure trace context is propagated across service calls (including through message queues like Pub/Sub).

Why: OpenTelemetry provides a vendor-neutral standard for instrumentation. Cloud Trace visualizes the end-to-end request flow, making it easy to spot which service or operation is the bottleneck.

Ensure application logs in GKE are correctly parsed, searchable, and have the proper severity level in Cloud Logging.

Configure applications to write logs to `stdout`/`stderr` in a JSON format. Include a `severity` field that matches Google Cloud's expected values (e.g., "INFO", "ERROR").

Why: GKE's default logging agent automatically picks up and parses JSON logs from stdout, making them structured and queryable in Cloud Logging without needing a sidecar or custom agent.

Track, visualize, and alert on SLO compliance and error budget consumption for a service.

Use Cloud Monitoring's Service Monitoring feature. Define a service, create SLIs (e.g., availability from a load balancer), set SLO targets, and configure burn rate alerting policies.

Why: This native feature automates the complex calculations of SLO compliance and error budgets, provides out-of-the-box dashboards, and integrates with the alerting system.

Quickly find the root cause of a problem by linking metrics, traces, and logs.

Ensure trace IDs are included in structured logs. Use Cloud Monitoring features like trace exemplars on metric charts to jump to a specific trace during a metric anomaly, and then from that trace, jump to the correlated logs.

Why: The ability to seamlessly pivot between the three pillars of observability (metrics, logs, traces) is key to reducing Mean Time to Resolution (MTTR).

Create custom metrics and alerts for application-specific events that are only available in log data, like user sign-ups or payment failures.

In Cloud Logging, create a log-based metric. Define a filter to match the relevant log entries and configure the metric type (counter or distribution). Use this custom metric in dashboards and alerting policies.

Why: Log-based metrics allow you to turn unstructured or semi-structured log data into structured time-series data, making it easy to monitor and alert on business-level KPIs without changing application code.

Diagnose database performance issues, such as slow queries, without adding load to the database.

Enable Cloud SQL Insights and Query Insights on the Cloud SQL instance. Use the dashboard to identify high-load queries, analyze execution plans, and see performance trends.

Why: Query Insights provides lightweight, agentless query performance monitoring. It helps DBAs and developers pinpoint inefficient queries without the overhead of traditional profiling tools.

Proactively monitor critical user journeys or API availability from an external perspective.

Use Cloud Monitoring uptime checks for simple HTTP/TCP checks. For multi-step user flows (e.g., login, add to cart, checkout), use Synthetic Monitors, which run custom scripts (e.g., Puppeteer) in a managed environment.

Why: Synthetic monitoring simulates real user interactions, allowing you to detect problems before users do. It tests the entire stack from the outside in.

Domain 5: Operate secure and reliable services in Google Cloud

Balance the need for service reliability with the need to release new features.

Define a Service Level Objective (SLO) (e.g., 99.9% availability). The remaining 0.1% is the error budget. If the budget is mostly intact, ship features. If the budget is depleted, halt feature releases and focus on reliability improvements.

Why: The error budget provides a data-driven framework for making risk decisions, aligning engineering, product, and business teams on a common goal.

Learn from incidents to prevent them from recurring, while fostering a culture of psychological safety.

Conduct blameless postmortems after incidents. Focus the investigation on systemic factors, process gaps, and tooling failures, not on attributing blame to individuals. The output should be a list of actionable improvement items.

Why: A blameless culture encourages honest and open communication, leading to a more accurate understanding of an incident's root causes and more effective preventative actions.

Coordinate the response to a major incident effectively, avoiding confusion and duplicated effort.

Implement an Incident Command System (ICS) with clearly defined roles: Incident Commander (overall coordination), Operations Lead (technical investigation/fix), and Communications Lead (stakeholder updates).

Why: ICS provides a standardized, scalable structure for incident response, ensuring clear lines of authority and communication, which is crucial for resolving complex issues quickly.

Measure the performance of a software delivery organization.

Track the four key DORA metrics: Deployment Frequency (how often), Lead Time for Changes (how fast from commit to deploy), Change Failure Rate (what percentage of deployments cause failure), and Time to Restore Service (MTTR).

Why: These four metrics provide a balanced view of both development velocity and operational stability, and have been proven to correlate with high-performing organizations.

An SRE team is spending too much time on manual, repetitive operational tasks (toil), leaving no time for engineering projects.

Identify and quantify the most time-consuming toil. Prioritize and automate these tasks (e.g., implementing autoscaling instead of manual scaling, auto-remediation for common alerts). Cap toil at < 50% of engineer time.

Why: Toil is a drag on productivity and morale. Systematically reducing it through automation frees up engineers to work on long-term reliability improvements.

Attribute cloud costs accurately to different teams, services, or environments in a shared infrastructure.

Implement a consistent labeling/tagging strategy. Use these labels to filter in Cloud Billing reports. For GKE, enable GKE cost allocation to break down costs by namespace or workload.

Why: Accurate cost allocation provides visibility, which drives accountability. Teams that can see their spending are empowered to optimize it.

Optimize compute costs for a diverse set of workloads (stable, interruptible, dev/test).

Match the workload to the pricing model. Use Committed Use Discounts (CUDs) for stable, 24/7 workloads. Use Spot VMs for fault-tolerant, interruptible jobs (e.g., batch processing). Schedule dev/test environments to shut down outside of business hours.

Why: A one-size-fits-all approach to compute pricing is inefficient. Using the right tool for the job can lead to significant savings (>70%) without impacting performance.

Optimize GKE costs and performance by ensuring pods are requesting appropriate amounts of CPU and memory.

Deploy the Vertical Pod Autoscaler (VPA) in `recommendation` mode. Analyze its suggestions to adjust pod resource `requests`. Once confident, switch to `auto` mode for continuous right-sizing.

Why: Over-provisioning pods wastes money, while under-provisioning causes performance issues (throttling, OOMKilled). VPA uses actual usage data to make accurate sizing recommendations, improving both efficiency and stability.

Reduce latency caused by cold starts for a Cloud Run service.

Configure a `min-instances` value to keep a number of instances warm. Additionally, optimize the container image (smaller base image, fewer layers) and application startup code (lazy initialization).

Why: `min-instances` is the most direct way to reduce cold starts, but it has a cost. Combining it with container and code optimization provides a balanced approach to performance and cost.

Optimize costs for a large-scale BigQuery analytics workload with variable query patterns.

Switch from on-demand pricing to BigQuery Editions (slots). Purchase a baseline slot commitment for predictable load and enable autoscaling for peaks. Additionally, optimize queries by using partitioned/clustered tables and avoiding `SELECT *`.

Why: For consistent workloads, slot-based pricing is more cost-effective than on-demand. Autoscaling provides flexibility for bursts while controlling costs. Query and table optimization reduces the amount of data processed, directly lowering costs.

Reduce high network egress costs for a globally distributed application.

Use Cloud CDN to cache static content at the edge, closer to users. For dynamic traffic, choose the appropriate Network Service Tier (Premium for performance, Standard for cost-savings). Process data regionally to minimize cross-region traffic.

Why: Egress is a major cost driver. CDN offloads traffic from the origin, directly reducing egress. Thoughtful use of network tiers and regional data processing can significantly lower costs.