← expertoracle.com

Google Cloud Deep Dive Portal

A practical reference for Cloud Architects, DBAs, Data Engineers, and Enterprise Infrastructure Teams. Built to be used while you learn, design, implement, operate, secure, and troubleshoot real Google Cloud environments - not a marketing overview.

19 deep sections Architecture patterns Troubleshooting runbooks gcloud & Terraform Architecture Framework
Last reviewed: July 2026 Google Cloud changes frequently - verify with current Google Cloud documentation before production use.
WHO THIS IS FOR

Cloud architects, infrastructure engineers, Apps DBAs, DBAs, enterprise architects, DevOps, security, data, and AI engineers - and anyone moving from traditional infrastructure or another cloud into Google Cloud. It assumes you know servers, networks, storage, and databases, and focuses on how those map into Google Cloud and what changes operationally.

How this portal is organized

Each section is a self-contained deep dive. Use the left navigation or the top-bar search to jump to a topic. Every section carries a Last reviewed date and, where content changes frequently (pricing, machine types, quotas, model availability, service names), a Verify with current Google Cloud documentation flag.

Learn
Foundations first

Sections 1-2 establish the mental model: the resource hierarchy (org / folder / project), regions/zones, and the IAM allow/deny policy model that everything else depends on.

Build
Service deep dives

Sections 3-12 cover networking, compute, storage, databases, load balancing, security, observability, containers, data analytics, and AI - with diagrams, tables, and gotchas.

Operate
Run and govern

Sections 13-19 cover migration and DR, cost and governance, reference patterns, troubleshooting runbooks, automation, the Architecture Framework, and a structured learning path.

Reading the callouts

Several note types recur. They flag the perspective that matters most for a point.

Architect note
Design-time decisions, trade-offs, and things to settle before production.
DBA note
Database-specific behavior - what Google manages vs. what you manage, patching, backups, connectivity.
Security note
Exposure, least privilege, encryption, and audit considerations.
Cost note
Where money is spent and commonly wasted.
Operations note
Day-2 behavior: patching, scaling, maintenance, and reliability.
Data engineering note
BigQuery, pipelines, lake/warehouse design, and query-cost behavior.
AI note
Vertex AI, Gemini, vector search, and governed GenAI patterns.
Common mistake
A specific error teams repeatedly make, and how to avoid it.

The Google Cloud shared responsibility model (orientation)

Responsibility is split, and the split moves depending on the service. Get it wrong and you either leave gaps (exposed data, lost recoverability) or redo work Google already does.

LayerCompute Engine (IaaS)GKE StandardCloud SQL / managed DBBigQuery / Cloud Run (serverless)
Physical / hypervisorGoogleGoogleGoogleGoogle
OS patchingYouYou (nodes) / Google (control plane)GoogleGoogle
Runtime / engine patchingYouSharedGoogle (in window)Google
Backup configYouYouManaged, you configureManaged / you export
Scaling / HAYou build it (MIG)You configureYou enable HAAutomatic
Data, schema, access, IAMYouYouYouYou
The rule that never moves
Google secures the cloud. You secure what you put in it: identities, IAM, network exposure, data classification, and access. No managed service removes your responsibility for who can reach the data and what they can do with it.

Suggested reading order

Accuracy & independence
This is an independent educational resource, not official Google material and not a sales tool. Service names, machine types, quotas, regional availability, and pricing change often. Treat every concrete number, machine type, and limit here as a starting point and confirm it in the Google Cloud Console for your project and the official Google Cloud documentation before any design, sizing, or purchasing decision.

1. Google Cloud Fundamentals

The global infrastructure and the resource hierarchy (organization, folders, projects) that every Google Cloud deployment is built on - plus the mental model that makes the rest of the platform predictable.

Last reviewed: July 2026 Verify region list, quotas, and service availability in the Console.
TL;DR

Google Cloud is a set of regions (each with multiple zones) sitting on Google's global network. Resources have a scope - global, regional, or zonal - and that scope drives HA design. Everything lives in a hierarchy: Organization > Folders > Projects > resources. The project is the fundamental unit of deployment, billing, quota, and isolation. IAM grants access down the hierarchy; Organization Policies restrict what is allowed. Get the hierarchy and a landing zone right before production - restructuring later is painful.

What Google Cloud is

Google Cloud Platform (GCP) is Google's public cloud: on-demand compute, storage, networking, databases, data analytics, and AI/ML services delivered from Google-operated regions, consumed over Google's private global network, and billed by usage. Its distinctive strengths for enterprises are its global network (a software-defined, private backbone that makes a VPC a global object and enables global load balancing from a single anycast IP), its data and analytics stack (BigQuery, Dataflow, Pub/Sub, Dataplex), and its AI/ML platform (Vertex AI, Gemini). If you come from traditional infrastructure, the biggest early surprises are that the network is global, the project is the main boundary, and much of the platform is API-first and serverless.

Google Cloud global infrastructure

Google global network (private backbone, edge PoPs, subsea cables) One VPC is a global object on this network; global load balancing uses a single anycast IP. Region A (e.g. us-central1) Zone -a VMs Zone -b VMs Zone -c VMs Regional resources span zones; zonal resources live in one zone. Region B (e.g. europe-west1) - for DR / residency Multi-region resources (e.g. a multi-region Cloud Storage bucket, BigQuery US/EU) span regions. Global resources (VPC, global LB, IAM, images) are not tied to any one region.
Region > zones on Google's global network; resources are global, multi-region, regional, or zonal

Resource scope: global, multi-region, regional, zonal

This is the single most important idea for HA design on GCP - a resource's scope determines what failure it survives and where it can be reached.

ScopeSpansExamplesFails if
GlobalAll regionsVPC network, firewall rules, routes, global external Application LB, images, IAM, HTTP(S) load balancer IPEssentially never region-bound; a global control-plane issue only
Multi-regionA set of regions (e.g. "US", "EU")Cloud Storage multi-region buckets, BigQuery multi-region datasetsSurvives a region loss within the multi-region
RegionalAll zones in one regionRegional MIG, regional persistent disk, regional GKE control plane, subnets, Cloud SQL (HA)The whole region is lost
ZonalOne zoneA single VM, zonal persistent disk, zonal GKE clusterThat one zone is lost
Architect note - design by scope
Availability is a function of scope. A single (zonal) VM has no zone-failure protection. To survive a zone loss, use a regional managed instance group across zones and regional disks. To survive a region loss, deploy to a second region with global load balancing and cross-region data replication. Draw the scope of every resource before you draw the diagram.
Common mistake
Treating a subnet like it is global because "the VPC is global." The VPC is global; subnets are regional. A resource in a subnet lives in that region. You do not need one VPC per region (you need one global VPC with a subnet per region), but you do place workloads region by region.

The resource hierarchy

Organization (root) Folder: Shared / Platform Folder: Business Unit A Folder: Business Unit B Proj: net-hub-prod Proj: logging Proj: security Folder: BU-A / prod Folder: BU-A / nonprod Proj: app-a-dev Proj: app-a-test IAM grants flow DOWN (inherited). Org Policies restrict DOWN. Project = billing + quota + API-enablement + isolation unit.
Organization > Folders > Projects > resources. IAM and Org Policies inherit downward.
  • Organization - the root node, tied to a Cloud Identity / Google Workspace domain. The top of IAM and policy inheritance.
  • Folders - grouping nodes (by department, environment, or BU) for delegated administration and policy inheritance. Can nest.
  • Projects - the fundamental unit. Every resource belongs to exactly one project. A project is the boundary for billing, quotas, API enablement, IAM, and isolation. It has a mutable name, an immutable project ID (globally unique), and a project number.
  • Resources - VMs, buckets, datasets, etc., inside a project.
  • Billing account - a separate object linked to projects; it is where charges accrue and can span many projects. Org and billing are managed independently.
  • Resource Manager - the API/service that manages this hierarchy; Cloud Asset Inventory gives you a searchable, historical inventory of all resources and IAM across the org.

The Google Cloud mental model

ConceptIs the boundary forThink of it as
OrganizationEverything; identity domainThe enterprise root
FolderDelegated admin & policy groupingA governance grouping (dept / env / BU)
ProjectBilling, quota, API, IAM, isolationThe main deployment and billing boundary
IAM allow/deny policyWho can do whatThe access-control boundary
Organization PolicyWhat is allowed at allThe governance / restriction boundary
VPC networkPrivate networkingA global network object (subnets are regional)
Region / zonePhysical placementThe workload placement boundary (drives HA)
Two different questions
IAM answers "can this principal do this action on this resource?" Organization Policy answers "is this action allowed to exist here at all?" (e.g. "no external IPs in this folder," "only these regions"). They are separate engines - you often need both: IAM to grant, Org Policy to constrain.

Organization Policies

Organization Policy Service lets you set constraints that apply to a node (org, folder, or project) and inherit downward - guardrails that IAM cannot express. Common ones:

  • constraints/compute.vmExternalIpAccess - block external IPs on VMs (huge for reducing exposure).
  • constraints/gcp.resourceLocations - restrict which regions resources can be created in (data residency).
  • constraints/iam.disableServiceAccountKeyCreation - stop long-lived SA keys.
  • constraints/iam.allowedPolicyMemberDomains - domain restricted sharing: only allow IAM grants to your org's identities.
  • constraints/compute.requireOsLogin - enforce OS Login instead of metadata SSH keys.
Architect note
Set a baseline of preventive Org Policies at the organization node in your landing zone: disable SA key creation, enforce domain-restricted sharing, restrict resource locations, and block external IPs by default (exempt specific folders that genuinely need them). These prevent whole classes of mistakes rather than detecting them after the fact.

Labels vs tags

LabelsTags
WhatKey/value metadata on resourcesKey/value objects defined at org/project, bound to resources
Main useCost attribution, billing export grouping, organizationConditional IAM and Org Policy / firewall targeting (governance)
Governed?Free-form (anyone with edit can set)Yes - tag keys/values are IAM-controlled resources
Cost note
Define a small, enforced set of labels (cost-center, environment, owner, app) from day one and turn on billing export to BigQuery. Chargeback and cost analysis are only as good as your labeling, and retrofitting labels across thousands of resources is slow and never complete. Use tags (not labels) when you need the value to drive IAM conditions or Org Policy.

Resource names, project IDs, project numbers

  • Project ID - globally unique, human-chosen, immutable (e.g. acme-app-prod-01). Used in most CLI/API calls. Choose a naming convention up front.
  • Project number - an auto-assigned numeric ID; some APIs and service-agent identities use it.
  • Project name - a mutable display name.
  • Full resource names are hierarchical, e.g. //compute.googleapis.com/projects/<id>/zones/<zone>/instances/<name>.

Ways to work with Google Cloud

Cloud Console

The web UI. Best for learning, exploring, and reading state. Not for repeatable production changes - use IaC.

gcloud CLI (Cloud SDK)

The primary command line. Config profiles, project/account selection, service-account impersonation. See section 17.

Cloud Shell

Browser terminal, pre-authenticated as your Console identity, with gcloud/Terraform/kubectl installed and ephemeral home storage.

Client libraries & REST APIs

Idiomatic libraries (Python, Go, Java, Node, ...) and REST/gRPC APIs for building applications and tooling.

Terraform (recommended IaC)

The Google provider is the standard way to build infrastructure declaratively. Deployment Manager still exists but is legacy; new work uses Terraform (or Infrastructure Manager, managed Terraform).

Cloud Asset Inventory

Search, export, and monitor all resources and IAM across the org, with history and feeds. Essential for governance and audits.

Recommended posture
Console to learn and inspect. gcloud for glue and ad-hoc ops. Terraform for anything that reaches production. Manual Console clicks in prod are the root cause of most "why is DR different?" incidents.

Designing the hierarchy & a landing zone

How to structure organization, folders, and projects Design
  • Common pattern: Org > Folders by environment or business unit > Projects per app-environment. A frequent shape is a top-level split of Shared/Platform, Security, and per-BU folders, each with prod/nonprod sub-folders.
  • One app + one environment per project is the norm - it gives clean IAM, quota, billing, and blast-radius isolation. Resist the urge to pile many apps into one project.
  • Shared projects hold cross-cutting infrastructure: the host project for Shared VPC (networking), centralized logging, centralized security tooling (SCC), and a monitoring project.
  • Never mix with production: sandbox/experimentation, personal projects, and unreviewed workloads must live in separate folders with their own guardrails and budgets - never in the prod project.
Separating dev / test / stage / prod / shared / security / networking / logging Design
  • Separate projects per environment for independent IAM, quotas, budgets, and cost reporting.
  • A dedicated networking (Shared VPC host) project, a logging project (aggregated log sink target), a security project (SCC, org-level tooling), and a monitoring project.
  • Use folders to apply environment-wide IAM and Org Policies once and inherit them.
  • Keep prod under stricter Org Policies (no external IPs, restricted locations, no SA keys) than nonprod.
What a Google Cloud landing zone includes Design

A landing zone is a codified, repeatable baseline (Terraform) deployed before workloads:

  • Resource hierarchy (org, folders, projects) and naming standards.
  • Identity: Cloud Identity/Workspace, groups, break-glass, federation.
  • Baseline IAM (groups, not users) and preventive Org Policies.
  • Networking: Shared VPC host project, subnets, hierarchical firewall, Cloud NAT, DNS, hybrid connectivity.
  • Security: SCC, VPC Service Controls perimeters, KMS, org audit-log sink to a logging project (and optionally BigQuery/SIEM).
  • Guardrails: budgets and alerts, quotas, labels/tags, billing export to BigQuery.
  • All as code, reviewed and version-controlled. Google's Cloud Foundation / landing-zone blueprints are a starting point.
Common mistakes in hierarchy design
  • Running everything under one project (or worse, one shared "default" project) so least privilege and cost attribution become impossible.
  • Skipping the landing zone and retrofitting Org Policies, Shared VPC, and log centralization after workloads exist.
  • Granting IAM at the org level for convenience, so every project inherits broad access.
  • No naming standard for projects/folders, breaking automation and reporting.
  • Mixing sandbox and production in the same folder with the same guardrails.

2. Identity and Access Management

Who can do what, on which resource, in which project - and the guardrails (deny policies, Org Policies, VPC Service Controls) around it. IAM is where most Google Cloud access issues and security incidents originate, so this section goes deep on principals, roles, inheritance, and troubleshooting.

Last reviewed: July 2026 Verify role names and IAM Conditions/deny features in current docs.
TL;DR

IAM binds a principal (user, group, service account, or federated identity) to a role (a bundle of permissions) on a resource, in an allow policy. Policies inherit down the hierarchy (org → folder → project → resource), and grants are additive. Deny policies and Org Policies can block regardless of allows. Use groups (not users) and predefined/custom roles (not basic Owner/Editor/Viewer), prefer impersonation over service-account keys, and wrap sensitive data in VPC Service Controls.

The IAM model

An IAM allow policy is attached to a resource (or a hierarchy node) and contains bindings: each binding maps a role to one or more members (principals), optionally with a condition. When a principal calls an API, Google evaluates the effective policy (the union of allow policies inherited from the resource up through project, folder, and org), checks for any applicable deny policy, and checks Org Policy and VPC Service Controls. Access requires an allow, no matching deny, and no blocking Org Policy/perimeter.

Principals (members)

PrincipalWhat it isUse for
Google account (user)A human identity in Cloud Identity / Workspace / consumerPeople - but grant via groups, not directly
Google groupA collection of users/service accountsAll human access management (add/remove members, not IAM bindings)
Service account (SA)A non-human identity for workloads (an app, VM, function)Machine-to-service auth; the workload's identity
Workload identity (federated)An external workload identity (GKE, other clouds, CI) mapped to a GCP identityLetting workloads authenticate without SA keys
Workforce identity (federated)External human users from your IdP (Okta, Entra, etc.)Console/gcloud access for a federated workforce
allAuthenticatedUsers / allUsersAny Google identity / anyone on the internetAlmost never - public exposure; block with Org Policy
Common mistake - confusing service accounts with users
A service account is a workload identity, not a person. Do not share SA credentials among people, do not use a personal account as an app identity, and do not grant humans broad rights "because the app has them." Humans authenticate as users (via groups); workloads authenticate as service accounts (ideally via impersonation or workload identity, not keys).

Roles: basic, predefined, custom

Role typeWhatGuidance
Basic (primitive)Owner, Editor, Viewer - broad, legacy roles spanning almost all servicesAvoid in production. Owner/Editor are enormous grants. Use only in throwaway sandboxes.
PredefinedCurated, service-specific roles (e.g. roles/compute.instanceAdmin.v1, roles/bigquery.dataViewer)The default. Pick the narrowest predefined role that fits the job.
CustomYou compose a role from specific permissionsWhen no predefined role fits without over-granting. Maintain them (permissions change).
Common mistake - basic roles
Granting roles/editor or roles/owner "to move fast" gives near-total control over the project (create/delete almost anything, and Owner can change IAM). It defeats least privilege and makes audits meaningless. Use predefined roles scoped to the job; reserve Owner for a tiny, monitored break-glass group.

Allow policies and deny policies; conditional IAM

  • Allow policy - grants roles to principals. Additive across the hierarchy; there is no "subtract."
  • Deny policy - explicitly denies specified permissions to specified principals, and is evaluated before allows - a matching deny wins. Use it to carve exceptions ("nobody except break-glass may delete buckets in this folder").
  • Conditional IAM - attach a condition (CEL expression) to a binding: by resource name/type, by request time, by tag. E.g. grant access only to resources tagged env=dev, or only during business hours.
# Conditional binding: grant only on buckets whose name starts with "app-"
gcloud projects add-iam-policy-binding my-proj \
  --member="group:app-team@example.com" \
  --role="roles/storage.objectAdmin" \
  --condition='expression=resource.name.startsWith("projects/_/buckets/app-"),title=app-buckets-only'

Inheritance and scope

Grant at the lowest node that works. A role granted at the org applies to every folder, project, and resource beneath it; a role granted on a single bucket applies only there.

ScopeGrant here whenRisk
OrganizationTruly org-wide roles (org admins, security auditors)Highest - inherited everywhere
FolderA whole environment/BU needs the same accessMedium
ProjectMost workload accessContained to one project
Resource (bucket, dataset, SA)Fine-grained, single-resource accessLowest
Common mistake - org-level grants
Granting a role at the organization "so it works everywhere" is the IAM equivalent of a firewall any-any rule. It inherits into every project including prod and security. Grant at the project (or resource) level; reserve org/folder grants for genuinely cross-cutting roles and review them regularly with Policy Analyzer.

Service accounts: keys vs impersonation vs workload identity

MechanismHow the workload authenticatesRisk
Attached SA (metadata)A VM/GKE/Run resource runs as an SA; credentials come from the metadata serverLow - no keys on disk
Workload Identity FederationExternal workload (other cloud, CI, on-prem) exchanges its native token for short-lived GCP credsLow - no keys
SA impersonationA principal with iam.serviceAccountTokenCreator mints short-lived tokens for an SALow - short-lived, auditable
SA key (JSON)A long-lived downloaded private keyHigh - leaks, gets committed to git, outlives its owner
Security note - kill long-lived keys
Prefer, in order: attached service accounts (workloads in GCP), Workload Identity Federation (workloads outside GCP), and impersonation (humans/CI acting as an SA). Reserve SA keys for the rare case with no alternative, and block their creation org-wide with iam.disableServiceAccountKeyCreation. A leaked JSON key is a standing, long-lived credential - the most common serious GCP incident.

Cloud Identity, Workspace, and federation

  • Cloud Identity / Google Workspace - where your human identities and groups live; the org is tied to a domain. Manage joiners/movers/leavers here.
  • Workforce Identity Federation - let human users from an external IdP (Okta, Entra ID, Ping) access Google Cloud without provisioning Google accounts.
  • Workload Identity Federation - let external workloads (GitHub Actions, AWS, on-prem) get short-lived GCP credentials by trust, no keys.
Architect note - groups + federation
Federate human identities to your corporate IdP and drive all access through Google Groups mapped from IdP groups. IAM bindings target groups, never individuals. This makes access reviews, joiners/leavers, and audits tractable; per-user bindings become unmanageable at scale.

IAP, Access Context Manager, VPC Service Controls

Identity-Aware Proxy (IAP)

Context-aware access to apps and VMs (SSH/RDP/TCP) without a VPN or public IPs - authenticate the user and check policy at the proxy. The modern replacement for bastion + public IP.

Access Context Manager

Defines access levels (by IP range, device posture, identity) used by IAP and VPC-SC to make access conditional on context.

VPC Service Controls (VPC-SC)

Draws a service perimeter around projects so data in managed services (Cloud Storage, BigQuery, etc.) cannot be exfiltrated to projects outside the perimeter - even by a valid identity with a valid key. The key control for sensitive-data isolation.

IAM Recommender / Policy Analyzer / Troubleshooter

Recommender flags over-granted roles; Policy Analyzer answers "who can access what"; Troubleshooter explains why a specific request was allowed or denied.

Security note - VPC-SC for sensitive data
IAM controls who can call an API; it does not stop a compromised-but-authorized identity from copying data to an attacker's project. VPC Service Controls adds an egress boundary around your data services so BigQuery/Cloud Storage data cannot leave the perimeter. For regulated or high-value data, VPC-SC is not optional.

Real IAM scenarios

Read-only security auditor across the org Low risk

Who: the security team. Scope: organization (auditors legitimately need breadth). Role: roles/iam.securityReviewer + roles/logging.viewer (and SCC roles), granted to a group. Risk: low - read-only. Safer alternative: scope to specific folders if their remit is narrower. Common misuse: giving auditors Viewer at org (broader than needed and includes data read on many services).

App team deploys to their own project only Medium risk

Who: the app team group. Scope: their project (not folder/org). Role: specific predefined roles (e.g. roles/run.developer, roles/artifactregistry.writer, roles/logging.viewer), not Editor. Risk: medium - contained to one project. Safer alternative: deploy via a pipeline SA and give humans only view + trigger. Common misuse: roles/editor on the project "to unblock them."

Workload reads a bucket - no keys Low risk

Who: a VM/Cloud Run service. Scope: a single bucket. Role: roles/storage.objectViewer granted to the workload's attached service account on that bucket. Risk: low - narrow, keyless. This is the pattern to imitate. Common misuse: a downloaded SA key baked into the image + project-level Storage Admin.

CI/CD pipeline outside GCP deploys in Medium risk

Who: GitHub Actions / external CI. Scope: the target project. Role: a deployer SA the pipeline impersonates via Workload Identity Federation - no key. Risk: medium, but no standing credential. Common misuse: storing a long-lived SA JSON key as a CI secret.

Common Google Cloud IAM mistakes

Common Google Cloud IAM mistakes
  • Basic roles (Owner/Editor/Viewer) too broadly - use predefined roles scoped to the task.
  • Granting at org level unnecessarily - inherits everywhere; grant at project/resource.
  • Long-lived service account keys - the top serious incident; use impersonation / workload identity and disable key creation.
  • Not using service-account impersonation - humans and CI should mint short-lived tokens, not hold keys.
  • Confusing service accounts with human users - different lifecycles, different controls.
  • Not using groups - per-user bindings are unmanageable and invisible in reviews.
  • Not understanding inheritance - a broad grant high in the hierarchy silently reaches prod.
  • Not using conditional IAM where a resource/time/tag condition would tighten a grant.
  • Ignoring VPC Service Controls for sensitive data - IAM alone does not stop exfiltration.
  • Not reviewing audit logs - Admin Activity and Data Access logs are your evidence and detection.

Google Cloud access troubleshooting mental model

When a request is denied (or unexpectedly allowed), walk the layers in order. Most "permission denied" tickets are one of these:

⚑ "Permission denied" - the checklist
  1. Which org / folder / project is the resource in? (Wrong project selected is the #1 cause.)
  2. Which principal is making the request - user, group, service account, or federated identity? (For workloads, which SA is actually attached?)
  3. What role is assigned, and does it contain the required permission?
  4. At what scope is it granted (resource / project / folder / org)? Does inheritance reach this resource?
  5. Is there an IAM deny policy matching this principal + permission?
  6. Is an Organization Policy blocking the action (e.g. location, external IP, SA key)?
  7. Is VPC Service Controls blocking it (cross-perimeter access to a managed service)?
  8. Is the API enabled in the project?
  9. For workloads acting on other resources: is the service account permitted to act on that resource, and does the caller have actAs / tokenCreator on the SA?

Tools

Policy Troubleshooter (explains allow/deny for a specific principal+resource+permission), Policy Analyzer (who-can-do-what), IAM Recommender (over-grants), and Cloud Audit Logs (the denied request with the reason).

gcloud projects get-iam-policy PROJECT_ID
gcloud asset analyze-iam-policy --organization=ORG_ID \
  --identity="user:jane@example.com"
# Is the API enabled?
gcloud services list --enabled --project PROJECT_ID

3. Networking Deep Dive

The global VPC, regional subnets, firewall model, private access to Google APIs, Shared VPC, and hybrid connectivity - plus the traffic-flow reasoning you need to design and debug real Google Cloud networks.

Last reviewed: July 2026 Verify quotas, firewall behavior, and connectivity limits in current docs.
TL;DR

A VPC is a global object; its subnets are regional with a CIDR you choose (use custom mode, never auto mode in production). Firewall rules are stateful, have a priority and direction, and there is an implied deny-ingress / allow-egress at the bottom. Private workloads reach Google APIs via Private Google Access or Private Service Connect, and reach the internet outbound-only via Cloud NAT. Shared VPC centralizes networking in a host project; VPC peering is not transitive. Plan non-overlapping CIDRs before anything else.

VPC networks and CIDR planning

A VPC network is a global, software-defined private network. Unlike most clouds, one VPC spans every region; you add a subnet per region as you expand, and resources in different regions on the same VPC route to each other over Google's backbone with no peering.

  • Custom mode vs auto mode: auto mode auto-creates a subnet in every region from a fixed 10.128.0.0/9 range - convenient but it hands you overlapping, non-planned CIDRs. Always use custom mode in production and assign subnets deliberately.
  • Subnets are regional; a subnet has a primary CIDR and optional secondary ranges (used for GKE pods/services alias IPs).
  • CIDRs must not overlap with each other, with peered/shared VPCs, or with on-premises. Overlap is the number-one cause of hybrid that "connects but won't route."
  • You can expand a subnet's primary range; plan generously so you rarely need to.
Architect note - a workable IP plan
Reserve a large private supernet for GCP, allocate a block per environment, and a subnet per region per tier, leaving room for GKE secondary ranges (pods need a lot of IPs). Keep a documented IPAM. Because the VPC is global, you do not make one VPC per region - you make one (or a few) VPCs with a regional subnet where you deploy. Overlap is a re-IP project later; a too-large plan costs nothing.
Common mistake
Using auto-mode VPCs (or the default network) in production. They pre-create subnets in every region with fixed ranges you did not plan, which collide with on-prem and other VPCs the moment you connect them. Delete/avoid the default network; build a custom-mode VPC against your enterprise IP plan.

Firewall rules and hierarchical firewall policies

GCP firewalls are stateful and evaluated per VM by priority (lower number wins). Every network has two implied rules at priority 65535: deny all ingress and allow all egress. You open what you need above that.

ConceptDetail
DirectionIngress (to targets) or egress (from targets). Rules are directional - a common source of confusion.
Priority0-65535, lower = higher priority; first match wins. Implied deny-ingress/allow-egress sit at 65535.
TargetsAll instances, by network tag, or by service account (prefer SA targeting - it can't be self-assigned like tags).
Source/destCIDR ranges, source tags/SAs, or (for some) source service.
Hierarchical firewall policiesRules set at org/folder that apply to all VPCs beneath - central guardrails (e.g. allow IAP range, deny risky ports) evaluated before VPC-level rules.
Firewall Rules LoggingLog matched connections per rule - use it to see what is actually being allowed/denied.
Security note - target by service account, not tag
Network tags are just strings any instance-admin can add to a VM, so a "db-allowed" tag rule can be self-granted. Prefer targeting firewall rules by service account: an attacker can't attach an SA to a VM without actAs permission. Use hierarchical policies for org-wide guardrails (allow the IAP range 35.235.240.0/20 for SSH, deny egress to known-bad ranges) so individual teams can't undo them.
Common mistake
Forgetting the implied deny-ingress and expecting traffic to flow, or forgetting that egress is implicitly allowed and leaving VMs able to reach anywhere outbound. Also: writing a broad 0.0.0.0/0 allow-ingress on port 22 instead of allowing only the IAP range and using IAP for SSH.

Routes, Cloud Router, and Cloud NAT

  • Routes - system-generated (subnet routes, default internet route) plus custom static routes; dynamic routes are learned via Cloud Router (BGP) from VPN/Interconnect.
  • Cloud Router - the BGP speaker for hybrid connectivity and for regional dynamic routing; advertises your subnets to on-prem and learns on-prem routes.
  • Cloud NAT - managed, outbound-only NAT so instances with no external IP can reach the internet (patches, external APIs). It is regional and requires a Cloud Router. It does not allow inbound.
  • External vs internal IPs - internal (RFC1918) always; external only when you truly need internet-facing exposure. Block external IPs by Org Policy where possible.
Common mistake
Expecting Cloud NAT to make a service reachable from the internet - it only handles outbound. Inbound comes from a load balancer or an external IP. Also: assuming a private VM can reach the internet with no NAT and no external IP - it can't; add Cloud NAT (or use Private Google Access for Google APIs).

Private Google Access, Private Service Connect, private services access

Three different "private" mechanisms that are constantly confused:

MechanismWhat it doesUse for
Private Google Access (PGA)Lets VMs with only internal IPs reach Google APIs/services (storage.googleapis.com, etc.) without an external IPPrivate VMs calling Google APIs (Cloud Storage, BigQuery, Artifact Registry)
Private Service Connect (PSC)A private endpoint (internal IP) in your VPC that maps to a Google API bundle or a published servicePrivate, controlled access to Google APIs or to a service in another VPC/producer
Private services access (PSA)A VPC peering to a Google-managed producer VPC for services like Cloud SQL private IP, MemorystoreGiving managed services a private IP reachable from your VPC
Serverless VPC AccessA connector that lets serverless (Cloud Run/Functions/App Engine) reach VPC internal IPsServerless calling private resources (a private Cloud SQL, an internal service)
Architect note - PGA vs PSC
Enable Private Google Access on a subnet so private VMs can reach Google APIs over internal IPs (pair with a route/DNS to private.googleapis.com). Use Private Service Connect when you want a specific internal endpoint IP (for tighter control, VPC-SC alignment, or to consume a published producer service). For managed databases' private IP, you need private services access (a reserved range + peering). These are not interchangeable - pick by the resource you are reaching.

Shared VPC and VPC peering

  • Shared VPC - a host project owns the VPC and subnets; service projects attach and deploy resources into shared subnets. Networking is centralized (one team owns IP space, firewall, connectivity) while app teams keep their own projects for IAM/billing. The enterprise default for multi-project networking.
  • VPC Network Peering - connects two VPCs privately. Crucially, peering is not transitive: if A peers B and B peers C, A cannot reach C. And you cannot peer overlapping ranges.
  • Network Connectivity Center (NCC) - a hub-and-spoke model to interconnect many VPCs and hybrid links through a central hub, addressing peering's non-transitivity at scale.
Common mistake - assuming peering is transitive
Teams build A↔B and B↔C peerings and expect A to reach C through B. It does not work - VPC peering is non-transitive, and it does not forward to on-prem via a peered VPC either. Use Shared VPC for centralized networking, or Network Connectivity Center for a transitive hub, rather than chains of peerings.

Hybrid connectivity: Cloud VPN and Interconnect

HA VPNDedicated / Partner Interconnect
PathOver the internet, IPSec-encryptedPrivate physical connection (direct or via partner)
BandwidthPer-tunnel (Gbps-class aggregate with multiple tunnels)10/100 Gbps (Dedicated); flexible sizes (Partner)
SLA / latencyBest-effort internet; HA VPN offers an SLA with the right topologyConsistent, low latency; higher SLA
SetupMinutesDays-weeks (provisioning)
Use asQuick start / backup / lower bandwidthPrimary enterprise link, large data, low latency

Both use Cloud Router for BGP. HA VPN uses two interfaces for a 99.99% topology. Common pattern: Interconnect primary + HA VPN backup, with BGP preferring Interconnect.

Cloud DNS

  • Public zones for internet-facing names; private zones for internal resolution within (and across, via peering) your VPCs.
  • DNS peering and forwarding integrate with on-prem DNS (inbound/outbound server policies) for hybrid name resolution.
  • Managed private zones for *.googleapis.com (e.g. private.googleapis.com / restricted.googleapis.com) route Google-API traffic privately for PGA/VPC-SC.

How traffic flows in Google Cloud

  1. Destination inside the same VPC (any region)? Routes locally over the backbone - only firewall rules apply.
  2. Outside the VPC? The route table (subnet/static/dynamic routes) picks the next hop: default internet route (needs external IP or Cloud NAT), a VPN/Interconnect route (via Cloud Router), or a peering route.
  3. Firewall (hierarchical policies, then VPC rules, by priority) must allow it - remember implied deny-ingress / allow-egress.
  4. For Google APIs from private VMs: PGA/PSC + DNS to the private API endpoint.

Debugging is almost always: is there a route to the right next hop? does firewall allow it (both directions/priority)? is external IP / Cloud NAT / PGA in place for the destination type?

Reference diagrams

Three-tier with global external Application Load Balancer

Internet Global externalApp LB + Cloud Armor VPC (global) - subnet per region Web/app tier (regional MIG, no external IP) - tag/SA: app app zone-a app zone-b Data tier: Cloud SQL (private IP via PSA) / internal service - SA: db Cloud SQL (HA, private IP) Cloud NAT PGA / PSC
Global LB + Cloud Armor front a regional MIG (no external IPs); Cloud SQL on private IP; egress via Cloud NAT; Google APIs via PGA/PSC.

Shared VPC (centralized networking)

Host project (net team) Shared VPC + subnets + firewall Cloud NAT, DNS, hybrid links Service project: app-aVMs/GKE in shared subnet Service project: app-bRun/GKE in shared subnet Service project: dataCloud SQL private IP
One host project owns the network; service projects deploy into shared subnets - central IP/firewall control, per-project IAM and billing.

Private Google Access & Cloud NAT egress

Private subnet (PGA on, no external IPs) VM private.googleapis.com(Google APIs) PGA Cloud NAT egress Internet
Private Google Access sends Google-API traffic privately; Cloud NAT handles outbound internet - no external IPs on the VMs.

Network Intelligence Center

ToolWhat it gives you
Connectivity TestsStatic reachability analysis A→B: tells you the exact firewall rule / route / config blocking a path. First stop for "cannot reach."
Network AnalyzerAutomatic detection of misconfigurations (shadowed routes, unused rules, IP exhaustion, sub-optimal config).
VPC Flow LogsSampled connection records for monitoring, forensics, and "is my rule dropping this?"
Firewall Rules Logging / Packet MirroringPer-rule connection logs; mirror traffic to an IDS/collector for deep inspection.
Start with Connectivity Tests
Before hand-checking rules, run a Connectivity Test for the source, destination, protocol, and port. It evaluates routes, firewall rules (incl. hierarchical), and config and names the first blocker - turning a long hunt into a quick answer.

Networking troubleshooting

⚑ VM cannot reach the internet / cannot download patches

Likely causes & checks

  • No external IP and no Cloud NAT for the subnet's region - private VMs need Cloud NAT for outbound.
  • Egress firewall (or a hierarchical policy) denies the destination, or a deny rule at higher priority matches.
  • Default internet route removed/overridden by a custom route.
  • OS firewall on the VM.

Fix / prevention

Add Cloud NAT (+ Cloud Router) for the region; for OS/package repos, Google mirrors are reachable via PGA. Standardize NAT + PGA in the subnet module.

gcloud compute routers nats describe NAT --router=RT --region=REGION
gcloud compute firewall-rules list --filter="direction=EGRESS"
⚑ VM cannot reach Google APIs privately

Causes: Private Google Access not enabled on the subnet; DNS not resolving *.googleapis.com to the private VIP; no route to private.googleapis.com (199.36.153.8/30); firewall egress blocking 443 to that range; or VPC-SC perimeter blocking. Fix: enable PGA on the subnet, add the private-API DNS zone + route, allow egress 443 to the restricted/private VIP range. Use a Connectivity Test to confirm.

⚑ Application cannot connect across VPCs / Shared VPC issue

Causes: relying on transitive peering (not supported); overlapping CIDRs; missing firewall allowing the peer range; for Shared VPC, the service project's SA lacks compute.networkUser on the subnet, or resources were created in the wrong (local) network. Fix: use Shared VPC or NCC instead of peering chains; grant networkUser; ensure resources deploy into the shared subnet; open firewall for the source range.

⚑ On-premises cannot reach Google Cloud

Causes: CIDR overlap; Cloud Router not advertising the subnet, or on-prem not advertising its routes; VPN tunnel down (IKE mismatch) or Interconnect/BGP down; firewall not allowing the on-prem range. Fix: resolve overlap, verify BGP advertisements both ways, check tunnel/attachment state, open firewall. Console: Hybrid Connectivity > VPN / Interconnect; Cloud Routers.

⚑ Load balancer backend unhealthy

Causes: firewall not allowing the health-check ranges (35.191.0.0/16 and 130.211.0.0/22) to the backend port; wrong health-check port/path/protocol; app not listening or bound to localhost; wrong backend-service protocol. Fix: allow the health-check ranges to the backend SA/tag on the port; align the health check; bind to 0.0.0.0. Full flow in section 7.

⚑ DNS / firewall / route / Cloud NAT / PSC issue

Method: run a Connectivity Test (names the blocking rule/route), then Network Analyzer for config issues and VPC Flow Logs / Firewall Rules Logging to see drops. For Cloud NAT, check port-allocation exhaustion (increase min-ports or enable dynamic port allocation). For PSC, verify the endpoint, the DNS mapping, and that the producer accepts the connection.

Google Cloud networking gotchas

Google Cloud networking gotchas
  • VPC is global, subnets are regional - don't build one VPC per region, and don't treat a subnet as global.
  • PGA vs PSC vs private services access are different - pick by what you're reaching (Google APIs vs published service vs managed-DB private IP).
  • Overlapping CIDRs break peering and hybrid - plan IP space early, avoid auto mode and the default network.
  • Poor Shared VPC design - decide host/service split and who owns firewall/IP before workloads land.
  • Databases/internal services on external IPs - use private IP + private access; block external IPs by Org Policy.
  • Firewall direction & priority - rules are directional and first-match-by-priority; the implied deny-ingress/allow-egress is easy to forget.
  • Peering is not transitive - use Shared VPC or NCC for hub-and-spoke.
  • Cloud NAT is outbound only - inbound needs an LB or external IP.
  • APIs not enabled - many "network" failures are actually a disabled API (compute, dns, servicenetworking).
  • Egress & inter-region charges - internet egress and cross-region traffic are metered; keep chatty services co-located and use private access.

4. Compute Deep Dive

Compute Engine machine families, managed instance groups, Spot VMs, and the serverless options (Cloud Run, Functions, App Engine) - how to choose, place, scale, and operate compute on Google Cloud.

Last reviewed: July 2026 Machine families/types and pricing change - verify current shapes in the Console.
TL;DR

Compute Engine VMs come in machine families (general/compute/memory/accelerator-optimized) plus custom types. Use regional managed instance groups + autoscaling for HA and elasticity, instance templates as the blueprint, and Spot VMs for fault-tolerant batch. Prefer OS Login over metadata SSH keys and Shielded VMs by default. For new apps, consider Cloud Run (serverless containers) before managing VMs. Committed use discounts and right-sizing are the main cost levers.

Machine families

FamilySeries (examples)Best for
General purposeE2 (cost), N2/N2D/N4, C3/C3D, Tau T2D/T2A (scale-out; T2A is Arm)Web, app, microservices, most workloads - the default
Compute optimizedC2/C2D, H3High per-core performance: gaming, HPC, CPU-bound apps
Memory optimizedM1/M2/M3Large in-memory: SAP HANA, large databases, in-memory analytics
Accelerator optimizedA2/A3 (NVIDIA GPUs), G2; TPUs separatelyAI/ML training & inference, GPU/TPU workloads
Custom machine typesN-series custom vCPU/memoryRight-sizing when predefined shapes waste vCPU or memory (great for licensing)
DBA note - memory-optimized for big databases and SAP
For SAP HANA and large database VMs, the M-series (memory-optimized) shapes provide the certified high memory-to-core ratios. Custom machine types let you tune vCPU/memory to a licensing sweet spot (fewer cores, more RAM) when a per-core-licensed engine is involved. Confirm certification (SAP, Oracle) for the exact shape and OS before committing.

Custom types, Spot VMs, sole-tenant, GPUs/TPUs

OptionWhat it doesUse when
Spot VMsDeeply discounted VMs Google can preempt anytime (successor to preemptible)Fault-tolerant, stateless, restartable batch/CI/render - never stateful prod
Sole-tenant nodesPhysical host dedicated to your projectCompliance/isolation, or per-core licensing that needs host affinity
GPUsAttach NVIDIA GPUs to VMs (or use A3/G2)ML training/inference, rendering, HPC
TPUsGoogle's custom ML accelerators (Cloud TPU)Large-scale training/inference on supported frameworks
ReservationsReserve capacity of a machine type in a zoneGuaranteeing capacity for scale-out or DR failover
Committed use discounts (CUDs)1/3-year commitment for a big discountSteady-state baseline compute (see section 14)
Cost note
Cover steady-state compute with committed use discounts (and benefit automatically from sustained-use discounts on some families), burst on on-demand, and run fault-tolerant batch on Spot VMs (often 60-90% cheaper). Custom machine types stop you paying for vCPU or memory you don't use. These three levers usually beat any re-architecture.

Images, machine images, templates

  • Public images (Debian, Ubuntu, RHEL, Windows Server, etc.) and custom images (your golden image).
  • Machine images capture a full VM (disks + metadata) for cloning/backup; images capture a boot disk.
  • Instance templates define shape, image, disks, network, metadata/startup script - the blueprint for MIGs.
  • Startup scripts (and cloud-init on supported images) bootstrap a VM on boot; the metadata server (169.254.169.254) exposes metadata and workload credentials.
Architect note - golden image + startup script
Bake slow-changing config (hardening, agents, base packages) into a custom image; use a startup script for fast-changing wiring (app version, config). This keeps MIG scale-ups fast and identical. Automate image builds (e.g. with a pipeline) so images are reproducible and patched.

Managed instance groups & autoscaling

Building blockRole
Instance templateImmutable blueprint for the VMs.
Managed Instance Group (MIG)Creates/maintains identical VMs from a template. Regional MIGs spread across zones for HA; zonal MIGs don't.
AutoscalingScales the MIG by CPU, LB utilization, custom metrics, or schedule.
AutohealingRecreates VMs failing a health check.
Rolling updates / canaryUpdate the template and roll instances gradually (surge/max-unavailable).
Common mistake
Running a single zonal VM (or a zonal MIG) for a "production" service - a zone maintenance event or failure takes it down. Use a regional MIG across zones with autohealing behind a load balancer, and regional persistent disks for stateful cases.

Shielded VMs, Confidential VMs, OS Login

  • Shielded VM - secure boot, vTPM, integrity monitoring; enable by default (some Org Policies require it).
  • Confidential VM - memory encrypted in use (AMD SEV / Intel TDX) for sensitive workloads.
  • OS Login - manage SSH access via IAM (and 2FA) instead of project/instance metadata SSH keys. Enforce it with constraints/compute.requireOsLogin.
  • IAP for SSH/RDP - reach VMs with no external IP through Identity-Aware Proxy, gated by IAM.
Security note - OS Login + IAP, no external IPs
The modern secure pattern: VMs have no external IP, admins connect via IAP (IAM-gated, no bastion), and SSH access is governed by OS Login (IAM roles + optional 2FA) rather than shared metadata keys. This removes public SSH exposure and ties every login to an IAM identity you can audit and revoke.

Serverless & managed compute

ServiceWhat it isUse for
Cloud RunServerless containers, scale-to-zero, request- or job-basedMost new stateless services/APIs and batch jobs - the default serverless choice
Cloud FunctionsEvent-driven functions (now Cloud Run functions)Small event handlers, glue, automation
App EnginePaaS for web apps (standard/flex)Legacy/existing App Engine apps; new work usually goes to Cloud Run
BatchManaged batch job scheduling on Compute EngineLarge batch/HPC jobs without managing a scheduler
Architect note - reach for Cloud Run first
For a new stateless service, start with Cloud Run: no VMs to patch, scale-to-zero, per-request billing, and a container you can run anywhere. Drop to Compute Engine/GKE only when you need persistent state, specialized kernels/GPUs at fine control, long-lived connections, or an orchestration ecosystem. This inverts the old "spin up a VM" reflex and cuts a lot of ops.

Choosing compute by workload

WorkloadStarting point
Web / API (stateless)Cloud Run; or regional MIG (E2/N2/T2D) behind a global LB
MiddlewareRegional MIG on N-series, memory-leaning
Databases (self-managed)N2/C3 or M-series (large), regional PD/Hyperdisk; prefer managed DB (section 6)
Oracle workloadsCompute Engine VM (self-managed Oracle) or the Oracle Database@Google service; sole-tenant for licensing
SAPMemory-optimized M-series (certified), reservations
Batch / CI / renderSpot VMs in a MIG, or Batch service
Memory-heavyM-series or custom high-memory
CPU-heavyC2/C3/H3 compute-optimized
GPU / AI trainingA3/A2/G2 with GPUs, or TPUs; consider Vertex AI (section 12)
Cost-sensitive / spikyE2 or T2D + autoscaling + CUDs; Spot for fault-tolerant parts
Event-driven / burstyCloud Run / Cloud Functions (scale to zero)

Operational guidance

Resize a Compute Engine VM Ops
  • Stop the VM, change the machine type (or custom vCPU/memory), start it - brief downtime. In a MIG, update the template and roll.
  • Changing architecture (x86 ↔ Arm/T2A) is a rebuild, not a resize - watch for arch-specific binaries.
Patch VMs safely Ops
  • Use VM Manager (OS patch management) for scheduled, reported patching across a fleet; combine with OS inventory.
  • For MIGs, prefer replacing instances from a new patched image (immutable) over in-place patching.
Troubleshoot boot / SSH / high CPU / memory / disk Ops
  • Boot/SSH: use the serial console to read boot output; verify OS Login/IAM roles and the IAP firewall range; check the VM isn't stopped.
  • High CPU/memory: Cloud Monitoring (install the Ops Agent for memory metrics, which aren't collected by default); right-size or autoscale.
  • Disk full: resize the persistent disk online, then grow the filesystem; alert at 85%.
  • Disk attach: confirm the disk is attached and in the same zone; format/mount and add to fstab by UUID.
Design compute for production HA Design
  • Regional MIG across ≥2 zones + autohealing behind a load balancer.
  • Instance template + autoscaling for elasticity and reproducibility.
  • No external IPs; OS Login + IAP for access; Shielded VMs.
  • Ops Agent for metrics/logs; image pipeline for patched golden images; regional disks or managed data services for state.
  • Reservations/CUDs for capacity + cost; a second region for DR.
Operations note - live migration
Compute Engine live-migrates most VMs during host maintenance with no reboot (host maintenance policy = migrate), so infrastructure maintenance is largely transparent. GPU and some specialized VMs are set to terminate instead - design those workloads to tolerate a maintenance restart, and use MIG autohealing.

5. Storage Deep Dive

Block (Persistent Disk, Hyperdisk, Local SSD), file (Filestore), and object (Cloud Storage) storage - their scope, performance, durability, and the decision of which to use for databases, shared filesystems, backups, archives, and data lakes.

Last reviewed: July 2026 Verify disk types, storage classes, and retrieval behavior in current docs.
TL;DR

Persistent Disk / Hyperdisk = network block storage for VMs (zonal or regional). Local SSD = ultra-fast but ephemeral (data lost on stop). Filestore = managed NFS for shared filesystems. Cloud Storage = object storage (buckets/objects) for backups, data lakes, static content, and archives - not a filesystem. Choose block for boot/DB, Filestore for shared POSIX, Cloud Storage for objects. Lock down buckets with uniform bucket-level access + public access prevention.

Block storage: Persistent Disk, Hyperdisk, Local SSD

TypeScopeNotes
Persistent Disk (pd-balanced/ssd/standard/extreme)Zonal or regional (synchronously replicated across 2 zones)Network block storage; resize online; snapshots. Regional PD is a key HA building block for stateful VMs.
HyperdiskZonal (some regional)Next-gen block storage with independently tunable IOPS/throughput (Balanced/Throughput/Extreme/ML) - decouple performance from capacity.
Local SSDZonal, attached to the VM's hostEphemeral - data is lost when the VM stops/terminates/migrates. Highest IOPS; only for scratch/cache/temp.
DBA note - size performance, and never put a DB on Local SSD alone
Database performance on VMs is usually a disk IOPS/throughput ceiling, not CPU. Use Hyperdisk (or pd-ssd/pd-extreme) sized for the I/O profile; performance scales with provisioned IOPS/throughput and, for PD, with size. Local SSD is ephemeral - only use it for temp/redo-scratch that you can rebuild, never as the sole home of datafiles. For stateful HA, use regional PD so the disk survives a zone loss.

Filestore

Filestore is managed NFS (v3) for shared POSIX filesystems mounted by many VMs/GKE pods. Tiers (Basic, Zonal, Regional, Enterprise) trade performance, capacity, and availability. Use it for shared application state, home directories, media/render scratch, and lift-and-shift apps that expect a filesystem.

Security note
Restrict Filestore access with the VPC firewall (NFS ports) to the client subnet/SA, keep it on private IPs, and use IAM for management. Treat an over-open NFS share the same as any other data exposure.

Cloud Storage

  • Buckets & objects - a bucket has a global unique name, a location (region, dual-region, or multi-region), and a default storage class.
  • Storage classes: Standard (hot), Nearline (~30-day), Coldline (~90-day), Archive (~365-day). Colder = cheaper storage, higher retrieval cost / minimum-storage-duration. Autoclass auto-moves objects between classes by access.
  • Lifecycle management - rules to transition class or delete objects by age/version.
  • Versioning keeps prior object versions; Object holds and retention policies + Bucket Lock give WORM/compliance immutability (a locked retention policy cannot be shortened or removed).
  • Access: Uniform bucket-level access (UBLA) (IAM only, no per-object ACLs) + Public access prevention should be the default. Signed URLs grant time-boxed access without IAM.
  • Cloud Storage FUSE mounts a bucket as a filesystem (with caveats - it is still object storage underneath).
  • Transfer: Storage Transfer Service (online, from other clouds/on-prem/HTTP) and Transfer Appliance (physical, for very large datasets).
Common mistake - Cloud Storage is not a filesystem
Objects are immutable blobs - no in-place random writes, no POSIX locking, and the "/" in names is just a convention. Don't run a database or a lock-dependent app on a FUSE-mounted bucket. Use block or Filestore for filesystem semantics; use Cloud Storage for whole-object put/get (backups, media, lake data, static sites).
Security note - lock buckets down by default
Turn on uniform bucket-level access and public access prevention at creation (enforce via Org Policy storage.publicAccessPrevention and storage.uniformBucketLevelAccess). Legacy per-object ACLs and allUsers grants are how buckets get accidentally exposed. Use signed URLs for controlled external sharing, keep them short-lived, and inventory them. For backup/compliance buckets, add versioning + a locked retention policy so data can't be deleted early (ransomware/accident protection).

Encryption

  • All data is encrypted at rest by default with Google-managed keys.
  • CMEK (customer-managed encryption keys) via Cloud KMS - you control rotation and can disable a key to render data unreadable. Use for sensitive/regulated data.
  • CSEK (customer-supplied keys) - you provide the raw key (niche; you manage all key handling).
  • Encryption in transit uses TLS across Google's network.
Security note - CMEK for sensitive data
Use CMEK for disks, buckets, and databases holding sensitive data so key control (and the emergency "disable the key" switch) is yours. Keep KMS keys in a locked-down security project, grant only the service agents that need encrypt/decrypt, and rotate on a schedule.

When to use which

NeedUse
VM boot / DB datafilesPersistent Disk or Hyperdisk (regional PD for zone-HA)
Ultra-fast scratch/cacheLocal SSD (ephemeral - rebuildable data only)
Shared POSIX filesystem for many VMs/podsFilestore
Backups (DB/app)Cloud Storage (Nearline/Coldline) + lifecycle + retention
Log / long-term archiveCloud Storage Archive + lifecycle + Bucket Lock
Data lakeCloud Storage (Standard) - queried by BigQuery/BigLake
Static website / mediaCloud Storage + Cloud CDN
Bulk data into GCPStorage Transfer Service (online) / Transfer Appliance (physical)
App/VM backup & DRBackup and DR service; PD snapshots (scheduled)

Practical examples

Database backups to Cloud Storage DBA

Managed DBs (Cloud SQL/AlloyDB) back up automatically; for self-managed DBs on VMs, dump/backup to a Cloud Storage bucket over Private Google Access (no internet). Lifecycle to Coldline/Archive; versioning + locked retention for immutability; enable a second-region copy (dual-region bucket or Transfer) for DR.

Data lake on Cloud Storage Data

Raw / curated / consumption prefixes in Standard buckets; BigQuery external tables / BigLake read them; Dataplex governs. Lifecycle cold raw data to Nearline/Coldline. See section 11.

Shared filesystem for an app cluster Apps

Filestore instance mounted on all app VMs/GKE pods; firewall to the app subnet; Enterprise/Regional tier for HA; snapshots for recovery.

Storage gotchas

Storage gotchas
  • Cloud Storage is object storage, not a filesystem - no random writes/locks.
  • Archive/Coldline have retrieval and minimum-duration costs - don't put frequently-read or short-lived data there.
  • Persistent Disk is zonal or regional - a zonal disk dies with its zone; use regional PD for HA.
  • Local SSD is ephemeral - never the only copy of anything.
  • Public-bucket mistakes - enforce UBLA + public access prevention by Org Policy.
  • Signed URL risk - they're bearer tokens; keep them short and tracked.
  • Locked retention policy - once Bucket Lock is applied you cannot shorten/delete it (that's the point) - set the duration carefully.
  • Snapshot cost growth - scheduled snapshots accumulate; set retention.
  • Cross-region replication / transfer cost - dual/multi-region and egress cost money and take time; a lagging copy isn't DR.
  • Wrong storage class - Standard for hot data you keep accessing; colder classes only for genuinely cold data.

6. Database Services Deep Dive

Google Cloud's database portfolio - Cloud SQL, AlloyDB, Spanner, Firestore, Bigtable, Memorystore - what each manages for you, how HA/DR/backup/patching differ, how to choose, and what changes for a DBA coming from Oracle.

Last reviewed: July 2026 DB features, versions, and limits change - verify in current docs.
TL;DR

For relational OLTP, start with Cloud SQL (managed PostgreSQL/MySQL/SQL Server); step up to AlloyDB (PostgreSQL-compatible, higher performance + analytics + vector) when you outgrow it, or Spanner for globally-distributed, horizontally-scalable strong consistency. For NoSQL, Firestore (document, app backends) and Bigtable (wide-column, huge scale, time-series). Memorystore for Redis/Valkey/Memcached caching. Choose the least you must manage that meets the workload; managed services own patching/backup/HA, you own schema, queries, and access.

The portfolio at a glance

ServiceModelSweet spotYou manageGoogle manages
Cloud SQLManaged PostgreSQL / MySQL / SQL ServerStandard relational OLTP, lift-and-shiftSchema, queries, flags, accessProvisioning, patching (in window), backups, HA, replicas
AlloyDBPostgreSQL-compatible, Google-enhancedDemanding PostgreSQL, HTAP, PostgreSQL + analytics + vectorSchema, queries, accessPatching, backups, HA, autoscaling read pools
SpannerGlobally-distributed, horizontally scalable, strongly consistent relationalGlobal OLTP, unlimited scale, five-ninesSchema (different mindset), queries, accessAlmost everything - sharding, replication, HA
FirestoreServerless document NoSQLWeb/mobile app backends, real-time syncData model, security rules, indexesScaling, replication, HA
BigtableWide-column NoSQL, petabyte-scale, low latencyTime-series, IoT, adtech, huge key-value/analyticsRow-key design (critical), schemaScaling, replication
MemorystoreManaged Redis / Valkey / MemcachedCache, session store, leaderboardsKeys/TTL, clientProvisioning, patching, HA

Service deep dives

Cloud SQL
AlloyDB
Spanner
Firestore / Bigtable / Memorystore

Cloud SQL (PostgreSQL, MySQL, SQL Server)

  • HA - regional, synchronous standby in another zone with automatic failover (enable HA; it is not on by default).
  • Read replicas - in-region and cross-region replicas for read scaling and DR; a cross-region replica can be promoted for regional DR.
  • Backups - automated daily backups + point-in-time recovery (binary/WAL logging); on-demand backups; you set retention.
  • Patching - Google patches during your maintenance window; you choose timing and get notifications, but you don't control every patch.
  • Connectivity - private IP (via private services access), the Cloud SQL Auth Proxy (IAM-authenticated, encrypted), authorized networks (public IP), and IAM database authentication (Postgres/MySQL).
  • SQL Server - license is included in the price (no BYOL for Cloud SQL SQL Server); watch edition/feature limits.
DBA note
Cloud SQL removes provisioning/patching/backup toil but is not full DBA control: no OS access, limited superuser, controlled flags, and maintenance you schedule but Google performs. Enable HA explicitly, test PITR, and use the Auth Proxy or private IP - never expose a public DB endpoint with broad authorized networks.

AlloyDB for PostgreSQL

PostgreSQL-compatible, Google-built engine aimed at demanding transactional and mixed (HTAP) workloads: a columnar accelerator for analytics, autoscaling read pools, and strong price/performance vs. self-managed Postgres. Supports vector search (pgvector + Google enhancements) for AI/RAG on operational data.

  • HA - regional with automatic failover; read pools scale reads.
  • Backups / PITR - continuous backup with point-in-time recovery.
  • Use it when Cloud SQL Postgres runs out of headroom, when you want analytics on operational data without a separate warehouse, or for Postgres-native vector search at scale. AlloyDB Omni runs the engine on-prem/other clouds.
DBA note
AlloyDB is PostgreSQL-compatible, not Oracle-compatible. Migrating Oracle to AlloyDB is a Postgres migration (schema/PL-SQL conversion via Database Migration Service + the Oracle-to-Postgres tooling), not a lift-and-shift. Plan for datatype, PL/SQL, and feature differences.

Spanner

Google's globally-distributed relational database: horizontal scale to virtually unlimited throughput, external (strong) consistency across regions, and up to 99.999% availability - no manual sharding. SQL interface (GoogleSQL/PostgreSQL dialect).

  • Scaling - add compute (nodes/processing units); storage and throughput scale with it. No failover to manage.
  • HA/DR - multi-region configurations replicate synchronously across regions; regional configs across zones.
  • Mindset shift - schema and primary-key design must avoid hotspots (no monotonically increasing keys); interleaving models parent-child locality. Not a drop-in for a single-node RDBMS.
Architect note - when Spanner is right
Choose Spanner when you genuinely need global scale + strong consistency + high availability beyond what a single primary can give (global user base, unbounded write scale, five-nines). For a normal regional app database, Spanner is over-powered and pricier than Cloud SQL/AlloyDB - and it demands a different data model. Don't reach for it by default.

Firestore, Bigtable, Memorystore

  • Firestore - serverless document database with real-time listeners and offline sync; great for web/mobile backends. Security Rules control client access. Not relational - model for your queries, mind index and hotspot limits.
  • Bigtable - wide-column, low-latency, petabyte-scale for time-series, IoT, adtech, and analytics feeding. Row-key design is everything - a bad key hotspots a node. HBase-compatible API.
  • Memorystore - managed Redis/Valkey/Memcached for caching, sessions, rate limiting; HA tiers with replicas/failover.
DBA note - these are not relational
Firestore and Bigtable are NoSQL: no joins, no ad-hoc SQL, no relational integrity. You design around access patterns and denormalize. Forcing a relational schema onto them (or expecting SQL tuning to apply) leads to hotspots and cost surprises. Pick them when the access pattern (document, wide-column key-value, real-time) genuinely fits.

Database service decision table

WorkloadRecommendedReasonHADROps responsibilityCost lever
PostgreSQL app DBCloud SQL PostgreSQLManaged, standard OLTPRegional HA (enable it)Cross-region replicaSchema/queriesRight-size + CUD; auto-storage
MySQL web appCloud SQL MySQLManaged, commonRegional HACross-region replicaSchema/queriesRight-size; read replicas
SQL Server workloadCloud SQL SQL ServerManaged, license includedRegional HACross-region replicaSchema/queriesEdition/size choice
Demanding Postgres / HTAP / vectorAlloyDBPerformance + analytics + pgvectorRegional + read poolsCross-region (config)Schema/queriesScale read pools
Global transactionalSpannerGlobal scale + strong consistencyBuilt-in (multi-region)Built-inSchema/key designRight-size compute units
Web/mobile app backendFirestoreServerless doc, real-timeBuilt-inMulti-region optionData model + rulesQuery/index efficiency
High-scale NoSQL / time-series / IoTBigtablePetabyte scale, low latencyBuilt-in (replication)Multi-clusterRow-key designNode count; storage type
Cache / sessionMemorystoreManaged Redis/ValkeyHA tier (replicas)Rebuild from sourceKeys/TTLRight-size tier
Data warehouseBigQuery (section 11)Serverless analytics, not OLTPBuilt-inMulti-region datasetSchema/queriesSlot/query cost control
Unsupported engine / full control (e.g. Oracle)Compute Engine (self-managed) or Oracle DB@GoogleEngine/version not offered managedYou build itYou build itEverythingSole-tenant/licensing

Connectivity & observability

  • Private IP (via private services access) is the production default - no public endpoint. Serverless VPC Access lets Cloud Run/Functions reach a private-IP DB.
  • Cloud SQL Auth Proxy / connectors - IAM-authenticated, encrypted connections without managing SSL certs or IP allowlists.
  • Authorized networks (public IP) - avoid; if used, restrict tightly and require SSL.
  • Query Insights (Cloud SQL/AlloyDB) - query-level performance analysis; plus Cloud Monitoring metrics for CPU, connections, storage, replication lag.
Common mistake
Giving a database a public IP with broad authorized networks "to connect quickly." That exposes it to the internet. Use private IP + the Auth Proxy / Serverless VPC Access, and plan the private-services-access range and DNS up front - retrofitting private IP later means a connectivity migration.

How HA, DR, backup, and patching differ

ServiceHADRBackupPatching
Cloud SQLRegional standby (opt-in), auto failoverCross-region read replica → promoteAutomated + PITR, you set retentionGoogle, in your maintenance window
AlloyDBRegional, auto failover; read poolsCross-region configContinuous + PITRGoogle, in window
SpannerBuilt-in (multi-zone/region)Multi-region configBackups + PITRFully managed, transparent
Firestore / BigtableBuilt-in replicationMulti-region / multi-clusterManaged backup/exportFully managed
Operations note - test restores
"Managed backups" does not equal "proven recoverability." Periodically restore a backup / do a PITR to a fresh instance and validate. Also verify your cross-region DR: replica lag within RPO, promotion procedure rehearsed, and app connection strings/DNS ready to repoint.

Google Cloud database gotchas for Oracle DBAs

For DBAs coming from Oracle
  • Cloud SQL is managed, not self-managed - no OS/SYSDBA-level control, controlled flags, Google-run patching. Your runbooks change.
  • AlloyDB is PostgreSQL-compatible, not Oracle-compatible - Oracle-to-AlloyDB is a full Postgres migration (schema + PL/SQL conversion), not lift-and-shift.
  • Spanner needs a different data-modeling mindset - key design to avoid hotspots, interleaving for locality; no single-node RDBMS assumptions.
  • Firestore and Bigtable are not relational - no joins/SQL; design for access patterns.
  • Patching control differs by service - you schedule windows, Google patches; not your RMAN/opatch world.
  • Backup access differs - backups are service-managed artifacts (+ PITR), not files you copy; export for portability.
  • Private IP and DNS must be planned - reserve the private-services-access range; plan resolution before you build.
  • Performance troubleshooting differs - Query Insights / Cloud Monitoring instead of AWR/ASH; different wait/metric vocabulary.
  • Oracle itself: for Oracle Database you either self-manage on Compute Engine (you own everything, sole-tenant for licensing) or use the Oracle Database@Google Cloud partnership (Exadata/Autonomous run by Oracle inside Google Cloud) - there is no native "managed Oracle" like Cloud SQL.

Enterprise examples

PostgreSQL application database OLTP

Cloud SQL PostgreSQL with HA enabled, private IP, Auth Proxy from the app, automated backups + PITR, a cross-region read replica for DR, Query Insights on. Move to AlloyDB if you need more performance or in-DB analytics/vector.

Globally distributed transactional workload Global

Spanner multi-region config; schema designed for even key distribution; app uses the client library with strong reads where needed and stale reads for scale.

IoT / time-series at scale NoSQL

Bigtable with a row key that spreads writes (e.g. reversed/ hashed device ID + timestamp), multi-cluster replication for HA, feeding BigQuery/Dataflow for analytics.

Self-managed Oracle on Compute Engine Oracle

When a managed option can't run the engine/version: Oracle on a memory-optimized VM, regional PD/Hyperdisk sized for IOPS, Data Guard you configure to a second-region VM, sole-tenant nodes for licensing, backups to Cloud Storage. Consider Oracle Database@Google for a managed alternative.

7. Load Balancing and Traffic Management

Google Cloud Load Balancing - the global and regional Application, Network, and proxy load balancers, their components (forwarding rule, target proxy, URL map, backend service, health check), and how to choose and debug them, with Cloud CDN and Cloud Armor.

Last reviewed: July 2026 LB names/tiers evolve - verify current LB types and features in docs.
TL;DR

Google Cloud Load Balancing is a family. The global external Application Load Balancer gives you one anycast IP serving users worldwide (L7, HTTP/S, with Cloud CDN + Cloud Armor). There are also regional external/internal Application LBs, passthrough Network LBs (L4, external/internal), and proxy Network LBs (L4 proxy). Every LB is assembled from a forwarding rule → target proxy → URL map → backend service → backends, with health checks. The #1 failure is a firewall not allowing the health-check ranges.

The load balancer family

Load balancerLayer / scopeUse for
Global external Application LBL7, global, single anycast IPInternet-facing web/APIs served worldwide; CDN + Cloud Armor; cross-region failover
Regional external Application LBL7, regionalRegional internet-facing L7 (data-residency, regional-only)
Internal Application LBL7, regional/cross-region internalInternal microservice HTTP routing
External passthrough Network LBL4, regional, preserves client IPNon-HTTP TCP/UDP internet-facing; source-IP-sensitive
Internal passthrough Network LBL4, regional internalInternal TCP/UDP (e.g. internal service VIP, HA databases)
Proxy Network LB (external/internal)L4 proxyTCP with TLS offload / where a proxy is wanted (no client-IP preservation)

Anatomy of a load balancer

Clients Load balancer Forwarding rule (IP:port) Target proxy + SSL cert URL map (host/path) Backend service + policy Backend (MIG / NEG) Backend bucket (static) Backend (Cloud Run NEG) Health checks
Forwarding rule → target proxy (+ cert) → URL map → backend service → backends (MIGs, NEGs, buckets); health checks probe backends.
  • Forwarding rule - the frontend IP + port. Target proxy terminates and (for HTTPS) holds the certificate. URL map does host/path routing. Backend service defines the balancing policy, session affinity, timeouts, and health check. Backends are MIGs, NEGs (network endpoint groups - incl. serverless NEGs for Cloud Run/Functions/App Engine), or backend buckets (static content).
  • SSL certificates - Google-managed certs (auto-provision/renew) or self-managed, via Certificate Manager for scale.
  • Session affinity - client IP / cookie based, when needed.

Cloud CDN and Cloud Armor

  • Cloud CDN - cache cacheable responses at Google's edge; enable on a backend service/bucket to cut latency and egress.
  • Cloud Armor - edge WAF/DDoS for the global external Application LB: OWASP rules, rate limiting, geo/IP allow-deny, bot management, and adaptive protection. Attach a security policy to the backend service.
Security note
Front public HTTP(S) with the global external Application LB + Cloud Armor: it absorbs DDoS at Google's edge, enforces WAF/rate-limit rules, and lets your backends live on private IPs with no direct internet exposure. Terminate TLS at the LB with Google-managed certs; keep backends in private subnets.

When to use which

Global internet web/API, want CDN + WAF + one IP
Global external Application LB + Cloud CDN + Cloud Armor
Regional-only L7 (residency)
Regional external Application LB
Internal microservice HTTP routing
Internal Application LB
Non-HTTP, need real client IP, high throughput
External passthrough Network LB
Internal L4 VIP (e.g. HA DB, internal service)
Internal passthrough Network LB
Serverless (Cloud Run) behind a global IP + WAF
Global external App LB with a serverless NEG

Load balancer troubleshooting

⚑ Backend unhealthy / 502

Likely causes (in order)

  1. Firewall doesn't allow the health-check ranges 35.191.0.0/16 and 130.211.0.0/22 to the backend port - the #1 cause.
  2. Health check port/path/protocol wrong vs. what the app serves.
  3. App not listening / bound to localhost instead of 0.0.0.0.
  4. Wrong backend service protocol (HTTP vs HTTPS vs HTTP/2) or named port mismatch on the MIG.
  5. Cloud Armor rule or URL map misrouting; OS firewall on the VM.

Checks

gcloud compute backend-services get-health BACKEND --global
gcloud compute firewall-rules list --filter="sourceRanges~35.191 OR sourceRanges~130.211"

Fix / prevention

Allow the health-check ranges to the backend SA/tag on the port; align the health check; fix the named port/protocol; bind to 0.0.0.0. Template the LB + firewall together in Terraform.

⚑ SSL certificate / cert issue

Causes: Google-managed cert stuck in PROVISIONING (the domain must resolve to the LB IP and DNS must be correct before it validates); wrong/missing domain on the cert; expired self-managed cert; HTTP→HTTPS redirect missing. Fix: point DNS at the LB IP first, then wait for provisioning; include all SANs; use Certificate Manager at scale; add a redirect URL map.

⚑ Wrong forwarding rule / URL map / Cloud Armor blocking valid traffic

Causes: forwarding rule on the wrong IP/port/protocol; URL map path/host rule not matching (default backend catching everything); Cloud Armor rule denying legitimate clients (over-broad geo/IP or a WAF rule false positive). Fix: verify the forwarding-rule frontend; test URL map routing (path matchers, order); review Cloud Armor logs and preview mode before enforcing; tune the offending rule.

8. Security Deep Dive

Defense in depth on Google Cloud: identity, governance (Org Policy, VPC-SC), network, data, and detective controls (SCC, audit logs) - plus concrete guidance for securing projects, storage, compute, and databases, ending in a production checklist.

Last reviewed: July 2026 Security service capabilities evolve - verify SCC tiers and features in docs.
TL;DR

Layer your controls: IAM (least privilege, groups, no basic roles, no SA keys), governance (Organization Policies to forbid the risky thing; VPC Service Controls to stop data exfiltration), network (private IPs, firewall, no public exposure, IAP), data (CMEK, Secret Manager, DLP/Sensitive Data Protection), and detection (Security Command Center, centralized Cloud Audit Logs). Reduce public exposure, encrypt with keys you control, centralize logs, and prefer preventive guardrails over after-the-fact detection.

Google Cloud shared responsibility model

Google secures the infrastructure (physical, hardware, host, network fabric, and managed-service internals). You are responsible for: IAM and identity, data classification and access, network exposure and firewall, key management choices, workload/OS security (for IaaS/GKE nodes), secure configuration, and monitoring/response. The higher up the managed-service stack you go (VM → GKE → Cloud SQL → BigQuery/Cloud Run), the more Google handles - but data, access, and configuration always remain yours.

The control layers

LayerControlsKey services
Identity & accessWho can do whatCloud IAM, groups, deny policies, conditional IAM, Workload/Workforce Identity Federation
GovernanceWhat is allowed at all; data can't leaveOrganization Policies, VPC Service Controls, Access Context Manager
NetworkWhat can reach whatFirewall + hierarchical policies, private IPs, Cloud Armor, IAP, Private Google Access/PSC
DataProtect data at rest/in transitCloud KMS/HSM (CMEK), Secret Manager, Sensitive Data Protection (DLP), CA Service
Workload / supply chainTrusted, hardened workloadsShielded/Confidential VMs, Binary Authorization, Artifact Analysis, Web Security Scanner
Detective / postureFind misconfig & threatsSecurity Command Center, Cloud Audit Logs, Cloud Logging, Chronicle

Cloud KMS, HSM, and Secret Manager

  • Cloud KMS - manage encryption keys for CMEK across storage, disks, databases, and app-level crypto. Cloud HSM gives FIPS 140-2 Level 3 hardware-backed keys; Cloud EKM lets keys live in an external KMS.
  • Secret Manager - store API keys, DB passwords, certs as versioned secrets; workloads read them via IAM (no secrets in code, images, or env files).
  • Certificate Authority Service - private CA for issuing internal certificates at scale.
Security note - Secret Manager + CMEK in a locked project
Put KMS keys and secrets in a dedicated security project where only a small key-admin group has admin, and grant workloads only use (encrypt/decrypt / secret accessor) via their service accounts. Never commit secrets or SA keys to source. Disabling a CMEK key is your emergency "make this data unreadable" control - hold it tightly and audit it.

Security Command Center and detection

Security Command Center (SCC)

Central security & risk platform: asset inventory, misconfiguration findings (Security Health Analytics), threat detection (Event Threat Detection), attack-path/risk analysis, and posture management across the org. Turn it on org-wide.

Cloud Audit Logs

Admin Activity (always on), Data Access, System Event, and Policy Denied logs - your immutable evidence trail. Enable Data Access logs where needed and centralize with a sink.

Cloud Logging + Log Router

Route org audit logs to a central logging project (log bucket + BigQuery + SIEM) via an aggregated sink for cross-project visibility and retention.

Chronicle / Sensitive Data Protection

Chronicle for SIEM-scale threat analytics; Sensitive Data Protection (DLP) to discover/classify/mask PII in storage and BigQuery.

Architect note - centralize audit logs on day one
Create an organization-level aggregated log sink to a dedicated logging project (and onward to BigQuery/SIEM) as part of the landing zone. Retrofitting centralized audit logging after an incident - when you find the Data Access logs were never enabled - is the classic post-mortem finding. Enable SCC org-wide alongside it.

Perimeter and supply chain

  • VPC Service Controls - a data-exfiltration perimeter around managed services so a valid identity can't copy BigQuery/Cloud Storage data to an outside project. The key control for sensitive data (see section 2).
  • Identity-Aware Proxy (IAP) - context-aware access to apps/VMs without VPN or public IPs.
  • Binary Authorization - only allow signed/attested container images to deploy (GKE/Cloud Run).
  • Artifact Analysis - scan images in Artifact Registry for vulnerabilities; Web Security Scanner for app scanning.
  • Confidential Computing - encrypt data in use (Confidential VMs/GKE).

How to secure specific things

Secure a production project (and multi-project env) Foundation
  • Federate identities; enforce 2FA; drive access through groups; no basic roles; least-privilege predefined roles at project/resource scope.
  • Preventive Org Policies at the org/folder: disable SA key creation, domain-restricted sharing, restrict resource locations, block external IPs, require OS Login/Shielded VM.
  • Shared VPC for centralized network control; VPC-SC perimeter around data projects.
  • SCC org-wide; aggregated audit-log sink to a logging project; budgets + quotas.
  • CMEK + Secret Manager in a security project; break-glass Owner group, monitored.
Secure Cloud Storage Storage
  • Uniform bucket-level access + public access prevention (enforced by Org Policy); no allUsers.
  • CMEK for sensitive buckets; versioning + locked retention for backup/compliance.
  • Signed URLs short-lived and inventoried; access via IAM + service accounts, not keys.
Secure Compute Engine Compute
  • No external IPs; access via IAP; OS Login (+2FA) instead of metadata SSH keys; Shielded VMs.
  • Target firewall rules by service account; allow only the IAP range for SSH.
  • VM Manager for patch compliance; Ops Agent for logs/metrics; CMEK on disks for sensitive data.
Secure databases & public load balancers Data / Edge
  • DBs on private IP, Auth Proxy/Serverless VPC Access, no public endpoint; CMEK; IAM DB auth where supported.
  • Public HTTP behind the global external App LB + Cloud Armor (WAF, rate limiting, geo/IP); backends private.
  • Reduce public exposure everywhere: block external IPs by Org Policy, prefer IAP + private access.

Production Google Cloud security checklist

  • Human access federated; 2FA enforced; access granted via groups, never individuals.
  • No basic roles (Owner/Editor/Viewer) in production; least-privilege predefined roles at project/resource scope.
  • Service account key creation disabled org-wide; workloads use attached SAs / Workload Identity Federation / impersonation.
  • Preventive Org Policies: block external IPs, restrict resource locations, domain-restricted sharing, require OS Login + Shielded VM.
  • Security Command Center enabled org-wide with findings triaged.
  • VPC Service Controls perimeter around projects holding sensitive data.
  • Shared VPC / hierarchical firewall centrally managed; no broad 0.0.0.0/0 ingress; SSH via IAP range only.
  • Databases and internal services on private IP; no public database endpoints.
  • Public HTTP behind global external App LB + Cloud Armor; backends private.
  • All sensitive data encrypted with CMEK; keys in a locked security project; rotation on.
  • Secrets in Secret Manager; nothing sensitive in code, images, or metadata.
  • Cloud Storage buckets: UBLA + public access prevention; versioning + retention on backups.
  • Org-level aggregated audit-log sink to a central logging project (+ BigQuery/SIEM); Data Access logs on where needed.
  • Alerts on IAM/policy changes, new SA keys, public exposure, and anomalous access.
  • Budgets + quotas as guardrails; consistent labels/tags for attribution.
  • DR and backups tested (restores verified), including CMEK key availability in the DR region.

Common security mistakes

Common Google Cloud security mistakes
  • Granting Owner/Editor too broadly, or roles at org level.
  • Long-lived service account keys (the top serious incident).
  • Public Cloud Storage exposure (allUsers, legacy ACLs).
  • Over-permissive firewall rules; broad public SSH instead of IAP.
  • Public database endpoints with wide authorized networks.
  • Not enabling / not centralizing audit logs.
  • Storing secrets in code instead of Secret Manager.
  • Not using impersonation; not using VPC Service Controls for sensitive data.
  • Not enforcing Organization Policies (leaving the guardrails off).

9. Observability, Monitoring, and Operations

Cloud Monitoring, Cloud Logging, and the Cloud Operations suite - what to monitor per service, how to build useful alerts without noise, how to centralize logs across projects, and Active Assist for optimization.

Last reviewed: July 2026 Verify metric names and Ops suite features in current docs.
TL;DR

Cloud Monitoring holds metrics, uptime checks, dashboards, and alerting policies that fire to notification channels. Cloud Logging collects logs; the Log Router sends them to log buckets, BigQuery, Pub/Sub, or Cloud Storage via sinks. Install the Ops Agent on VMs for memory/disk/process metrics and logs (not collected by default). Managed Service for Prometheus for container metrics. Alert on user-visible symptoms, route by severity, and centralize logs across projects with an aggregated sink.

The observability stack

ServiceRole
Cloud MonitoringMetrics, uptime checks, dashboards, alerting policies, SLOs.
Alerting policies + notification channelsThreshold/absence conditions → email, PagerDuty, Slack, SMS, Pub/Sub, webhook.
Cloud Logging + Log Router + sinksCollect, route, and retain logs; export to log buckets / BigQuery / Cloud Storage / Pub/Sub.
Cloud Audit LogsAdmin Activity, Data Access, System Event, Policy Denied - the control-plane record.
Error Reporting / Trace / ProfilerAggregate errors; distributed latency traces; continuous CPU/heap profiling.
Managed Service for PrometheusPrometheus-compatible metrics at scale (GKE and beyond).
Ops AgentOn-VM agent for host metrics (memory, disk, swap), process metrics, and logs.
Network Intelligence Center / Flow LogsNetwork observability (see section 3).
Cloud Asset Inventory / Recommender / Active AssistInventory, recommendations, and automated insights (cost, security, reliability).
Operations note - install the Ops Agent
Compute Engine reports CPU, disk I/O, and network by default but not memory, per-disk-utilization, or process metrics. Install the Ops Agent (via VM Manager or startup) to get memory pressure, disk usage, and application logs. Many "we couldn't see the memory leak" incidents trace back to the agent never being installed.

What to monitor per area

Compute Engine

CPU, memory (agent), disk usage & IOPS/throughput, instance up/health, MIG size vs. target, autohealing events.

Persistent Disk

Throughput/IOPS vs. provisioned limits, disk usage %, latency. Hitting the disk ceiling is a common hidden bottleneck.

Cloud Storage

Request/error rates, object counts, and unusual access patterns (via Data Access logs).

Cloud SQL / databases

CPU, memory, connections vs. limit, storage used %, replication lag, backup success, Query Insights.

Load balancers

Backend health, request count, 5xx rate, latency, backend utilization.

VPC / security

VPN/Interconnect status, flow-log anomalies, firewall drops; SCC findings, audit-log anomalies (policy/key/public-exposure changes).

Building useful alerts

  • Alert on symptoms users feel (5xx rate, unhealthy backends, DB down, high latency, SLO burn), not just causes.
  • Use appropriate aligners/reducers and a duration to avoid flapping (e.g. mean over 5 min, not a single spike).
  • Use absence conditions for signals that should always report (heartbeat, backup completion).
  • Route by severity: critical → page; warning → ticket/Slack; info → dashboard.
  • Adopt SLOs and alert on error-budget burn rather than raw thresholds where you can.

Example alerts to implement

AlertConditionSeverity
VM CPU highCPU > 85% mean for 5-10 minWarning → Critical
VM unavailableInstance up-check fails / metric absentCritical
Memory pressureMemory (agent) > 90%Warning
Disk usage / IOPSDisk > 85% used; throughput near provisioned limitWarning
LB unhealthy backendHealthy backend count < desiredCritical
Cloud SQL CPU / storage / connectionsCPU > 90%; storage > 85%; connections near maxWarning → Critical
Failed backupsBackup failure / success signal absentCritical
VPN tunnel down / Interconnect issueTunnel/attachment status != upCritical
Cloud Storage unusual accessSpike / unexpected public access (Data Access logs)Security review
Cloud Run / Functions errors5xx rate or error count over thresholdWarning → Critical
Pub/Sub backlog highOldest unacked message age / undelivered count highWarning
GKE pod crash loopsContainer restart count risingWarning
Common mistake - alert fatigue
Paging on every transient spike trains people to ignore alerts. Use longer windows, appropriate reducers, duration conditions, severity routing (only real user-impact pages), maintenance suppression, and prune alerts nobody acts on. An alert that never leads to action should be a dashboard metric, not a page.

Centralizing logs across projects

Create an aggregated sink at the org or folder level that routes all projects' logs (especially audit logs) to a central logging project - a log bucket for retention, BigQuery for analysis, and/or Pub/Sub to a SIEM. This gives cross-project security visibility and satisfies retention/compliance without per-project setup.

# Org-level aggregated sink of all audit logs to a central BigQuery dataset
gcloud logging sinks create org-audit-sink \
  bigquery.googleapis.com/projects/central-logging/datasets/org_audit \
  --organization=ORG_ID --include-children \
  --log-filter='logName:"cloudaudit.googleapis.com"'

Active Assist & recommendations

Recommender / Active Assist surface actionable insights: IAM over-grants, idle VMs/disks/IPs, right-sizing, commitment recommendations, and reliability/security findings. Review them monthly (they feed the cost checklist in section 14). Service Health shows Google-side incidents and maintenance events affecting your resources.

10. Containers, Kubernetes, and Cloud Native

GKE (Autopilot and Standard), Cloud Run, and the serverless / event-driven building blocks - when to use each, how networking and IAM work for containers, and reference patterns for microservices and event-driven systems.

Last reviewed: July 2026 Verify GKE modes, Gateway API, and Cloud Run features in current docs.
TL;DR

GKE (managed Kubernetes) for orchestrated, long-running microservices - Autopilot when you want Google to manage nodes, Standard when you need node-level control. Cloud Run for serverless containers that scale to zero (the default for most new stateless services). Cloud Functions for small event handlers. Around them: Artifact Registry, Cloud Build/Deploy, Pub/Sub, Eventarc, Workflows, Cloud Tasks/Scheduler, and Apigee/API Gateway. GKE workloads use Workload Identity (not SA keys) to call Google APIs.

The cloud-native services

ServiceWhat it isUse for
GKE (Autopilot / Standard)Managed Kubernetes control plane + nodes (or fully-managed nodes in Autopilot)Orchestrated microservices, platform teams, portable K8s workloads
Cloud RunServerless containers (services & jobs), scale-to-zeroMost new stateless services/APIs and batch jobs
Cloud FunctionsEvent-driven functions (on Cloud Run)Small event handlers, glue
App EnginePaaS for web appsExisting App Engine apps
Artifact RegistryManaged registry for images and packagesStore/scan images (Container Registry is legacy)
Cloud Build / Cloud DeployCI (build) and CD (progressive delivery)Container build + deploy pipelines
Pub/SubGlobal messaging / event busDecoupling, streaming ingestion, fan-out
EventarcEvent routing (from Google services / Pub/Sub) to Run/GKE/WorkflowsEvent-driven triggers
Workflows / Cloud Tasks / Cloud SchedulerOrchestration / task queues / cronServerless orchestration and scheduling
Apigee / API GatewayFull API management / lightweight API gatewayPublishing, securing, and managing APIs

GKE deep dive

  • Autopilot vs Standard: Autopilot manages nodes, scaling, and much security for you and bills per-pod - less to run, fewer knobs. Standard gives you node pools and full control (custom machine types, GPUs, DaemonSets that need node access) at more operational cost.
  • Node pools (Standard) - groups of nodes with a machine type/image; scale and upgrade per pool; use Spot node pools for fault-tolerant workloads.
  • Regional vs zonal clusters - regional replicates the control plane and spreads nodes across zones (HA); zonal is single-zone.
  • Networking - VPC-native clusters use alias IP ranges (secondary subnet ranges) for pods and services; plan those ranges (pods need many IPs).
  • Ingress / Gateway API - a Kubernetes Ingress or the newer Gateway API provisions a Google Cloud load balancer for HTTP routing; a Service type=LoadBalancer provisions an L4 LB.
  • Workload Identity - map a Kubernetes service account to a Google service account so pods call Google APIs with short-lived credentials, no keys.
# Bind a Kubernetes SA to a Google SA (Workload Identity)
gcloud iam service-accounts add-iam-policy-binding GSA@PROJECT.iam.gserviceaccount.com \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:PROJECT.svc.id.goog[NAMESPACE/KSA_NAME]"
Architect note - Autopilot by default, plan pod IPs
Start with GKE Autopilot unless you need node-level features (specific GPUs/DaemonSets, custom kernels) - it removes node ops and improves security posture. Either way, size the pod alias range for peak pods across the cluster; running out of pod IPs stalls scheduling in ways that look like mysterious Pending pods. Use Workload Identity for all pod-to-Google-API access - never mount SA keys.

GKE vs Cloud Run vs Cloud Functions vs Compute Engine

Many orchestrated microservices, need the K8s ecosystem
GKE (Autopilot unless you need node control)
Stateless service/API that should scale to zero
Cloud Run (the default for new services)
Small event handler / glue
Cloud Functions (+ Eventarc)
Persistent state, special kernels, full control
Compute Engine / GKE Standard
Batch job
Cloud Run jobs or Batch service
Cost note
Don't run a GKE cluster for one or two containers - the control plane and node baseline cost more than Cloud Run, which bills per request and scales to zero. Reserve GKE for genuine orchestration needs (fleets, service mesh, complex scheduling). Use Spot node pools for fault-tolerant GKE workloads.

Networking, IAM, and security for containers

  • Networking - VPC-native GKE in a (shared) VPC subnet with pod/service alias ranges; private clusters keep the control-plane endpoint private; Cloud Run connects to VPC via Serverless VPC Access or Direct VPC egress.
  • IAM - cluster access via IAM + Kubernetes RBAC; Workload Identity for pod-to-Google auth; Cloud Run/Functions run as a service account you set.
  • Supply chain - scan images in Artifact Registry (Artifact Analysis); enforce Binary Authorization so only attested images deploy.
  • Runtime - network policies, pod security, secrets from Secret Manager, and least-privilege service accounts.
  • Monitoring - GKE integrates with Cloud Monitoring/Logging and Managed Service for Prometheus; Cloud Run emits request metrics and logs automatically.

CI/CD for containers

Cloud Build builds and tests images (triggered from a repo), pushes to Artifact Registry (scanned), and Cloud Deploy promotes them through environments (dev → staging → prod) with approvals and rollback. Binary Authorization gates what can deploy.

Architecture patterns

Cloud Storage Eventarc object.finalized Cloud Run Process → BigQuery / DB Pub/Sub → downstream
Event-driven: an object upload triggers Eventarc → Cloud Run processes it and writes to BigQuery/DB or publishes to Pub/Sub for fan-out.
  • Microservices on GKE - deployments behind Gateway API/Ingress LB, HPA autoscaling, optional service mesh (Cloud Service Mesh) for mTLS/traffic control, Workload Identity, Cloud Deploy pipelines.
  • Microservices on Cloud Run - each service a container, private ingress + internal LB for service-to-service, Pub/Sub/Eventarc for async - minimal ops.
  • Serverless function on a Cloud Storage event - as diagrammed; image/ETL/validation triggers.
  • Event-driven architecture - Eventarc + Pub/Sub + Workflows + Cloud Run/Functions + Cloud Tasks for decoupled, resilient pipelines.
  • Private container platform - private GKE cluster in a Shared VPC, internal LBs, Binary Authorization, no public endpoints.

Troubleshooting

⚑ GKE pod not starting (Pending / ImagePullBackOff / CrashLoopBackOff)

Causes: Pending = no schedulable capacity or pod IP exhaustion (alias range too small) or resource requests too big; ImagePullBackOff = bad image path or missing Artifact Registry read permission on the node/Workload-Identity SA, or no route to the registry (private cluster without PGA/Artifact Registry access); CrashLoopBackOff = app failing on start (config/secret missing, bad liveness probe). Checks: kubectl describe pod, kubectl logs --previous, node capacity, alias-range free IPs. Fix: scale the node pool / fix alias range; grant artifactregistry.reader; fix probes/config; enable PGA for private clusters.

⚑ GKE Ingress issue

Causes: health checks failing (firewall not allowing 35.191.0.0/16 & 130.211.0.0/22, or wrong readiness probe); missing/managed cert not provisioned (DNS must point at the LB IP first); wrong BackendConfig/NEG; Ingress class/annotations misconfigured. Fix: allow health-check ranges, align readiness probe with the LB health check, point DNS then wait for cert, verify BackendConfig.

⚑ Cloud Run revision / Cloud Functions timeout / Pub/Sub backlog

Cloud Run: new revision serving 100% but failing - check container starts and listens on $PORT, startup CPU boost, min instances, and the runtime SA's permissions; roll back to the previous revision. Functions timeout: raise the timeout/memory, make work idempotent, offload long work to a Cloud Run job. Pub/Sub backlog: slow/failing subscriber - check ack deadline, subscriber errors, and scale consumers; use a dead-letter topic for poison messages.

11. Analytics, Data, and Integration

Data is Google Cloud's strongest area. BigQuery, the lake/lakehouse stack (Cloud Storage, BigLake, Dataplex), the pipeline tools (Dataflow, Dataproc, Data Fusion, Dataform), streaming (Pub/Sub), CDC (Datastream), and BI (Looker) - with the BigQuery mental model that trips up newcomers.

Last reviewed: July 2026 Verify BigQuery editions/pricing models and service availability in docs.
TL;DR

Land data in Cloud Storage (the lake) and analyze it in BigQuery (serverless warehouse - storage and compute are separate). Transform with Dataflow (streaming/batch Beam), Dataproc (managed Spark/Hadoop), Data Fusion (visual ETL), or Dataform (SQL transformations in BigQuery). Ingest streams with Pub/Sub, CDC with Datastream, govern with Dataplex, and visualize with Looker / Looker Studio. BigQuery cost is driven by bytes scanned (on-demand) or slots (capacity) - query design directly affects the bill.

The services

ServiceRole
BigQueryServerless data warehouse - SQL analytics at petabyte scale; separated storage/compute; ML (BigQuery ML) and vector search built in.
BigLakeUnify BigQuery-managed and open-format (Parquet/Iceberg) data in Cloud Storage under one governed table interface.
DataplexData governance, cataloging, lineage, quality, and organization across the lake/warehouse (Data Catalog is part of it).
Cloud StorageThe data lake landing/curation/consumption zones.
DataflowFully-managed Apache Beam - unified streaming & batch pipelines.
DataprocManaged Spark/Hadoop clusters (and serverless Spark).
Data FusionVisual, code-free ETL/ELT (CDAP-based).
DataformSQL-based transformation/versioning inside BigQuery (ELT, tests, docs).
Pub/SubGlobal messaging for streaming ingestion and event distribution.
DatastreamServerless CDC from operational DBs (Oracle, MySQL, PostgreSQL) into BigQuery/Cloud Storage.
Cloud ComposerManaged Apache Airflow for orchestration.
Analytics HubSecurely share/exchange datasets across projects/orgs.
Looker / Looker StudioGoverned BI/semantic modeling (Looker) and self-serve dashboards (Looker Studio).
Data engineering note - Datastream + BigQuery for CDC
To get near-real-time operational data into the warehouse, Datastream streams change data (including from Oracle) into BigQuery with low source impact - a common alternative to heavyweight ETL. Pair with BigQuery's CDC/merge handling. Mind supplemental logging and schema-drift handling on the source, and validate throughput for high-change tables.

The BigQuery mental model

BigQuery is not a traditional database
  • Storage and compute are separated. Data sits in columnar storage; queries spin up compute (slots) on demand. You are not managing a server with a fixed size.
  • It is analytical (OLAP), not transactional. Great for scans/aggregations over huge tables; wrong for high-rate single-row OLTP (use Cloud SQL/Spanner for that).
  • SQL, but different operations. No indexes in the OLTP sense; performance comes from partitioning (by date/ingestion time) and clustering (by high-cardinality filter columns) to prune data.
  • Query design drives cost. On-demand pricing charges by bytes scanned - SELECT * and unpartitioned full scans are expensive. Select only needed columns; filter on partition/cluster keys.
  • Slots are the unit of compute. On-demand auto-allocates; capacity (editions) reserve slots for predictable cost/performance and let you autoscale.
  • Loading options differ: batch load (free load), streaming inserts / Storage Write API (real-time, priced), and external/BigLake tables (query data in Cloud Storage in place) each have different cost/latency trade-offs.
Common mistake
Treating BigQuery like Postgres: SELECT * across an unpartitioned multi-TB table, per-row updates, or expecting OLTP latency. That scans (and bills) enormous data and performs poorly. Partition + cluster tables, select only needed columns, and keep transactional workloads in Cloud SQL/Spanner.

Common data patterns

PatternBuilt from
Data lakeCloud Storage (raw/curated/consumption) + Dataplex + BigLake
Data warehouseBigQuery + Dataform/Dataflow loads + Looker
LakehouseBigLake over open formats (Iceberg/Parquet in GCS) + Dataplex governance, queried by BigQuery
ETL / ELTDataflow / Dataproc / Data Fusion (E/T) + Dataform (in-warehouse T)
Streaming ingestionPub/Sub → Dataflow → BigQuery (or Storage Write API direct)
CDCDatastream → BigQuery / Cloud Storage
Reporting / BIBigQuery + Looker (governed) / Looker Studio (self-serve)
AI-ready dataCurated lake + BigQuery ML / vector search + Vertex AI (section 12)
Cross-org data sharingAnalytics Hub (publish/subscribe datasets) with governance

Data governance with Dataplex

Dataplex gives you a data catalog, business/technical metadata, data lineage, data quality, and policy-based access across lakes and BigQuery - so a growing lake stays a governed asset instead of a "data swamp." Combine with VPC Service Controls (perimeter around BigQuery/Storage) and column/row-level security in BigQuery for sensitive data.

Security note
Wrap analytics projects (BigQuery + the lake buckets) in a VPC Service Controls perimeter so data cannot be exported to outside projects, use column-level security and data masking for sensitive fields, discover PII with Sensitive Data Protection (DLP), and control sharing through Analytics Hub rather than ad-hoc dataset grants.

Reference architecture: lakehouse + BI

Source DBs (OLTP) Events / files Datastream (CDC) Pub/Sub Cloud Storage lakeraw / curated /consumption Dataflow BigQuery Dataplex (govern) Looker
CDC/streaming into a Cloud Storage lake, Dataflow transforms, BigQuery serves analytics (BigLake over open formats), Dataplex governs, Looker visualizes.

BigQuery cost control

Cost note - control the scan
  • Partition and cluster big tables so queries prune to the relevant data.
  • Select only needed columns; avoid SELECT *. Preview cost with the dry-run estimator before running.
  • Set custom quotas / maximum bytes billed per query to cap runaway scans.
  • Choose the right pricing: on-demand (per-byte) for spiky/low volume, capacity (editions with slot reservations + autoscaling) for predictable heavy workloads.
  • Use materialized views and BI Engine for repeated aggregations; expire temp datasets.

12. AI, ML, and Generative AI on Google Cloud

Vertex AI as the unified ML platform, Gemini models and Agent Builder, vector search across AlloyDB / Cloud SQL / BigQuery, and the pretrained AI APIs - plus the enterprise RAG patterns and governance guardrails that separate a demo from something you can run on real data.

Last reviewed: July 2026 Model names, regions, quotas & pricing change fast - verify Gemini/Vertex availability in the Console.
TL;DR

Vertex AI is the unified platform for building, tuning, deploying, and operating models - Model Garden (incl. Gemini), Vertex AI Studio, custom/AutoML training, endpoints, pipelines, feature store, model registry, and Agent Builder / Vertex AI Search for grounded assistants. For RAG, generate embeddings and store vectors in AlloyDB, Cloud SQL (pgvector), BigQuery, or Vertex Vector Search. BigQuery ML runs ML in SQL. The hard part is not the model - it is governing what the model can reach.

Vertex AI

CapabilityWhat it does
Model GardenCatalog of Google (Gemini), open, and third-party models to deploy/tune.
Vertex AI StudioPrompt, test, and tune generative models; multimodal.
Gemini modelsGoogle's frontier multimodal LLMs (text, image, audio, video, code) served via Vertex AI.
Agent Builder / Vertex AI SearchBuild grounded search and agents over your data with managed retrieval (less RAG plumbing).
AutoML / custom trainingTrain without/with your own code; distributed training on GPUs/TPUs.
Endpoints (online / batch prediction)Serve models for low-latency online or high-throughput batch inference.
Pipelines / Feature Store / Experiments / Model RegistryMLOps: reproducible pipelines, feature serving, experiment tracking, versioned model governance.
Model MonitoringDrift/skew detection and quality monitoring for deployed models.

Pretrained AI APIs & BigQuery ML

Document AI

Parse and extract structured data (text, tables, entities) from documents - invoices, forms, contracts.

Vision / Speech / Translation / Natural Language

Pretrained APIs for image analysis, speech-to-text and text-to-speech, translation, and text/entity/sentiment analysis - no training needed.

Contact Center AI / Dialogflow

Conversational agents and virtual call-center assistants (CCAI), including generative agents.

BigQuery ML

Train and run ML models (and call Gemini/embeddings) directly with SQL in BigQuery - great for analysts and data-resident ML.

Vector search options

OptionUse when
AlloyDB vector searchVectors alongside operational Postgres data, with high performance (Google's ScaNN-based indexing).
Cloud SQL for PostgreSQL (pgvector)Vectors in an existing Cloud SQL Postgres, modest scale, simplest path.
BigQuery vector searchVectors + embeddings at analytics scale, alongside your warehouse data, in SQL.
Vertex AI Vector SearchPurpose-built, very large-scale, low-latency similarity search (managed ANN).
AI note - keep vectors near the governed data
Storing embeddings in AlloyDB/Cloud SQL/BigQuery means retrieval inherits your existing IAM, VPC-SC perimeter, backups, and row/column security - you combine similarity search with ordinary WHERE filters so retrieval respects entitlements. Use Vertex Vector Search when scale/latency demands a dedicated ANN service. Either way, filter retrieved context to what the requesting user is allowed to see.

RAG architecture on Google Cloud

Docs in Cloud Storage Chunk + embed Vectors: AlloyDB / BQ -- runtime query path -- User query Serving layer(Cloud Run: authz + guardrails) Retrieve top-k(entitlement-filtered) Gemini (Vertex AI)grounded answer Audit + logging
Ingestion: docs → chunk → embed → store vectors. Runtime: query → governed serving layer → entitlement-filtered retrieval → Gemini generates a grounded, audited answer. Or use Vertex AI Search/Agent Builder to manage the retrieval.

Enterprise patterns

PatternHowWatch out for
Chat with documentsRAG over Cloud Storage docs + vector search + Gemini (or Vertex AI Search)Chunking quality; stale index; citations
Chat with databaseRetrieve from curated views; generate grounded answersNever expose raw prod OLTP; use a serving layer
Natural language to SQLGemini proposes SQL against a governed schema/catalogValidate/parametrize; read-only; no dynamic SQL
RAG with BigQueryEmbeddings + vector search in BigQuery over warehouse dataColumn/row security on retrieval
Document processing pipelineDocument AI → extract → BigQuery / workflowHuman review of low-confidence extractions
Call center AICCAI / Dialogflow + Gemini + knowledge baseGrounding; escalation to humans
MLOps pipelineVertex AI Pipelines + Feature Store + Model Registry + MonitoringReproducibility; drift monitoring
Governed private GenAIPrivate endpoints + VPC-SC + curated data + auditEntitlement-aware retrieval

Governance and security for GenAI

  • Serving layer, always - agents/LLMs call a governed API (e.g. Cloud Run) that enforces authN/authZ, rate limits, input/output validation, and logging. They do not touch data stores directly.
  • Entitlement-aware retrieval - filter retrieved context to what the requesting user may see (row/column/document level) so RAG cannot leak across users.
  • Private & perimetered - keep model and data traffic private (Private Service Connect / private endpoints); wrap data services in VPC Service Controls.
  • Credential hygiene - secrets in Secret Manager, access via service accounts / Workload Identity; the model never sees raw credentials.
  • Auditability - log prompts, retrieved context IDs, and responses (per privacy rules) so answers are explainable.
  • Responsible AI - safety filters, evaluation, and Model Monitoring for drift/quality; human review for consequential outputs.

Warnings (read before connecting AI to enterprise data)

Do not do these
  • Do not connect LLM agents directly to production OLTP databases without a governed serving layer. Live transactional systems are not a query surface for a probabilistic agent.
  • Avoid uncontrolled dynamic SQL. NL-to-SQL must produce validated, parameterized, read-only queries against a curated schema - never free-form DML against production.
  • Protect credentials. No DB passwords, keys, or wallets in prompts, code, or agent memory. Use Secret Manager + service accounts / Workload Identity.
  • Add auditability. If you cannot show what data an answer came from and who asked, you cannot defend it to security or compliance.
  • Use curated datasets, APIs, or read-only reporting layers as the AI's data surface - not raw production tables.
  • Validate output before business use. Treat model output as a draft/suggestion until a human or deterministic check confirms it.
  • Monitor prompt injection and data-leakage risks - untrusted content in the context can hijack instructions; isolate and sanitize retrieved/user content.
  • Check Gemini model availability, region, quota, and pricing before you design - these change frequently and vary by region.
AI note - the pattern that scales safely
The durable enterprise GenAI shape is: curated/governed data → entitlement-filtered retrieval → model behind a serving API → validated, audited output, all inside a VPC-SC perimeter. Everything risky (raw OLTP access, dynamic SQL, embedded credentials, unlogged answers) is a shortcut that works in a demo and fails an audit. Build the governed path first.

13. Migration and Disaster Recovery

Getting workloads into Google Cloud (VMs, databases, data) and keeping them recoverable - the migration tooling, the DR patterns by tier, and how RTO/RPO drive architecture and cost.

Last reviewed: July 2026 Verify supported sources for MVM/DMS/Datastream in current docs.
TL;DR

Migrate VMs with Migrate to Virtual Machines, databases with Database Migration Service (and Datastream for CDC/low-downtime), and bulk data with Storage Transfer Service / Transfer Appliance. Plan with Migration Center. For DR, choose per tier: backup & restore (cheapest, slow), cold/pilot light, warm standby, or hot/active-active. Global load balancing + Cloud DNS handle traffic failover. Your RTO/RPO targets pick the pattern - and DR you never test is not DR.

Migration tooling

MoveToolingNotes
Assess & planMigration CenterDiscovery, assessment, TCO, and grouping before you move
VMsMigrate to Virtual Machines (MVM)Lift-and-shift VMware/AWS/Azure/physical VMs to Compute Engine
VMware estatesGoogle Cloud VMware EngineRun VMware as-is in GCP, then modernize gradually
Databases (low downtime)Database Migration Service (DMS) + DatastreamManaged migrations to Cloud SQL/AlloyDB; DMS supports Oracle→PostgreSQL conversion
Bulk dataStorage Transfer Service / Transfer ApplianceOnline from other clouds/on-prem/HTTP, or physical appliance for very large sets
Data warehouseBigQuery Data Transfer Service + migration toolingFrom Teradata/Redshift/others into BigQuery
Backup & DRBackup and DR serviceApplication-consistent backup and DR orchestration for VMs/databases

Database migration paths

Source → targetMethodDowntime
PostgreSQL/MySQL → Cloud SQLDMS continuous (logical replication)Near-zero
PostgreSQL → AlloyDBDMS continuousNear-zero
Oracle → PostgreSQL/AlloyDBDMS with schema/code conversion (heterogeneous)Low, plus conversion effort
Oracle/MySQL/PostgreSQL → BigQuery/StorageDatastream (CDC)Near-zero (ongoing feed)
Any → self-managed on Compute EngineNative dump/restore, replication, or Data Guard (Oracle)Depends on method
DBA note - heterogeneous Oracle moves are conversions
Oracle → Cloud SQL/AlloyDB PostgreSQL is a heterogeneous migration: DMS handles data movement and assists with schema/PL-SQL conversion, but datatype, PL/SQL, and feature differences need real work and testing. If the app must stay on Oracle, plan for self-managed Oracle on Compute Engine or Oracle Database@Google Cloud instead of a conversion. Use Datastream when you only need Oracle data flowing into BigQuery for analytics.

DR patterns

PatternStandby stateRTORPOCost
Backup & restoreBackups in another region; nothing runningHours+Since last backupLowest
Cold / pilot lightCore data replicated (e.g. cross-region replica); app offTens of minSmall (replica lag)Low
Warm standbyScaled-down full stack running in DR regionMinutesSmallMedium
Hot / active-activeBoth regions serving (global LB, Spanner/replicated data)Near-zeroNear-zeroHighest + complexity

Building blocks: cross-region Cloud SQL/AlloyDB replicas (promote for DR) or Spanner/Firestore multi-region (built-in); Cloud Storage dual/multi-region or Transfer for objects; regional PD and PD snapshots for VMs; the global external Application LB for automatic cross-region failover; and Cloud DNS for DNS-based failover where the LB doesn't cover it.

Architect note - global LB simplifies app DR
Because the global external Application LB uses one anycast IP with backends in multiple regions and health-based routing, a regional outage fails traffic over automatically for stateless tiers - no DNS TTL wait. That removes a lot of classic DR plumbing for the app layer. The hard part remains the data tier: pick cross-region replicas (promote) or a multi-region database (Spanner/Firestore) per your RTO/RPO, and rehearse the promotion.
Common mistake - active-active data is hard
Stateless tiers go active-active easily behind a global LB; stateful databases generally do not without conflict handling. Most "active-active" requirements are met by active/passive with fast failover, or by using a database that is natively multi-region (Spanner). Don't take on multi-master complexity unless the requirement truly demands it.

RTO and RPO

  • RTO - how long you can be down → drives standby readiness and automation.
  • RPO - how much data you can lose → drives replication mode (synchronous vs async vs backup interval).
  • Zero data loss needs synchronous replication (regional HA, or Spanner multi-region) and low inter-region latency; verify the network and the performance trade-off.

Architecture examples

Global external App LB Region A (primary) App (regional MIG) Cloud SQL (HA) Region B (DR) App (warm MIG) Cross-region replica replication
One global LB fronts both regions (auto failover for stateless tiers); the database uses a cross-region replica promoted on failover.
  • On-prem VM → GCP: Migrate to VMs; cut over with the network in place.
  • DB → Cloud SQL: DMS continuous, cut over at low replication lag.
  • PostgreSQL → AlloyDB: DMS; validate performance/analytics gains.
  • Oracle → Compute Engine (where required): self-managed, Data Guard to a second-region VM, sole-tenant for licensing.
  • Cross-region DR (app): global LB + warm MIG + storage replication.
  • Cross-region DR (database): cross-region replica (promote) or multi-region Spanner/Firestore.
  • Backup-based DR: Backup and DR service / cross-region snapshots, rebuild on demand.

DR testing

DR you have never tested is a hope, not a plan
Run regular drills: replica promotion, region failover, and full app validation in DR (not just "the DB opened"). Verify RTO/RPO are actually met, that CMEK keys exist and are usable in the DR region (a missing key makes the replica unusable), that the global LB/DNS failover works, and that runbooks and connection strings are current. Backup and DR service can help codify and rehearse plans.
  • Cross-region replica within RPO (monitor replication lag); promotion rehearsed.
  • CMEK keys present and usable in the DR region.
  • App tier can start and connect in DR; config points to DR endpoints.
  • Global LB / Cloud DNS failover tested and time-measured.
  • Object data (dual/multi-region or replicated) within RPO.
  • Capacity available in DR (reservations if RTO is tight); runbook current.

14. Cost Management and Governance

How Google Cloud charges, the tools to track and cap spend (billing export, budgets, quotas), the discount levers (CUDs, sustained-use, Spot), and the governance model - ending in a monthly cost-review checklist.

Last reviewed: July 2026 Pricing and discount models change - verify all rates on the pricing pages.
TL;DR

Google Cloud bills mainly by compute (vCPU/memory-hours), storage GB, BigQuery bytes scanned or slots, and network egress. Track with billing export to BigQuery + Cost tables/dashboards, cap with Budgets & alerts (notify) and quotas (block). The big levers: committed use discounts for steady state, sustained-use discounts (automatic on some families), Spot VMs for fault-tolerant work, right-sizing / custom machine types, BigQuery query controls, storage lifecycle, and killing idle resources. Governance = resource hierarchy + Org Policies + budgets + labels.

Pricing basics

DimensionCharged onNotes
Compute EnginevCPU + memory per second (per machine type)Sustained-use discounts on some families; CUDs; Spot for big savings
Persistent Disk / storageProvisioned GB-month (+ IOPS/throughput for Hyperdisk)Snapshots add up; choose disk type deliberately
Cloud StorageGB-month by class + operations + retrieval (colder classes)Lifecycle to colder classes; watch retrieval/egress
BigQueryBytes scanned (on-demand) or slot-hours (capacity) + storageQuery design and partitioning dominate cost
NetworkInternet egress + inter-region + some inter-zone; ingress freeKeep chatty services co-located; use private access
Managed servicesPer-service (Cloud SQL vCPU/RAM/storage, Cloud Run per-request, etc.)Right-size; scale-to-zero where possible
Logging / MonitoringIngestion volume (logs), some metricsExclude noisy logs; set retention

Cost tracking tools

ToolDoes
Billing export to BigQueryDetailed usage/cost data for your own analysis and dashboards (turn on early).
Cost table / Cost breakdown / ReportsConsole views of spend by project, service, SKU, label, time.
Budgets & alertsTrack spend against a target per billing account/project/label; alert at thresholds (and optionally trigger Pub/Sub automation). Budgets notify - they don't block.
QuotasHard caps on resource usage per project - the "block" control.
Pricing CalculatorEstimate before you build.
Recommender / Active AssistRight-sizing, idle-resource, and commitment recommendations.
Cost note - budgets alert, quotas enforce
Use them together: a budget warns you spend is trending over; a quota stops a project creating the expensive thing. Attribute everything via labels + billing export to BigQuery so chargeback works. This only works if labeling is enforced from the start (section 1).

Discounts

  • Committed use discounts (CUDs) - 1 or 3-year commitment (resource-based or spend-based) for a substantial discount on steady-state usage.
  • Sustained-use discounts - automatic discounts for running certain machine families a large fraction of the month.
  • Spot VMs - 60-90% off for preemptible, fault-tolerant workloads.
  • Custom machine types - stop paying for vCPU or memory you don't use.

Governance model

Governance is enforced through the same primitives as security: the resource hierarchy (projects/folders for isolation and attribution), Organization Policies (restrict locations, machine types, external IPs), budgets + quotas, labels/tags, and a landing zone deployed as code. This keeps spend controlled and attributable by design rather than by cleanup.

Cost optimization examples

ActionTypical savingEffort
Stop / schedule non-prod VMs off-hoursHigh (up to ~65-70% of that compute)Low
Right-size VMs (Recommender) / custom machine typesHighLow
Committed use discounts for baselineHighMedium
Spot VMs for fault-tolerant / batchVery high (60-90%)Medium
Choose the right disk type / delete unused disksMediumLow
Cloud Storage lifecycle to colder classesMedium-HighLow
BigQuery: partition/cluster, max-bytes-billed, capacity vs on-demandHigh for heavy BQMedium
Reduce logging ingestion (exclusion filters)MediumLow
Reduce inter-region / egress trafficMediumMedium
Delete old snapshots & unused external IPsMediumLow
Cloud SQL right-sizing / scale-to-zero serverlessMedium-HighLow
Cost note - cheap wins first
Before re-architecting, do the high-ROI basics: schedule non-prod off-hours, act on Recommender right-sizing, apply storage lifecycle, buy CUDs for baseline, and fix BigQuery query patterns. Idle external IPs, orphaned disks, and unbounded log ingestion quietly add up - clean them monthly.

Monthly Google Cloud cost review checklist

  • Review cost reports month-over-month by project, service, and label; investigate spikes.
  • Check each budget: which projects/labels are over or trending over target.
  • Act on Recommender right-sizing and idle-resource recommendations.
  • Confirm non-prod stop/schedule ran (nothing running 24x7 by accident).
  • Find and delete unused persistent disks, orphaned snapshots, and idle VMs.
  • Release unused external (static) IPs - they bill when unattached.
  • Review CUD coverage vs. steady-state usage; buy/adjust commitments.
  • BigQuery: top queries by bytes scanned; add partitioning/clustering; set max-bytes-billed; review on-demand vs capacity.
  • Cloud Storage: are lifecycle rules moving cold data to Nearline/Coldline/Archive?
  • Logging/Monitoring ingestion: exclude noisy logs; check retention settings.
  • Review egress / inter-region charges; co-locate chatty services; use private access.
  • Cloud SQL / managed DB sizing vs. actual utilization; scale down over-provisioned.
  • Confirm every resource is labeled (cost-center/env/owner) for attribution.
  • Validate quotas still reflect intent; check for anomalous new spend by service.

15. Enterprise Architecture Patterns

Reference blueprints for real Google Cloud deployments. Each card gives the business case, services, traffic flow, and the security / HA / DR / monitoring / cost / risk dimensions so you can adapt rather than start from a blank page.

Last reviewed: July 2026 Blueprints are starting points - validate sizing/services against current docs and requirements.
HOW TO READ THESE

Every pattern lists the same dimensions. Start from the one closest to your workload, then apply the service deep dives (sections 3-12) and the DR/cost guidance (13-14). The recurring backbone is: global external App LB + Cloud Armor → private compute (Cloud Run / regional MIG / GKE) → managed database on private IP → Private Google Access / PSC for Google APIs → centralized logging → cross-region DR, all inside a governed landing zone.

Foundational three-tier (reference backbone)

Three-tier enterprise application
The pattern most others extend
Users Global LB+ Cloud Armor Shared VPC (global) - subnet per region Web/app: Cloud Run or regional MIG (no ext IP) zone-a zone-b Data: Cloud SQL / AlloyDB (private IP, HA) Cloud SQL HA + PITR Cloud NAT PGA/PSC Secret Mgr Logging
Business caseStandard internal/external web or enterprise app needing HA and controlled exposure.
ServicesShared VPC, global external App LB + Cloud Armor + Cloud CDN, Cloud Run or regional MIG, Cloud SQL/AlloyDB (private IP), Cloud NAT, PGA/PSC, Secret Manager, Cloud Monitoring/Logging.
Traffic flowUser → Cloud Armor/LB → app (private) → DB (private IP); app → Google APIs via PGA/PSC; egress via Cloud NAT.
SecurityNo external IPs; firewall by SA; DB private; CMEK; secrets in Secret Manager; Org Policies + VPC-SC; SCC on.
HARegional MIG / Cloud Run across zones; Cloud SQL HA; LB health-based routing.
DRSecond-region backends behind the same global LB + cross-region DB replica.
MonitoringLB/backends, app, DB metrics; alerts → notification channels; central logs.
CostCloud Run scale-to-zero or right-sized MIG + CUD; storage lifecycle; BQ controls.
Risks / mistakesHealth-check firewall rule missing; DB public IP; no zone spread; secrets in code.

Pattern library

Simple web application Small
CaseLow-complexity site/app, cost-sensitive.
ServicesCloud Run + Cloud SQL (or Firestore) + Cloud Storage for assets + global LB + Cloud Armor.
HA/DR/costCloud Run multi-zone by default; Cloud SQL HA; scale-to-zero. Risk: public DB IP, no backups.
Highly available application HA
CaseMust survive zone (and ideally region) failure.
ServicesRegional MIG / Cloud Run across zones, regional PD for state, Cloud SQL HA or Spanner, global LB.
DRSecond-region backends + cross-region DB replica or multi-region Spanner. Risk: state on a single zonal disk; untested failover.
Private enterprise application Regulated
CaseInternal-only, reachable from on-prem, no public footprint.
ServicesPrivate subnets, internal Application LB, HA VPN/Interconnect via Cloud Router, PGA/PSC, IAP for admin, no external IPs.
Security/riskVPC-SC perimeter; hierarchical firewall; DB private IP. Risk: CIDR overlap; DNS forwarding gaps.
Shared VPC / centralized networking & security Platform
CaseMany teams/projects with centrally-governed network, security, and logging.
ServicesHost project (Shared VPC), hierarchical firewall, Cloud NAT/DNS, central logging project (aggregated sink), security project (SCC, KMS), org Org Policies.
RiskUnder-scoped networkUser grants; teams creating shadow VPCs; missing perimeter.
Multi-project landing zone Governance
CaseGoverned foundation before workloads land.
ServicesOrg/folder/project hierarchy, baseline IAM (groups), preventive Org Policies, Shared VPC, central logging + SCC, budgets/quotas, labels - all Terraform (Cloud Foundation blueprints).
RiskSkipping it and retrofitting governance later.
Cloud SQL / AlloyDB application DB
CaseRelational app backend.
ServicesCloud SQL (or AlloyDB) private IP + HA + PITR, Auth Proxy / Serverless VPC Access, cross-region replica for DR, Query Insights.
RiskPublic IP + broad authorized networks; HA not enabled; untested restore.
BigQuery analytics platform / data lake Data
CaseEnterprise analytics on curated + raw data.
ServicesCloud Storage lake (zones) + BigQuery/BigLake + Dataflow/Dataform + Datastream (CDC) + Dataplex (govern) + Looker; VPC-SC perimeter.
Cost/riskPartition/cluster + query controls; column-level security. Risk: ungoverned "data swamp"; runaway scans.
GKE platform Cloud native
CaseContainer platform for many microservices with CI/CD.
ServicesPrivate GKE (Autopilot or Standard) in Shared VPC, Gateway API LB, Workload Identity, Artifact Registry (scanned) + Binary Authorization, Cloud Build/Deploy, service mesh optional.
RiskPod IP exhaustion; over-privileged Workload Identity; public control plane.
Cloud Run serverless application Serverless
CaseStateless services/APIs with minimal ops.
ServicesCloud Run (services + jobs) behind global LB + Cloud Armor, Serverless VPC Access to private DB, Pub/Sub/Eventarc for async, Secret Manager.
Cost/riskScale-to-zero; per-request billing. Risk: cold-start latency for spiky critical paths (use min instances).
Event-driven architecture Events
CaseDecoupled, resilient processing pipelines.
ServicesEventarc + Pub/Sub + Cloud Run/Functions + Workflows + Cloud Tasks/Scheduler; dead-letter topics.
RiskPoison messages without DLQ; non-idempotent handlers; backlog from slow consumers.
Hybrid cloud Hybrid
CaseWorkloads split across on-prem and GCP.
ServicesInterconnect (primary) + HA VPN (backup) via Cloud Router, Shared VPC / NCC, hybrid Cloud DNS, hierarchical firewall.
RiskCIDR overlap; expecting transitive peering; single link with no backup; asymmetric routing.
Multi-region DR DR
CaseBusiness-critical stack needing regional resilience.
ServicesGlobal LB with multi-region backends, cross-region DB replica or multi-region Spanner/Firestore, dual/multi-region Cloud Storage, reservations, Backup and DR service.
RiskUntested DR; CMEK key missing in DR region; capacity unavailable at failover.
Secure landing zone Security
CasePreventive-guardrail foundation.
ServicesOrg Policies (no external IP, location restriction, no SA keys, OS Login), VPC-SC perimeters, central SCC + logging, KMS/Secret Manager, break-glass, budgets/quotas - as code.
RiskGuardrails left off; over-broad break-glass.
GenAI with private enterprise data AI
CaseRAG/assistant over internal data, governed.
ServicesCloud Storage (docs) + vectors in AlloyDB/BigQuery + Gemini (Vertex AI) or Vertex AI Search behind a Cloud Run serving API + Secret Manager + VPC-SC + logging.
Flow / riskQuery → serving layer (authz + guardrails) → entitlement-filtered retrieval → grounded, audited answer. Risk: ungoverned data access, dynamic SQL, credential leakage (section 12 warnings).
Common mistakes across all patterns
  • Databases/services on external IPs "to get it working"; missing private access planning.
  • No zone/region spread - a zone event takes the whole "HA" tier.
  • Health-check firewall ranges forgotten, so LB backends are unhealthy on day one.
  • DR designed but never tested; CMEK keys missing in the DR region.
  • Secrets in code/metadata instead of Secret Manager; long-lived SA keys.
  • No centralized logging/SCC until an incident needs it.
  • CIDR overlap / expecting transitive peering discovered during hybrid setup.
  • Landing zone / Org Policies skipped and retrofitted painfully later.

16. Troubleshooting Guides

A runbook catalog for the failures you will actually hit. Each entry lists symptoms, likely causes, checks (with Console path and gcloud where useful), fixes, and prevention. Deeper versions of some live in their service sections; this is the consolidated index.

Last reviewed: July 2026 Verify gcloud syntax with gcloud <group> --help.
General method
Work top-down: identity/API (is the caller allowed? is the API enabled? right project?) → network (route + firewall both ways + private access) → host/service (listening/healthy?) → data. For "cannot reach," run a Connectivity Test; for "permission denied," use Policy Troubleshooter and the access mental model in section 2. Almost everything is in Cloud Logging and Cloud Audit Logs.
ComputeStorageNetworkLBDBIAMServerlessGKEObservability

Compute & access

⚑ VM not reachable / SSH / OS Login issue

Symptoms: SSH times out or is denied. Causes: firewall doesn't allow SSH from your source (allow the IAP range 35.235.240.0/20 and use IAP, or your CIDR); VM has no external IP and you're not using IAP; OS Login enabled but you lack roles/compute.osLogin (or osAdminLogin) / 2FA; VM stopped or boot failed; OS firewall. Checks: serial console for boot; IAM roles; firewall; gcloud compute ssh --tunnel-through-iap. Fix: grant OS Login role, open IAP range, use IAP tunneling. Prevention: standardize IAP + OS Login; no external IPs.

gcloud compute ssh VM --tunnel-through-iap --zone=ZONE
gcloud compute instances get-serial-port-output VM --zone=ZONE
⚑ VM boot issue

Causes: bad fstab mount, full boot disk, kernel/driver issue, failed startup script. Checks: serial console output; startup-script logs. Fix: detach the boot disk, attach to a rescue VM, correct config, reattach; keep boot-disk snapshots. Prevention: test image changes in non-prod.

⚑ High CPU / memory pressure / disk full

CPU: Monitoring trend; on host top; right-size/autoscale. Memory: requires the Ops Agent (memory isn't collected by default) - install it, then right-size. Disk full: resize the PD online, grow the filesystem, alert at 85%; clean logs/temp.

⚑ Persistent Disk attachment issue

Causes: disk in a different zone than the VM; not formatted/mounted; wrong device name. Checks: gcloud compute instances describe; lsblk. Fix: attach in the same zone, format & mount by UUID in fstab. Prevention: regional PD for HA; automate mount in the startup script.

Storage

⚑ Cloud Storage access denied / public access issue

Denied: missing IAM (needs roles/storage.objectViewer/Admin) at bucket or project; wrong project; UBLA on but you relied on an ACL; VPC-SC perimeter blocking; API disabled. Public access blocked: Org Policy storage.publicAccessPrevention is (correctly) enforcing - don't disable it; use signed URLs or IAM instead. Checks: gsutil iam get gs://BUCKET; Policy Troubleshooter; audit logs. Fix: grant least-privilege IAM; use signed URLs for external sharing.

⚑ Filestore mount issue

Causes: firewall blocking NFS between client and Filestore; wrong mount IP/path; client not in the allowed network. Fix: allow NFS ports from the client subnet, verify the mount target IP and export path, ensure same VPC/region reachability.

Network

⚑ VPC routing / firewall / Cloud NAT / Private Google Access / PSC / peering / Shared VPC

Method: run a Connectivity Test (names the blocking route/firewall) and Network Analyzer for config issues. Then per case:

  • Firewall: implied deny-ingress; check priority/direction; allow health-check (35.191/16, 130.211/22) and IAP (35.235.240.0/20) ranges; prefer SA targeting.
  • Route: a custom route shadowing the default internet route; missing dynamic route from Cloud Router.
  • Cloud NAT: outbound-only; port exhaustion (raise min-ports / enable dynamic allocation); NAT not covering the subnet/region.
  • PGA: not enabled on the subnet; DNS/route for private.googleapis.com missing; VPC-SC blocking.
  • PSC: endpoint/DNS mapping wrong; producer not accepting the connection.
  • Peering: not transitive; overlapping ranges; missing firewall for the peer range.
  • Shared VPC: service-project SA lacks compute.networkUser; resources created in the wrong network.
⚑ VPN down / Interconnect issue / Cloud Router BGP

Causes: IKE/IPSec parameter mismatch (VPN); attachment/BGP session down (Interconnect); Cloud Router not advertising subnets or not learning on-prem routes; CIDR overlap. Checks: tunnel/attachment status; BGP session state and advertised/learned routes. Fix: align IKE params, fix BGP advertisements both ways, resolve overlap. Prevention: Interconnect + HA VPN backup; alarms on tunnel/BGP state.

⚑ Cloud DNS / private DNS issue

Causes: private zone not attached to the VPC; missing record; forwarding/peering not set for hybrid; wrong resolver. Checks: dig from a VM; zone attachment. Fix: attach the private zone, add records, set inbound/outbound server policies for on-prem forwarding.

Load balancer & databases

⚑ LB backend unhealthy / SSL cert issue

Unhealthy: firewall not allowing health-check ranges (35.191/16, 130.211/22) to the backend port; wrong health-check port/path/protocol; app on localhost; wrong backend protocol/named port. SSL: Google-managed cert stuck PROVISIONING because DNS doesn't point at the LB IP yet; missing SAN; expired self-managed cert. Fix: per section 7.

⚑ Cloud SQL connection / performance / backup issue

Connection: use the Auth Proxy or private IP; check the runtime SA has roles/cloudsql.client; authorized networks / SSL for public IP; Serverless VPC Access for Cloud Run/Functions. Performance: Query Insights + Cloud Monitoring (CPU/connections/storage/lag); add read replicas; tune queries/flags. Backup failed: check storage, PITR/binary logging enabled, and quota; test a restore.

gcloud sql instances describe INSTANCE
./cloud-sql-proxy --private-ip PROJECT:REGION:INSTANCE

IAM & service accounts

⚑ IAM permission denied / impersonation / SA key issue

Permission denied: walk the section 2 mental model - right project? which principal? role/permission? scope/inheritance? deny policy? Org Policy? VPC-SC? API enabled? For workloads: does the caller have actAs/tokenCreator on the SA, and does the SA have the role on the target? Impersonation: caller needs roles/iam.serviceAccountTokenCreator on the SA. SA key issue: key creation may be blocked by Org Policy (good) - use impersonation/Workload Identity instead; a leaked/rotated key stops working. Tools: Policy Troubleshooter, Policy Analyzer, audit logs.

Serverless & GKE

⚑ Cloud Run revision / Cloud Functions timeout / Pub/Sub backlog

Cloud Run: container must listen on $PORT and start fast; check the runtime SA permissions and startup CPU; roll back a bad revision. Functions timeout: raise timeout/memory, make idempotent, offload long work. Pub/Sub backlog: slow/failing subscriber - check ack deadline, errors, scale consumers, add a dead-letter topic.

⚑ GKE pod not starting / Ingress issue

Pod: Pending (capacity / pod-IP exhaustion / requests too big), ImagePullBackOff (Artifact Registry read perms / PGA for private cluster), CrashLoopBackOff (config/probes). Ingress: allow health-check ranges; readiness probe aligned; managed cert needs DNS → LB IP first; verify BackendConfig/NEG. Tools: kubectl describe/logs. (Section 10.)

Observability

⚑ Alert not firing / logs missing

Alert: wrong metric/filter, threshold/duration never met, policy disabled, notification channel unverified, or maintenance suppression. Test by forcing the condition; use absence conditions for heartbeats. Logs missing: log not enabled (e.g. Data Access audit logs off), Ops Agent not installed on the VM, a log exclusion filter dropping them, wrong project/log bucket, or retention expired. Fix: enable the log, install the agent, review Log Router sinks/exclusions.

17. gcloud CLI, Terraform, and Automation

Practical, copy-friendly automation: gcloud setup and configurations, service-account impersonation, common commands, the Google Terraform provider, and clean examples for VPC, VMs, buckets, IAM, and alerts - plus state and structure practices.

Last reviewed: July 2026 Verify provider version and resource argument names against current docs.
TL;DR

The gcloud CLI uses named configurations (account + project + region). Prefer Application Default Credentials and service-account impersonation over downloaded keys. Build production infrastructure with Terraform (the Google provider); keep state remote and locked in a Cloud Storage backend, structure code into modules, and separate environments by workspace/backend + tfvars. Run it in a pipeline (Cloud Build) with an impersonated deployer SA - no keys.

gcloud CLI setup & configurations

# Install the Cloud SDK, then authenticate
gcloud auth login                       # human login
gcloud auth application-default login    # ADC for local tools/Terraform

# Named configurations (switch between projects/accounts fast)
gcloud config configurations create prod
gcloud config set account jane@example.com
gcloud config set project acme-app-prod-01
gcloud config set compute/region us-central1

gcloud config configurations activate prod
gcloud config configurations list

Service-account impersonation (no keys)

# Your user needs roles/iam.serviceAccountTokenCreator on the SA
gcloud config set auth/impersonate_service_account deployer@PROJECT.iam.gserviceaccount.com
gcloud compute instances list      # now runs AS the SA, short-lived token

# Or per-command:
gcloud storage ls --impersonate-service-account=deployer@PROJECT.iam.gserviceaccount.com

# Terraform: impersonate via ADC + provider setting (no key file)
Security note
Do not download service-account JSON keys. Use ADC for local dev, impersonation for acting as an SA, and Workload Identity Federation for CI/CD. Block key creation org-wide with iam.disableServiceAccountKeyCreation.

Common gcloud commands

# Projects / APIs
gcloud projects list
gcloud services enable compute.googleapis.com run.googleapis.com --project PROJECT
gcloud services list --enabled

# Compute
gcloud compute instances list
gcloud compute instances create web-1 --machine-type=e2-standard-2 --no-address --zone=us-central1-a
gcloud compute ssh web-1 --tunnel-through-iap --zone=us-central1-a

# Storage
gcloud storage ls
gcloud storage cp -r ./data gs://my-bucket/data

# IAM
gcloud projects add-iam-policy-binding PROJECT --member="group:app@example.com" --role="roles/run.developer"
gcloud projects get-iam-policy PROJECT

# Logs
gcloud logging read 'severity>=ERROR' --limit 20 --freshness 1h

Terraform provider setup

# versions.tf
terraform {
  required_version = ">= 1.5"
  required_providers {
    google = { source = "hashicorp/google", version = "~> 6.0" }   # verify current major
  }
  backend "gcs" { bucket = "acme-tfstate-prod"  prefix = "app" }    # remote, locked state
}

provider "google" {
  project = var.project_id
  region  = var.region
  # Impersonate a deployer SA using ADC - no key file
  impersonate_service_account = var.deployer_sa
}

Create a custom-mode VPC + subnet + Cloud NAT

resource "google_compute_network" "vpc" {
  name                    = "app-vpc"
  auto_create_subnetworks = false        # custom mode
}

resource "google_compute_subnetwork" "app" {
  name                     = "app-us-central1"
  ip_cidr_range            = "10.10.0.0/20"
  region                   = "us-central1"
  network                  = google_compute_network.vpc.id
  private_ip_google_access = true         # Private Google Access
  secondary_ip_range {                    # for GKE pods, if needed
    range_name    = "pods"
    ip_cidr_range = "10.20.0.0/16"
  }
}

resource "google_compute_router" "rt" {
  name = "app-rt"  region = "us-central1"  network = google_compute_network.vpc.id
}
resource "google_compute_router_nat" "nat" {
  name   = "app-nat"  router = google_compute_router.rt.name  region = "us-central1"
  nat_ip_allocate_option             = "AUTO_ONLY"
  source_subnetwork_ip_ranges_to_nat = "ALL_SUBNETWORKS_ALL_IP_RANGES"
}

Create a Compute Engine VM (no external IP, Shielded)

resource "google_compute_instance" "app" {
  name         = "app-1"
  machine_type = "e2-standard-2"
  zone         = "us-central1-a"
  boot_disk { initialize_params { image = "debian-cloud/debian-12" } }
  network_interface {
    subnetwork = google_compute_subnetwork.app.id
    # no access_config block = no external IP
  }
  shielded_instance_config { enable_secure_boot = true  enable_vtpm = true  enable_integrity_monitoring = true }
  service_account { email = google_service_account.app.email  scopes = ["cloud-platform"] }
  metadata = { enable-oslogin = "TRUE" }
}

Create a hardened Cloud Storage bucket

resource "google_storage_bucket" "data" {
  name                        = "acme-app-data-prod"
  location                    = "US"
  uniform_bucket_level_access = true
  public_access_prevention    = "enforced"
  versioning { enabled = true }
  lifecycle_rule {
    condition { age = 30 }
    action { type = "SetStorageClass"  storage_class = "NEARLINE" }
  }
  # encryption { default_kms_key_name = google_kms_crypto_key.data.id }  # CMEK
}

IAM binding, service account, and a conditional binding

resource "google_service_account" "app" {
  account_id   = "app-runtime"
  display_name = "App runtime SA"
}

# Least-privilege: only object read on one bucket, to a GROUP
resource "google_storage_bucket_iam_member" "read" {
  bucket = google_storage_bucket.data.name
  role   = "roles/storage.objectViewer"
  member = "serviceAccount:${google_service_account.app.email}"
}

# Conditional project binding (time-boxed / resource-scoped)
resource "google_project_iam_member" "cond" {
  project = var.project_id
  role    = "roles/compute.viewer"
  member  = "group:oncall@example.com"
  condition {
    title      = "prod-hours"
    expression = "request.time.getHours('America/New_York') >= 8 && request.time.getHours('America/New_York') < 20"
  }
}

Create a Cloud Monitoring alert

resource "google_monitoring_notification_channel" "email" {
  display_name = "ops-email"  type = "email"
  labels = { email_address = "oncall@example.com" }
}

resource "google_monitoring_alert_policy" "cpu" {
  display_name = "VM CPU high"
  combiner     = "OR"
  conditions {
    display_name = "CPU > 85%"
    condition_threshold {
      filter          = "resource.type=\"gce_instance\" AND metric.type=\"compute.googleapis.com/instance/cpu/utilization\""
      comparison      = "COMPARISON_GT"
      threshold_value = 0.85
      duration        = "300s"
      aggregations { alignment_period = "60s"  per_series_aligner = "ALIGN_MEAN" }
    }
  }
  notification_channels = [google_monitoring_notification_channel.email.id]
}

State, structure, and CI/CD

  • Remote, locked state: a GCS backend (object versioning on the state bucket) with state locking. Never keep prod state on a laptop; never commit state (it holds secrets).
  • Modular structure: reusable modules (network, compute, data, iam, monitoring) composed per environment.
  • Environment separation: separate state per env (workspaces or separate backends/prefixes) driven by dev.tfvars / prod.tfvars; separate projects and ideally separate deployer SAs/pipelines.
  • CI/CD: run plan/apply in Cloud Build (or another pipeline) using an impersonated deployer SA via Workload Identity Federation - no keys. Gate apply with approvals; run plan on PRs for review.
  • No secrets in code: reference Secret Manager / KMS by name; keep secret tfvars out of git.
gcp-infra/
  modules/
    network/   compute/   data/   iam/   monitoring/
  envs/
    dev/    main.tf  dev.tfvars   backend.tf
    prod/   main.tf  prod.tfvars  backend.tf
  cloudbuild.yaml
  README.md
Architect note
Prod and DR should be provably identical because they come from the same modules with different variables. Manual Console changes in prod are the enemy of a working DR - enforce "infrastructure changes go through Terraform," run plan in CI on every PR, and use Cloud Asset Inventory / drift detection to catch out-of-band changes.

18. Google Cloud Architecture Framework

The five pillars Google uses to review a design - operational excellence, security/privacy/compliance, reliability, cost optimization, and performance. Written for real architecture reviews: what each means, the services that support it, examples, mistakes, and a review checklist.

Last reviewed: July 2026 Verify against the current Architecture Framework docs.
HOW TO USE THIS

Run a design (or an existing system) through all five pillars. For each, ask the checklist questions, map to concrete Google Cloud services, and record gaps as action items. A pillar with no owner and no evidence is a risk, not a pass.

Operational excellence
Security & compliance
Reliability
Cost optimization
Performance

Operational excellence

What it means: run, monitor, and improve systems and processes reliably and repeatably - automation, observability, incident response, and change management.

Why it matters: most outages are caused by change and by not seeing problems early. Operational maturity is what turns a good design into a dependable service.

Supporting services: Cloud Monitoring/Logging, Error Reporting/Trace/Profiler, Cloud Build/Deploy, Terraform + Cloud Asset Inventory (IaC + drift), VM Manager, Service Health, Active Assist.

Practical examples: everything as code with peer-reviewed changes; SLOs with error budgets; centralized logs; golden images + automated patching; runbooks tied to alerts; blameless post-mortems.

Common mistakes
Manual Console changes in prod; alerting on causes not symptoms (or alert fatigue); no SLOs; observability bolted on after an incident; no defined incident process.
Review checklist
Is all infra in code with review? Are there SLOs + error budgets? Are logs centralized and audit logs on? Is patching automated and reported? Are alerts symptom-based and actionable? Are runbooks current and rehearsed?

Security, privacy, and compliance

What it means: protect identities, data, and workloads; meet regulatory obligations; and be able to prove it.

Why it matters: a single over-broad grant, public bucket, or long-lived key can undo everything else. Security is a design property, not an add-on.

Supporting services: Cloud IAM (least privilege, groups), Organization Policies, VPC Service Controls, Security Command Center, Cloud KMS/HSM + Secret Manager, Cloud Armor, IAP, Binary Authorization, Sensitive Data Protection, Cloud Audit Logs.

Practical examples: no basic roles; SA keys disabled org-wide; preventive Org Policies; VPC-SC around sensitive data; CMEK; private IPs + IAP; SCC org-wide; centralized audit logs. (See section 8's checklist.)

Common mistakes
Owner/Editor everywhere; long-lived SA keys; public storage; public DB endpoints; secrets in code; no VPC-SC for sensitive data; audit logs off or not centralized; Org Policies unset.
Review checklist
Least privilege via groups/predefined roles? SA keys disabled + impersonation/WIF used? Preventive Org Policies on? VPC-SC around sensitive data? CMEK + Secret Manager? SCC + centralized audit logs? Public exposure minimized? DR keys available cross-region?

Reliability

What it means: the system meets its availability and durability targets and recovers from failures - designed around resource scope (zonal/regional/multi-region), redundancy, and tested DR.

Why it matters: reliability targets (SLOs) drive architecture and cost. You cannot bolt on availability after an outage.

Supporting services: regional MIGs + autohealing, regional PD, global external LB (health-based failover), Cloud SQL HA / cross-region replicas, Spanner/Firestore multi-region, Backup and DR, Cloud Monitoring SLOs.

Practical examples: multi-zone by default (regional resources); a defined DR pattern per tier with tested RTO/RPO; graceful degradation; capacity planning + reservations; error budgets governing release pace.

Common mistakes
Single zonal VM/disk for "prod"; DR never tested; CMEK keys missing in DR; no capacity reservation for failover; assuming stateful active-active is easy.
Review checklist
What scope is each critical resource (zonal/regional/multi-region)? Multi-zone by default? Defined + tested DR pattern per tier with RTO/RPO? Health checks + autohealing? SLOs monitored? Capacity for failover?

Cost optimization

What it means: deliver the required value at the lowest sustainable cost - right-sizing, discounts, eliminating waste, and attributing spend.

Why it matters: unmanaged cloud spend grows silently; cost is a first-class design and operational concern, not a finance afterthought.

Supporting services: billing export to BigQuery, Budgets + alerts, quotas, Recommender/Active Assist, CUDs, Spot VMs, custom machine types, storage lifecycle, BigQuery query controls.

Practical examples: labels + billing export for attribution; CUDs for baseline; Spot for batch; scheduled non-prod shutdown; storage lifecycle; BigQuery partitioning + max-bytes-billed; monthly review (section 14).

Common mistakes
No labels/attribution; over-provisioned VMs and disks; on-demand for steady-state; unbounded BigQuery scans; idle IPs and orphaned disks; noisy log ingestion.
Review checklist
Is spend attributed via labels + billing export? Budgets + quotas in place? CUD coverage for baseline? Right-sizing acted on? Storage lifecycle + BigQuery controls? A recurring cost review?

Performance optimization

What it means: resources meet latency/throughput requirements efficiently as demand changes - right machine types, autoscaling, caching, data locality, and query design.

Why it matters: performance affects user experience and cost simultaneously; the right shape and data design often beat simply adding capacity.

Supporting services: machine families (C-series for CPU, M-series for memory, GPUs/TPUs), autoscaling (MIG/Cloud Run/GKE HPA), Cloud CDN, Memorystore, global LB, Hyperdisk (tunable IOPS), BigQuery partitioning/clustering/BI Engine.

Practical examples: match machine family to the bottleneck; autoscale on the right signal; cache at the edge (CDN) and in-memory (Memorystore); co-locate data and compute (reduce egress/latency); partition/cluster BigQuery; load-test before launch.

Common mistakes
Wrong machine family (CPU-bound on a general shape); no autoscaling or scaling on the wrong metric; disk IOPS ceiling mistaken for CPU; chatty cross-region calls; unpartitioned BigQuery scans.
Review checklist
Is the machine family matched to the workload? Autoscaling on a meaningful signal? Caching (CDN/Memorystore) where it helps? Data co-located with compute? Storage/DB performance sized (Hyperdisk IOPS, BQ partitioning)? Load-tested?

19. Learning Path

A structured route from Google Cloud fundamentals to enterprise-grade architecture, security, data, and AI - aimed at people coming from traditional infrastructure or another cloud. Each level lists what to learn, why, hands-on labs, common mistakes, and the outcome you should reach.

Last reviewed: July 2026 Certification names/exam details change - verify on Google Cloud training before scheduling.
Beginner
Fundamentals: hierarchy, IAM, VPC, Compute, Storage, Monitoring
Intermediate
LB, MIGs, private networking, Cloud SQL, KMS, SCC, cost
Advanced
Shared VPC, VPC-SC, GKE, BigQuery, Dataflow, Vertex AI, DR, Terraform
How to use this
Do the labs, don't just read. Use the Free Tier / a trial project for hands-on. Map each level to the deep-dive sections above - the learning path is the syllabus, the sections are the textbook. Certifications (Cloud Digital Leader → Associate Cloud Engineer → Professional Cloud Architect, plus role specialties) are useful checkpoints, but capability comes from building.

Beginner

Level 1 - Foundations
Goal: deploy and connect basic Google Cloud resources confidently

What to learn

  • Fundamentals: global infra, regions/zones, resource scope, and the org/folder/project hierarchy (section 1).
  • IAM basics: principals, predefined roles (not basic), inheritance, groups, service accounts (section 2).
  • VPC basics: custom-mode VPC, regional subnets, firewall rules, Cloud NAT (section 3).
  • Compute Engine basics: machine types, images, OS Login, IAP SSH (section 4).
  • Cloud Storage basics: buckets, classes, UBLA/public access prevention (section 5).
  • Cloud Monitoring/Logging basics: metrics, the Ops Agent, an alert, logs (section 9).

Why it matters

Every design rests on the hierarchy, IAM, and the global-VPC / regional-subnet model. Get these right and everything later is easier.

Hands-on labs

  • Create a project; add a group; grant a predefined role; test access.
  • Build a custom-mode VPC with a subnet, firewall rules, and Cloud NAT.
  • Launch a VM with no external IP; SSH via IAP; use OS Login.
  • Create a hardened bucket (UBLA + public access prevention); upload objects.
  • Install the Ops Agent; create a CPU alert to an email channel.

Common mistakes

Using the default network / auto mode; basic roles; per-user grants; external IPs everywhere; forgetting the Ops Agent for memory.

Expected outcome

You can stand up a properly-segmented VPC, reach a private VM via IAP, use IAM correctly, and see basic telemetry.

Intermediate

Level 2 - Building real workloads
Goal: deploy an HA app + managed database with monitoring, security, and cost control

What to learn

  • Load balancing (global external App LB + Cloud Armor) and managed instance groups + autoscaling (sections 7, 4).
  • Private networking: Private Google Access, Cloud NAT, private IP for services; Cloud VPN / Interconnect basics (section 3).
  • Cloud SQL: HA, private IP, Auth Proxy, backups/PITR, read replicas (section 6).
  • Cloud Logging (sinks) and Cloud KMS / Secret Manager (sections 9, 8).
  • Security Command Center and Org Policies basics (section 8).
  • Cost management: budgets, labels, billing export, CUDs (section 14).

Why it matters

This is the day job: HA app tiers, managed databases, and the operational, security, and cost controls that make them production-worthy.

Hands-on labs

  • Deploy a 3-tier app: global LB + Cloud Armor → regional MIG (or Cloud Run) → Cloud SQL (private IP, HA).
  • Allow the health-check ranges; confirm backends healthy; force a failover.
  • Store the DB password in Secret Manager; connect via the Auth Proxy with a runtime SA.
  • Create alerts (CPU, unhealthy backend, DB storage) and a notification channel.
  • Set a budget + labels + billing export; add a couple of Org Policies (no external IP, restrict locations).

Common mistakes

Health-check firewall rule missing; DB on public IP; secrets in code; noisy alerts; no labels for attribution.

Expected outcome

You can deploy a secure, monitored, HA application + managed database, connect it privately, and keep its cost and access under control.

Advanced

Level 3 - Enterprise architecture, data & AI
Goal: design governed, multi-region, data-and-AI-capable platforms

What to learn

  • Shared VPC, Organization Policies, and VPC Service Controls; Private Service Connect (sections 3, 8, 2).
  • GKE (Autopilot/Standard), Workload Identity, and Cloud Run at scale (section 10).
  • Pub/Sub, Eventarc, Workflows for event-driven systems (section 10).
  • BigQuery (mental model, partitioning/clustering, cost), Dataflow, Dataplex (section 11).
  • Vertex AI, Gemini, and vector search / governed RAG (section 12).
  • Multi-region DR (global LB + cross-region replicas / Spanner) (section 13).
  • Terraform + remote state + CI/CD; a landing zone (sections 17, 1, 14).
  • Enterprise security at scale: CMEK, Binary Authorization, centralized logging + SCC (section 8).

Why it matters

At this level you own governance, resilience, data platforms, and AI enablement across many teams - decisions that are expensive to reverse.

Hands-on labs

  • Deploy a landing zone via Terraform: hierarchy, groups, Org Policies, Shared VPC, central logging + SCC, budgets.
  • Stand up a private GKE (Autopilot) cluster in the Shared VPC with Workload Identity and a Cloud Build/Deploy pipeline.
  • Build a BigQuery + Cloud Storage lakehouse with a Datastream CDC feed and Dataplex governance; tune a query with partitioning/clustering.
  • Build a governed RAG assistant: Cloud Storage + vectors in AlloyDB/BigQuery + Gemini behind a Cloud Run serving API, with VPC-SC + audit + entitlement-filtered retrieval.
  • Implement cross-region DR for a Cloud SQL app (replica promotion) behind a global LB; rehearse failover and confirm CMEK keys in DR.

Common mistakes

Skipping the landing zone; DR never tested; over-privileged Workload Identity; pod-IP exhaustion; unbounded BigQuery scans; connecting AI to production data without a governed serving layer.

Expected outcome

You can design and operate a governed, automated, multi-region Google Cloud platform - including data and AI workloads - and defend the trade-offs on security, reliability, and cost.

Certification checkpoints (optional)

LevelTypical certification track
BeginnerCloud Digital Leader; Associate Cloud Engineer
IntermediateProfessional Cloud Architect; Professional Cloud Network Engineer
AdvancedProfessional Cloud Security Engineer, Data Engineer, Database Engineer, DevOps Engineer, Machine Learning Engineer
Verify before scheduling
Google updates exam content and role certifications regularly. Confirm the current track and objectives on Google Cloud's official training site before you prepare. Certifications validate knowledge; the labs above build the capability employers pay for.