Google Cloud Deep Dive Portal

A practical reference for Cloud Architects, DBAs, Data Engineers, and Enterprise Infrastructure Teams. Built to be used while you learn, design, implement, operate, secure, and troubleshoot real Google Cloud environments - not a marketing overview.

19 deep sections Architecture patterns Troubleshooting runbooks gcloud & Terraform Architecture Framework

Last reviewed: July 2026 Google Cloud changes frequently - verify with current Google Cloud documentation before production use.

WHO THIS IS FOR

Cloud architects, infrastructure engineers, Apps DBAs, DBAs, enterprise architects, DevOps, security, data, and AI engineers - and anyone moving from traditional infrastructure or another cloud into Google Cloud. It assumes you know servers, networks, storage, and databases, and focuses on how those map into Google Cloud and what changes operationally.

How this portal is organized

Each section is a self-contained deep dive. Use the left navigation or the top-bar search to jump to a topic. Every section carries a Last reviewed date and, where content changes frequently (pricing, machine types, quotas, model availability, service names), a Verify with current Google Cloud documentation flag.

Learn

Foundations first

Sections 1-2 establish the mental model: the resource hierarchy (org / folder / project), regions/zones, and the IAM allow/deny policy model that everything else depends on.

Build

Service deep dives

Sections 3-12 cover networking, compute, storage, databases, load balancing, security, observability, containers, data analytics, and AI - with diagrams, tables, and gotchas.

Operate

Run and govern

Sections 13-19 cover migration and DR, cost and governance, reference patterns, troubleshooting runbooks, automation, the Architecture Framework, and a structured learning path.

Reading the callouts

Several note types recur. They flag the perspective that matters most for a point.

Architect note

Design-time decisions, trade-offs, and things to settle before production.

DBA note

Database-specific behavior - what Google manages vs. what you manage, patching, backups, connectivity.

Security note

Exposure, least privilege, encryption, and audit considerations.

Cost note

Where money is spent and commonly wasted.

Operations note

Day-2 behavior: patching, scaling, maintenance, and reliability.

Data engineering note

BigQuery, pipelines, lake/warehouse design, and query-cost behavior.

AI note

Vertex AI, Gemini, vector search, and governed GenAI patterns.

Common mistake

A specific error teams repeatedly make, and how to avoid it.

The Google Cloud shared responsibility model (orientation)

Responsibility is split, and the split moves depending on the service. Get it wrong and you either leave gaps (exposed data, lost recoverability) or redo work Google already does.

Layer	Compute Engine (IaaS)	GKE Standard	Cloud SQL / managed DB	BigQuery / Cloud Run (serverless)
Physical / hypervisor	Google	Google	Google	Google
OS patching	You	You (nodes) / Google (control plane)	Google	Google
Runtime / engine patching	You	Shared	Google (in window)	Google
Backup config	You	You	Managed, you configure	Managed / you export
Scaling / HA	You build it (MIG)	You configure	You enable HA	Automatic
Data, schema, access, IAM	You	You	You	You

The rule that never moves

Google secures the cloud. You secure what you put in it: identities, IAM, network exposure, data classification, and access. No managed service removes your responsibility for who can reach the data and what they can do with it.

1. Google Cloud Fundamentals

The global infrastructure and the resource hierarchy (organization, folders, projects) that every Google Cloud deployment is built on - plus the mental model that makes the rest of the platform predictable.

Last reviewed: July 2026 Verify region list, quotas, and service availability in the Console.

What GCP is Global infrastructure Resource scope Hierarchy Mental model Org policies Labels vs tags Names & IDs Console / gcloud / IaC Landing zone

TL;DR

Google Cloud is a set of regions (each with multiple zones) sitting on Google's global network. Resources have a scope - global, regional, or zonal - and that scope drives HA design. Everything lives in a hierarchy: Organization > Folders > Projects > resources. The project is the fundamental unit of deployment, billing, quota, and isolation. IAM grants access down the hierarchy; Organization Policies restrict what is allowed. Get the hierarchy and a landing zone right before production - restructuring later is painful.

What Google Cloud is

Google Cloud Platform (GCP) is Google's public cloud: on-demand compute, storage, networking, databases, data analytics, and AI/ML services delivered from Google-operated regions, consumed over Google's private global network, and billed by usage. Its distinctive strengths for enterprises are its global network (a software-defined, private backbone that makes a VPC a global object and enables global load balancing from a single anycast IP), its data and analytics stack (BigQuery, Dataflow, Pub/Sub, Dataplex), and its AI/ML platform (Vertex AI, Gemini). If you come from traditional infrastructure, the biggest early surprises are that the network is global, the project is the main boundary, and much of the platform is API-first and serverless.

Google Cloud global infrastructure

Region > zones on Google's global network; resources are global, multi-region, regional, or zonal

Resource scope: global, multi-region, regional, zonal

This is the single most important idea for HA design on GCP - a resource's scope determines what failure it survives and where it can be reached.

Scope	Spans	Examples	Fails if
Global	All regions	VPC network, firewall rules, routes, global external Application LB, images, IAM, HTTP(S) load balancer IP	Essentially never region-bound; a global control-plane issue only
Multi-region	A set of regions (e.g. "US", "EU")	Cloud Storage multi-region buckets, BigQuery multi-region datasets	Survives a region loss within the multi-region
Regional	All zones in one region	Regional MIG, regional persistent disk, regional GKE control plane, subnets, Cloud SQL (HA)	The whole region is lost
Zonal	One zone	A single VM, zonal persistent disk, zonal GKE cluster	That one zone is lost

Architect note - design by scope

Availability is a function of scope. A single (zonal) VM has no zone-failure protection. To survive a zone loss, use a regional managed instance group across zones and regional disks. To survive a region loss, deploy to a second region with global load balancing and cross-region data replication. Draw the scope of every resource before you draw the diagram.

Common mistake

Treating a subnet like it is global because "the VPC is global." The VPC is global; subnets are regional. A resource in a subnet lives in that region. You do not need one VPC per region (you need one global VPC with a subnet per region), but you do place workloads region by region.

The resource hierarchy

Organization > Folders > Projects > resources. IAM and Org Policies inherit downward.

Organization - the root node, tied to a Cloud Identity / Google Workspace domain. The top of IAM and policy inheritance.
Folders - grouping nodes (by department, environment, or BU) for delegated administration and policy inheritance. Can nest.
Projects - the fundamental unit. Every resource belongs to exactly one project. A project is the boundary for billing, quotas, API enablement, IAM, and isolation. It has a mutable name, an immutable project ID (globally unique), and a project number.
Resources - VMs, buckets, datasets, etc., inside a project.
Billing account - a separate object linked to projects; it is where charges accrue and can span many projects. Org and billing are managed independently.
Resource Manager - the API/service that manages this hierarchy; Cloud Asset Inventory gives you a searchable, historical inventory of all resources and IAM across the org.

The Google Cloud mental model

Concept	Is the boundary for	Think of it as
Organization	Everything; identity domain	The enterprise root
Folder	Delegated admin & policy grouping	A governance grouping (dept / env / BU)
Project	Billing, quota, API, IAM, isolation	The main deployment and billing boundary
IAM allow/deny policy	Who can do what	The access-control boundary
Organization Policy	What is allowed at all	The governance / restriction boundary
VPC network	Private networking	A global network object (subnets are regional)
Region / zone	Physical placement	The workload placement boundary (drives HA)

Two different questions

IAM answers "can this principal do this action on this resource?" Organization Policy answers "is this action allowed to exist here at all?" (e.g. "no external IPs in this folder," "only these regions"). They are separate engines - you often need both: IAM to grant, Org Policy to constrain.

Organization Policies

Organization Policy Service lets you set constraints that apply to a node (org, folder, or project) and inherit downward - guardrails that IAM cannot express. Common ones:

constraints/compute.vmExternalIpAccess - block external IPs on VMs (huge for reducing exposure).
constraints/gcp.resourceLocations - restrict which regions resources can be created in (data residency).
constraints/iam.disableServiceAccountKeyCreation - stop long-lived SA keys.
constraints/iam.allowedPolicyMemberDomains - domain restricted sharing: only allow IAM grants to your org's identities.
constraints/compute.requireOsLogin - enforce OS Login instead of metadata SSH keys.

Architect note

Set a baseline of preventive Org Policies at the organization node in your landing zone: disable SA key creation, enforce domain-restricted sharing, restrict resource locations, and block external IPs by default (exempt specific folders that genuinely need them). These prevent whole classes of mistakes rather than detecting them after the fact.

Labels vs tags

	Labels	Tags
What	Key/value metadata on resources	Key/value objects defined at org/project, bound to resources
Main use	Cost attribution, billing export grouping, organization	Conditional IAM and Org Policy / firewall targeting (governance)
Governed?	Free-form (anyone with edit can set)	Yes - tag keys/values are IAM-controlled resources

Cost note

Define a small, enforced set of labels (cost-center, environment, owner, app) from day one and turn on billing export to BigQuery. Chargeback and cost analysis are only as good as your labeling, and retrofitting labels across thousands of resources is slow and never complete. Use tags (not labels) when you need the value to drive IAM conditions or Org Policy.

Resource names, project IDs, project numbers

Project ID - globally unique, human-chosen, immutable (e.g. acme-app-prod-01). Used in most CLI/API calls. Choose a naming convention up front.
Project number - an auto-assigned numeric ID; some APIs and service-agent identities use it.
Project name - a mutable display name.
Full resource names are hierarchical, e.g. //compute.googleapis.com/projects/<id>/zones/<zone>/instances/<name>.

Ways to work with Google Cloud

Cloud Console

The web UI. Best for learning, exploring, and reading state. Not for repeatable production changes - use IaC.

gcloud CLI (Cloud SDK)

The primary command line. Config profiles, project/account selection, service-account impersonation. See section 17.

Cloud Shell

Browser terminal, pre-authenticated as your Console identity, with gcloud/Terraform/kubectl installed and ephemeral home storage.

Client libraries & REST APIs

Idiomatic libraries (Python, Go, Java, Node, ...) and REST/gRPC APIs for building applications and tooling.

Terraform (recommended IaC)

The Google provider is the standard way to build infrastructure declaratively. Deployment Manager still exists but is legacy; new work uses Terraform (or Infrastructure Manager, managed Terraform).

Cloud Asset Inventory

Search, export, and monitor all resources and IAM across the org, with history and feeds. Essential for governance and audits.

Designing the hierarchy & a landing zone

How to structure organization, folders, and projects Design

Common pattern: Org > Folders by environment or business unit > Projects per app-environment. A frequent shape is a top-level split of Shared/Platform, Security, and per-BU folders, each with prod/nonprod sub-folders.
One app + one environment per project is the norm - it gives clean IAM, quota, billing, and blast-radius isolation. Resist the urge to pile many apps into one project.
Shared projects hold cross-cutting infrastructure: the host project for Shared VPC (networking), centralized logging, centralized security tooling (SCC), and a monitoring project.
Never mix with production: sandbox/experimentation, personal projects, and unreviewed workloads must live in separate folders with their own guardrails and budgets - never in the prod project.

Separating dev / test / stage / prod / shared / security / networking / logging Design

Separate projects per environment for independent IAM, quotas, budgets, and cost reporting.
A dedicated networking (Shared VPC host) project, a logging project (aggregated log sink target), a security project (SCC, org-level tooling), and a monitoring project.
Use folders to apply environment-wide IAM and Org Policies once and inherit them.
Keep prod under stricter Org Policies (no external IPs, restricted locations, no SA keys) than nonprod.

What a Google Cloud landing zone includes Design

A landing zone is a codified, repeatable baseline (Terraform) deployed before workloads:

Resource hierarchy (org, folders, projects) and naming standards.
Identity: Cloud Identity/Workspace, groups, break-glass, federation.
Baseline IAM (groups, not users) and preventive Org Policies.
Networking: Shared VPC host project, subnets, hierarchical firewall, Cloud NAT, DNS, hybrid connectivity.
Security: SCC, VPC Service Controls perimeters, KMS, org audit-log sink to a logging project (and optionally BigQuery/SIEM).
Guardrails: budgets and alerts, quotas, labels/tags, billing export to BigQuery.
All as code, reviewed and version-controlled. Google's Cloud Foundation / landing-zone blueprints are a starting point.

Common mistakes in hierarchy design

Running everything under one project (or worse, one shared "default" project) so least privilege and cost attribution become impossible.
Skipping the landing zone and retrofitting Org Policies, Shared VPC, and log centralization after workloads exist.
Granting IAM at the org level for convenience, so every project inherits broad access.
No naming standard for projects/folders, breaking automation and reporting.
Mixing sandbox and production in the same folder with the same guardrails.

Official documentation: Google Cloud setup & resource hierarchy →

2. Identity and Access Management

Who can do what, on which resource, in which project - and the guardrails (deny policies, Org Policies, VPC Service Controls) around it. IAM is where most Google Cloud access issues and security incidents originate, so this section goes deep on principals, roles, inheritance, and troubleshooting.

Last reviewed: July 2026 Verify role names and IAM Conditions/deny features in current docs.

Model Principals Roles Allow & deny Inheritance & scope Service accounts Federation IAP, Access Context, VPC-SC Scenarios Common mistakes Access troubleshooting model

TL;DR

IAM binds a principal (user, group, service account, or federated identity) to a role (a bundle of permissions) on a resource, in an allow policy. Policies inherit down the hierarchy (org → folder → project → resource), and grants are additive. Deny policies and Org Policies can block regardless of allows. Use groups (not users) and predefined/custom roles (not basic Owner/Editor/Viewer), prefer impersonation over service-account keys, and wrap sensitive data in VPC Service Controls.

The IAM model

An IAM allow policy is attached to a resource (or a hierarchy node) and contains bindings: each binding maps a role to one or more members (principals), optionally with a condition. When a principal calls an API, Google evaluates the effective policy (the union of allow policies inherited from the resource up through project, folder, and org), checks for any applicable deny policy, and checks Org Policy and VPC Service Controls. Access requires an allow, no matching deny, and no blocking Org Policy/perimeter.

Principals (members)

Principal	What it is	Use for
Google account (user)	A human identity in Cloud Identity / Workspace / consumer	People - but grant via groups, not directly
Google group	A collection of users/service accounts	All human access management (add/remove members, not IAM bindings)
Service account (SA)	A non-human identity for workloads (an app, VM, function)	Machine-to-service auth; the workload's identity
Workload identity (federated)	An external workload identity (GKE, other clouds, CI) mapped to a GCP identity	Letting workloads authenticate without SA keys
Workforce identity (federated)	External human users from your IdP (Okta, Entra, etc.)	Console/gcloud access for a federated workforce
allAuthenticatedUsers / allUsers	Any Google identity / anyone on the internet	Almost never - public exposure; block with Org Policy

Common mistake - confusing service accounts with users

A service account is a workload identity, not a person. Do not share SA credentials among people, do not use a personal account as an app identity, and do not grant humans broad rights "because the app has them." Humans authenticate as users (via groups); workloads authenticate as service accounts (ideally via impersonation or workload identity, not keys).

Roles: basic, predefined, custom

Role type	What	Guidance
Basic (primitive)	Owner, Editor, Viewer - broad, legacy roles spanning almost all services	Avoid in production. Owner/Editor are enormous grants. Use only in throwaway sandboxes.
Predefined	Curated, service-specific roles (e.g. `roles/compute.instanceAdmin.v1`, `roles/bigquery.dataViewer`)	The default. Pick the narrowest predefined role that fits the job.
Custom	You compose a role from specific permissions	When no predefined role fits without over-granting. Maintain them (permissions change).

Common mistake - basic roles

Granting roles/editor or roles/owner "to move fast" gives near-total control over the project (create/delete almost anything, and Owner can change IAM). It defeats least privilege and makes audits meaningless. Use predefined roles scoped to the job; reserve Owner for a tiny, monitored break-glass group.

Allow policies and deny policies; conditional IAM

Allow policy - grants roles to principals. Additive across the hierarchy; there is no "subtract."
Deny policy - explicitly denies specified permissions to specified principals, and is evaluated before allows - a matching deny wins. Use it to carve exceptions ("nobody except break-glass may delete buckets in this folder").
Conditional IAM - attach a condition (CEL expression) to a binding: by resource name/type, by request time, by tag. E.g. grant access only to resources tagged env=dev, or only during business hours.

# Conditional binding: grant only on buckets whose name starts with "app-"
gcloud projects add-iam-policy-binding my-proj \
  --member="group:app-team@example.com" \
  --role="roles/storage.objectAdmin" \
  --condition='expression=resource.name.startsWith("projects/_/buckets/app-"),title=app-buckets-only'

Inheritance and scope

Grant at the lowest node that works. A role granted at the org applies to every folder, project, and resource beneath it; a role granted on a single bucket applies only there.

Scope	Grant here when	Risk
Organization	Truly org-wide roles (org admins, security auditors)	Highest - inherited everywhere
Folder	A whole environment/BU needs the same access	Medium
Project	Most workload access	Contained to one project
Resource (bucket, dataset, SA)	Fine-grained, single-resource access	Lowest

Common mistake - org-level grants

Granting a role at the organization "so it works everywhere" is the IAM equivalent of a firewall any-any rule. It inherits into every project including prod and security. Grant at the project (or resource) level; reserve org/folder grants for genuinely cross-cutting roles and review them regularly with Policy Analyzer.

Service accounts: keys vs impersonation vs workload identity

Mechanism	How the workload authenticates	Risk
Attached SA (metadata)	A VM/GKE/Run resource runs as an SA; credentials come from the metadata server	Low - no keys on disk
Workload Identity Federation	External workload (other cloud, CI, on-prem) exchanges its native token for short-lived GCP creds	Low - no keys
SA impersonation	A principal with `iam.serviceAccountTokenCreator` mints short-lived tokens for an SA	Low - short-lived, auditable
SA key (JSON)	A long-lived downloaded private key	High - leaks, gets committed to git, outlives its owner

Security note - kill long-lived keys

Prefer, in order: attached service accounts (workloads in GCP), Workload Identity Federation (workloads outside GCP), and impersonation (humans/CI acting as an SA). Reserve SA keys for the rare case with no alternative, and block their creation org-wide with iam.disableServiceAccountKeyCreation. A leaked JSON key is a standing, long-lived credential - the most common serious GCP incident.

Cloud Identity, Workspace, and federation

Cloud Identity / Google Workspace - where your human identities and groups live; the org is tied to a domain. Manage joiners/movers/leavers here.
Workforce Identity Federation - let human users from an external IdP (Okta, Entra ID, Ping) access Google Cloud without provisioning Google accounts.
Workload Identity Federation - let external workloads (GitHub Actions, AWS, on-prem) get short-lived GCP credentials by trust, no keys.

Architect note - groups + federation

Federate human identities to your corporate IdP and drive all access through Google Groups mapped from IdP groups. IAM bindings target groups, never individuals. This makes access reviews, joiners/leavers, and audits tractable; per-user bindings become unmanageable at scale.

IAP, Access Context Manager, VPC Service Controls

Identity-Aware Proxy (IAP)

Context-aware access to apps and VMs (SSH/RDP/TCP) without a VPN or public IPs - authenticate the user and check policy at the proxy. The modern replacement for bastion + public IP.

Access Context Manager

Defines access levels (by IP range, device posture, identity) used by IAP and VPC-SC to make access conditional on context.

VPC Service Controls (VPC-SC)

Draws a service perimeter around projects so data in managed services (Cloud Storage, BigQuery, etc.) cannot be exfiltrated to projects outside the perimeter - even by a valid identity with a valid key. The key control for sensitive-data isolation.

IAM Recommender / Policy Analyzer / Troubleshooter

Recommender flags over-granted roles; Policy Analyzer answers "who can access what"; Troubleshooter explains why a specific request was allowed or denied.

Security note - VPC-SC for sensitive data

IAM controls who can call an API; it does not stop a compromised-but-authorized identity from copying data to an attacker's project. VPC Service Controls adds an egress boundary around your data services so BigQuery/Cloud Storage data cannot leave the perimeter. For regulated or high-value data, VPC-SC is not optional.

Real IAM scenarios

Read-only security auditor across the org Low risk

Who: the security team. Scope: organization (auditors legitimately need breadth). Role: roles/iam.securityReviewer + roles/logging.viewer (and SCC roles), granted to a group. Risk: low - read-only. Safer alternative: scope to specific folders if their remit is narrower. Common misuse: giving auditors Viewer at org (broader than needed and includes data read on many services).

App team deploys to their own project only Medium risk

Who: the app team group. Scope: their project (not folder/org). Role: specific predefined roles (e.g. roles/run.developer, roles/artifactregistry.writer, roles/logging.viewer), not Editor. Risk: medium - contained to one project. Safer alternative: deploy via a pipeline SA and give humans only view + trigger. Common misuse: roles/editor on the project "to unblock them."

Workload reads a bucket - no keys Low risk

Who: a VM/Cloud Run service. Scope: a single bucket. Role: roles/storage.objectViewer granted to the workload's attached service account on that bucket. Risk: low - narrow, keyless. This is the pattern to imitate. Common misuse: a downloaded SA key baked into the image + project-level Storage Admin.

CI/CD pipeline outside GCP deploys in Medium risk

Who: GitHub Actions / external CI. Scope: the target project. Role: a deployer SA the pipeline impersonates via Workload Identity Federation - no key. Risk: medium, but no standing credential. Common misuse: storing a long-lived SA JSON key as a CI secret.

Common Google Cloud IAM mistakes

Basic roles (Owner/Editor/Viewer) too broadly - use predefined roles scoped to the task.
Granting at org level unnecessarily - inherits everywhere; grant at project/resource.
Long-lived service account keys - the top serious incident; use impersonation / workload identity and disable key creation.
Not using service-account impersonation - humans and CI should mint short-lived tokens, not hold keys.
Confusing service accounts with human users - different lifecycles, different controls.
Not using groups - per-user bindings are unmanageable and invisible in reviews.
Not understanding inheritance - a broad grant high in the hierarchy silently reaches prod.
Not using conditional IAM where a resource/time/tag condition would tighten a grant.
Ignoring VPC Service Controls for sensitive data - IAM alone does not stop exfiltration.
Not reviewing audit logs - Admin Activity and Data Access logs are your evidence and detection.

Google Cloud access troubleshooting mental model

When a request is denied (or unexpectedly allowed), walk the layers in order. Most "permission denied" tickets are one of these:

⚑ "Permission denied" - the checklist

Which org / folder / project is the resource in? (Wrong project selected is the #1 cause.)
Which principal is making the request - user, group, service account, or federated identity? (For workloads, which SA is actually attached?)
What role is assigned, and does it contain the required permission?
At what scope is it granted (resource / project / folder / org)? Does inheritance reach this resource?
Is there an IAM deny policy matching this principal + permission?
Is an Organization Policy blocking the action (e.g. location, external IP, SA key)?
Is VPC Service Controls blocking it (cross-perimeter access to a managed service)?
Is the API enabled in the project?
For workloads acting on other resources: is the service account permitted to act on that resource, and does the caller have actAs / tokenCreator on the SA?

Tools

Policy Troubleshooter (explains allow/deny for a specific principal+resource+permission), Policy Analyzer (who-can-do-what), IAM Recommender (over-grants), and Cloud Audit Logs (the denied request with the reason).

gcloud projects get-iam-policy PROJECT_ID
gcloud asset analyze-iam-policy --organization=ORG_ID \
  --identity="user:jane@example.com"
# Is the API enabled?
gcloud services list --enabled --project PROJECT_ID

Official documentation: Cloud IAM overview & best practices →

3. Networking Deep Dive

The global VPC, regional subnets, firewall model, private access to Google APIs, Shared VPC, and hybrid connectivity - plus the traffic-flow reasoning you need to design and debug real Google Cloud networks.

Last reviewed: July 2026 Verify quotas, firewall behavior, and connectivity limits in current docs.

VPC & CIDR Firewall Routes & Cloud NAT Private access Shared VPC & peering Hybrid DNS Traffic flow Diagrams Network Intelligence Troubleshooting Gotchas

TL;DR

A VPC is a global object; its subnets are regional with a CIDR you choose (use custom mode, never auto mode in production). Firewall rules are stateful, have a priority and direction, and there is an implied deny-ingress / allow-egress at the bottom. Private workloads reach Google APIs via Private Google Access or Private Service Connect, and reach the internet outbound-only via Cloud NAT. Shared VPC centralizes networking in a host project; VPC peering is not transitive. Plan non-overlapping CIDRs before anything else.

VPC networks and CIDR planning

A VPC network is a global, software-defined private network. Unlike most clouds, one VPC spans every region; you add a subnet per region as you expand, and resources in different regions on the same VPC route to each other over Google's backbone with no peering.

Custom mode vs auto mode: auto mode auto-creates a subnet in every region from a fixed 10.128.0.0/9 range - convenient but it hands you overlapping, non-planned CIDRs. Always use custom mode in production and assign subnets deliberately.
Subnets are regional; a subnet has a primary CIDR and optional secondary ranges (used for GKE pods/services alias IPs).
CIDRs must not overlap with each other, with peered/shared VPCs, or with on-premises. Overlap is the number-one cause of hybrid that "connects but won't route."
You can expand a subnet's primary range; plan generously so you rarely need to.

Architect note - a workable IP plan

Reserve a large private supernet for GCP, allocate a block per environment, and a subnet per region per tier, leaving room for GKE secondary ranges (pods need a lot of IPs). Keep a documented IPAM. Because the VPC is global, you do not make one VPC per region - you make one (or a few) VPCs with a regional subnet where you deploy. Overlap is a re-IP project later; a too-large plan costs nothing.

Common mistake

Using auto-mode VPCs (or the default network) in production. They pre-create subnets in every region with fixed ranges you did not plan, which collide with on-prem and other VPCs the moment you connect them. Delete/avoid the default network; build a custom-mode VPC against your enterprise IP plan.

Firewall rules and hierarchical firewall policies

GCP firewalls are stateful and evaluated per VM by priority (lower number wins). Every network has two implied rules at priority 65535: deny all ingress and allow all egress. You open what you need above that.

Concept	Detail
Direction	Ingress (to targets) or egress (from targets). Rules are directional - a common source of confusion.
Priority	0-65535, lower = higher priority; first match wins. Implied deny-ingress/allow-egress sit at 65535.
Targets	All instances, by network tag, or by service account (prefer SA targeting - it can't be self-assigned like tags).
Source/dest	CIDR ranges, source tags/SAs, or (for some) source service.
Hierarchical firewall policies	Rules set at org/folder that apply to all VPCs beneath - central guardrails (e.g. allow IAP range, deny risky ports) evaluated before VPC-level rules.
Firewall Rules Logging	Log matched connections per rule - use it to see what is actually being allowed/denied.

Security note - target by service account, not tag

Network tags are just strings any instance-admin can add to a VM, so a "db-allowed" tag rule can be self-granted. Prefer targeting firewall rules by service account: an attacker can't attach an SA to a VM without actAs permission. Use hierarchical policies for org-wide guardrails (allow the IAP range 35.235.240.0/20 for SSH, deny egress to known-bad ranges) so individual teams can't undo them.

Common mistake

Forgetting the implied deny-ingress and expecting traffic to flow, or forgetting that egress is implicitly allowed and leaving VMs able to reach anywhere outbound. Also: writing a broad 0.0.0.0/0 allow-ingress on port 22 instead of allowing only the IAP range and using IAP for SSH.

Routes, Cloud Router, and Cloud NAT

Routes - system-generated (subnet routes, default internet route) plus custom static routes; dynamic routes are learned via Cloud Router (BGP) from VPN/Interconnect.
Cloud Router - the BGP speaker for hybrid connectivity and for regional dynamic routing; advertises your subnets to on-prem and learns on-prem routes.
Cloud NAT - managed, outbound-only NAT so instances with no external IP can reach the internet (patches, external APIs). It is regional and requires a Cloud Router. It does not allow inbound.
External vs internal IPs - internal (RFC1918) always; external only when you truly need internet-facing exposure. Block external IPs by Org Policy where possible.

Common mistake

Expecting Cloud NAT to make a service reachable from the internet - it only handles outbound. Inbound comes from a load balancer or an external IP. Also: assuming a private VM can reach the internet with no NAT and no external IP - it can't; add Cloud NAT (or use Private Google Access for Google APIs).

Private Google Access, Private Service Connect, private services access

Three different "private" mechanisms that are constantly confused:

Mechanism	What it does	Use for
Private Google Access (PGA)	Lets VMs with only internal IPs reach Google APIs/services (storage.googleapis.com, etc.) without an external IP	Private VMs calling Google APIs (Cloud Storage, BigQuery, Artifact Registry)
Private Service Connect (PSC)	A private endpoint (internal IP) in your VPC that maps to a Google API bundle or a published service	Private, controlled access to Google APIs or to a service in another VPC/producer
Private services access (PSA)	A VPC peering to a Google-managed producer VPC for services like Cloud SQL private IP, Memorystore	Giving managed services a private IP reachable from your VPC
Serverless VPC Access	A connector that lets serverless (Cloud Run/Functions/App Engine) reach VPC internal IPs	Serverless calling private resources (a private Cloud SQL, an internal service)

Architect note - PGA vs PSC

Enable Private Google Access on a subnet so private VMs can reach Google APIs over internal IPs (pair with a route/DNS to private.googleapis.com). Use Private Service Connect when you want a specific internal endpoint IP (for tighter control, VPC-SC alignment, or to consume a published producer service). For managed databases' private IP, you need private services access (a reserved range + peering). These are not interchangeable - pick by the resource you are reaching.

Shared VPC and VPC peering

Shared VPC - a host project owns the VPC and subnets; service projects attach and deploy resources into shared subnets. Networking is centralized (one team owns IP space, firewall, connectivity) while app teams keep their own projects for IAM/billing. The enterprise default for multi-project networking.
VPC Network Peering - connects two VPCs privately. Crucially, peering is not transitive: if A peers B and B peers C, A cannot reach C. And you cannot peer overlapping ranges.
Network Connectivity Center (NCC) - a hub-and-spoke model to interconnect many VPCs and hybrid links through a central hub, addressing peering's non-transitivity at scale.

Common mistake - assuming peering is transitive

Teams build A↔B and B↔C peerings and expect A to reach C through B. It does not work - VPC peering is non-transitive, and it does not forward to on-prem via a peered VPC either. Use Shared VPC for centralized networking, or Network Connectivity Center for a transitive hub, rather than chains of peerings.

Hybrid connectivity: Cloud VPN and Interconnect

	HA VPN	Dedicated / Partner Interconnect
Path	Over the internet, IPSec-encrypted	Private physical connection (direct or via partner)
Bandwidth	Per-tunnel (Gbps-class aggregate with multiple tunnels)	10/100 Gbps (Dedicated); flexible sizes (Partner)
SLA / latency	Best-effort internet; HA VPN offers an SLA with the right topology	Consistent, low latency; higher SLA
Setup	Minutes	Days-weeks (provisioning)
Use as	Quick start / backup / lower bandwidth	Primary enterprise link, large data, low latency

Both use Cloud Router for BGP. HA VPN uses two interfaces for a 99.99% topology. Common pattern: Interconnect primary + HA VPN backup, with BGP preferring Interconnect.

Cloud DNS

Public zones for internet-facing names; private zones for internal resolution within (and across, via peering) your VPCs.
DNS peering and forwarding integrate with on-prem DNS (inbound/outbound server policies) for hybrid name resolution.
Managed private zones for *.googleapis.com (e.g. private.googleapis.com / restricted.googleapis.com) route Google-API traffic privately for PGA/VPC-SC.

How traffic flows in Google Cloud

Destination inside the same VPC (any region)? Routes locally over the backbone - only firewall rules apply.
Outside the VPC? The route table (subnet/static/dynamic routes) picks the next hop: default internet route (needs external IP or Cloud NAT), a VPN/Interconnect route (via Cloud Router), or a peering route.
Firewall (hierarchical policies, then VPC rules, by priority) must allow it - remember implied deny-ingress / allow-egress.
For Google APIs from private VMs: PGA/PSC + DNS to the private API endpoint.

Debugging is almost always: is there a route to the right next hop? does firewall allow it (both directions/priority)? is external IP / Cloud NAT / PGA in place for the destination type?

Reference diagrams

Three-tier with global external Application Load Balancer

Global LB + Cloud Armor front a regional MIG (no external IPs); Cloud SQL on private IP; egress via Cloud NAT; Google APIs via PGA/PSC.

Shared VPC (centralized networking)

One host project owns the network; service projects deploy into shared subnets - central IP/firewall control, per-project IAM and billing.

Private Google Access & Cloud NAT egress

Private Google Access sends Google-API traffic privately; Cloud NAT handles outbound internet - no external IPs on the VMs.

Network Intelligence Center

Tool	What it gives you
Connectivity Tests	Static reachability analysis A→B: tells you the exact firewall rule / route / config blocking a path. First stop for "cannot reach."
Network Analyzer	Automatic detection of misconfigurations (shadowed routes, unused rules, IP exhaustion, sub-optimal config).
VPC Flow Logs	Sampled connection records for monitoring, forensics, and "is my rule dropping this?"
Firewall Rules Logging / Packet Mirroring	Per-rule connection logs; mirror traffic to an IDS/collector for deep inspection.

Start with Connectivity Tests

Before hand-checking rules, run a Connectivity Test for the source, destination, protocol, and port. It evaluates routes, firewall rules (incl. hierarchical), and config and names the first blocker - turning a long hunt into a quick answer.

Networking troubleshooting

⚑ VM cannot reach the internet / cannot download patches

Likely causes & checks

No external IP and no Cloud NAT for the subnet's region - private VMs need Cloud NAT for outbound.
Egress firewall (or a hierarchical policy) denies the destination, or a deny rule at higher priority matches.
Default internet route removed/overridden by a custom route.
OS firewall on the VM.

Fix / prevention

Add Cloud NAT (+ Cloud Router) for the region; for OS/package repos, Google mirrors are reachable via PGA. Standardize NAT + PGA in the subnet module.

gcloud compute routers nats describe NAT --router=RT --region=REGION
gcloud compute firewall-rules list --filter="direction=EGRESS"

⚑ VM cannot reach Google APIs privately

Causes: Private Google Access not enabled on the subnet; DNS not resolving *.googleapis.com to the private VIP; no route to private.googleapis.com (199.36.153.8/30); firewall egress blocking 443 to that range; or VPC-SC perimeter blocking. Fix: enable PGA on the subnet, add the private-API DNS zone + route, allow egress 443 to the restricted/private VIP range. Use a Connectivity Test to confirm.

⚑ Application cannot connect across VPCs / Shared VPC issue

Causes: relying on transitive peering (not supported); overlapping CIDRs; missing firewall allowing the peer range; for Shared VPC, the service project's SA lacks compute.networkUser on the subnet, or resources were created in the wrong (local) network. Fix: use Shared VPC or NCC instead of peering chains; grant networkUser; ensure resources deploy into the shared subnet; open firewall for the source range.

⚑ On-premises cannot reach Google Cloud

Causes: CIDR overlap; Cloud Router not advertising the subnet, or on-prem not advertising its routes; VPN tunnel down (IKE mismatch) or Interconnect/BGP down; firewall not allowing the on-prem range. Fix: resolve overlap, verify BGP advertisements both ways, check tunnel/attachment state, open firewall. Console: Hybrid Connectivity > VPN / Interconnect; Cloud Routers.

⚑ Load balancer backend unhealthy

Causes: firewall not allowing the health-check ranges (35.191.0.0/16 and 130.211.0.0/22) to the backend port; wrong health-check port/path/protocol; app not listening or bound to localhost; wrong backend-service protocol. Fix: allow the health-check ranges to the backend SA/tag on the port; align the health check; bind to 0.0.0.0. Full flow in section 7.

⚑ DNS / firewall / route / Cloud NAT / PSC issue

Method: run a Connectivity Test (names the blocking rule/route), then Network Analyzer for config issues and VPC Flow Logs / Firewall Rules Logging to see drops. For Cloud NAT, check port-allocation exhaustion (increase min-ports or enable dynamic port allocation). For PSC, verify the endpoint, the DNS mapping, and that the producer accepts the connection.

Google Cloud networking gotchas

VPC is global, subnets are regional - don't build one VPC per region, and don't treat a subnet as global.
PGA vs PSC vs private services access are different - pick by what you're reaching (Google APIs vs published service vs managed-DB private IP).
Overlapping CIDRs break peering and hybrid - plan IP space early, avoid auto mode and the default network.
Poor Shared VPC design - decide host/service split and who owns firewall/IP before workloads land.
Databases/internal services on external IPs - use private IP + private access; block external IPs by Org Policy.
Firewall direction & priority - rules are directional and first-match-by-priority; the implied deny-ingress/allow-egress is easy to forget.
Peering is not transitive - use Shared VPC or NCC for hub-and-spoke.
Cloud NAT is outbound only - inbound needs an LB or external IP.
APIs not enabled - many "network" failures are actually a disabled API (compute, dns, servicenetworking).
Egress & inter-region charges - internet egress and cross-region traffic are metered; keep chatty services co-located and use private access.

Official documentation: VPC networking overview →

4. Compute Deep Dive

Compute Engine machine families, managed instance groups, Spot VMs, and the serverless options (Cloud Run, Functions, App Engine) - how to choose, place, scale, and operate compute on Google Cloud.

Last reviewed: July 2026 Machine families/types and pricing change - verify current shapes in the Console.

Machine families Custom / Spot / sole-tenant Images & templates MIGs & autoscaling Shielded / Confidential / OS Login Serverless compute Choosing compute Operations

TL;DR

Compute Engine VMs come in machine families (general/compute/memory/accelerator-optimized) plus custom types. Use regional managed instance groups + autoscaling for HA and elasticity, instance templates as the blueprint, and Spot VMs for fault-tolerant batch. Prefer OS Login over metadata SSH keys and Shielded VMs by default. For new apps, consider Cloud Run (serverless containers) before managing VMs. Committed use discounts and right-sizing are the main cost levers.

Machine families

Family	Series (examples)	Best for
General purpose	E2 (cost), N2/N2D/N4, C3/C3D, Tau T2D/T2A (scale-out; T2A is Arm)	Web, app, microservices, most workloads - the default
Compute optimized	C2/C2D, H3	High per-core performance: gaming, HPC, CPU-bound apps
Memory optimized	M1/M2/M3	Large in-memory: SAP HANA, large databases, in-memory analytics
Accelerator optimized	A2/A3 (NVIDIA GPUs), G2; TPUs separately	AI/ML training & inference, GPU/TPU workloads
Custom machine types	N-series custom vCPU/memory	Right-sizing when predefined shapes waste vCPU or memory (great for licensing)

DBA note - memory-optimized for big databases and SAP

For SAP HANA and large database VMs, the M-series (memory-optimized) shapes provide the certified high memory-to-core ratios. Custom machine types let you tune vCPU/memory to a licensing sweet spot (fewer cores, more RAM) when a per-core-licensed engine is involved. Confirm certification (SAP, Oracle) for the exact shape and OS before committing.

Custom types, Spot VMs, sole-tenant, GPUs/TPUs

Option	What it does	Use when
Spot VMs	Deeply discounted VMs Google can preempt anytime (successor to preemptible)	Fault-tolerant, stateless, restartable batch/CI/render - never stateful prod
Sole-tenant nodes	Physical host dedicated to your project	Compliance/isolation, or per-core licensing that needs host affinity
GPUs	Attach NVIDIA GPUs to VMs (or use A3/G2)	ML training/inference, rendering, HPC
TPUs	Google's custom ML accelerators (Cloud TPU)	Large-scale training/inference on supported frameworks
Reservations	Reserve capacity of a machine type in a zone	Guaranteeing capacity for scale-out or DR failover
Committed use discounts (CUDs)	1/3-year commitment for a big discount	Steady-state baseline compute (see section 14)

Cost note

Cover steady-state compute with committed use discounts (and benefit automatically from sustained-use discounts on some families), burst on on-demand, and run fault-tolerant batch on Spot VMs (often 60-90% cheaper). Custom machine types stop you paying for vCPU or memory you don't use. These three levers usually beat any re-architecture.

Images, machine images, templates

Public images (Debian, Ubuntu, RHEL, Windows Server, etc.) and custom images (your golden image).
Machine images capture a full VM (disks + metadata) for cloning/backup; images capture a boot disk.
Instance templates define shape, image, disks, network, metadata/startup script - the blueprint for MIGs.
Startup scripts (and cloud-init on supported images) bootstrap a VM on boot; the metadata server (169.254.169.254) exposes metadata and workload credentials.

Architect note - golden image + startup script

Bake slow-changing config (hardening, agents, base packages) into a custom image; use a startup script for fast-changing wiring (app version, config). This keeps MIG scale-ups fast and identical. Automate image builds (e.g. with a pipeline) so images are reproducible and patched.

Managed instance groups & autoscaling

Building block	Role
Instance template	Immutable blueprint for the VMs.
Managed Instance Group (MIG)	Creates/maintains identical VMs from a template. Regional MIGs spread across zones for HA; zonal MIGs don't.
Autoscaling	Scales the MIG by CPU, LB utilization, custom metrics, or schedule.
Autohealing	Recreates VMs failing a health check.
Rolling updates / canary	Update the template and roll instances gradually (surge/max-unavailable).

Common mistake

Running a single zonal VM (or a zonal MIG) for a "production" service - a zone maintenance event or failure takes it down. Use a regional MIG across zones with autohealing behind a load balancer, and regional persistent disks for stateful cases.

Shielded VMs, Confidential VMs, OS Login

Shielded VM - secure boot, vTPM, integrity monitoring; enable by default (some Org Policies require it).
Confidential VM - memory encrypted in use (AMD SEV / Intel TDX) for sensitive workloads.
OS Login - manage SSH access via IAM (and 2FA) instead of project/instance metadata SSH keys. Enforce it with constraints/compute.requireOsLogin.
IAP for SSH/RDP - reach VMs with no external IP through Identity-Aware Proxy, gated by IAM.

Security note - OS Login + IAP, no external IPs

The modern secure pattern: VMs have no external IP, admins connect via IAP (IAM-gated, no bastion), and SSH access is governed by OS Login (IAM roles + optional 2FA) rather than shared metadata keys. This removes public SSH exposure and ties every login to an IAM identity you can audit and revoke.

Serverless & managed compute

Service	What it is	Use for
Cloud Run	Serverless containers, scale-to-zero, request- or job-based	Most new stateless services/APIs and batch jobs - the default serverless choice
Cloud Functions	Event-driven functions (now Cloud Run functions)	Small event handlers, glue, automation
App Engine	PaaS for web apps (standard/flex)	Legacy/existing App Engine apps; new work usually goes to Cloud Run
Batch	Managed batch job scheduling on Compute Engine	Large batch/HPC jobs without managing a scheduler

Architect note - reach for Cloud Run first

For a new stateless service, start with Cloud Run: no VMs to patch, scale-to-zero, per-request billing, and a container you can run anywhere. Drop to Compute Engine/GKE only when you need persistent state, specialized kernels/GPUs at fine control, long-lived connections, or an orchestration ecosystem. This inverts the old "spin up a VM" reflex and cuts a lot of ops.

Choosing compute by workload

Workload	Starting point
Web / API (stateless)	Cloud Run; or regional MIG (E2/N2/T2D) behind a global LB
Middleware	Regional MIG on N-series, memory-leaning
Databases (self-managed)	N2/C3 or M-series (large), regional PD/Hyperdisk; prefer managed DB (section 6)
Oracle workloads	Compute Engine VM (self-managed Oracle) or the Oracle Database@Google service; sole-tenant for licensing
SAP	Memory-optimized M-series (certified), reservations
Batch / CI / render	Spot VMs in a MIG, or Batch service
Memory-heavy	M-series or custom high-memory
CPU-heavy	C2/C3/H3 compute-optimized
GPU / AI training	A3/A2/G2 with GPUs, or TPUs; consider Vertex AI (section 12)
Cost-sensitive / spiky	E2 or T2D + autoscaling + CUDs; Spot for fault-tolerant parts
Event-driven / bursty	Cloud Run / Cloud Functions (scale to zero)

Operational guidance

Resize a Compute Engine VM Ops

Stop the VM, change the machine type (or custom vCPU/memory), start it - brief downtime. In a MIG, update the template and roll.
Changing architecture (x86 ↔ Arm/T2A) is a rebuild, not a resize - watch for arch-specific binaries.

Patch VMs safely Ops

Use VM Manager (OS patch management) for scheduled, reported patching across a fleet; combine with OS inventory.
For MIGs, prefer replacing instances from a new patched image (immutable) over in-place patching.

Troubleshoot boot / SSH / high CPU / memory / disk Ops

Boot/SSH: use the serial console to read boot output; verify OS Login/IAM roles and the IAP firewall range; check the VM isn't stopped.
High CPU/memory: Cloud Monitoring (install the Ops Agent for memory metrics, which aren't collected by default); right-size or autoscale.
Disk full: resize the persistent disk online, then grow the filesystem; alert at 85%.
Disk attach: confirm the disk is attached and in the same zone; format/mount and add to fstab by UUID.

Design compute for production HA Design

Regional MIG across ≥2 zones + autohealing behind a load balancer.
Instance template + autoscaling for elasticity and reproducibility.
No external IPs; OS Login + IAP for access; Shielded VMs.
Ops Agent for metrics/logs; image pipeline for patched golden images; regional disks or managed data services for state.
Reservations/CUDs for capacity + cost; a second region for DR.

Operations note - live migration

Compute Engine live-migrates most VMs during host maintenance with no reboot (host maintenance policy = migrate), so infrastructure maintenance is largely transparent. GPU and some specialized VMs are set to terminate instead - design those workloads to tolerate a maintenance restart, and use MIG autohealing.

Official documentation: Compute Engine & Cloud Run →

5. Storage Deep Dive

Block (Persistent Disk, Hyperdisk, Local SSD), file (Filestore), and object (Cloud Storage) storage - their scope, performance, durability, and the decision of which to use for databases, shared filesystems, backups, archives, and data lakes.

Last reviewed: July 2026 Verify disk types, storage classes, and retrieval behavior in current docs.

Block storage Filestore Cloud Storage Encryption When to use which Examples Gotchas

TL;DR

Persistent Disk / Hyperdisk = network block storage for VMs (zonal or regional). Local SSD = ultra-fast but ephemeral (data lost on stop). Filestore = managed NFS for shared filesystems. Cloud Storage = object storage (buckets/objects) for backups, data lakes, static content, and archives - not a filesystem. Choose block for boot/DB, Filestore for shared POSIX, Cloud Storage for objects. Lock down buckets with uniform bucket-level access + public access prevention.

Block storage: Persistent Disk, Hyperdisk, Local SSD

Type	Scope	Notes
Persistent Disk (pd-balanced/ssd/standard/extreme)	Zonal or regional (synchronously replicated across 2 zones)	Network block storage; resize online; snapshots. Regional PD is a key HA building block for stateful VMs.
Hyperdisk	Zonal (some regional)	Next-gen block storage with independently tunable IOPS/throughput (Balanced/Throughput/Extreme/ML) - decouple performance from capacity.
Local SSD	Zonal, attached to the VM's host	Ephemeral - data is lost when the VM stops/terminates/migrates. Highest IOPS; only for scratch/cache/temp.

DBA note - size performance, and never put a DB on Local SSD alone

Database performance on VMs is usually a disk IOPS/throughput ceiling, not CPU. Use Hyperdisk (or pd-ssd/pd-extreme) sized for the I/O profile; performance scales with provisioned IOPS/throughput and, for PD, with size. Local SSD is ephemeral - only use it for temp/redo-scratch that you can rebuild, never as the sole home of datafiles. For stateful HA, use regional PD so the disk survives a zone loss.

Filestore

Filestore is managed NFS (v3) for shared POSIX filesystems mounted by many VMs/GKE pods. Tiers (Basic, Zonal, Regional, Enterprise) trade performance, capacity, and availability. Use it for shared application state, home directories, media/render scratch, and lift-and-shift apps that expect a filesystem.

Security note

Restrict Filestore access with the VPC firewall (NFS ports) to the client subnet/SA, keep it on private IPs, and use IAM for management. Treat an over-open NFS share the same as any other data exposure.

Cloud Storage

Buckets & objects - a bucket has a global unique name, a location (region, dual-region, or multi-region), and a default storage class.
Storage classes: Standard (hot), Nearline (~30-day), Coldline (~90-day), Archive (~365-day). Colder = cheaper storage, higher retrieval cost / minimum-storage-duration. Autoclass auto-moves objects between classes by access.
Lifecycle management - rules to transition class or delete objects by age/version.
Versioning keeps prior object versions; Object holds and retention policies + Bucket Lock give WORM/compliance immutability (a locked retention policy cannot be shortened or removed).
Access: Uniform bucket-level access (UBLA) (IAM only, no per-object ACLs) + Public access prevention should be the default. Signed URLs grant time-boxed access without IAM.
Cloud Storage FUSE mounts a bucket as a filesystem (with caveats - it is still object storage underneath).
Transfer: Storage Transfer Service (online, from other clouds/on-prem/HTTP) and Transfer Appliance (physical, for very large datasets).

Common mistake - Cloud Storage is not a filesystem

Objects are immutable blobs - no in-place random writes, no POSIX locking, and the "/" in names is just a convention. Don't run a database or a lock-dependent app on a FUSE-mounted bucket. Use block or Filestore for filesystem semantics; use Cloud Storage for whole-object put/get (backups, media, lake data, static sites).

Security note - lock buckets down by default

Turn on uniform bucket-level access and public access prevention at creation (enforce via Org Policy storage.publicAccessPrevention and storage.uniformBucketLevelAccess). Legacy per-object ACLs and allUsers grants are how buckets get accidentally exposed. Use signed URLs for controlled external sharing, keep them short-lived, and inventory them. For backup/compliance buckets, add versioning + a locked retention policy so data can't be deleted early (ransomware/accident protection).

Encryption

All data is encrypted at rest by default with Google-managed keys.
CMEK (customer-managed encryption keys) via Cloud KMS - you control rotation and can disable a key to render data unreadable. Use for sensitive/regulated data.
CSEK (customer-supplied keys) - you provide the raw key (niche; you manage all key handling).
Encryption in transit uses TLS across Google's network.

Security note - CMEK for sensitive data

Use CMEK for disks, buckets, and databases holding sensitive data so key control (and the emergency "disable the key" switch) is yours. Keep KMS keys in a locked-down security project, grant only the service agents that need encrypt/decrypt, and rotate on a schedule.

When to use which

Need	Use
VM boot / DB datafiles	Persistent Disk or Hyperdisk (regional PD for zone-HA)
Ultra-fast scratch/cache	Local SSD (ephemeral - rebuildable data only)
Shared POSIX filesystem for many VMs/pods	Filestore
Backups (DB/app)	Cloud Storage (Nearline/Coldline) + lifecycle + retention
Log / long-term archive	Cloud Storage Archive + lifecycle + Bucket Lock
Data lake	Cloud Storage (Standard) - queried by BigQuery/BigLake
Static website / media	Cloud Storage + Cloud CDN
Bulk data into GCP	Storage Transfer Service (online) / Transfer Appliance (physical)
App/VM backup & DR	Backup and DR service; PD snapshots (scheduled)

Practical examples

Database backups to Cloud Storage DBA

Managed DBs (Cloud SQL/AlloyDB) back up automatically; for self-managed DBs on VMs, dump/backup to a Cloud Storage bucket over Private Google Access (no internet). Lifecycle to Coldline/Archive; versioning + locked retention for immutability; enable a second-region copy (dual-region bucket or Transfer) for DR.

Data lake on Cloud Storage Data

Raw / curated / consumption prefixes in Standard buckets; BigQuery external tables / BigLake read them; Dataplex governs. Lifecycle cold raw data to Nearline/Coldline. See section 11.

Shared filesystem for an app cluster Apps

Filestore instance mounted on all app VMs/GKE pods; firewall to the app subnet; Enterprise/Regional tier for HA; snapshots for recovery.

Storage gotchas

Cloud Storage is object storage, not a filesystem - no random writes/locks.
Archive/Coldline have retrieval and minimum-duration costs - don't put frequently-read or short-lived data there.
Persistent Disk is zonal or regional - a zonal disk dies with its zone; use regional PD for HA.
Local SSD is ephemeral - never the only copy of anything.
Public-bucket mistakes - enforce UBLA + public access prevention by Org Policy.
Signed URL risk - they're bearer tokens; keep them short and tracked.
Locked retention policy - once Bucket Lock is applied you cannot shorten/delete it (that's the point) - set the duration carefully.
Snapshot cost growth - scheduled snapshots accumulate; set retention.
Cross-region replication / transfer cost - dual/multi-region and egress cost money and take time; a lagging copy isn't DR.
Wrong storage class - Standard for hot data you keep accessing; colder classes only for genuinely cold data.

Official documentation: Cloud Storage, Persistent Disk & Filestore →

6. Database Services Deep Dive

Google Cloud's database portfolio - Cloud SQL, AlloyDB, Spanner, Firestore, Bigtable, Memorystore - what each manages for you, how HA/DR/backup/patching differ, how to choose, and what changes for a DBA coming from Oracle.

Last reviewed: July 2026 DB features, versions, and limits change - verify in current docs.

Portfolio Service deep dives Decision table Connectivity HA / DR / backup Oracle DBA gotchas Examples

TL;DR

For relational OLTP, start with Cloud SQL (managed PostgreSQL/MySQL/SQL Server); step up to AlloyDB (PostgreSQL-compatible, higher performance + analytics + vector) when you outgrow it, or Spanner for globally-distributed, horizontally-scalable strong consistency. For NoSQL, Firestore (document, app backends) and Bigtable (wide-column, huge scale, time-series). Memorystore for Redis/Valkey/Memcached caching. Choose the least you must manage that meets the workload; managed services own patching/backup/HA, you own schema, queries, and access.

The portfolio at a glance

Service	Model	Sweet spot	You manage	Google manages
Cloud SQL	Managed PostgreSQL / MySQL / SQL Server	Standard relational OLTP, lift-and-shift	Schema, queries, flags, access	Provisioning, patching (in window), backups, HA, replicas
AlloyDB	PostgreSQL-compatible, Google-enhanced	Demanding PostgreSQL, HTAP, PostgreSQL + analytics + vector	Schema, queries, access	Patching, backups, HA, autoscaling read pools
Spanner	Globally-distributed, horizontally scalable, strongly consistent relational	Global OLTP, unlimited scale, five-nines	Schema (different mindset), queries, access	Almost everything - sharding, replication, HA
Firestore	Serverless document NoSQL	Web/mobile app backends, real-time sync	Data model, security rules, indexes	Scaling, replication, HA
Bigtable	Wide-column NoSQL, petabyte-scale, low latency	Time-series, IoT, adtech, huge key-value/analytics	Row-key design (critical), schema	Scaling, replication
Memorystore	Managed Redis / Valkey / Memcached	Cache, session store, leaderboards	Keys/TTL, client	Provisioning, patching, HA

Service deep dives

Cloud SQL

AlloyDB

Spanner

Firestore / Bigtable / Memorystore

Cloud SQL (PostgreSQL, MySQL, SQL Server)

HA - regional, synchronous standby in another zone with automatic failover (enable HA; it is not on by default).
Read replicas - in-region and cross-region replicas for read scaling and DR; a cross-region replica can be promoted for regional DR.
Backups - automated daily backups + point-in-time recovery (binary/WAL logging); on-demand backups; you set retention.
Patching - Google patches during your maintenance window; you choose timing and get notifications, but you don't control every patch.
Connectivity - private IP (via private services access), the Cloud SQL Auth Proxy (IAM-authenticated, encrypted), authorized networks (public IP), and IAM database authentication (Postgres/MySQL).
SQL Server - license is included in the price (no BYOL for Cloud SQL SQL Server); watch edition/feature limits.

DBA note

Cloud SQL removes provisioning/patching/backup toil but is not full DBA control: no OS access, limited superuser, controlled flags, and maintenance you schedule but Google performs. Enable HA explicitly, test PITR, and use the Auth Proxy or private IP - never expose a public DB endpoint with broad authorized networks.

AlloyDB for PostgreSQL

PostgreSQL-compatible, Google-built engine aimed at demanding transactional and mixed (HTAP) workloads: a columnar accelerator for analytics, autoscaling read pools, and strong price/performance vs. self-managed Postgres. Supports vector search (pgvector + Google enhancements) for AI/RAG on operational data.

HA - regional with automatic failover; read pools scale reads.
Backups / PITR - continuous backup with point-in-time recovery.
Use it when Cloud SQL Postgres runs out of headroom, when you want analytics on operational data without a separate warehouse, or for Postgres-native vector search at scale. AlloyDB Omni runs the engine on-prem/other clouds.

DBA note

AlloyDB is PostgreSQL-compatible, not Oracle-compatible. Migrating Oracle to AlloyDB is a Postgres migration (schema/PL-SQL conversion via Database Migration Service + the Oracle-to-Postgres tooling), not a lift-and-shift. Plan for datatype, PL/SQL, and feature differences.

Spanner

Google's globally-distributed relational database: horizontal scale to virtually unlimited throughput, external (strong) consistency across regions, and up to 99.999% availability - no manual sharding. SQL interface (GoogleSQL/PostgreSQL dialect).

Scaling - add compute (nodes/processing units); storage and throughput scale with it. No failover to manage.
HA/DR - multi-region configurations replicate synchronously across regions; regional configs across zones.
Mindset shift - schema and primary-key design must avoid hotspots (no monotonically increasing keys); interleaving models parent-child locality. Not a drop-in for a single-node RDBMS.

Architect note - when Spanner is right

Choose Spanner when you genuinely need global scale + strong consistency + high availability beyond what a single primary can give (global user base, unbounded write scale, five-nines). For a normal regional app database, Spanner is over-powered and pricier than Cloud SQL/AlloyDB - and it demands a different data model. Don't reach for it by default.

Firestore, Bigtable, Memorystore

Firestore - serverless document database with real-time listeners and offline sync; great for web/mobile backends. Security Rules control client access. Not relational - model for your queries, mind index and hotspot limits.
Bigtable - wide-column, low-latency, petabyte-scale for time-series, IoT, adtech, and analytics feeding. Row-key design is everything - a bad key hotspots a node. HBase-compatible API.
Memorystore - managed Redis/Valkey/Memcached for caching, sessions, rate limiting; HA tiers with replicas/failover.

DBA note - these are not relational

Firestore and Bigtable are NoSQL: no joins, no ad-hoc SQL, no relational integrity. You design around access patterns and denormalize. Forcing a relational schema onto them (or expecting SQL tuning to apply) leads to hotspots and cost surprises. Pick them when the access pattern (document, wide-column key-value, real-time) genuinely fits.

Database service decision table

Workload	Recommended	Reason	HA	DR	Ops responsibility	Cost lever
PostgreSQL app DB	Cloud SQL PostgreSQL	Managed, standard OLTP	Regional HA (enable it)	Cross-region replica	Schema/queries	Right-size + CUD; auto-storage
MySQL web app	Cloud SQL MySQL	Managed, common	Regional HA	Cross-region replica	Schema/queries	Right-size; read replicas
SQL Server workload	Cloud SQL SQL Server	Managed, license included	Regional HA	Cross-region replica	Schema/queries	Edition/size choice
Demanding Postgres / HTAP / vector	AlloyDB	Performance + analytics + pgvector	Regional + read pools	Cross-region (config)	Schema/queries	Scale read pools
Global transactional	Spanner	Global scale + strong consistency	Built-in (multi-region)	Built-in	Schema/key design	Right-size compute units
Web/mobile app backend	Firestore	Serverless doc, real-time	Built-in	Multi-region option	Data model + rules	Query/index efficiency
High-scale NoSQL / time-series / IoT	Bigtable	Petabyte scale, low latency	Built-in (replication)	Multi-cluster	Row-key design	Node count; storage type
Cache / session	Memorystore	Managed Redis/Valkey	HA tier (replicas)	Rebuild from source	Keys/TTL	Right-size tier
Data warehouse	BigQuery (section 11)	Serverless analytics, not OLTP	Built-in	Multi-region dataset	Schema/queries	Slot/query cost control
Unsupported engine / full control (e.g. Oracle)	Compute Engine (self-managed) or Oracle DB@Google	Engine/version not offered managed	You build it	You build it	Everything	Sole-tenant/licensing

Connectivity & observability

Private IP (via private services access) is the production default - no public endpoint. Serverless VPC Access lets Cloud Run/Functions reach a private-IP DB.
Cloud SQL Auth Proxy / connectors - IAM-authenticated, encrypted connections without managing SSL certs or IP allowlists.
Authorized networks (public IP) - avoid; if used, restrict tightly and require SSL.
Query Insights (Cloud SQL/AlloyDB) - query-level performance analysis; plus Cloud Monitoring metrics for CPU, connections, storage, replication lag.

Common mistake

Giving a database a public IP with broad authorized networks "to connect quickly." That exposes it to the internet. Use private IP + the Auth Proxy / Serverless VPC Access, and plan the private-services-access range and DNS up front - retrofitting private IP later means a connectivity migration.

How HA, DR, backup, and patching differ

Service	HA	DR	Backup	Patching
Cloud SQL	Regional standby (opt-in), auto failover	Cross-region read replica → promote	Automated + PITR, you set retention	Google, in your maintenance window
AlloyDB	Regional, auto failover; read pools	Cross-region config	Continuous + PITR	Google, in window
Spanner	Built-in (multi-zone/region)	Multi-region config	Backups + PITR	Fully managed, transparent
Firestore / Bigtable	Built-in replication	Multi-region / multi-cluster	Managed backup/export	Fully managed

Operations note - test restores

"Managed backups" does not equal "proven recoverability." Periodically restore a backup / do a PITR to a fresh instance and validate. Also verify your cross-region DR: replica lag within RPO, promotion procedure rehearsed, and app connection strings/DNS ready to repoint.

Google Cloud database gotchas for Oracle DBAs

For DBAs coming from Oracle

Cloud SQL is managed, not self-managed - no OS/SYSDBA-level control, controlled flags, Google-run patching. Your runbooks change.
AlloyDB is PostgreSQL-compatible, not Oracle-compatible - Oracle-to-AlloyDB is a full Postgres migration (schema + PL/SQL conversion), not lift-and-shift.
Spanner needs a different data-modeling mindset - key design to avoid hotspots, interleaving for locality; no single-node RDBMS assumptions.
Firestore and Bigtable are not relational - no joins/SQL; design for access patterns.
Patching control differs by service - you schedule windows, Google patches; not your RMAN/opatch world.
Backup access differs - backups are service-managed artifacts (+ PITR), not files you copy; export for portability.
Private IP and DNS must be planned - reserve the private-services-access range; plan resolution before you build.
Performance troubleshooting differs - Query Insights / Cloud Monitoring instead of AWR/ASH; different wait/metric vocabulary.
Oracle itself: for Oracle Database you either self-manage on Compute Engine (you own everything, sole-tenant for licensing) or use the Oracle Database@Google Cloud partnership (Exadata/Autonomous run by Oracle inside Google Cloud) - there is no native "managed Oracle" like Cloud SQL.

Enterprise examples

PostgreSQL application database OLTP

Cloud SQL PostgreSQL with HA enabled, private IP, Auth Proxy from the app, automated backups + PITR, a cross-region read replica for DR, Query Insights on. Move to AlloyDB if you need more performance or in-DB analytics/vector.

Globally distributed transactional workload Global

Spanner multi-region config; schema designed for even key distribution; app uses the client library with strong reads where needed and stale reads for scale.

IoT / time-series at scale NoSQL

Bigtable with a row key that spreads writes (e.g. reversed/ hashed device ID + timestamp), multi-cluster replication for HA, feeding BigQuery/Dataflow for analytics.

Self-managed Oracle on Compute Engine Oracle

When a managed option can't run the engine/version: Oracle on a memory-optimized VM, regional PD/Hyperdisk sized for IOPS, Data Guard you configure to a second-region VM, sole-tenant nodes for licensing, backups to Cloud Storage. Consider Oracle Database@Google for a managed alternative.

Official documentation: Google Cloud databases →

7. Load Balancing and Traffic Management

Google Cloud Load Balancing - the global and regional Application, Network, and proxy load balancers, their components (forwarding rule, target proxy, URL map, backend service, health check), and how to choose and debug them, with Cloud CDN and Cloud Armor.

Last reviewed: July 2026 LB names/tiers evolve - verify current LB types and features in docs.

LB types Anatomy CDN & Armor When to use which Troubleshooting

TL;DR

Google Cloud Load Balancing is a family. The global external Application Load Balancer gives you one anycast IP serving users worldwide (L7, HTTP/S, with Cloud CDN + Cloud Armor). There are also regional external/internal Application LBs, passthrough Network LBs (L4, external/internal), and proxy Network LBs (L4 proxy). Every LB is assembled from a forwarding rule → target proxy → URL map → backend service → backends, with health checks. The #1 failure is a firewall not allowing the health-check ranges.

The load balancer family

Load balancer	Layer / scope	Use for
Global external Application LB	L7, global, single anycast IP	Internet-facing web/APIs served worldwide; CDN + Cloud Armor; cross-region failover
Regional external Application LB	L7, regional	Regional internet-facing L7 (data-residency, regional-only)
Internal Application LB	L7, regional/cross-region internal	Internal microservice HTTP routing
External passthrough Network LB	L4, regional, preserves client IP	Non-HTTP TCP/UDP internet-facing; source-IP-sensitive
Internal passthrough Network LB	L4, regional internal	Internal TCP/UDP (e.g. internal service VIP, HA databases)
Proxy Network LB (external/internal)	L4 proxy	TCP with TLS offload / where a proxy is wanted (no client-IP preservation)

Anatomy of a load balancer

Forwarding rule → target proxy (+ cert) → URL map → backend service → backends (MIGs, NEGs, buckets); health checks probe backends.

Forwarding rule - the frontend IP + port. Target proxy terminates and (for HTTPS) holds the certificate. URL map does host/path routing. Backend service defines the balancing policy, session affinity, timeouts, and health check. Backends are MIGs, NEGs (network endpoint groups - incl. serverless NEGs for Cloud Run/Functions/App Engine), or backend buckets (static content).
SSL certificates - Google-managed certs (auto-provision/renew) or self-managed, via Certificate Manager for scale.
Session affinity - client IP / cookie based, when needed.

Cloud CDN and Cloud Armor

Cloud CDN - cache cacheable responses at Google's edge; enable on a backend service/bucket to cut latency and egress.
Cloud Armor - edge WAF/DDoS for the global external Application LB: OWASP rules, rate limiting, geo/IP allow-deny, bot management, and adaptive protection. Attach a security policy to the backend service.

Security note

Front public HTTP(S) with the global external Application LB + Cloud Armor: it absorbs DDoS at Google's edge, enforces WAF/rate-limit rules, and lets your backends live on private IPs with no direct internet exposure. Terminate TLS at the LB with Google-managed certs; keep backends in private subnets.

When to use which

Global internet web/API, want CDN + WAF + one IP

Global external Application LB + Cloud CDN + Cloud Armor

Regional-only L7 (residency)

Regional external Application LB

Internal microservice HTTP routing

Internal Application LB

Non-HTTP, need real client IP, high throughput

External passthrough Network LB

Internal L4 VIP (e.g. HA DB, internal service)

Internal passthrough Network LB

Serverless (Cloud Run) behind a global IP + WAF

Global external App LB with a serverless NEG

Load balancer troubleshooting

⚑ Backend unhealthy / 502

Likely causes (in order)

Firewall doesn't allow the health-check ranges 35.191.0.0/16 and 130.211.0.0/22 to the backend port - the #1 cause.
Health check port/path/protocol wrong vs. what the app serves.
App not listening / bound to localhost instead of 0.0.0.0.
Wrong backend service protocol (HTTP vs HTTPS vs HTTP/2) or named port mismatch on the MIG.
Cloud Armor rule or URL map misrouting; OS firewall on the VM.

Checks

gcloud compute backend-services get-health BACKEND --global
gcloud compute firewall-rules list --filter="sourceRanges~35.191 OR sourceRanges~130.211"

Fix / prevention

Allow the health-check ranges to the backend SA/tag on the port; align the health check; fix the named port/protocol; bind to 0.0.0.0. Template the LB + firewall together in Terraform.

⚑ SSL certificate / cert issue

Causes: Google-managed cert stuck in PROVISIONING (the domain must resolve to the LB IP and DNS must be correct before it validates); wrong/missing domain on the cert; expired self-managed cert; HTTP→HTTPS redirect missing. Fix: point DNS at the LB IP first, then wait for provisioning; include all SANs; use Certificate Manager at scale; add a redirect URL map.

⚑ Wrong forwarding rule / URL map / Cloud Armor blocking valid traffic

Causes: forwarding rule on the wrong IP/port/protocol; URL map path/host rule not matching (default backend catching everything); Cloud Armor rule denying legitimate clients (over-broad geo/IP or a WAF rule false positive). Fix: verify the forwarding-rule frontend; test URL map routing (path matchers, order); review Cloud Armor logs and preview mode before enforcing; tune the offending rule.

Official documentation: Cloud Load Balancing →

8. Security Deep Dive

Defense in depth on Google Cloud: identity, governance (Org Policy, VPC-SC), network, data, and detective controls (SCC, audit logs) - plus concrete guidance for securing projects, storage, compute, and databases, ending in a production checklist.

Last reviewed: July 2026 Security service capabilities evolve - verify SCC tiers and features in docs.

Shared responsibility Control layers KMS & secrets SCC & detection Perimeter & supply chain How to secure X Production checklist Common mistakes

TL;DR

Layer your controls: IAM (least privilege, groups, no basic roles, no SA keys), governance (Organization Policies to forbid the risky thing; VPC Service Controls to stop data exfiltration), network (private IPs, firewall, no public exposure, IAP), data (CMEK, Secret Manager, DLP/Sensitive Data Protection), and detection (Security Command Center, centralized Cloud Audit Logs). Reduce public exposure, encrypt with keys you control, centralize logs, and prefer preventive guardrails over after-the-fact detection.

Google Cloud shared responsibility model

Google secures the infrastructure (physical, hardware, host, network fabric, and managed-service internals). You are responsible for: IAM and identity, data classification and access, network exposure and firewall, key management choices, workload/OS security (for IaaS/GKE nodes), secure configuration, and monitoring/response. The higher up the managed-service stack you go (VM → GKE → Cloud SQL → BigQuery/Cloud Run), the more Google handles - but data, access, and configuration always remain yours.

The control layers

Layer	Controls	Key services
Identity & access	Who can do what	Cloud IAM, groups, deny policies, conditional IAM, Workload/Workforce Identity Federation
Governance	What is allowed at all; data can't leave	Organization Policies, VPC Service Controls, Access Context Manager
Network	What can reach what	Firewall + hierarchical policies, private IPs, Cloud Armor, IAP, Private Google Access/PSC
Data	Protect data at rest/in transit	Cloud KMS/HSM (CMEK), Secret Manager, Sensitive Data Protection (DLP), CA Service
Workload / supply chain	Trusted, hardened workloads	Shielded/Confidential VMs, Binary Authorization, Artifact Analysis, Web Security Scanner
Detective / posture	Find misconfig & threats	Security Command Center, Cloud Audit Logs, Cloud Logging, Chronicle

Cloud KMS, HSM, and Secret Manager

Cloud KMS - manage encryption keys for CMEK across storage, disks, databases, and app-level crypto. Cloud HSM gives FIPS 140-2 Level 3 hardware-backed keys; Cloud EKM lets keys live in an external KMS.
Secret Manager - store API keys, DB passwords, certs as versioned secrets; workloads read them via IAM (no secrets in code, images, or env files).
Certificate Authority Service - private CA for issuing internal certificates at scale.

Security note - Secret Manager + CMEK in a locked project

Put KMS keys and secrets in a dedicated security project where only a small key-admin group has admin, and grant workloads only use (encrypt/decrypt / secret accessor) via their service accounts. Never commit secrets or SA keys to source. Disabling a CMEK key is your emergency "make this data unreadable" control - hold it tightly and audit it.

Security Command Center and detection

Security Command Center (SCC)

Central security & risk platform: asset inventory, misconfiguration findings (Security Health Analytics), threat detection (Event Threat Detection), attack-path/risk analysis, and posture management across the org. Turn it on org-wide.

Cloud Audit Logs

Admin Activity (always on), Data Access, System Event, and Policy Denied logs - your immutable evidence trail. Enable Data Access logs where needed and centralize with a sink.

Cloud Logging + Log Router

Route org audit logs to a central logging project (log bucket + BigQuery + SIEM) via an aggregated sink for cross-project visibility and retention.

Chronicle / Sensitive Data Protection

Chronicle for SIEM-scale threat analytics; Sensitive Data Protection (DLP) to discover/classify/mask PII in storage and BigQuery.

Architect note - centralize audit logs on day one

Create an organization-level aggregated log sink to a dedicated logging project (and onward to BigQuery/SIEM) as part of the landing zone. Retrofitting centralized audit logging after an incident - when you find the Data Access logs were never enabled - is the classic post-mortem finding. Enable SCC org-wide alongside it.

Perimeter and supply chain

VPC Service Controls - a data-exfiltration perimeter around managed services so a valid identity can't copy BigQuery/Cloud Storage data to an outside project. The key control for sensitive data (see section 2).
Identity-Aware Proxy (IAP) - context-aware access to apps/VMs without VPN or public IPs.
Binary Authorization - only allow signed/attested container images to deploy (GKE/Cloud Run).
Artifact Analysis - scan images in Artifact Registry for vulnerabilities; Web Security Scanner for app scanning.
Confidential Computing - encrypt data in use (Confidential VMs/GKE).

How to secure specific things

Secure a production project (and multi-project env) Foundation

Federate identities; enforce 2FA; drive access through groups; no basic roles; least-privilege predefined roles at project/resource scope.
Preventive Org Policies at the org/folder: disable SA key creation, domain-restricted sharing, restrict resource locations, block external IPs, require OS Login/Shielded VM.
Shared VPC for centralized network control; VPC-SC perimeter around data projects.
SCC org-wide; aggregated audit-log sink to a logging project; budgets + quotas.
CMEK + Secret Manager in a security project; break-glass Owner group, monitored.

Secure Cloud Storage Storage

Uniform bucket-level access + public access prevention (enforced by Org Policy); no allUsers.
CMEK for sensitive buckets; versioning + locked retention for backup/compliance.
Signed URLs short-lived and inventoried; access via IAM + service accounts, not keys.

Secure Compute Engine Compute

No external IPs; access via IAP; OS Login (+2FA) instead of metadata SSH keys; Shielded VMs.
Target firewall rules by service account; allow only the IAP range for SSH.
VM Manager for patch compliance; Ops Agent for logs/metrics; CMEK on disks for sensitive data.

Secure databases & public load balancers Data / Edge

DBs on private IP, Auth Proxy/Serverless VPC Access, no public endpoint; CMEK; IAM DB auth where supported.
Public HTTP behind the global external App LB + Cloud Armor (WAF, rate limiting, geo/IP); backends private.
Reduce public exposure everywhere: block external IPs by Org Policy, prefer IAP + private access.

Production Google Cloud security checklist

Human access federated; 2FA enforced; access granted via groups, never individuals.
No basic roles (Owner/Editor/Viewer) in production; least-privilege predefined roles at project/resource scope.
Service account key creation disabled org-wide; workloads use attached SAs / Workload Identity Federation / impersonation.
Preventive Org Policies: block external IPs, restrict resource locations, domain-restricted sharing, require OS Login + Shielded VM.
Security Command Center enabled org-wide with findings triaged.
VPC Service Controls perimeter around projects holding sensitive data.
Shared VPC / hierarchical firewall centrally managed; no broad 0.0.0.0/0 ingress; SSH via IAP range only.
Databases and internal services on private IP; no public database endpoints.
Public HTTP behind global external App LB + Cloud Armor; backends private.
All sensitive data encrypted with CMEK; keys in a locked security project; rotation on.
Secrets in Secret Manager; nothing sensitive in code, images, or metadata.
Cloud Storage buckets: UBLA + public access prevention; versioning + retention on backups.
Org-level aggregated audit-log sink to a central logging project (+ BigQuery/SIEM); Data Access logs on where needed.
Alerts on IAM/policy changes, new SA keys, public exposure, and anomalous access.
Budgets + quotas as guardrails; consistent labels/tags for attribution.
DR and backups tested (restores verified), including CMEK key availability in the DR region.

Common security mistakes

Common Google Cloud security mistakes

Granting Owner/Editor too broadly, or roles at org level.
Long-lived service account keys (the top serious incident).
Public Cloud Storage exposure (allUsers, legacy ACLs).
Over-permissive firewall rules; broad public SSH instead of IAP.
Public database endpoints with wide authorized networks.
Not enabling / not centralizing audit logs.
Storing secrets in code instead of Secret Manager.
Not using impersonation; not using VPC Service Controls for sensitive data.
Not enforcing Organization Policies (leaving the guardrails off).

Official documentation: Google Cloud security best practices →

9. Observability, Monitoring, and Operations

Cloud Monitoring, Cloud Logging, and the Cloud Operations suite - what to monitor per service, how to build useful alerts without noise, how to centralize logs across projects, and Active Assist for optimization.

Last reviewed: July 2026 Verify metric names and Ops suite features in current docs.

The stack What to monitor Building alerts Example alerts Centralizing logs Active Assist

TL;DR

Cloud Monitoring holds metrics, uptime checks, dashboards, and alerting policies that fire to notification channels. Cloud Logging collects logs; the Log Router sends them to log buckets, BigQuery, Pub/Sub, or Cloud Storage via sinks. Install the Ops Agent on VMs for memory/disk/process metrics and logs (not collected by default). Managed Service for Prometheus for container metrics. Alert on user-visible symptoms, route by severity, and centralize logs across projects with an aggregated sink.

The observability stack

Service	Role
Cloud Monitoring	Metrics, uptime checks, dashboards, alerting policies, SLOs.
Alerting policies + notification channels	Threshold/absence conditions → email, PagerDuty, Slack, SMS, Pub/Sub, webhook.
Cloud Logging + Log Router + sinks	Collect, route, and retain logs; export to log buckets / BigQuery / Cloud Storage / Pub/Sub.
Cloud Audit Logs	Admin Activity, Data Access, System Event, Policy Denied - the control-plane record.
Error Reporting / Trace / Profiler	Aggregate errors; distributed latency traces; continuous CPU/heap profiling.
Managed Service for Prometheus	Prometheus-compatible metrics at scale (GKE and beyond).
Ops Agent	On-VM agent for host metrics (memory, disk, swap), process metrics, and logs.
Network Intelligence Center / Flow Logs	Network observability (see section 3).
Cloud Asset Inventory / Recommender / Active Assist	Inventory, recommendations, and automated insights (cost, security, reliability).

Operations note - install the Ops Agent

Compute Engine reports CPU, disk I/O, and network by default but not memory, per-disk-utilization, or process metrics. Install the Ops Agent (via VM Manager or startup) to get memory pressure, disk usage, and application logs. Many "we couldn't see the memory leak" incidents trace back to the agent never being installed.

What to monitor per area

Compute Engine

CPU, memory (agent), disk usage & IOPS/throughput, instance up/health, MIG size vs. target, autohealing events.

Persistent Disk

Throughput/IOPS vs. provisioned limits, disk usage %, latency. Hitting the disk ceiling is a common hidden bottleneck.

Cloud Storage

Request/error rates, object counts, and unusual access patterns (via Data Access logs).

Cloud SQL / databases

CPU, memory, connections vs. limit, storage used %, replication lag, backup success, Query Insights.

Load balancers

Backend health, request count, 5xx rate, latency, backend utilization.

VPC / security

VPN/Interconnect status, flow-log anomalies, firewall drops; SCC findings, audit-log anomalies (policy/key/public-exposure changes).

Building useful alerts

Alert on symptoms users feel (5xx rate, unhealthy backends, DB down, high latency, SLO burn), not just causes.
Use appropriate aligners/reducers and a duration to avoid flapping (e.g. mean over 5 min, not a single spike).
Use absence conditions for signals that should always report (heartbeat, backup completion).
Route by severity: critical → page; warning → ticket/Slack; info → dashboard.
Adopt SLOs and alert on error-budget burn rather than raw thresholds where you can.

Example alerts to implement

Alert	Condition	Severity
VM CPU high	CPU > 85% mean for 5-10 min	Warning → Critical
VM unavailable	Instance up-check fails / metric absent	Critical
Memory pressure	Memory (agent) > 90%	Warning
Disk usage / IOPS	Disk > 85% used; throughput near provisioned limit	Warning
LB unhealthy backend	Healthy backend count < desired	Critical
Cloud SQL CPU / storage / connections	CPU > 90%; storage > 85%; connections near max	Warning → Critical
Failed backups	Backup failure / success signal absent	Critical
VPN tunnel down / Interconnect issue	Tunnel/attachment status != up	Critical
Cloud Storage unusual access	Spike / unexpected public access (Data Access logs)	Security review
Cloud Run / Functions errors	5xx rate or error count over threshold	Warning → Critical
Pub/Sub backlog high	Oldest unacked message age / undelivered count high	Warning
GKE pod crash loops	Container restart count rising	Warning

Common mistake - alert fatigue

Paging on every transient spike trains people to ignore alerts. Use longer windows, appropriate reducers, duration conditions, severity routing (only real user-impact pages), maintenance suppression, and prune alerts nobody acts on. An alert that never leads to action should be a dashboard metric, not a page.

Centralizing logs across projects

Create an aggregated sink at the org or folder level that routes all projects' logs (especially audit logs) to a central logging project - a log bucket for retention, BigQuery for analysis, and/or Pub/Sub to a SIEM. This gives cross-project security visibility and satisfies retention/compliance without per-project setup.

# Org-level aggregated sink of all audit logs to a central BigQuery dataset
gcloud logging sinks create org-audit-sink \
  bigquery.googleapis.com/projects/central-logging/datasets/org_audit \
  --organization=ORG_ID --include-children \
  --log-filter='logName:"cloudaudit.googleapis.com"'

Active Assist & recommendations

Recommender / Active Assist surface actionable insights: IAM over-grants, idle VMs/disks/IPs, right-sizing, commitment recommendations, and reliability/security findings. Review them monthly (they feed the cost checklist in section 14). Service Health shows Google-side incidents and maintenance events affecting your resources.

Official documentation: Cloud Monitoring, Logging & Operations suite →

10. Containers, Kubernetes, and Cloud Native

GKE (Autopilot and Standard), Cloud Run, and the serverless / event-driven building blocks - when to use each, how networking and IAM work for containers, and reference patterns for microservices and event-driven systems.

Last reviewed: July 2026 Verify GKE modes, Gateway API, and Cloud Run features in current docs.

Services GKE deep dive GKE vs Run vs Functions vs GCE Networking & IAM CI/CD Patterns Troubleshooting

TL;DR

GKE (managed Kubernetes) for orchestrated, long-running microservices - Autopilot when you want Google to manage nodes, Standard when you need node-level control. Cloud Run for serverless containers that scale to zero (the default for most new stateless services). Cloud Functions for small event handlers. Around them: Artifact Registry, Cloud Build/Deploy, Pub/Sub, Eventarc, Workflows, Cloud Tasks/Scheduler, and Apigee/API Gateway. GKE workloads use Workload Identity (not SA keys) to call Google APIs.

The cloud-native services

Service	What it is	Use for
GKE (Autopilot / Standard)	Managed Kubernetes control plane + nodes (or fully-managed nodes in Autopilot)	Orchestrated microservices, platform teams, portable K8s workloads
Cloud Run	Serverless containers (services & jobs), scale-to-zero	Most new stateless services/APIs and batch jobs
Cloud Functions	Event-driven functions (on Cloud Run)	Small event handlers, glue
App Engine	PaaS for web apps	Existing App Engine apps
Artifact Registry	Managed registry for images and packages	Store/scan images (Container Registry is legacy)
Cloud Build / Cloud Deploy	CI (build) and CD (progressive delivery)	Container build + deploy pipelines
Pub/Sub	Global messaging / event bus	Decoupling, streaming ingestion, fan-out
Eventarc	Event routing (from Google services / Pub/Sub) to Run/GKE/Workflows	Event-driven triggers
Workflows / Cloud Tasks / Cloud Scheduler	Orchestration / task queues / cron	Serverless orchestration and scheduling
Apigee / API Gateway	Full API management / lightweight API gateway	Publishing, securing, and managing APIs

GKE deep dive

Autopilot vs Standard: Autopilot manages nodes, scaling, and much security for you and bills per-pod - less to run, fewer knobs. Standard gives you node pools and full control (custom machine types, GPUs, DaemonSets that need node access) at more operational cost.
Node pools (Standard) - groups of nodes with a machine type/image; scale and upgrade per pool; use Spot node pools for fault-tolerant workloads.
Regional vs zonal clusters - regional replicates the control plane and spreads nodes across zones (HA); zonal is single-zone.
Networking - VPC-native clusters use alias IP ranges (secondary subnet ranges) for pods and services; plan those ranges (pods need many IPs).
Ingress / Gateway API - a Kubernetes Ingress or the newer Gateway API provisions a Google Cloud load balancer for HTTP routing; a Service type=LoadBalancer provisions an L4 LB.
Workload Identity - map a Kubernetes service account to a Google service account so pods call Google APIs with short-lived credentials, no keys.

# Bind a Kubernetes SA to a Google SA (Workload Identity)
gcloud iam service-accounts add-iam-policy-binding GSA@PROJECT.iam.gserviceaccount.com \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:PROJECT.svc.id.goog[NAMESPACE/KSA_NAME]"

Architect note - Autopilot by default, plan pod IPs

Start with GKE Autopilot unless you need node-level features (specific GPUs/DaemonSets, custom kernels) - it removes node ops and improves security posture. Either way, size the pod alias range for peak pods across the cluster; running out of pod IPs stalls scheduling in ways that look like mysterious Pending pods. Use Workload Identity for all pod-to-Google-API access - never mount SA keys.

GKE vs Cloud Run vs Cloud Functions vs Compute Engine

Many orchestrated microservices, need the K8s ecosystem

GKE (Autopilot unless you need node control)

Stateless service/API that should scale to zero

Cloud Run (the default for new services)

Small event handler / glue

Cloud Functions (+ Eventarc)

Persistent state, special kernels, full control

Compute Engine / GKE Standard

Batch job

Cloud Run jobs or Batch service

Cost note

Don't run a GKE cluster for one or two containers - the control plane and node baseline cost more than Cloud Run, which bills per request and scales to zero. Reserve GKE for genuine orchestration needs (fleets, service mesh, complex scheduling). Use Spot node pools for fault-tolerant GKE workloads.

Networking, IAM, and security for containers

Networking - VPC-native GKE in a (shared) VPC subnet with pod/service alias ranges; private clusters keep the control-plane endpoint private; Cloud Run connects to VPC via Serverless VPC Access or Direct VPC egress.
IAM - cluster access via IAM + Kubernetes RBAC; Workload Identity for pod-to-Google auth; Cloud Run/Functions run as a service account you set.
Supply chain - scan images in Artifact Registry (Artifact Analysis); enforce Binary Authorization so only attested images deploy.
Runtime - network policies, pod security, secrets from Secret Manager, and least-privilege service accounts.
Monitoring - GKE integrates with Cloud Monitoring/Logging and Managed Service for Prometheus; Cloud Run emits request metrics and logs automatically.

CI/CD for containers

Cloud Build builds and tests images (triggered from a repo), pushes to Artifact Registry (scanned), and Cloud Deploy promotes them through environments (dev → staging → prod) with approvals and rollback. Binary Authorization gates what can deploy.

Architecture patterns

Event-driven: an object upload triggers Eventarc → Cloud Run processes it and writes to BigQuery/DB or publishes to Pub/Sub for fan-out.

Microservices on GKE - deployments behind Gateway API/Ingress LB, HPA autoscaling, optional service mesh (Cloud Service Mesh) for mTLS/traffic control, Workload Identity, Cloud Deploy pipelines.
Microservices on Cloud Run - each service a container, private ingress + internal LB for service-to-service, Pub/Sub/Eventarc for async - minimal ops.
Serverless function on a Cloud Storage event - as diagrammed; image/ETL/validation triggers.
Event-driven architecture - Eventarc + Pub/Sub + Workflows + Cloud Run/Functions + Cloud Tasks for decoupled, resilient pipelines.
Private container platform - private GKE cluster in a Shared VPC, internal LBs, Binary Authorization, no public endpoints.

Troubleshooting

⚑ GKE pod not starting (Pending / ImagePullBackOff / CrashLoopBackOff)

Causes: Pending = no schedulable capacity or pod IP exhaustion (alias range too small) or resource requests too big; ImagePullBackOff = bad image path or missing Artifact Registry read permission on the node/Workload-Identity SA, or no route to the registry (private cluster without PGA/Artifact Registry access); CrashLoopBackOff = app failing on start (config/secret missing, bad liveness probe). Checks: kubectl describe pod, kubectl logs --previous, node capacity, alias-range free IPs. Fix: scale the node pool / fix alias range; grant artifactregistry.reader; fix probes/config; enable PGA for private clusters.

⚑ GKE Ingress issue

Causes: health checks failing (firewall not allowing 35.191.0.0/16 & 130.211.0.0/22, or wrong readiness probe); missing/managed cert not provisioned (DNS must point at the LB IP first); wrong BackendConfig/NEG; Ingress class/annotations misconfigured. Fix: allow health-check ranges, align readiness probe with the LB health check, point DNS then wait for cert, verify BackendConfig.

⚑ Cloud Run revision / Cloud Functions timeout / Pub/Sub backlog

Cloud Run: new revision serving 100% but failing - check container starts and listens on $PORT, startup CPU boost, min instances, and the runtime SA's permissions; roll back to the previous revision. Functions timeout: raise the timeout/memory, make work idempotent, offload long work to a Cloud Run job. Pub/Sub backlog: slow/failing subscriber - check ack deadline, subscriber errors, and scale consumers; use a dead-letter topic for poison messages.

Official documentation: GKE, Cloud Run & Eventarc →

11. Analytics, Data, and Integration

Data is Google Cloud's strongest area. BigQuery, the lake/lakehouse stack (Cloud Storage, BigLake, Dataplex), the pipeline tools (Dataflow, Dataproc, Data Fusion, Dataform), streaming (Pub/Sub), CDC (Datastream), and BI (Looker) - with the BigQuery mental model that trips up newcomers.

Last reviewed: July 2026 Verify BigQuery editions/pricing models and service availability in docs.

Services BigQuery mental model Data patterns Governance Reference architecture BigQuery cost control

TL;DR

Land data in Cloud Storage (the lake) and analyze it in BigQuery (serverless warehouse - storage and compute are separate). Transform with Dataflow (streaming/batch Beam), Dataproc (managed Spark/Hadoop), Data Fusion (visual ETL), or Dataform (SQL transformations in BigQuery). Ingest streams with Pub/Sub, CDC with Datastream, govern with Dataplex, and visualize with Looker / Looker Studio. BigQuery cost is driven by bytes scanned (on-demand) or slots (capacity) - query design directly affects the bill.

The services

Service	Role
BigQuery	Serverless data warehouse - SQL analytics at petabyte scale; separated storage/compute; ML (BigQuery ML) and vector search built in.
BigLake	Unify BigQuery-managed and open-format (Parquet/Iceberg) data in Cloud Storage under one governed table interface.
Dataplex	Data governance, cataloging, lineage, quality, and organization across the lake/warehouse (Data Catalog is part of it).
Cloud Storage	The data lake landing/curation/consumption zones.
Dataflow	Fully-managed Apache Beam - unified streaming & batch pipelines.
Dataproc	Managed Spark/Hadoop clusters (and serverless Spark).
Data Fusion	Visual, code-free ETL/ELT (CDAP-based).
Dataform	SQL-based transformation/versioning inside BigQuery (ELT, tests, docs).
Pub/Sub	Global messaging for streaming ingestion and event distribution.
Datastream	Serverless CDC from operational DBs (Oracle, MySQL, PostgreSQL) into BigQuery/Cloud Storage.
Cloud Composer	Managed Apache Airflow for orchestration.
Analytics Hub	Securely share/exchange datasets across projects/orgs.
Looker / Looker Studio	Governed BI/semantic modeling (Looker) and self-serve dashboards (Looker Studio).

Data engineering note - Datastream + BigQuery for CDC

To get near-real-time operational data into the warehouse, Datastream streams change data (including from Oracle) into BigQuery with low source impact - a common alternative to heavyweight ETL. Pair with BigQuery's CDC/merge handling. Mind supplemental logging and schema-drift handling on the source, and validate throughput for high-change tables.

The BigQuery mental model

BigQuery is not a traditional database

Storage and compute are separated. Data sits in columnar storage; queries spin up compute (slots) on demand. You are not managing a server with a fixed size.
It is analytical (OLAP), not transactional. Great for scans/aggregations over huge tables; wrong for high-rate single-row OLTP (use Cloud SQL/Spanner for that).
SQL, but different operations. No indexes in the OLTP sense; performance comes from partitioning (by date/ingestion time) and clustering (by high-cardinality filter columns) to prune data.
Query design drives cost. On-demand pricing charges by bytes scanned - SELECT * and unpartitioned full scans are expensive. Select only needed columns; filter on partition/cluster keys.
Slots are the unit of compute. On-demand auto-allocates; capacity (editions) reserve slots for predictable cost/performance and let you autoscale.
Loading options differ: batch load (free load), streaming inserts / Storage Write API (real-time, priced), and external/BigLake tables (query data in Cloud Storage in place) each have different cost/latency trade-offs.

Common mistake

Treating BigQuery like Postgres: SELECT * across an unpartitioned multi-TB table, per-row updates, or expecting OLTP latency. That scans (and bills) enormous data and performs poorly. Partition + cluster tables, select only needed columns, and keep transactional workloads in Cloud SQL/Spanner.

Common data patterns

Pattern	Built from
Data lake	Cloud Storage (raw/curated/consumption) + Dataplex + BigLake
Data warehouse	BigQuery + Dataform/Dataflow loads + Looker
Lakehouse	BigLake over open formats (Iceberg/Parquet in GCS) + Dataplex governance, queried by BigQuery
ETL / ELT	Dataflow / Dataproc / Data Fusion (E/T) + Dataform (in-warehouse T)
Streaming ingestion	Pub/Sub → Dataflow → BigQuery (or Storage Write API direct)
CDC	Datastream → BigQuery / Cloud Storage
Reporting / BI	BigQuery + Looker (governed) / Looker Studio (self-serve)
AI-ready data	Curated lake + BigQuery ML / vector search + Vertex AI (section 12)
Cross-org data sharing	Analytics Hub (publish/subscribe datasets) with governance

Data governance with Dataplex

Dataplex gives you a data catalog, business/technical metadata, data lineage, data quality, and policy-based access across lakes and BigQuery - so a growing lake stays a governed asset instead of a "data swamp." Combine with VPC Service Controls (perimeter around BigQuery/Storage) and column/row-level security in BigQuery for sensitive data.

Security note

Wrap analytics projects (BigQuery + the lake buckets) in a VPC Service Controls perimeter so data cannot be exported to outside projects, use column-level security and data masking for sensitive fields, discover PII with Sensitive Data Protection (DLP), and control sharing through Analytics Hub rather than ad-hoc dataset grants.

Reference architecture: lakehouse + BI

CDC/streaming into a Cloud Storage lake, Dataflow transforms, BigQuery serves analytics (BigLake over open formats), Dataplex governs, Looker visualizes.

BigQuery cost control

Cost note - control the scan

Partition and cluster big tables so queries prune to the relevant data.
Select only needed columns; avoid SELECT *. Preview cost with the dry-run estimator before running.
Set custom quotas / maximum bytes billed per query to cap runaway scans.
Choose the right pricing: on-demand (per-byte) for spiky/low volume, capacity (editions with slot reservations + autoscaling) for predictable heavy workloads.
Use materialized views and BI Engine for repeated aggregations; expire temp datasets.

Official documentation: BigQuery, Dataflow, Dataplex & the data stack →

12. AI, ML, and Generative AI on Google Cloud

Vertex AI as the unified ML platform, Gemini models and Agent Builder, vector search across AlloyDB / Cloud SQL / BigQuery, and the pretrained AI APIs - plus the enterprise RAG patterns and governance guardrails that separate a demo from something you can run on real data.

Last reviewed: July 2026 Model names, regions, quotas & pricing change fast - verify Gemini/Vertex availability in the Console.

Vertex AI Pretrained AI APIs Vector search RAG architecture Enterprise patterns Governance Warnings

TL;DR

Vertex AI is the unified platform for building, tuning, deploying, and operating models - Model Garden (incl. Gemini), Vertex AI Studio, custom/AutoML training, endpoints, pipelines, feature store, model registry, and Agent Builder / Vertex AI Search for grounded assistants. For RAG, generate embeddings and store vectors in AlloyDB, Cloud SQL (pgvector), BigQuery, or Vertex Vector Search. BigQuery ML runs ML in SQL. The hard part is not the model - it is governing what the model can reach.

Vertex AI

Capability	What it does
Model Garden	Catalog of Google (Gemini), open, and third-party models to deploy/tune.
Vertex AI Studio	Prompt, test, and tune generative models; multimodal.
Gemini models	Google's frontier multimodal LLMs (text, image, audio, video, code) served via Vertex AI.
Agent Builder / Vertex AI Search	Build grounded search and agents over your data with managed retrieval (less RAG plumbing).
AutoML / custom training	Train without/with your own code; distributed training on GPUs/TPUs.
Endpoints (online / batch prediction)	Serve models for low-latency online or high-throughput batch inference.
Pipelines / Feature Store / Experiments / Model Registry	MLOps: reproducible pipelines, feature serving, experiment tracking, versioned model governance.
Model Monitoring	Drift/skew detection and quality monitoring for deployed models.

Pretrained AI APIs & BigQuery ML

Document AI

Parse and extract structured data (text, tables, entities) from documents - invoices, forms, contracts.

Vision / Speech / Translation / Natural Language

Pretrained APIs for image analysis, speech-to-text and text-to-speech, translation, and text/entity/sentiment analysis - no training needed.

Contact Center AI / Dialogflow

Conversational agents and virtual call-center assistants (CCAI), including generative agents.

BigQuery ML

Train and run ML models (and call Gemini/embeddings) directly with SQL in BigQuery - great for analysts and data-resident ML.

Vector search options

Option	Use when
AlloyDB vector search	Vectors alongside operational Postgres data, with high performance (Google's ScaNN-based indexing).
Cloud SQL for PostgreSQL (pgvector)	Vectors in an existing Cloud SQL Postgres, modest scale, simplest path.
BigQuery vector search	Vectors + embeddings at analytics scale, alongside your warehouse data, in SQL.
Vertex AI Vector Search	Purpose-built, very large-scale, low-latency similarity search (managed ANN).

AI note - keep vectors near the governed data

Storing embeddings in AlloyDB/Cloud SQL/BigQuery means retrieval inherits your existing IAM, VPC-SC perimeter, backups, and row/column security - you combine similarity search with ordinary WHERE filters so retrieval respects entitlements. Use Vertex Vector Search when scale/latency demands a dedicated ANN service. Either way, filter retrieved context to what the requesting user is allowed to see.

RAG architecture on Google Cloud

Ingestion: docs → chunk → embed → store vectors. Runtime: query → governed serving layer → entitlement-filtered retrieval → Gemini generates a grounded, audited answer. Or use Vertex AI Search/Agent Builder to manage the retrieval.

Enterprise patterns

Pattern	How	Watch out for
Chat with documents	RAG over Cloud Storage docs + vector search + Gemini (or Vertex AI Search)	Chunking quality; stale index; citations
Chat with database	Retrieve from curated views; generate grounded answers	Never expose raw prod OLTP; use a serving layer
Natural language to SQL	Gemini proposes SQL against a governed schema/catalog	Validate/parametrize; read-only; no dynamic SQL
RAG with BigQuery	Embeddings + vector search in BigQuery over warehouse data	Column/row security on retrieval
Document processing pipeline	Document AI → extract → BigQuery / workflow	Human review of low-confidence extractions
Call center AI	CCAI / Dialogflow + Gemini + knowledge base	Grounding; escalation to humans
MLOps pipeline	Vertex AI Pipelines + Feature Store + Model Registry + Monitoring	Reproducibility; drift monitoring
Governed private GenAI	Private endpoints + VPC-SC + curated data + audit	Entitlement-aware retrieval

Governance and security for GenAI

Serving layer, always - agents/LLMs call a governed API (e.g. Cloud Run) that enforces authN/authZ, rate limits, input/output validation, and logging. They do not touch data stores directly.
Entitlement-aware retrieval - filter retrieved context to what the requesting user may see (row/column/document level) so RAG cannot leak across users.
Private & perimetered - keep model and data traffic private (Private Service Connect / private endpoints); wrap data services in VPC Service Controls.
Credential hygiene - secrets in Secret Manager, access via service accounts / Workload Identity; the model never sees raw credentials.
Auditability - log prompts, retrieved context IDs, and responses (per privacy rules) so answers are explainable.
Responsible AI - safety filters, evaluation, and Model Monitoring for drift/quality; human review for consequential outputs.

Warnings (read before connecting AI to enterprise data)

Do not do these

Do not connect LLM agents directly to production OLTP databases without a governed serving layer. Live transactional systems are not a query surface for a probabilistic agent.
Avoid uncontrolled dynamic SQL. NL-to-SQL must produce validated, parameterized, read-only queries against a curated schema - never free-form DML against production.
Protect credentials. No DB passwords, keys, or wallets in prompts, code, or agent memory. Use Secret Manager + service accounts / Workload Identity.
Add auditability. If you cannot show what data an answer came from and who asked, you cannot defend it to security or compliance.
Use curated datasets, APIs, or read-only reporting layers as the AI's data surface - not raw production tables.
Validate output before business use. Treat model output as a draft/suggestion until a human or deterministic check confirms it.
Monitor prompt injection and data-leakage risks - untrusted content in the context can hijack instructions; isolate and sanitize retrieved/user content.
Check Gemini model availability, region, quota, and pricing before you design - these change frequently and vary by region.

AI note - the pattern that scales safely

The durable enterprise GenAI shape is: curated/governed data → entitlement-filtered retrieval → model behind a serving API → validated, audited output, all inside a VPC-SC perimeter. Everything risky (raw OLTP access, dynamic SQL, embedded credentials, unlogged answers) is a shortcut that works in a demo and fails an audit. Build the governed path first.

Official documentation: Vertex AI, Gemini & Vector Search →

13. Migration and Disaster Recovery

Getting workloads into Google Cloud (VMs, databases, data) and keeping them recoverable - the migration tooling, the DR patterns by tier, and how RTO/RPO drive architecture and cost.

Last reviewed: July 2026 Verify supported sources for MVM/DMS/Datastream in current docs.

Migration tooling Database migration DR patterns RTO / RPO Examples DR testing

TL;DR

Migrate VMs with Migrate to Virtual Machines, databases with Database Migration Service (and Datastream for CDC/low-downtime), and bulk data with Storage Transfer Service / Transfer Appliance. Plan with Migration Center. For DR, choose per tier: backup & restore (cheapest, slow), cold/pilot light, warm standby, or hot/active-active. Global load balancing + Cloud DNS handle traffic failover. Your RTO/RPO targets pick the pattern - and DR you never test is not DR.

Migration tooling

Move	Tooling	Notes
Assess & plan	Migration Center	Discovery, assessment, TCO, and grouping before you move
VMs	Migrate to Virtual Machines (MVM)	Lift-and-shift VMware/AWS/Azure/physical VMs to Compute Engine
VMware estates	Google Cloud VMware Engine	Run VMware as-is in GCP, then modernize gradually
Databases (low downtime)	Database Migration Service (DMS) + Datastream	Managed migrations to Cloud SQL/AlloyDB; DMS supports Oracle→PostgreSQL conversion
Bulk data	Storage Transfer Service / Transfer Appliance	Online from other clouds/on-prem/HTTP, or physical appliance for very large sets
Data warehouse	BigQuery Data Transfer Service + migration tooling	From Teradata/Redshift/others into BigQuery
Backup & DR	Backup and DR service	Application-consistent backup and DR orchestration for VMs/databases

Database migration paths

Source → target	Method	Downtime
PostgreSQL/MySQL → Cloud SQL	DMS continuous (logical replication)	Near-zero
PostgreSQL → AlloyDB	DMS continuous	Near-zero
Oracle → PostgreSQL/AlloyDB	DMS with schema/code conversion (heterogeneous)	Low, plus conversion effort
Oracle/MySQL/PostgreSQL → BigQuery/Storage	Datastream (CDC)	Near-zero (ongoing feed)
Any → self-managed on Compute Engine	Native dump/restore, replication, or Data Guard (Oracle)	Depends on method

DBA note - heterogeneous Oracle moves are conversions

Oracle → Cloud SQL/AlloyDB PostgreSQL is a heterogeneous migration: DMS handles data movement and assists with schema/PL-SQL conversion, but datatype, PL/SQL, and feature differences need real work and testing. If the app must stay on Oracle, plan for self-managed Oracle on Compute Engine or Oracle Database@Google Cloud instead of a conversion. Use Datastream when you only need Oracle data flowing into BigQuery for analytics.

DR patterns

Pattern	Standby state	RTO	RPO	Cost
Backup & restore	Backups in another region; nothing running	Hours+	Since last backup	Lowest
Cold / pilot light	Core data replicated (e.g. cross-region replica); app off	Tens of min	Small (replica lag)	Low
Warm standby	Scaled-down full stack running in DR region	Minutes	Small	Medium
Hot / active-active	Both regions serving (global LB, Spanner/replicated data)	Near-zero	Near-zero	Highest + complexity

Building blocks: cross-region Cloud SQL/AlloyDB replicas (promote for DR) or Spanner/Firestore multi-region (built-in); Cloud Storage dual/multi-region or Transfer for objects; regional PD and PD snapshots for VMs; the global external Application LB for automatic cross-region failover; and Cloud DNS for DNS-based failover where the LB doesn't cover it.

Architect note - global LB simplifies app DR

Because the global external Application LB uses one anycast IP with backends in multiple regions and health-based routing, a regional outage fails traffic over automatically for stateless tiers - no DNS TTL wait. That removes a lot of classic DR plumbing for the app layer. The hard part remains the data tier: pick cross-region replicas (promote) or a multi-region database (Spanner/Firestore) per your RTO/RPO, and rehearse the promotion.

Common mistake - active-active data is hard

Stateless tiers go active-active easily behind a global LB; stateful databases generally do not without conflict handling. Most "active-active" requirements are met by active/passive with fast failover, or by using a database that is natively multi-region (Spanner). Don't take on multi-master complexity unless the requirement truly demands it.

RTO and RPO

RTO - how long you can be down → drives standby readiness and automation.
RPO - how much data you can lose → drives replication mode (synchronous vs async vs backup interval).
Zero data loss needs synchronous replication (regional HA, or Spanner multi-region) and low inter-region latency; verify the network and the performance trade-off.

Architecture examples

One global LB fronts both regions (auto failover for stateless tiers); the database uses a cross-region replica promoted on failover.

On-prem VM → GCP: Migrate to VMs; cut over with the network in place.
DB → Cloud SQL: DMS continuous, cut over at low replication lag.
PostgreSQL → AlloyDB: DMS; validate performance/analytics gains.
Oracle → Compute Engine (where required): self-managed, Data Guard to a second-region VM, sole-tenant for licensing.
Cross-region DR (app): global LB + warm MIG + storage replication.
Cross-region DR (database): cross-region replica (promote) or multi-region Spanner/Firestore.
Backup-based DR: Backup and DR service / cross-region snapshots, rebuild on demand.

DR testing

DR you have never tested is a hope, not a plan

Run regular drills: replica promotion, region failover, and full app validation in DR (not just "the DB opened"). Verify RTO/RPO are actually met, that CMEK keys exist and are usable in the DR region (a missing key makes the replica unusable), that the global LB/DNS failover works, and that runbooks and connection strings are current. Backup and DR service can help codify and rehearse plans.

Cross-region replica within RPO (monitor replication lag); promotion rehearsed.
CMEK keys present and usable in the DR region.
App tier can start and connect in DR; config points to DR endpoints.
Global LB / Cloud DNS failover tested and time-measured.
Object data (dual/multi-region or replicated) within RPO.
Capacity available in DR (reservations if RTO is tight); runbook current.

Official documentation: DR planning & migration →

14. Cost Management and Governance

How Google Cloud charges, the tools to track and cap spend (billing export, budgets, quotas), the discount levers (CUDs, sustained-use, Spot), and the governance model - ending in a monthly cost-review checklist.

Last reviewed: July 2026 Pricing and discount models change - verify all rates on the pricing pages.

Pricing basics Cost tools Discounts Governance Optimization Monthly checklist

TL;DR

Google Cloud bills mainly by compute (vCPU/memory-hours), storage GB, BigQuery bytes scanned or slots, and network egress. Track with billing export to BigQuery + Cost tables/dashboards, cap with Budgets & alerts (notify) and quotas (block). The big levers: committed use discounts for steady state, sustained-use discounts (automatic on some families), Spot VMs for fault-tolerant work, right-sizing / custom machine types, BigQuery query controls, storage lifecycle, and killing idle resources. Governance = resource hierarchy + Org Policies + budgets + labels.

Pricing basics

Dimension	Charged on	Notes
Compute Engine	vCPU + memory per second (per machine type)	Sustained-use discounts on some families; CUDs; Spot for big savings
Persistent Disk / storage	Provisioned GB-month (+ IOPS/throughput for Hyperdisk)	Snapshots add up; choose disk type deliberately
Cloud Storage	GB-month by class + operations + retrieval (colder classes)	Lifecycle to colder classes; watch retrieval/egress
BigQuery	Bytes scanned (on-demand) or slot-hours (capacity) + storage	Query design and partitioning dominate cost
Network	Internet egress + inter-region + some inter-zone; ingress free	Keep chatty services co-located; use private access
Managed services	Per-service (Cloud SQL vCPU/RAM/storage, Cloud Run per-request, etc.)	Right-size; scale-to-zero where possible
Logging / Monitoring	Ingestion volume (logs), some metrics	Exclude noisy logs; set retention

Cost tracking tools

Tool	Does
Billing export to BigQuery	Detailed usage/cost data for your own analysis and dashboards (turn on early).
Cost table / Cost breakdown / Reports	Console views of spend by project, service, SKU, label, time.
Budgets & alerts	Track spend against a target per billing account/project/label; alert at thresholds (and optionally trigger Pub/Sub automation). Budgets notify - they don't block.
Quotas	Hard caps on resource usage per project - the "block" control.
Pricing Calculator	Estimate before you build.
Recommender / Active Assist	Right-sizing, idle-resource, and commitment recommendations.

Cost note - budgets alert, quotas enforce

Use them together: a budget warns you spend is trending over; a quota stops a project creating the expensive thing. Attribute everything via labels + billing export to BigQuery so chargeback works. This only works if labeling is enforced from the start (section 1).

Discounts

Committed use discounts (CUDs) - 1 or 3-year commitment (resource-based or spend-based) for a substantial discount on steady-state usage.
Sustained-use discounts - automatic discounts for running certain machine families a large fraction of the month.
Spot VMs - 60-90% off for preemptible, fault-tolerant workloads.
Custom machine types - stop paying for vCPU or memory you don't use.

Governance model

Governance is enforced through the same primitives as security: the resource hierarchy (projects/folders for isolation and attribution), Organization Policies (restrict locations, machine types, external IPs), budgets + quotas, labels/tags, and a landing zone deployed as code. This keeps spend controlled and attributable by design rather than by cleanup.

Cost optimization examples

Action	Typical saving	Effort
Stop / schedule non-prod VMs off-hours	High (up to ~65-70% of that compute)	Low
Right-size VMs (Recommender) / custom machine types	High	Low
Committed use discounts for baseline	High	Medium
Spot VMs for fault-tolerant / batch	Very high (60-90%)	Medium
Choose the right disk type / delete unused disks	Medium	Low
Cloud Storage lifecycle to colder classes	Medium-High	Low
BigQuery: partition/cluster, max-bytes-billed, capacity vs on-demand	High for heavy BQ	Medium
Reduce logging ingestion (exclusion filters)	Medium	Low
Reduce inter-region / egress traffic	Medium	Medium
Delete old snapshots & unused external IPs	Medium	Low
Cloud SQL right-sizing / scale-to-zero serverless	Medium-High	Low

Cost note - cheap wins first

Before re-architecting, do the high-ROI basics: schedule non-prod off-hours, act on Recommender right-sizing, apply storage lifecycle, buy CUDs for baseline, and fix BigQuery query patterns. Idle external IPs, orphaned disks, and unbounded log ingestion quietly add up - clean them monthly.

Monthly Google Cloud cost review checklist

Review cost reports month-over-month by project, service, and label; investigate spikes.
Check each budget: which projects/labels are over or trending over target.
Act on Recommender right-sizing and idle-resource recommendations.
Confirm non-prod stop/schedule ran (nothing running 24x7 by accident).
Find and delete unused persistent disks, orphaned snapshots, and idle VMs.
Release unused external (static) IPs - they bill when unattached.
Review CUD coverage vs. steady-state usage; buy/adjust commitments.
BigQuery: top queries by bytes scanned; add partitioning/clustering; set max-bytes-billed; review on-demand vs capacity.
Cloud Storage: are lifecycle rules moving cold data to Nearline/Coldline/Archive?
Logging/Monitoring ingestion: exclude noisy logs; check retention settings.
Review egress / inter-region charges; co-locate chatty services; use private access.
Cloud SQL / managed DB sizing vs. actual utilization; scale down over-provisioned.
Confirm every resource is labeled (cost-center/env/owner) for attribution.
Validate quotas still reflect intent; check for anomalous new spend by service.

Official documentation: Cost management, Budgets & CUDs →

15. Enterprise Architecture Patterns

Reference blueprints for real Google Cloud deployments. Each card gives the business case, services, traffic flow, and the security / HA / DR / monitoring / cost / risk dimensions so you can adapt rather than start from a blank page.

Last reviewed: July 2026 Blueprints are starting points - validate sizing/services against current docs and requirements.

HOW TO READ THESE

Every pattern lists the same dimensions. Start from the one closest to your workload, then apply the service deep dives (sections 3-12) and the DR/cost guidance (13-14). The recurring backbone is: global external App LB + Cloud Armor → private compute (Cloud Run / regional MIG / GKE) → managed database on private IP → Private Google Access / PSC for Google APIs → centralized logging → cross-region DR, all inside a governed landing zone.

Foundational three-tier (reference backbone)

Three-tier enterprise application

The pattern most others extend

Business case	Standard internal/external web or enterprise app needing HA and controlled exposure.
Services	Shared VPC, global external App LB + Cloud Armor + Cloud CDN, Cloud Run or regional MIG, Cloud SQL/AlloyDB (private IP), Cloud NAT, PGA/PSC, Secret Manager, Cloud Monitoring/Logging.
Traffic flow	User → Cloud Armor/LB → app (private) → DB (private IP); app → Google APIs via PGA/PSC; egress via Cloud NAT.
Security	No external IPs; firewall by SA; DB private; CMEK; secrets in Secret Manager; Org Policies + VPC-SC; SCC on.
HA	Regional MIG / Cloud Run across zones; Cloud SQL HA; LB health-based routing.
DR	Second-region backends behind the same global LB + cross-region DB replica.
Monitoring	LB/backends, app, DB metrics; alerts → notification channels; central logs.
Cost	Cloud Run scale-to-zero or right-sized MIG + CUD; storage lifecycle; BQ controls.
Risks / mistakes	Health-check firewall rule missing; DB public IP; no zone spread; secrets in code.

Pattern library

Simple web application Small

Case	Low-complexity site/app, cost-sensitive.
Services	Cloud Run + Cloud SQL (or Firestore) + Cloud Storage for assets + global LB + Cloud Armor.
HA/DR/cost	Cloud Run multi-zone by default; Cloud SQL HA; scale-to-zero. Risk: public DB IP, no backups.

Highly available application HA

Case	Must survive zone (and ideally region) failure.
Services	Regional MIG / Cloud Run across zones, regional PD for state, Cloud SQL HA or Spanner, global LB.
DR	Second-region backends + cross-region DB replica or multi-region Spanner. Risk: state on a single zonal disk; untested failover.

Private enterprise application Regulated

Case	Internal-only, reachable from on-prem, no public footprint.
Services	Private subnets, internal Application LB, HA VPN/Interconnect via Cloud Router, PGA/PSC, IAP for admin, no external IPs.
Security/risk	VPC-SC perimeter; hierarchical firewall; DB private IP. Risk: CIDR overlap; DNS forwarding gaps.

Shared VPC / centralized networking & security Platform

Case	Many teams/projects with centrally-governed network, security, and logging.
Services	Host project (Shared VPC), hierarchical firewall, Cloud NAT/DNS, central logging project (aggregated sink), security project (SCC, KMS), org Org Policies.
Risk	Under-scoped `networkUser` grants; teams creating shadow VPCs; missing perimeter.

Multi-project landing zone Governance

Case	Governed foundation before workloads land.
Services	Org/folder/project hierarchy, baseline IAM (groups), preventive Org Policies, Shared VPC, central logging + SCC, budgets/quotas, labels - all Terraform (Cloud Foundation blueprints).
Risk	Skipping it and retrofitting governance later.

Cloud SQL / AlloyDB application DB

Case	Relational app backend.
Services	Cloud SQL (or AlloyDB) private IP + HA + PITR, Auth Proxy / Serverless VPC Access, cross-region replica for DR, Query Insights.
Risk	Public IP + broad authorized networks; HA not enabled; untested restore.

BigQuery analytics platform / data lake Data

Case	Enterprise analytics on curated + raw data.
Services	Cloud Storage lake (zones) + BigQuery/BigLake + Dataflow/Dataform + Datastream (CDC) + Dataplex (govern) + Looker; VPC-SC perimeter.
Cost/risk	Partition/cluster + query controls; column-level security. Risk: ungoverned "data swamp"; runaway scans.

GKE platform Cloud native

Case	Container platform for many microservices with CI/CD.
Services	Private GKE (Autopilot or Standard) in Shared VPC, Gateway API LB, Workload Identity, Artifact Registry (scanned) + Binary Authorization, Cloud Build/Deploy, service mesh optional.
Risk	Pod IP exhaustion; over-privileged Workload Identity; public control plane.

Cloud Run serverless application Serverless

Case	Stateless services/APIs with minimal ops.
Services	Cloud Run (services + jobs) behind global LB + Cloud Armor, Serverless VPC Access to private DB, Pub/Sub/Eventarc for async, Secret Manager.
Cost/risk	Scale-to-zero; per-request billing. Risk: cold-start latency for spiky critical paths (use min instances).

Event-driven architecture Events

Case	Decoupled, resilient processing pipelines.
Services	Eventarc + Pub/Sub + Cloud Run/Functions + Workflows + Cloud Tasks/Scheduler; dead-letter topics.
Risk	Poison messages without DLQ; non-idempotent handlers; backlog from slow consumers.

Hybrid cloud Hybrid

Case	Workloads split across on-prem and GCP.
Services	Interconnect (primary) + HA VPN (backup) via Cloud Router, Shared VPC / NCC, hybrid Cloud DNS, hierarchical firewall.
Risk	CIDR overlap; expecting transitive peering; single link with no backup; asymmetric routing.

Multi-region DR DR

Case	Business-critical stack needing regional resilience.
Services	Global LB with multi-region backends, cross-region DB replica or multi-region Spanner/Firestore, dual/multi-region Cloud Storage, reservations, Backup and DR service.
Risk	Untested DR; CMEK key missing in DR region; capacity unavailable at failover.

Secure landing zone Security

Case	Preventive-guardrail foundation.
Services	Org Policies (no external IP, location restriction, no SA keys, OS Login), VPC-SC perimeters, central SCC + logging, KMS/Secret Manager, break-glass, budgets/quotas - as code.
Risk	Guardrails left off; over-broad break-glass.

GenAI with private enterprise data AI

Case	RAG/assistant over internal data, governed.
Services	Cloud Storage (docs) + vectors in AlloyDB/BigQuery + Gemini (Vertex AI) or Vertex AI Search behind a Cloud Run serving API + Secret Manager + VPC-SC + logging.
Flow / risk	Query → serving layer (authz + guardrails) → entitlement-filtered retrieval → grounded, audited answer. Risk: ungoverned data access, dynamic SQL, credential leakage (section 12 warnings).

Common mistakes across all patterns

Databases/services on external IPs "to get it working"; missing private access planning.
No zone/region spread - a zone event takes the whole "HA" tier.
Health-check firewall ranges forgotten, so LB backends are unhealthy on day one.
DR designed but never tested; CMEK keys missing in the DR region.
Secrets in code/metadata instead of Secret Manager; long-lived SA keys.
No centralized logging/SCC until an incident needs it.
CIDR overlap / expecting transitive peering discovered during hybrid setup.
Landing zone / Org Policies skipped and retrofitted painfully later.

Official documentation: Google Cloud reference architectures →

16. Troubleshooting Guides

A runbook catalog for the failures you will actually hit. Each entry lists symptoms, likely causes, checks (with Console path and gcloud where useful), fixes, and prevention. Deeper versions of some live in their service sections; this is the consolidated index.

Last reviewed: July 2026 Verify gcloud syntax with gcloud <group> --help.

General method

Work top-down: identity/API (is the caller allowed? is the API enabled? right project?) → network (route + firewall both ways + private access) → host/service (listening/healthy?) → data. For "cannot reach," run a Connectivity Test; for "permission denied," use Policy Troubleshooter and the access mental model in section 2. Almost everything is in Cloud Logging and Cloud Audit Logs.

ComputeStorageNetworkLBDBIAMServerlessGKEObservability

Compute & access

⚑ VM not reachable / SSH / OS Login issue

Symptoms: SSH times out or is denied. Causes: firewall doesn't allow SSH from your source (allow the IAP range 35.235.240.0/20 and use IAP, or your CIDR); VM has no external IP and you're not using IAP; OS Login enabled but you lack roles/compute.osLogin (or osAdminLogin) / 2FA; VM stopped or boot failed; OS firewall. Checks: serial console for boot; IAM roles; firewall; gcloud compute ssh --tunnel-through-iap. Fix: grant OS Login role, open IAP range, use IAP tunneling. Prevention: standardize IAP + OS Login; no external IPs.

gcloud compute ssh VM --tunnel-through-iap --zone=ZONE
gcloud compute instances get-serial-port-output VM --zone=ZONE

⚑ VM boot issue

Causes: bad fstab mount, full boot disk, kernel/driver issue, failed startup script. Checks: serial console output; startup-script logs. Fix: detach the boot disk, attach to a rescue VM, correct config, reattach; keep boot-disk snapshots. Prevention: test image changes in non-prod.

⚑ High CPU / memory pressure / disk full

CPU: Monitoring trend; on host top; right-size/autoscale. Memory: requires the Ops Agent (memory isn't collected by default) - install it, then right-size. Disk full: resize the PD online, grow the filesystem, alert at 85%; clean logs/temp.

⚑ Persistent Disk attachment issue

Causes: disk in a different zone than the VM; not formatted/mounted; wrong device name. Checks: gcloud compute instances describe; lsblk. Fix: attach in the same zone, format & mount by UUID in fstab. Prevention: regional PD for HA; automate mount in the startup script.

Storage

⚑ Cloud Storage access denied / public access issue

Denied: missing IAM (needs roles/storage.objectViewer/Admin) at bucket or project; wrong project; UBLA on but you relied on an ACL; VPC-SC perimeter blocking; API disabled. Public access blocked: Org Policy storage.publicAccessPrevention is (correctly) enforcing - don't disable it; use signed URLs or IAM instead. Checks: gsutil iam get gs://BUCKET; Policy Troubleshooter; audit logs. Fix: grant least-privilege IAM; use signed URLs for external sharing.

⚑ Filestore mount issue

Causes: firewall blocking NFS between client and Filestore; wrong mount IP/path; client not in the allowed network. Fix: allow NFS ports from the client subnet, verify the mount target IP and export path, ensure same VPC/region reachability.

Network

⚑ VPC routing / firewall / Cloud NAT / Private Google Access / PSC / peering / Shared VPC

Method: run a Connectivity Test (names the blocking route/firewall) and Network Analyzer for config issues. Then per case:

Firewall: implied deny-ingress; check priority/direction; allow health-check (35.191/16, 130.211/22) and IAP (35.235.240.0/20) ranges; prefer SA targeting.
Route: a custom route shadowing the default internet route; missing dynamic route from Cloud Router.
Cloud NAT: outbound-only; port exhaustion (raise min-ports / enable dynamic allocation); NAT not covering the subnet/region.
PGA: not enabled on the subnet; DNS/route for private.googleapis.com missing; VPC-SC blocking.
PSC: endpoint/DNS mapping wrong; producer not accepting the connection.
Peering: not transitive; overlapping ranges; missing firewall for the peer range.
Shared VPC: service-project SA lacks compute.networkUser; resources created in the wrong network.

⚑ VPN down / Interconnect issue / Cloud Router BGP

Causes: IKE/IPSec parameter mismatch (VPN); attachment/BGP session down (Interconnect); Cloud Router not advertising subnets or not learning on-prem routes; CIDR overlap. Checks: tunnel/attachment status; BGP session state and advertised/learned routes. Fix: align IKE params, fix BGP advertisements both ways, resolve overlap. Prevention: Interconnect + HA VPN backup; alarms on tunnel/BGP state.

⚑ Cloud DNS / private DNS issue

Causes: private zone not attached to the VPC; missing record; forwarding/peering not set for hybrid; wrong resolver. Checks: dig from a VM; zone attachment. Fix: attach the private zone, add records, set inbound/outbound server policies for on-prem forwarding.

Load balancer & databases

⚑ LB backend unhealthy / SSL cert issue

Unhealthy: firewall not allowing health-check ranges (35.191/16, 130.211/22) to the backend port; wrong health-check port/path/protocol; app on localhost; wrong backend protocol/named port. SSL: Google-managed cert stuck PROVISIONING because DNS doesn't point at the LB IP yet; missing SAN; expired self-managed cert. Fix: per section 7.

⚑ Cloud SQL connection / performance / backup issue

Connection: use the Auth Proxy or private IP; check the runtime SA has roles/cloudsql.client; authorized networks / SSL for public IP; Serverless VPC Access for Cloud Run/Functions. Performance: Query Insights + Cloud Monitoring (CPU/connections/storage/lag); add read replicas; tune queries/flags. Backup failed: check storage, PITR/binary logging enabled, and quota; test a restore.

gcloud sql instances describe INSTANCE
./cloud-sql-proxy --private-ip PROJECT:REGION:INSTANCE

IAM & service accounts

⚑ IAM permission denied / impersonation / SA key issue

Permission denied: walk the section 2 mental model - right project? which principal? role/permission? scope/inheritance? deny policy? Org Policy? VPC-SC? API enabled? For workloads: does the caller have actAs/tokenCreator on the SA, and does the SA have the role on the target? Impersonation: caller needs roles/iam.serviceAccountTokenCreator on the SA. SA key issue: key creation may be blocked by Org Policy (good) - use impersonation/Workload Identity instead; a leaked/rotated key stops working. Tools: Policy Troubleshooter, Policy Analyzer, audit logs.

Serverless & GKE

⚑ Cloud Run revision / Cloud Functions timeout / Pub/Sub backlog

Cloud Run: container must listen on $PORT and start fast; check the runtime SA permissions and startup CPU; roll back a bad revision. Functions timeout: raise timeout/memory, make idempotent, offload long work. Pub/Sub backlog: slow/failing subscriber - check ack deadline, errors, scale consumers, add a dead-letter topic.

⚑ GKE pod not starting / Ingress issue

Pod: Pending (capacity / pod-IP exhaustion / requests too big), ImagePullBackOff (Artifact Registry read perms / PGA for private cluster), CrashLoopBackOff (config/probes). Ingress: allow health-check ranges; readiness probe aligned; managed cert needs DNS → LB IP first; verify BackendConfig/NEG. Tools: kubectl describe/logs. (Section 10.)

Observability

⚑ Alert not firing / logs missing

Alert: wrong metric/filter, threshold/duration never met, policy disabled, notification channel unverified, or maintenance suppression. Test by forcing the condition; use absence conditions for heartbeats. Logs missing: log not enabled (e.g. Data Access audit logs off), Ops Agent not installed on the VM, a log exclusion filter dropping them, wrong project/log bucket, or retention expired. Fix: enable the log, install the agent, review Log Router sinks/exclusions.

Official documentation: Google Cloud troubleshooting →

17. gcloud CLI, Terraform, and Automation

Practical, copy-friendly automation: gcloud setup and configurations, service-account impersonation, common commands, the Google Terraform provider, and clean examples for VPC, VMs, buckets, IAM, and alerts - plus state and structure practices.

Last reviewed: July 2026 Verify provider version and resource argument names against current docs.

gcloud setup Impersonation Common commands Terraform setup VPC VM Bucket IAM Alert State & structure

TL;DR

The gcloud CLI uses named configurations (account + project + region). Prefer Application Default Credentials and service-account impersonation over downloaded keys. Build production infrastructure with Terraform (the Google provider); keep state remote and locked in a Cloud Storage backend, structure code into modules, and separate environments by workspace/backend + tfvars. Run it in a pipeline (Cloud Build) with an impersonated deployer SA - no keys.

gcloud CLI setup & configurations

# Install the Cloud SDK, then authenticate
gcloud auth login                       # human login
gcloud auth application-default login    # ADC for local tools/Terraform

# Named configurations (switch between projects/accounts fast)
gcloud config configurations create prod
gcloud config set account jane@example.com
gcloud config set project acme-app-prod-01
gcloud config set compute/region us-central1

gcloud config configurations activate prod
gcloud config configurations list

Service-account impersonation (no keys)

# Your user needs roles/iam.serviceAccountTokenCreator on the SA
gcloud config set auth/impersonate_service_account deployer@PROJECT.iam.gserviceaccount.com
gcloud compute instances list      # now runs AS the SA, short-lived token

# Or per-command:
gcloud storage ls --impersonate-service-account=deployer@PROJECT.iam.gserviceaccount.com

# Terraform: impersonate via ADC + provider setting (no key file)

Security note

Do not download service-account JSON keys. Use ADC for local dev, impersonation for acting as an SA, and Workload Identity Federation for CI/CD. Block key creation org-wide with iam.disableServiceAccountKeyCreation.

Common gcloud commands

# Projects / APIs
gcloud projects list
gcloud services enable compute.googleapis.com run.googleapis.com --project PROJECT
gcloud services list --enabled

# Compute
gcloud compute instances list
gcloud compute instances create web-1 --machine-type=e2-standard-2 --no-address --zone=us-central1-a
gcloud compute ssh web-1 --tunnel-through-iap --zone=us-central1-a

# Storage
gcloud storage ls
gcloud storage cp -r ./data gs://my-bucket/data

# IAM
gcloud projects add-iam-policy-binding PROJECT --member="group:app@example.com" --role="roles/run.developer"
gcloud projects get-iam-policy PROJECT

# Logs
gcloud logging read 'severity>=ERROR' --limit 20 --freshness 1h

Terraform provider setup

# versions.tf
terraform {
  required_version = ">= 1.5"
  required_providers {
    google = { source = "hashicorp/google", version = "~> 6.0" }   # verify current major
  }
  backend "gcs" { bucket = "acme-tfstate-prod"  prefix = "app" }    # remote, locked state
}

provider "google" {
  project = var.project_id
  region  = var.region
  # Impersonate a deployer SA using ADC - no key file
  impersonate_service_account = var.deployer_sa
}

Create a custom-mode VPC + subnet + Cloud NAT

resource "google_compute_network" "vpc" {
  name                    = "app-vpc"
  auto_create_subnetworks = false        # custom mode
}

resource "google_compute_subnetwork" "app" {
  name                     = "app-us-central1"
  ip_cidr_range            = "10.10.0.0/20"
  region                   = "us-central1"
  network                  = google_compute_network.vpc.id
  private_ip_google_access = true         # Private Google Access
  secondary_ip_range {                    # for GKE pods, if needed
    range_name    = "pods"
    ip_cidr_range = "10.20.0.0/16"
  }
}

resource "google_compute_router" "rt" {
  name = "app-rt"  region = "us-central1"  network = google_compute_network.vpc.id
}
resource "google_compute_router_nat" "nat" {
  name   = "app-nat"  router = google_compute_router.rt.name  region = "us-central1"
  nat_ip_allocate_option             = "AUTO_ONLY"
  source_subnetwork_ip_ranges_to_nat = "ALL_SUBNETWORKS_ALL_IP_RANGES"
}

Create a Compute Engine VM (no external IP, Shielded)

resource "google_compute_instance" "app" {
  name         = "app-1"
  machine_type = "e2-standard-2"
  zone         = "us-central1-a"
  boot_disk { initialize_params { image = "debian-cloud/debian-12" } }
  network_interface {
    subnetwork = google_compute_subnetwork.app.id
    # no access_config block = no external IP
  }
  shielded_instance_config { enable_secure_boot = true  enable_vtpm = true  enable_integrity_monitoring = true }
  service_account { email = google_service_account.app.email  scopes = ["cloud-platform"] }
  metadata = { enable-oslogin = "TRUE" }
}

Create a hardened Cloud Storage bucket

resource "google_storage_bucket" "data" {
  name                        = "acme-app-data-prod"
  location                    = "US"
  uniform_bucket_level_access = true
  public_access_prevention    = "enforced"
  versioning { enabled = true }
  lifecycle_rule {
    condition { age = 30 }
    action { type = "SetStorageClass"  storage_class = "NEARLINE" }
  }
  # encryption { default_kms_key_name = google_kms_crypto_key.data.id }  # CMEK
}

IAM binding, service account, and a conditional binding

resource "google_service_account" "app" {
  account_id   = "app-runtime"
  display_name = "App runtime SA"
}

# Least-privilege: only object read on one bucket, to a GROUP
resource "google_storage_bucket_iam_member" "read" {
  bucket = google_storage_bucket.data.name
  role   = "roles/storage.objectViewer"
  member = "serviceAccount:${google_service_account.app.email}"
}

# Conditional project binding (time-boxed / resource-scoped)
resource "google_project_iam_member" "cond" {
  project = var.project_id
  role    = "roles/compute.viewer"
  member  = "group:oncall@example.com"
  condition {
    title      = "prod-hours"
    expression = "request.time.getHours('America/New_York') >= 8 && request.time.getHours('America/New_York') < 20"
  }
}

Create a Cloud Monitoring alert

resource "google_monitoring_notification_channel" "email" {
  display_name = "ops-email"  type = "email"
  labels = { email_address = "oncall@example.com" }
}

resource "google_monitoring_alert_policy" "cpu" {
  display_name = "VM CPU high"
  combiner     = "OR"
  conditions {
    display_name = "CPU > 85%"
    condition_threshold {
      filter          = "resource.type=\"gce_instance\" AND metric.type=\"compute.googleapis.com/instance/cpu/utilization\""
      comparison      = "COMPARISON_GT"
      threshold_value = 0.85
      duration        = "300s"
      aggregations { alignment_period = "60s"  per_series_aligner = "ALIGN_MEAN" }
    }
  }
  notification_channels = [google_monitoring_notification_channel.email.id]
}

State, structure, and CI/CD

Remote, locked state: a GCS backend (object versioning on the state bucket) with state locking. Never keep prod state on a laptop; never commit state (it holds secrets).
Modular structure: reusable modules (network, compute, data, iam, monitoring) composed per environment.
Environment separation: separate state per env (workspaces or separate backends/prefixes) driven by dev.tfvars / prod.tfvars; separate projects and ideally separate deployer SAs/pipelines.
CI/CD: run plan/apply in Cloud Build (or another pipeline) using an impersonated deployer SA via Workload Identity Federation - no keys. Gate apply with approvals; run plan on PRs for review.
No secrets in code: reference Secret Manager / KMS by name; keep secret tfvars out of git.

gcp-infra/
  modules/
    network/   compute/   data/   iam/   monitoring/
  envs/
    dev/    main.tf  dev.tfvars   backend.tf
    prod/   main.tf  prod.tfvars  backend.tf
  cloudbuild.yaml
  README.md

Architect note

Prod and DR should be provably identical because they come from the same modules with different variables. Manual Console changes in prod are the enemy of a working DR - enforce "infrastructure changes go through Terraform," run plan in CI on every PR, and use Cloud Asset Inventory / drift detection to catch out-of-band changes.

Official documentation: gcloud CLI & Terraform on Google Cloud →

18. Google Cloud Architecture Framework

The five pillars Google uses to review a design - operational excellence, security/privacy/compliance, reliability, cost optimization, and performance. Written for real architecture reviews: what each means, the services that support it, examples, mistakes, and a review checklist.

Last reviewed: July 2026 Verify against the current Architecture Framework docs.

HOW TO USE THIS

Run a design (or an existing system) through all five pillars. For each, ask the checklist questions, map to concrete Google Cloud services, and record gaps as action items. A pillar with no owner and no evidence is a risk, not a pass.

Operational excellence

Security & compliance

Reliability

Cost optimization

Performance

Operational excellence

What it means: run, monitor, and improve systems and processes reliably and repeatably - automation, observability, incident response, and change management.

Why it matters: most outages are caused by change and by not seeing problems early. Operational maturity is what turns a good design into a dependable service.

Supporting services: Cloud Monitoring/Logging, Error Reporting/Trace/Profiler, Cloud Build/Deploy, Terraform + Cloud Asset Inventory (IaC + drift), VM Manager, Service Health, Active Assist.

Practical examples: everything as code with peer-reviewed changes; SLOs with error budgets; centralized logs; golden images + automated patching; runbooks tied to alerts; blameless post-mortems.

Common mistakes

Manual Console changes in prod; alerting on causes not symptoms (or alert fatigue); no SLOs; observability bolted on after an incident; no defined incident process.

Review checklist

Is all infra in code with review? Are there SLOs + error budgets? Are logs centralized and audit logs on? Is patching automated and reported? Are alerts symptom-based and actionable? Are runbooks current and rehearsed?

Security, privacy, and compliance

What it means: protect identities, data, and workloads; meet regulatory obligations; and be able to prove it.

Why it matters: a single over-broad grant, public bucket, or long-lived key can undo everything else. Security is a design property, not an add-on.

Supporting services: Cloud IAM (least privilege, groups), Organization Policies, VPC Service Controls, Security Command Center, Cloud KMS/HSM + Secret Manager, Cloud Armor, IAP, Binary Authorization, Sensitive Data Protection, Cloud Audit Logs.

Practical examples: no basic roles; SA keys disabled org-wide; preventive Org Policies; VPC-SC around sensitive data; CMEK; private IPs + IAP; SCC org-wide; centralized audit logs. (See section 8's checklist.)

Common mistakes

Owner/Editor everywhere; long-lived SA keys; public storage; public DB endpoints; secrets in code; no VPC-SC for sensitive data; audit logs off or not centralized; Org Policies unset.

Review checklist

Least privilege via groups/predefined roles? SA keys disabled + impersonation/WIF used? Preventive Org Policies on? VPC-SC around sensitive data? CMEK + Secret Manager? SCC + centralized audit logs? Public exposure minimized? DR keys available cross-region?

Reliability

What it means: the system meets its availability and durability targets and recovers from failures - designed around resource scope (zonal/regional/multi-region), redundancy, and tested DR.

Why it matters: reliability targets (SLOs) drive architecture and cost. You cannot bolt on availability after an outage.

Supporting services: regional MIGs + autohealing, regional PD, global external LB (health-based failover), Cloud SQL HA / cross-region replicas, Spanner/Firestore multi-region, Backup and DR, Cloud Monitoring SLOs.

Practical examples: multi-zone by default (regional resources); a defined DR pattern per tier with tested RTO/RPO; graceful degradation; capacity planning + reservations; error budgets governing release pace.

Common mistakes

Single zonal VM/disk for "prod"; DR never tested; CMEK keys missing in DR; no capacity reservation for failover; assuming stateful active-active is easy.

Review checklist

What scope is each critical resource (zonal/regional/multi-region)? Multi-zone by default? Defined + tested DR pattern per tier with RTO/RPO? Health checks + autohealing? SLOs monitored? Capacity for failover?

Cost optimization

What it means: deliver the required value at the lowest sustainable cost - right-sizing, discounts, eliminating waste, and attributing spend.

Why it matters: unmanaged cloud spend grows silently; cost is a first-class design and operational concern, not a finance afterthought.

Supporting services: billing export to BigQuery, Budgets + alerts, quotas, Recommender/Active Assist, CUDs, Spot VMs, custom machine types, storage lifecycle, BigQuery query controls.

Practical examples: labels + billing export for attribution; CUDs for baseline; Spot for batch; scheduled non-prod shutdown; storage lifecycle; BigQuery partitioning + max-bytes-billed; monthly review (section 14).

Common mistakes

No labels/attribution; over-provisioned VMs and disks; on-demand for steady-state; unbounded BigQuery scans; idle IPs and orphaned disks; noisy log ingestion.

Review checklist

Is spend attributed via labels + billing export? Budgets + quotas in place? CUD coverage for baseline? Right-sizing acted on? Storage lifecycle + BigQuery controls? A recurring cost review?

Performance optimization

What it means: resources meet latency/throughput requirements efficiently as demand changes - right machine types, autoscaling, caching, data locality, and query design.

Why it matters: performance affects user experience and cost simultaneously; the right shape and data design often beat simply adding capacity.

Supporting services: machine families (C-series for CPU, M-series for memory, GPUs/TPUs), autoscaling (MIG/Cloud Run/GKE HPA), Cloud CDN, Memorystore, global LB, Hyperdisk (tunable IOPS), BigQuery partitioning/clustering/BI Engine.

Practical examples: match machine family to the bottleneck; autoscale on the right signal; cache at the edge (CDN) and in-memory (Memorystore); co-locate data and compute (reduce egress/latency); partition/cluster BigQuery; load-test before launch.

Common mistakes

Wrong machine family (CPU-bound on a general shape); no autoscaling or scaling on the wrong metric; disk IOPS ceiling mistaken for CPU; chatty cross-region calls; unpartitioned BigQuery scans.

Review checklist

Is the machine family matched to the workload? Autoscaling on a meaningful signal? Caching (CDN/Memorystore) where it helps? Data co-located with compute? Storage/DB performance sized (Hyperdisk IOPS, BQ partitioning)? Load-tested?

Official documentation: Google Cloud Architecture Framework →

19. Learning Path

A structured route from Google Cloud fundamentals to enterprise-grade architecture, security, data, and AI - aimed at people coming from traditional infrastructure or another cloud. Each level lists what to learn, why, hands-on labs, common mistakes, and the outcome you should reach.

Last reviewed: July 2026 Certification names/exam details change - verify on Google Cloud training before scheduling.

Beginner

Fundamentals: hierarchy, IAM, VPC, Compute, Storage, Monitoring

Intermediate

LB, MIGs, private networking, Cloud SQL, KMS, SCC, cost

Advanced

Shared VPC, VPC-SC, GKE, BigQuery, Dataflow, Vertex AI, DR, Terraform

How to use this

Do the labs, don't just read. Use the Free Tier / a trial project for hands-on. Map each level to the deep-dive sections above - the learning path is the syllabus, the sections are the textbook. Certifications (Cloud Digital Leader → Associate Cloud Engineer → Professional Cloud Architect, plus role specialties) are useful checkpoints, but capability comes from building.

Beginner

Level 1 - Foundations

Goal: deploy and connect basic Google Cloud resources confidently

What to learn

Fundamentals: global infra, regions/zones, resource scope, and the org/folder/project hierarchy (section 1).
IAM basics: principals, predefined roles (not basic), inheritance, groups, service accounts (section 2).
VPC basics: custom-mode VPC, regional subnets, firewall rules, Cloud NAT (section 3).
Compute Engine basics: machine types, images, OS Login, IAP SSH (section 4).
Cloud Storage basics: buckets, classes, UBLA/public access prevention (section 5).
Cloud Monitoring/Logging basics: metrics, the Ops Agent, an alert, logs (section 9).

Why it matters

Every design rests on the hierarchy, IAM, and the global-VPC / regional-subnet model. Get these right and everything later is easier.

Hands-on labs

Create a project; add a group; grant a predefined role; test access.
Build a custom-mode VPC with a subnet, firewall rules, and Cloud NAT.
Launch a VM with no external IP; SSH via IAP; use OS Login.
Create a hardened bucket (UBLA + public access prevention); upload objects.
Install the Ops Agent; create a CPU alert to an email channel.

Common mistakes

Using the default network / auto mode; basic roles; per-user grants; external IPs everywhere; forgetting the Ops Agent for memory.

Expected outcome

You can stand up a properly-segmented VPC, reach a private VM via IAP, use IAM correctly, and see basic telemetry.

Intermediate

Level 2 - Building real workloads

Goal: deploy an HA app + managed database with monitoring, security, and cost control

What to learn

Load balancing (global external App LB + Cloud Armor) and managed instance groups + autoscaling (sections 7, 4).
Private networking: Private Google Access, Cloud NAT, private IP for services; Cloud VPN / Interconnect basics (section 3).
Cloud SQL: HA, private IP, Auth Proxy, backups/PITR, read replicas (section 6).
Cloud Logging (sinks) and Cloud KMS / Secret Manager (sections 9, 8).
Security Command Center and Org Policies basics (section 8).
Cost management: budgets, labels, billing export, CUDs (section 14).

Why it matters

This is the day job: HA app tiers, managed databases, and the operational, security, and cost controls that make them production-worthy.

Hands-on labs

Deploy a 3-tier app: global LB + Cloud Armor → regional MIG (or Cloud Run) → Cloud SQL (private IP, HA).
Allow the health-check ranges; confirm backends healthy; force a failover.
Store the DB password in Secret Manager; connect via the Auth Proxy with a runtime SA.
Create alerts (CPU, unhealthy backend, DB storage) and a notification channel.
Set a budget + labels + billing export; add a couple of Org Policies (no external IP, restrict locations).

Common mistakes

Health-check firewall rule missing; DB on public IP; secrets in code; noisy alerts; no labels for attribution.

Expected outcome

You can deploy a secure, monitored, HA application + managed database, connect it privately, and keep its cost and access under control.

Advanced

Level 3 - Enterprise architecture, data & AI

Goal: design governed, multi-region, data-and-AI-capable platforms

What to learn

Shared VPC, Organization Policies, and VPC Service Controls; Private Service Connect (sections 3, 8, 2).
GKE (Autopilot/Standard), Workload Identity, and Cloud Run at scale (section 10).
Pub/Sub, Eventarc, Workflows for event-driven systems (section 10).
BigQuery (mental model, partitioning/clustering, cost), Dataflow, Dataplex (section 11).
Vertex AI, Gemini, and vector search / governed RAG (section 12).
Multi-region DR (global LB + cross-region replicas / Spanner) (section 13).
Terraform + remote state + CI/CD; a landing zone (sections 17, 1, 14).
Enterprise security at scale: CMEK, Binary Authorization, centralized logging + SCC (section 8).

Why it matters

At this level you own governance, resilience, data platforms, and AI enablement across many teams - decisions that are expensive to reverse.

Hands-on labs

Deploy a landing zone via Terraform: hierarchy, groups, Org Policies, Shared VPC, central logging + SCC, budgets.
Stand up a private GKE (Autopilot) cluster in the Shared VPC with Workload Identity and a Cloud Build/Deploy pipeline.
Build a BigQuery + Cloud Storage lakehouse with a Datastream CDC feed and Dataplex governance; tune a query with partitioning/clustering.
Build a governed RAG assistant: Cloud Storage + vectors in AlloyDB/BigQuery + Gemini behind a Cloud Run serving API, with VPC-SC + audit + entitlement-filtered retrieval.
Implement cross-region DR for a Cloud SQL app (replica promotion) behind a global LB; rehearse failover and confirm CMEK keys in DR.

Common mistakes

Skipping the landing zone; DR never tested; over-privileged Workload Identity; pod-IP exhaustion; unbounded BigQuery scans; connecting AI to production data without a governed serving layer.

Expected outcome

You can design and operate a governed, automated, multi-region Google Cloud platform - including data and AI workloads - and defend the trade-offs on security, reliability, and cost.

Certification checkpoints (optional)

Level	Typical certification track
Beginner	Cloud Digital Leader; Associate Cloud Engineer
Intermediate	Professional Cloud Architect; Professional Cloud Network Engineer
Advanced	Professional Cloud Security Engineer, Data Engineer, Database Engineer, DevOps Engineer, Machine Learning Engineer

Verify before scheduling

Google updates exam content and role certifications regularly. Confirm the current track and objectives on Google Cloud's official training site before you prepare. Certifications validate knowledge; the labs above build the capability employers pay for.

Official: Google Cloud training & certification →

Google Cloud Deep Dive Portal

How this portal is organized

Reading the callouts

The Google Cloud shared responsibility model (orientation)

Suggested reading order

1. Google Cloud Fundamentals

What Google Cloud is

Google Cloud global infrastructure

Resource scope: global, multi-region, regional, zonal

The resource hierarchy

The Google Cloud mental model

Organization Policies

Labels vs tags

Resource names, project IDs, project numbers

Ways to work with Google Cloud

Designing the hierarchy & a landing zone

2. Identity and Access Management

The IAM model

Principals (members)

Roles: basic, predefined, custom

Allow policies and deny policies; conditional IAM

Inheritance and scope

Service accounts: keys vs impersonation vs workload identity

Cloud Identity, Workspace, and federation

IAP, Access Context Manager, VPC Service Controls

Real IAM scenarios

Common Google Cloud IAM mistakes

Google Cloud access troubleshooting mental model

Tools

3. Networking Deep Dive

VPC networks and CIDR planning

Firewall rules and hierarchical firewall policies

Routes, Cloud Router, and Cloud NAT

Private Google Access, Private Service Connect, private services access

Shared VPC and VPC peering

Hybrid connectivity: Cloud VPN and Interconnect

Cloud DNS

How traffic flows in Google Cloud

Reference diagrams

Three-tier with global external Application Load Balancer

Shared VPC (centralized networking)

Private Google Access & Cloud NAT egress

Network Intelligence Center

Networking troubleshooting

Likely causes & checks

Fix / prevention

Google Cloud networking gotchas

4. Compute Deep Dive

Machine families

Custom types, Spot VMs, sole-tenant, GPUs/TPUs

Images, machine images, templates

Managed instance groups & autoscaling

Shielded VMs, Confidential VMs, OS Login

Serverless & managed compute

Choosing compute by workload

Operational guidance

5. Storage Deep Dive

Block storage: Persistent Disk, Hyperdisk, Local SSD

Filestore

Cloud Storage

Encryption

When to use which

Practical examples

Storage gotchas

6. Database Services Deep Dive

The portfolio at a glance

Service deep dives

Cloud SQL (PostgreSQL, MySQL, SQL Server)

AlloyDB for PostgreSQL

Spanner

Firestore, Bigtable, Memorystore

Database service decision table

Connectivity & observability

How HA, DR, backup, and patching differ

Google Cloud database gotchas for Oracle DBAs

Enterprise examples

7. Load Balancing and Traffic Management

The load balancer family

Anatomy of a load balancer

Cloud CDN and Cloud Armor