Google Cloud Deep Dive Portal
A practical reference for Cloud Architects, DBAs, Data Engineers, and Enterprise Infrastructure Teams. Built to be used while you learn, design, implement, operate, secure, and troubleshoot real Google Cloud environments - not a marketing overview.
Cloud architects, infrastructure engineers, Apps DBAs, DBAs, enterprise architects, DevOps, security, data, and AI engineers - and anyone moving from traditional infrastructure or another cloud into Google Cloud. It assumes you know servers, networks, storage, and databases, and focuses on how those map into Google Cloud and what changes operationally.
How this portal is organized
Each section is a self-contained deep dive. Use the left navigation or the top-bar search to jump to a topic. Every section carries a Last reviewed date and, where content changes frequently (pricing, machine types, quotas, model availability, service names), a Verify with current Google Cloud documentation flag.
Sections 1-2 establish the mental model: the resource hierarchy (org / folder / project), regions/zones, and the IAM allow/deny policy model that everything else depends on.
Sections 3-12 cover networking, compute, storage, databases, load balancing, security, observability, containers, data analytics, and AI - with diagrams, tables, and gotchas.
Sections 13-19 cover migration and DR, cost and governance, reference patterns, troubleshooting runbooks, automation, the Architecture Framework, and a structured learning path.
Reading the callouts
Several note types recur. They flag the perspective that matters most for a point.
The Google Cloud shared responsibility model (orientation)
Responsibility is split, and the split moves depending on the service. Get it wrong and you either leave gaps (exposed data, lost recoverability) or redo work Google already does.
| Layer | Compute Engine (IaaS) | GKE Standard | Cloud SQL / managed DB | BigQuery / Cloud Run (serverless) |
|---|---|---|---|---|
| Physical / hypervisor | ||||
| OS patching | You | You (nodes) / Google (control plane) | ||
| Runtime / engine patching | You | Shared | Google (in window) | |
| Backup config | You | You | Managed, you configure | Managed / you export |
| Scaling / HA | You build it (MIG) | You configure | You enable HA | Automatic |
| Data, schema, access, IAM | You | You | You | You |
Suggested reading order
1. Google Cloud Fundamentals
The global infrastructure and the resource hierarchy (organization, folders, projects) that every Google Cloud deployment is built on - plus the mental model that makes the rest of the platform predictable.
Google Cloud is a set of regions (each with multiple zones) sitting on Google's global network. Resources have a scope - global, regional, or zonal - and that scope drives HA design. Everything lives in a hierarchy: Organization > Folders > Projects > resources. The project is the fundamental unit of deployment, billing, quota, and isolation. IAM grants access down the hierarchy; Organization Policies restrict what is allowed. Get the hierarchy and a landing zone right before production - restructuring later is painful.
What Google Cloud is
Google Cloud Platform (GCP) is Google's public cloud: on-demand compute, storage, networking, databases, data analytics, and AI/ML services delivered from Google-operated regions, consumed over Google's private global network, and billed by usage. Its distinctive strengths for enterprises are its global network (a software-defined, private backbone that makes a VPC a global object and enables global load balancing from a single anycast IP), its data and analytics stack (BigQuery, Dataflow, Pub/Sub, Dataplex), and its AI/ML platform (Vertex AI, Gemini). If you come from traditional infrastructure, the biggest early surprises are that the network is global, the project is the main boundary, and much of the platform is API-first and serverless.
Google Cloud global infrastructure
Resource scope: global, multi-region, regional, zonal
This is the single most important idea for HA design on GCP - a resource's scope determines what failure it survives and where it can be reached.
| Scope | Spans | Examples | Fails if |
|---|---|---|---|
| Global | All regions | VPC network, firewall rules, routes, global external Application LB, images, IAM, HTTP(S) load balancer IP | Essentially never region-bound; a global control-plane issue only |
| Multi-region | A set of regions (e.g. "US", "EU") | Cloud Storage multi-region buckets, BigQuery multi-region datasets | Survives a region loss within the multi-region |
| Regional | All zones in one region | Regional MIG, regional persistent disk, regional GKE control plane, subnets, Cloud SQL (HA) | The whole region is lost |
| Zonal | One zone | A single VM, zonal persistent disk, zonal GKE cluster | That one zone is lost |
The resource hierarchy
- Organization - the root node, tied to a Cloud Identity / Google Workspace domain. The top of IAM and policy inheritance.
- Folders - grouping nodes (by department, environment, or BU) for delegated administration and policy inheritance. Can nest.
- Projects - the fundamental unit. Every resource belongs to exactly one project. A project is the boundary for billing, quotas, API enablement, IAM, and isolation. It has a mutable name, an immutable project ID (globally unique), and a project number.
- Resources - VMs, buckets, datasets, etc., inside a project.
- Billing account - a separate object linked to projects; it is where charges accrue and can span many projects. Org and billing are managed independently.
- Resource Manager - the API/service that manages this hierarchy; Cloud Asset Inventory gives you a searchable, historical inventory of all resources and IAM across the org.
The Google Cloud mental model
| Concept | Is the boundary for | Think of it as |
|---|---|---|
| Organization | Everything; identity domain | The enterprise root |
| Folder | Delegated admin & policy grouping | A governance grouping (dept / env / BU) |
| Project | Billing, quota, API, IAM, isolation | The main deployment and billing boundary |
| IAM allow/deny policy | Who can do what | The access-control boundary |
| Organization Policy | What is allowed at all | The governance / restriction boundary |
| VPC network | Private networking | A global network object (subnets are regional) |
| Region / zone | Physical placement | The workload placement boundary (drives HA) |
Organization Policies
Organization Policy Service lets you set constraints that apply to a node (org, folder, or project) and inherit downward - guardrails that IAM cannot express. Common ones:
constraints/compute.vmExternalIpAccess- block external IPs on VMs (huge for reducing exposure).constraints/gcp.resourceLocations- restrict which regions resources can be created in (data residency).constraints/iam.disableServiceAccountKeyCreation- stop long-lived SA keys.constraints/iam.allowedPolicyMemberDomains- domain restricted sharing: only allow IAM grants to your org's identities.constraints/compute.requireOsLogin- enforce OS Login instead of metadata SSH keys.
Labels vs tags
| Labels | Tags | |
|---|---|---|
| What | Key/value metadata on resources | Key/value objects defined at org/project, bound to resources |
| Main use | Cost attribution, billing export grouping, organization | Conditional IAM and Org Policy / firewall targeting (governance) |
| Governed? | Free-form (anyone with edit can set) | Yes - tag keys/values are IAM-controlled resources |
Resource names, project IDs, project numbers
- Project ID - globally unique, human-chosen, immutable (e.g.
acme-app-prod-01). Used in most CLI/API calls. Choose a naming convention up front. - Project number - an auto-assigned numeric ID; some APIs and service-agent identities use it.
- Project name - a mutable display name.
- Full resource names are hierarchical, e.g.
//compute.googleapis.com/projects/<id>/zones/<zone>/instances/<name>.
Ways to work with Google Cloud
The web UI. Best for learning, exploring, and reading state. Not for repeatable production changes - use IaC.
The primary command line. Config profiles, project/account selection, service-account impersonation. See section 17.
Browser terminal, pre-authenticated as your Console identity, with gcloud/Terraform/kubectl installed and ephemeral home storage.
Idiomatic libraries (Python, Go, Java, Node, ...) and REST/gRPC APIs for building applications and tooling.
The Google provider is the standard way to build infrastructure declaratively. Deployment Manager still exists but is legacy; new work uses Terraform (or Infrastructure Manager, managed Terraform).
Search, export, and monitor all resources and IAM across the org, with history and feeds. Essential for governance and audits.
Designing the hierarchy & a landing zone
How to structure organization, folders, and projects Design
- Common pattern: Org > Folders by environment or business unit > Projects per app-environment. A frequent shape is a top-level split of Shared/Platform, Security, and per-BU folders, each with prod/nonprod sub-folders.
- One app + one environment per project is the norm - it gives clean IAM, quota, billing, and blast-radius isolation. Resist the urge to pile many apps into one project.
- Shared projects hold cross-cutting infrastructure: the host project for Shared VPC (networking), centralized logging, centralized security tooling (SCC), and a monitoring project.
- Never mix with production: sandbox/experimentation, personal projects, and unreviewed workloads must live in separate folders with their own guardrails and budgets - never in the prod project.
Separating dev / test / stage / prod / shared / security / networking / logging Design
- Separate projects per environment for independent IAM, quotas, budgets, and cost reporting.
- A dedicated networking (Shared VPC host) project, a logging project (aggregated log sink target), a security project (SCC, org-level tooling), and a monitoring project.
- Use folders to apply environment-wide IAM and Org Policies once and inherit them.
- Keep prod under stricter Org Policies (no external IPs, restricted locations, no SA keys) than nonprod.
What a Google Cloud landing zone includes Design
A landing zone is a codified, repeatable baseline (Terraform) deployed before workloads:
- Resource hierarchy (org, folders, projects) and naming standards.
- Identity: Cloud Identity/Workspace, groups, break-glass, federation.
- Baseline IAM (groups, not users) and preventive Org Policies.
- Networking: Shared VPC host project, subnets, hierarchical firewall, Cloud NAT, DNS, hybrid connectivity.
- Security: SCC, VPC Service Controls perimeters, KMS, org audit-log sink to a logging project (and optionally BigQuery/SIEM).
- Guardrails: budgets and alerts, quotas, labels/tags, billing export to BigQuery.
- All as code, reviewed and version-controlled. Google's Cloud Foundation / landing-zone blueprints are a starting point.
- Running everything under one project (or worse, one shared "default" project) so least privilege and cost attribution become impossible.
- Skipping the landing zone and retrofitting Org Policies, Shared VPC, and log centralization after workloads exist.
- Granting IAM at the org level for convenience, so every project inherits broad access.
- No naming standard for projects/folders, breaking automation and reporting.
- Mixing sandbox and production in the same folder with the same guardrails.
2. Identity and Access Management
Who can do what, on which resource, in which project - and the guardrails (deny policies, Org Policies, VPC Service Controls) around it. IAM is where most Google Cloud access issues and security incidents originate, so this section goes deep on principals, roles, inheritance, and troubleshooting.
IAM binds a principal (user, group, service account, or federated identity) to a role (a bundle of permissions) on a resource, in an allow policy. Policies inherit down the hierarchy (org → folder → project → resource), and grants are additive. Deny policies and Org Policies can block regardless of allows. Use groups (not users) and predefined/custom roles (not basic Owner/Editor/Viewer), prefer impersonation over service-account keys, and wrap sensitive data in VPC Service Controls.
The IAM model
An IAM allow policy is attached to a resource (or a hierarchy node) and contains bindings: each binding maps a role to one or more members (principals), optionally with a condition. When a principal calls an API, Google evaluates the effective policy (the union of allow policies inherited from the resource up through project, folder, and org), checks for any applicable deny policy, and checks Org Policy and VPC Service Controls. Access requires an allow, no matching deny, and no blocking Org Policy/perimeter.
Principals (members)
| Principal | What it is | Use for |
|---|---|---|
| Google account (user) | A human identity in Cloud Identity / Workspace / consumer | People - but grant via groups, not directly |
| Google group | A collection of users/service accounts | All human access management (add/remove members, not IAM bindings) |
| Service account (SA) | A non-human identity for workloads (an app, VM, function) | Machine-to-service auth; the workload's identity |
| Workload identity (federated) | An external workload identity (GKE, other clouds, CI) mapped to a GCP identity | Letting workloads authenticate without SA keys |
| Workforce identity (federated) | External human users from your IdP (Okta, Entra, etc.) | Console/gcloud access for a federated workforce |
| allAuthenticatedUsers / allUsers | Any Google identity / anyone on the internet | Almost never - public exposure; block with Org Policy |
Roles: basic, predefined, custom
| Role type | What | Guidance |
|---|---|---|
| Basic (primitive) | Owner, Editor, Viewer - broad, legacy roles spanning almost all services | Avoid in production. Owner/Editor are enormous grants. Use only in throwaway sandboxes. |
| Predefined | Curated, service-specific roles (e.g. roles/compute.instanceAdmin.v1, roles/bigquery.dataViewer) | The default. Pick the narrowest predefined role that fits the job. |
| Custom | You compose a role from specific permissions | When no predefined role fits without over-granting. Maintain them (permissions change). |
roles/editor or roles/owner "to move fast" gives near-total control over the project (create/delete almost anything, and Owner can change IAM). It defeats least privilege and makes audits meaningless. Use predefined roles scoped to the job; reserve Owner for a tiny, monitored break-glass group.Allow policies and deny policies; conditional IAM
- Allow policy - grants roles to principals. Additive across the hierarchy; there is no "subtract."
- Deny policy - explicitly denies specified permissions to specified principals, and is evaluated before allows - a matching deny wins. Use it to carve exceptions ("nobody except break-glass may delete buckets in this folder").
- Conditional IAM - attach a condition (CEL expression) to a binding: by resource name/type, by request time, by tag. E.g. grant access only to resources tagged
env=dev, or only during business hours.
# Conditional binding: grant only on buckets whose name starts with "app-"
gcloud projects add-iam-policy-binding my-proj \
--member="group:app-team@example.com" \
--role="roles/storage.objectAdmin" \
--condition='expression=resource.name.startsWith("projects/_/buckets/app-"),title=app-buckets-only'Inheritance and scope
Grant at the lowest node that works. A role granted at the org applies to every folder, project, and resource beneath it; a role granted on a single bucket applies only there.
| Scope | Grant here when | Risk |
|---|---|---|
| Organization | Truly org-wide roles (org admins, security auditors) | Highest - inherited everywhere |
| Folder | A whole environment/BU needs the same access | Medium |
| Project | Most workload access | Contained to one project |
| Resource (bucket, dataset, SA) | Fine-grained, single-resource access | Lowest |
Service accounts: keys vs impersonation vs workload identity
| Mechanism | How the workload authenticates | Risk |
|---|---|---|
| Attached SA (metadata) | A VM/GKE/Run resource runs as an SA; credentials come from the metadata server | Low - no keys on disk |
| Workload Identity Federation | External workload (other cloud, CI, on-prem) exchanges its native token for short-lived GCP creds | Low - no keys |
| SA impersonation | A principal with iam.serviceAccountTokenCreator mints short-lived tokens for an SA | Low - short-lived, auditable |
| SA key (JSON) | A long-lived downloaded private key | High - leaks, gets committed to git, outlives its owner |
iam.disableServiceAccountKeyCreation. A leaked JSON key is a standing, long-lived credential - the most common serious GCP incident.Cloud Identity, Workspace, and federation
- Cloud Identity / Google Workspace - where your human identities and groups live; the org is tied to a domain. Manage joiners/movers/leavers here.
- Workforce Identity Federation - let human users from an external IdP (Okta, Entra ID, Ping) access Google Cloud without provisioning Google accounts.
- Workload Identity Federation - let external workloads (GitHub Actions, AWS, on-prem) get short-lived GCP credentials by trust, no keys.
IAP, Access Context Manager, VPC Service Controls
Context-aware access to apps and VMs (SSH/RDP/TCP) without a VPN or public IPs - authenticate the user and check policy at the proxy. The modern replacement for bastion + public IP.
Defines access levels (by IP range, device posture, identity) used by IAP and VPC-SC to make access conditional on context.
Draws a service perimeter around projects so data in managed services (Cloud Storage, BigQuery, etc.) cannot be exfiltrated to projects outside the perimeter - even by a valid identity with a valid key. The key control for sensitive-data isolation.
Recommender flags over-granted roles; Policy Analyzer answers "who can access what"; Troubleshooter explains why a specific request was allowed or denied.
Real IAM scenarios
Read-only security auditor across the org Low risk
Who: the security team. Scope: organization (auditors legitimately need breadth). Role: roles/iam.securityReviewer + roles/logging.viewer (and SCC roles), granted to a group. Risk: low - read-only. Safer alternative: scope to specific folders if their remit is narrower. Common misuse: giving auditors Viewer at org (broader than needed and includes data read on many services).
App team deploys to their own project only Medium risk
Who: the app team group. Scope: their project (not folder/org). Role: specific predefined roles (e.g. roles/run.developer, roles/artifactregistry.writer, roles/logging.viewer), not Editor. Risk: medium - contained to one project. Safer alternative: deploy via a pipeline SA and give humans only view + trigger. Common misuse: roles/editor on the project "to unblock them."
Workload reads a bucket - no keys Low risk
Who: a VM/Cloud Run service. Scope: a single bucket. Role: roles/storage.objectViewer granted to the workload's attached service account on that bucket. Risk: low - narrow, keyless. This is the pattern to imitate. Common misuse: a downloaded SA key baked into the image + project-level Storage Admin.
CI/CD pipeline outside GCP deploys in Medium risk
Who: GitHub Actions / external CI. Scope: the target project. Role: a deployer SA the pipeline impersonates via Workload Identity Federation - no key. Risk: medium, but no standing credential. Common misuse: storing a long-lived SA JSON key as a CI secret.
Common Google Cloud IAM mistakes
- Basic roles (Owner/Editor/Viewer) too broadly - use predefined roles scoped to the task.
- Granting at org level unnecessarily - inherits everywhere; grant at project/resource.
- Long-lived service account keys - the top serious incident; use impersonation / workload identity and disable key creation.
- Not using service-account impersonation - humans and CI should mint short-lived tokens, not hold keys.
- Confusing service accounts with human users - different lifecycles, different controls.
- Not using groups - per-user bindings are unmanageable and invisible in reviews.
- Not understanding inheritance - a broad grant high in the hierarchy silently reaches prod.
- Not using conditional IAM where a resource/time/tag condition would tighten a grant.
- Ignoring VPC Service Controls for sensitive data - IAM alone does not stop exfiltration.
- Not reviewing audit logs - Admin Activity and Data Access logs are your evidence and detection.
Google Cloud access troubleshooting mental model
When a request is denied (or unexpectedly allowed), walk the layers in order. Most "permission denied" tickets are one of these:
- Which org / folder / project is the resource in? (Wrong project selected is the #1 cause.)
- Which principal is making the request - user, group, service account, or federated identity? (For workloads, which SA is actually attached?)
- What role is assigned, and does it contain the required permission?
- At what scope is it granted (resource / project / folder / org)? Does inheritance reach this resource?
- Is there an IAM deny policy matching this principal + permission?
- Is an Organization Policy blocking the action (e.g. location, external IP, SA key)?
- Is VPC Service Controls blocking it (cross-perimeter access to a managed service)?
- Is the API enabled in the project?
- For workloads acting on other resources: is the service account permitted to act on that resource, and does the caller have
actAs/tokenCreatoron the SA?
Tools
Policy Troubleshooter (explains allow/deny for a specific principal+resource+permission), Policy Analyzer (who-can-do-what), IAM Recommender (over-grants), and Cloud Audit Logs (the denied request with the reason).
gcloud projects get-iam-policy PROJECT_ID
gcloud asset analyze-iam-policy --organization=ORG_ID \
--identity="user:jane@example.com"
# Is the API enabled?
gcloud services list --enabled --project PROJECT_ID3. Networking Deep Dive
The global VPC, regional subnets, firewall model, private access to Google APIs, Shared VPC, and hybrid connectivity - plus the traffic-flow reasoning you need to design and debug real Google Cloud networks.
A VPC is a global object; its subnets are regional with a CIDR you choose (use custom mode, never auto mode in production). Firewall rules are stateful, have a priority and direction, and there is an implied deny-ingress / allow-egress at the bottom. Private workloads reach Google APIs via Private Google Access or Private Service Connect, and reach the internet outbound-only via Cloud NAT. Shared VPC centralizes networking in a host project; VPC peering is not transitive. Plan non-overlapping CIDRs before anything else.
VPC networks and CIDR planning
A VPC network is a global, software-defined private network. Unlike most clouds, one VPC spans every region; you add a subnet per region as you expand, and resources in different regions on the same VPC route to each other over Google's backbone with no peering.
- Custom mode vs auto mode: auto mode auto-creates a subnet in every region from a fixed 10.128.0.0/9 range - convenient but it hands you overlapping, non-planned CIDRs. Always use custom mode in production and assign subnets deliberately.
- Subnets are regional; a subnet has a primary CIDR and optional secondary ranges (used for GKE pods/services alias IPs).
- CIDRs must not overlap with each other, with peered/shared VPCs, or with on-premises. Overlap is the number-one cause of hybrid that "connects but won't route."
- You can expand a subnet's primary range; plan generously so you rarely need to.
default network) in production. They pre-create subnets in every region with fixed ranges you did not plan, which collide with on-prem and other VPCs the moment you connect them. Delete/avoid the default network; build a custom-mode VPC against your enterprise IP plan.Firewall rules and hierarchical firewall policies
GCP firewalls are stateful and evaluated per VM by priority (lower number wins). Every network has two implied rules at priority 65535: deny all ingress and allow all egress. You open what you need above that.
| Concept | Detail |
|---|---|
| Direction | Ingress (to targets) or egress (from targets). Rules are directional - a common source of confusion. |
| Priority | 0-65535, lower = higher priority; first match wins. Implied deny-ingress/allow-egress sit at 65535. |
| Targets | All instances, by network tag, or by service account (prefer SA targeting - it can't be self-assigned like tags). |
| Source/dest | CIDR ranges, source tags/SAs, or (for some) source service. |
| Hierarchical firewall policies | Rules set at org/folder that apply to all VPCs beneath - central guardrails (e.g. allow IAP range, deny risky ports) evaluated before VPC-level rules. |
| Firewall Rules Logging | Log matched connections per rule - use it to see what is actually being allowed/denied. |
actAs permission. Use hierarchical policies for org-wide guardrails (allow the IAP range 35.235.240.0/20 for SSH, deny egress to known-bad ranges) so individual teams can't undo them.0.0.0.0/0 allow-ingress on port 22 instead of allowing only the IAP range and using IAP for SSH.Routes, Cloud Router, and Cloud NAT
- Routes - system-generated (subnet routes, default internet route) plus custom static routes; dynamic routes are learned via Cloud Router (BGP) from VPN/Interconnect.
- Cloud Router - the BGP speaker for hybrid connectivity and for regional dynamic routing; advertises your subnets to on-prem and learns on-prem routes.
- Cloud NAT - managed, outbound-only NAT so instances with no external IP can reach the internet (patches, external APIs). It is regional and requires a Cloud Router. It does not allow inbound.
- External vs internal IPs - internal (RFC1918) always; external only when you truly need internet-facing exposure. Block external IPs by Org Policy where possible.
Private Google Access, Private Service Connect, private services access
Three different "private" mechanisms that are constantly confused:
| Mechanism | What it does | Use for |
|---|---|---|
| Private Google Access (PGA) | Lets VMs with only internal IPs reach Google APIs/services (storage.googleapis.com, etc.) without an external IP | Private VMs calling Google APIs (Cloud Storage, BigQuery, Artifact Registry) |
| Private Service Connect (PSC) | A private endpoint (internal IP) in your VPC that maps to a Google API bundle or a published service | Private, controlled access to Google APIs or to a service in another VPC/producer |
| Private services access (PSA) | A VPC peering to a Google-managed producer VPC for services like Cloud SQL private IP, Memorystore | Giving managed services a private IP reachable from your VPC |
| Serverless VPC Access | A connector that lets serverless (Cloud Run/Functions/App Engine) reach VPC internal IPs | Serverless calling private resources (a private Cloud SQL, an internal service) |
private.googleapis.com). Use Private Service Connect when you want a specific internal endpoint IP (for tighter control, VPC-SC alignment, or to consume a published producer service). For managed databases' private IP, you need private services access (a reserved range + peering). These are not interchangeable - pick by the resource you are reaching.Shared VPC and VPC peering
- Shared VPC - a host project owns the VPC and subnets; service projects attach and deploy resources into shared subnets. Networking is centralized (one team owns IP space, firewall, connectivity) while app teams keep their own projects for IAM/billing. The enterprise default for multi-project networking.
- VPC Network Peering - connects two VPCs privately. Crucially, peering is not transitive: if A peers B and B peers C, A cannot reach C. And you cannot peer overlapping ranges.
- Network Connectivity Center (NCC) - a hub-and-spoke model to interconnect many VPCs and hybrid links through a central hub, addressing peering's non-transitivity at scale.
Hybrid connectivity: Cloud VPN and Interconnect
| HA VPN | Dedicated / Partner Interconnect | |
|---|---|---|
| Path | Over the internet, IPSec-encrypted | Private physical connection (direct or via partner) |
| Bandwidth | Per-tunnel (Gbps-class aggregate with multiple tunnels) | 10/100 Gbps (Dedicated); flexible sizes (Partner) |
| SLA / latency | Best-effort internet; HA VPN offers an SLA with the right topology | Consistent, low latency; higher SLA |
| Setup | Minutes | Days-weeks (provisioning) |
| Use as | Quick start / backup / lower bandwidth | Primary enterprise link, large data, low latency |
Both use Cloud Router for BGP. HA VPN uses two interfaces for a 99.99% topology. Common pattern: Interconnect primary + HA VPN backup, with BGP preferring Interconnect.
Cloud DNS
- Public zones for internet-facing names; private zones for internal resolution within (and across, via peering) your VPCs.
- DNS peering and forwarding integrate with on-prem DNS (inbound/outbound server policies) for hybrid name resolution.
- Managed private zones for
*.googleapis.com(e.g.private.googleapis.com/restricted.googleapis.com) route Google-API traffic privately for PGA/VPC-SC.
How traffic flows in Google Cloud
- Destination inside the same VPC (any region)? Routes locally over the backbone - only firewall rules apply.
- Outside the VPC? The route table (subnet/static/dynamic routes) picks the next hop: default internet route (needs external IP or Cloud NAT), a VPN/Interconnect route (via Cloud Router), or a peering route.
- Firewall (hierarchical policies, then VPC rules, by priority) must allow it - remember implied deny-ingress / allow-egress.
- For Google APIs from private VMs: PGA/PSC + DNS to the private API endpoint.
Debugging is almost always: is there a route to the right next hop? does firewall allow it (both directions/priority)? is external IP / Cloud NAT / PGA in place for the destination type?
Reference diagrams
Three-tier with global external Application Load Balancer
Shared VPC (centralized networking)
Private Google Access & Cloud NAT egress
Network Intelligence Center
| Tool | What it gives you |
|---|---|
| Connectivity Tests | Static reachability analysis A→B: tells you the exact firewall rule / route / config blocking a path. First stop for "cannot reach." |
| Network Analyzer | Automatic detection of misconfigurations (shadowed routes, unused rules, IP exhaustion, sub-optimal config). |
| VPC Flow Logs | Sampled connection records for monitoring, forensics, and "is my rule dropping this?" |
| Firewall Rules Logging / Packet Mirroring | Per-rule connection logs; mirror traffic to an IDS/collector for deep inspection. |
Networking troubleshooting
Likely causes & checks
- No external IP and no Cloud NAT for the subnet's region - private VMs need Cloud NAT for outbound.
- Egress firewall (or a hierarchical policy) denies the destination, or a deny rule at higher priority matches.
- Default internet route removed/overridden by a custom route.
- OS firewall on the VM.
Fix / prevention
Add Cloud NAT (+ Cloud Router) for the region; for OS/package repos, Google mirrors are reachable via PGA. Standardize NAT + PGA in the subnet module.
gcloud compute routers nats describe NAT --router=RT --region=REGION
gcloud compute firewall-rules list --filter="direction=EGRESS"Causes: Private Google Access not enabled on the subnet; DNS not resolving *.googleapis.com to the private VIP; no route to private.googleapis.com (199.36.153.8/30); firewall egress blocking 443 to that range; or VPC-SC perimeter blocking. Fix: enable PGA on the subnet, add the private-API DNS zone + route, allow egress 443 to the restricted/private VIP range. Use a Connectivity Test to confirm.
Causes: relying on transitive peering (not supported); overlapping CIDRs; missing firewall allowing the peer range; for Shared VPC, the service project's SA lacks compute.networkUser on the subnet, or resources were created in the wrong (local) network. Fix: use Shared VPC or NCC instead of peering chains; grant networkUser; ensure resources deploy into the shared subnet; open firewall for the source range.
Causes: CIDR overlap; Cloud Router not advertising the subnet, or on-prem not advertising its routes; VPN tunnel down (IKE mismatch) or Interconnect/BGP down; firewall not allowing the on-prem range. Fix: resolve overlap, verify BGP advertisements both ways, check tunnel/attachment state, open firewall. Console: Hybrid Connectivity > VPN / Interconnect; Cloud Routers.
Causes: firewall not allowing the health-check ranges (35.191.0.0/16 and 130.211.0.0/22) to the backend port; wrong health-check port/path/protocol; app not listening or bound to localhost; wrong backend-service protocol. Fix: allow the health-check ranges to the backend SA/tag on the port; align the health check; bind to 0.0.0.0. Full flow in section 7.
Method: run a Connectivity Test (names the blocking rule/route), then Network Analyzer for config issues and VPC Flow Logs / Firewall Rules Logging to see drops. For Cloud NAT, check port-allocation exhaustion (increase min-ports or enable dynamic port allocation). For PSC, verify the endpoint, the DNS mapping, and that the producer accepts the connection.
Google Cloud networking gotchas
- VPC is global, subnets are regional - don't build one VPC per region, and don't treat a subnet as global.
- PGA vs PSC vs private services access are different - pick by what you're reaching (Google APIs vs published service vs managed-DB private IP).
- Overlapping CIDRs break peering and hybrid - plan IP space early, avoid auto mode and the default network.
- Poor Shared VPC design - decide host/service split and who owns firewall/IP before workloads land.
- Databases/internal services on external IPs - use private IP + private access; block external IPs by Org Policy.
- Firewall direction & priority - rules are directional and first-match-by-priority; the implied deny-ingress/allow-egress is easy to forget.
- Peering is not transitive - use Shared VPC or NCC for hub-and-spoke.
- Cloud NAT is outbound only - inbound needs an LB or external IP.
- APIs not enabled - many "network" failures are actually a disabled API (compute, dns, servicenetworking).
- Egress & inter-region charges - internet egress and cross-region traffic are metered; keep chatty services co-located and use private access.
4. Compute Deep Dive
Compute Engine machine families, managed instance groups, Spot VMs, and the serverless options (Cloud Run, Functions, App Engine) - how to choose, place, scale, and operate compute on Google Cloud.
Compute Engine VMs come in machine families (general/compute/memory/accelerator-optimized) plus custom types. Use regional managed instance groups + autoscaling for HA and elasticity, instance templates as the blueprint, and Spot VMs for fault-tolerant batch. Prefer OS Login over metadata SSH keys and Shielded VMs by default. For new apps, consider Cloud Run (serverless containers) before managing VMs. Committed use discounts and right-sizing are the main cost levers.
Machine families
| Family | Series (examples) | Best for |
|---|---|---|
| General purpose | E2 (cost), N2/N2D/N4, C3/C3D, Tau T2D/T2A (scale-out; T2A is Arm) | Web, app, microservices, most workloads - the default |
| Compute optimized | C2/C2D, H3 | High per-core performance: gaming, HPC, CPU-bound apps |
| Memory optimized | M1/M2/M3 | Large in-memory: SAP HANA, large databases, in-memory analytics |
| Accelerator optimized | A2/A3 (NVIDIA GPUs), G2; TPUs separately | AI/ML training & inference, GPU/TPU workloads |
| Custom machine types | N-series custom vCPU/memory | Right-sizing when predefined shapes waste vCPU or memory (great for licensing) |
Custom types, Spot VMs, sole-tenant, GPUs/TPUs
| Option | What it does | Use when |
|---|---|---|
| Spot VMs | Deeply discounted VMs Google can preempt anytime (successor to preemptible) | Fault-tolerant, stateless, restartable batch/CI/render - never stateful prod |
| Sole-tenant nodes | Physical host dedicated to your project | Compliance/isolation, or per-core licensing that needs host affinity |
| GPUs | Attach NVIDIA GPUs to VMs (or use A3/G2) | ML training/inference, rendering, HPC |
| TPUs | Google's custom ML accelerators (Cloud TPU) | Large-scale training/inference on supported frameworks |
| Reservations | Reserve capacity of a machine type in a zone | Guaranteeing capacity for scale-out or DR failover |
| Committed use discounts (CUDs) | 1/3-year commitment for a big discount | Steady-state baseline compute (see section 14) |
Images, machine images, templates
- Public images (Debian, Ubuntu, RHEL, Windows Server, etc.) and custom images (your golden image).
- Machine images capture a full VM (disks + metadata) for cloning/backup; images capture a boot disk.
- Instance templates define shape, image, disks, network, metadata/startup script - the blueprint for MIGs.
- Startup scripts (and cloud-init on supported images) bootstrap a VM on boot; the metadata server (169.254.169.254) exposes metadata and workload credentials.
Managed instance groups & autoscaling
| Building block | Role |
|---|---|
| Instance template | Immutable blueprint for the VMs. |
| Managed Instance Group (MIG) | Creates/maintains identical VMs from a template. Regional MIGs spread across zones for HA; zonal MIGs don't. |
| Autoscaling | Scales the MIG by CPU, LB utilization, custom metrics, or schedule. |
| Autohealing | Recreates VMs failing a health check. |
| Rolling updates / canary | Update the template and roll instances gradually (surge/max-unavailable). |
Shielded VMs, Confidential VMs, OS Login
- Shielded VM - secure boot, vTPM, integrity monitoring; enable by default (some Org Policies require it).
- Confidential VM - memory encrypted in use (AMD SEV / Intel TDX) for sensitive workloads.
- OS Login - manage SSH access via IAM (and 2FA) instead of project/instance metadata SSH keys. Enforce it with
constraints/compute.requireOsLogin. - IAP for SSH/RDP - reach VMs with no external IP through Identity-Aware Proxy, gated by IAM.
Serverless & managed compute
| Service | What it is | Use for |
|---|---|---|
| Cloud Run | Serverless containers, scale-to-zero, request- or job-based | Most new stateless services/APIs and batch jobs - the default serverless choice |
| Cloud Functions | Event-driven functions (now Cloud Run functions) | Small event handlers, glue, automation |
| App Engine | PaaS for web apps (standard/flex) | Legacy/existing App Engine apps; new work usually goes to Cloud Run |
| Batch | Managed batch job scheduling on Compute Engine | Large batch/HPC jobs without managing a scheduler |
Choosing compute by workload
| Workload | Starting point |
|---|---|
| Web / API (stateless) | Cloud Run; or regional MIG (E2/N2/T2D) behind a global LB |
| Middleware | Regional MIG on N-series, memory-leaning |
| Databases (self-managed) | N2/C3 or M-series (large), regional PD/Hyperdisk; prefer managed DB (section 6) |
| Oracle workloads | Compute Engine VM (self-managed Oracle) or the Oracle Database@Google service; sole-tenant for licensing |
| SAP | Memory-optimized M-series (certified), reservations |
| Batch / CI / render | Spot VMs in a MIG, or Batch service |
| Memory-heavy | M-series or custom high-memory |
| CPU-heavy | C2/C3/H3 compute-optimized |
| GPU / AI training | A3/A2/G2 with GPUs, or TPUs; consider Vertex AI (section 12) |
| Cost-sensitive / spiky | E2 or T2D + autoscaling + CUDs; Spot for fault-tolerant parts |
| Event-driven / bursty | Cloud Run / Cloud Functions (scale to zero) |
Operational guidance
Resize a Compute Engine VM Ops
- Stop the VM, change the machine type (or custom vCPU/memory), start it - brief downtime. In a MIG, update the template and roll.
- Changing architecture (x86 ↔ Arm/T2A) is a rebuild, not a resize - watch for arch-specific binaries.
Patch VMs safely Ops
- Use VM Manager (OS patch management) for scheduled, reported patching across a fleet; combine with OS inventory.
- For MIGs, prefer replacing instances from a new patched image (immutable) over in-place patching.
Troubleshoot boot / SSH / high CPU / memory / disk Ops
- Boot/SSH: use the serial console to read boot output; verify OS Login/IAM roles and the IAP firewall range; check the VM isn't stopped.
- High CPU/memory: Cloud Monitoring (install the Ops Agent for memory metrics, which aren't collected by default); right-size or autoscale.
- Disk full: resize the persistent disk online, then grow the filesystem; alert at 85%.
- Disk attach: confirm the disk is attached and in the same zone; format/mount and add to fstab by UUID.
Design compute for production HA Design
- Regional MIG across ≥2 zones + autohealing behind a load balancer.
- Instance template + autoscaling for elasticity and reproducibility.
- No external IPs; OS Login + IAP for access; Shielded VMs.
- Ops Agent for metrics/logs; image pipeline for patched golden images; regional disks or managed data services for state.
- Reservations/CUDs for capacity + cost; a second region for DR.
5. Storage Deep Dive
Block (Persistent Disk, Hyperdisk, Local SSD), file (Filestore), and object (Cloud Storage) storage - their scope, performance, durability, and the decision of which to use for databases, shared filesystems, backups, archives, and data lakes.
Persistent Disk / Hyperdisk = network block storage for VMs (zonal or regional). Local SSD = ultra-fast but ephemeral (data lost on stop). Filestore = managed NFS for shared filesystems. Cloud Storage = object storage (buckets/objects) for backups, data lakes, static content, and archives - not a filesystem. Choose block for boot/DB, Filestore for shared POSIX, Cloud Storage for objects. Lock down buckets with uniform bucket-level access + public access prevention.
Block storage: Persistent Disk, Hyperdisk, Local SSD
| Type | Scope | Notes |
|---|---|---|
| Persistent Disk (pd-balanced/ssd/standard/extreme) | Zonal or regional (synchronously replicated across 2 zones) | Network block storage; resize online; snapshots. Regional PD is a key HA building block for stateful VMs. |
| Hyperdisk | Zonal (some regional) | Next-gen block storage with independently tunable IOPS/throughput (Balanced/Throughput/Extreme/ML) - decouple performance from capacity. |
| Local SSD | Zonal, attached to the VM's host | Ephemeral - data is lost when the VM stops/terminates/migrates. Highest IOPS; only for scratch/cache/temp. |
Filestore
Filestore is managed NFS (v3) for shared POSIX filesystems mounted by many VMs/GKE pods. Tiers (Basic, Zonal, Regional, Enterprise) trade performance, capacity, and availability. Use it for shared application state, home directories, media/render scratch, and lift-and-shift apps that expect a filesystem.
Cloud Storage
- Buckets & objects - a bucket has a global unique name, a location (region, dual-region, or multi-region), and a default storage class.
- Storage classes: Standard (hot), Nearline (~30-day), Coldline (~90-day), Archive (~365-day). Colder = cheaper storage, higher retrieval cost / minimum-storage-duration. Autoclass auto-moves objects between classes by access.
- Lifecycle management - rules to transition class or delete objects by age/version.
- Versioning keeps prior object versions; Object holds and retention policies + Bucket Lock give WORM/compliance immutability (a locked retention policy cannot be shortened or removed).
- Access: Uniform bucket-level access (UBLA) (IAM only, no per-object ACLs) + Public access prevention should be the default. Signed URLs grant time-boxed access without IAM.
- Cloud Storage FUSE mounts a bucket as a filesystem (with caveats - it is still object storage underneath).
- Transfer: Storage Transfer Service (online, from other clouds/on-prem/HTTP) and Transfer Appliance (physical, for very large datasets).
storage.publicAccessPrevention and storage.uniformBucketLevelAccess). Legacy per-object ACLs and allUsers grants are how buckets get accidentally exposed. Use signed URLs for controlled external sharing, keep them short-lived, and inventory them. For backup/compliance buckets, add versioning + a locked retention policy so data can't be deleted early (ransomware/accident protection).Encryption
- All data is encrypted at rest by default with Google-managed keys.
- CMEK (customer-managed encryption keys) via Cloud KMS - you control rotation and can disable a key to render data unreadable. Use for sensitive/regulated data.
- CSEK (customer-supplied keys) - you provide the raw key (niche; you manage all key handling).
- Encryption in transit uses TLS across Google's network.
When to use which
| Need | Use |
|---|---|
| VM boot / DB datafiles | Persistent Disk or Hyperdisk (regional PD for zone-HA) |
| Ultra-fast scratch/cache | Local SSD (ephemeral - rebuildable data only) |
| Shared POSIX filesystem for many VMs/pods | Filestore |
| Backups (DB/app) | Cloud Storage (Nearline/Coldline) + lifecycle + retention |
| Log / long-term archive | Cloud Storage Archive + lifecycle + Bucket Lock |
| Data lake | Cloud Storage (Standard) - queried by BigQuery/BigLake |
| Static website / media | Cloud Storage + Cloud CDN |
| Bulk data into GCP | Storage Transfer Service (online) / Transfer Appliance (physical) |
| App/VM backup & DR | Backup and DR service; PD snapshots (scheduled) |
Practical examples
Database backups to Cloud Storage DBA
Managed DBs (Cloud SQL/AlloyDB) back up automatically; for self-managed DBs on VMs, dump/backup to a Cloud Storage bucket over Private Google Access (no internet). Lifecycle to Coldline/Archive; versioning + locked retention for immutability; enable a second-region copy (dual-region bucket or Transfer) for DR.
Data lake on Cloud Storage Data
Raw / curated / consumption prefixes in Standard buckets; BigQuery external tables / BigLake read them; Dataplex governs. Lifecycle cold raw data to Nearline/Coldline. See section 11.
Shared filesystem for an app cluster Apps
Filestore instance mounted on all app VMs/GKE pods; firewall to the app subnet; Enterprise/Regional tier for HA; snapshots for recovery.
Storage gotchas
- Cloud Storage is object storage, not a filesystem - no random writes/locks.
- Archive/Coldline have retrieval and minimum-duration costs - don't put frequently-read or short-lived data there.
- Persistent Disk is zonal or regional - a zonal disk dies with its zone; use regional PD for HA.
- Local SSD is ephemeral - never the only copy of anything.
- Public-bucket mistakes - enforce UBLA + public access prevention by Org Policy.
- Signed URL risk - they're bearer tokens; keep them short and tracked.
- Locked retention policy - once Bucket Lock is applied you cannot shorten/delete it (that's the point) - set the duration carefully.
- Snapshot cost growth - scheduled snapshots accumulate; set retention.
- Cross-region replication / transfer cost - dual/multi-region and egress cost money and take time; a lagging copy isn't DR.
- Wrong storage class - Standard for hot data you keep accessing; colder classes only for genuinely cold data.
6. Database Services Deep Dive
Google Cloud's database portfolio - Cloud SQL, AlloyDB, Spanner, Firestore, Bigtable, Memorystore - what each manages for you, how HA/DR/backup/patching differ, how to choose, and what changes for a DBA coming from Oracle.
For relational OLTP, start with Cloud SQL (managed PostgreSQL/MySQL/SQL Server); step up to AlloyDB (PostgreSQL-compatible, higher performance + analytics + vector) when you outgrow it, or Spanner for globally-distributed, horizontally-scalable strong consistency. For NoSQL, Firestore (document, app backends) and Bigtable (wide-column, huge scale, time-series). Memorystore for Redis/Valkey/Memcached caching. Choose the least you must manage that meets the workload; managed services own patching/backup/HA, you own schema, queries, and access.
The portfolio at a glance
| Service | Model | Sweet spot | You manage | Google manages |
|---|---|---|---|---|
| Cloud SQL | Managed PostgreSQL / MySQL / SQL Server | Standard relational OLTP, lift-and-shift | Schema, queries, flags, access | Provisioning, patching (in window), backups, HA, replicas |
| AlloyDB | PostgreSQL-compatible, Google-enhanced | Demanding PostgreSQL, HTAP, PostgreSQL + analytics + vector | Schema, queries, access | Patching, backups, HA, autoscaling read pools |
| Spanner | Globally-distributed, horizontally scalable, strongly consistent relational | Global OLTP, unlimited scale, five-nines | Schema (different mindset), queries, access | Almost everything - sharding, replication, HA |
| Firestore | Serverless document NoSQL | Web/mobile app backends, real-time sync | Data model, security rules, indexes | Scaling, replication, HA |
| Bigtable | Wide-column NoSQL, petabyte-scale, low latency | Time-series, IoT, adtech, huge key-value/analytics | Row-key design (critical), schema | Scaling, replication |
| Memorystore | Managed Redis / Valkey / Memcached | Cache, session store, leaderboards | Keys/TTL, client | Provisioning, patching, HA |
Service deep dives
Cloud SQL (PostgreSQL, MySQL, SQL Server)
- HA - regional, synchronous standby in another zone with automatic failover (enable HA; it is not on by default).
- Read replicas - in-region and cross-region replicas for read scaling and DR; a cross-region replica can be promoted for regional DR.
- Backups - automated daily backups + point-in-time recovery (binary/WAL logging); on-demand backups; you set retention.
- Patching - Google patches during your maintenance window; you choose timing and get notifications, but you don't control every patch.
- Connectivity - private IP (via private services access), the Cloud SQL Auth Proxy (IAM-authenticated, encrypted), authorized networks (public IP), and IAM database authentication (Postgres/MySQL).
- SQL Server - license is included in the price (no BYOL for Cloud SQL SQL Server); watch edition/feature limits.
AlloyDB for PostgreSQL
PostgreSQL-compatible, Google-built engine aimed at demanding transactional and mixed (HTAP) workloads: a columnar accelerator for analytics, autoscaling read pools, and strong price/performance vs. self-managed Postgres. Supports vector search (pgvector + Google enhancements) for AI/RAG on operational data.
- HA - regional with automatic failover; read pools scale reads.
- Backups / PITR - continuous backup with point-in-time recovery.
- Use it when Cloud SQL Postgres runs out of headroom, when you want analytics on operational data without a separate warehouse, or for Postgres-native vector search at scale. AlloyDB Omni runs the engine on-prem/other clouds.
Spanner
Google's globally-distributed relational database: horizontal scale to virtually unlimited throughput, external (strong) consistency across regions, and up to 99.999% availability - no manual sharding. SQL interface (GoogleSQL/PostgreSQL dialect).
- Scaling - add compute (nodes/processing units); storage and throughput scale with it. No failover to manage.
- HA/DR - multi-region configurations replicate synchronously across regions; regional configs across zones.
- Mindset shift - schema and primary-key design must avoid hotspots (no monotonically increasing keys); interleaving models parent-child locality. Not a drop-in for a single-node RDBMS.
Firestore, Bigtable, Memorystore
- Firestore - serverless document database with real-time listeners and offline sync; great for web/mobile backends. Security Rules control client access. Not relational - model for your queries, mind index and hotspot limits.
- Bigtable - wide-column, low-latency, petabyte-scale for time-series, IoT, adtech, and analytics feeding. Row-key design is everything - a bad key hotspots a node. HBase-compatible API.
- Memorystore - managed Redis/Valkey/Memcached for caching, sessions, rate limiting; HA tiers with replicas/failover.
Database service decision table
| Workload | Recommended | Reason | HA | DR | Ops responsibility | Cost lever |
|---|---|---|---|---|---|---|
| PostgreSQL app DB | Cloud SQL PostgreSQL | Managed, standard OLTP | Regional HA (enable it) | Cross-region replica | Schema/queries | Right-size + CUD; auto-storage |
| MySQL web app | Cloud SQL MySQL | Managed, common | Regional HA | Cross-region replica | Schema/queries | Right-size; read replicas |
| SQL Server workload | Cloud SQL SQL Server | Managed, license included | Regional HA | Cross-region replica | Schema/queries | Edition/size choice |
| Demanding Postgres / HTAP / vector | AlloyDB | Performance + analytics + pgvector | Regional + read pools | Cross-region (config) | Schema/queries | Scale read pools |
| Global transactional | Spanner | Global scale + strong consistency | Built-in (multi-region) | Built-in | Schema/key design | Right-size compute units |
| Web/mobile app backend | Firestore | Serverless doc, real-time | Built-in | Multi-region option | Data model + rules | Query/index efficiency |
| High-scale NoSQL / time-series / IoT | Bigtable | Petabyte scale, low latency | Built-in (replication) | Multi-cluster | Row-key design | Node count; storage type |
| Cache / session | Memorystore | Managed Redis/Valkey | HA tier (replicas) | Rebuild from source | Keys/TTL | Right-size tier |
| Data warehouse | BigQuery (section 11) | Serverless analytics, not OLTP | Built-in | Multi-region dataset | Schema/queries | Slot/query cost control |
| Unsupported engine / full control (e.g. Oracle) | Compute Engine (self-managed) or Oracle DB@Google | Engine/version not offered managed | You build it | You build it | Everything | Sole-tenant/licensing |
Connectivity & observability
- Private IP (via private services access) is the production default - no public endpoint. Serverless VPC Access lets Cloud Run/Functions reach a private-IP DB.
- Cloud SQL Auth Proxy / connectors - IAM-authenticated, encrypted connections without managing SSL certs or IP allowlists.
- Authorized networks (public IP) - avoid; if used, restrict tightly and require SSL.
- Query Insights (Cloud SQL/AlloyDB) - query-level performance analysis; plus Cloud Monitoring metrics for CPU, connections, storage, replication lag.
How HA, DR, backup, and patching differ
| Service | HA | DR | Backup | Patching |
|---|---|---|---|---|
| Cloud SQL | Regional standby (opt-in), auto failover | Cross-region read replica → promote | Automated + PITR, you set retention | Google, in your maintenance window |
| AlloyDB | Regional, auto failover; read pools | Cross-region config | Continuous + PITR | Google, in window |
| Spanner | Built-in (multi-zone/region) | Multi-region config | Backups + PITR | Fully managed, transparent |
| Firestore / Bigtable | Built-in replication | Multi-region / multi-cluster | Managed backup/export | Fully managed |
Google Cloud database gotchas for Oracle DBAs
- Cloud SQL is managed, not self-managed - no OS/SYSDBA-level control, controlled flags, Google-run patching. Your runbooks change.
- AlloyDB is PostgreSQL-compatible, not Oracle-compatible - Oracle-to-AlloyDB is a full Postgres migration (schema + PL/SQL conversion), not lift-and-shift.
- Spanner needs a different data-modeling mindset - key design to avoid hotspots, interleaving for locality; no single-node RDBMS assumptions.
- Firestore and Bigtable are not relational - no joins/SQL; design for access patterns.
- Patching control differs by service - you schedule windows, Google patches; not your RMAN/opatch world.
- Backup access differs - backups are service-managed artifacts (+ PITR), not files you copy; export for portability.
- Private IP and DNS must be planned - reserve the private-services-access range; plan resolution before you build.
- Performance troubleshooting differs - Query Insights / Cloud Monitoring instead of AWR/ASH; different wait/metric vocabulary.
- Oracle itself: for Oracle Database you either self-manage on Compute Engine (you own everything, sole-tenant for licensing) or use the Oracle Database@Google Cloud partnership (Exadata/Autonomous run by Oracle inside Google Cloud) - there is no native "managed Oracle" like Cloud SQL.
Enterprise examples
PostgreSQL application database OLTP
Cloud SQL PostgreSQL with HA enabled, private IP, Auth Proxy from the app, automated backups + PITR, a cross-region read replica for DR, Query Insights on. Move to AlloyDB if you need more performance or in-DB analytics/vector.
Globally distributed transactional workload Global
Spanner multi-region config; schema designed for even key distribution; app uses the client library with strong reads where needed and stale reads for scale.
IoT / time-series at scale NoSQL
Bigtable with a row key that spreads writes (e.g. reversed/ hashed device ID + timestamp), multi-cluster replication for HA, feeding BigQuery/Dataflow for analytics.
Self-managed Oracle on Compute Engine Oracle
When a managed option can't run the engine/version: Oracle on a memory-optimized VM, regional PD/Hyperdisk sized for IOPS, Data Guard you configure to a second-region VM, sole-tenant nodes for licensing, backups to Cloud Storage. Consider Oracle Database@Google for a managed alternative.
7. Load Balancing and Traffic Management
Google Cloud Load Balancing - the global and regional Application, Network, and proxy load balancers, their components (forwarding rule, target proxy, URL map, backend service, health check), and how to choose and debug them, with Cloud CDN and Cloud Armor.
Google Cloud Load Balancing is a family. The global external Application Load Balancer gives you one anycast IP serving users worldwide (L7, HTTP/S, with Cloud CDN + Cloud Armor). There are also regional external/internal Application LBs, passthrough Network LBs (L4, external/internal), and proxy Network LBs (L4 proxy). Every LB is assembled from a forwarding rule → target proxy → URL map → backend service → backends, with health checks. The #1 failure is a firewall not allowing the health-check ranges.
The load balancer family
| Load balancer | Layer / scope | Use for |
|---|---|---|
| Global external Application LB | L7, global, single anycast IP | Internet-facing web/APIs served worldwide; CDN + Cloud Armor; cross-region failover |
| Regional external Application LB | L7, regional | Regional internet-facing L7 (data-residency, regional-only) |
| Internal Application LB | L7, regional/cross-region internal | Internal microservice HTTP routing |
| External passthrough Network LB | L4, regional, preserves client IP | Non-HTTP TCP/UDP internet-facing; source-IP-sensitive |
| Internal passthrough Network LB | L4, regional internal | Internal TCP/UDP (e.g. internal service VIP, HA databases) |
| Proxy Network LB (external/internal) | L4 proxy | TCP with TLS offload / where a proxy is wanted (no client-IP preservation) |
Anatomy of a load balancer
- Forwarding rule - the frontend IP + port. Target proxy terminates and (for HTTPS) holds the certificate. URL map does host/path routing. Backend service defines the balancing policy, session affinity, timeouts, and health check. Backends are MIGs, NEGs (network endpoint groups - incl. serverless NEGs for Cloud Run/Functions/App Engine), or backend buckets (static content).
- SSL certificates - Google-managed certs (auto-provision/renew) or self-managed, via Certificate Manager for scale.
- Session affinity - client IP / cookie based, when needed.
Cloud CDN and Cloud Armor
- Cloud CDN - cache cacheable responses at Google's edge; enable on a backend service/bucket to cut latency and egress.
- Cloud Armor - edge WAF/DDoS for the global external Application LB: OWASP rules, rate limiting, geo/IP allow-deny, bot management, and adaptive protection. Attach a security policy to the backend service.
When to use which
Load balancer troubleshooting
Likely causes (in order)
- Firewall doesn't allow the health-check ranges
35.191.0.0/16and130.211.0.0/22to the backend port - the #1 cause. - Health check port/path/protocol wrong vs. what the app serves.
- App not listening / bound to localhost instead of 0.0.0.0.
- Wrong backend service protocol (HTTP vs HTTPS vs HTTP/2) or named port mismatch on the MIG.
- Cloud Armor rule or URL map misrouting; OS firewall on the VM.
Checks
gcloud compute backend-services get-health BACKEND --global
gcloud compute firewall-rules list --filter="sourceRanges~35.191 OR sourceRanges~130.211"Fix / prevention
Allow the health-check ranges to the backend SA/tag on the port; align the health check; fix the named port/protocol; bind to 0.0.0.0. Template the LB + firewall together in Terraform.
Causes: Google-managed cert stuck in PROVISIONING (the domain must resolve to the LB IP and DNS must be correct before it validates); wrong/missing domain on the cert; expired self-managed cert; HTTP→HTTPS redirect missing. Fix: point DNS at the LB IP first, then wait for provisioning; include all SANs; use Certificate Manager at scale; add a redirect URL map.
Causes: forwarding rule on the wrong IP/port/protocol; URL map path/host rule not matching (default backend catching everything); Cloud Armor rule denying legitimate clients (over-broad geo/IP or a WAF rule false positive). Fix: verify the forwarding-rule frontend; test URL map routing (path matchers, order); review Cloud Armor logs and preview mode before enforcing; tune the offending rule.
8. Security Deep Dive
Defense in depth on Google Cloud: identity, governance (Org Policy, VPC-SC), network, data, and detective controls (SCC, audit logs) - plus concrete guidance for securing projects, storage, compute, and databases, ending in a production checklist.
Layer your controls: IAM (least privilege, groups, no basic roles, no SA keys), governance (Organization Policies to forbid the risky thing; VPC Service Controls to stop data exfiltration), network (private IPs, firewall, no public exposure, IAP), data (CMEK, Secret Manager, DLP/Sensitive Data Protection), and detection (Security Command Center, centralized Cloud Audit Logs). Reduce public exposure, encrypt with keys you control, centralize logs, and prefer preventive guardrails over after-the-fact detection.
Google Cloud shared responsibility model
Google secures the infrastructure (physical, hardware, host, network fabric, and managed-service internals). You are responsible for: IAM and identity, data classification and access, network exposure and firewall, key management choices, workload/OS security (for IaaS/GKE nodes), secure configuration, and monitoring/response. The higher up the managed-service stack you go (VM → GKE → Cloud SQL → BigQuery/Cloud Run), the more Google handles - but data, access, and configuration always remain yours.
The control layers
| Layer | Controls | Key services |
|---|---|---|
| Identity & access | Who can do what | Cloud IAM, groups, deny policies, conditional IAM, Workload/Workforce Identity Federation |
| Governance | What is allowed at all; data can't leave | Organization Policies, VPC Service Controls, Access Context Manager |
| Network | What can reach what | Firewall + hierarchical policies, private IPs, Cloud Armor, IAP, Private Google Access/PSC |
| Data | Protect data at rest/in transit | Cloud KMS/HSM (CMEK), Secret Manager, Sensitive Data Protection (DLP), CA Service |
| Workload / supply chain | Trusted, hardened workloads | Shielded/Confidential VMs, Binary Authorization, Artifact Analysis, Web Security Scanner |
| Detective / posture | Find misconfig & threats | Security Command Center, Cloud Audit Logs, Cloud Logging, Chronicle |
Cloud KMS, HSM, and Secret Manager
- Cloud KMS - manage encryption keys for CMEK across storage, disks, databases, and app-level crypto. Cloud HSM gives FIPS 140-2 Level 3 hardware-backed keys; Cloud EKM lets keys live in an external KMS.
- Secret Manager - store API keys, DB passwords, certs as versioned secrets; workloads read them via IAM (no secrets in code, images, or env files).
- Certificate Authority Service - private CA for issuing internal certificates at scale.
Security Command Center and detection
Central security & risk platform: asset inventory, misconfiguration findings (Security Health Analytics), threat detection (Event Threat Detection), attack-path/risk analysis, and posture management across the org. Turn it on org-wide.
Admin Activity (always on), Data Access, System Event, and Policy Denied logs - your immutable evidence trail. Enable Data Access logs where needed and centralize with a sink.
Route org audit logs to a central logging project (log bucket + BigQuery + SIEM) via an aggregated sink for cross-project visibility and retention.
Chronicle for SIEM-scale threat analytics; Sensitive Data Protection (DLP) to discover/classify/mask PII in storage and BigQuery.
Perimeter and supply chain
- VPC Service Controls - a data-exfiltration perimeter around managed services so a valid identity can't copy BigQuery/Cloud Storage data to an outside project. The key control for sensitive data (see section 2).
- Identity-Aware Proxy (IAP) - context-aware access to apps/VMs without VPN or public IPs.
- Binary Authorization - only allow signed/attested container images to deploy (GKE/Cloud Run).
- Artifact Analysis - scan images in Artifact Registry for vulnerabilities; Web Security Scanner for app scanning.
- Confidential Computing - encrypt data in use (Confidential VMs/GKE).
How to secure specific things
Secure a production project (and multi-project env) Foundation
- Federate identities; enforce 2FA; drive access through groups; no basic roles; least-privilege predefined roles at project/resource scope.
- Preventive Org Policies at the org/folder: disable SA key creation, domain-restricted sharing, restrict resource locations, block external IPs, require OS Login/Shielded VM.
- Shared VPC for centralized network control; VPC-SC perimeter around data projects.
- SCC org-wide; aggregated audit-log sink to a logging project; budgets + quotas.
- CMEK + Secret Manager in a security project; break-glass Owner group, monitored.
Secure Cloud Storage Storage
- Uniform bucket-level access + public access prevention (enforced by Org Policy); no
allUsers. - CMEK for sensitive buckets; versioning + locked retention for backup/compliance.
- Signed URLs short-lived and inventoried; access via IAM + service accounts, not keys.
Secure Compute Engine Compute
- No external IPs; access via IAP; OS Login (+2FA) instead of metadata SSH keys; Shielded VMs.
- Target firewall rules by service account; allow only the IAP range for SSH.
- VM Manager for patch compliance; Ops Agent for logs/metrics; CMEK on disks for sensitive data.
Secure databases & public load balancers Data / Edge
- DBs on private IP, Auth Proxy/Serverless VPC Access, no public endpoint; CMEK; IAM DB auth where supported.
- Public HTTP behind the global external App LB + Cloud Armor (WAF, rate limiting, geo/IP); backends private.
- Reduce public exposure everywhere: block external IPs by Org Policy, prefer IAP + private access.
Production Google Cloud security checklist
- Human access federated; 2FA enforced; access granted via groups, never individuals.
- No basic roles (Owner/Editor/Viewer) in production; least-privilege predefined roles at project/resource scope.
- Service account key creation disabled org-wide; workloads use attached SAs / Workload Identity Federation / impersonation.
- Preventive Org Policies: block external IPs, restrict resource locations, domain-restricted sharing, require OS Login + Shielded VM.
- Security Command Center enabled org-wide with findings triaged.
- VPC Service Controls perimeter around projects holding sensitive data.
- Shared VPC / hierarchical firewall centrally managed; no broad
0.0.0.0/0ingress; SSH via IAP range only. - Databases and internal services on private IP; no public database endpoints.
- Public HTTP behind global external App LB + Cloud Armor; backends private.
- All sensitive data encrypted with CMEK; keys in a locked security project; rotation on.
- Secrets in Secret Manager; nothing sensitive in code, images, or metadata.
- Cloud Storage buckets: UBLA + public access prevention; versioning + retention on backups.
- Org-level aggregated audit-log sink to a central logging project (+ BigQuery/SIEM); Data Access logs on where needed.
- Alerts on IAM/policy changes, new SA keys, public exposure, and anomalous access.
- Budgets + quotas as guardrails; consistent labels/tags for attribution.
- DR and backups tested (restores verified), including CMEK key availability in the DR region.
Common security mistakes
- Granting Owner/Editor too broadly, or roles at org level.
- Long-lived service account keys (the top serious incident).
- Public Cloud Storage exposure (
allUsers, legacy ACLs). - Over-permissive firewall rules; broad public SSH instead of IAP.
- Public database endpoints with wide authorized networks.
- Not enabling / not centralizing audit logs.
- Storing secrets in code instead of Secret Manager.
- Not using impersonation; not using VPC Service Controls for sensitive data.
- Not enforcing Organization Policies (leaving the guardrails off).
9. Observability, Monitoring, and Operations
Cloud Monitoring, Cloud Logging, and the Cloud Operations suite - what to monitor per service, how to build useful alerts without noise, how to centralize logs across projects, and Active Assist for optimization.
Cloud Monitoring holds metrics, uptime checks, dashboards, and alerting policies that fire to notification channels. Cloud Logging collects logs; the Log Router sends them to log buckets, BigQuery, Pub/Sub, or Cloud Storage via sinks. Install the Ops Agent on VMs for memory/disk/process metrics and logs (not collected by default). Managed Service for Prometheus for container metrics. Alert on user-visible symptoms, route by severity, and centralize logs across projects with an aggregated sink.
The observability stack
| Service | Role |
|---|---|
| Cloud Monitoring | Metrics, uptime checks, dashboards, alerting policies, SLOs. |
| Alerting policies + notification channels | Threshold/absence conditions → email, PagerDuty, Slack, SMS, Pub/Sub, webhook. |
| Cloud Logging + Log Router + sinks | Collect, route, and retain logs; export to log buckets / BigQuery / Cloud Storage / Pub/Sub. |
| Cloud Audit Logs | Admin Activity, Data Access, System Event, Policy Denied - the control-plane record. |
| Error Reporting / Trace / Profiler | Aggregate errors; distributed latency traces; continuous CPU/heap profiling. |
| Managed Service for Prometheus | Prometheus-compatible metrics at scale (GKE and beyond). |
| Ops Agent | On-VM agent for host metrics (memory, disk, swap), process metrics, and logs. |
| Network Intelligence Center / Flow Logs | Network observability (see section 3). |
| Cloud Asset Inventory / Recommender / Active Assist | Inventory, recommendations, and automated insights (cost, security, reliability). |
What to monitor per area
CPU, memory (agent), disk usage & IOPS/throughput, instance up/health, MIG size vs. target, autohealing events.
Throughput/IOPS vs. provisioned limits, disk usage %, latency. Hitting the disk ceiling is a common hidden bottleneck.
Request/error rates, object counts, and unusual access patterns (via Data Access logs).
CPU, memory, connections vs. limit, storage used %, replication lag, backup success, Query Insights.
Backend health, request count, 5xx rate, latency, backend utilization.
VPN/Interconnect status, flow-log anomalies, firewall drops; SCC findings, audit-log anomalies (policy/key/public-exposure changes).
Building useful alerts
- Alert on symptoms users feel (5xx rate, unhealthy backends, DB down, high latency, SLO burn), not just causes.
- Use appropriate aligners/reducers and a duration to avoid flapping (e.g. mean over 5 min, not a single spike).
- Use absence conditions for signals that should always report (heartbeat, backup completion).
- Route by severity: critical → page; warning → ticket/Slack; info → dashboard.
- Adopt SLOs and alert on error-budget burn rather than raw thresholds where you can.
Example alerts to implement
| Alert | Condition | Severity |
|---|---|---|
| VM CPU high | CPU > 85% mean for 5-10 min | Warning → Critical |
| VM unavailable | Instance up-check fails / metric absent | Critical |
| Memory pressure | Memory (agent) > 90% | Warning |
| Disk usage / IOPS | Disk > 85% used; throughput near provisioned limit | Warning |
| LB unhealthy backend | Healthy backend count < desired | Critical |
| Cloud SQL CPU / storage / connections | CPU > 90%; storage > 85%; connections near max | Warning → Critical |
| Failed backups | Backup failure / success signal absent | Critical |
| VPN tunnel down / Interconnect issue | Tunnel/attachment status != up | Critical |
| Cloud Storage unusual access | Spike / unexpected public access (Data Access logs) | Security review |
| Cloud Run / Functions errors | 5xx rate or error count over threshold | Warning → Critical |
| Pub/Sub backlog high | Oldest unacked message age / undelivered count high | Warning |
| GKE pod crash loops | Container restart count rising | Warning |
Centralizing logs across projects
Create an aggregated sink at the org or folder level that routes all projects' logs (especially audit logs) to a central logging project - a log bucket for retention, BigQuery for analysis, and/or Pub/Sub to a SIEM. This gives cross-project security visibility and satisfies retention/compliance without per-project setup.
# Org-level aggregated sink of all audit logs to a central BigQuery dataset
gcloud logging sinks create org-audit-sink \
bigquery.googleapis.com/projects/central-logging/datasets/org_audit \
--organization=ORG_ID --include-children \
--log-filter='logName:"cloudaudit.googleapis.com"'Active Assist & recommendations
Recommender / Active Assist surface actionable insights: IAM over-grants, idle VMs/disks/IPs, right-sizing, commitment recommendations, and reliability/security findings. Review them monthly (they feed the cost checklist in section 14). Service Health shows Google-side incidents and maintenance events affecting your resources.
10. Containers, Kubernetes, and Cloud Native
GKE (Autopilot and Standard), Cloud Run, and the serverless / event-driven building blocks - when to use each, how networking and IAM work for containers, and reference patterns for microservices and event-driven systems.
GKE (managed Kubernetes) for orchestrated, long-running microservices - Autopilot when you want Google to manage nodes, Standard when you need node-level control. Cloud Run for serverless containers that scale to zero (the default for most new stateless services). Cloud Functions for small event handlers. Around them: Artifact Registry, Cloud Build/Deploy, Pub/Sub, Eventarc, Workflows, Cloud Tasks/Scheduler, and Apigee/API Gateway. GKE workloads use Workload Identity (not SA keys) to call Google APIs.
The cloud-native services
| Service | What it is | Use for |
|---|---|---|
| GKE (Autopilot / Standard) | Managed Kubernetes control plane + nodes (or fully-managed nodes in Autopilot) | Orchestrated microservices, platform teams, portable K8s workloads |
| Cloud Run | Serverless containers (services & jobs), scale-to-zero | Most new stateless services/APIs and batch jobs |
| Cloud Functions | Event-driven functions (on Cloud Run) | Small event handlers, glue |
| App Engine | PaaS for web apps | Existing App Engine apps |
| Artifact Registry | Managed registry for images and packages | Store/scan images (Container Registry is legacy) |
| Cloud Build / Cloud Deploy | CI (build) and CD (progressive delivery) | Container build + deploy pipelines |
| Pub/Sub | Global messaging / event bus | Decoupling, streaming ingestion, fan-out |
| Eventarc | Event routing (from Google services / Pub/Sub) to Run/GKE/Workflows | Event-driven triggers |
| Workflows / Cloud Tasks / Cloud Scheduler | Orchestration / task queues / cron | Serverless orchestration and scheduling |
| Apigee / API Gateway | Full API management / lightweight API gateway | Publishing, securing, and managing APIs |
GKE deep dive
- Autopilot vs Standard: Autopilot manages nodes, scaling, and much security for you and bills per-pod - less to run, fewer knobs. Standard gives you node pools and full control (custom machine types, GPUs, DaemonSets that need node access) at more operational cost.
- Node pools (Standard) - groups of nodes with a machine type/image; scale and upgrade per pool; use Spot node pools for fault-tolerant workloads.
- Regional vs zonal clusters - regional replicates the control plane and spreads nodes across zones (HA); zonal is single-zone.
- Networking - VPC-native clusters use alias IP ranges (secondary subnet ranges) for pods and services; plan those ranges (pods need many IPs).
- Ingress / Gateway API - a Kubernetes
Ingressor the newer Gateway API provisions a Google Cloud load balancer for HTTP routing; aService type=LoadBalancerprovisions an L4 LB. - Workload Identity - map a Kubernetes service account to a Google service account so pods call Google APIs with short-lived credentials, no keys.
# Bind a Kubernetes SA to a Google SA (Workload Identity)
gcloud iam service-accounts add-iam-policy-binding GSA@PROJECT.iam.gserviceaccount.com \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:PROJECT.svc.id.goog[NAMESPACE/KSA_NAME]"GKE vs Cloud Run vs Cloud Functions vs Compute Engine
Networking, IAM, and security for containers
- Networking - VPC-native GKE in a (shared) VPC subnet with pod/service alias ranges; private clusters keep the control-plane endpoint private; Cloud Run connects to VPC via Serverless VPC Access or Direct VPC egress.
- IAM - cluster access via IAM + Kubernetes RBAC; Workload Identity for pod-to-Google auth; Cloud Run/Functions run as a service account you set.
- Supply chain - scan images in Artifact Registry (Artifact Analysis); enforce Binary Authorization so only attested images deploy.
- Runtime - network policies, pod security, secrets from Secret Manager, and least-privilege service accounts.
- Monitoring - GKE integrates with Cloud Monitoring/Logging and Managed Service for Prometheus; Cloud Run emits request metrics and logs automatically.
CI/CD for containers
Cloud Build builds and tests images (triggered from a repo), pushes to Artifact Registry (scanned), and Cloud Deploy promotes them through environments (dev → staging → prod) with approvals and rollback. Binary Authorization gates what can deploy.
Architecture patterns
- Microservices on GKE - deployments behind Gateway API/Ingress LB, HPA autoscaling, optional service mesh (Cloud Service Mesh) for mTLS/traffic control, Workload Identity, Cloud Deploy pipelines.
- Microservices on Cloud Run - each service a container, private ingress + internal LB for service-to-service, Pub/Sub/Eventarc for async - minimal ops.
- Serverless function on a Cloud Storage event - as diagrammed; image/ETL/validation triggers.
- Event-driven architecture - Eventarc + Pub/Sub + Workflows + Cloud Run/Functions + Cloud Tasks for decoupled, resilient pipelines.
- Private container platform - private GKE cluster in a Shared VPC, internal LBs, Binary Authorization, no public endpoints.
Troubleshooting
Causes: Pending = no schedulable capacity or pod IP exhaustion (alias range too small) or resource requests too big; ImagePullBackOff = bad image path or missing Artifact Registry read permission on the node/Workload-Identity SA, or no route to the registry (private cluster without PGA/Artifact Registry access); CrashLoopBackOff = app failing on start (config/secret missing, bad liveness probe). Checks: kubectl describe pod, kubectl logs --previous, node capacity, alias-range free IPs. Fix: scale the node pool / fix alias range; grant artifactregistry.reader; fix probes/config; enable PGA for private clusters.
Causes: health checks failing (firewall not allowing 35.191.0.0/16 & 130.211.0.0/22, or wrong readiness probe); missing/managed cert not provisioned (DNS must point at the LB IP first); wrong BackendConfig/NEG; Ingress class/annotations misconfigured. Fix: allow health-check ranges, align readiness probe with the LB health check, point DNS then wait for cert, verify BackendConfig.
Cloud Run: new revision serving 100% but failing - check container starts and listens on $PORT, startup CPU boost, min instances, and the runtime SA's permissions; roll back to the previous revision. Functions timeout: raise the timeout/memory, make work idempotent, offload long work to a Cloud Run job. Pub/Sub backlog: slow/failing subscriber - check ack deadline, subscriber errors, and scale consumers; use a dead-letter topic for poison messages.
11. Analytics, Data, and Integration
Data is Google Cloud's strongest area. BigQuery, the lake/lakehouse stack (Cloud Storage, BigLake, Dataplex), the pipeline tools (Dataflow, Dataproc, Data Fusion, Dataform), streaming (Pub/Sub), CDC (Datastream), and BI (Looker) - with the BigQuery mental model that trips up newcomers.
Land data in Cloud Storage (the lake) and analyze it in BigQuery (serverless warehouse - storage and compute are separate). Transform with Dataflow (streaming/batch Beam), Dataproc (managed Spark/Hadoop), Data Fusion (visual ETL), or Dataform (SQL transformations in BigQuery). Ingest streams with Pub/Sub, CDC with Datastream, govern with Dataplex, and visualize with Looker / Looker Studio. BigQuery cost is driven by bytes scanned (on-demand) or slots (capacity) - query design directly affects the bill.
The services
| Service | Role |
|---|---|
| BigQuery | Serverless data warehouse - SQL analytics at petabyte scale; separated storage/compute; ML (BigQuery ML) and vector search built in. |
| BigLake | Unify BigQuery-managed and open-format (Parquet/Iceberg) data in Cloud Storage under one governed table interface. |
| Dataplex | Data governance, cataloging, lineage, quality, and organization across the lake/warehouse (Data Catalog is part of it). |
| Cloud Storage | The data lake landing/curation/consumption zones. |
| Dataflow | Fully-managed Apache Beam - unified streaming & batch pipelines. |
| Dataproc | Managed Spark/Hadoop clusters (and serverless Spark). |
| Data Fusion | Visual, code-free ETL/ELT (CDAP-based). |
| Dataform | SQL-based transformation/versioning inside BigQuery (ELT, tests, docs). |
| Pub/Sub | Global messaging for streaming ingestion and event distribution. |
| Datastream | Serverless CDC from operational DBs (Oracle, MySQL, PostgreSQL) into BigQuery/Cloud Storage. |
| Cloud Composer | Managed Apache Airflow for orchestration. |
| Analytics Hub | Securely share/exchange datasets across projects/orgs. |
| Looker / Looker Studio | Governed BI/semantic modeling (Looker) and self-serve dashboards (Looker Studio). |
The BigQuery mental model
- Storage and compute are separated. Data sits in columnar storage; queries spin up compute (slots) on demand. You are not managing a server with a fixed size.
- It is analytical (OLAP), not transactional. Great for scans/aggregations over huge tables; wrong for high-rate single-row OLTP (use Cloud SQL/Spanner for that).
- SQL, but different operations. No indexes in the OLTP sense; performance comes from partitioning (by date/ingestion time) and clustering (by high-cardinality filter columns) to prune data.
- Query design drives cost. On-demand pricing charges by bytes scanned -
SELECT *and unpartitioned full scans are expensive. Select only needed columns; filter on partition/cluster keys. - Slots are the unit of compute. On-demand auto-allocates; capacity (editions) reserve slots for predictable cost/performance and let you autoscale.
- Loading options differ: batch load (free load), streaming inserts / Storage Write API (real-time, priced), and external/BigLake tables (query data in Cloud Storage in place) each have different cost/latency trade-offs.
SELECT * across an unpartitioned multi-TB table, per-row updates, or expecting OLTP latency. That scans (and bills) enormous data and performs poorly. Partition + cluster tables, select only needed columns, and keep transactional workloads in Cloud SQL/Spanner.Common data patterns
| Pattern | Built from |
|---|---|
| Data lake | Cloud Storage (raw/curated/consumption) + Dataplex + BigLake |
| Data warehouse | BigQuery + Dataform/Dataflow loads + Looker |
| Lakehouse | BigLake over open formats (Iceberg/Parquet in GCS) + Dataplex governance, queried by BigQuery |
| ETL / ELT | Dataflow / Dataproc / Data Fusion (E/T) + Dataform (in-warehouse T) |
| Streaming ingestion | Pub/Sub → Dataflow → BigQuery (or Storage Write API direct) |
| CDC | Datastream → BigQuery / Cloud Storage |
| Reporting / BI | BigQuery + Looker (governed) / Looker Studio (self-serve) |
| AI-ready data | Curated lake + BigQuery ML / vector search + Vertex AI (section 12) |
| Cross-org data sharing | Analytics Hub (publish/subscribe datasets) with governance |
Data governance with Dataplex
Dataplex gives you a data catalog, business/technical metadata, data lineage, data quality, and policy-based access across lakes and BigQuery - so a growing lake stays a governed asset instead of a "data swamp." Combine with VPC Service Controls (perimeter around BigQuery/Storage) and column/row-level security in BigQuery for sensitive data.
Reference architecture: lakehouse + BI
BigQuery cost control
- Partition and cluster big tables so queries prune to the relevant data.
- Select only needed columns; avoid
SELECT *. Preview cost with the dry-run estimator before running. - Set custom quotas / maximum bytes billed per query to cap runaway scans.
- Choose the right pricing: on-demand (per-byte) for spiky/low volume, capacity (editions with slot reservations + autoscaling) for predictable heavy workloads.
- Use materialized views and BI Engine for repeated aggregations; expire temp datasets.
12. AI, ML, and Generative AI on Google Cloud
Vertex AI as the unified ML platform, Gemini models and Agent Builder, vector search across AlloyDB / Cloud SQL / BigQuery, and the pretrained AI APIs - plus the enterprise RAG patterns and governance guardrails that separate a demo from something you can run on real data.
Vertex AI is the unified platform for building, tuning, deploying, and operating models - Model Garden (incl. Gemini), Vertex AI Studio, custom/AutoML training, endpoints, pipelines, feature store, model registry, and Agent Builder / Vertex AI Search for grounded assistants. For RAG, generate embeddings and store vectors in AlloyDB, Cloud SQL (pgvector), BigQuery, or Vertex Vector Search. BigQuery ML runs ML in SQL. The hard part is not the model - it is governing what the model can reach.
Vertex AI
| Capability | What it does |
|---|---|
| Model Garden | Catalog of Google (Gemini), open, and third-party models to deploy/tune. |
| Vertex AI Studio | Prompt, test, and tune generative models; multimodal. |
| Gemini models | Google's frontier multimodal LLMs (text, image, audio, video, code) served via Vertex AI. |
| Agent Builder / Vertex AI Search | Build grounded search and agents over your data with managed retrieval (less RAG plumbing). |
| AutoML / custom training | Train without/with your own code; distributed training on GPUs/TPUs. |
| Endpoints (online / batch prediction) | Serve models for low-latency online or high-throughput batch inference. |
| Pipelines / Feature Store / Experiments / Model Registry | MLOps: reproducible pipelines, feature serving, experiment tracking, versioned model governance. |
| Model Monitoring | Drift/skew detection and quality monitoring for deployed models. |
Pretrained AI APIs & BigQuery ML
Parse and extract structured data (text, tables, entities) from documents - invoices, forms, contracts.
Pretrained APIs for image analysis, speech-to-text and text-to-speech, translation, and text/entity/sentiment analysis - no training needed.
Conversational agents and virtual call-center assistants (CCAI), including generative agents.
Train and run ML models (and call Gemini/embeddings) directly with SQL in BigQuery - great for analysts and data-resident ML.
Vector search options
| Option | Use when |
|---|---|
| AlloyDB vector search | Vectors alongside operational Postgres data, with high performance (Google's ScaNN-based indexing). |
| Cloud SQL for PostgreSQL (pgvector) | Vectors in an existing Cloud SQL Postgres, modest scale, simplest path. |
| BigQuery vector search | Vectors + embeddings at analytics scale, alongside your warehouse data, in SQL. |
| Vertex AI Vector Search | Purpose-built, very large-scale, low-latency similarity search (managed ANN). |
WHERE filters so retrieval respects entitlements. Use Vertex Vector Search when scale/latency demands a dedicated ANN service. Either way, filter retrieved context to what the requesting user is allowed to see.RAG architecture on Google Cloud
Enterprise patterns
| Pattern | How | Watch out for |
|---|---|---|
| Chat with documents | RAG over Cloud Storage docs + vector search + Gemini (or Vertex AI Search) | Chunking quality; stale index; citations |
| Chat with database | Retrieve from curated views; generate grounded answers | Never expose raw prod OLTP; use a serving layer |
| Natural language to SQL | Gemini proposes SQL against a governed schema/catalog | Validate/parametrize; read-only; no dynamic SQL |
| RAG with BigQuery | Embeddings + vector search in BigQuery over warehouse data | Column/row security on retrieval |
| Document processing pipeline | Document AI → extract → BigQuery / workflow | Human review of low-confidence extractions |
| Call center AI | CCAI / Dialogflow + Gemini + knowledge base | Grounding; escalation to humans |
| MLOps pipeline | Vertex AI Pipelines + Feature Store + Model Registry + Monitoring | Reproducibility; drift monitoring |
| Governed private GenAI | Private endpoints + VPC-SC + curated data + audit | Entitlement-aware retrieval |
Governance and security for GenAI
- Serving layer, always - agents/LLMs call a governed API (e.g. Cloud Run) that enforces authN/authZ, rate limits, input/output validation, and logging. They do not touch data stores directly.
- Entitlement-aware retrieval - filter retrieved context to what the requesting user may see (row/column/document level) so RAG cannot leak across users.
- Private & perimetered - keep model and data traffic private (Private Service Connect / private endpoints); wrap data services in VPC Service Controls.
- Credential hygiene - secrets in Secret Manager, access via service accounts / Workload Identity; the model never sees raw credentials.
- Auditability - log prompts, retrieved context IDs, and responses (per privacy rules) so answers are explainable.
- Responsible AI - safety filters, evaluation, and Model Monitoring for drift/quality; human review for consequential outputs.
Warnings (read before connecting AI to enterprise data)
- Do not connect LLM agents directly to production OLTP databases without a governed serving layer. Live transactional systems are not a query surface for a probabilistic agent.
- Avoid uncontrolled dynamic SQL. NL-to-SQL must produce validated, parameterized, read-only queries against a curated schema - never free-form DML against production.
- Protect credentials. No DB passwords, keys, or wallets in prompts, code, or agent memory. Use Secret Manager + service accounts / Workload Identity.
- Add auditability. If you cannot show what data an answer came from and who asked, you cannot defend it to security or compliance.
- Use curated datasets, APIs, or read-only reporting layers as the AI's data surface - not raw production tables.
- Validate output before business use. Treat model output as a draft/suggestion until a human or deterministic check confirms it.
- Monitor prompt injection and data-leakage risks - untrusted content in the context can hijack instructions; isolate and sanitize retrieved/user content.
- Check Gemini model availability, region, quota, and pricing before you design - these change frequently and vary by region.
13. Migration and Disaster Recovery
Getting workloads into Google Cloud (VMs, databases, data) and keeping them recoverable - the migration tooling, the DR patterns by tier, and how RTO/RPO drive architecture and cost.
Migrate VMs with Migrate to Virtual Machines, databases with Database Migration Service (and Datastream for CDC/low-downtime), and bulk data with Storage Transfer Service / Transfer Appliance. Plan with Migration Center. For DR, choose per tier: backup & restore (cheapest, slow), cold/pilot light, warm standby, or hot/active-active. Global load balancing + Cloud DNS handle traffic failover. Your RTO/RPO targets pick the pattern - and DR you never test is not DR.
Migration tooling
| Move | Tooling | Notes |
|---|---|---|
| Assess & plan | Migration Center | Discovery, assessment, TCO, and grouping before you move |
| VMs | Migrate to Virtual Machines (MVM) | Lift-and-shift VMware/AWS/Azure/physical VMs to Compute Engine |
| VMware estates | Google Cloud VMware Engine | Run VMware as-is in GCP, then modernize gradually |
| Databases (low downtime) | Database Migration Service (DMS) + Datastream | Managed migrations to Cloud SQL/AlloyDB; DMS supports Oracle→PostgreSQL conversion |
| Bulk data | Storage Transfer Service / Transfer Appliance | Online from other clouds/on-prem/HTTP, or physical appliance for very large sets |
| Data warehouse | BigQuery Data Transfer Service + migration tooling | From Teradata/Redshift/others into BigQuery |
| Backup & DR | Backup and DR service | Application-consistent backup and DR orchestration for VMs/databases |
Database migration paths
| Source → target | Method | Downtime |
|---|---|---|
| PostgreSQL/MySQL → Cloud SQL | DMS continuous (logical replication) | Near-zero |
| PostgreSQL → AlloyDB | DMS continuous | Near-zero |
| Oracle → PostgreSQL/AlloyDB | DMS with schema/code conversion (heterogeneous) | Low, plus conversion effort |
| Oracle/MySQL/PostgreSQL → BigQuery/Storage | Datastream (CDC) | Near-zero (ongoing feed) |
| Any → self-managed on Compute Engine | Native dump/restore, replication, or Data Guard (Oracle) | Depends on method |
DR patterns
| Pattern | Standby state | RTO | RPO | Cost |
|---|---|---|---|---|
| Backup & restore | Backups in another region; nothing running | Hours+ | Since last backup | Lowest |
| Cold / pilot light | Core data replicated (e.g. cross-region replica); app off | Tens of min | Small (replica lag) | Low |
| Warm standby | Scaled-down full stack running in DR region | Minutes | Small | Medium |
| Hot / active-active | Both regions serving (global LB, Spanner/replicated data) | Near-zero | Near-zero | Highest + complexity |
Building blocks: cross-region Cloud SQL/AlloyDB replicas (promote for DR) or Spanner/Firestore multi-region (built-in); Cloud Storage dual/multi-region or Transfer for objects; regional PD and PD snapshots for VMs; the global external Application LB for automatic cross-region failover; and Cloud DNS for DNS-based failover where the LB doesn't cover it.
RTO and RPO
- RTO - how long you can be down → drives standby readiness and automation.
- RPO - how much data you can lose → drives replication mode (synchronous vs async vs backup interval).
- Zero data loss needs synchronous replication (regional HA, or Spanner multi-region) and low inter-region latency; verify the network and the performance trade-off.
Architecture examples
- On-prem VM → GCP: Migrate to VMs; cut over with the network in place.
- DB → Cloud SQL: DMS continuous, cut over at low replication lag.
- PostgreSQL → AlloyDB: DMS; validate performance/analytics gains.
- Oracle → Compute Engine (where required): self-managed, Data Guard to a second-region VM, sole-tenant for licensing.
- Cross-region DR (app): global LB + warm MIG + storage replication.
- Cross-region DR (database): cross-region replica (promote) or multi-region Spanner/Firestore.
- Backup-based DR: Backup and DR service / cross-region snapshots, rebuild on demand.
DR testing
- Cross-region replica within RPO (monitor replication lag); promotion rehearsed.
- CMEK keys present and usable in the DR region.
- App tier can start and connect in DR; config points to DR endpoints.
- Global LB / Cloud DNS failover tested and time-measured.
- Object data (dual/multi-region or replicated) within RPO.
- Capacity available in DR (reservations if RTO is tight); runbook current.
14. Cost Management and Governance
How Google Cloud charges, the tools to track and cap spend (billing export, budgets, quotas), the discount levers (CUDs, sustained-use, Spot), and the governance model - ending in a monthly cost-review checklist.
Google Cloud bills mainly by compute (vCPU/memory-hours), storage GB, BigQuery bytes scanned or slots, and network egress. Track with billing export to BigQuery + Cost tables/dashboards, cap with Budgets & alerts (notify) and quotas (block). The big levers: committed use discounts for steady state, sustained-use discounts (automatic on some families), Spot VMs for fault-tolerant work, right-sizing / custom machine types, BigQuery query controls, storage lifecycle, and killing idle resources. Governance = resource hierarchy + Org Policies + budgets + labels.
Pricing basics
| Dimension | Charged on | Notes |
|---|---|---|
| Compute Engine | vCPU + memory per second (per machine type) | Sustained-use discounts on some families; CUDs; Spot for big savings |
| Persistent Disk / storage | Provisioned GB-month (+ IOPS/throughput for Hyperdisk) | Snapshots add up; choose disk type deliberately |
| Cloud Storage | GB-month by class + operations + retrieval (colder classes) | Lifecycle to colder classes; watch retrieval/egress |
| BigQuery | Bytes scanned (on-demand) or slot-hours (capacity) + storage | Query design and partitioning dominate cost |
| Network | Internet egress + inter-region + some inter-zone; ingress free | Keep chatty services co-located; use private access |
| Managed services | Per-service (Cloud SQL vCPU/RAM/storage, Cloud Run per-request, etc.) | Right-size; scale-to-zero where possible |
| Logging / Monitoring | Ingestion volume (logs), some metrics | Exclude noisy logs; set retention |
Cost tracking tools
| Tool | Does |
|---|---|
| Billing export to BigQuery | Detailed usage/cost data for your own analysis and dashboards (turn on early). |
| Cost table / Cost breakdown / Reports | Console views of spend by project, service, SKU, label, time. |
| Budgets & alerts | Track spend against a target per billing account/project/label; alert at thresholds (and optionally trigger Pub/Sub automation). Budgets notify - they don't block. |
| Quotas | Hard caps on resource usage per project - the "block" control. |
| Pricing Calculator | Estimate before you build. |
| Recommender / Active Assist | Right-sizing, idle-resource, and commitment recommendations. |
Discounts
- Committed use discounts (CUDs) - 1 or 3-year commitment (resource-based or spend-based) for a substantial discount on steady-state usage.
- Sustained-use discounts - automatic discounts for running certain machine families a large fraction of the month.
- Spot VMs - 60-90% off for preemptible, fault-tolerant workloads.
- Custom machine types - stop paying for vCPU or memory you don't use.
Governance model
Governance is enforced through the same primitives as security: the resource hierarchy (projects/folders for isolation and attribution), Organization Policies (restrict locations, machine types, external IPs), budgets + quotas, labels/tags, and a landing zone deployed as code. This keeps spend controlled and attributable by design rather than by cleanup.
Cost optimization examples
| Action | Typical saving | Effort |
|---|---|---|
| Stop / schedule non-prod VMs off-hours | High (up to ~65-70% of that compute) | Low |
| Right-size VMs (Recommender) / custom machine types | High | Low |
| Committed use discounts for baseline | High | Medium |
| Spot VMs for fault-tolerant / batch | Very high (60-90%) | Medium |
| Choose the right disk type / delete unused disks | Medium | Low |
| Cloud Storage lifecycle to colder classes | Medium-High | Low |
| BigQuery: partition/cluster, max-bytes-billed, capacity vs on-demand | High for heavy BQ | Medium |
| Reduce logging ingestion (exclusion filters) | Medium | Low |
| Reduce inter-region / egress traffic | Medium | Medium |
| Delete old snapshots & unused external IPs | Medium | Low |
| Cloud SQL right-sizing / scale-to-zero serverless | Medium-High | Low |
Monthly Google Cloud cost review checklist
- Review cost reports month-over-month by project, service, and label; investigate spikes.
- Check each budget: which projects/labels are over or trending over target.
- Act on Recommender right-sizing and idle-resource recommendations.
- Confirm non-prod stop/schedule ran (nothing running 24x7 by accident).
- Find and delete unused persistent disks, orphaned snapshots, and idle VMs.
- Release unused external (static) IPs - they bill when unattached.
- Review CUD coverage vs. steady-state usage; buy/adjust commitments.
- BigQuery: top queries by bytes scanned; add partitioning/clustering; set max-bytes-billed; review on-demand vs capacity.
- Cloud Storage: are lifecycle rules moving cold data to Nearline/Coldline/Archive?
- Logging/Monitoring ingestion: exclude noisy logs; check retention settings.
- Review egress / inter-region charges; co-locate chatty services; use private access.
- Cloud SQL / managed DB sizing vs. actual utilization; scale down over-provisioned.
- Confirm every resource is labeled (cost-center/env/owner) for attribution.
- Validate quotas still reflect intent; check for anomalous new spend by service.
15. Enterprise Architecture Patterns
Reference blueprints for real Google Cloud deployments. Each card gives the business case, services, traffic flow, and the security / HA / DR / monitoring / cost / risk dimensions so you can adapt rather than start from a blank page.
Every pattern lists the same dimensions. Start from the one closest to your workload, then apply the service deep dives (sections 3-12) and the DR/cost guidance (13-14). The recurring backbone is: global external App LB + Cloud Armor → private compute (Cloud Run / regional MIG / GKE) → managed database on private IP → Private Google Access / PSC for Google APIs → centralized logging → cross-region DR, all inside a governed landing zone.
Foundational three-tier (reference backbone)
| Business case | Standard internal/external web or enterprise app needing HA and controlled exposure. |
|---|---|
| Services | Shared VPC, global external App LB + Cloud Armor + Cloud CDN, Cloud Run or regional MIG, Cloud SQL/AlloyDB (private IP), Cloud NAT, PGA/PSC, Secret Manager, Cloud Monitoring/Logging. |
| Traffic flow | User → Cloud Armor/LB → app (private) → DB (private IP); app → Google APIs via PGA/PSC; egress via Cloud NAT. |
| Security | No external IPs; firewall by SA; DB private; CMEK; secrets in Secret Manager; Org Policies + VPC-SC; SCC on. |
| HA | Regional MIG / Cloud Run across zones; Cloud SQL HA; LB health-based routing. |
| DR | Second-region backends behind the same global LB + cross-region DB replica. |
| Monitoring | LB/backends, app, DB metrics; alerts → notification channels; central logs. |
| Cost | Cloud Run scale-to-zero or right-sized MIG + CUD; storage lifecycle; BQ controls. |
| Risks / mistakes | Health-check firewall rule missing; DB public IP; no zone spread; secrets in code. |
Pattern library
Simple web application Small
| Case | Low-complexity site/app, cost-sensitive. |
|---|---|
| Services | Cloud Run + Cloud SQL (or Firestore) + Cloud Storage for assets + global LB + Cloud Armor. |
| HA/DR/cost | Cloud Run multi-zone by default; Cloud SQL HA; scale-to-zero. Risk: public DB IP, no backups. |
Highly available application HA
| Case | Must survive zone (and ideally region) failure. |
|---|---|
| Services | Regional MIG / Cloud Run across zones, regional PD for state, Cloud SQL HA or Spanner, global LB. |
| DR | Second-region backends + cross-region DB replica or multi-region Spanner. Risk: state on a single zonal disk; untested failover. |
Private enterprise application Regulated
| Case | Internal-only, reachable from on-prem, no public footprint. |
|---|---|
| Services | Private subnets, internal Application LB, HA VPN/Interconnect via Cloud Router, PGA/PSC, IAP for admin, no external IPs. |
| Security/risk | VPC-SC perimeter; hierarchical firewall; DB private IP. Risk: CIDR overlap; DNS forwarding gaps. |
Shared VPC / centralized networking & security Platform
| Case | Many teams/projects with centrally-governed network, security, and logging. |
|---|---|
| Services | Host project (Shared VPC), hierarchical firewall, Cloud NAT/DNS, central logging project (aggregated sink), security project (SCC, KMS), org Org Policies. |
| Risk | Under-scoped networkUser grants; teams creating shadow VPCs; missing perimeter. |
Multi-project landing zone Governance
| Case | Governed foundation before workloads land. |
|---|---|
| Services | Org/folder/project hierarchy, baseline IAM (groups), preventive Org Policies, Shared VPC, central logging + SCC, budgets/quotas, labels - all Terraform (Cloud Foundation blueprints). |
| Risk | Skipping it and retrofitting governance later. |
Cloud SQL / AlloyDB application DB
| Case | Relational app backend. |
|---|---|
| Services | Cloud SQL (or AlloyDB) private IP + HA + PITR, Auth Proxy / Serverless VPC Access, cross-region replica for DR, Query Insights. |
| Risk | Public IP + broad authorized networks; HA not enabled; untested restore. |
BigQuery analytics platform / data lake Data
| Case | Enterprise analytics on curated + raw data. |
|---|---|
| Services | Cloud Storage lake (zones) + BigQuery/BigLake + Dataflow/Dataform + Datastream (CDC) + Dataplex (govern) + Looker; VPC-SC perimeter. |
| Cost/risk | Partition/cluster + query controls; column-level security. Risk: ungoverned "data swamp"; runaway scans. |
GKE platform Cloud native
| Case | Container platform for many microservices with CI/CD. |
|---|---|
| Services | Private GKE (Autopilot or Standard) in Shared VPC, Gateway API LB, Workload Identity, Artifact Registry (scanned) + Binary Authorization, Cloud Build/Deploy, service mesh optional. |
| Risk | Pod IP exhaustion; over-privileged Workload Identity; public control plane. |
Cloud Run serverless application Serverless
| Case | Stateless services/APIs with minimal ops. |
|---|---|
| Services | Cloud Run (services + jobs) behind global LB + Cloud Armor, Serverless VPC Access to private DB, Pub/Sub/Eventarc for async, Secret Manager. |
| Cost/risk | Scale-to-zero; per-request billing. Risk: cold-start latency for spiky critical paths (use min instances). |
Event-driven architecture Events
| Case | Decoupled, resilient processing pipelines. |
|---|---|
| Services | Eventarc + Pub/Sub + Cloud Run/Functions + Workflows + Cloud Tasks/Scheduler; dead-letter topics. |
| Risk | Poison messages without DLQ; non-idempotent handlers; backlog from slow consumers. |
Hybrid cloud Hybrid
| Case | Workloads split across on-prem and GCP. |
|---|---|
| Services | Interconnect (primary) + HA VPN (backup) via Cloud Router, Shared VPC / NCC, hybrid Cloud DNS, hierarchical firewall. |
| Risk | CIDR overlap; expecting transitive peering; single link with no backup; asymmetric routing. |
Multi-region DR DR
| Case | Business-critical stack needing regional resilience. |
|---|---|
| Services | Global LB with multi-region backends, cross-region DB replica or multi-region Spanner/Firestore, dual/multi-region Cloud Storage, reservations, Backup and DR service. |
| Risk | Untested DR; CMEK key missing in DR region; capacity unavailable at failover. |
Secure landing zone Security
| Case | Preventive-guardrail foundation. |
|---|---|
| Services | Org Policies (no external IP, location restriction, no SA keys, OS Login), VPC-SC perimeters, central SCC + logging, KMS/Secret Manager, break-glass, budgets/quotas - as code. |
| Risk | Guardrails left off; over-broad break-glass. |
GenAI with private enterprise data AI
| Case | RAG/assistant over internal data, governed. |
|---|---|
| Services | Cloud Storage (docs) + vectors in AlloyDB/BigQuery + Gemini (Vertex AI) or Vertex AI Search behind a Cloud Run serving API + Secret Manager + VPC-SC + logging. |
| Flow / risk | Query → serving layer (authz + guardrails) → entitlement-filtered retrieval → grounded, audited answer. Risk: ungoverned data access, dynamic SQL, credential leakage (section 12 warnings). |
- Databases/services on external IPs "to get it working"; missing private access planning.
- No zone/region spread - a zone event takes the whole "HA" tier.
- Health-check firewall ranges forgotten, so LB backends are unhealthy on day one.
- DR designed but never tested; CMEK keys missing in the DR region.
- Secrets in code/metadata instead of Secret Manager; long-lived SA keys.
- No centralized logging/SCC until an incident needs it.
- CIDR overlap / expecting transitive peering discovered during hybrid setup.
- Landing zone / Org Policies skipped and retrofitted painfully later.
16. Troubleshooting Guides
A runbook catalog for the failures you will actually hit. Each entry lists symptoms, likely causes, checks (with Console path and gcloud where useful), fixes, and prevention. Deeper versions of some live in their service sections; this is the consolidated index.
Compute & access
Symptoms: SSH times out or is denied. Causes: firewall doesn't allow SSH from your source (allow the IAP range 35.235.240.0/20 and use IAP, or your CIDR); VM has no external IP and you're not using IAP; OS Login enabled but you lack roles/compute.osLogin (or osAdminLogin) / 2FA; VM stopped or boot failed; OS firewall. Checks: serial console for boot; IAM roles; firewall; gcloud compute ssh --tunnel-through-iap. Fix: grant OS Login role, open IAP range, use IAP tunneling. Prevention: standardize IAP + OS Login; no external IPs.
gcloud compute ssh VM --tunnel-through-iap --zone=ZONE
gcloud compute instances get-serial-port-output VM --zone=ZONECauses: bad fstab mount, full boot disk, kernel/driver issue, failed startup script. Checks: serial console output; startup-script logs. Fix: detach the boot disk, attach to a rescue VM, correct config, reattach; keep boot-disk snapshots. Prevention: test image changes in non-prod.
CPU: Monitoring trend; on host top; right-size/autoscale. Memory: requires the Ops Agent (memory isn't collected by default) - install it, then right-size. Disk full: resize the PD online, grow the filesystem, alert at 85%; clean logs/temp.
Causes: disk in a different zone than the VM; not formatted/mounted; wrong device name. Checks: gcloud compute instances describe; lsblk. Fix: attach in the same zone, format & mount by UUID in fstab. Prevention: regional PD for HA; automate mount in the startup script.
Storage
Denied: missing IAM (needs roles/storage.objectViewer/Admin) at bucket or project; wrong project; UBLA on but you relied on an ACL; VPC-SC perimeter blocking; API disabled. Public access blocked: Org Policy storage.publicAccessPrevention is (correctly) enforcing - don't disable it; use signed URLs or IAM instead. Checks: gsutil iam get gs://BUCKET; Policy Troubleshooter; audit logs. Fix: grant least-privilege IAM; use signed URLs for external sharing.
Causes: firewall blocking NFS between client and Filestore; wrong mount IP/path; client not in the allowed network. Fix: allow NFS ports from the client subnet, verify the mount target IP and export path, ensure same VPC/region reachability.
Network
Method: run a Connectivity Test (names the blocking route/firewall) and Network Analyzer for config issues. Then per case:
- Firewall: implied deny-ingress; check priority/direction; allow health-check (35.191/16, 130.211/22) and IAP (35.235.240.0/20) ranges; prefer SA targeting.
- Route: a custom route shadowing the default internet route; missing dynamic route from Cloud Router.
- Cloud NAT: outbound-only; port exhaustion (raise min-ports / enable dynamic allocation); NAT not covering the subnet/region.
- PGA: not enabled on the subnet; DNS/route for
private.googleapis.commissing; VPC-SC blocking. - PSC: endpoint/DNS mapping wrong; producer not accepting the connection.
- Peering: not transitive; overlapping ranges; missing firewall for the peer range.
- Shared VPC: service-project SA lacks
compute.networkUser; resources created in the wrong network.
Causes: IKE/IPSec parameter mismatch (VPN); attachment/BGP session down (Interconnect); Cloud Router not advertising subnets or not learning on-prem routes; CIDR overlap. Checks: tunnel/attachment status; BGP session state and advertised/learned routes. Fix: align IKE params, fix BGP advertisements both ways, resolve overlap. Prevention: Interconnect + HA VPN backup; alarms on tunnel/BGP state.
Causes: private zone not attached to the VPC; missing record; forwarding/peering not set for hybrid; wrong resolver. Checks: dig from a VM; zone attachment. Fix: attach the private zone, add records, set inbound/outbound server policies for on-prem forwarding.
Load balancer & databases
Unhealthy: firewall not allowing health-check ranges (35.191/16, 130.211/22) to the backend port; wrong health-check port/path/protocol; app on localhost; wrong backend protocol/named port. SSL: Google-managed cert stuck PROVISIONING because DNS doesn't point at the LB IP yet; missing SAN; expired self-managed cert. Fix: per section 7.
Connection: use the Auth Proxy or private IP; check the runtime SA has roles/cloudsql.client; authorized networks / SSL for public IP; Serverless VPC Access for Cloud Run/Functions. Performance: Query Insights + Cloud Monitoring (CPU/connections/storage/lag); add read replicas; tune queries/flags. Backup failed: check storage, PITR/binary logging enabled, and quota; test a restore.
gcloud sql instances describe INSTANCE
./cloud-sql-proxy --private-ip PROJECT:REGION:INSTANCEIAM & service accounts
Permission denied: walk the section 2 mental model - right project? which principal? role/permission? scope/inheritance? deny policy? Org Policy? VPC-SC? API enabled? For workloads: does the caller have actAs/tokenCreator on the SA, and does the SA have the role on the target? Impersonation: caller needs roles/iam.serviceAccountTokenCreator on the SA. SA key issue: key creation may be blocked by Org Policy (good) - use impersonation/Workload Identity instead; a leaked/rotated key stops working. Tools: Policy Troubleshooter, Policy Analyzer, audit logs.
Serverless & GKE
Cloud Run: container must listen on $PORT and start fast; check the runtime SA permissions and startup CPU; roll back a bad revision. Functions timeout: raise timeout/memory, make idempotent, offload long work. Pub/Sub backlog: slow/failing subscriber - check ack deadline, errors, scale consumers, add a dead-letter topic.
Pod: Pending (capacity / pod-IP exhaustion / requests too big), ImagePullBackOff (Artifact Registry read perms / PGA for private cluster), CrashLoopBackOff (config/probes). Ingress: allow health-check ranges; readiness probe aligned; managed cert needs DNS → LB IP first; verify BackendConfig/NEG. Tools: kubectl describe/logs. (Section 10.)
Observability
Alert: wrong metric/filter, threshold/duration never met, policy disabled, notification channel unverified, or maintenance suppression. Test by forcing the condition; use absence conditions for heartbeats. Logs missing: log not enabled (e.g. Data Access audit logs off), Ops Agent not installed on the VM, a log exclusion filter dropping them, wrong project/log bucket, or retention expired. Fix: enable the log, install the agent, review Log Router sinks/exclusions.
17. gcloud CLI, Terraform, and Automation
Practical, copy-friendly automation: gcloud setup and configurations, service-account impersonation, common commands, the Google Terraform provider, and clean examples for VPC, VMs, buckets, IAM, and alerts - plus state and structure practices.
The gcloud CLI uses named configurations (account + project + region). Prefer Application Default Credentials and service-account impersonation over downloaded keys. Build production infrastructure with Terraform (the Google provider); keep state remote and locked in a Cloud Storage backend, structure code into modules, and separate environments by workspace/backend + tfvars. Run it in a pipeline (Cloud Build) with an impersonated deployer SA - no keys.
gcloud CLI setup & configurations
# Install the Cloud SDK, then authenticate
gcloud auth login # human login
gcloud auth application-default login # ADC for local tools/Terraform
# Named configurations (switch between projects/accounts fast)
gcloud config configurations create prod
gcloud config set account jane@example.com
gcloud config set project acme-app-prod-01
gcloud config set compute/region us-central1
gcloud config configurations activate prod
gcloud config configurations listService-account impersonation (no keys)
# Your user needs roles/iam.serviceAccountTokenCreator on the SA
gcloud config set auth/impersonate_service_account deployer@PROJECT.iam.gserviceaccount.com
gcloud compute instances list # now runs AS the SA, short-lived token
# Or per-command:
gcloud storage ls --impersonate-service-account=deployer@PROJECT.iam.gserviceaccount.com
# Terraform: impersonate via ADC + provider setting (no key file)iam.disableServiceAccountKeyCreation.Common gcloud commands
# Projects / APIs
gcloud projects list
gcloud services enable compute.googleapis.com run.googleapis.com --project PROJECT
gcloud services list --enabled
# Compute
gcloud compute instances list
gcloud compute instances create web-1 --machine-type=e2-standard-2 --no-address --zone=us-central1-a
gcloud compute ssh web-1 --tunnel-through-iap --zone=us-central1-a
# Storage
gcloud storage ls
gcloud storage cp -r ./data gs://my-bucket/data
# IAM
gcloud projects add-iam-policy-binding PROJECT --member="group:app@example.com" --role="roles/run.developer"
gcloud projects get-iam-policy PROJECT
# Logs
gcloud logging read 'severity>=ERROR' --limit 20 --freshness 1hTerraform provider setup
# versions.tf
terraform {
required_version = ">= 1.5"
required_providers {
google = { source = "hashicorp/google", version = "~> 6.0" } # verify current major
}
backend "gcs" { bucket = "acme-tfstate-prod" prefix = "app" } # remote, locked state
}
provider "google" {
project = var.project_id
region = var.region
# Impersonate a deployer SA using ADC - no key file
impersonate_service_account = var.deployer_sa
}Create a custom-mode VPC + subnet + Cloud NAT
resource "google_compute_network" "vpc" {
name = "app-vpc"
auto_create_subnetworks = false # custom mode
}
resource "google_compute_subnetwork" "app" {
name = "app-us-central1"
ip_cidr_range = "10.10.0.0/20"
region = "us-central1"
network = google_compute_network.vpc.id
private_ip_google_access = true # Private Google Access
secondary_ip_range { # for GKE pods, if needed
range_name = "pods"
ip_cidr_range = "10.20.0.0/16"
}
}
resource "google_compute_router" "rt" {
name = "app-rt" region = "us-central1" network = google_compute_network.vpc.id
}
resource "google_compute_router_nat" "nat" {
name = "app-nat" router = google_compute_router.rt.name region = "us-central1"
nat_ip_allocate_option = "AUTO_ONLY"
source_subnetwork_ip_ranges_to_nat = "ALL_SUBNETWORKS_ALL_IP_RANGES"
}Create a Compute Engine VM (no external IP, Shielded)
resource "google_compute_instance" "app" {
name = "app-1"
machine_type = "e2-standard-2"
zone = "us-central1-a"
boot_disk { initialize_params { image = "debian-cloud/debian-12" } }
network_interface {
subnetwork = google_compute_subnetwork.app.id
# no access_config block = no external IP
}
shielded_instance_config { enable_secure_boot = true enable_vtpm = true enable_integrity_monitoring = true }
service_account { email = google_service_account.app.email scopes = ["cloud-platform"] }
metadata = { enable-oslogin = "TRUE" }
}Create a hardened Cloud Storage bucket
resource "google_storage_bucket" "data" {
name = "acme-app-data-prod"
location = "US"
uniform_bucket_level_access = true
public_access_prevention = "enforced"
versioning { enabled = true }
lifecycle_rule {
condition { age = 30 }
action { type = "SetStorageClass" storage_class = "NEARLINE" }
}
# encryption { default_kms_key_name = google_kms_crypto_key.data.id } # CMEK
}IAM binding, service account, and a conditional binding
resource "google_service_account" "app" {
account_id = "app-runtime"
display_name = "App runtime SA"
}
# Least-privilege: only object read on one bucket, to a GROUP
resource "google_storage_bucket_iam_member" "read" {
bucket = google_storage_bucket.data.name
role = "roles/storage.objectViewer"
member = "serviceAccount:${google_service_account.app.email}"
}
# Conditional project binding (time-boxed / resource-scoped)
resource "google_project_iam_member" "cond" {
project = var.project_id
role = "roles/compute.viewer"
member = "group:oncall@example.com"
condition {
title = "prod-hours"
expression = "request.time.getHours('America/New_York') >= 8 && request.time.getHours('America/New_York') < 20"
}
}Create a Cloud Monitoring alert
resource "google_monitoring_notification_channel" "email" {
display_name = "ops-email" type = "email"
labels = { email_address = "oncall@example.com" }
}
resource "google_monitoring_alert_policy" "cpu" {
display_name = "VM CPU high"
combiner = "OR"
conditions {
display_name = "CPU > 85%"
condition_threshold {
filter = "resource.type=\"gce_instance\" AND metric.type=\"compute.googleapis.com/instance/cpu/utilization\""
comparison = "COMPARISON_GT"
threshold_value = 0.85
duration = "300s"
aggregations { alignment_period = "60s" per_series_aligner = "ALIGN_MEAN" }
}
}
notification_channels = [google_monitoring_notification_channel.email.id]
}State, structure, and CI/CD
- Remote, locked state: a GCS backend (object versioning on the state bucket) with state locking. Never keep prod state on a laptop; never commit state (it holds secrets).
- Modular structure: reusable modules (network, compute, data, iam, monitoring) composed per environment.
- Environment separation: separate state per env (workspaces or separate backends/prefixes) driven by
dev.tfvars/prod.tfvars; separate projects and ideally separate deployer SAs/pipelines. - CI/CD: run
plan/applyin Cloud Build (or another pipeline) using an impersonated deployer SA via Workload Identity Federation - no keys. Gateapplywith approvals; runplanon PRs for review. - No secrets in code: reference Secret Manager / KMS by name; keep secret
tfvarsout of git.
gcp-infra/
modules/
network/ compute/ data/ iam/ monitoring/
envs/
dev/ main.tf dev.tfvars backend.tf
prod/ main.tf prod.tfvars backend.tf
cloudbuild.yaml
README.mdplan in CI on every PR, and use Cloud Asset Inventory / drift detection to catch out-of-band changes.18. Google Cloud Architecture Framework
The five pillars Google uses to review a design - operational excellence, security/privacy/compliance, reliability, cost optimization, and performance. Written for real architecture reviews: what each means, the services that support it, examples, mistakes, and a review checklist.
Run a design (or an existing system) through all five pillars. For each, ask the checklist questions, map to concrete Google Cloud services, and record gaps as action items. A pillar with no owner and no evidence is a risk, not a pass.
Operational excellence
What it means: run, monitor, and improve systems and processes reliably and repeatably - automation, observability, incident response, and change management.
Why it matters: most outages are caused by change and by not seeing problems early. Operational maturity is what turns a good design into a dependable service.
Supporting services: Cloud Monitoring/Logging, Error Reporting/Trace/Profiler, Cloud Build/Deploy, Terraform + Cloud Asset Inventory (IaC + drift), VM Manager, Service Health, Active Assist.
Practical examples: everything as code with peer-reviewed changes; SLOs with error budgets; centralized logs; golden images + automated patching; runbooks tied to alerts; blameless post-mortems.
Security, privacy, and compliance
What it means: protect identities, data, and workloads; meet regulatory obligations; and be able to prove it.
Why it matters: a single over-broad grant, public bucket, or long-lived key can undo everything else. Security is a design property, not an add-on.
Supporting services: Cloud IAM (least privilege, groups), Organization Policies, VPC Service Controls, Security Command Center, Cloud KMS/HSM + Secret Manager, Cloud Armor, IAP, Binary Authorization, Sensitive Data Protection, Cloud Audit Logs.
Practical examples: no basic roles; SA keys disabled org-wide; preventive Org Policies; VPC-SC around sensitive data; CMEK; private IPs + IAP; SCC org-wide; centralized audit logs. (See section 8's checklist.)
Reliability
What it means: the system meets its availability and durability targets and recovers from failures - designed around resource scope (zonal/regional/multi-region), redundancy, and tested DR.
Why it matters: reliability targets (SLOs) drive architecture and cost. You cannot bolt on availability after an outage.
Supporting services: regional MIGs + autohealing, regional PD, global external LB (health-based failover), Cloud SQL HA / cross-region replicas, Spanner/Firestore multi-region, Backup and DR, Cloud Monitoring SLOs.
Practical examples: multi-zone by default (regional resources); a defined DR pattern per tier with tested RTO/RPO; graceful degradation; capacity planning + reservations; error budgets governing release pace.
Cost optimization
What it means: deliver the required value at the lowest sustainable cost - right-sizing, discounts, eliminating waste, and attributing spend.
Why it matters: unmanaged cloud spend grows silently; cost is a first-class design and operational concern, not a finance afterthought.
Supporting services: billing export to BigQuery, Budgets + alerts, quotas, Recommender/Active Assist, CUDs, Spot VMs, custom machine types, storage lifecycle, BigQuery query controls.
Practical examples: labels + billing export for attribution; CUDs for baseline; Spot for batch; scheduled non-prod shutdown; storage lifecycle; BigQuery partitioning + max-bytes-billed; monthly review (section 14).
Performance optimization
What it means: resources meet latency/throughput requirements efficiently as demand changes - right machine types, autoscaling, caching, data locality, and query design.
Why it matters: performance affects user experience and cost simultaneously; the right shape and data design often beat simply adding capacity.
Supporting services: machine families (C-series for CPU, M-series for memory, GPUs/TPUs), autoscaling (MIG/Cloud Run/GKE HPA), Cloud CDN, Memorystore, global LB, Hyperdisk (tunable IOPS), BigQuery partitioning/clustering/BI Engine.
Practical examples: match machine family to the bottleneck; autoscale on the right signal; cache at the edge (CDN) and in-memory (Memorystore); co-locate data and compute (reduce egress/latency); partition/cluster BigQuery; load-test before launch.
19. Learning Path
A structured route from Google Cloud fundamentals to enterprise-grade architecture, security, data, and AI - aimed at people coming from traditional infrastructure or another cloud. Each level lists what to learn, why, hands-on labs, common mistakes, and the outcome you should reach.
Beginner
What to learn
- Fundamentals: global infra, regions/zones, resource scope, and the org/folder/project hierarchy (section 1).
- IAM basics: principals, predefined roles (not basic), inheritance, groups, service accounts (section 2).
- VPC basics: custom-mode VPC, regional subnets, firewall rules, Cloud NAT (section 3).
- Compute Engine basics: machine types, images, OS Login, IAP SSH (section 4).
- Cloud Storage basics: buckets, classes, UBLA/public access prevention (section 5).
- Cloud Monitoring/Logging basics: metrics, the Ops Agent, an alert, logs (section 9).
Why it matters
Every design rests on the hierarchy, IAM, and the global-VPC / regional-subnet model. Get these right and everything later is easier.
Hands-on labs
- Create a project; add a group; grant a predefined role; test access.
- Build a custom-mode VPC with a subnet, firewall rules, and Cloud NAT.
- Launch a VM with no external IP; SSH via IAP; use OS Login.
- Create a hardened bucket (UBLA + public access prevention); upload objects.
- Install the Ops Agent; create a CPU alert to an email channel.
Common mistakes
Using the default network / auto mode; basic roles; per-user grants; external IPs everywhere; forgetting the Ops Agent for memory.
Expected outcome
You can stand up a properly-segmented VPC, reach a private VM via IAP, use IAM correctly, and see basic telemetry.
Intermediate
What to learn
- Load balancing (global external App LB + Cloud Armor) and managed instance groups + autoscaling (sections 7, 4).
- Private networking: Private Google Access, Cloud NAT, private IP for services; Cloud VPN / Interconnect basics (section 3).
- Cloud SQL: HA, private IP, Auth Proxy, backups/PITR, read replicas (section 6).
- Cloud Logging (sinks) and Cloud KMS / Secret Manager (sections 9, 8).
- Security Command Center and Org Policies basics (section 8).
- Cost management: budgets, labels, billing export, CUDs (section 14).
Why it matters
This is the day job: HA app tiers, managed databases, and the operational, security, and cost controls that make them production-worthy.
Hands-on labs
- Deploy a 3-tier app: global LB + Cloud Armor → regional MIG (or Cloud Run) → Cloud SQL (private IP, HA).
- Allow the health-check ranges; confirm backends healthy; force a failover.
- Store the DB password in Secret Manager; connect via the Auth Proxy with a runtime SA.
- Create alerts (CPU, unhealthy backend, DB storage) and a notification channel.
- Set a budget + labels + billing export; add a couple of Org Policies (no external IP, restrict locations).
Common mistakes
Health-check firewall rule missing; DB on public IP; secrets in code; noisy alerts; no labels for attribution.
Expected outcome
You can deploy a secure, monitored, HA application + managed database, connect it privately, and keep its cost and access under control.
Advanced
What to learn
- Shared VPC, Organization Policies, and VPC Service Controls; Private Service Connect (sections 3, 8, 2).
- GKE (Autopilot/Standard), Workload Identity, and Cloud Run at scale (section 10).
- Pub/Sub, Eventarc, Workflows for event-driven systems (section 10).
- BigQuery (mental model, partitioning/clustering, cost), Dataflow, Dataplex (section 11).
- Vertex AI, Gemini, and vector search / governed RAG (section 12).
- Multi-region DR (global LB + cross-region replicas / Spanner) (section 13).
- Terraform + remote state + CI/CD; a landing zone (sections 17, 1, 14).
- Enterprise security at scale: CMEK, Binary Authorization, centralized logging + SCC (section 8).
Why it matters
At this level you own governance, resilience, data platforms, and AI enablement across many teams - decisions that are expensive to reverse.
Hands-on labs
- Deploy a landing zone via Terraform: hierarchy, groups, Org Policies, Shared VPC, central logging + SCC, budgets.
- Stand up a private GKE (Autopilot) cluster in the Shared VPC with Workload Identity and a Cloud Build/Deploy pipeline.
- Build a BigQuery + Cloud Storage lakehouse with a Datastream CDC feed and Dataplex governance; tune a query with partitioning/clustering.
- Build a governed RAG assistant: Cloud Storage + vectors in AlloyDB/BigQuery + Gemini behind a Cloud Run serving API, with VPC-SC + audit + entitlement-filtered retrieval.
- Implement cross-region DR for a Cloud SQL app (replica promotion) behind a global LB; rehearse failover and confirm CMEK keys in DR.
Common mistakes
Skipping the landing zone; DR never tested; over-privileged Workload Identity; pod-IP exhaustion; unbounded BigQuery scans; connecting AI to production data without a governed serving layer.
Expected outcome
You can design and operate a governed, automated, multi-region Google Cloud platform - including data and AI workloads - and defend the trade-offs on security, reliability, and cost.
Certification checkpoints (optional)
| Level | Typical certification track |
|---|---|
| Beginner | Cloud Digital Leader; Associate Cloud Engineer |
| Intermediate | Professional Cloud Architect; Professional Cloud Network Engineer |
| Advanced | Professional Cloud Security Engineer, Data Engineer, Database Engineer, DevOps Engineer, Machine Learning Engineer |