Microsoft Azure Deep Dive Portal

A practical reference for Cloud Architects, DBAs, and Enterprise Infrastructure Teams. Built to be used while you learn, design, implement, operate, secure, and troubleshoot real Azure environments - not a marketing overview.

19 deep sections Architecture patterns Troubleshooting runbooks CLI / Bicep / Terraform Well-Architected

Last reviewed: July 2026 Azure changes frequently - verify with current Microsoft documentation before production use.

WHO THIS IS FOR

Cloud architects, infrastructure engineers, Apps DBAs, DBAs, enterprise architects, DevOps, security engineers, Microsoft administrators - and anyone moving from traditional infrastructure or another cloud into Azure. It assumes you know servers, networks, storage, and databases, and focuses on how those map into Azure and what changes operationally.

How this portal is organized

Each section is a self-contained deep dive. Use the left navigation or the top-bar search to jump to a topic. Every section carries a Last reviewed date and, where content changes frequently (pricing, VM sizes, quotas, model availability, service names), a Verify with current Microsoft documentation flag.

Learn

Foundations first

Sections 1-2 establish the mental model: the governance hierarchy (management group / subscription / resource group), Entra ID, and the RBAC model everything depends on.

Build

Service deep dives

Sections 3-12 cover networking, compute, storage, databases, load balancing, security, observability, containers, data, and AI - with diagrams, tables, and gotchas.

Operate

Run and govern

Sections 13-19 cover migration and DR, cost and governance, reference patterns, troubleshooting runbooks, automation, the Well-Architected Framework, and a learning path.

Reading the callouts

Several note types recur. They flag the perspective that matters most for a point.

Architect note

Design-time decisions, trade-offs, and things to settle before production.

DBA note

Database-specific behavior - what Azure manages vs. what you manage, patching, backups, connectivity.

Security note

Exposure, least privilege, encryption, and audit considerations.

Cost note

Where money is spent and commonly wasted.

Operations note

Day-2 behavior: patching, scaling, maintenance, and reliability.

Common mistake

A specific error teams repeatedly make, and how to avoid it.

The Azure shared responsibility model (orientation)

Responsibility is split, and the split moves depending on the service. Get it wrong and you either leave gaps (exposed data, lost recoverability) or redo work Microsoft already does.

Layer	Virtual Machines (IaaS)	AKS	Azure SQL DB / PaaS DB	App Service / Functions
Physical / hypervisor	Microsoft	Microsoft	Microsoft	Microsoft
OS patching	You (Update Manager)	You (nodes) / MS (control plane)	Microsoft	Microsoft
Runtime / engine patching	You	Shared	Microsoft (in window)	Microsoft
Backup config	You (Azure Backup)	You	Managed, you configure	Managed / you export
Scaling / HA	You (Scale Sets, zones)	You configure	You enable zone-redundancy	Automatic
Data, schema, access, RBAC	You	You	You	You

The rule that never moves

Microsoft secures the cloud. You secure what you put in it: identities, RBAC, network exposure, data classification, and access. No managed service removes your responsibility for who can reach the data and what they can do with it.

1. Azure Fundamentals

The global infrastructure and the governance hierarchy (management groups, subscriptions, resource groups) that every Azure deployment is built on - plus the mental model that makes the rest of the platform predictable.

Last reviewed: July 2026 Verify region list, zones, quotas, and service availability in the portal.

What Azure is Global infrastructure Governance hierarchy Mental model ARM / Bicep / tools Policy & locks Tags Landing zones

TL;DR

Azure is a set of regions (most paired for DR, many with Availability Zones) on Microsoft's global network. Governance is a hierarchy: Management Groups > Subscriptions > Resource Groups > resources, all under one Microsoft Entra tenant (the identity boundary). The subscription is the billing/deployment/quota boundary; the resource group is the lifecycle boundary. Azure Resource Manager (ARM) is the deployment/control plane; RBAC grants access and Azure Policy constrains what is allowed. Get the hierarchy and a landing zone right before production.

What Azure is

Microsoft Azure is Microsoft's public cloud: on-demand compute, storage, networking, databases, data, and AI services delivered from Microsoft-operated regions, consumed over Microsoft's global network, and billed by usage. Its distinctive strengths for enterprises are deep Microsoft ecosystem integration (Entra ID / Active Directory, Windows Server, SQL Server, Microsoft 365), strong hybrid tooling (Azure Arc, Azure Stack, ExpressRoute), and a mature governance model (management groups, Azure Policy, landing zones). If you come from traditional Windows/SQL/AD infrastructure, much of Azure will feel familiar; the biggest early shift is the resource hierarchy and that identity lives in Entra ID, separate from Azure resource access (RBAC).

Azure global infrastructure

Concept	What it is	Protects against / used for
Region	A set of data centers in a geography; your primary deployment + residency boundary	Choose by latency, residency, and service/zone availability
Region pair	Each region is paired with another in the same geography; Microsoft sequences updates and some services replicate to the pair	Regional DR target; ordered platform maintenance
Availability Zone (AZ)	Physically separate datacenters within a region (independent power/cooling/network)	Datacenter-level failure - spread across zones for in-region HA
Availability Set	A grouping that spreads VMs across fault domains (racks) and update domains (maintenance groups) within a single datacenter	Rack/maintenance failure when zones aren't used
Fault domain	A rack of hardware (shared power/network)	Anti-affinity within an availability set
Update domain	A group updated/rebooted together during planned maintenance	Ensures not all instances reboot at once
Sovereign clouds	Isolated clouds (Azure Government, Azure operated by 21Vianet in China)	Regulatory/sovereignty isolation - separate endpoints and features

Architect note - Availability Zones vs Availability Sets

Prefer Availability Zones (spread across physically separate datacenters, ~99.99% VM SLA with 2+ zones) for new HA designs. Availability Sets only protect against rack/maintenance failure within one datacenter - use them in regions without zones or for tightly-coupled legacy tiers. You cannot combine a single VM into both models; decide per tier. Confirm zone support for your region and VM size.

Common mistake

Assuming every region has Availability Zones or that "the region pair" is an automatic DR solution. Zone availability varies by region and service; region-pair replication only applies to specific services (e.g. GRS storage). You still design and test your own DR.

The governance hierarchy

Entra tenant > Management Groups > Subscriptions > Resource Groups > resources. RBAC and Azure Policy inherit downward.

Microsoft Entra tenant - the identity boundary (users, groups, apps). One tenant can hold many subscriptions. Identity (Entra) is separate from resource access (Azure RBAC) - a crucial distinction (section 2).
Management groups - a hierarchy above subscriptions for applying RBAC and Azure Policy at scale (e.g. a Platform MG and a Landing Zones MG under the root). Can nest.
Subscriptions - the unit of billing, quota/limits, and deployment. Also a strong isolation and blast-radius boundary. Large enterprises use many subscriptions (per app/environment/BU).
Resource groups - a lifecycle/management container for resources that share a lifecycle; RBAC, locks, and deletion apply at this level. A resource belongs to exactly one RG.
Azure Resource Manager (ARM) - the control-plane API/service for all deployments and management; ARM templates and Bicep are its native IaC languages.

The Azure mental model

Concept	Is the boundary for
Entra tenant	Identity (who exists)
Subscription	Billing and deployment (and quota/limits)
Resource group	Lifecycle and management (deploy/delete/lock together)
Management group	Governance (apply RBAC/Policy across many subscriptions)
Azure Policy	Control and compliance (what is allowed to exist / be configured)
Azure RBAC	Access (who can do what to Azure resources)

Two different questions

RBAC answers "can this identity perform this action on this resource?" Azure Policy answers "is this resource/configuration allowed to exist here at all?" (e.g. "only these regions", "no public IPs", "require a tag"). They are separate engines - you often need both: RBAC to grant, Policy to constrain. And Entra directory roles (Global Administrator, etc.) are different again from Azure RBAC (section 2).

ARM, Bicep, and the tools

Azure Portal

The web UI. Best for learning, exploring, and reading state. Not for repeatable production changes - use IaC.

Azure CLI & Azure PowerShell

Two first-class command lines (az and Az module). Pick one house standard; both cover the control plane. See section 17.

Cloud Shell

Browser shell pre-authenticated as your identity, with CLI/PowerShell/Terraform/kubectl and persistent storage.

ARM templates & Bicep

Native IaC. Bicep is the modern, readable authoring language that compiles to ARM JSON - prefer it over raw ARM templates.

Terraform

The azurerm provider is widely used for multi-cloud/standardized IaC. Bicep vs Terraform is a house choice; both are valid.

REST APIs & SDKs

Everything is an ARM REST call; idiomatic SDKs (.NET, Python, Java, JS, Go) for building tooling and apps.

Azure Policy and resource locks

Azure Policy - define, assign, and enforce rules (deny, audit, deployIfNotExists, modify) at MG/subscription/RG scope, inheriting down. Common: allowed regions, allowed SKUs, require tags, deny public IPs/public storage, require diagnostic settings. Initiatives bundle policies (e.g. a regulatory baseline).
Resource locks - CanNotDelete or ReadOnly locks on a subscription/RG/resource to prevent accidental deletion/changes (e.g. lock the hub network and prod databases).

Architect note

Set a baseline of preventive Azure Policy at the management-group level in your landing zone: allowed locations, deny public IPs by default, require diagnostic settings to a central Log Analytics workspace, enforce tags. Add CanNotDelete locks on foundational resources (hub VNet, Key Vault, prod data). This prevents whole classes of mistakes rather than detecting them after the fact.

Designing the hierarchy & landing zones

How to structure tenants, management groups, subscriptions, resource groups Design

One Entra tenant for the enterprise (identity boundary); multiple tenants only for genuine sovereignty/M&A isolation.
Management groups following the Cloud Adoption Framework (CAF) pattern: a Platform MG (connectivity, management, identity subscriptions) and a Landing Zones MG (corp/online workload subscriptions), plus Sandbox and Decommissioned.
Subscriptions per workload/environment (or per app), not one giant subscription - they are the quota and blast-radius boundary. Separate platform subscriptions for connectivity (hub network), management (logging/monitoring), and identity.
Resource groups per app-tier or per lifecycle (things deployed and deleted together).

Separating dev / test / stage / prod / shared / security / networking / logging Design

Separate subscriptions per environment for independent quota, budgets, RBAC, and blast radius; use management groups to apply environment-wide Policy/RBAC once.
Dedicated platform subscriptions: connectivity (hub VNet, Azure Firewall, ExpressRoute/VPN), management (Log Analytics, Automation, Backup), identity (domain controllers / Entra Domain Services if needed).
Keep prod under stricter Policy (no public IPs, restricted regions/SKUs, mandatory logging) than nonprod.
Never mix sandbox/experimentation with production - separate MG with its own guardrails and budgets.

What an Azure landing zone includes Design

An Azure Landing Zone (CAF) is a codified, repeatable baseline deployed before workloads:

Management-group hierarchy, subscription organization, and naming/tagging standards.
Identity: Entra tenant config, groups, PIM, break-glass, Conditional Access baseline.
Baseline RBAC (groups, not users) and preventive Azure Policy initiatives.
Connectivity: hub-and-spoke (or Virtual WAN), Azure Firewall, DNS, ExpressRoute/VPN in the connectivity subscription.
Management: central Log Analytics workspace, diagnostic settings policy, Azure Monitor, Backup, Update Manager.
Security: Defender for Cloud, Sentinel, Key Vault, Private Link/Private DNS strategy.
Guardrails: budgets, quotas, resource locks, tags - all as code (Terraform / Bicep; the CAF ALZ accelerator is a starting point).

Common mistakes in Azure governance

One big subscription for everything, so quota, RBAC, and cost attribution collapse together.
Skipping the landing zone and retrofitting management groups, Policy, and hub networking later.
Confusing the Entra tenant (identity) with a subscription (billing) - they are different boundaries.
Granting RBAC at subscription/MG level for convenience so everything inherits broad access.
No naming/tagging standard; mixing sandbox with production.

Official documentation: Cloud Adoption Framework & landing zones →

2. Identity and Access Management

Microsoft Entra ID and Azure RBAC - the two different systems that decide who exists and who can do what - plus managed identities, PIM, Conditional Access, and a troubleshooting model. This is where most Azure access issues and security incidents originate.

Last reviewed: July 2026 Verify role names, PIM, and Conditional Access features in current docs.

Entra vs RBAC Principals Managed identities RBAC & scope PIM & Conditional Access Scenarios Common mistakes Access troubleshooting model

TL;DR

Azure has two access systems: Microsoft Entra ID (identity + directory roles like Global Administrator that govern Entra/M365) and Azure RBAC (roles like Contributor that govern Azure resources at MG/subscription/RG/resource scope). Confusing them is the #1 IAM error. Use groups (not users), built-in roles scoped narrowly (not Owner at subscription), managed identities for workloads (not secrets), PIM for just-in-time privileged access, and Conditional Access + MFA for sign-in. Deny assignments and Policy can block regardless of role.

Entra ID roles vs Azure RBAC roles (the critical distinction)

	Microsoft Entra directory roles	Azure RBAC roles
Govern	Entra ID & Microsoft 365 (users, groups, apps, tenant settings)	Azure resources (VMs, storage, networks, databases)
Examples	Global Administrator, User Administrator, Application Administrator	Owner, Contributor, Reader, Storage Blob Data Contributor
Scope	Tenant (and administrative units)	Management group / subscription / resource group / resource
Managed in	Entra ID > Roles and administrators	Resource > Access control (IAM)

Common mistake - confusing the two

A Global Administrator (Entra) does not automatically have access to Azure resources, and an Azure Owner cannot manage Entra users. They are separate systems with separate roles and scopes. (A Global Admin can elevate to gain User Access Administrator over all subscriptions - a deliberate, audited action, not the default.) Always know which system a task needs before assigning a role.

Principals (who can be granted access)

Principal	What it is	Use for
User	A human identity in Entra ID	People - but grant via groups, not directly
Group	An Entra security group of users/principals	All human access management
Administrative unit	A container to scope Entra role admins to a subset of users/groups	Delegated identity admin (e.g. per-region helpdesk)
App registration + service principal	An application identity; the app registration is the definition, the service principal is its instance in a tenant	Apps/CI authenticating to Azure/Graph (prefer managed identity where possible)
Managed identity	An Azure-managed service principal with no credentials you handle	Azure workloads calling Azure/Entra - the preferred workload identity
External identity (B2B guest)	A user from another tenant invited as a guest	Partner/vendor collaboration

Managed identities: system- vs user-assigned

	System-assigned	User-assigned
Lifecycle	Tied to one resource; created/deleted with it	Standalone resource; attach to many
Use for	A single workload needing its own identity	Sharing one identity across resources; pre-creating and granting RBAC before deploy
Credentials	None you manage - Azure rotates them	None you manage

Security note - managed identities over secrets

An app-registration client secret or certificate is a long-lived credential that can leak, be committed to git, expire unexpectedly, or outlive its owner. A managed identity has no secret you handle - Azure issues and rotates tokens automatically, scoped by RBAC. If an Azure workload needs to call Azure or Entra, it should almost always use a managed identity, not an app secret. Reserve app registrations + secrets/certs for external/CI cases, and prefer workload identity federation (no secret) even there.

Azure RBAC: roles and scope

Built-in roles - Owner (full + manage access), Contributor (full except manage access), Reader, plus hundreds of granular roles (e.g. Virtual Machine Contributor, Storage Blob Data Reader). Prefer the narrowest built-in role.
Custom roles - compose specific actions/dataActions when no built-in role fits without over-granting.
Role assignment = role + principal + scope. Scope is MG / subscription / RG / resource; assignments inherit downward and are additive.
Deny assignments - explicitly block actions regardless of role (used by Azure-managed features like Blueprints/managed apps); evaluated before allows.
Control-plane vs data-plane - some roles govern management (create/delete a storage account) vs data (read blobs). A Contributor on a storage account can't necessarily read the blobs without a data role - a frequent surprise.

Common mistake - Owner / subscription-scope grants

Assigning Owner (or even Contributor) at subscription scope "to unblock a team" gives broad control over everything in it, inherited by every RG and resource. Grant the narrowest built-in role at the resource group (or resource) scope; reserve subscription/MG-scope and Owner for a small, PIM-gated admin group.

Privileged Identity Management & Conditional Access

Privileged Identity Management (PIM)

Just-in-time, time-bound, approval-and-MFA-gated activation of privileged roles (Entra and Azure RBAC). Admins are eligible, not permanently assigned - they activate when needed, with justification and audit. The single biggest reduction of standing privilege.

Conditional Access

Sign-in policies that require MFA, compliant/managed devices, trusted locations, or block risky sign-ins - the enforcement layer for Zero Trust. Enforce MFA for all users, especially admins.

Identity Protection

Risk-based detection (leaked credentials, impossible travel) feeding Conditional Access to block/step-up risky sign-ins.

Access reviews & entitlement management

Periodic recertification of group/role/guest access, and access packages for governed self-service - keep access from silently accumulating.

Security note - PIM + break-glass

Make privileged roles (Owner, User Access Administrator, Global Administrator) eligible via PIM, not permanent - activation requires MFA, justification, and (for the highest roles) approval, all audited. Keep two break-glass (emergency access) accounts: cloud-only, excluded from Conditional Access that could lock you out, with very long unique passwords stored offline, MFA via a separate method, and alerting on every sign-in. They exist so you can still administer the tenant if federation/PIM/Conditional Access breaks.

Real RBAC / identity scenarios

App team manages their resource group only Medium risk

Who: the app team group. Scope: their resource group (not subscription). Role: Contributor on the RG (or narrower roles like Virtual Machine Contributor + Web Plan Contributor). Risk: medium - contained to one RG. Safer alternative: deploy via a pipeline managed identity and give humans Reader + specific action roles. Common misuse: Owner/Contributor at subscription scope.

Workload reads a storage account - no secrets Low risk

Who: a VM/App Service/AKS workload. Scope: a single storage account (or container). Role: Storage Blob Data Reader assigned to the workload's managed identity. Risk: low - narrow, keyless. This is the pattern to imitate. Common misuse: a storage account key or SAS in app config + Contributor at RG.

Read-only auditor across the platform Low risk

Who: security/audit group. Scope: the root or Platform management group (auditors need breadth). Role: Reader (+ Security Reader for Defender), granted to a group. Risk: low - read-only. Common misuse: giving auditors Contributor "just in case".

Emergency change requiring elevated access Higher risk

Who: an on-call engineer. Scope: the affected subscription. Role: Contributor/Owner made eligible via PIM - activated just-in-time with MFA + justification, time-boxed, audited. Risk: higher, but no standing privilege and full audit trail. Common misuse: a permanent Owner assignment "for emergencies".

Common Azure IAM mistakes

Confusing Entra roles with Azure RBAC roles - different systems, scopes, and portals.
Owner assigned too broadly - use narrow built-in roles at RG/resource scope.
Subscription-scope grants unnecessarily - inherit into every RG; grant lower.
Not using PIM - standing privileged access is the biggest avoidable risk.
Break-glass accounts not protected/monitored correctly - or not existing at all.
Too many app secrets, not rotated - and used where a managed identity would work.
Not using managed identities for Azure workloads.
Mixing human users and workload identities - different lifecycles/controls.
Weak Conditional Access - no MFA enforcement, no risk policies, admins unprotected.

Azure access troubleshooting mental model

When access fails (or unexpectedly works), walk the layers in order:

⚑ "Access denied" / "does not have authorization" - the checklist

Which tenant are you signed into? (Guest in the wrong tenant is common.)
Which subscription is the resource in, and is it the selected one?
Which identity is making the request - user, group, service principal, or managed identity? (For workloads, which MI is actually assigned?)
What role is assigned, and does it include the required action (control-plane vs data-plane)?
At what scope (MG/subscription/RG/resource)? Does inheritance reach this resource?
Is there a deny assignment blocking it?
Is Azure Policy blocking the action (deny effect)?
Is Conditional Access blocking the sign-in (device/location/MFA)?
Does the role require PIM activation that hasn't been done (eligible but not active)?
Is the resource provider registered / API available in the subscription?

Tools

Resource > Access control (IAM) > Check access and View my access; Entra sign-in logs (for Conditional Access); Activity Log (for the denied operation); PIM (eligible vs active).

az role assignment list --assignee <objectId> --all -o table
az account show     # which tenant/subscription am I in?
az provider show -n Microsoft.Sql --query registrationState

Official documentation: Azure RBAC & Microsoft Entra ID →

3. Networking Deep Dive

Virtual Networks, NSGs vs Azure Firewall, Private Endpoint vs Service Endpoint, hub-and-spoke, hybrid connectivity, and Private DNS - plus the traffic-flow reasoning you need to design and debug real Azure networks.

Last reviewed: July 2026 Verify gateway SKUs, limits, and Private DNS behavior in current docs.

VNet & CIDR NSG & ASG Routes & egress Azure Firewall Private Endpoint vs Service Endpoint Peering & hub-spoke Hybrid DNS Traffic flow Diagrams Network Watcher Troubleshooting Gotchas

TL;DR

A Virtual Network (VNet) is regional; you carve subnets from its CIDR. NSGs are stateful L3/L4 allow/deny rules (per subnet/NIC); Azure Firewall is a managed L3-L7 firewall for centralized egress/inspection in a hub. Reach PaaS privately with Private Endpoint (a private IP in your VNet, works cross-region/hybrid) or the older Service Endpoint (keeps traffic on the backbone but the service keeps a public endpoint). VNet peering is not transitive - use hub-and-spoke or Virtual WAN. Private Endpoint needs correct Private DNS zone linkage or nothing resolves. Plan non-overlapping CIDRs first.

Virtual Network and CIDR planning

A VNet is regional with one or more address spaces; subnets partition it. Azure reserves 5 IPs per subnet (first 4 + last).
Some services want dedicated/delegated subnets (Azure Firewall needs AzureFirewallSubnet, gateways need GatewaySubnet, Bastion needs AzureBastionSubnet, App Gateway its own subnet). Plan these in advance.
CIDRs must not overlap with peered VNets or on-premises. Overlap is the #1 cause of hybrid that "connects but won't route."
Plan generously: leave room for growth, gateway/firewall/bastion subnets, and AKS (which consumes many IPs with Azure CNI).

Architect note - IP plan first

Reserve a large private supernet for Azure, allocate a block per region/environment, and subnets per tier plus the required platform subnets (GatewaySubnet, AzureFirewallSubnet, AzureBastionSubnet). Keep a documented IPAM. Overlapping ranges break peering and hybrid and are a re-IP project later; a too-large plan costs nothing.

Network Security Groups & Application Security Groups

NSG - stateful allow/deny rules on source/dest, port, protocol, evaluated by priority (lower wins). Attach to a subnet and/or a NIC; both apply. Default rules allow intra-VNet and deny inbound from internet.
Application Security Group (ASG) - a named group of NICs you reference in NSG rules instead of IPs, so rules read "web-asg → db-asg on 1433" and update automatically as VMs scale.
Service tags - Microsoft-maintained IP groups (e.g. Storage, Sql, AzureMonitor, Internet) you use in rules instead of hardcoding ranges.

Security note - ASGs + service tags

Write NSG rules against ASGs (by workload role) and service tags (by Azure service) rather than raw IP ranges. Rules stay readable and self-updating, and you avoid the classic "someone hardcoded a range that changed." Keep a tight default: deny inbound from Internet, allow only what a tier needs from the tier that needs it.

Common mistake - NSG vs Azure Firewall confusion

An NSG is stateful L3/L4 packet filtering attached to subnets/NICs - it is not a firewall appliance (no FQDN filtering, no L7, no centralized logging of app traffic, no threat intel). Azure Firewall is a managed stateful firewall for centralized egress/inspection with FQDN rules, threat intel, and DNAT. Use both: NSGs for micro-segmentation at the subnet/NIC, Azure Firewall in the hub for controlled egress. They are complementary, not alternatives.

Routes, UDRs, and internet egress

System routes handle intra-VNet, peering, and a default route to the internet. User Defined Routes (UDRs) in a route table override them - e.g. force 0.0.0.0/0 through the Azure Firewall (forced tunneling / centralized egress).
NAT Gateway - the recommended way to give a subnet scalable, stable outbound internet (SNAT) without per-VM public IPs. Outbound only.
Public IPs (Standard SKU, zone-redundant) for inbound-facing resources; avoid on workload VMs - use a load balancer / Bastion / firewall instead.
Default outbound access is being retired - new VMs should have an explicit outbound method (NAT Gateway, LB outbound rule, or firewall). Don't rely on implicit outbound.

Common mistake

Relying on Azure's implicit/default outbound internet for VMs (being deprecated) or expecting NAT Gateway to allow inbound - it's outbound only. Give every subnet an explicit outbound path (NAT Gateway or firewall) and use a load balancer/App Gateway/firewall DNAT for inbound. Also: a UDR that sends 0.0.0.0/0 to the firewall without a matching route back can black-hole traffic - design symmetric routing.

Azure Firewall & Firewall Manager

Azure Firewall is a managed, highly-available, stateful network firewall (Standard/Premium; Premium adds TLS inspection, IDPS, URL filtering). Placed in the hub VNet with spoke UDRs pointing at it, it centralizes egress control (FQDN + network rules), DNAT for inbound, and logging. Firewall Manager centrally manages firewall policies across hubs (incl. Virtual WAN secured hubs).

Private Endpoint vs Service Endpoint (the constant confusion)

	Private Endpoint (Private Link)	Service Endpoint
What it is	A private IP in your subnet that maps to a specific PaaS resource instance	Extends the subnet's identity to the service over the backbone; the service keeps its public endpoint
Traffic	Fully private; the PaaS resource can disable public access entirely	Stays on the Azure backbone, but the resource still has a public endpoint (restricted by rules)
Cross-region / on-prem	Yes - reachable from peered VNets and on-prem (ExpressRoute/VPN)	No - VNet/region local; not reachable from on-prem
DNS	Requires a Private DNS zone mapping the service FQDN to the private IP	No DNS change; uses the public FQDN
Cost	Per-endpoint + data	Free
Use for	The modern default for private PaaS access (Storage, SQL, Key Vault...)	Simpler/cheaper cases where a public endpoint restricted to your subnet is acceptable

Common mistake - Private Endpoint DNS not linked

The single most common Private Endpoint failure: the Private DNS zone is not linked to the VNet (or on-prem DNS has no conditional forwarder), so the service FQDN still resolves to the public IP and connections fail or bypass the private path. For each PaaS type you need the correct zone (e.g. privatelink.blob.core.windows.net, privatelink.database.windows.net), an A record for the endpoint, and a VNet link (and hybrid forwarding for on-prem). Automate this - it is easy to forget and hard to notice.

VNet peering & hub-and-spoke

VNet peering (regional or global) connects two VNets privately over the backbone. Crucially, peering is not transitive: spoke A peered to hub H and spoke B peered to H cannot reach each other unless you route through a hub appliance (Azure Firewall/NVA) with UDRs and "allow forwarded traffic".
Hub-and-spoke - a central hub VNet (firewall, gateways, DNS, Bastion) peered to workload spokes; spokes route egress and cross-spoke traffic through the hub firewall.
Virtual WAN - a Microsoft-managed hub-and-spoke at scale (managed hubs, integrated firewall, any-to-any transit, branch/VPN/ExpressRoute) - use it instead of hand-built hubs for large/global topologies.

Common mistake - assuming peering is transitive

Teams peer spoke-to-hub and expect spoke-to-spoke to work - it doesn't. Route cross-spoke traffic through a hub firewall/NVA (UDRs + "allow forwarded traffic" on the peering), or adopt Virtual WAN which provides managed transit. Don't build meshes of direct spoke peerings.

Hybrid connectivity: VPN Gateway and ExpressRoute

	VPN Gateway (Site-to-Site)	ExpressRoute
Path	Over the internet, IPSec-encrypted	Private, dedicated circuit via a partner/provider
Bandwidth	Up to gateway-SKU limits (hundreds of Mbps-Gbps)	50 Mbps to 100 Gbps circuits
SLA / latency	Best-effort internet	Consistent, low latency; higher SLA; private
Setup	Minutes-hours	Days-weeks (provider provisioning)
Use as	Quick start / backup / lower bandwidth	Primary enterprise link; large data; low latency; private access to Microsoft/M365

ExpressRoute Global Reach connects on-prem sites to each other through Azure. Common pattern: ExpressRoute primary + VPN backup. Gateway SKU determines bandwidth and features - size it deliberately (a frequent bottleneck).

Azure DNS & Private DNS

Azure DNS public zones for internet-facing names; Private DNS zones for internal resolution, linked to VNets (with optional auto-registration).
Private Endpoints depend on Private DNS zones resolving the privatelink.* FQDNs to the private IPs.
Azure DNS Private Resolver - a managed resolver for hybrid DNS (inbound/outbound endpoints + forwarding rulesets) so on-prem can resolve Azure private names and vice-versa, without running DNS VMs.

How traffic flows in Azure

Destination inside the VNet (or a peered VNet)? Routes over the backbone - only NSG rules apply.
Outside? The effective route table (system + UDRs) picks the next hop: internet (via NAT Gateway / public IP), the firewall/NVA (if a UDR forces it), the VPN/ExpressRoute gateway, or a peering.
NSGs (subnet + NIC, by priority) must allow it - remember default deny-inbound-from-internet.
For private PaaS: Private Endpoint + correct Private DNS resolution, or Service Endpoint with the public FQDN.

Debugging is almost always: what do effective routes say? what do effective NSG rules say (both directions)? does DNS resolve to the right (private) IP?

Reference diagrams

Hub-and-spoke with centralized egress

Central hub holds firewall, gateways, DNS, and Bastion; spokes route egress and cross-spoke traffic through the hub.

Private Endpoint pattern

A Private Endpoint gives the PaaS resource a private IP in your subnet; the Private DNS zone (linked to the VNet) resolves its FQDN to that IP. Public access is disabled.

Network Watcher

Tool	What it gives you
IP Flow Verify	"Is this specific 5-tuple allowed or denied, and by which NSG rule?" First stop for NSG questions.
Effective security rules / Effective routes	The actual merged NSG rules and routes applied to a NIC - what really governs the traffic.
Connection Monitor / Connection Troubleshoot	Test and continuously monitor reachability/latency A→B, naming the blocker.
NSG flow logs	Connection records for monitoring, forensics, and "is my rule dropping this?" (into a storage account / Log Analytics).

Start with IP Flow Verify + Effective routes

Before hand-reading rules, run IP Flow Verify (names the deciding NSG rule) and check Effective routes on the NIC (shows the real next hop). Together they resolve most "cannot reach" cases in seconds.

Networking troubleshooting

⚑ VM cannot reach internet / private VM cannot download patches

Causes: no explicit outbound (NAT Gateway/firewall) and default outbound retired; a UDR sends 0.0.0.0/0 to a firewall that blocks it or has asymmetric routing; NSG egress deny; no public IP where one is required. Checks: Effective routes on the NIC; NAT Gateway on the subnet; firewall rules; IP Flow Verify. Fix: add a NAT Gateway (or firewall egress allow for repos/service tags); fix UDR/return routing. For OS repos, allow the relevant service tags/FQDNs on the firewall.

⚑ VM cannot reach a storage account / PaaS privately

Causes: Private Endpoint's Private DNS zone not linked to the VNet (FQDN resolves to public IP); missing A record; on-prem DNS lacks a conditional forwarder; the storage account still allows only the public path; NSG blocking. Checks: nslookup the FQDN from the VM (should return the private IP); Private DNS zone VNet links; endpoint connection state. Fix: link the correct privatelink.* zone to the VNet, add the A record (or use the auto DNS integration), set hybrid forwarding.

⚑ Application cannot connect across VNets / peering issue

Causes: relying on transitive peering (unsupported); overlapping CIDRs; peering missing "allow forwarded traffic"/gateway transit; NSG blocking the peer range; UDR not routing spoke-to-spoke via the hub firewall. Fix: route cross-spoke through the hub firewall (UDR + allow forwarded traffic) or use Virtual WAN; resolve overlap; open NSGs for the peer range.

⚑ On-premises cannot reach Azure / ExpressRoute / VPN issue

Causes: CIDR overlap; BGP not advertising routes both ways; VPN tunnel down (IKE/PSK mismatch); ExpressRoute circuit/peering down or route filters wrong; gateway SKU bandwidth exhausted; NSG/firewall blocking on-prem range. Checks: gateway/connection status; BGP learned/advertised routes; effective routes. Fix: align IKE, correct BGP/route filters, resolve overlap, right-size the gateway SKU. Portal: Virtual network gateways / ExpressRoute circuits.

⚑ Application Gateway health probe failing

Causes: NSG on the backend/App Gateway subnet blocking the probe or the required App Gateway management ports; wrong probe host/path/port/protocol; backend app on localhost or wrong port; certificate mismatch on HTTPS probes; App Gateway subnet missing required outbound. Fix: allow the probe + management traffic, align the probe, bind the app to all interfaces. Full flow in section 7.

⚑ DNS / NSG / route / Azure Firewall / Private DNS issue

Method: IP Flow Verify (NSG decision), Effective routes/rules (real next hop + merged rules), NSG flow logs (drops), and nslookup for DNS. For Azure Firewall, check the network/application rule collections and the UDR forcing traffic to it; for Private DNS, verify zone links and forwarders.

Azure networking gotchas

NSG vs Azure Firewall - NSG is stateful L3/L4 filtering, not a firewall appliance; use both.
Private Endpoint vs Service Endpoint - different mechanisms; Private Endpoint needs Private DNS zone linkage.
Forgetting Private DNS zone linkage - the top Private Endpoint failure.
Overlapping CIDRs - break peering and hybrid; plan IP space early.
Poor hub-and-spoke - decide hub ownership, firewall, and UDRs before spokes land; consider Virtual WAN.
Databases/services on public endpoints - use Private Endpoint; disable public access; deny public by Policy.
Peering is not transitive - route spoke-to-spoke via the hub firewall or use Virtual WAN.
Route tables (UDR) - forced tunneling and asymmetric routing black-hole traffic; design return paths.
Default outbound retiring - give every subnet an explicit outbound (NAT Gateway/firewall).
Gateway SKU limits - VPN/ExpressRoute/App Gateway SKUs cap bandwidth and features; size deliberately.

Official documentation: Azure Virtual Network →

4. Compute Deep Dive

Azure Virtual Machines (series, disks, placement, HA), Scale Sets, Spot, images, and the managed/serverless options (App Service, Functions, Container Apps) - how to choose, place, scale, and operate compute on Azure.

Last reviewed: July 2026 VM series/sizes and pricing change - verify current SKUs and zone support in the portal.

VM series Zones / sets / scale sets Spot / dedicated hosts Images & disks Access & management App Service / Functions / Container Apps Choosing VMs Operations

TL;DR

Azure VMs come in series (B/D general, F compute, E/M memory, L storage, N GPU) with families and sizes. Use Availability Zones + Virtual Machine Scale Sets (flexible orchestration) for HA and autoscale, managed disks (Premium SSD v2 / Ultra for demanding I/O), Bastion for keyless RDP/SSH, and managed identity for keyless API access. For new apps, consider App Service, Container Apps, or Functions before managing VMs. Reserved Instances / Savings Plan and Azure Hybrid Benefit are the main cost levers.

VM series and families

Series	Family	Best for
B	Burstable	Low-average, spiky workloads (dev, small web, bastions)
D (Dv5/Dasv5...)	General purpose	Web, app, microservices - the default
F	Compute optimized	High CPU-to-memory: batch, gaming, app servers
E	Memory optimized	Databases, in-memory caches, mid-size SAP
M / Mv3	Memory optimized (very large)	Large databases, SAP HANA (certified), in-memory analytics
L (Lsv3)	Storage optimized	High local-disk IOPS/throughput: NoSQL, big data, data nodes
N (NC/ND/NV)	GPU	AI/ML training/inference, rendering, visualization
DC	Confidential computing	Data-in-use encryption (SGX / AMD SEV-SNP)

DBA note - E/M series for databases and SAP

For SQL Server / Oracle / large databases and SAP HANA, use E or M (memory-optimized) sizes with high memory-to-vCPU ratios and pair with Premium SSD v2 / Ultra Disk sized for IOPS/throughput. HANA and large DB sizes have specific certified SKUs - confirm certification and the constrained-vCPU options (fewer active cores for per-core licensing) before committing.

Availability Zones, Availability Sets, and Scale Sets

Mechanism	Protects against	Notes
Availability Zones	Datacenter failure within a region	Spread VMs/scale sets across 2-3 zones; ~99.99% VM SLA. Preferred for new HA.
Availability Set	Rack (fault domain) & maintenance (update domain) failure in one datacenter	Use where zones aren't available; cannot span zones.
Virtual Machine Scale Sets (VMSS)	-	Manage identical VMs with autoscale + rolling upgrades; Flexible orchestration is the modern default (can span zones + mix sizes).
Proximity Placement Group	-	Co-locates VMs for lowest latency (e.g. app + DB tier); trades off against zone spread.

Common mistake

Running a single VM (or an availability set in a zone-capable region) for a production service - a datacenter event takes it down. Use a zone-spread VMSS (Flexible) behind a zone-redundant load balancer with health probes and rolling upgrades. Note a single VM gets a lower SLA than a multi-instance availability-set/zone deployment.

Spot VMs and dedicated hosts

Option	What it does	Use when
Spot VMs	Deeply discounted capacity Azure can evict with 30s notice	Fault-tolerant, interruptible batch/CI/render - never stateful prod
Dedicated Hosts	A physical server dedicated to your subscription	Compliance/isolation, or per-core licensing needing host affinity/visibility
Reserved Instances / Savings Plan	1/3-year commitment for a discount	Steady-state baseline compute (section 14)
Azure Hybrid Benefit	Apply existing Windows Server / SQL Server licenses	Big savings for Microsoft-licensed workloads

Cost note

Cover steady-state with Reserved Instances or the Savings Plan for Compute, run interruptible batch on Spot (often 60-90% off), and apply Azure Hybrid Benefit to Windows/SQL licenses you already own (frequently the biggest single saving for a Microsoft shop). Right-size before you commit.

Images, Compute Gallery, and managed disks

Marketplace/platform images and custom images; the Azure Compute Gallery versions and replicates images across regions for scale.
Managed disks: Standard HDD, Standard SSD, Premium SSD, Premium SSD v2 (independently tunable IOPS/throughput/size), and Ultra Disk (highest performance, sub-ms). Ephemeral OS disks live on the host (fast, free, but lost on deallocate - stateless only).
Snapshots and disk encryption (platform-managed keys, customer-managed keys via Key Vault, or Azure Disk Encryption/host-based).

DBA note - Premium SSD v2 / Ultra for databases

Database I/O is usually the bottleneck. Premium SSD v2 lets you tune IOPS and throughput independently of size (better cost/performance than provisioning huge Premium SSDs for IOPS); Ultra Disk for the most demanding low-latency workloads (log disks). Stripe multiple disks for higher aggregate throughput where a single disk caps out, and keep temp/scratch off ephemeral-only if it must survive a reboot.

Access and management

Azure Bastion - browser-based RDP/SSH to VMs with no public IP and no exposed 3389/22, gated by RBAC. The secure default for admin access.
Serial console + boot diagnostics for out-of-band recovery.
VM extensions (Custom Script, DSC), Run Command for ad-hoc commands, Azure Monitor Agent for telemetry.
Update Manager for patch orchestration; Azure Arc to manage on-prem/other-cloud servers with the same tooling (policy, monitoring, extensions).

Security note - Bastion + managed identity, no public IPs

The secure pattern: VMs have no public IP, admins connect via Azure Bastion (RBAC-gated, no exposed RDP/SSH), and the VM uses a managed identity for any Azure API access. This removes public RDP/SSH exposure (a top attack vector) and eliminates stored credentials. Enforce "no public IP on VMs" with Azure Policy.

App Service, Functions, Container Apps

Service	What it is	Use for
App Service	Managed PaaS for web apps/APIs (Windows/Linux, code or container)	Web apps/APIs without managing VMs - a common default
Azure Functions	Event-driven serverless functions (Consumption/Premium/Flex plans)	Event handlers, glue, automation, scale-to-zero
Container Apps	Serverless containers (Kubernetes/KEDA under the hood, no cluster to run)	Microservices/containers without managing AKS - scale to zero, Dapr optional
Container Instances (ACI)	Single containers on demand	Simple, short-lived container tasks

Architect note - reach for PaaS/serverless first

For a new stateless web app or API, start with App Service or Container Apps; for event-driven code, Functions. No VM patching, built-in scaling and slots, managed TLS, and easy managed identity. Drop to VMs/AKS only when you need OS control, specialized kernels/GPUs, long-lived stateful services, or the Kubernetes ecosystem. This cuts a lot of ops versus the old "spin up a VM" reflex.

Choosing VM families by workload

Workload	Starting point
Web / API	App Service/Container Apps; or D-series VMSS behind a load balancer
Middleware	D/E-series VMSS, memory-leaning
Databases (self-managed)	E/M-series + Premium SSD v2/Ultra; or Azure SQL/PaaS (section 6)
Oracle workloads	E/M-series VM (self-managed Oracle) or Oracle Database@Azure; constrained-vCPU for licensing
SAP	M-series (HANA-certified), proximity placement group, Ultra Disk
Batch / CI / render	Spot VMs in a VMSS, or Azure Batch
Memory-heavy	E or M series
CPU-heavy	F series
Storage-heavy (local IOPS)	L series
GPU / AI	N series (NC/ND/NV); or Azure ML (section 12)
Cost-sensitive / spiky	B-series + autoscale; Spot for fault-tolerant parts; RIs/Savings Plan for baseline

Operational guidance

Resize / patch VMs safely Ops

Resize: deallocate, change size (must be available in the region/zone/host cluster), start - brief downtime; in a VMSS, update the model and roll. Changing to a size not on the current host cluster requires stop/deallocate.
Patch: use Update Manager for scheduled, reported OS patching across VMs (and Arc servers); for VMSS prefer replacing instances from a new image (immutable) over in-place patching.

Troubleshoot boot / high CPU / memory / disk Ops

Boot: enable boot diagnostics + serial console to see boot output; use the VM "Redeploy"/"Reset password"/repair-VM options for stuck boots.
High CPU / memory: Azure Monitor + VM Insights (install the Azure Monitor Agent - guest memory isn't collected without it); right-size or autoscale.
Disk full: expand the managed disk, then grow the partition/filesystem; alert at 85%.
Disk attach: confirm the disk is attached and initialized; check the disk is in the same region; LUN mapping.

Design compute for production HA Design

Zone-spread VMSS (Flexible) + autoscale + health probes behind a zone-redundant load balancer.
No public IPs; Bastion for access; managed identity for API access.
Azure Monitor Agent + VM Insights; Update Manager for patch compliance; Compute Gallery for golden images.
RIs/Savings Plan + Hybrid Benefit for cost; a paired region for DR (Azure Site Recovery, section 13).

Operations note - maintenance & live migration

Azure uses memory-preserving live migration and scheduled maintenance for most host events, often with no reboot; some events require a reboot you can control via Scheduled Events (the metadata endpoint that warns the VM). Design for it: zone/VMSS spread + health probes so a maintenance reboot never takes the whole tier. GPU and specialized sizes may not live-migrate.

Official documentation: Azure Virtual Machines & App Service →

5. Storage Deep Dive

Managed Disks, Blob Storage, Azure Files, and NetApp Files - their performance, redundancy (LRS/ZRS/GRS/GZRS), tiers, and the decision of which to use for databases, shared filesystems, backups, archives, and data lakes. Plus Azure Backup and Site Recovery.

Last reviewed: July 2026 Verify disk SKUs, redundancy options, and access-tier behavior in current docs.

Managed disks Blob Storage Redundancy (LRS/ZRS/GRS/GZRS) Files & NetApp Backup & Site Recovery When to use which Gotchas

TL;DR

Managed Disks (Standard HDD/SSD, Premium SSD, Premium SSD v2, Ultra) = block storage for VMs. Blob Storage (Hot/Cool/Cold/Archive tiers) = object storage for backups, data lakes, static content - not a filesystem. Azure Files = managed SMB/NFS shares; Azure NetApp Files = high-performance enterprise NAS. Choose redundancy (LRS/ZRS/GRS/GZRS) deliberately per data. Lock down storage accounts (disable public blob access, use Private Endpoints, prefer managed identity over keys/SAS). Azure Backup for backup, Site Recovery for DR replication.

Managed Disks

Disk type	Profile	Use for
Standard HDD	Lowest cost, low IOPS	Dev/test, infrequently accessed data
Standard SSD	Better latency/consistency than HDD	Light production, web servers
Premium SSD	Production-grade, size-linked performance	Most production VMs/databases
Premium SSD v2	IOPS/throughput tunable independently of size	Best cost/perf for databases needing IOPS without huge capacity
Ultra Disk	Highest performance, sub-ms latency	Top-tier databases (log disks), SAP HANA
Ephemeral OS disk	On the host - fast, free, lost on deallocate	Stateless VMSS OS disks only

DBA note - size for IOPS, not just capacity

Provision Premium SSD v2 (tune IOPS/throughput) or Ultra for database data/log disks; enable read-only host caching for data disks and none for log disks (per SQL/Oracle guidance). Stripe disks for higher aggregate throughput when one disk caps out. Regional resilience for a stateful VM comes from zone-redundant deployment + Azure Backup/Site Recovery, not the disk alone (managed disks are zonal).

Blob Storage

A storage account holds blob containers (plus optionally Files/Queues/Tables). Blobs come in access tiers: Hot (frequent), Cool (~30-day), Cold (~90-day), Archive (offline, cheapest, must be rehydrated before reading - hours).
Lifecycle management auto-moves/deletes blobs by age/access; versioning + soft delete protect against accidental change/delete; immutable storage (time-based/legal hold) gives WORM compliance; object replication copies between accounts/regions.
Data Lake Storage Gen2 = a storage account with hierarchical namespace enabled (real directories, POSIX ACLs) for analytics - covered in section 11.
Access: prefer Entra ID + RBAC data roles + managed identity. SAS tokens (account/service/user-delegation) grant scoped, time-boxed access; stored access policies let you revoke them. Storage account keys are all-powerful - avoid distributing them.

Common mistake - Blob is not a filesystem; public exposure

Blobs are objects (no in-place random writes, no POSIX locking on flat namespace) - don't run a lock-dependent app on them; use Files/NetApp for filesystem semantics. And the classic breach: a storage account with "Allow Blob public access" on and an anonymous container. Disable public blob access at the account level (enforce via Azure Policy), use Private Endpoints, and prefer managed identity + RBAC over keys/SAS.

Security note - SAS tokens are a leak risk

A SAS token is a bearer credential - anyone with the URL has its access until it expires. Prefer user-delegation SAS (backed by Entra, revocable) or managed identity + RBAC instead of account-key SAS; keep lifetimes short; back long-lived SAS with a stored access policy so you can revoke; and never commit SAS/keys to code. Turn on soft delete + versioning for recoverability.

Redundancy: LRS, ZRS, GRS, GZRS

Option	Copies	Protects against
LRS	3 copies in one datacenter	Disk/rack failure only
ZRS	3 copies across zones in the region	Datacenter/zone failure (in-region HA)
GRS	LRS + async copy to the paired region	Regional disaster (read access with RA-GRS)
GZRS	ZRS + async copy to the paired region	Zone and regional failure (highest)

Architect note - choose redundancy per data

Don't default everything to GRS (paying for cross-region copies you may not need) or leave critical data on LRS (no zone/region protection). Use ZRS/GZRS for data that must survive a zone loss, GRS/GZRS for data needing regional DR, and LRS for easily-reproducible or dev data. GRS is async - failover has an RPO and (for customer-initiated failover) operational steps; it is not zero-RPO.

Azure Files, File Sync, and NetApp Files

Azure Files - managed SMB and NFS shares (Standard on HDD, Premium on SSD). Mount from VMs, on-prem, or containers. Identity-based auth (Entra/AD) for SMB.
Azure File Sync - cache Azure Files on on-prem Windows servers with cloud tiering (hybrid file access).
Azure NetApp Files - high-performance, low-latency enterprise NAS (NFS/SMB) for demanding workloads: SAP, HPC, large shared filesystems, and Oracle datafiles over NFS. Separate service, higher performance/cost.

DBA note

For Oracle on Azure VMs needing shared/NFS storage or extreme performance, Azure NetApp Files is the common choice (certified performance tiers, snapshots). For SAP shared filesystems (sapmnt, transport) NetApp Files or Premium Files is standard. Confirm certification and performance tier for the workload.

Azure Backup and Azure Site Recovery

	Azure Backup	Azure Site Recovery (ASR)
Purpose	Point-in-time backup & restore (VMs, disks, files, SQL/SAP in VM, Blob)	Continuous replication for DR failover to another region
Recovery	Restore from a recovery point (higher RTO/RPO)	Fail over the whole workload with low RPO
Use for	Data protection, retention, ransomware recovery (immutable vault)	Cross-region DR of running workloads

Security note

Enable immutable + soft-delete on the Recovery Services/Backup vault and multi-user authorization so backups themselves survive a ransomware/insider attack. A backup an attacker can delete is not a backup.

When to use which

Need	Use
VM OS / database disks	Managed Disks (Premium SSD v2 / Ultra for DB)
Backups (DB/app)	Azure Backup + Blob (Cool/Archive) with lifecycle + immutability
Log / long-term archive	Blob Archive tier + lifecycle + immutable policy
Data lake	Blob with hierarchical namespace (ADLS Gen2)
Shared filesystem (SMB/NFS)	Azure Files (or NetApp Files for high performance)
High-performance NAS / SAP / Oracle NFS	Azure NetApp Files
Static website / media	Blob static website + Front Door/CDN
Hybrid file access	Azure File Sync
Bulk data into Azure	Data Box (physical) / AzCopy (online)
Cross-region DR of workloads	Azure Site Recovery

Storage gotchas

Blob is object storage, not a filesystem - no random writes/locks; use Files/NetApp for that.
Archive tier has a rehydration delay (hours) - never for data you need immediately.
SAS tokens are a security risk - short-lived, user-delegation, revocable; prefer managed identity + RBAC.
Public blob access exposure - disable at account level; enforce by Policy.
Private Endpoint DNS mistakes - link the correct privatelink.* zone or nothing resolves.
Disk performance sizing - IOPS/throughput follow SKU/size; use Premium SSD v2/Ultra + striping for DBs.
Snapshot cost growth - incremental snapshots still accumulate; set retention.
Cross-region replication cost - GRS/object replication cost money and lag (async RPO).
Wrong redundancy - confusing LRS/ZRS/GRS/GZRS leaves data under- or over-protected.
Ephemeral OS disk data loss - it's lost on deallocate; stateless only.

Official documentation: Azure Storage, Disks, Files & NetApp Files →

6. Database Services Deep Dive

Azure's database portfolio - Azure SQL (Database / Managed Instance / on VM), PostgreSQL, MySQL, Cosmos DB, Cache for Redis, and the analytics stores - what each manages, how HA/DR/backup/patching differ, how to choose, and what changes for a DBA coming from SQL Server or Oracle.

Last reviewed: July 2026 DB features, tiers, and limits change - verify in current docs. Oracle Database@Azure especially.

Portfolio Service deep dives Decision table Connectivity & security HA / DR / backup Oracle DBA gotchas Examples

TL;DR

For SQL Server workloads, choose along a spectrum: Azure SQL Database (fully-managed, cloud-native, most managed) → SQL Managed Instance (near-full SQL Server compatibility, managed) → SQL Server on a VM (full control, you manage). For open source, Azure Database for PostgreSQL / MySQL (Flexible Server). For NoSQL at global scale, Cosmos DB. For cache, Azure Cache for Redis. For analytics, Synapse / Microsoft Fabric. Oracle has no native managed service - use a VM or Oracle Database@Azure. Managed services own patching/backup/HA; you own schema, queries, and access.

The portfolio at a glance

Service	Model	Sweet spot
Azure SQL Database	Fully-managed SQL (single DB / elastic pool; DTU or vCore)	New/cloud-native SQL apps; most managed, least control
Azure SQL Managed Instance	Managed instance with near-full SQL Server surface (SQL Agent, cross-DB, CLR, linked servers)	Lift-and-shift SQL Server needing instance features
SQL Server on Azure VM	IaaS - you run SQL Server	Full control, unsupported features/versions, OS access
Azure DB for PostgreSQL / MySQL (Flexible Server)	Managed OSS databases	Postgres/MySQL apps; zone-redundant HA
Cosmos DB	Globally-distributed multi-model NoSQL (NoSQL/Mongo/Cassandra/Gremlin/Table APIs)	Global scale, low latency, elastic throughput
Azure Cache for Redis	Managed Redis	Cache, session, rate limiting, leaderboards
Synapse / Microsoft Fabric / Data Explorer	Analytics warehouse / lakehouse / log-time-series	Analytics, not OLTP (section 11)
Oracle Database@Azure	Oracle Exadata/Autonomous run by Oracle inside Azure datacenters	Oracle workloads wanting managed Oracle in Azure (verify current availability)

Service deep dives

Azure SQL Database

SQL Managed Instance

PostgreSQL / MySQL

Cosmos DB / Redis

Azure SQL Database

Purchasing models: vCore (recommended; General Purpose / Business Critical / Hyperscale service tiers) or legacy DTU. Serverless compute auto-scales and can auto-pause. Hyperscale scales storage to 100TB+ with fast backups/restores.
HA: built-in; Business Critical and zone-redundant options give higher SLAs. Failover groups + active geo-replication for cross-region DR (readable secondaries).
Backups: automatic with point-in-time restore (retention configurable) + long-term retention; you don't manage backup files.
Patching: fully Microsoft-managed (in maintenance windows you can set).
Limits: no SQL Agent, no cross-database queries by default, no instance-level features - it's a single database service. If you need those, use Managed Instance.

DBA note

Azure SQL Database is not "SQL Server in the cloud" - it's a cloud-native database service. Great for new apps and single-database workloads, but instance-level features (Agent jobs, cross-DB, CLR, linked servers) aren't there. Choose Managed Instance when a lift-and-shift needs them.

Azure SQL Managed Instance

A managed instance with near-full SQL Server compatibility: SQL Agent, cross-database queries, CLR, Service Broker, linked servers, and instance-scoped features - deployed into your VNet (private). The best target for lifting an existing SQL Server estate while offloading patching/backup/HA.

HA: built-in; Business Critical + zone redundancy; failover groups for cross-region DR.
Backups/patching: managed (automatic backups + PITR; Microsoft patches in windows).
Networking: lives in a delegated subnet in your VNet - plan the subnet and connectivity.
Still has limits vs. a full SQL Server on a VM (some instance features, unlimited OS access, specific configs). Verify the surface for your app.

Azure Database for PostgreSQL / MySQL (Flexible Server)

Flexible Server is the current model: zone-redundant HA (standby in another zone), configurable maintenance windows, private access (VNet-injected or Private Endpoint), and server parameters.
HA/DR: zone-redundant HA in-region; read replicas (incl. cross-region) for scaling and DR.
Backups: automatic + PITR; you set retention.
Postgres supports the pgvector extension for AI/vector workloads (section 12); MySQL for common LAMP-style apps.

DBA note

Flexible Server gives more control (parameters, maintenance timing, HA choice) than the older Single Server. Plan private networking (VNet integration or Private Endpoint) up front; enable zone-redundant HA explicitly for production.

Cosmos DB & Azure Cache for Redis

Cosmos DB - globally-distributed, multi-model NoSQL with turnkey multi-region writes, 5 consistency levels, elastic RU/s (or autoscale/serverless), and single-digit-ms latency. Partition key design is critical - a poor key hotspots partitions and inflates RU cost. Has an analytical store + vector search.
Azure Cache for Redis - managed Redis for caching, sessions, and pub/sub; tiers (Basic/Standard/Premium/Enterprise) trade HA, clustering, persistence, and modules.

DBA note - Cosmos is not relational

Cosmos DB is NoSQL - no joins, no ad-hoc SQL across containers, throughput is provisioned as RU/s and cost is driven by partition-key design and query efficiency. Model for access patterns and even key distribution; don't force a relational schema onto it.

Database service decision table

Workload	Recommended	Reason	HA	DR	Ops responsibility	Cost lever
New SQL app (single DB)	Azure SQL Database	Most managed, cloud-native	Built-in / zone-redundant	Failover group / geo-replica	Schema/queries	vCore right-size; serverless auto-pause
Lift-and-shift SQL Server	SQL Managed Instance	Instance-level compatibility, managed	Built-in / zone-redundant	Failover group	Schema/queries + agent jobs	Right-size; Hybrid Benefit
Full control / unsupported feature	SQL Server on Azure VM	OS + full SQL control	You build (AG/FCI + zones)	You build (AG/ASR)	Everything	Hybrid Benefit; VM size
PostgreSQL app	Azure DB for PostgreSQL (Flexible)	Managed, zone-redundant	Zone-redundant HA	Cross-region read replica	Schema/queries	Right-size; burstable tiers
MySQL web app	Azure DB for MySQL (Flexible)	Managed, common	Zone-redundant HA	Read replica	Schema/queries	Right-size
Global NoSQL / low latency	Cosmos DB	Global distribution, elastic	Built-in	Multi-region	Data model / partition key	Autoscale RU / serverless
Cache / session	Azure Cache for Redis	Managed Redis	Premium/Enterprise HA	Geo-replication (Enterprise)	Keys/TTL	Right-size tier
Data warehouse / analytics	Synapse / Microsoft Fabric	Analytics, not OLTP	Built-in	Config-dependent	Schema/queries	Pause/scale compute
Oracle workload	Oracle DB@Azure or Oracle on VM	No native managed Oracle	Oracle-managed / you build	Data Guard	Oracle side	Licensing; verify offering

Connectivity & security

Private Endpoint (or VNet injection for MI/Flexible Server) is the production default - disable public network access. Plan the Private DNS zone (privatelink.database.windows.net, etc.).
Entra authentication + managed identity - apps authenticate to Azure SQL/Postgres/MySQL via Entra tokens (no passwords in config). Prefer this over SQL logins.
Encryption: TDE (transparent data encryption, on by default; customer-managed keys via Key Vault), Always Encrypted for column-level protection from even DBAs, TLS in transit.
Microsoft Defender for SQL - vulnerability assessment + threat detection; Query Performance Insight + automatic tuning for performance.

Common mistake

Leaving a database on its public endpoint with broad firewall rules "to connect quickly." Disable public network access, use Private Endpoint + Private DNS, and authenticate with Entra + managed identity. Retrofitting private connectivity later is a migration - plan it up front.

How HA, DR, backup, and patching differ

Service	HA	DR	Backup	Patching
Azure SQL DB	Built-in; zone-redundant / Business Critical	Failover groups + active geo-replication	Automatic + PITR + LTR	Microsoft (windows)
SQL Managed Instance	Built-in; zone-redundant	Failover groups	Automatic + PITR	Microsoft (windows)
PostgreSQL / MySQL Flexible	Zone-redundant HA (opt-in)	Cross-region read replica	Automatic + PITR	Microsoft (windows)
SQL Server on VM	You build: Always On AG / FCI + zones	You build: AG replica / Azure Site Recovery	You configure (Azure Backup for SQL)	You (Update Manager)
Cosmos DB	Built-in	Multi-region (turnkey)	Continuous backup + PITR	Fully managed

Operations note - test restores & failover

"Automatic backups" does not prove recoverability - periodically restore/PITR to a new database and validate. For DR, rehearse the failover group failover (or AG failover for VMs): confirm the secondary is within RPO, connection strings/listeners repoint, and the app works end-to-end - not just that the database opened.

Azure database gotchas for Oracle DBAs

For DBAs coming from Oracle (or SQL Server on-prem)

Azure SQL Database != SQL Server on a VM - it's a cloud-native single-database service without instance features (Agent, cross-DB, linked servers, OS access).
Managed Instance gives more compatibility but still has limits - verify your specific instance features are supported before assuming a clean lift-and-shift.
Patching control differs by service - PaaS patches in Microsoft-run windows you schedule, not your opatch/CU cadence.
Backup access differs - PaaS backups are service-managed (PITR/LTR), not files you copy; export (BACPAC/dump) for portability.
Private Endpoint and DNS must be planned - the private path needs the privatelink.* zone linked, or connections use the public endpoint.
Licensing choices matter - Azure Hybrid Benefit (bring your SQL/Windows licenses) materially changes cost; for Oracle, licensing/counting on Azure VMs needs LMS-aware planning.
Performance troubleshooting differs - Query Performance Insight / Query Store / automatic tuning and Azure Monitor, not your on-prem toolset.
Oracle is a special case - there is no native "managed Oracle." You self-manage Oracle on a VM (Data Guard, backups, ASM/NetApp Files, constrained-vCPU for licensing) or use Oracle Database@Azure (Oracle-operated Exadata/Autonomous in Azure) - verify current regional availability and terms in the official documentation before designing.

Enterprise examples

SQL Server enterprise workload SQL

Lift to SQL Managed Instance (Business Critical + zone redundancy) in a delegated subnet, Private Endpoint/VNet only, failover group to a paired region, Azure Hybrid Benefit, Defender for SQL on. Use SQL-on-VM only if a required feature isn't in MI.

Oracle workload on Azure Oracle

Self-managed Oracle on E/M-series VMs with Premium SSD v2/Ultra or Azure NetApp Files, Data Guard to a paired region, backups to Blob, constrained-vCPU + Dedicated Host for licensing. Or Oracle Database@Azure for a managed Oracle Exadata/Autonomous experience inside Azure (verify availability).

PostgreSQL application database OSS

Azure Database for PostgreSQL Flexible Server, zone-redundant HA, Private Endpoint, Entra auth, automatic backups + PITR, cross-region read replica for DR, pgvector if doing AI/RAG.

Globally distributed NoSQL NoSQL

Cosmos DB with multi-region writes, partition key chosen for even distribution, autoscale RU/s, chosen consistency level per workload, analytical store or vector search where needed.

Official documentation: Azure SQL, PostgreSQL, MySQL & Cosmos DB →

7. Load Balancing and Traffic Management

The four Azure load-balancing services - Load Balancer (L4), Application Gateway (L7 regional), Front Door (L7 global), and Traffic Manager (DNS) - when to use each, how they are assembled, and how to debug the classic unhealthy-backend and DNS failures.

Last reviewed: July 2026 Verify SKUs (Standard LB, App Gateway v2, Front Door tiers) in current docs.

The four services When to use which App Gateway + WAF Front Door Troubleshooting

TL;DR

Four services, two axes (L4 vs L7, regional vs global): Azure Load Balancer (L4, regional, public/internal, ultra-fast), Application Gateway (L7, regional, with WAF, path/host routing, TLS), Front Door (L7, global, anycast + CDN + WAF + global failover), and Traffic Manager (DNS-based global routing). Combine them (e.g. Front Door → regional App Gateway → backend). The #1 failure is an NSG blocking the health probe or the App Gateway management ports.

The four load-balancing services

Service	Layer / scope	Use for
Azure Load Balancer (Standard)	L4 (TCP/UDP), regional, public or internal, zone-redundant	High-throughput L4, internal VIPs (e.g. SQL AG listener), non-HTTP; also provides outbound rules
Application Gateway (v2)	L7 (HTTP/S), regional, with WAF	Regional web apps needing path/host routing, TLS termination, WAF, autoscaling
Azure Front Door	L7 (HTTP/S), global, anycast + CDN + WAF	Internet-facing global apps: edge acceleration, global load balancing/failover, WAF at the edge
Traffic Manager	DNS-based, global	DNS-level routing across regions/endpoints (priority/weighted/performance/geographic); works for non-HTTP too
Cross-region Load Balancer	L4 global	Global L4 with a single anycast frontend across regional LBs
NAT Gateway	Outbound SNAT	Scalable outbound internet for a subnet (not a load balancer, but the modern outbound method)

When to use which

Global internet web app, want edge/CDN + WAF + failover

Front Door (optionally → regional App Gateway)

Regional web app, path/host routing + WAF + TLS

Application Gateway v2 (with WAF)

L4 TCP/UDP, high throughput, internal VIP

Azure Load Balancer (internal or public)

DNS-level routing across regions / non-HTTP global

Traffic Manager (or Cross-region LB for L4)

Outbound internet for a subnet

NAT Gateway

Architect note - they compose

These are layers, not competitors. A common enterprise stack: Front Door (global edge + WAF + failover) → regional Application Gateway (or directly to App Service) → backend pool; internal tiers use an internal Load Balancer. Put WAF at the layer facing the internet (Front Door for global, App Gateway for regional) - don't double-pay for two WAFs unless you need defense in depth.

Application Gateway anatomy + WAF

Listener (+ cert) → rules (path/host) + WAF → HTTP settings + probe → backend pool. WAF (OWASP) protects at L7.

Components: frontend IP, listener (port/protocol + cert), rules (path/host routing), HTTP settings (backend port/protocol, cookie affinity, probe), backend pool (VMSS/NICs/App Service/IPs), health probe, and WAF policy.
SSL: termination at the gateway (offload) or end-to-end (re-encrypt to backend). Manage certs via Key Vault integration.
WAF - OWASP core rule set + custom rules, geo/rate limiting; run in Detection mode first, then Prevention.

Front Door

Front Door is the global L7 entry point: anycast frontend, edge TLS + caching/CDN, WAF at the edge, and health-based global routing/failover across regional backends. Use it for internet-facing apps that need low latency worldwide and automatic region failover. Managed certificates auto-provision once DNS points at the Front Door endpoint.

Load balancing troubleshooting

⚑ Backend unhealthy / App Gateway probe failing

Likely causes (in order)

NSG blocks the probe or the required App Gateway management ports (v2 needs specific inbound from the GatewayManager service tag) on the App Gateway subnet.
Health probe host/path/port/protocol wrong vs. what the app serves (probe expects 200-399).
App not listening / bound to localhost / wrong backend port in HTTP settings.
HTTPS probe cert/hostname mismatch (end-to-end SSL); backend expects a specific host header.
For Azure Load Balancer: probe port not open, or (Standard LB) no outbound rule so backends can't respond.

Checks & fix

App Gateway > Backend health (shows the reason); allow the probe + management traffic on the NSG; align the probe (path/port/protocol/host); bind the app to all interfaces; verify HTTP settings backend port/protocol.

az network application-gateway show-backend-health -g RG -n APPGW -o table

⚑ SSL certificate issue

Causes: Front Door/App Gateway managed cert not provisioned (DNS must point at the frontend first, then validation completes); expired/incomplete cert chain; Key Vault access policy/permissions missing for App Gateway's managed identity; hostname/SNI mismatch. Fix: point DNS at the frontend and wait for provisioning; grant the gateway identity get on the Key Vault secret/cert; include the full chain and correct SANs.

⚑ Wrong listener/rule, WAF blocking valid traffic, or Private Endpoint DNS

Causes: listener on the wrong port/host or a catch-all rule masking a specific one; WAF denying legitimate requests (over-broad rule / false positive - check WAF logs, use Detection mode first, then tune/exclude); backend behind a Private Endpoint whose Private DNS isn't resolving; wrong frontend IP config (public vs private). Fix: verify listener/rule order and host, review WAF logs and add exclusions, confirm Private DNS zone linkage, check the frontend IP.

Official documentation: Azure load-balancing options →

8. Security Deep Dive

Defense in depth on Azure: identity (Entra, PIM, Conditional Access), governance (Azure Policy, locks), network (NSG, Firewall, Private Link, DDoS), data (Key Vault, encryption), and detection (Defender for Cloud, Sentinel) - plus how to secure subscriptions, storage, VMs, and databases, ending in a production checklist.

Last reviewed: July 2026 Defender/Sentinel capabilities evolve - verify plans and features in docs.

Shared responsibility Control layers Key Vault & encryption Defender & Sentinel How to secure X Production checklist Common mistakes

TL;DR

Layer your controls: identity (Entra + PIM for privileged, Conditional Access + MFA, managed identities, no standing Owner), governance (Azure Policy to forbid the risky thing; resource locks), network (private endpoints, NSG/ASG, Azure Firewall, DDoS, no public exposure, Bastion), data (Key Vault/Managed HSM, CMK, TDE, Always Encrypted), and detection (Defender for Cloud, Sentinel, centralized diagnostic logs to Log Analytics). Reduce public exposure, encrypt with keys you control, centralize logs, and prefer preventive Policy over after-the-fact detection.

Azure shared responsibility model

Microsoft secures the infrastructure (physical, host, network fabric, and managed-service internals). You own: identity and access (Entra + RBAC), data classification and access, network exposure and firewall, key management choices, workload/OS security (IaaS/AKS nodes), secure configuration, and monitoring/response. The higher up the managed stack (VM → AKS → Azure SQL → App Service/Functions), the more Microsoft handles - but data, identity, and configuration always remain yours.

The control layers

Layer	Controls	Key services
Identity & access	Who can sign in and do what	Entra ID, Conditional Access, PIM, Identity Protection, RBAC, managed identities
Governance	What is allowed to exist/be configured	Azure Policy (deny/audit/deployIfNotExists), resource locks, management groups
Network	What can reach what	NSG/ASG, Azure Firewall, Private Link/Endpoint, DDoS Protection, Bastion, WAF
Data	Protect data at rest/in transit	Key Vault / Managed HSM (CMK), TDE, Always Encrypted, storage encryption, TLS
Detective / posture	Find misconfig & threats	Microsoft Defender for Cloud, Microsoft Sentinel, Activity Log, diagnostic settings

Key Vault, Managed HSM, and encryption

Key Vault - store keys, secrets, and certificates; workloads read them via managed identity + RBAC (or access policies). Managed HSM for FIPS 140-2 Level 3 single-tenant HSM keys.
Encryption at rest - on by default (platform keys); use customer-managed keys (CMK) in Key Vault for storage/disks/databases where you need key control and the "disable key" switch.
Always Encrypted / TDE for databases; TLS everywhere in transit.
Turn on Key Vault soft-delete + purge protection so keys/secrets can't be permanently destroyed by mistake or malice.

Security note - Key Vault + managed identity in a locked resource group

Put Key Vaults in a controlled resource group where only a small key-admin group has admin, and grant workloads only the data actions they need (get secret / wrap-unwrap key) via their managed identity. Never store secrets in app settings, code, or pipeline variables in clear text - reference Key Vault. Enable purge protection; a vault whose keys can be purged is a single point of catastrophic data loss.

Defender for Cloud & Sentinel

Microsoft Defender for Cloud

CSPM (secure score, misconfiguration recommendations) + CWP (Defender plans for servers, SQL, storage, containers, Key Vault, etc.) with threat detection. Turn on the relevant plans; work the secure score down.

Microsoft Sentinel

Cloud-native SIEM/SOAR on Log Analytics - ingest Azure + M365 + third-party logs, detect with analytics rules, and automate response with playbooks.

Activity Log & diagnostic settings

Activity Log = control-plane operations (who did what). Diagnostic settings route resource + platform logs/metrics to a central Log Analytics workspace / storage / Event Hub. Enable them everywhere via Policy.

DDoS Protection & WAF

DDoS Network/IP Protection for L3/4 volumetric attacks; WAF (App Gateway/Front Door) for L7. Protect public frontends.

Architect note - centralize logs on day one

Use Azure Policy to enforce diagnostic settings sending every resource's logs to a central Log Analytics workspace (and onward to Sentinel), and route the Activity Log too - set this up in the landing zone. Retrofitting centralized logging after an incident, when you find the logs were never enabled, is the classic post-mortem finding. Turn on Defender for Cloud across the management group.

How to secure specific things

Secure a production subscription (and multi-subscription env) Foundation

Access via groups; no basic Owner at subscription; least-privilege built-in roles at RG/resource; privileged roles eligible via PIM; Conditional Access + MFA; protected break-glass accounts.
Preventive Azure Policy at the management group: allowed regions/SKUs, deny public IPs, deny public blob access, require diagnostic settings, require tags; resource locks on foundational resources.
Hub-and-spoke with Azure Firewall; Private Endpoints for PaaS; DDoS on public frontends.
Defender for Cloud (all relevant plans) + Sentinel; central Log Analytics; budgets + quotas.
Key Vault + managed identities; CMK for sensitive data.

Secure storage accounts Storage

Disable public blob access + "allow shared key" where possible; use Entra + RBAC data roles + managed identity over keys/SAS.
Private Endpoint + Private DNS; firewall to specific VNets; CMK for sensitive data; soft delete + versioning + immutability for backups.
If SAS is required: user-delegation SAS, short lifetime, stored access policy to revoke.

Secure VMs & databases Compute / Data

VMs: no public IP; Bastion for access; managed identity; NSG/ASG micro-segmentation; Update Manager; Defender for Servers.
Databases: Private Endpoint / VNet only, disable public access; Entra auth + managed identity; TDE + CMK; Always Encrypted for sensitive columns; Defender for SQL.

Secure public load balancers & reduce exposure Edge

Public HTTP behind Front Door / App Gateway + WAF and DDoS; backends private (no public IPs).
Enforce "no public IP on VMs" and "no public blob access" via Policy; use Private Endpoints for all sensitive PaaS.
Prefer Bastion + private access; audit for stray public IPs regularly (Defender/Policy).

Production Azure security checklist

Human access via groups; MFA enforced via Conditional Access; risky sign-ins blocked (Identity Protection).
Privileged roles (Owner, User Access Admin, Global Admin) eligible via PIM, not permanent; approvals + audit on.
Two protected, monitored break-glass accounts (excluded from lock-out policies, alerted on every sign-in).
No basic Owner/Contributor at subscription/MG for daily work; least-privilege built-in roles at RG/resource.
Workloads use managed identities; app secrets minimized, rotated, and in Key Vault.
Preventive Azure Policy: deny public IPs, deny public blob access, allowed regions/SKUs, require diagnostic settings + tags.
Resource locks (CanNotDelete) on hub network, Key Vault, and production data.
Databases and PaaS on Private Endpoints; public network access disabled; no public database endpoints.
Public HTTP behind Front Door/App Gateway + WAF; DDoS Protection on public frontends; backends private.
Sensitive data encrypted with CMK in Key Vault (soft delete + purge protection on).
Diagnostic settings + Activity Log centralized to a Log Analytics workspace; Sentinel ingesting.
Defender for Cloud plans enabled across the management group; secure score tracked.
Alerts on RBAC/PIM changes, new app secrets, public exposure, Key Vault access anomalies.
Budgets + quotas as guardrails; consistent tags for attribution.
Backups immutable (vault soft-delete/immutability); DR tested incl. CMK key availability in the DR region.

Common security mistakes

Common Azure security mistakes

Owner assigned too broadly; not using PIM (standing privilege).
Weak Conditional Access (no MFA, admins unprotected).
Storage account public exposure; over-permissive NSGs.
Public database endpoints instead of Private Endpoints.
Not enabling diagnostic logs; not centralizing logs.
Secrets in code/app settings instead of Key Vault; not using managed identities.
Not using Private Endpoints for sensitive services; not enforcing Azure Policy (guardrails off).

Official documentation: Microsoft cloud security benchmark & Defender for Cloud →

9. Observability, Monitoring, and Operations

Azure Monitor (metrics, logs, alerts), Log Analytics + KQL, Application Insights, and the operations tooling - what to monitor per service, how to build useful alerts without noise, and how to centralize logs across subscriptions.

Last reviewed: July 2026 Verify agent (AMA) and alert-rule details in current docs.

The stack What to monitor Building alerts Example alerts Centralizing logs Operations tooling

TL;DR

Azure Monitor is the umbrella: metrics (near-real-time numeric), logs (in a Log Analytics workspace, queried with KQL), Application Insights (APM for apps), and alerts (metric/log/activity) firing to action groups. Install the Azure Monitor Agent (AMA) on VMs for guest metrics/logs (memory isn't collected by default). Diagnostic settings route each resource's logs to the workspace. Centralize across subscriptions with a shared workspace + Policy, alert on user-visible symptoms, and route by severity.

The observability stack

Service	Role
Azure Monitor Metrics	Platform + custom numeric metrics, near-real-time, for dashboards and metric alerts.
Log Analytics workspace + KQL	Central log store; query with Kusto Query Language; the target for diagnostic settings.
Application Insights	APM: requests, dependencies, exceptions, traces, live metrics, availability tests for apps.
Alerts + Action Groups	Metric/log/activity/resource-health alerts → email, SMS, webhook, Logic App, ITSM, Functions.
Diagnostic settings	Route resource logs/metrics to Log Analytics / storage / Event Hub. Enable via Policy everywhere.
Activity Log / Resource Health / Service Health	Control-plane operations; per-resource health; Azure-side incidents & planned maintenance.
VM Insights / Container Insights	Curated VM and AKS monitoring (perf, maps, container logs) via the agent.
Azure Monitor Agent (AMA)	The agent for VM/Arc guest metrics and logs; configured by Data Collection Rules.
Workbooks / Dashboards	Interactive reports and shared operational views.

Operations note - install the agent for guest metrics

Azure collects host-level VM metrics (CPU, disk, network) by default, but guest memory, disk-free %, and process/log data require the Azure Monitor Agent (via a Data Collection Rule) - use VM Insights to deploy it. Many "we couldn't see the memory leak / disk filling" incidents trace back to the agent never being deployed.

What to monitor per area

VMs

CPU, memory (agent), disk free %, disk IOPS/throughput vs. SKU, availability/heartbeat, VMSS instance health.

Disks / storage accounts

Disk IOPS/throughput vs. provisioned; storage transactions, throttling (429), availability, capacity, unusual access.

Databases

DTU/vCore/CPU, storage %, connections, deadlocks, replication lag, backup status; Query Performance Insight.

Load balancers / App Gateway

Backend health, unhealthy host count, response time, 5xx, throughput, WAF blocks.

Networking

VPN/ExpressRoute status, NSG flow-log anomalies, NAT SNAT port usage, DNS.

Security

Defender alerts, Activity Log anomalies (RBAC/policy/public-exposure changes), Key Vault access failures.

Building useful alerts

Alert on symptoms users feel (5xx, unhealthy backends, DB down, latency), not only causes.
Use appropriate aggregation (avg/percentile) and an evaluation window + frequency to avoid flapping.
Use log alerts (KQL) for things metrics can't express; metric alerts for fast numeric thresholds; activity-log alerts for governance events.
Route by severity via action groups: Sev0/1 → page; Sev2/3 → ticket/Teams; info → dashboard.
Consider dynamic thresholds (ML baselines) for noisy signals, and Resource Health alerts for platform issues.

Example alerts to implement

Alert	Condition	Severity
VM CPU high	CPU > 85% avg for 5-10 min	Warning → Critical
VM unavailable	Heartbeat missing / Resource Health unavailable	Critical
Memory pressure	Available memory (agent) < threshold	Warning
Disk usage / IOPS	Disk free < 15%; IOPS near provisioned limit	Warning
App Gateway unhealthy backend	Unhealthy host count > 0	Critical
Azure SQL CPU / DTU-vCore / storage	CPU/DTU > 90%; storage > 85%	Warning → Critical
Failed backups	Backup job failed / success signal absent	Critical
VPN tunnel down / ExpressRoute issue	Connection/circuit status != connected	Critical
Storage unusual access / throttling	Spike / 429 throttling / anomalous access	Warning / Security
Function errors / throttles	Failure rate / throttle count over threshold	Warning → Critical
Key Vault access denied spikes	Forbidden/denied requests rising	Security review

Common mistake - alert fatigue

Paging on every transient spike trains people to ignore alerts. Use longer windows, appropriate aggregation, severity routing (only real user-impact pages), dynamic thresholds for noisy signals, maintenance suppression, and prune alerts nobody acts on. An alert that never leads to action should be a dashboard tile, not a page.

Centralizing logs across subscriptions

Use a shared Log Analytics workspace (in the management subscription) and enforce diagnostic settings across all resources via Azure Policy (deployIfNotExists) so every subscription sends logs there. Route the Activity Log too, and connect the workspace to Sentinel for security analytics. This gives cross-subscription visibility and satisfies retention/compliance without per-resource setup.

# KQL: top VMs by CPU over the last hour
InsightsMetrics
| where Namespace == "Processor" and Name == "UtilizationPercentage"
| summarize avg(Val) by Computer, bin(TimeGenerated, 5m)
| top 20 by avg_Val desc

Operations tooling

Update Manager for OS patch orchestration/compliance; Change Tracking & Inventory for drift; Automation Account for runbooks.
Azure Arc to bring on-prem/other-cloud servers, Kubernetes, and data services under the same monitoring, policy, and update tooling.
Defender for Cloud recommendations and Advisor for security/reliability/cost/performance guidance; Cost Management exports for spend (section 14).

Official documentation: Azure Monitor, Log Analytics & Application Insights →

10. Containers, Kubernetes, and Cloud Native

AKS, Container Apps, and the serverless / event-driven building blocks (Functions, Event Grid, Service Bus, Event Hubs, Logic Apps) - when to use each, how networking and identity work for containers, and reference patterns.

Last reviewed: July 2026 Verify AKS networking modes, workload identity, and Container Apps features in docs.

Services AKS deep dive AKS vs Container Apps vs Functions vs VMs Networking & identity Messaging & events Patterns Troubleshooting

TL;DR

AKS (managed Kubernetes) for orchestrated microservices when you need the K8s ecosystem; Container Apps for serverless containers without running a cluster (scale-to-zero, KEDA, Dapr); Functions for event-driven code; App Service for web apps. Around them: Azure Container Registry, Event Grid / Event Hubs / Service Bus, Logic Apps, and API Management. AKS workloads use Workload Identity (federated, no secrets) to reach Azure/Entra.

The cloud-native services

Service	What it is	Use for
Azure Kubernetes Service (AKS)	Managed Kubernetes (free control plane; you manage node pools, or use node autoprovisioning)	Orchestrated microservices, platform teams, portable K8s
Container Apps	Serverless containers on managed Kubernetes/KEDA - no cluster ops, scale to zero	Most containerized microservices without AKS overhead
Container Instances (ACI)	Single containers on demand	Simple/short-lived tasks; AKS virtual-node burst
App Service	Managed web app/API PaaS (code or container)	Web apps/APIs
Azure Functions	Event-driven serverless functions	Event handlers, glue, automation
Azure Container Registry (ACR)	Private registry with scanning, geo-replication, tasks	Store/scan/build images
API Management (APIM)	Full API gateway/management	Publishing, securing, throttling, versioning APIs
Event Grid / Event Hubs / Service Bus	Eventing / big-data streaming / enterprise messaging	Event-driven and decoupled architectures
Logic Apps	Low-code workflow/integration with connectors	Integration, orchestration, SaaS connectors

AKS deep dive

Node pools: a system node pool (runs cluster-critical pods) and one or more user node pools (your workloads). Use Spot user pools for fault-tolerant work; scale per pool; virtual nodes (ACI) for burst.
Networking: Azure CNI (pods get VNet IPs - plan a big subnet; CNI Overlay reduces IP usage) vs kubenet (legacy). Private clusters keep the API server private.
Ingress: the Application Gateway Ingress Controller (AGIC) or the managed App Routing add-on / Gateway API provisions an App Gateway or LB; a Service type=LoadBalancer creates an Azure Load Balancer.
Identity: Microsoft Entra Workload Identity federates a Kubernetes service account to a managed identity so pods get Entra tokens with no secrets. Use Entra + Azure RBAC for cluster access, plus Kubernetes RBAC.
Security/ops: Defender for Containers, image scanning in ACR, Azure Policy for AKS (Gatekeeper), Container Insights for monitoring.

Architect note - Container Apps unless you need AKS

Reach for Container Apps first for containerized services - it removes cluster management, scales to zero, and supports KEDA/Dapr. Use AKS when you genuinely need the Kubernetes ecosystem (operators, service mesh, DaemonSets, GPU scheduling, cluster-level control). With AKS + Azure CNI, size the pod subnet for peak pods (or use CNI Overlay) - IP exhaustion stalls scheduling in ways that look like mysterious Pending pods. Use Workload Identity for pod-to-Azure access, never mounted secrets.

AKS vs Container Apps vs Functions vs VMs

Many orchestrated microservices, need K8s ecosystem

AKS

Containerized services, no cluster ops, scale to zero

Container Apps

Event-driven code / glue

Functions (+ Event Grid)

Web app / API without containers

App Service

Full OS control / specialized kernels/GPUs

VMs / AKS with GPU pools

Cost note

Don't run an AKS cluster for a couple of containers - even with a free control plane, node pools cost more than Container Apps, which bills per usage and scales to zero. Reserve AKS for genuine orchestration needs, use Spot node pools for fault-tolerant workloads, and right-size node pools with the cluster autoscaler / node autoprovisioning.

Networking & identity for containers

Networking - AKS/Container Apps deploy into a VNet subnet; use private endpoints for dependencies (ACR, Key Vault, databases), Private DNS, and NSGs. Internal ingress + Private Link for private platforms.
Identity - Entra Workload Identity (AKS) / managed identity (Container Apps, Functions, App Service) for keyless access to Key Vault, storage, databases.
Supply chain - scan images in ACR (Defender), sign/verify, restrict pull to the workload identity; Azure Policy for AKS to enforce baselines.
Monitoring - Container Insights (AKS), built-in metrics/logs for Container Apps/Functions; App Insights for app-level tracing.

Messaging & events

Service	Model	Use for
Event Grid	Discrete event routing (pub/sub, reactive)	React to Azure/resource events (e.g. blob created) → Functions/Container Apps
Event Hubs	High-throughput streaming (Kafka-compatible)	Telemetry/log/IoT ingestion pipelines
Service Bus	Enterprise messaging (queues/topics, ordering, transactions, dead-letter)	Reliable decoupling, work queues, ordered processing
Logic Apps	Low-code workflows with connectors	Integration/orchestration across SaaS + Azure

Architecture patterns

Event-driven: a blob upload raises an Event Grid event → a Function processes it and writes to a database or hands off to Service Bus for reliable downstream processing.

Microservices on AKS - deployments behind AGIC/Gateway ingress, HPA/KEDA autoscaling, optional service mesh, Workload Identity, ACR + a deployment pipeline.
Microservices on Container Apps - each service a container, internal ingress + Dapr for service-to-service, KEDA scaling, managed identity - minimal ops.
Serverless function on a Blob event - as diagrammed; image/ETL/validation.
Event-driven architecture - Event Grid + Functions/Container Apps + Service Bus + Event Hubs for decoupled, resilient pipelines.
Private container platform - private AKS/Container Apps in a spoke VNet, internal ingress, private endpoints to ACR/Key Vault/DB, no public endpoints.

Troubleshooting

⚑ AKS pod not starting (Pending / ImagePullBackOff / CrashLoopBackOff)

Causes: Pending = no schedulable capacity or pod IP exhaustion (Azure CNI subnet too small) or resource requests too big; ImagePullBackOff = ACR pull permission missing (grant AcrPull to the kubelet/managed identity), private ACR unreachable (needs private endpoint/DNS), or wrong image path; CrashLoopBackOff = app config/secret missing or bad probes. Checks: kubectl describe pod, kubectl logs --previous, node capacity, subnet free IPs. Fix: scale/enable autoscale or use CNI Overlay; grant AcrPull; fix probes/config/Workload Identity.

⚑ Container App revision issue / Function timeout / trigger issue

Container Apps: a new revision serving traffic but failing - check the container listens on the target port, ingress config, scale rules (min replicas), and the managed identity's RBAC; roll back to a previous revision or split traffic. Function timeout: raise the timeout/plan (Consumption caps duration; use Premium/Flex for long work), make idempotent, offload long work. Trigger not firing: check the binding/connection (managed identity or connection string), the event source, and the function's logs/metrics; verify the Event Grid subscription/filter.

Official documentation: AKS, Container Apps & Functions →

11. Analytics, Data, and Integration

The Azure data stack - Microsoft Fabric and Synapse, Data Factory, Data Lake Storage Gen2, Databricks, Event Hubs and Stream Analytics, Data Explorer, Purview governance, and Power BI - with the common lake/warehouse/streaming patterns.

Last reviewed: July 2026 Fabric is evolving fast and consolidating Synapse - verify current positioning in docs.

Services Fabric & Synapse Data patterns Governance Reference architecture

TL;DR

Land data in Data Lake Storage Gen2 (a storage account with hierarchical namespace). Analyze with Microsoft Fabric (the unified SaaS analytics platform: OneLake, Lakehouse, Warehouse, Data Factory, Power BI) or Synapse/Databricks. Ingest streams with Event Hubs + Stream Analytics, orchestrate ETL with Data Factory, query time-series/logs with Data Explorer, govern with Microsoft Purview, and visualize with Power BI.

The services

Service	Role
Microsoft Fabric	Unified SaaS analytics: OneLake (one logical lake), Lakehouse, Data Warehouse, Data Factory, Real-Time Intelligence, and Power BI - one capacity, one governance surface.
Azure Synapse Analytics	Integrated analytics (dedicated/serverless SQL pools, Spark, pipelines). Much of it is converging into Fabric - check current guidance.
Azure Databricks	First-party Apache Spark + Delta Lakehouse for large-scale data engineering/ML.
Data Factory	Cloud ETL/ELT orchestration with 100+ connectors (also in Synapse/Fabric).
Data Lake Storage Gen2	Blob + hierarchical namespace (directories, POSIX ACLs) - the lake foundation.
Event Hubs	High-throughput event streaming (Kafka-compatible) for ingestion.
Stream Analytics	Serverless real-time stream processing (SQL over streams).
Azure Data Explorer (ADX / Kusto)	Fast analytics over logs/time-series/telemetry (KQL).
Microsoft Purview	Data governance: catalog, classification, lineage, and access policies across the estate.
Power BI	BI/reporting and semantic models (native in Fabric).

Data engineering note - Fabric vs Synapse vs Databricks

Microsoft is consolidating analytics into Fabric (SaaS, OneLake, capacity-based). New greenfield analytics often start in Fabric; large existing Spark/Delta estates frequently stay on Databricks; Synapse remains for existing workloads and is being folded toward Fabric. Choose by your existing investment and team skills, and verify the current Microsoft positioning - this area moves fast.

The Fabric / OneLake model (mental model)

How Fabric differs from a database

OneLake is one logical data lake for the whole tenant (built on ADLS Gen2, open Delta/Parquet format) - "shortcuts" reference data in place instead of copying.
Storage and compute are separate; you buy capacity (compute) and workloads (Lakehouse, Warehouse, Power BI) share it.
Lakehouse (Spark/notebooks + SQL endpoint) vs Warehouse (T-SQL, transactional) - both over the same Delta data in OneLake.
Direct Lake lets Power BI read Delta directly for speed without import/DirectQuery trade-offs.
It is analytical (OLAP), not transactional - keep OLTP in Azure SQL/Cosmos and feed the lake.

Common data patterns

Pattern	Built from
Data lake	ADLS Gen2 / OneLake (bronze/silver/gold) + Purview governance
Data warehouse	Synapse dedicated SQL pool or Fabric Warehouse + Power BI
Lakehouse	Fabric Lakehouse or Databricks (Delta) over ADLS/OneLake
ETL / ELT	Data Factory pipelines (+ Spark/Databricks for transforms)
Streaming ingestion	Event Hubs → Stream Analytics / Fabric Real-Time → lake/warehouse
Event-driven integration	Event Grid + Functions/Logic Apps + Service Bus
Reporting / BI	Power BI over Warehouse/Lakehouse (Direct Lake)
AI-ready data	Curated lake + Azure ML / Azure OpenAI + vector search (section 12)
Cross-org data sharing	Azure Data Share / Fabric sharing with governance

Governance with Purview

Microsoft Purview provides a unified catalog, automated classification (PII/sensitive), lineage, and data access policies across Azure data sources, on-prem, and (increasingly) multicloud - so a growing lake stays governed instead of a "data swamp." Combine with Private Endpoints around data services, storage/lake ACLs, and column/row-level security in the warehouse for sensitive data.

Security note

Put analytics data behind Private Endpoints and (for the warehouse) column-/row-level security and dynamic data masking; classify and discover PII with Purview; and control sharing through governed mechanisms (Data Share / Fabric domains) rather than ad-hoc access. For sensitive data, keep the storage account private and use managed identity + RBAC, not keys.

Reference architecture: lakehouse + BI

Batch (Data Factory) and streaming (Event Hubs) land in the Delta lake; Spark transforms; Fabric Warehouse serves analytics; Purview governs; Power BI visualizes.

Official documentation: Microsoft Fabric, Synapse, Data Factory & Purview →

12. AI, ML, and Generative AI on Azure

Azure AI Foundry, Azure OpenAI, Azure AI Search, the pretrained AI services, and Azure Machine Learning - plus vector search across Cosmos DB / PostgreSQL, the enterprise RAG patterns, and the governance guardrails that separate a demo from something you can run on real data.

Last reviewed: July 2026 Model names, regions, quotas & pricing change fast - verify Azure OpenAI/Foundry availability in the portal.

AI Foundry & OpenAI AI Search & RAG Applied AI services Vector search RAG architecture Enterprise patterns Governance Warnings

TL;DR

Azure AI Foundry is the platform for building/deploying AI (models, agents, prompt flow, evaluation); Azure OpenAI serves GPT/embedding models with enterprise controls; Azure AI Search provides the retrieval layer (keyword + vector + semantic) for RAG. Store vectors in AI Search, Cosmos DB, or PostgreSQL (pgvector). Pretrained Azure AI services (Document Intelligence, Vision, Language, Speech, Translator) cover common tasks; Azure ML for custom models/MLOps. The hard part is not the model - it is governing what it can reach; use private endpoints, managed identity, and Content Safety.

Azure AI Foundry & Azure OpenAI

Capability	What it does
Azure AI Foundry	The unified platform/portal + SDK to build, evaluate, deploy, and monitor generative AI apps and agents; model catalog, prompt flow, tracing, and evaluation.
Azure OpenAI Service	GPT (chat/completions), embeddings, and other OpenAI models with Azure enterprise controls (private networking, RBAC, content filtering, data-not-used-to-train).
Model catalog	OpenAI + open + partner models to deploy (managed or serverless endpoints).
Prompt flow	Author, test, and evaluate LLM app flows (prompts, tools, retrieval) with versioning.
Content Safety	Detect/block harmful content, jailbreaks/prompt-injection, and groundedness issues.
Fine-tuning	Customize supported models where available (verify per model/region).

Azure AI Search (the retrieval layer)

Azure AI Search is the managed retrieval engine for RAG: it indexes your content and supports keyword, vector, and hybrid search plus semantic ranking. Integrated vectorization and indexers can chunk, embed, and index content from Blob/ADLS/SQL/Cosmos automatically. It is the most common grounding store for Azure OpenAI RAG.

Applied AI services & Azure ML

Document Intelligence

Extract text, tables, key-value pairs, and structure from documents (invoices, forms, contracts).

Vision / Language / Speech / Translator

Pretrained APIs for image analysis, entity/sentiment/PII, speech-to-text/text-to-speech, and translation - no training.

Azure Bot Service / Copilot patterns

Conversational bots and assistant patterns grounded on your data.

Azure Machine Learning

Full MLOps: training, pipelines, model registry, managed online/batch endpoints, and monitoring for custom models.

Vector search options

Option	Use when
Azure AI Search (vector/hybrid)	Purpose-built retrieval with hybrid + semantic ranking; the default RAG store.
Cosmos DB vector search	Vectors alongside operational NoSQL data at global scale.
Azure DB for PostgreSQL (pgvector)	Vectors in an existing Postgres, alongside relational data.
Azure SQL vector support	Vectors alongside relational SQL data (verify current availability).

AI note - keep vectors near governed data

Storing embeddings in Cosmos DB / PostgreSQL / Azure SQL means retrieval inherits your existing RBAC, Private Endpoints, backups, and row/column security - you combine similarity search with ordinary filters so retrieval respects entitlements. Use Azure AI Search when you want the best hybrid + semantic retrieval and integrated vectorization. Either way, filter retrieved context to what the requesting user is allowed to see.

RAG architecture on Azure

Ingestion: docs → chunk → embed → Azure AI Search. Runtime: query → governed serving layer → security-trimmed retrieval → Azure OpenAI generates a grounded, audited answer (Content Safety on the loop).

Enterprise patterns

Pattern	How	Watch out for
Chat with documents	RAG over Blob/ADLS + AI Search + Azure OpenAI	Chunking quality; stale index; citations
Chat with database	Retrieve from curated views; grounded answers	Never raw prod OLTP; use a serving layer
Natural language to SQL	Azure OpenAI proposes SQL over a governed schema	Validate/parametrize; read-only; no dynamic SQL
RAG with private data	Private endpoints for OpenAI + Search + storage; security trimming	Entitlement-aware retrieval
Document processing	Document Intelligence → extract → SQL/warehouse	Human review of low-confidence extractions
Call center AI	Speech + Language + Azure OpenAI + bot	Grounding; human escalation
MLOps pipeline	Azure ML pipelines + registry + managed endpoints + monitoring	Reproducibility; drift monitoring
Private GenAI	Private networking + managed identity + Content Safety + audit	Security trimming, prompt-injection defense

Governance and security for GenAI

Serving layer, always - agents/LLMs call a governed API (App Service/Function) that enforces authN/authZ, rate limits, input/output validation, and logging. They do not touch data stores directly.
Security-trimmed retrieval - filter retrieved context to what the requesting user may see (document/row/column) so RAG cannot leak across users.
Private & perimetered - use Private Endpoints for Azure OpenAI, AI Search, storage, and databases; keep model/data traffic off the internet.
Credential hygiene - secrets in Key Vault, access via managed identity; the model never sees raw credentials.
Auditability - log prompts, retrieved document IDs, and responses (per privacy rules); use Content Safety and groundedness checks.
Responsible AI - evaluate quality/safety in prompt flow, monitor deployed models, and require human review for consequential outputs.

Warnings (read before connecting AI to enterprise data)

Do not do these

Do not connect LLM agents directly to production OLTP databases without a governed serving layer. Live transactional systems are not a query surface for a probabilistic agent.
Avoid uncontrolled dynamic SQL. NL-to-SQL must produce validated, parameterized, read-only queries against a curated schema - never free-form DML against production.
Protect credentials. No DB passwords, keys, or connection strings in prompts, code, or agent memory. Use Key Vault + managed identity.
Add auditability. If you cannot show what data an answer came from and who asked, you cannot defend it to security or compliance.
Use curated datasets, APIs, or read-only reporting layers as the AI's data surface - not raw production tables.
Validate output before business use. Treat model output as a draft/suggestion until a human or deterministic check confirms it.
Monitor prompt injection and data-leakage risks - untrusted content in the context can hijack instructions; use Content Safety prompt-shields and isolate/sanitize retrieved and user content.
Check Azure OpenAI regional and model availability, quota, and pricing before you design - these vary by region and change frequently.

AI note - the pattern that scales safely

The durable enterprise GenAI shape is: curated/governed data → security-trimmed retrieval → model behind a serving API → validated, audited output, all over private endpoints with Content Safety on the loop. Everything risky (raw OLTP access, dynamic SQL, embedded credentials, unlogged answers) is a shortcut that works in a demo and fails an audit. Build the governed path first.

Official documentation: Azure AI Foundry, Azure OpenAI & AI Search →

13. Migration and Disaster Recovery

Getting workloads into Azure (VMs, databases, data) and keeping them recoverable - Azure Migrate, Site Recovery, Backup, Database Migration Service - plus DR patterns by tier and how RTO/RPO drive architecture and cost.

Last reviewed: July 2026 Verify supported sources for Azure Migrate/ASR/DMS in current docs.

Migration tooling Database migration DR patterns RTO / RPO Examples DR testing

TL;DR

Assess and migrate servers with Azure Migrate (VMware/Hyper-V/physical → VMs), databases with Database Migration Service (+ tooling like DMA/SSMA), and bulk data with Data Box / AzCopy. For DR, choose per tier: backup & restore (cheapest, slow), pilot light, warm standby, or hot/active-active. Azure Site Recovery replicates running workloads for region failover; Front Door / Traffic Manager handle traffic failover. Your RTO/RPO targets pick the pattern - and DR you never test is not DR.

Migration tooling

Move	Tooling	Notes
Assess & plan	Azure Migrate	Discovery, dependency mapping, right-sizing, and cost estimates before you move
Servers / VMs	Azure Migrate: Server Migration (agentless/agent)	VMware, Hyper-V, physical, and other-cloud VMs → Azure VMs
Databases (low downtime)	Database Migration Service (DMS) + DMA/SSMA	SQL/Postgres/MySQL to Azure PaaS; SSMA for heterogeneous (Oracle→SQL) conversions
Bulk data	Data Box / AzCopy / Storage Mover	Physical appliance for large sets; online for the rest
Files	Azure File Sync / Storage Mover	Hybrid file access + migration to Azure Files
DR replication	Azure Site Recovery (ASR)	Replicate running VMs to another region for failover

Database migration paths

Source → target	Method	Downtime
SQL Server → Managed Instance / SQL DB	DMS online (log replay), or Managed Instance link (near-real-time)	Near-zero
SQL Server → SQL on VM	Backup/restore, Always On, or ASR	Window-dependent
PostgreSQL/MySQL → Flexible Server	DMS online / native replication	Near-zero
Oracle → SQL / Postgres	SSMA / DMS (heterogeneous conversion)	Low + conversion effort
Oracle → Oracle on Azure VM / DB@Azure	Data Pump / RMAN / Data Guard	Depends on method

DBA note - heterogeneous Oracle moves are conversions

Oracle → Azure SQL/PostgreSQL is a heterogeneous migration: SSMA/DMS help with schema/code conversion and data movement, but PL/SQL, datatypes, and features need real work and testing. If the app must stay on Oracle, plan self-managed Oracle on an Azure VM (Data Pump/RMAN/Data Guard) or Oracle Database@Azure instead of a conversion. The Managed Instance link is a clean low-downtime path for SQL Server.

DR patterns

Pattern	Standby state	RTO	RPO	Cost
Backup & restore	Backups in another region; nothing running	Hours+	Since last backup	Lowest
Pilot light	Core data replicated (geo-replica / ASR); app off	Tens of min	Small	Low
Warm standby	Scaled-down full stack in the DR region	Minutes	Small	Medium
Hot / active-active	Both regions serving (Front Door + multi-region data)	Near-zero	Near-zero	Highest + complexity

Building blocks: Azure SQL failover groups / geo-replication or Cosmos multi-region (data); GZRS/GRS storage and object replication (objects); Azure Site Recovery for VM replication; Availability Zones for in-region HA and region pairs for DR; and Front Door / Traffic Manager for traffic failover.

Architect note - Front Door simplifies app DR

Front Door (or Traffic Manager) with backends in two regions and health-based routing fails traffic over automatically for stateless tiers on a regional outage. That removes a lot of DR plumbing for the app layer. The hard part remains the data tier: use SQL failover groups / Cosmos multi-region / ASR per your RTO/RPO, and rehearse the failover (including that the app repoints to the promoted database).

Common mistake - active-active data is hard

Stateless tiers go active-active easily behind Front Door; stateful databases generally do not without conflict handling. Most "active-active" requirements are satisfied by active/passive with fast failover (failover groups) or a natively multi-region store (Cosmos DB). Don't take on multi-master complexity unless the requirement truly demands it.

RTO and RPO

RTO - how long you can be down → drives standby readiness and automation.
RPO - how much data you can lose → drives replication mode (sync HA vs async geo-replication vs backup interval).
Zero data loss needs synchronous replication (zone-redundant HA, Business Critical, or a sync AG) and low latency; GRS/geo-replication are async (an RPO applies). Verify the network and trade-offs.

Architecture examples

Front Door fronts both regions (auto failover for stateless tiers); Azure SQL uses a failover group with a geo-secondary promoted on failover.

On-prem VM → Azure: Azure Migrate server migration.
SQL Server → Azure: DMS / Managed Instance link, cut over at low lag.
Oracle → Azure: self-managed on VM (Data Guard) or Oracle DB@Azure; SSMA if converting to SQL/Postgres.
Cross-region DR (app): Front Door + warm tier + storage replication.
Cross-region DR (database): failover groups / geo-replication / Cosmos multi-region.
Backup-based DR: Azure Backup with cross-region restore; rebuild on demand.
ASR pattern: replicate IaaS VMs to the paired region, fail over with recovery plans.

DR testing

DR you have never tested is a hope, not a plan

Run regular drills: ASR test failover (isolated, non-disruptive), failover-group failover, and full app validation in DR (not just "the DB opened"). Verify RTO/RPO are actually met, that Key Vault CMK keys exist and are usable in the DR region (a missing key makes encrypted data/replica unusable), that Front Door/Traffic Manager failover works, and that runbooks and connection strings are current. Use ASR recovery plans to codify and rehearse.

Geo-secondary / ASR replication within RPO; failover rehearsed.
CMK keys present and usable in the DR region.
App tier can start and connect in DR; config points to DR endpoints.
Front Door / Traffic Manager failover tested and time-measured.
Object data (GZRS/GRS or replicated) within RPO.
Capacity available in DR; runbook / recovery plan current.

Official documentation: Azure Migrate, Site Recovery & DR →

14. Cost Management and Governance

How Azure charges, the tools to track and cap spend (Cost Management, budgets, exports), the discount levers (Reservations, Savings Plan, Hybrid Benefit, Spot), and the governance model - ending in a monthly cost-review checklist.

Last reviewed: July 2026 Pricing and discount models change - verify all rates on the pricing pages.

Pricing basics Cost tools Discounts Governance Optimization Monthly checklist

TL;DR

Azure bills mainly by compute (VM size-hours), storage GB + transactions + redundancy, database vCore/DTU, and data egress. Track with Microsoft Cost Management + budgets + cost exports to storage/BigQuery-style analysis, cap with budgets (alert) and quotas (block). Big levers: Reserved Instances / Savings Plan for Compute for baseline, Azure Hybrid Benefit for Windows/SQL licenses, Spot for interruptible work, right-sizing, storage lifecycle, and reducing Log Analytics ingestion. Governance = management groups + Azure Policy + budgets + tags.

Pricing basics

Dimension	Charged on	Notes
VMs	Size per hour (+ OS licensing)	RIs / Savings Plan; Hybrid Benefit; Spot; Dev/Test pricing
Managed disks	Provisioned GB + tier (+ IOPS/throughput for v2/Ultra)	Snapshots accumulate; choose SKU deliberately
Storage accounts	GB + tier + transactions + redundancy (LRS<ZRS<GRS<GZRS)	Lifecycle to cool/archive; watch egress/retrieval
Azure SQL / databases	vCore/DTU + storage + backups (LTR)	Serverless auto-pause; Hybrid Benefit; right-size tier
Networking	Egress + inter-region; Application Gateway & Azure Firewall have hourly + data costs	Firewall/App Gateway are often surprising line items - right-size and consolidate
Log Analytics	Ingestion (GB) + retention	Noisy logs get expensive - filter and set retention/basic-logs tiers

Cost note - the sneaky ones

Azure Firewall, Application Gateway, and Log Analytics ingestion are frequently underestimated. Consolidate firewalls in the hub (don't run one per spoke), right-size App Gateway (autoscale v2), and aggressively filter/route logs (exclude chatty sources, use Basic Logs / Auxiliary tiers, set retention). GRS/GZRS redundancy also doubles storage cost for cross-region copies.

Cost tracking tools

Tool	Does
Cost Management + cost analysis	Spend by subscription, RG, resource, service, tag, time; forecasts.
Budgets + cost alerts	Track spend against a target at MG/subscription/RG scope; alert at thresholds (and trigger automation). Budgets notify - they don't block.
Cost exports	Scheduled detailed usage to a storage account for your own BI/analysis.
Quotas	Per-subscription limits - the "block" control (e.g. cap vCPU by family/region).
Advisor	Cost (right-sizing, idle, reservation) + reliability/security/performance recommendations.
Pricing / TCO calculators	Estimate before you build.

Cost note - budgets alert, quotas enforce

Use both: a budget warns you spend is trending over; a quota stops a subscription creating the expensive thing. Attribute everything via tags (enforced by Policy) + cost exports so chargeback works. This only works if tagging is enforced from the start (section 1).

Discounts

Reserved Instances (RIs) - 1/3-year commitment to specific VM families/regions (or other services) for a big discount on steady state.
Savings Plan for Compute - a 1/3-year hourly-spend commitment that flexes across VM families/regions (more flexible than RIs, sometimes smaller discount).
Azure Hybrid Benefit - apply existing Windows Server / SQL Server licenses to Azure - often the single biggest saving for a Microsoft shop.
Spot VMs - 60-90% off for interruptible workloads.
Dev/Test pricing - reduced rates for non-production under Dev/Test subscriptions.

Governance model

Governance is enforced through the same primitives as security: the management-group hierarchy and subscriptions (isolation + attribution), Azure Policy (restrict regions/SKUs, require tags, deny public exposure), budgets + quotas, resource locks, and tags - all deployed as a landing zone in code. This keeps spend controlled and attributable by design.

Cost optimization examples

Action	Typical saving	Effort
Stop / auto-shutdown non-prod VMs off-hours	High (up to ~65-70% of that compute)	Low
Right-size VMs (Advisor)	High	Low
Reserved Instances / Savings Plan for baseline	High	Medium
Azure Hybrid Benefit (Windows/SQL)	Very high (Microsoft licenses)	Low
Spot VMs for interruptible / batch	Very high (60-90%)	Medium
Choose correct disk type / delete unused disks	Medium	Low
Storage lifecycle to cool/archive	Medium-High	Low
Reduce Log Analytics ingestion (filters, tiers, retention)	Medium-High	Low
Reduce cross-region traffic / consolidate firewalls	Medium	Medium
Delete old snapshots & unused public IPs	Medium	Low
Database right-sizing / serverless auto-pause	Medium-High	Low

Cost note - cheap wins first

Before re-architecting: auto-shutdown non-prod, act on Advisor right-sizing, apply Azure Hybrid Benefit, buy RIs/Savings Plan for baseline, and cut Log Analytics ingestion. Idle public IPs, orphaned disks, over-sized App Gateway/Firewall, and unbounded log ingestion quietly add up - review monthly.

Monthly Azure cost review checklist

Review Cost Management month-over-month by subscription, service, and tag; investigate spikes.
Check each budget: which subscriptions/RGs/tags are over or trending over target.
Act on Advisor right-sizing and idle-resource recommendations.
Confirm non-prod auto-shutdown ran (nothing running 24x7 by accident).
Find and delete unused managed disks, orphaned snapshots, and idle VMs.
Release unassociated public IPs (Standard IPs bill when idle).
Review RI / Savings Plan coverage and utilization; buy/adjust; confirm Hybrid Benefit applied.
Log Analytics: top ingestion sources; add filters / Basic Logs; check retention.
Storage: lifecycle rules moving cold data to cool/archive; review redundancy choices.
Right-size Azure SQL / databases vs. utilization; serverless auto-pause where suitable.
Review App Gateway / Azure Firewall sizing and consolidation.
Review egress / inter-region charges; co-locate chatty services; use private access.
Confirm every resource is tagged (cost-center/env/owner) for attribution.
Validate quotas still reflect intent; check for anomalous new spend by service.

Official documentation: Microsoft Cost Management & billing →

15. Enterprise Architecture Patterns

Reference blueprints for real Azure deployments. Each card gives the business case, services, traffic flow, and the security / HA / DR / monitoring / cost / risk dimensions so you can adapt rather than start from a blank page.

Last reviewed: July 2026 Blueprints are starting points - validate sizing/services against current docs and requirements.

HOW TO READ THESE

Every pattern lists the same dimensions. Start from the one closest to your workload, then apply the service deep dives (sections 3-12) and the DR/cost guidance (13-14). The recurring backbone is: Front Door/App Gateway + WAF → private compute (App Service / VMSS / AKS) → managed database on a Private Endpoint → Private Endpoints for PaaS → centralized Log Analytics → cross-region DR, all inside a governed landing zone (management groups + Policy + hub-and-spoke).

Foundational three-tier (reference backbone)

Three-tier enterprise application

The pattern most others extend

Business case	Standard internal/external web or enterprise app needing HA and controlled exposure.
Services	Hub-and-spoke VNet, Front Door + WAF (or App Gateway), App Service/VMSS, Azure SQL (Private Endpoint), NAT/Firewall, Key Vault, Azure Monitor/Log Analytics.
Traffic flow	User → Front Door/WAF → app (private) → SQL (Private Endpoint); egress via NAT/firewall; PaaS via Private Endpoints.
Security	No public IPs on workloads; NSG/ASG; DB private; CMK; secrets in Key Vault; Azure Policy + Private Endpoints; Defender on.
HA	Zone-spread app + zone-redundant SQL; Front Door/App Gateway health probes.
DR	Second-region backends behind Front Door + SQL failover group.
Monitoring	Backend health, app + DB metrics; alerts → action groups; central Log Analytics.
Cost	App Service/VMSS right-size + RIs/Savings Plan; Hybrid Benefit; storage lifecycle.
Risks / mistakes	Probe NSG rule missing; DB public endpoint; no zone spread; secrets in app settings; Private DNS not linked.

Pattern library

Simple web application Small

Case	Low-complexity site/app, cost-sensitive.
Services	App Service + Azure SQL (or Cosmos) + Blob for assets + Front Door + WAF.
HA/DR/cost	App Service zone-redundant; SQL HA; scale rules. Risk: public DB endpoint, no backups tested.

Highly available application HA

Case	Must survive zone (and ideally region) failure.
Services	Zone-spread VMSS/App Service, zone-redundant Azure SQL, Front Door, zone-redundant LB/App Gateway.
DR	Second-region backends + SQL failover group. Risk: state on a single zonal disk; untested failover.

Private enterprise application Regulated

Case	Internal-only, reachable from on-prem, no public footprint.
Services	Private subnets, internal App Gateway/LB, ExpressRoute/VPN via hub, Private Endpoints, Bastion, no public IPs.
Risk	CIDR overlap; Private DNS not linked; transitive-peering assumption.

Hub-and-spoke / centralized networking & security & logging Platform

Case	Many subscriptions with centrally-governed network, security, and logging.
Services	Connectivity subscription (hub VNet, Azure Firewall, gateways, DNS, Bastion), management subscription (Log Analytics, Automation, Backup), Defender + Sentinel, MG-level Policy.
Risk	Firewall per spoke (cost); shadow VNets; missing UDR return routes.

Multi-subscription landing zone Governance

Case	Governed foundation before workloads land.
Services	Management-group hierarchy, platform + landing-zone subscriptions, baseline RBAC (groups) + PIM, preventive Azure Policy, hub network, central logging + Defender, budgets/quotas, locks - all IaC (CAF ALZ).
Risk	Skipping it and retrofitting governance later.

SQL Server enterprise workload SQL

Case	Lift a SQL Server estate to managed.
Services	SQL Managed Instance (Business Critical + zone redundancy), delegated subnet/Private Endpoint, failover group, Hybrid Benefit, Defender for SQL.
Risk	Unsupported instance feature; public endpoint; untested failover.

Oracle workload on Azure Oracle

Case	Run Oracle on Azure.
Services	Self-managed Oracle on E/M VMs + Premium SSD v2/Ultra or NetApp Files, Data Guard to a paired region, backups to Blob, constrained-vCPU/Dedicated Host for licensing; or Oracle Database@Azure for managed Oracle.
Risk	Licensing/counting on Azure; storage IOPS undersized; verify DB@Azure availability.

Azure SQL application PaaS DB

Case	New relational app backend.
Services	Azure SQL Database (vCore, zone-redundant), Private Endpoint, Entra auth + managed identity, PITR + LTR, failover group for DR, Query Performance Insight.
Risk	Public endpoint; HA not enabled; instance features expected (use MI instead).

Data warehouse / data lake Data

Case	Enterprise analytics on curated + raw data.
Services	ADLS Gen2 / OneLake (bronze/silver/gold) + Fabric/Synapse/Databricks + Data Factory + Event Hubs + Purview + Power BI; Private Endpoints.
Cost/risk	Capacity/query controls; column-level security. Risk: ungoverned "data swamp"; runaway compute.

Kubernetes platform Cloud native

Case	Container platform for many microservices with CI/CD.
Services	Private AKS in a spoke VNet, AGIC/Gateway ingress, Workload Identity, ACR (scanned) + Defender for Containers, Azure Policy for AKS, pipeline (GitHub Actions/Azure DevOps).
Risk	Pod IP exhaustion (CNI); over-privileged Workload Identity; public API server.

Serverless application Serverless

Case	Event-driven / stateless services with minimal ops.
Services	Container Apps / Functions + Front Door + Event Grid/Service Bus + Cosmos/SQL + Key Vault; VNet integration to private data.
Cost/risk	Scale-to-zero. Risk: cold starts for spiky critical paths (min replicas/Premium plan).

Event-driven architecture Events

Case	Decoupled, resilient pipelines.
Services	Event Grid + Functions/Container Apps + Service Bus + Event Hubs; dead-letter queues.
Risk	Poison messages without DLQ; non-idempotent handlers; backlog from slow consumers.

Hybrid cloud Hybrid

Case	Workloads split across on-prem and Azure.
Services	ExpressRoute (primary) + VPN (backup) via hub, hub-and-spoke / Virtual WAN, hybrid DNS (Private Resolver), Azure Arc for on-prem management.
Risk	CIDR overlap; transitive-peering assumption; single link; asymmetric routing.

Multi-region DR DR

Case	Business-critical stack needing regional resilience.
Services	Front Door with multi-region backends, SQL failover group / Cosmos multi-region, GZRS storage, Azure Site Recovery, reservations.
Risk	Untested DR; CMK key missing in DR region; capacity unavailable at failover.

Secure landing zone Security

Case	Preventive-guardrail foundation.
Services	Azure Policy (deny public IP/blob, location/SKU restrictions, require diagnostics), PIM + Conditional Access, hub firewall, Private Link strategy, central Defender + Sentinel + logging, Key Vault, budgets/quotas - as code.
Risk	Guardrails off; over-broad break-glass; standing Owner.

GenAI with private enterprise data AI

Case	RAG/assistant over internal data, governed.
Services	Blob/ADLS + Azure AI Search + Azure OpenAI behind an App Service/Function serving API + Key Vault + Private Endpoints + Content Safety + Log Analytics.
Flow / risk	Query → serving layer (authz + guardrails) → security-trimmed retrieval → grounded, audited answer. Risk: ungoverned data access, dynamic SQL, credential leakage (section 12 warnings).

Common mistakes across all patterns

Databases/services on public endpoints "to get it working"; Private DNS zone not linked.
No zone/region spread - a zone event takes the whole "HA" tier.
Health-probe NSG rule (and App Gateway management ports) forgotten - unhealthy backends on day one.
DR designed but never tested; CMK keys missing in the DR region.
Secrets in app settings/code instead of Key Vault; standing Owner instead of PIM.
No centralized logging/Defender until an incident needs it.
CIDR overlap / transitive-peering assumption discovered during hybrid setup.
Landing zone / Azure Policy skipped and retrofitted painfully later.

Official documentation: Azure Architecture Center reference architectures →

16. Troubleshooting Guides

A runbook catalog for the failures you will actually hit. Each entry lists symptoms, likely causes, checks (with portal path, CLI, and PowerShell where useful), fixes, and prevention. Deeper versions of some live in their service sections; this is the consolidated index.

Last reviewed: July 2026 Verify CLI/PowerShell syntax with az <group> --help / Get-Help.

General method

Work top-down: identity/API (right tenant/subscription? role/scope? PIM active? provider registered?) → network (Effective routes + IP Flow Verify + DNS) → host/service (listening/healthy?) → data. For "cannot reach," use Network Watcher (IP Flow Verify / Connection Troubleshoot); for "access denied," use IAM > Check access and the section 2 model. Everything is in Activity Log + Log Analytics.

ComputeStorageNetworkLBDBIdentityServerlessAKSMonitor

Compute & access

⚑ VM not reachable / SSH / RDP / Bastion issue

Causes: NSG blocking 22/3389 from your source; VM has no public IP and you're not using Bastion; Bastion subnet (AzureBastionSubnet) missing/misconfigured or RBAC missing; VM stopped/boot failed; OS firewall; wrong credentials. Checks: Effective security rules + IP Flow Verify; boot diagnostics/serial console; Bastion deployment. Fix: use Bastion (no public IP needed), allow the source in the NSG, reset password/SSH via "Reset password"/Run Command. Prevention: standardize Bastion + no public IPs.

az network watcher test-ip-flow -g RG --vm VM --direction Inbound --protocol TCP --local IP:22 --remote SRC:0
az vm boot-diagnostics get-boot-log -g RG -n VM

⚑ VM boot issue

Causes: bad fstab mount, full OS disk, driver/kernel issue, failed extension. Checks: boot diagnostics screenshot + serial console. Fix: use the VM "Repair" (attach OS disk to a rescue VM) to fix config; keep disk snapshots. Prevention: test image/extension changes in non-prod.

⚑ High CPU / memory pressure / disk full

CPU: Azure Monitor trend; on host top; right-size/autoscale (VMSS). Memory: requires the Azure Monitor Agent (guest memory not collected by default) - deploy via VM Insights, then right-size. Disk full: expand the managed disk, grow the partition/filesystem, alert at 85%.

⚑ Managed disk attachment issue

Causes: disk in a different region/zone than the VM; not initialized; LUN mapping; disk still attached elsewhere. Checks: disk state + LUN; lsblk/Disk Management. Fix: attach in the same region/zone, initialize/mount, add to fstab by UUID. Prevention: automate via extension/cloud-init.

Storage

⚑ Storage account access denied / Blob public access issue

Denied: missing RBAC data role (needs Storage Blob Data Reader/Contributor - control-plane roles like Contributor don't grant data access); wrong SAS/expired; firewall blocking; "allow shared key" disabled but code uses a key; VNet/Private Endpoint restriction; API not registered. Public access: account-level "allow blob public access" is off (correct) - use SAS/RBAC instead. Checks: IAM data roles; storage firewall; az storage blob list --auth-mode login. Fix: grant the data role to the managed identity; use Private Endpoint + Entra auth.

⚑ Azure Files mount issue

Causes: port 445 blocked (ISP/NSG) for SMB; wrong credentials (storage key or Entra/AD Kerberos); Private Endpoint DNS not resolving; NFS export/rules wrong. Fix: use Private Endpoint (avoids 445-over-internet), configure identity-based auth, verify Private DNS, check firewall.

Network

⚑ DNS / NSG / route / Azure Firewall / NAT / Private Endpoint / peering / Private DNS

Method: IP Flow Verify (which NSG rule decides), Effective routes/rules on the NIC (real next hop + merged rules), Connection Troubleshoot, and nslookup. Per case:

NSG: priority/direction; default deny-inbound; missing allow.
Route (UDR): forced-tunnel to firewall with no return route (black hole); missing route.
Azure Firewall: app/network rule collection missing the FQDN/port; DNAT config; UDR not pointing at it.
NAT Gateway: outbound only; SNAT port exhaustion; not on the subnet.
Private Endpoint: Private DNS zone not linked / no A record - FQDN resolves to public IP.
Peering: not transitive; overlap; missing "allow forwarded traffic"/gateway transit.

⚑ VPN down / ExpressRoute issue

Causes: IKE/PSK mismatch (VPN); BGP not advertising/learning; ExpressRoute circuit/peering down or route filters; gateway SKU bandwidth exhausted; CIDR overlap. Checks: connection/circuit status; BGP routes; effective routes. Fix: align IKE, correct BGP/route filters, resolve overlap, right-size the gateway SKU.

Load balancing & databases

⚑ App Gateway backend unhealthy / LB probe failure / SSL cert

Unhealthy: NSG blocks the health probe or App Gateway management ports (GatewayManager service tag) on the App Gateway subnet; wrong probe host/path/port/protocol; app on localhost; backend HTTP settings mismatch. LB: probe port not open; Standard LB needs an outbound rule. SSL: managed cert needs DNS → frontend first; Key Vault access for the gateway identity; chain/SAN. Fix: per section 7.

⚑ Azure SQL connection / performance / backup issue

Connection: firewall/public access disabled but no Private Endpoint path; Private DNS not resolving; Entra token/permission; wrong connection string (server FQDN); transient errors (retry logic). Performance: Query Performance Insight / Query Store; DTU/vCore/log-IO limits; missing indexes; enable automatic tuning. Backup: PITR/LTR retention config; test a restore. Checks: az sql db show; audit/diagnostic logs.

Identity

⚑ RBAC permission denied / managed identity / app secret expired

Denied: walk the section 2 model - right tenant/subscription? which principal? role + control vs data plane? scope/inheritance? deny assignment? Azure Policy? Conditional Access? PIM eligible but not activated? Managed identity: is it assigned to the resource, and does it have the target role (data role too)? propagation delay after assignment. App secret expired: a service principal's client secret/cert expired - rotate it (and move to a managed identity / workload identity federation to avoid recurrence). Tools: IAM > Check access; Entra sign-in logs (Conditional Access); Activity Log.

Serverless & AKS

⚑ Function timeout / trigger issue / Container App revision

Function timeout: Consumption plan caps duration - use Premium/Flex, make idempotent, offload long work. Trigger not firing: check the binding connection (managed identity / connection string), the source (queue/blob/Event Grid subscription + filter), and function logs. Container Apps: new revision failing - container must listen on the target port, check ingress + scale rules (min replicas) + managed identity RBAC; roll back or split traffic.

⚑ AKS pod not starting

Causes: Pending (capacity / pod-IP exhaustion with Azure CNI / requests too big), ImagePullBackOff (grant AcrPull to the kubelet identity; private ACR needs endpoint/DNS), CrashLoopBackOff (config/probes). Tools: kubectl describe/logs, node capacity, subnet IPs. (Section 10.)

Monitoring

⚑ Azure Monitor alert not firing / diagnostic logs missing

Alert: wrong signal/scope/threshold, evaluation window never met, action group has no verified receiver, alert disabled, or maintenance suppression. Test by forcing the condition; check the alert's history. Logs missing: diagnostic settings not enabled on the resource, Azure Monitor Agent / DCR not deployed on the VM, wrong Log Analytics workspace, or ingestion delay/retention expired. Fix: enable diagnostic settings (enforce via Policy), deploy the agent via VM Insights, verify the workspace.

Official documentation: Network Watcher & Azure troubleshooting →

17. Azure CLI, PowerShell, Bicep, ARM & Terraform

Practical, copy-friendly automation: CLI/PowerShell setup and tenant/subscription selection, managed identity, Bicep vs ARM vs Terraform, clean examples for VNet, VM, storage, RBAC, and alerts - plus state, structure, and CI/CD practices.

Last reviewed: July 2026 Verify provider/API versions and resource argument names in current docs.

CLI / PowerShell Auth & managed identity Common commands Bicep vs ARM vs Terraform VNet (Bicep + TF) VM Storage RBAC Alert State & CI/CD

TL;DR

Two CLIs: Azure CLI (az) and Azure PowerShell (Az) - pick a house standard. Always set the right tenant + subscription before acting. Prefer managed identity / federated credentials over service-principal secrets. For IaC, use Bicep (Azure-native, readable) or Terraform (azurerm, multicloud) - both beat hand-written ARM JSON. Keep Terraform state remote and locked (a storage account with blob lease), structure into modules, separate environments, and deploy via a pipeline using a federated (secretless) identity.

Azure CLI and PowerShell setup

# Azure CLI
az login                                  # interactive
az account list -o table                   # subscriptions you can see
az account set --subscription "Prod-01"    # select subscription
az account show                            # confirm tenant + subscription

# Azure PowerShell
Connect-AzAccount
Set-AzContext -Subscription "Prod-01"
Get-AzContext

Auth & managed identity (no secrets)

# On an Azure resource with a managed identity, tools authenticate automatically:
az login --identity                        # system-assigned MI
az login --identity --username <clientId>   # user-assigned MI

# CI/CD: use OIDC workload identity federation (no client secret)
# - create an app registration + federated credential for the pipeline
# - the pipeline exchanges its OIDC token for an Azure token; nothing to store

Security note

Prefer managed identities for Azure-hosted automation and workload identity federation (OIDC) for external CI (GitHub Actions/Azure DevOps) - both avoid long-lived service-principal secrets. If a secret is unavoidable, store it in Key Vault, scope the SP minimally, and rotate it.

Common commands

# Resource groups & providers
az group create -n rg-app-prod -l eastus2
az provider register --namespace Microsoft.Sql

# Compute
az vm list -o table
az vm create -g rg-app-prod -n web1 --image Ubuntu2204 --public-ip-address "" --size Standard_D2s_v5

# Storage
az storage account create -g rg-app-prod -n stappprod01 --sku Standard_ZRS --allow-blob-public-access false
az storage blob upload-batch -d container -s ./data --auth-mode login

# RBAC
az role assignment create --assignee-object-id <objId> --role "Storage Blob Data Reader" --scope <resourceId>
az role assignment list --assignee <objId> --all -o table

# Logs (KQL)
az monitor log-analytics query -w <workspaceId> --analytics-query "AzureActivity | take 20"

Bicep vs ARM vs Terraform

	Bicep	ARM templates	Terraform
Language	Concise DSL → compiles to ARM	Verbose JSON	HCL (`azurerm` provider)
Best for	Azure-only, native, day-1 feature parity	Legacy; avoid authoring by hand	Multicloud / standardized IaC, rich ecosystem
State	None (ARM tracks deployments)	None	You manage remote state

Architect note

Both Bicep and Terraform are valid - pick one house standard. Bicep if you're Azure-only and want native, immediate feature support with no state to manage. Terraform if you standardize IaC across clouds or want its module/ecosystem. Don't author raw ARM JSON by hand; if you have ARM, decompile to Bicep.

Create a VNet + subnet + NSG

// Bicep: network.bicep
resource vnet 'Microsoft.Network/virtualNetworks@2023-11-01' = {
  name: 'vnet-app'
  location: resourceGroup().location
  properties: {
    addressSpace: { addressPrefixes: [ '10.10.0.0/20' ] }
    subnets: [
      {
        name: 'snet-app'
        properties: {
          addressPrefix: '10.10.1.0/24'
          networkSecurityGroup: { id: nsg.id }
          privateEndpointNetworkPolicies: 'Disabled'
        }
      }
    ]
  }
}
resource nsg 'Microsoft.Network/networkSecurityGroups@2023-11-01' = {
  name: 'nsg-app'
  location: resourceGroup().location
}

# Terraform: network.tf
resource "azurerm_virtual_network" "vnet" {
  name                = "vnet-app"
  resource_group_name = azurerm_resource_group.app.name
  location            = azurerm_resource_group.app.location
  address_space       = ["10.10.0.0/20"]
}
resource "azurerm_subnet" "app" {
  name                 = "snet-app"
  resource_group_name  = azurerm_resource_group.app.name
  virtual_network_name = azurerm_virtual_network.vnet.name
  address_prefixes     = ["10.10.1.0/24"]
}

Create a VM (no public IP)

# Terraform: a Linux VM with a system-assigned managed identity, no public IP
resource "azurerm_linux_virtual_machine" "app" {
  name                = "vm-app-1"
  resource_group_name = azurerm_resource_group.app.name
  location            = azurerm_resource_group.app.location
  size                = "Standard_D2s_v5"
  admin_username      = "azureuser"
  network_interface_ids = [azurerm_network_interface.app.id]
  identity { type = "SystemAssigned" }
  admin_ssh_key { username = "azureuser"  public_key = file("~/.ssh/id_rsa.pub") }
  os_disk { caching = "ReadWrite"  storage_account_type = "Premium_LRS" }
  source_image_reference { publisher = "Canonical"  offer = "0001-com-ubuntu-server-jammy"  sku = "22_04-lts-gen2"  version = "latest" }
}

Create a hardened storage account

resource "azurerm_storage_account" "data" {
  name                            = "stappdataprod01"
  resource_group_name             = azurerm_resource_group.app.name
  location                        = azurerm_resource_group.app.location
  account_tier                    = "Standard"
  account_replication_type        = "ZRS"
  allow_nested_items_to_be_public = false        # no public blob access
  shared_access_key_enabled       = false        # force Entra auth
  min_tls_version                 = "TLS1_2"
  blob_properties { versioning_enabled = true  delete_retention_policy { days = 30 } }
}

Create an RBAC assignment

# Least-privilege: data role on one storage account, to a GROUP
resource "azurerm_role_assignment" "blob_read" {
  scope                = azurerm_storage_account.data.id
  role_definition_name = "Storage Blob Data Reader"
  principal_id         = azuread_group.app_team.object_id
}

Create a metric alert

resource "azurerm_monitor_action_group" "ops" {
  name                = "ops-email"
  resource_group_name = azurerm_resource_group.app.name
  short_name          = "ops"
  email_receiver { name = "oncall"  email_address = "oncall@example.com" }
}
resource "azurerm_monitor_metric_alert" "cpu" {
  name                = "vm-cpu-high"
  resource_group_name = azurerm_resource_group.app.name
  scopes              = [azurerm_linux_virtual_machine.app.id]
  criteria {
    metric_namespace = "Microsoft.Compute/virtualMachines"
    metric_name      = "Percentage CPU"
    aggregation      = "Average"
    operator         = "GreaterThan"
    threshold        = 85
  }
  window_size = "PT5M"  frequency = "PT1M"
  action { action_group_id = azurerm_monitor_action_group.ops.id }
}

State, structure, and CI/CD

Remote, locked state (Terraform): an Azure storage account backend (blob) with blob leasing for locking and versioning on. Never keep prod state on a laptop; never commit state (it holds secrets).
Modular structure: reusable modules (network, compute, data, rbac, monitoring) composed per environment.
Environment separation: separate state per env (workspaces or separate backends/keys) driven by dev.tfvars / prod.tfvars; separate subscriptions and ideally separate deployer identities/pipelines.
CI/CD: run plan/apply (or Bicep what-if/deploy) in GitHub Actions or Azure DevOps using OIDC workload identity federation (no secret). Gate apply with approvals; run plan/what-if on PRs.
No secrets in code: reference Key Vault by name; keep secret tfvars out of git.

azure-infra/
  modules/
    network/   compute/   data/   rbac/   monitoring/
  envs/
    dev/    main.tf  dev.tfvars   backend.tf
    prod/   main.tf  prod.tfvars  backend.tf
  .github/workflows/  or  azure-pipelines.yml
  README.md

Architect note

Prod and DR should be provably identical because they come from the same modules with different variables. Manual portal changes in prod are the enemy of a working DR - enforce "infrastructure changes go through IaC," run plan/what-if in CI on every PR, and use Azure Policy + Change Tracking / drift detection to catch out-of-band changes.

Official documentation: Azure CLI, Bicep & Terraform on Azure →

18. Azure Well-Architected Framework

The five pillars Microsoft uses to review a workload - Reliability, Security, Cost Optimization, Operational Excellence, and Performance Efficiency. Written for real architecture reviews: what each means, the services that support it, examples, mistakes, and a review checklist.

Last reviewed: July 2026 Verify against the current Well-Architected Framework docs.

HOW TO USE THIS

Run a workload (or a design) through all five pillars. For each, ask the checklist questions, map to concrete Azure services, and record gaps as action items. A pillar with no owner and no evidence is a risk, not a pass. (Microsoft's WAF review + the Advisor score are good companions.)

Reliability

Security

Cost Optimization

Operational Excellence

Performance Efficiency

Reliability

What it means: the workload meets its availability and recovery targets and withstands failures - built around zones, redundancy, health modeling, and tested DR.

Why it matters: reliability targets (SLOs) drive architecture and cost; you can't bolt on availability after an outage.

Supporting services: Availability Zones, zone-redundant VMSS/App Service/LB, Azure SQL failover groups / zone redundancy, Cosmos multi-region, Azure Site Recovery, Backup, Front Door, Azure Monitor (health/SLOs).

Practical examples: zone-spread every tier; a defined + tested DR pattern per tier with RTO/RPO; health probes + autohealing; graceful degradation and retries with backoff; capacity planning; chaos/failure testing.

Common mistakes

Single-zone "prod"; DR never tested; CMK keys missing in DR; no retry/timeout handling for transient faults; no health model.

Review checklist

Is every critical tier zone-redundant? Defined + tested DR per tier with RTO/RPO? Health probes + autohealing? Transient-fault handling (retries/circuit breakers)? Backups immutable and restore-tested? Capacity for failover?

Security

What it means: protect identities, data, and workloads and meet compliance - Zero Trust, least privilege, encryption, and defense in depth.

Why it matters: a single over-broad role, public storage account, or long-lived secret can undo everything else. Security is a design property.

Supporting services: Entra ID + Conditional Access + PIM, Azure RBAC, managed identities, Azure Policy, Key Vault/Managed HSM, Private Link, Azure Firewall/WAF/DDoS, Defender for Cloud, Sentinel.

Practical examples: no standing Owner (PIM); MFA everywhere; managed identities (no secrets); preventive Policy; Private Endpoints; CMK; centralized logs; Defender on. (See section 8's checklist.)

Common mistakes

Owner everywhere / no PIM; weak Conditional Access; public storage/DB; secrets in code; no Private Endpoints for sensitive services; Policy off; logs not centralized.

Review checklist

Least privilege via groups/RBAC + PIM? MFA + Conditional Access? Managed identities, minimal secrets? Preventive Azure Policy on? Private Endpoints for sensitive PaaS? CMK + Key Vault purge protection? Defender + centralized logs? DR keys cross-region?

Cost Optimization

What it means: deliver required value at the lowest sustainable cost - right-sizing, commitments, eliminating waste, and attributing spend.

Why it matters: unmanaged Azure spend grows silently; cost is a first-class design and operational concern.

Supporting services: Cost Management + budgets + exports, quotas, Advisor, Reservations, Savings Plan, Azure Hybrid Benefit, Spot, storage lifecycle, Log Analytics tiers.

Practical examples: tags + exports for attribution; RIs/Savings Plan for baseline; Hybrid Benefit; auto-shutdown non-prod; storage lifecycle; Log Analytics filtering; monthly review (section 14).

Common mistakes

No tags/attribution; over-provisioned VMs/disks; on-demand for steady-state; unfiltered Log Analytics; over-sized App Gateway/Firewall; idle IPs and orphaned disks.

Review checklist

Spend attributed via tags + exports? Budgets + quotas? RI/Savings Plan + Hybrid Benefit applied? Right-sizing acted on? Storage lifecycle + Log Analytics tiers? A recurring cost review?

Operational Excellence

What it means: run and improve the workload reliably and repeatably - automation, observability, incident response, and safe change.

Why it matters: most outages come from change and from not seeing problems early; operational maturity turns a good design into a dependable service.

Supporting services: Bicep/Terraform + pipelines (GitHub Actions/Azure DevOps), Azure Monitor/Log Analytics/App Insights, Update Manager, Change Tracking, Automation, Azure Policy (drift), Deployment stacks.

Practical examples: everything as code with peer review; SLOs + alerts on symptoms; centralized logs; automated patching; runbooks tied to alerts; blameless post-mortems; progressive delivery (slots/rings).

Common mistakes

Manual portal changes in prod; alerting on causes / alert fatigue; no SLOs; observability added after an incident; no defined incident process.

Review checklist

All infra in code with review? SLOs + symptom-based alerts? Centralized logs + Activity Log? Automated patching? Runbooks current + rehearsed? Safe deployment (slots/canary) + rollback?

Performance Efficiency

What it means: meet performance requirements efficiently as demand changes - right SKUs, autoscaling, caching, data locality, and query design.

Why it matters: performance affects user experience and cost simultaneously; the right shape and data design often beat simply adding capacity.

Supporting services: VM series (F/E/M/L), autoscaling (VMSS/App Service/AKS HPA), Front Door + CDN, Azure Cache for Redis, Premium SSD v2/Ultra, Azure SQL tiers/read replicas, App Insights.

Practical examples: match VM series to the bottleneck; autoscale on the right signal; cache at the edge (Front Door/CDN) and in-memory (Redis); co-locate data + compute (proximity groups) to cut latency/egress; tune database indexes/queries; load-test before launch.

Common mistakes

Wrong VM series (CPU-bound on general purpose); no autoscaling / scaling on the wrong metric; disk IOPS ceiling mistaken for CPU; chatty cross-region calls; untuned database queries.

Review checklist

VM series matched to workload? Autoscaling on a meaningful signal? Caching (Front Door/Redis) where it helps? Data co-located with compute? Storage/DB performance sized (Premium SSD v2/Ultra, right DB tier)? Load-tested?

Official documentation: Azure Well-Architected Framework →

19. Learning Path

A structured route from Azure fundamentals to enterprise-grade architecture, security, and AI - aimed at people coming from traditional infrastructure or another cloud. Each level lists what to learn, why, hands-on labs, common mistakes, and the outcome you should reach.

Last reviewed: July 2026 Certification names/exam details change - verify on Microsoft Learn before scheduling.

Beginner

Fundamentals: hierarchy, Entra, RBAC, VNet, VM, Storage, Monitor

Intermediate

LB/App Gateway, private networking, Azure SQL, Key Vault, Defender, Policy, cost

Advanced

Management groups, landing zones, hub-spoke, Firewall, Private Link, PIM, AKS, Synapse/Fabric, OpenAI, DR, Terraform

How to use this

Do the labs, don't just read. Use a free trial / Azure free account for hands-on. Map each level to the deep-dive sections above - the learning path is the syllabus, the sections are the textbook. Certifications (AZ-900 → AZ-104 → AZ-305, plus specialties) are useful checkpoints, but capability comes from building.

Beginner

Level 1 - Foundations

Goal: deploy and connect basic Azure resources confidently

What to learn

Fundamentals: regions/zones, the governance hierarchy (management group / subscription / resource group), and the Azure mental model (section 1).
Entra ID basics and RBAC vs Entra roles; groups; managed identities (section 2).
VNet basics: subnets, NSGs, and how traffic flows (section 3).
VM basics: sizes, managed disks, Bastion access (section 4).
Storage basics: storage accounts, blob tiers, redundancy, disabling public access (section 5).
Azure Monitor basics: metrics, the agent, an alert, Log Analytics (section 9).

Why it matters

Every design rests on the hierarchy, the Entra/RBAC split, and the VNet model. Get these right and everything later is easier.

Hands-on labs

Create a resource group; assign a built-in role to a group at RG scope; test with Check access.
Build a VNet with subnets + NSGs; deploy a VM with no public IP; connect via Bastion.
Create a storage account (ZRS, public access disabled); upload blobs with Entra auth.
Deploy the Azure Monitor Agent (VM Insights); create a CPU alert to an action group.

Common mistakes

Confusing Entra roles with RBAC; Owner at subscription; public storage; no agent for memory; VMs with public IPs.

Expected outcome

You can stand up a segmented VNet, reach a private VM via Bastion, use RBAC correctly, and see basic telemetry.

Intermediate

Level 2 - Building real workloads

Goal: deploy an HA app + managed database with monitoring, security, and cost control

What to learn

Load balancing (Load Balancer, Application Gateway + WAF, Front Door) and VMSS + autoscale (sections 7, 4).
Private networking: Private Endpoints, NAT Gateway, VPN Gateway, ExpressRoute basics (section 3).
Azure SQL: service tiers, zone redundancy, Private Endpoint, PITR, failover groups (section 6).
Key Vault + managed identities; Defender for Cloud; Azure Policy basics (section 8).
Cost Management: budgets, tags, Reservations, Hybrid Benefit (section 14).

Why it matters

This is the day job: HA app tiers, managed databases, and the operational, security, and cost controls that make them production-worthy.

Hands-on labs

Deploy a 3-tier app: Front Door + WAF → App Service/VMSS → Azure SQL (Private Endpoint, zone-redundant).
Allow the health probe (+ App Gateway management ports); confirm backends healthy; force a failover.
Store the DB connection secret in Key Vault; connect via managed identity / Entra auth.
Create alerts (CPU, unhealthy backend, DB storage) + an action group; wire a budget + tags.
Apply a couple of Azure Policies (deny public IP, allowed regions) at the resource group.

Common mistakes

Health-probe NSG rule missing; DB public endpoint; secrets in app settings; Private DNS not linked; noisy alerts.

Expected outcome

You can deploy a secure, monitored, HA application + managed database, connect it privately, and control its cost and access.

Advanced

Level 3 - Enterprise architecture, data & AI

Goal: design governed, multi-region, data-and-AI-capable platforms

What to learn

Management groups, landing zones (CAF ALZ), and Azure Policy at scale (sections 1, 8, 14).
Hub-and-spoke / Virtual WAN, Azure Firewall, Private Link, DNS Private Resolver (section 3).
Advanced RBAC + PIM + Conditional Access (section 2).
AKS / Container Apps, Functions, Event Grid/Service Bus (section 10).
Synapse / Microsoft Fabric, Data Factory, Purview (section 11).
Azure OpenAI, Azure AI Search, and governed RAG / vector search (section 12).
Multi-region DR (Front Door + failover groups / Cosmos) (section 13).
Bicep/Terraform + pipelines; enterprise security; large-enterprise architecture (sections 17, 8).

Why it matters

At this level you own governance, resilience, data platforms, and AI enablement across many teams - decisions that are expensive to reverse.

Hands-on labs

Deploy a landing zone (CAF ALZ accelerator): management groups, baseline RBAC + PIM, Azure Policy, hub network, central logging + Defender, budgets.
Stand up a private AKS cluster in a spoke with Workload Identity and a GitHub Actions/Azure DevOps pipeline.
Build a Fabric/Synapse lakehouse with a Data Factory pipeline and Purview governance; tune a query.
Build a governed RAG assistant: Blob + Azure AI Search + Azure OpenAI behind an App Service serving API, Private Endpoints, Content Safety, and security-trimmed retrieval.
Implement cross-region DR for an Azure SQL app (failover group) behind Front Door; rehearse failover and confirm CMK keys in DR.

Common mistakes

Skipping the landing zone; DR never tested; standing Owner instead of PIM; pod-IP exhaustion; connecting AI to production data without a governed serving layer.

Expected outcome

You can design and operate a governed, automated, multi-region Azure platform - including data and AI workloads - and defend the trade-offs on security, reliability, and cost.

Certification checkpoints (optional)

Level	Typical certification track
Beginner	AZ-900 (Azure Fundamentals); AZ-104 (Administrator)
Intermediate	AZ-104; AZ-500 (Security); AZ-700 (Networking)
Advanced	AZ-305 (Solutions Architect Expert); DP-203/DP-600 (Data); AI-102 (AI Engineer)

Verify before scheduling

Microsoft updates exam content and role certifications regularly. Confirm the current track and objectives on Microsoft Learn before you prepare. Certifications validate knowledge; the labs above build the capability employers pay for.

Official: Microsoft Learn training & certifications →

Microsoft Azure Deep Dive Portal

How this portal is organized

Reading the callouts

The Azure shared responsibility model (orientation)

Suggested reading order

1. Azure Fundamentals

What Azure is

Azure global infrastructure

The governance hierarchy

The Azure mental model

ARM, Bicep, and the tools

Azure Policy and resource locks

Tags

Designing the hierarchy & landing zones

2. Identity and Access Management

Entra ID roles vs Azure RBAC roles (the critical distinction)

Principals (who can be granted access)

Managed identities: system- vs user-assigned

Azure RBAC: roles and scope

Privileged Identity Management & Conditional Access

Real RBAC / identity scenarios

Common Azure IAM mistakes

Azure access troubleshooting mental model

Tools

3. Networking Deep Dive

Virtual Network and CIDR planning

Network Security Groups & Application Security Groups

Routes, UDRs, and internet egress

Azure Firewall & Firewall Manager

Private Endpoint vs Service Endpoint (the constant confusion)

VNet peering & hub-and-spoke

Hybrid connectivity: VPN Gateway and ExpressRoute

Azure DNS & Private DNS

How traffic flows in Azure

Reference diagrams

Hub-and-spoke with centralized egress

Private Endpoint pattern

Network Watcher

Networking troubleshooting

Azure networking gotchas

4. Compute Deep Dive

VM series and families

Availability Zones, Availability Sets, and Scale Sets

Spot VMs and dedicated hosts

Images, Compute Gallery, and managed disks

Access and management

App Service, Functions, Container Apps

Choosing VM families by workload

Operational guidance

5. Storage Deep Dive

Managed Disks

Blob Storage

Redundancy: LRS, ZRS, GRS, GZRS

Azure Files, File Sync, and NetApp Files

Azure Backup and Azure Site Recovery

When to use which

Storage gotchas

6. Database Services Deep Dive

The portfolio at a glance

Service deep dives

Azure SQL Database

Azure SQL Managed Instance

Azure Database for PostgreSQL / MySQL (Flexible Server)

Cosmos DB & Azure Cache for Redis

Database service decision table

Connectivity & security

How HA, DR, backup, and patching differ

Azure database gotchas for Oracle DBAs

Enterprise examples

7. Load Balancing and Traffic Management

The four load-balancing services

When to use which

Application Gateway anatomy + WAF

Front Door

Load balancing troubleshooting

Likely causes (in order)

Checks & fix

8. Security Deep Dive

Azure shared responsibility model

The control layers