← expertoracle.com

Microsoft Azure Deep Dive Portal

A practical reference for Cloud Architects, DBAs, and Enterprise Infrastructure Teams. Built to be used while you learn, design, implement, operate, secure, and troubleshoot real Azure environments - not a marketing overview.

19 deep sections Architecture patterns Troubleshooting runbooks CLI / Bicep / Terraform Well-Architected
Last reviewed: July 2026 Azure changes frequently - verify with current Microsoft documentation before production use.
WHO THIS IS FOR

Cloud architects, infrastructure engineers, Apps DBAs, DBAs, enterprise architects, DevOps, security engineers, Microsoft administrators - and anyone moving from traditional infrastructure or another cloud into Azure. It assumes you know servers, networks, storage, and databases, and focuses on how those map into Azure and what changes operationally.

How this portal is organized

Each section is a self-contained deep dive. Use the left navigation or the top-bar search to jump to a topic. Every section carries a Last reviewed date and, where content changes frequently (pricing, VM sizes, quotas, model availability, service names), a Verify with current Microsoft documentation flag.

Learn
Foundations first

Sections 1-2 establish the mental model: the governance hierarchy (management group / subscription / resource group), Entra ID, and the RBAC model everything depends on.

Build
Service deep dives

Sections 3-12 cover networking, compute, storage, databases, load balancing, security, observability, containers, data, and AI - with diagrams, tables, and gotchas.

Operate
Run and govern

Sections 13-19 cover migration and DR, cost and governance, reference patterns, troubleshooting runbooks, automation, the Well-Architected Framework, and a learning path.

Reading the callouts

Several note types recur. They flag the perspective that matters most for a point.

Architect note
Design-time decisions, trade-offs, and things to settle before production.
DBA note
Database-specific behavior - what Azure manages vs. what you manage, patching, backups, connectivity.
Security note
Exposure, least privilege, encryption, and audit considerations.
Cost note
Where money is spent and commonly wasted.
Operations note
Day-2 behavior: patching, scaling, maintenance, and reliability.
Common mistake
A specific error teams repeatedly make, and how to avoid it.

The Azure shared responsibility model (orientation)

Responsibility is split, and the split moves depending on the service. Get it wrong and you either leave gaps (exposed data, lost recoverability) or redo work Microsoft already does.

LayerVirtual Machines (IaaS)AKSAzure SQL DB / PaaS DBApp Service / Functions
Physical / hypervisorMicrosoftMicrosoftMicrosoftMicrosoft
OS patchingYou (Update Manager)You (nodes) / MS (control plane)MicrosoftMicrosoft
Runtime / engine patchingYouSharedMicrosoft (in window)Microsoft
Backup configYou (Azure Backup)YouManaged, you configureManaged / you export
Scaling / HAYou (Scale Sets, zones)You configureYou enable zone-redundancyAutomatic
Data, schema, access, RBACYouYouYouYou
The rule that never moves
Microsoft secures the cloud. You secure what you put in it: identities, RBAC, network exposure, data classification, and access. No managed service removes your responsibility for who can reach the data and what they can do with it.

Suggested reading order

Accuracy & independence
This is an independent educational resource, not official Microsoft material and not a sales tool. Service names, VM sizes, quotas, regional availability, and pricing change often. Treat every concrete number, SKU, and limit here as a starting point and confirm it in the Azure portal for your subscription and the official Microsoft documentation before any design, sizing, or purchasing decision.

1. Azure Fundamentals

The global infrastructure and the governance hierarchy (management groups, subscriptions, resource groups) that every Azure deployment is built on - plus the mental model that makes the rest of the platform predictable.

Last reviewed: July 2026 Verify region list, zones, quotas, and service availability in the portal.
TL;DR

Azure is a set of regions (most paired for DR, many with Availability Zones) on Microsoft's global network. Governance is a hierarchy: Management Groups > Subscriptions > Resource Groups > resources, all under one Microsoft Entra tenant (the identity boundary). The subscription is the billing/deployment/quota boundary; the resource group is the lifecycle boundary. Azure Resource Manager (ARM) is the deployment/control plane; RBAC grants access and Azure Policy constrains what is allowed. Get the hierarchy and a landing zone right before production.

What Azure is

Microsoft Azure is Microsoft's public cloud: on-demand compute, storage, networking, databases, data, and AI services delivered from Microsoft-operated regions, consumed over Microsoft's global network, and billed by usage. Its distinctive strengths for enterprises are deep Microsoft ecosystem integration (Entra ID / Active Directory, Windows Server, SQL Server, Microsoft 365), strong hybrid tooling (Azure Arc, Azure Stack, ExpressRoute), and a mature governance model (management groups, Azure Policy, landing zones). If you come from traditional Windows/SQL/AD infrastructure, much of Azure will feel familiar; the biggest early shift is the resource hierarchy and that identity lives in Entra ID, separate from Azure resource access (RBAC).

Azure global infrastructure

ConceptWhat it isProtects against / used for
RegionA set of data centers in a geography; your primary deployment + residency boundaryChoose by latency, residency, and service/zone availability
Region pairEach region is paired with another in the same geography; Microsoft sequences updates and some services replicate to the pairRegional DR target; ordered platform maintenance
Availability Zone (AZ)Physically separate datacenters within a region (independent power/cooling/network)Datacenter-level failure - spread across zones for in-region HA
Availability SetA grouping that spreads VMs across fault domains (racks) and update domains (maintenance groups) within a single datacenterRack/maintenance failure when zones aren't used
Fault domainA rack of hardware (shared power/network)Anti-affinity within an availability set
Update domainA group updated/rebooted together during planned maintenanceEnsures not all instances reboot at once
Sovereign cloudsIsolated clouds (Azure Government, Azure operated by 21Vianet in China)Regulatory/sovereignty isolation - separate endpoints and features
Architect note - Availability Zones vs Availability Sets
Prefer Availability Zones (spread across physically separate datacenters, ~99.99% VM SLA with 2+ zones) for new HA designs. Availability Sets only protect against rack/maintenance failure within one datacenter - use them in regions without zones or for tightly-coupled legacy tiers. You cannot combine a single VM into both models; decide per tier. Confirm zone support for your region and VM size.
Common mistake
Assuming every region has Availability Zones or that "the region pair" is an automatic DR solution. Zone availability varies by region and service; region-pair replication only applies to specific services (e.g. GRS storage). You still design and test your own DR.

The governance hierarchy

Entra tenant (identity boundary) Root management group MG: Platform MG: Landing Zones MG: Sandbox / Decommissioned Sub: connectivity Sub: management Sub: identity Sub: corp-prod Sub: corp-nonprod RG: app-network RG: app-compute RBAC + Policy inherit DOWN. Subscription = billing/quota/deploy. Resource group = lifecycle unit.
Entra tenant > Management Groups > Subscriptions > Resource Groups > resources. RBAC and Azure Policy inherit downward.
  • Microsoft Entra tenant - the identity boundary (users, groups, apps). One tenant can hold many subscriptions. Identity (Entra) is separate from resource access (Azure RBAC) - a crucial distinction (section 2).
  • Management groups - a hierarchy above subscriptions for applying RBAC and Azure Policy at scale (e.g. a Platform MG and a Landing Zones MG under the root). Can nest.
  • Subscriptions - the unit of billing, quota/limits, and deployment. Also a strong isolation and blast-radius boundary. Large enterprises use many subscriptions (per app/environment/BU).
  • Resource groups - a lifecycle/management container for resources that share a lifecycle; RBAC, locks, and deletion apply at this level. A resource belongs to exactly one RG.
  • Azure Resource Manager (ARM) - the control-plane API/service for all deployments and management; ARM templates and Bicep are its native IaC languages.

The Azure mental model

ConceptIs the boundary for
Entra tenantIdentity (who exists)
SubscriptionBilling and deployment (and quota/limits)
Resource groupLifecycle and management (deploy/delete/lock together)
Management groupGovernance (apply RBAC/Policy across many subscriptions)
Azure PolicyControl and compliance (what is allowed to exist / be configured)
Azure RBACAccess (who can do what to Azure resources)
Two different questions
RBAC answers "can this identity perform this action on this resource?" Azure Policy answers "is this resource/configuration allowed to exist here at all?" (e.g. "only these regions", "no public IPs", "require a tag"). They are separate engines - you often need both: RBAC to grant, Policy to constrain. And Entra directory roles (Global Administrator, etc.) are different again from Azure RBAC (section 2).

ARM, Bicep, and the tools

Azure Portal

The web UI. Best for learning, exploring, and reading state. Not for repeatable production changes - use IaC.

Azure CLI & Azure PowerShell

Two first-class command lines (az and Az module). Pick one house standard; both cover the control plane. See section 17.

Cloud Shell

Browser shell pre-authenticated as your identity, with CLI/PowerShell/Terraform/kubectl and persistent storage.

ARM templates & Bicep

Native IaC. Bicep is the modern, readable authoring language that compiles to ARM JSON - prefer it over raw ARM templates.

Terraform

The azurerm provider is widely used for multi-cloud/standardized IaC. Bicep vs Terraform is a house choice; both are valid.

REST APIs & SDKs

Everything is an ARM REST call; idiomatic SDKs (.NET, Python, Java, JS, Go) for building tooling and apps.

Recommended posture
Portal to learn and inspect. CLI/PowerShell for glue and ad-hoc ops. Bicep or Terraform for anything that reaches production. Manual portal clicks in prod are the root cause of most "why is DR different?" incidents. (Azure Blueprints is being retired in favor of Terraform / template specs / deployment stacks - use those for landing-zone deployment.)

Azure Policy and resource locks

  • Azure Policy - define, assign, and enforce rules (deny, audit, deployIfNotExists, modify) at MG/subscription/RG scope, inheriting down. Common: allowed regions, allowed SKUs, require tags, deny public IPs/public storage, require diagnostic settings. Initiatives bundle policies (e.g. a regulatory baseline).
  • Resource locks - CanNotDelete or ReadOnly locks on a subscription/RG/resource to prevent accidental deletion/changes (e.g. lock the hub network and prod databases).
Architect note
Set a baseline of preventive Azure Policy at the management-group level in your landing zone: allowed locations, deny public IPs by default, require diagnostic settings to a central Log Analytics workspace, enforce tags. Add CanNotDelete locks on foundational resources (hub VNet, Key Vault, prod data). This prevents whole classes of mistakes rather than detecting them after the fact.

Tags

Tags are key/value metadata on resources, resource groups, and subscriptions - used for cost attribution (they flow into Cost Management and billing exports), organization, and automation. Tags are not inherited by default (a resource does not automatically get its RG's tags), though Azure Policy can enforce/inherit them.

Cost note
Define a small, enforced tag set (cost-center, environment, owner, application) from day one and enforce it with Azure Policy (require + inherit from RG). Chargeback and Cost Management analysis are only as good as your tagging, and retrofitting tags across thousands of resources is slow and never complete.

Designing the hierarchy & landing zones

How to structure tenants, management groups, subscriptions, resource groups Design
  • One Entra tenant for the enterprise (identity boundary); multiple tenants only for genuine sovereignty/M&A isolation.
  • Management groups following the Cloud Adoption Framework (CAF) pattern: a Platform MG (connectivity, management, identity subscriptions) and a Landing Zones MG (corp/online workload subscriptions), plus Sandbox and Decommissioned.
  • Subscriptions per workload/environment (or per app), not one giant subscription - they are the quota and blast-radius boundary. Separate platform subscriptions for connectivity (hub network), management (logging/monitoring), and identity.
  • Resource groups per app-tier or per lifecycle (things deployed and deleted together).
Separating dev / test / stage / prod / shared / security / networking / logging Design
  • Separate subscriptions per environment for independent quota, budgets, RBAC, and blast radius; use management groups to apply environment-wide Policy/RBAC once.
  • Dedicated platform subscriptions: connectivity (hub VNet, Azure Firewall, ExpressRoute/VPN), management (Log Analytics, Automation, Backup), identity (domain controllers / Entra Domain Services if needed).
  • Keep prod under stricter Policy (no public IPs, restricted regions/SKUs, mandatory logging) than nonprod.
  • Never mix sandbox/experimentation with production - separate MG with its own guardrails and budgets.
What an Azure landing zone includes Design

An Azure Landing Zone (CAF) is a codified, repeatable baseline deployed before workloads:

  • Management-group hierarchy, subscription organization, and naming/tagging standards.
  • Identity: Entra tenant config, groups, PIM, break-glass, Conditional Access baseline.
  • Baseline RBAC (groups, not users) and preventive Azure Policy initiatives.
  • Connectivity: hub-and-spoke (or Virtual WAN), Azure Firewall, DNS, ExpressRoute/VPN in the connectivity subscription.
  • Management: central Log Analytics workspace, diagnostic settings policy, Azure Monitor, Backup, Update Manager.
  • Security: Defender for Cloud, Sentinel, Key Vault, Private Link/Private DNS strategy.
  • Guardrails: budgets, quotas, resource locks, tags - all as code (Terraform / Bicep; the CAF ALZ accelerator is a starting point).
Common mistakes in Azure governance
  • One big subscription for everything, so quota, RBAC, and cost attribution collapse together.
  • Skipping the landing zone and retrofitting management groups, Policy, and hub networking later.
  • Confusing the Entra tenant (identity) with a subscription (billing) - they are different boundaries.
  • Granting RBAC at subscription/MG level for convenience so everything inherits broad access.
  • No naming/tagging standard; mixing sandbox with production.

2. Identity and Access Management

Microsoft Entra ID and Azure RBAC - the two different systems that decide who exists and who can do what - plus managed identities, PIM, Conditional Access, and a troubleshooting model. This is where most Azure access issues and security incidents originate.

Last reviewed: July 2026 Verify role names, PIM, and Conditional Access features in current docs.
TL;DR

Azure has two access systems: Microsoft Entra ID (identity + directory roles like Global Administrator that govern Entra/M365) and Azure RBAC (roles like Contributor that govern Azure resources at MG/subscription/RG/resource scope). Confusing them is the #1 IAM error. Use groups (not users), built-in roles scoped narrowly (not Owner at subscription), managed identities for workloads (not secrets), PIM for just-in-time privileged access, and Conditional Access + MFA for sign-in. Deny assignments and Policy can block regardless of role.

Entra ID roles vs Azure RBAC roles (the critical distinction)

Microsoft Entra directory rolesAzure RBAC roles
GovernEntra ID & Microsoft 365 (users, groups, apps, tenant settings)Azure resources (VMs, storage, networks, databases)
ExamplesGlobal Administrator, User Administrator, Application AdministratorOwner, Contributor, Reader, Storage Blob Data Contributor
ScopeTenant (and administrative units)Management group / subscription / resource group / resource
Managed inEntra ID > Roles and administratorsResource > Access control (IAM)
Common mistake - confusing the two
A Global Administrator (Entra) does not automatically have access to Azure resources, and an Azure Owner cannot manage Entra users. They are separate systems with separate roles and scopes. (A Global Admin can elevate to gain User Access Administrator over all subscriptions - a deliberate, audited action, not the default.) Always know which system a task needs before assigning a role.

Principals (who can be granted access)

PrincipalWhat it isUse for
UserA human identity in Entra IDPeople - but grant via groups, not directly
GroupAn Entra security group of users/principalsAll human access management
Administrative unitA container to scope Entra role admins to a subset of users/groupsDelegated identity admin (e.g. per-region helpdesk)
App registration + service principalAn application identity; the app registration is the definition, the service principal is its instance in a tenantApps/CI authenticating to Azure/Graph (prefer managed identity where possible)
Managed identityAn Azure-managed service principal with no credentials you handleAzure workloads calling Azure/Entra - the preferred workload identity
External identity (B2B guest)A user from another tenant invited as a guestPartner/vendor collaboration

Managed identities: system- vs user-assigned

System-assignedUser-assigned
LifecycleTied to one resource; created/deleted with itStandalone resource; attach to many
Use forA single workload needing its own identitySharing one identity across resources; pre-creating and granting RBAC before deploy
CredentialsNone you manage - Azure rotates themNone you manage
Security note - managed identities over secrets
An app-registration client secret or certificate is a long-lived credential that can leak, be committed to git, expire unexpectedly, or outlive its owner. A managed identity has no secret you handle - Azure issues and rotates tokens automatically, scoped by RBAC. If an Azure workload needs to call Azure or Entra, it should almost always use a managed identity, not an app secret. Reserve app registrations + secrets/certs for external/CI cases, and prefer workload identity federation (no secret) even there.

Azure RBAC: roles and scope

  • Built-in roles - Owner (full + manage access), Contributor (full except manage access), Reader, plus hundreds of granular roles (e.g. Virtual Machine Contributor, Storage Blob Data Reader). Prefer the narrowest built-in role.
  • Custom roles - compose specific actions/dataActions when no built-in role fits without over-granting.
  • Role assignment = role + principal + scope. Scope is MG / subscription / RG / resource; assignments inherit downward and are additive.
  • Deny assignments - explicitly block actions regardless of role (used by Azure-managed features like Blueprints/managed apps); evaluated before allows.
  • Control-plane vs data-plane - some roles govern management (create/delete a storage account) vs data (read blobs). A Contributor on a storage account can't necessarily read the blobs without a data role - a frequent surprise.
Common mistake - Owner / subscription-scope grants
Assigning Owner (or even Contributor) at subscription scope "to unblock a team" gives broad control over everything in it, inherited by every RG and resource. Grant the narrowest built-in role at the resource group (or resource) scope; reserve subscription/MG-scope and Owner for a small, PIM-gated admin group.

Privileged Identity Management & Conditional Access

Privileged Identity Management (PIM)

Just-in-time, time-bound, approval-and-MFA-gated activation of privileged roles (Entra and Azure RBAC). Admins are eligible, not permanently assigned - they activate when needed, with justification and audit. The single biggest reduction of standing privilege.

Conditional Access

Sign-in policies that require MFA, compliant/managed devices, trusted locations, or block risky sign-ins - the enforcement layer for Zero Trust. Enforce MFA for all users, especially admins.

Identity Protection

Risk-based detection (leaked credentials, impossible travel) feeding Conditional Access to block/step-up risky sign-ins.

Access reviews & entitlement management

Periodic recertification of group/role/guest access, and access packages for governed self-service - keep access from silently accumulating.

Security note - PIM + break-glass
Make privileged roles (Owner, User Access Administrator, Global Administrator) eligible via PIM, not permanent - activation requires MFA, justification, and (for the highest roles) approval, all audited. Keep two break-glass (emergency access) accounts: cloud-only, excluded from Conditional Access that could lock you out, with very long unique passwords stored offline, MFA via a separate method, and alerting on every sign-in. They exist so you can still administer the tenant if federation/PIM/Conditional Access breaks.

Real RBAC / identity scenarios

App team manages their resource group only Medium risk

Who: the app team group. Scope: their resource group (not subscription). Role: Contributor on the RG (or narrower roles like Virtual Machine Contributor + Web Plan Contributor). Risk: medium - contained to one RG. Safer alternative: deploy via a pipeline managed identity and give humans Reader + specific action roles. Common misuse: Owner/Contributor at subscription scope.

Workload reads a storage account - no secrets Low risk

Who: a VM/App Service/AKS workload. Scope: a single storage account (or container). Role: Storage Blob Data Reader assigned to the workload's managed identity. Risk: low - narrow, keyless. This is the pattern to imitate. Common misuse: a storage account key or SAS in app config + Contributor at RG.

Read-only auditor across the platform Low risk

Who: security/audit group. Scope: the root or Platform management group (auditors need breadth). Role: Reader (+ Security Reader for Defender), granted to a group. Risk: low - read-only. Common misuse: giving auditors Contributor "just in case".

Emergency change requiring elevated access Higher risk

Who: an on-call engineer. Scope: the affected subscription. Role: Contributor/Owner made eligible via PIM - activated just-in-time with MFA + justification, time-boxed, audited. Risk: higher, but no standing privilege and full audit trail. Common misuse: a permanent Owner assignment "for emergencies".

Common Azure IAM mistakes

Common Azure IAM mistakes
  • Confusing Entra roles with Azure RBAC roles - different systems, scopes, and portals.
  • Owner assigned too broadly - use narrow built-in roles at RG/resource scope.
  • Subscription-scope grants unnecessarily - inherit into every RG; grant lower.
  • Not using PIM - standing privileged access is the biggest avoidable risk.
  • Break-glass accounts not protected/monitored correctly - or not existing at all.
  • Too many app secrets, not rotated - and used where a managed identity would work.
  • Not using managed identities for Azure workloads.
  • Mixing human users and workload identities - different lifecycles/controls.
  • Weak Conditional Access - no MFA enforcement, no risk policies, admins unprotected.

Azure access troubleshooting mental model

When access fails (or unexpectedly works), walk the layers in order:

⚑ "Access denied" / "does not have authorization" - the checklist
  1. Which tenant are you signed into? (Guest in the wrong tenant is common.)
  2. Which subscription is the resource in, and is it the selected one?
  3. Which identity is making the request - user, group, service principal, or managed identity? (For workloads, which MI is actually assigned?)
  4. What role is assigned, and does it include the required action (control-plane vs data-plane)?
  5. At what scope (MG/subscription/RG/resource)? Does inheritance reach this resource?
  6. Is there a deny assignment blocking it?
  7. Is Azure Policy blocking the action (deny effect)?
  8. Is Conditional Access blocking the sign-in (device/location/MFA)?
  9. Does the role require PIM activation that hasn't been done (eligible but not active)?
  10. Is the resource provider registered / API available in the subscription?

Tools

Resource > Access control (IAM) > Check access and View my access; Entra sign-in logs (for Conditional Access); Activity Log (for the denied operation); PIM (eligible vs active).

az role assignment list --assignee <objectId> --all -o table
az account show     # which tenant/subscription am I in?
az provider show -n Microsoft.Sql --query registrationState

3. Networking Deep Dive

Virtual Networks, NSGs vs Azure Firewall, Private Endpoint vs Service Endpoint, hub-and-spoke, hybrid connectivity, and Private DNS - plus the traffic-flow reasoning you need to design and debug real Azure networks.

Last reviewed: July 2026 Verify gateway SKUs, limits, and Private DNS behavior in current docs.
TL;DR

A Virtual Network (VNet) is regional; you carve subnets from its CIDR. NSGs are stateful L3/L4 allow/deny rules (per subnet/NIC); Azure Firewall is a managed L3-L7 firewall for centralized egress/inspection in a hub. Reach PaaS privately with Private Endpoint (a private IP in your VNet, works cross-region/hybrid) or the older Service Endpoint (keeps traffic on the backbone but the service keeps a public endpoint). VNet peering is not transitive - use hub-and-spoke or Virtual WAN. Private Endpoint needs correct Private DNS zone linkage or nothing resolves. Plan non-overlapping CIDRs first.

Virtual Network and CIDR planning

  • A VNet is regional with one or more address spaces; subnets partition it. Azure reserves 5 IPs per subnet (first 4 + last).
  • Some services want dedicated/delegated subnets (Azure Firewall needs AzureFirewallSubnet, gateways need GatewaySubnet, Bastion needs AzureBastionSubnet, App Gateway its own subnet). Plan these in advance.
  • CIDRs must not overlap with peered VNets or on-premises. Overlap is the #1 cause of hybrid that "connects but won't route."
  • Plan generously: leave room for growth, gateway/firewall/bastion subnets, and AKS (which consumes many IPs with Azure CNI).
Architect note - IP plan first
Reserve a large private supernet for Azure, allocate a block per region/environment, and subnets per tier plus the required platform subnets (GatewaySubnet, AzureFirewallSubnet, AzureBastionSubnet). Keep a documented IPAM. Overlapping ranges break peering and hybrid and are a re-IP project later; a too-large plan costs nothing.

Network Security Groups & Application Security Groups

  • NSG - stateful allow/deny rules on source/dest, port, protocol, evaluated by priority (lower wins). Attach to a subnet and/or a NIC; both apply. Default rules allow intra-VNet and deny inbound from internet.
  • Application Security Group (ASG) - a named group of NICs you reference in NSG rules instead of IPs, so rules read "web-asg → db-asg on 1433" and update automatically as VMs scale.
  • Service tags - Microsoft-maintained IP groups (e.g. Storage, Sql, AzureMonitor, Internet) you use in rules instead of hardcoding ranges.
Security note - ASGs + service tags
Write NSG rules against ASGs (by workload role) and service tags (by Azure service) rather than raw IP ranges. Rules stay readable and self-updating, and you avoid the classic "someone hardcoded a range that changed." Keep a tight default: deny inbound from Internet, allow only what a tier needs from the tier that needs it.
Common mistake - NSG vs Azure Firewall confusion
An NSG is stateful L3/L4 packet filtering attached to subnets/NICs - it is not a firewall appliance (no FQDN filtering, no L7, no centralized logging of app traffic, no threat intel). Azure Firewall is a managed stateful firewall for centralized egress/inspection with FQDN rules, threat intel, and DNAT. Use both: NSGs for micro-segmentation at the subnet/NIC, Azure Firewall in the hub for controlled egress. They are complementary, not alternatives.

Routes, UDRs, and internet egress

  • System routes handle intra-VNet, peering, and a default route to the internet. User Defined Routes (UDRs) in a route table override them - e.g. force 0.0.0.0/0 through the Azure Firewall (forced tunneling / centralized egress).
  • NAT Gateway - the recommended way to give a subnet scalable, stable outbound internet (SNAT) without per-VM public IPs. Outbound only.
  • Public IPs (Standard SKU, zone-redundant) for inbound-facing resources; avoid on workload VMs - use a load balancer / Bastion / firewall instead.
  • Default outbound access is being retired - new VMs should have an explicit outbound method (NAT Gateway, LB outbound rule, or firewall). Don't rely on implicit outbound.
Common mistake
Relying on Azure's implicit/default outbound internet for VMs (being deprecated) or expecting NAT Gateway to allow inbound - it's outbound only. Give every subnet an explicit outbound path (NAT Gateway or firewall) and use a load balancer/App Gateway/firewall DNAT for inbound. Also: a UDR that sends 0.0.0.0/0 to the firewall without a matching route back can black-hole traffic - design symmetric routing.

Azure Firewall & Firewall Manager

Azure Firewall is a managed, highly-available, stateful network firewall (Standard/Premium; Premium adds TLS inspection, IDPS, URL filtering). Placed in the hub VNet with spoke UDRs pointing at it, it centralizes egress control (FQDN + network rules), DNAT for inbound, and logging. Firewall Manager centrally manages firewall policies across hubs (incl. Virtual WAN secured hubs).

Private Endpoint vs Service Endpoint (the constant confusion)

Private Endpoint (Private Link)Service Endpoint
What it isA private IP in your subnet that maps to a specific PaaS resource instanceExtends the subnet's identity to the service over the backbone; the service keeps its public endpoint
TrafficFully private; the PaaS resource can disable public access entirelyStays on the Azure backbone, but the resource still has a public endpoint (restricted by rules)
Cross-region / on-premYes - reachable from peered VNets and on-prem (ExpressRoute/VPN)No - VNet/region local; not reachable from on-prem
DNSRequires a Private DNS zone mapping the service FQDN to the private IPNo DNS change; uses the public FQDN
CostPer-endpoint + dataFree
Use forThe modern default for private PaaS access (Storage, SQL, Key Vault...)Simpler/cheaper cases where a public endpoint restricted to your subnet is acceptable
Common mistake - Private Endpoint DNS not linked
The single most common Private Endpoint failure: the Private DNS zone is not linked to the VNet (or on-prem DNS has no conditional forwarder), so the service FQDN still resolves to the public IP and connections fail or bypass the private path. For each PaaS type you need the correct zone (e.g. privatelink.blob.core.windows.net, privatelink.database.windows.net), an A record for the endpoint, and a VNet link (and hybrid forwarding for on-prem). Automate this - it is easy to forget and hard to notice.

VNet peering & hub-and-spoke

  • VNet peering (regional or global) connects two VNets privately over the backbone. Crucially, peering is not transitive: spoke A peered to hub H and spoke B peered to H cannot reach each other unless you route through a hub appliance (Azure Firewall/NVA) with UDRs and "allow forwarded traffic".
  • Hub-and-spoke - a central hub VNet (firewall, gateways, DNS, Bastion) peered to workload spokes; spokes route egress and cross-spoke traffic through the hub firewall.
  • Virtual WAN - a Microsoft-managed hub-and-spoke at scale (managed hubs, integrated firewall, any-to-any transit, branch/VPN/ExpressRoute) - use it instead of hand-built hubs for large/global topologies.
Common mistake - assuming peering is transitive
Teams peer spoke-to-hub and expect spoke-to-spoke to work - it doesn't. Route cross-spoke traffic through a hub firewall/NVA (UDRs + "allow forwarded traffic" on the peering), or adopt Virtual WAN which provides managed transit. Don't build meshes of direct spoke peerings.

Hybrid connectivity: VPN Gateway and ExpressRoute

VPN Gateway (Site-to-Site)ExpressRoute
PathOver the internet, IPSec-encryptedPrivate, dedicated circuit via a partner/provider
BandwidthUp to gateway-SKU limits (hundreds of Mbps-Gbps)50 Mbps to 100 Gbps circuits
SLA / latencyBest-effort internetConsistent, low latency; higher SLA; private
SetupMinutes-hoursDays-weeks (provider provisioning)
Use asQuick start / backup / lower bandwidthPrimary enterprise link; large data; low latency; private access to Microsoft/M365

ExpressRoute Global Reach connects on-prem sites to each other through Azure. Common pattern: ExpressRoute primary + VPN backup. Gateway SKU determines bandwidth and features - size it deliberately (a frequent bottleneck).

Azure DNS & Private DNS

  • Azure DNS public zones for internet-facing names; Private DNS zones for internal resolution, linked to VNets (with optional auto-registration).
  • Private Endpoints depend on Private DNS zones resolving the privatelink.* FQDNs to the private IPs.
  • Azure DNS Private Resolver - a managed resolver for hybrid DNS (inbound/outbound endpoints + forwarding rulesets) so on-prem can resolve Azure private names and vice-versa, without running DNS VMs.

How traffic flows in Azure

  1. Destination inside the VNet (or a peered VNet)? Routes over the backbone - only NSG rules apply.
  2. Outside? The effective route table (system + UDRs) picks the next hop: internet (via NAT Gateway / public IP), the firewall/NVA (if a UDR forces it), the VPN/ExpressRoute gateway, or a peering.
  3. NSGs (subnet + NIC, by priority) must allow it - remember default deny-inbound-from-internet.
  4. For private PaaS: Private Endpoint + correct Private DNS resolution, or Service Endpoint with the public FQDN.

Debugging is almost always: what do effective routes say? what do effective NSG rules say (both directions)? does DNS resolve to the right (private) IP?

Reference diagrams

Hub-and-spoke with centralized egress

Hub VNet Azure Firewall (egress + inspect) VPN/ExpressRoute GW, DNS, Bastion Spoke: prod VNetapp + db subnets (NSGs) Spoke: nonprod VNet On-premisesvia ExpressRoute/VPN peering + UDR Spokes route 0.0.0.0/0 (UDR) to the hub firewall; spoke-to-spoke transits the firewall (peering is not transitive).
Central hub holds firewall, gateways, DNS, and Bastion; spokes route egress and cross-spoke traffic through the hub.

Private Endpoint pattern

Spoke VNet / subnet (no public IPs) App VM PrivateEndpoint Private DNS zoneprivatelink.database... resolves to private IP Azure SQL / Storage(public access disabled)
A Private Endpoint gives the PaaS resource a private IP in your subnet; the Private DNS zone (linked to the VNet) resolves its FQDN to that IP. Public access is disabled.

Network Watcher

ToolWhat it gives you
IP Flow Verify"Is this specific 5-tuple allowed or denied, and by which NSG rule?" First stop for NSG questions.
Effective security rules / Effective routesThe actual merged NSG rules and routes applied to a NIC - what really governs the traffic.
Connection Monitor / Connection TroubleshootTest and continuously monitor reachability/latency A→B, naming the blocker.
NSG flow logsConnection records for monitoring, forensics, and "is my rule dropping this?" (into a storage account / Log Analytics).
Start with IP Flow Verify + Effective routes
Before hand-reading rules, run IP Flow Verify (names the deciding NSG rule) and check Effective routes on the NIC (shows the real next hop). Together they resolve most "cannot reach" cases in seconds.

Networking troubleshooting

⚑ VM cannot reach internet / private VM cannot download patches

Causes: no explicit outbound (NAT Gateway/firewall) and default outbound retired; a UDR sends 0.0.0.0/0 to a firewall that blocks it or has asymmetric routing; NSG egress deny; no public IP where one is required. Checks: Effective routes on the NIC; NAT Gateway on the subnet; firewall rules; IP Flow Verify. Fix: add a NAT Gateway (or firewall egress allow for repos/service tags); fix UDR/return routing. For OS repos, allow the relevant service tags/FQDNs on the firewall.

⚑ VM cannot reach a storage account / PaaS privately

Causes: Private Endpoint's Private DNS zone not linked to the VNet (FQDN resolves to public IP); missing A record; on-prem DNS lacks a conditional forwarder; the storage account still allows only the public path; NSG blocking. Checks: nslookup the FQDN from the VM (should return the private IP); Private DNS zone VNet links; endpoint connection state. Fix: link the correct privatelink.* zone to the VNet, add the A record (or use the auto DNS integration), set hybrid forwarding.

⚑ Application cannot connect across VNets / peering issue

Causes: relying on transitive peering (unsupported); overlapping CIDRs; peering missing "allow forwarded traffic"/gateway transit; NSG blocking the peer range; UDR not routing spoke-to-spoke via the hub firewall. Fix: route cross-spoke through the hub firewall (UDR + allow forwarded traffic) or use Virtual WAN; resolve overlap; open NSGs for the peer range.

⚑ On-premises cannot reach Azure / ExpressRoute / VPN issue

Causes: CIDR overlap; BGP not advertising routes both ways; VPN tunnel down (IKE/PSK mismatch); ExpressRoute circuit/peering down or route filters wrong; gateway SKU bandwidth exhausted; NSG/firewall blocking on-prem range. Checks: gateway/connection status; BGP learned/advertised routes; effective routes. Fix: align IKE, correct BGP/route filters, resolve overlap, right-size the gateway SKU. Portal: Virtual network gateways / ExpressRoute circuits.

⚑ Application Gateway health probe failing

Causes: NSG on the backend/App Gateway subnet blocking the probe or the required App Gateway management ports; wrong probe host/path/port/protocol; backend app on localhost or wrong port; certificate mismatch on HTTPS probes; App Gateway subnet missing required outbound. Fix: allow the probe + management traffic, align the probe, bind the app to all interfaces. Full flow in section 7.

⚑ DNS / NSG / route / Azure Firewall / Private DNS issue

Method: IP Flow Verify (NSG decision), Effective routes/rules (real next hop + merged rules), NSG flow logs (drops), and nslookup for DNS. For Azure Firewall, check the network/application rule collections and the UDR forcing traffic to it; for Private DNS, verify zone links and forwarders.

Azure networking gotchas

Azure networking gotchas
  • NSG vs Azure Firewall - NSG is stateful L3/L4 filtering, not a firewall appliance; use both.
  • Private Endpoint vs Service Endpoint - different mechanisms; Private Endpoint needs Private DNS zone linkage.
  • Forgetting Private DNS zone linkage - the top Private Endpoint failure.
  • Overlapping CIDRs - break peering and hybrid; plan IP space early.
  • Poor hub-and-spoke - decide hub ownership, firewall, and UDRs before spokes land; consider Virtual WAN.
  • Databases/services on public endpoints - use Private Endpoint; disable public access; deny public by Policy.
  • Peering is not transitive - route spoke-to-spoke via the hub firewall or use Virtual WAN.
  • Route tables (UDR) - forced tunneling and asymmetric routing black-hole traffic; design return paths.
  • Default outbound retiring - give every subnet an explicit outbound (NAT Gateway/firewall).
  • Gateway SKU limits - VPN/ExpressRoute/App Gateway SKUs cap bandwidth and features; size deliberately.

4. Compute Deep Dive

Azure Virtual Machines (series, disks, placement, HA), Scale Sets, Spot, images, and the managed/serverless options (App Service, Functions, Container Apps) - how to choose, place, scale, and operate compute on Azure.

Last reviewed: July 2026 VM series/sizes and pricing change - verify current SKUs and zone support in the portal.
TL;DR

Azure VMs come in series (B/D general, F compute, E/M memory, L storage, N GPU) with families and sizes. Use Availability Zones + Virtual Machine Scale Sets (flexible orchestration) for HA and autoscale, managed disks (Premium SSD v2 / Ultra for demanding I/O), Bastion for keyless RDP/SSH, and managed identity for keyless API access. For new apps, consider App Service, Container Apps, or Functions before managing VMs. Reserved Instances / Savings Plan and Azure Hybrid Benefit are the main cost levers.

VM series and families

SeriesFamilyBest for
BBurstableLow-average, spiky workloads (dev, small web, bastions)
D (Dv5/Dasv5...)General purposeWeb, app, microservices - the default
FCompute optimizedHigh CPU-to-memory: batch, gaming, app servers
EMemory optimizedDatabases, in-memory caches, mid-size SAP
M / Mv3Memory optimized (very large)Large databases, SAP HANA (certified), in-memory analytics
L (Lsv3)Storage optimizedHigh local-disk IOPS/throughput: NoSQL, big data, data nodes
N (NC/ND/NV)GPUAI/ML training/inference, rendering, visualization
DCConfidential computingData-in-use encryption (SGX / AMD SEV-SNP)
DBA note - E/M series for databases and SAP
For SQL Server / Oracle / large databases and SAP HANA, use E or M (memory-optimized) sizes with high memory-to-vCPU ratios and pair with Premium SSD v2 / Ultra Disk sized for IOPS/throughput. HANA and large DB sizes have specific certified SKUs - confirm certification and the constrained-vCPU options (fewer active cores for per-core licensing) before committing.

Availability Zones, Availability Sets, and Scale Sets

MechanismProtects againstNotes
Availability ZonesDatacenter failure within a regionSpread VMs/scale sets across 2-3 zones; ~99.99% VM SLA. Preferred for new HA.
Availability SetRack (fault domain) & maintenance (update domain) failure in one datacenterUse where zones aren't available; cannot span zones.
Virtual Machine Scale Sets (VMSS)-Manage identical VMs with autoscale + rolling upgrades; Flexible orchestration is the modern default (can span zones + mix sizes).
Proximity Placement Group-Co-locates VMs for lowest latency (e.g. app + DB tier); trades off against zone spread.
Common mistake
Running a single VM (or an availability set in a zone-capable region) for a production service - a datacenter event takes it down. Use a zone-spread VMSS (Flexible) behind a zone-redundant load balancer with health probes and rolling upgrades. Note a single VM gets a lower SLA than a multi-instance availability-set/zone deployment.

Spot VMs and dedicated hosts

OptionWhat it doesUse when
Spot VMsDeeply discounted capacity Azure can evict with 30s noticeFault-tolerant, interruptible batch/CI/render - never stateful prod
Dedicated HostsA physical server dedicated to your subscriptionCompliance/isolation, or per-core licensing needing host affinity/visibility
Reserved Instances / Savings Plan1/3-year commitment for a discountSteady-state baseline compute (section 14)
Azure Hybrid BenefitApply existing Windows Server / SQL Server licensesBig savings for Microsoft-licensed workloads
Cost note
Cover steady-state with Reserved Instances or the Savings Plan for Compute, run interruptible batch on Spot (often 60-90% off), and apply Azure Hybrid Benefit to Windows/SQL licenses you already own (frequently the biggest single saving for a Microsoft shop). Right-size before you commit.

Images, Compute Gallery, and managed disks

  • Marketplace/platform images and custom images; the Azure Compute Gallery versions and replicates images across regions for scale.
  • Managed disks: Standard HDD, Standard SSD, Premium SSD, Premium SSD v2 (independently tunable IOPS/throughput/size), and Ultra Disk (highest performance, sub-ms). Ephemeral OS disks live on the host (fast, free, but lost on deallocate - stateless only).
  • Snapshots and disk encryption (platform-managed keys, customer-managed keys via Key Vault, or Azure Disk Encryption/host-based).
DBA note - Premium SSD v2 / Ultra for databases
Database I/O is usually the bottleneck. Premium SSD v2 lets you tune IOPS and throughput independently of size (better cost/performance than provisioning huge Premium SSDs for IOPS); Ultra Disk for the most demanding low-latency workloads (log disks). Stripe multiple disks for higher aggregate throughput where a single disk caps out, and keep temp/scratch off ephemeral-only if it must survive a reboot.

Access and management

  • Azure Bastion - browser-based RDP/SSH to VMs with no public IP and no exposed 3389/22, gated by RBAC. The secure default for admin access.
  • Serial console + boot diagnostics for out-of-band recovery.
  • VM extensions (Custom Script, DSC), Run Command for ad-hoc commands, Azure Monitor Agent for telemetry.
  • Update Manager for patch orchestration; Azure Arc to manage on-prem/other-cloud servers with the same tooling (policy, monitoring, extensions).
Security note - Bastion + managed identity, no public IPs
The secure pattern: VMs have no public IP, admins connect via Azure Bastion (RBAC-gated, no exposed RDP/SSH), and the VM uses a managed identity for any Azure API access. This removes public RDP/SSH exposure (a top attack vector) and eliminates stored credentials. Enforce "no public IP on VMs" with Azure Policy.

App Service, Functions, Container Apps

ServiceWhat it isUse for
App ServiceManaged PaaS for web apps/APIs (Windows/Linux, code or container)Web apps/APIs without managing VMs - a common default
Azure FunctionsEvent-driven serverless functions (Consumption/Premium/Flex plans)Event handlers, glue, automation, scale-to-zero
Container AppsServerless containers (Kubernetes/KEDA under the hood, no cluster to run)Microservices/containers without managing AKS - scale to zero, Dapr optional
Container Instances (ACI)Single containers on demandSimple, short-lived container tasks
Architect note - reach for PaaS/serverless first
For a new stateless web app or API, start with App Service or Container Apps; for event-driven code, Functions. No VM patching, built-in scaling and slots, managed TLS, and easy managed identity. Drop to VMs/AKS only when you need OS control, specialized kernels/GPUs, long-lived stateful services, or the Kubernetes ecosystem. This cuts a lot of ops versus the old "spin up a VM" reflex.

Choosing VM families by workload

WorkloadStarting point
Web / APIApp Service/Container Apps; or D-series VMSS behind a load balancer
MiddlewareD/E-series VMSS, memory-leaning
Databases (self-managed)E/M-series + Premium SSD v2/Ultra; or Azure SQL/PaaS (section 6)
Oracle workloadsE/M-series VM (self-managed Oracle) or Oracle Database@Azure; constrained-vCPU for licensing
SAPM-series (HANA-certified), proximity placement group, Ultra Disk
Batch / CI / renderSpot VMs in a VMSS, or Azure Batch
Memory-heavyE or M series
CPU-heavyF series
Storage-heavy (local IOPS)L series
GPU / AIN series (NC/ND/NV); or Azure ML (section 12)
Cost-sensitive / spikyB-series + autoscale; Spot for fault-tolerant parts; RIs/Savings Plan for baseline

Operational guidance

Resize / patch VMs safely Ops
  • Resize: deallocate, change size (must be available in the region/zone/host cluster), start - brief downtime; in a VMSS, update the model and roll. Changing to a size not on the current host cluster requires stop/deallocate.
  • Patch: use Update Manager for scheduled, reported OS patching across VMs (and Arc servers); for VMSS prefer replacing instances from a new image (immutable) over in-place patching.
Troubleshoot boot / high CPU / memory / disk Ops
  • Boot: enable boot diagnostics + serial console to see boot output; use the VM "Redeploy"/"Reset password"/repair-VM options for stuck boots.
  • High CPU / memory: Azure Monitor + VM Insights (install the Azure Monitor Agent - guest memory isn't collected without it); right-size or autoscale.
  • Disk full: expand the managed disk, then grow the partition/filesystem; alert at 85%.
  • Disk attach: confirm the disk is attached and initialized; check the disk is in the same region; LUN mapping.
Design compute for production HA Design
  • Zone-spread VMSS (Flexible) + autoscale + health probes behind a zone-redundant load balancer.
  • No public IPs; Bastion for access; managed identity for API access.
  • Azure Monitor Agent + VM Insights; Update Manager for patch compliance; Compute Gallery for golden images.
  • RIs/Savings Plan + Hybrid Benefit for cost; a paired region for DR (Azure Site Recovery, section 13).
Operations note - maintenance & live migration
Azure uses memory-preserving live migration and scheduled maintenance for most host events, often with no reboot; some events require a reboot you can control via Scheduled Events (the metadata endpoint that warns the VM). Design for it: zone/VMSS spread + health probes so a maintenance reboot never takes the whole tier. GPU and specialized sizes may not live-migrate.

5. Storage Deep Dive

Managed Disks, Blob Storage, Azure Files, and NetApp Files - their performance, redundancy (LRS/ZRS/GRS/GZRS), tiers, and the decision of which to use for databases, shared filesystems, backups, archives, and data lakes. Plus Azure Backup and Site Recovery.

Last reviewed: July 2026 Verify disk SKUs, redundancy options, and access-tier behavior in current docs.
TL;DR

Managed Disks (Standard HDD/SSD, Premium SSD, Premium SSD v2, Ultra) = block storage for VMs. Blob Storage (Hot/Cool/Cold/Archive tiers) = object storage for backups, data lakes, static content - not a filesystem. Azure Files = managed SMB/NFS shares; Azure NetApp Files = high-performance enterprise NAS. Choose redundancy (LRS/ZRS/GRS/GZRS) deliberately per data. Lock down storage accounts (disable public blob access, use Private Endpoints, prefer managed identity over keys/SAS). Azure Backup for backup, Site Recovery for DR replication.

Managed Disks

Disk typeProfileUse for
Standard HDDLowest cost, low IOPSDev/test, infrequently accessed data
Standard SSDBetter latency/consistency than HDDLight production, web servers
Premium SSDProduction-grade, size-linked performanceMost production VMs/databases
Premium SSD v2IOPS/throughput tunable independently of sizeBest cost/perf for databases needing IOPS without huge capacity
Ultra DiskHighest performance, sub-ms latencyTop-tier databases (log disks), SAP HANA
Ephemeral OS diskOn the host - fast, free, lost on deallocateStateless VMSS OS disks only
DBA note - size for IOPS, not just capacity
Provision Premium SSD v2 (tune IOPS/throughput) or Ultra for database data/log disks; enable read-only host caching for data disks and none for log disks (per SQL/Oracle guidance). Stripe disks for higher aggregate throughput when one disk caps out. Regional resilience for a stateful VM comes from zone-redundant deployment + Azure Backup/Site Recovery, not the disk alone (managed disks are zonal).

Blob Storage

  • A storage account holds blob containers (plus optionally Files/Queues/Tables). Blobs come in access tiers: Hot (frequent), Cool (~30-day), Cold (~90-day), Archive (offline, cheapest, must be rehydrated before reading - hours).
  • Lifecycle management auto-moves/deletes blobs by age/access; versioning + soft delete protect against accidental change/delete; immutable storage (time-based/legal hold) gives WORM compliance; object replication copies between accounts/regions.
  • Data Lake Storage Gen2 = a storage account with hierarchical namespace enabled (real directories, POSIX ACLs) for analytics - covered in section 11.
  • Access: prefer Entra ID + RBAC data roles + managed identity. SAS tokens (account/service/user-delegation) grant scoped, time-boxed access; stored access policies let you revoke them. Storage account keys are all-powerful - avoid distributing them.
Common mistake - Blob is not a filesystem; public exposure
Blobs are objects (no in-place random writes, no POSIX locking on flat namespace) - don't run a lock-dependent app on them; use Files/NetApp for filesystem semantics. And the classic breach: a storage account with "Allow Blob public access" on and an anonymous container. Disable public blob access at the account level (enforce via Azure Policy), use Private Endpoints, and prefer managed identity + RBAC over keys/SAS.
Security note - SAS tokens are a leak risk
A SAS token is a bearer credential - anyone with the URL has its access until it expires. Prefer user-delegation SAS (backed by Entra, revocable) or managed identity + RBAC instead of account-key SAS; keep lifetimes short; back long-lived SAS with a stored access policy so you can revoke; and never commit SAS/keys to code. Turn on soft delete + versioning for recoverability.

Redundancy: LRS, ZRS, GRS, GZRS

OptionCopiesProtects against
LRS3 copies in one datacenterDisk/rack failure only
ZRS3 copies across zones in the regionDatacenter/zone failure (in-region HA)
GRSLRS + async copy to the paired regionRegional disaster (read access with RA-GRS)
GZRSZRS + async copy to the paired regionZone and regional failure (highest)
Architect note - choose redundancy per data
Don't default everything to GRS (paying for cross-region copies you may not need) or leave critical data on LRS (no zone/region protection). Use ZRS/GZRS for data that must survive a zone loss, GRS/GZRS for data needing regional DR, and LRS for easily-reproducible or dev data. GRS is async - failover has an RPO and (for customer-initiated failover) operational steps; it is not zero-RPO.

Azure Files, File Sync, and NetApp Files

  • Azure Files - managed SMB and NFS shares (Standard on HDD, Premium on SSD). Mount from VMs, on-prem, or containers. Identity-based auth (Entra/AD) for SMB.
  • Azure File Sync - cache Azure Files on on-prem Windows servers with cloud tiering (hybrid file access).
  • Azure NetApp Files - high-performance, low-latency enterprise NAS (NFS/SMB) for demanding workloads: SAP, HPC, large shared filesystems, and Oracle datafiles over NFS. Separate service, higher performance/cost.
DBA note
For Oracle on Azure VMs needing shared/NFS storage or extreme performance, Azure NetApp Files is the common choice (certified performance tiers, snapshots). For SAP shared filesystems (sapmnt, transport) NetApp Files or Premium Files is standard. Confirm certification and performance tier for the workload.

Azure Backup and Azure Site Recovery

Azure BackupAzure Site Recovery (ASR)
PurposePoint-in-time backup & restore (VMs, disks, files, SQL/SAP in VM, Blob)Continuous replication for DR failover to another region
RecoveryRestore from a recovery point (higher RTO/RPO)Fail over the whole workload with low RPO
Use forData protection, retention, ransomware recovery (immutable vault)Cross-region DR of running workloads
Security note
Enable immutable + soft-delete on the Recovery Services/Backup vault and multi-user authorization so backups themselves survive a ransomware/insider attack. A backup an attacker can delete is not a backup.

When to use which

NeedUse
VM OS / database disksManaged Disks (Premium SSD v2 / Ultra for DB)
Backups (DB/app)Azure Backup + Blob (Cool/Archive) with lifecycle + immutability
Log / long-term archiveBlob Archive tier + lifecycle + immutable policy
Data lakeBlob with hierarchical namespace (ADLS Gen2)
Shared filesystem (SMB/NFS)Azure Files (or NetApp Files for high performance)
High-performance NAS / SAP / Oracle NFSAzure NetApp Files
Static website / mediaBlob static website + Front Door/CDN
Hybrid file accessAzure File Sync
Bulk data into AzureData Box (physical) / AzCopy (online)
Cross-region DR of workloadsAzure Site Recovery

Storage gotchas

Storage gotchas
  • Blob is object storage, not a filesystem - no random writes/locks; use Files/NetApp for that.
  • Archive tier has a rehydration delay (hours) - never for data you need immediately.
  • SAS tokens are a security risk - short-lived, user-delegation, revocable; prefer managed identity + RBAC.
  • Public blob access exposure - disable at account level; enforce by Policy.
  • Private Endpoint DNS mistakes - link the correct privatelink.* zone or nothing resolves.
  • Disk performance sizing - IOPS/throughput follow SKU/size; use Premium SSD v2/Ultra + striping for DBs.
  • Snapshot cost growth - incremental snapshots still accumulate; set retention.
  • Cross-region replication cost - GRS/object replication cost money and lag (async RPO).
  • Wrong redundancy - confusing LRS/ZRS/GRS/GZRS leaves data under- or over-protected.
  • Ephemeral OS disk data loss - it's lost on deallocate; stateless only.

6. Database Services Deep Dive

Azure's database portfolio - Azure SQL (Database / Managed Instance / on VM), PostgreSQL, MySQL, Cosmos DB, Cache for Redis, and the analytics stores - what each manages, how HA/DR/backup/patching differ, how to choose, and what changes for a DBA coming from SQL Server or Oracle.

Last reviewed: July 2026 DB features, tiers, and limits change - verify in current docs. Oracle Database@Azure especially.
TL;DR

For SQL Server workloads, choose along a spectrum: Azure SQL Database (fully-managed, cloud-native, most managed) → SQL Managed Instance (near-full SQL Server compatibility, managed) → SQL Server on a VM (full control, you manage). For open source, Azure Database for PostgreSQL / MySQL (Flexible Server). For NoSQL at global scale, Cosmos DB. For cache, Azure Cache for Redis. For analytics, Synapse / Microsoft Fabric. Oracle has no native managed service - use a VM or Oracle Database@Azure. Managed services own patching/backup/HA; you own schema, queries, and access.

The portfolio at a glance

ServiceModelSweet spot
Azure SQL DatabaseFully-managed SQL (single DB / elastic pool; DTU or vCore)New/cloud-native SQL apps; most managed, least control
Azure SQL Managed InstanceManaged instance with near-full SQL Server surface (SQL Agent, cross-DB, CLR, linked servers)Lift-and-shift SQL Server needing instance features
SQL Server on Azure VMIaaS - you run SQL ServerFull control, unsupported features/versions, OS access
Azure DB for PostgreSQL / MySQL (Flexible Server)Managed OSS databasesPostgres/MySQL apps; zone-redundant HA
Cosmos DBGlobally-distributed multi-model NoSQL (NoSQL/Mongo/Cassandra/Gremlin/Table APIs)Global scale, low latency, elastic throughput
Azure Cache for RedisManaged RedisCache, session, rate limiting, leaderboards
Synapse / Microsoft Fabric / Data ExplorerAnalytics warehouse / lakehouse / log-time-seriesAnalytics, not OLTP (section 11)
Oracle Database@AzureOracle Exadata/Autonomous run by Oracle inside Azure datacentersOracle workloads wanting managed Oracle in Azure (verify current availability)

Service deep dives

Azure SQL Database
SQL Managed Instance
PostgreSQL / MySQL
Cosmos DB / Redis

Azure SQL Database

  • Purchasing models: vCore (recommended; General Purpose / Business Critical / Hyperscale service tiers) or legacy DTU. Serverless compute auto-scales and can auto-pause. Hyperscale scales storage to 100TB+ with fast backups/restores.
  • HA: built-in; Business Critical and zone-redundant options give higher SLAs. Failover groups + active geo-replication for cross-region DR (readable secondaries).
  • Backups: automatic with point-in-time restore (retention configurable) + long-term retention; you don't manage backup files.
  • Patching: fully Microsoft-managed (in maintenance windows you can set).
  • Limits: no SQL Agent, no cross-database queries by default, no instance-level features - it's a single database service. If you need those, use Managed Instance.
DBA note
Azure SQL Database is not "SQL Server in the cloud" - it's a cloud-native database service. Great for new apps and single-database workloads, but instance-level features (Agent jobs, cross-DB, CLR, linked servers) aren't there. Choose Managed Instance when a lift-and-shift needs them.

Azure SQL Managed Instance

A managed instance with near-full SQL Server compatibility: SQL Agent, cross-database queries, CLR, Service Broker, linked servers, and instance-scoped features - deployed into your VNet (private). The best target for lifting an existing SQL Server estate while offloading patching/backup/HA.

  • HA: built-in; Business Critical + zone redundancy; failover groups for cross-region DR.
  • Backups/patching: managed (automatic backups + PITR; Microsoft patches in windows).
  • Networking: lives in a delegated subnet in your VNet - plan the subnet and connectivity.
  • Still has limits vs. a full SQL Server on a VM (some instance features, unlimited OS access, specific configs). Verify the surface for your app.

Azure Database for PostgreSQL / MySQL (Flexible Server)

  • Flexible Server is the current model: zone-redundant HA (standby in another zone), configurable maintenance windows, private access (VNet-injected or Private Endpoint), and server parameters.
  • HA/DR: zone-redundant HA in-region; read replicas (incl. cross-region) for scaling and DR.
  • Backups: automatic + PITR; you set retention.
  • Postgres supports the pgvector extension for AI/vector workloads (section 12); MySQL for common LAMP-style apps.
DBA note
Flexible Server gives more control (parameters, maintenance timing, HA choice) than the older Single Server. Plan private networking (VNet integration or Private Endpoint) up front; enable zone-redundant HA explicitly for production.

Cosmos DB & Azure Cache for Redis

  • Cosmos DB - globally-distributed, multi-model NoSQL with turnkey multi-region writes, 5 consistency levels, elastic RU/s (or autoscale/serverless), and single-digit-ms latency. Partition key design is critical - a poor key hotspots partitions and inflates RU cost. Has an analytical store + vector search.
  • Azure Cache for Redis - managed Redis for caching, sessions, and pub/sub; tiers (Basic/Standard/Premium/Enterprise) trade HA, clustering, persistence, and modules.
DBA note - Cosmos is not relational
Cosmos DB is NoSQL - no joins, no ad-hoc SQL across containers, throughput is provisioned as RU/s and cost is driven by partition-key design and query efficiency. Model for access patterns and even key distribution; don't force a relational schema onto it.

Database service decision table

WorkloadRecommendedReasonHADROps responsibilityCost lever
New SQL app (single DB)Azure SQL DatabaseMost managed, cloud-nativeBuilt-in / zone-redundantFailover group / geo-replicaSchema/queriesvCore right-size; serverless auto-pause
Lift-and-shift SQL ServerSQL Managed InstanceInstance-level compatibility, managedBuilt-in / zone-redundantFailover groupSchema/queries + agent jobsRight-size; Hybrid Benefit
Full control / unsupported featureSQL Server on Azure VMOS + full SQL controlYou build (AG/FCI + zones)You build (AG/ASR)EverythingHybrid Benefit; VM size
PostgreSQL appAzure DB for PostgreSQL (Flexible)Managed, zone-redundantZone-redundant HACross-region read replicaSchema/queriesRight-size; burstable tiers
MySQL web appAzure DB for MySQL (Flexible)Managed, commonZone-redundant HARead replicaSchema/queriesRight-size
Global NoSQL / low latencyCosmos DBGlobal distribution, elasticBuilt-inMulti-regionData model / partition keyAutoscale RU / serverless
Cache / sessionAzure Cache for RedisManaged RedisPremium/Enterprise HAGeo-replication (Enterprise)Keys/TTLRight-size tier
Data warehouse / analyticsSynapse / Microsoft FabricAnalytics, not OLTPBuilt-inConfig-dependentSchema/queriesPause/scale compute
Oracle workloadOracle DB@Azure or Oracle on VMNo native managed OracleOracle-managed / you buildData GuardOracle sideLicensing; verify offering

Connectivity & security

  • Private Endpoint (or VNet injection for MI/Flexible Server) is the production default - disable public network access. Plan the Private DNS zone (privatelink.database.windows.net, etc.).
  • Entra authentication + managed identity - apps authenticate to Azure SQL/Postgres/MySQL via Entra tokens (no passwords in config). Prefer this over SQL logins.
  • Encryption: TDE (transparent data encryption, on by default; customer-managed keys via Key Vault), Always Encrypted for column-level protection from even DBAs, TLS in transit.
  • Microsoft Defender for SQL - vulnerability assessment + threat detection; Query Performance Insight + automatic tuning for performance.
Common mistake
Leaving a database on its public endpoint with broad firewall rules "to connect quickly." Disable public network access, use Private Endpoint + Private DNS, and authenticate with Entra + managed identity. Retrofitting private connectivity later is a migration - plan it up front.

How HA, DR, backup, and patching differ

ServiceHADRBackupPatching
Azure SQL DBBuilt-in; zone-redundant / Business CriticalFailover groups + active geo-replicationAutomatic + PITR + LTRMicrosoft (windows)
SQL Managed InstanceBuilt-in; zone-redundantFailover groupsAutomatic + PITRMicrosoft (windows)
PostgreSQL / MySQL FlexibleZone-redundant HA (opt-in)Cross-region read replicaAutomatic + PITRMicrosoft (windows)
SQL Server on VMYou build: Always On AG / FCI + zonesYou build: AG replica / Azure Site RecoveryYou configure (Azure Backup for SQL)You (Update Manager)
Cosmos DBBuilt-inMulti-region (turnkey)Continuous backup + PITRFully managed
Operations note - test restores & failover
"Automatic backups" does not prove recoverability - periodically restore/PITR to a new database and validate. For DR, rehearse the failover group failover (or AG failover for VMs): confirm the secondary is within RPO, connection strings/listeners repoint, and the app works end-to-end - not just that the database opened.

Azure database gotchas for Oracle DBAs

For DBAs coming from Oracle (or SQL Server on-prem)
  • Azure SQL Database != SQL Server on a VM - it's a cloud-native single-database service without instance features (Agent, cross-DB, linked servers, OS access).
  • Managed Instance gives more compatibility but still has limits - verify your specific instance features are supported before assuming a clean lift-and-shift.
  • Patching control differs by service - PaaS patches in Microsoft-run windows you schedule, not your opatch/CU cadence.
  • Backup access differs - PaaS backups are service-managed (PITR/LTR), not files you copy; export (BACPAC/dump) for portability.
  • Private Endpoint and DNS must be planned - the private path needs the privatelink.* zone linked, or connections use the public endpoint.
  • Licensing choices matter - Azure Hybrid Benefit (bring your SQL/Windows licenses) materially changes cost; for Oracle, licensing/counting on Azure VMs needs LMS-aware planning.
  • Performance troubleshooting differs - Query Performance Insight / Query Store / automatic tuning and Azure Monitor, not your on-prem toolset.
  • Oracle is a special case - there is no native "managed Oracle." You self-manage Oracle on a VM (Data Guard, backups, ASM/NetApp Files, constrained-vCPU for licensing) or use Oracle Database@Azure (Oracle-operated Exadata/Autonomous in Azure) - verify current regional availability and terms in the official documentation before designing.

Enterprise examples

SQL Server enterprise workload SQL

Lift to SQL Managed Instance (Business Critical + zone redundancy) in a delegated subnet, Private Endpoint/VNet only, failover group to a paired region, Azure Hybrid Benefit, Defender for SQL on. Use SQL-on-VM only if a required feature isn't in MI.

Oracle workload on Azure Oracle

Self-managed Oracle on E/M-series VMs with Premium SSD v2/Ultra or Azure NetApp Files, Data Guard to a paired region, backups to Blob, constrained-vCPU + Dedicated Host for licensing. Or Oracle Database@Azure for a managed Oracle Exadata/Autonomous experience inside Azure (verify availability).

PostgreSQL application database OSS

Azure Database for PostgreSQL Flexible Server, zone-redundant HA, Private Endpoint, Entra auth, automatic backups + PITR, cross-region read replica for DR, pgvector if doing AI/RAG.

Globally distributed NoSQL NoSQL

Cosmos DB with multi-region writes, partition key chosen for even distribution, autoscale RU/s, chosen consistency level per workload, analytical store or vector search where needed.

7. Load Balancing and Traffic Management

The four Azure load-balancing services - Load Balancer (L4), Application Gateway (L7 regional), Front Door (L7 global), and Traffic Manager (DNS) - when to use each, how they are assembled, and how to debug the classic unhealthy-backend and DNS failures.

Last reviewed: July 2026 Verify SKUs (Standard LB, App Gateway v2, Front Door tiers) in current docs.
TL;DR

Four services, two axes (L4 vs L7, regional vs global): Azure Load Balancer (L4, regional, public/internal, ultra-fast), Application Gateway (L7, regional, with WAF, path/host routing, TLS), Front Door (L7, global, anycast + CDN + WAF + global failover), and Traffic Manager (DNS-based global routing). Combine them (e.g. Front Door → regional App Gateway → backend). The #1 failure is an NSG blocking the health probe or the App Gateway management ports.

The four load-balancing services

ServiceLayer / scopeUse for
Azure Load Balancer (Standard)L4 (TCP/UDP), regional, public or internal, zone-redundantHigh-throughput L4, internal VIPs (e.g. SQL AG listener), non-HTTP; also provides outbound rules
Application Gateway (v2)L7 (HTTP/S), regional, with WAFRegional web apps needing path/host routing, TLS termination, WAF, autoscaling
Azure Front DoorL7 (HTTP/S), global, anycast + CDN + WAFInternet-facing global apps: edge acceleration, global load balancing/failover, WAF at the edge
Traffic ManagerDNS-based, globalDNS-level routing across regions/endpoints (priority/weighted/performance/geographic); works for non-HTTP too
Cross-region Load BalancerL4 globalGlobal L4 with a single anycast frontend across regional LBs
NAT GatewayOutbound SNATScalable outbound internet for a subnet (not a load balancer, but the modern outbound method)

When to use which

Global internet web app, want edge/CDN + WAF + failover
Front Door (optionally → regional App Gateway)
Regional web app, path/host routing + WAF + TLS
Application Gateway v2 (with WAF)
L4 TCP/UDP, high throughput, internal VIP
Azure Load Balancer (internal or public)
DNS-level routing across regions / non-HTTP global
Traffic Manager (or Cross-region LB for L4)
Outbound internet for a subnet
NAT Gateway
Architect note - they compose
These are layers, not competitors. A common enterprise stack: Front Door (global edge + WAF + failover) → regional Application Gateway (or directly to App Service) → backend pool; internal tiers use an internal Load Balancer. Put WAF at the layer facing the internet (Front Door for global, App Gateway for regional) - don't double-pay for two WAFs unless you need defense in depth.

Application Gateway anatomy + WAF

Clients Application Gateway Frontend IP + Listener (+ cert) Rules (path/host) + WAF policy HTTP settings + health probe Backend pool Backend: VMSS / App Svc Backend (zone B) Health probe
Listener (+ cert) → rules (path/host) + WAF → HTTP settings + probe → backend pool. WAF (OWASP) protects at L7.
  • Components: frontend IP, listener (port/protocol + cert), rules (path/host routing), HTTP settings (backend port/protocol, cookie affinity, probe), backend pool (VMSS/NICs/App Service/IPs), health probe, and WAF policy.
  • SSL: termination at the gateway (offload) or end-to-end (re-encrypt to backend). Manage certs via Key Vault integration.
  • WAF - OWASP core rule set + custom rules, geo/rate limiting; run in Detection mode first, then Prevention.

Front Door

Front Door is the global L7 entry point: anycast frontend, edge TLS + caching/CDN, WAF at the edge, and health-based global routing/failover across regional backends. Use it for internet-facing apps that need low latency worldwide and automatic region failover. Managed certificates auto-provision once DNS points at the Front Door endpoint.

Load balancing troubleshooting

⚑ Backend unhealthy / App Gateway probe failing

Likely causes (in order)

  1. NSG blocks the probe or the required App Gateway management ports (v2 needs specific inbound from the GatewayManager service tag) on the App Gateway subnet.
  2. Health probe host/path/port/protocol wrong vs. what the app serves (probe expects 200-399).
  3. App not listening / bound to localhost / wrong backend port in HTTP settings.
  4. HTTPS probe cert/hostname mismatch (end-to-end SSL); backend expects a specific host header.
  5. For Azure Load Balancer: probe port not open, or (Standard LB) no outbound rule so backends can't respond.

Checks & fix

App Gateway > Backend health (shows the reason); allow the probe + management traffic on the NSG; align the probe (path/port/protocol/host); bind the app to all interfaces; verify HTTP settings backend port/protocol.

az network application-gateway show-backend-health -g RG -n APPGW -o table
⚑ SSL certificate issue

Causes: Front Door/App Gateway managed cert not provisioned (DNS must point at the frontend first, then validation completes); expired/incomplete cert chain; Key Vault access policy/permissions missing for App Gateway's managed identity; hostname/SNI mismatch. Fix: point DNS at the frontend and wait for provisioning; grant the gateway identity get on the Key Vault secret/cert; include the full chain and correct SANs.

⚑ Wrong listener/rule, WAF blocking valid traffic, or Private Endpoint DNS

Causes: listener on the wrong port/host or a catch-all rule masking a specific one; WAF denying legitimate requests (over-broad rule / false positive - check WAF logs, use Detection mode first, then tune/exclude); backend behind a Private Endpoint whose Private DNS isn't resolving; wrong frontend IP config (public vs private). Fix: verify listener/rule order and host, review WAF logs and add exclusions, confirm Private DNS zone linkage, check the frontend IP.

8. Security Deep Dive

Defense in depth on Azure: identity (Entra, PIM, Conditional Access), governance (Azure Policy, locks), network (NSG, Firewall, Private Link, DDoS), data (Key Vault, encryption), and detection (Defender for Cloud, Sentinel) - plus how to secure subscriptions, storage, VMs, and databases, ending in a production checklist.

Last reviewed: July 2026 Defender/Sentinel capabilities evolve - verify plans and features in docs.
TL;DR

Layer your controls: identity (Entra + PIM for privileged, Conditional Access + MFA, managed identities, no standing Owner), governance (Azure Policy to forbid the risky thing; resource locks), network (private endpoints, NSG/ASG, Azure Firewall, DDoS, no public exposure, Bastion), data (Key Vault/Managed HSM, CMK, TDE, Always Encrypted), and detection (Defender for Cloud, Sentinel, centralized diagnostic logs to Log Analytics). Reduce public exposure, encrypt with keys you control, centralize logs, and prefer preventive Policy over after-the-fact detection.

Azure shared responsibility model

Microsoft secures the infrastructure (physical, host, network fabric, and managed-service internals). You own: identity and access (Entra + RBAC), data classification and access, network exposure and firewall, key management choices, workload/OS security (IaaS/AKS nodes), secure configuration, and monitoring/response. The higher up the managed stack (VM → AKS → Azure SQL → App Service/Functions), the more Microsoft handles - but data, identity, and configuration always remain yours.

The control layers

LayerControlsKey services
Identity & accessWho can sign in and do whatEntra ID, Conditional Access, PIM, Identity Protection, RBAC, managed identities
GovernanceWhat is allowed to exist/be configuredAzure Policy (deny/audit/deployIfNotExists), resource locks, management groups
NetworkWhat can reach whatNSG/ASG, Azure Firewall, Private Link/Endpoint, DDoS Protection, Bastion, WAF
DataProtect data at rest/in transitKey Vault / Managed HSM (CMK), TDE, Always Encrypted, storage encryption, TLS
Detective / postureFind misconfig & threatsMicrosoft Defender for Cloud, Microsoft Sentinel, Activity Log, diagnostic settings

Key Vault, Managed HSM, and encryption

  • Key Vault - store keys, secrets, and certificates; workloads read them via managed identity + RBAC (or access policies). Managed HSM for FIPS 140-2 Level 3 single-tenant HSM keys.
  • Encryption at rest - on by default (platform keys); use customer-managed keys (CMK) in Key Vault for storage/disks/databases where you need key control and the "disable key" switch.
  • Always Encrypted / TDE for databases; TLS everywhere in transit.
  • Turn on Key Vault soft-delete + purge protection so keys/secrets can't be permanently destroyed by mistake or malice.
Security note - Key Vault + managed identity in a locked resource group
Put Key Vaults in a controlled resource group where only a small key-admin group has admin, and grant workloads only the data actions they need (get secret / wrap-unwrap key) via their managed identity. Never store secrets in app settings, code, or pipeline variables in clear text - reference Key Vault. Enable purge protection; a vault whose keys can be purged is a single point of catastrophic data loss.

Defender for Cloud & Sentinel

Microsoft Defender for Cloud

CSPM (secure score, misconfiguration recommendations) + CWP (Defender plans for servers, SQL, storage, containers, Key Vault, etc.) with threat detection. Turn on the relevant plans; work the secure score down.

Microsoft Sentinel

Cloud-native SIEM/SOAR on Log Analytics - ingest Azure + M365 + third-party logs, detect with analytics rules, and automate response with playbooks.

Activity Log & diagnostic settings

Activity Log = control-plane operations (who did what). Diagnostic settings route resource + platform logs/metrics to a central Log Analytics workspace / storage / Event Hub. Enable them everywhere via Policy.

DDoS Protection & WAF

DDoS Network/IP Protection for L3/4 volumetric attacks; WAF (App Gateway/Front Door) for L7. Protect public frontends.

Architect note - centralize logs on day one
Use Azure Policy to enforce diagnostic settings sending every resource's logs to a central Log Analytics workspace (and onward to Sentinel), and route the Activity Log too - set this up in the landing zone. Retrofitting centralized logging after an incident, when you find the logs were never enabled, is the classic post-mortem finding. Turn on Defender for Cloud across the management group.

How to secure specific things

Secure a production subscription (and multi-subscription env) Foundation
  • Access via groups; no basic Owner at subscription; least-privilege built-in roles at RG/resource; privileged roles eligible via PIM; Conditional Access + MFA; protected break-glass accounts.
  • Preventive Azure Policy at the management group: allowed regions/SKUs, deny public IPs, deny public blob access, require diagnostic settings, require tags; resource locks on foundational resources.
  • Hub-and-spoke with Azure Firewall; Private Endpoints for PaaS; DDoS on public frontends.
  • Defender for Cloud (all relevant plans) + Sentinel; central Log Analytics; budgets + quotas.
  • Key Vault + managed identities; CMK for sensitive data.
Secure storage accounts Storage
  • Disable public blob access + "allow shared key" where possible; use Entra + RBAC data roles + managed identity over keys/SAS.
  • Private Endpoint + Private DNS; firewall to specific VNets; CMK for sensitive data; soft delete + versioning + immutability for backups.
  • If SAS is required: user-delegation SAS, short lifetime, stored access policy to revoke.
Secure VMs & databases Compute / Data
  • VMs: no public IP; Bastion for access; managed identity; NSG/ASG micro-segmentation; Update Manager; Defender for Servers.
  • Databases: Private Endpoint / VNet only, disable public access; Entra auth + managed identity; TDE + CMK; Always Encrypted for sensitive columns; Defender for SQL.
Secure public load balancers & reduce exposure Edge
  • Public HTTP behind Front Door / App Gateway + WAF and DDoS; backends private (no public IPs).
  • Enforce "no public IP on VMs" and "no public blob access" via Policy; use Private Endpoints for all sensitive PaaS.
  • Prefer Bastion + private access; audit for stray public IPs regularly (Defender/Policy).

Production Azure security checklist

  • Human access via groups; MFA enforced via Conditional Access; risky sign-ins blocked (Identity Protection).
  • Privileged roles (Owner, User Access Admin, Global Admin) eligible via PIM, not permanent; approvals + audit on.
  • Two protected, monitored break-glass accounts (excluded from lock-out policies, alerted on every sign-in).
  • No basic Owner/Contributor at subscription/MG for daily work; least-privilege built-in roles at RG/resource.
  • Workloads use managed identities; app secrets minimized, rotated, and in Key Vault.
  • Preventive Azure Policy: deny public IPs, deny public blob access, allowed regions/SKUs, require diagnostic settings + tags.
  • Resource locks (CanNotDelete) on hub network, Key Vault, and production data.
  • Databases and PaaS on Private Endpoints; public network access disabled; no public database endpoints.
  • Public HTTP behind Front Door/App Gateway + WAF; DDoS Protection on public frontends; backends private.
  • Sensitive data encrypted with CMK in Key Vault (soft delete + purge protection on).
  • Diagnostic settings + Activity Log centralized to a Log Analytics workspace; Sentinel ingesting.
  • Defender for Cloud plans enabled across the management group; secure score tracked.
  • Alerts on RBAC/PIM changes, new app secrets, public exposure, Key Vault access anomalies.
  • Budgets + quotas as guardrails; consistent tags for attribution.
  • Backups immutable (vault soft-delete/immutability); DR tested incl. CMK key availability in the DR region.

Common security mistakes

Common Azure security mistakes
  • Owner assigned too broadly; not using PIM (standing privilege).
  • Weak Conditional Access (no MFA, admins unprotected).
  • Storage account public exposure; over-permissive NSGs.
  • Public database endpoints instead of Private Endpoints.
  • Not enabling diagnostic logs; not centralizing logs.
  • Secrets in code/app settings instead of Key Vault; not using managed identities.
  • Not using Private Endpoints for sensitive services; not enforcing Azure Policy (guardrails off).

9. Observability, Monitoring, and Operations

Azure Monitor (metrics, logs, alerts), Log Analytics + KQL, Application Insights, and the operations tooling - what to monitor per service, how to build useful alerts without noise, and how to centralize logs across subscriptions.

Last reviewed: July 2026 Verify agent (AMA) and alert-rule details in current docs.
TL;DR

Azure Monitor is the umbrella: metrics (near-real-time numeric), logs (in a Log Analytics workspace, queried with KQL), Application Insights (APM for apps), and alerts (metric/log/activity) firing to action groups. Install the Azure Monitor Agent (AMA) on VMs for guest metrics/logs (memory isn't collected by default). Diagnostic settings route each resource's logs to the workspace. Centralize across subscriptions with a shared workspace + Policy, alert on user-visible symptoms, and route by severity.

The observability stack

ServiceRole
Azure Monitor MetricsPlatform + custom numeric metrics, near-real-time, for dashboards and metric alerts.
Log Analytics workspace + KQLCentral log store; query with Kusto Query Language; the target for diagnostic settings.
Application InsightsAPM: requests, dependencies, exceptions, traces, live metrics, availability tests for apps.
Alerts + Action GroupsMetric/log/activity/resource-health alerts → email, SMS, webhook, Logic App, ITSM, Functions.
Diagnostic settingsRoute resource logs/metrics to Log Analytics / storage / Event Hub. Enable via Policy everywhere.
Activity Log / Resource Health / Service HealthControl-plane operations; per-resource health; Azure-side incidents & planned maintenance.
VM Insights / Container InsightsCurated VM and AKS monitoring (perf, maps, container logs) via the agent.
Azure Monitor Agent (AMA)The agent for VM/Arc guest metrics and logs; configured by Data Collection Rules.
Workbooks / DashboardsInteractive reports and shared operational views.
Operations note - install the agent for guest metrics
Azure collects host-level VM metrics (CPU, disk, network) by default, but guest memory, disk-free %, and process/log data require the Azure Monitor Agent (via a Data Collection Rule) - use VM Insights to deploy it. Many "we couldn't see the memory leak / disk filling" incidents trace back to the agent never being deployed.

What to monitor per area

VMs

CPU, memory (agent), disk free %, disk IOPS/throughput vs. SKU, availability/heartbeat, VMSS instance health.

Disks / storage accounts

Disk IOPS/throughput vs. provisioned; storage transactions, throttling (429), availability, capacity, unusual access.

Databases

DTU/vCore/CPU, storage %, connections, deadlocks, replication lag, backup status; Query Performance Insight.

Load balancers / App Gateway

Backend health, unhealthy host count, response time, 5xx, throughput, WAF blocks.

Networking

VPN/ExpressRoute status, NSG flow-log anomalies, NAT SNAT port usage, DNS.

Security

Defender alerts, Activity Log anomalies (RBAC/policy/public-exposure changes), Key Vault access failures.

Building useful alerts

  • Alert on symptoms users feel (5xx, unhealthy backends, DB down, latency), not only causes.
  • Use appropriate aggregation (avg/percentile) and an evaluation window + frequency to avoid flapping.
  • Use log alerts (KQL) for things metrics can't express; metric alerts for fast numeric thresholds; activity-log alerts for governance events.
  • Route by severity via action groups: Sev0/1 → page; Sev2/3 → ticket/Teams; info → dashboard.
  • Consider dynamic thresholds (ML baselines) for noisy signals, and Resource Health alerts for platform issues.

Example alerts to implement

AlertConditionSeverity
VM CPU highCPU > 85% avg for 5-10 minWarning → Critical
VM unavailableHeartbeat missing / Resource Health unavailableCritical
Memory pressureAvailable memory (agent) < thresholdWarning
Disk usage / IOPSDisk free < 15%; IOPS near provisioned limitWarning
App Gateway unhealthy backendUnhealthy host count > 0Critical
Azure SQL CPU / DTU-vCore / storageCPU/DTU > 90%; storage > 85%Warning → Critical
Failed backupsBackup job failed / success signal absentCritical
VPN tunnel down / ExpressRoute issueConnection/circuit status != connectedCritical
Storage unusual access / throttlingSpike / 429 throttling / anomalous accessWarning / Security
Function errors / throttlesFailure rate / throttle count over thresholdWarning → Critical
Key Vault access denied spikesForbidden/denied requests risingSecurity review
Common mistake - alert fatigue
Paging on every transient spike trains people to ignore alerts. Use longer windows, appropriate aggregation, severity routing (only real user-impact pages), dynamic thresholds for noisy signals, maintenance suppression, and prune alerts nobody acts on. An alert that never leads to action should be a dashboard tile, not a page.

Centralizing logs across subscriptions

Use a shared Log Analytics workspace (in the management subscription) and enforce diagnostic settings across all resources via Azure Policy (deployIfNotExists) so every subscription sends logs there. Route the Activity Log too, and connect the workspace to Sentinel for security analytics. This gives cross-subscription visibility and satisfies retention/compliance without per-resource setup.

# KQL: top VMs by CPU over the last hour
InsightsMetrics
| where Namespace == "Processor" and Name == "UtilizationPercentage"
| summarize avg(Val) by Computer, bin(TimeGenerated, 5m)
| top 20 by avg_Val desc

Operations tooling

  • Update Manager for OS patch orchestration/compliance; Change Tracking & Inventory for drift; Automation Account for runbooks.
  • Azure Arc to bring on-prem/other-cloud servers, Kubernetes, and data services under the same monitoring, policy, and update tooling.
  • Defender for Cloud recommendations and Advisor for security/reliability/cost/performance guidance; Cost Management exports for spend (section 14).

10. Containers, Kubernetes, and Cloud Native

AKS, Container Apps, and the serverless / event-driven building blocks (Functions, Event Grid, Service Bus, Event Hubs, Logic Apps) - when to use each, how networking and identity work for containers, and reference patterns.

Last reviewed: July 2026 Verify AKS networking modes, workload identity, and Container Apps features in docs.
TL;DR

AKS (managed Kubernetes) for orchestrated microservices when you need the K8s ecosystem; Container Apps for serverless containers without running a cluster (scale-to-zero, KEDA, Dapr); Functions for event-driven code; App Service for web apps. Around them: Azure Container Registry, Event Grid / Event Hubs / Service Bus, Logic Apps, and API Management. AKS workloads use Workload Identity (federated, no secrets) to reach Azure/Entra.

The cloud-native services

ServiceWhat it isUse for
Azure Kubernetes Service (AKS)Managed Kubernetes (free control plane; you manage node pools, or use node autoprovisioning)Orchestrated microservices, platform teams, portable K8s
Container AppsServerless containers on managed Kubernetes/KEDA - no cluster ops, scale to zeroMost containerized microservices without AKS overhead
Container Instances (ACI)Single containers on demandSimple/short-lived tasks; AKS virtual-node burst
App ServiceManaged web app/API PaaS (code or container)Web apps/APIs
Azure FunctionsEvent-driven serverless functionsEvent handlers, glue, automation
Azure Container Registry (ACR)Private registry with scanning, geo-replication, tasksStore/scan/build images
API Management (APIM)Full API gateway/managementPublishing, securing, throttling, versioning APIs
Event Grid / Event Hubs / Service BusEventing / big-data streaming / enterprise messagingEvent-driven and decoupled architectures
Logic AppsLow-code workflow/integration with connectorsIntegration, orchestration, SaaS connectors

AKS deep dive

  • Node pools: a system node pool (runs cluster-critical pods) and one or more user node pools (your workloads). Use Spot user pools for fault-tolerant work; scale per pool; virtual nodes (ACI) for burst.
  • Networking: Azure CNI (pods get VNet IPs - plan a big subnet; CNI Overlay reduces IP usage) vs kubenet (legacy). Private clusters keep the API server private.
  • Ingress: the Application Gateway Ingress Controller (AGIC) or the managed App Routing add-on / Gateway API provisions an App Gateway or LB; a Service type=LoadBalancer creates an Azure Load Balancer.
  • Identity: Microsoft Entra Workload Identity federates a Kubernetes service account to a managed identity so pods get Entra tokens with no secrets. Use Entra + Azure RBAC for cluster access, plus Kubernetes RBAC.
  • Security/ops: Defender for Containers, image scanning in ACR, Azure Policy for AKS (Gatekeeper), Container Insights for monitoring.
Architect note - Container Apps unless you need AKS
Reach for Container Apps first for containerized services - it removes cluster management, scales to zero, and supports KEDA/Dapr. Use AKS when you genuinely need the Kubernetes ecosystem (operators, service mesh, DaemonSets, GPU scheduling, cluster-level control). With AKS + Azure CNI, size the pod subnet for peak pods (or use CNI Overlay) - IP exhaustion stalls scheduling in ways that look like mysterious Pending pods. Use Workload Identity for pod-to-Azure access, never mounted secrets.

AKS vs Container Apps vs Functions vs VMs

Many orchestrated microservices, need K8s ecosystem
AKS
Containerized services, no cluster ops, scale to zero
Container Apps
Event-driven code / glue
Functions (+ Event Grid)
Web app / API without containers
App Service
Full OS control / specialized kernels/GPUs
VMs / AKS with GPU pools
Cost note
Don't run an AKS cluster for a couple of containers - even with a free control plane, node pools cost more than Container Apps, which bills per usage and scales to zero. Reserve AKS for genuine orchestration needs, use Spot node pools for fault-tolerant workloads, and right-size node pools with the cluster autoscaler / node autoprovisioning.

Networking & identity for containers

  • Networking - AKS/Container Apps deploy into a VNet subnet; use private endpoints for dependencies (ACR, Key Vault, databases), Private DNS, and NSGs. Internal ingress + Private Link for private platforms.
  • Identity - Entra Workload Identity (AKS) / managed identity (Container Apps, Functions, App Service) for keyless access to Key Vault, storage, databases.
  • Supply chain - scan images in ACR (Defender), sign/verify, restrict pull to the workload identity; Azure Policy for AKS to enforce baselines.
  • Monitoring - Container Insights (AKS), built-in metrics/logs for Container Apps/Functions; App Insights for app-level tracing.

Messaging & events

ServiceModelUse for
Event GridDiscrete event routing (pub/sub, reactive)React to Azure/resource events (e.g. blob created) → Functions/Container Apps
Event HubsHigh-throughput streaming (Kafka-compatible)Telemetry/log/IoT ingestion pipelines
Service BusEnterprise messaging (queues/topics, ordering, transactions, dead-letter)Reliable decoupling, work queues, ordered processing
Logic AppsLow-code workflows with connectorsIntegration/orchestration across SaaS + Azure

Architecture patterns

Blob Storage Event Grid blob created Function Process → SQL / Cosmos Service Bus → downstream
Event-driven: a blob upload raises an Event Grid event → a Function processes it and writes to a database or hands off to Service Bus for reliable downstream processing.
  • Microservices on AKS - deployments behind AGIC/Gateway ingress, HPA/KEDA autoscaling, optional service mesh, Workload Identity, ACR + a deployment pipeline.
  • Microservices on Container Apps - each service a container, internal ingress + Dapr for service-to-service, KEDA scaling, managed identity - minimal ops.
  • Serverless function on a Blob event - as diagrammed; image/ETL/validation.
  • Event-driven architecture - Event Grid + Functions/Container Apps + Service Bus + Event Hubs for decoupled, resilient pipelines.
  • Private container platform - private AKS/Container Apps in a spoke VNet, internal ingress, private endpoints to ACR/Key Vault/DB, no public endpoints.

Troubleshooting

⚑ AKS pod not starting (Pending / ImagePullBackOff / CrashLoopBackOff)

Causes: Pending = no schedulable capacity or pod IP exhaustion (Azure CNI subnet too small) or resource requests too big; ImagePullBackOff = ACR pull permission missing (grant AcrPull to the kubelet/managed identity), private ACR unreachable (needs private endpoint/DNS), or wrong image path; CrashLoopBackOff = app config/secret missing or bad probes. Checks: kubectl describe pod, kubectl logs --previous, node capacity, subnet free IPs. Fix: scale/enable autoscale or use CNI Overlay; grant AcrPull; fix probes/config/Workload Identity.

⚑ Container App revision issue / Function timeout / trigger issue

Container Apps: a new revision serving traffic but failing - check the container listens on the target port, ingress config, scale rules (min replicas), and the managed identity's RBAC; roll back to a previous revision or split traffic. Function timeout: raise the timeout/plan (Consumption caps duration; use Premium/Flex for long work), make idempotent, offload long work. Trigger not firing: check the binding/connection (managed identity or connection string), the event source, and the function's logs/metrics; verify the Event Grid subscription/filter.

11. Analytics, Data, and Integration

The Azure data stack - Microsoft Fabric and Synapse, Data Factory, Data Lake Storage Gen2, Databricks, Event Hubs and Stream Analytics, Data Explorer, Purview governance, and Power BI - with the common lake/warehouse/streaming patterns.

Last reviewed: July 2026 Fabric is evolving fast and consolidating Synapse - verify current positioning in docs.
TL;DR

Land data in Data Lake Storage Gen2 (a storage account with hierarchical namespace). Analyze with Microsoft Fabric (the unified SaaS analytics platform: OneLake, Lakehouse, Warehouse, Data Factory, Power BI) or Synapse/Databricks. Ingest streams with Event Hubs + Stream Analytics, orchestrate ETL with Data Factory, query time-series/logs with Data Explorer, govern with Microsoft Purview, and visualize with Power BI.

The services

ServiceRole
Microsoft FabricUnified SaaS analytics: OneLake (one logical lake), Lakehouse, Data Warehouse, Data Factory, Real-Time Intelligence, and Power BI - one capacity, one governance surface.
Azure Synapse AnalyticsIntegrated analytics (dedicated/serverless SQL pools, Spark, pipelines). Much of it is converging into Fabric - check current guidance.
Azure DatabricksFirst-party Apache Spark + Delta Lakehouse for large-scale data engineering/ML.
Data FactoryCloud ETL/ELT orchestration with 100+ connectors (also in Synapse/Fabric).
Data Lake Storage Gen2Blob + hierarchical namespace (directories, POSIX ACLs) - the lake foundation.
Event HubsHigh-throughput event streaming (Kafka-compatible) for ingestion.
Stream AnalyticsServerless real-time stream processing (SQL over streams).
Azure Data Explorer (ADX / Kusto)Fast analytics over logs/time-series/telemetry (KQL).
Microsoft PurviewData governance: catalog, classification, lineage, and access policies across the estate.
Power BIBI/reporting and semantic models (native in Fabric).
Data engineering note - Fabric vs Synapse vs Databricks
Microsoft is consolidating analytics into Fabric (SaaS, OneLake, capacity-based). New greenfield analytics often start in Fabric; large existing Spark/Delta estates frequently stay on Databricks; Synapse remains for existing workloads and is being folded toward Fabric. Choose by your existing investment and team skills, and verify the current Microsoft positioning - this area moves fast.

The Fabric / OneLake model (mental model)

How Fabric differs from a database
  • OneLake is one logical data lake for the whole tenant (built on ADLS Gen2, open Delta/Parquet format) - "shortcuts" reference data in place instead of copying.
  • Storage and compute are separate; you buy capacity (compute) and workloads (Lakehouse, Warehouse, Power BI) share it.
  • Lakehouse (Spark/notebooks + SQL endpoint) vs Warehouse (T-SQL, transactional) - both over the same Delta data in OneLake.
  • Direct Lake lets Power BI read Delta directly for speed without import/DirectQuery trade-offs.
  • It is analytical (OLAP), not transactional - keep OLTP in Azure SQL/Cosmos and feed the lake.

Common data patterns

PatternBuilt from
Data lakeADLS Gen2 / OneLake (bronze/silver/gold) + Purview governance
Data warehouseSynapse dedicated SQL pool or Fabric Warehouse + Power BI
LakehouseFabric Lakehouse or Databricks (Delta) over ADLS/OneLake
ETL / ELTData Factory pipelines (+ Spark/Databricks for transforms)
Streaming ingestionEvent Hubs → Stream Analytics / Fabric Real-Time → lake/warehouse
Event-driven integrationEvent Grid + Functions/Logic Apps + Service Bus
Reporting / BIPower BI over Warehouse/Lakehouse (Direct Lake)
AI-ready dataCurated lake + Azure ML / Azure OpenAI + vector search (section 12)
Cross-org data sharingAzure Data Share / Fabric sharing with governance

Governance with Purview

Microsoft Purview provides a unified catalog, automated classification (PII/sensitive), lineage, and data access policies across Azure data sources, on-prem, and (increasingly) multicloud - so a growing lake stays governed instead of a "data swamp." Combine with Private Endpoints around data services, storage/lake ACLs, and column/row-level security in the warehouse for sensitive data.

Security note
Put analytics data behind Private Endpoints and (for the warehouse) column-/row-level security and dynamic data masking; classify and discover PII with Purview; and control sharing through governed mechanisms (Data Share / Fabric domains) rather than ad-hoc access. For sensitive data, keep the storage account private and use managed identity + RBAC, not keys.

Reference architecture: lakehouse + BI

Source DBs (OLTP) Events / IoT Data Factory Event Hubs ADLS Gen2 / OneLakebronze / silver /gold (Delta) Databricks / Spark Fabric Warehouse Purview (govern) Power BI
Batch (Data Factory) and streaming (Event Hubs) land in the Delta lake; Spark transforms; Fabric Warehouse serves analytics; Purview governs; Power BI visualizes.

12. AI, ML, and Generative AI on Azure

Azure AI Foundry, Azure OpenAI, Azure AI Search, the pretrained AI services, and Azure Machine Learning - plus vector search across Cosmos DB / PostgreSQL, the enterprise RAG patterns, and the governance guardrails that separate a demo from something you can run on real data.

Last reviewed: July 2026 Model names, regions, quotas & pricing change fast - verify Azure OpenAI/Foundry availability in the portal.
TL;DR

Azure AI Foundry is the platform for building/deploying AI (models, agents, prompt flow, evaluation); Azure OpenAI serves GPT/embedding models with enterprise controls; Azure AI Search provides the retrieval layer (keyword + vector + semantic) for RAG. Store vectors in AI Search, Cosmos DB, or PostgreSQL (pgvector). Pretrained Azure AI services (Document Intelligence, Vision, Language, Speech, Translator) cover common tasks; Azure ML for custom models/MLOps. The hard part is not the model - it is governing what it can reach; use private endpoints, managed identity, and Content Safety.

Azure AI Foundry & Azure OpenAI

CapabilityWhat it does
Azure AI FoundryThe unified platform/portal + SDK to build, evaluate, deploy, and monitor generative AI apps and agents; model catalog, prompt flow, tracing, and evaluation.
Azure OpenAI ServiceGPT (chat/completions), embeddings, and other OpenAI models with Azure enterprise controls (private networking, RBAC, content filtering, data-not-used-to-train).
Model catalogOpenAI + open + partner models to deploy (managed or serverless endpoints).
Prompt flowAuthor, test, and evaluate LLM app flows (prompts, tools, retrieval) with versioning.
Content SafetyDetect/block harmful content, jailbreaks/prompt-injection, and groundedness issues.
Fine-tuningCustomize supported models where available (verify per model/region).

Azure AI Search is the managed retrieval engine for RAG: it indexes your content and supports keyword, vector, and hybrid search plus semantic ranking. Integrated vectorization and indexers can chunk, embed, and index content from Blob/ADLS/SQL/Cosmos automatically. It is the most common grounding store for Azure OpenAI RAG.

Applied AI services & Azure ML

Document Intelligence

Extract text, tables, key-value pairs, and structure from documents (invoices, forms, contracts).

Vision / Language / Speech / Translator

Pretrained APIs for image analysis, entity/sentiment/PII, speech-to-text/text-to-speech, and translation - no training.

Azure Bot Service / Copilot patterns

Conversational bots and assistant patterns grounded on your data.

Azure Machine Learning

Full MLOps: training, pipelines, model registry, managed online/batch endpoints, and monitoring for custom models.

Vector search options

OptionUse when
Azure AI Search (vector/hybrid)Purpose-built retrieval with hybrid + semantic ranking; the default RAG store.
Cosmos DB vector searchVectors alongside operational NoSQL data at global scale.
Azure DB for PostgreSQL (pgvector)Vectors in an existing Postgres, alongside relational data.
Azure SQL vector supportVectors alongside relational SQL data (verify current availability).
AI note - keep vectors near governed data
Storing embeddings in Cosmos DB / PostgreSQL / Azure SQL means retrieval inherits your existing RBAC, Private Endpoints, backups, and row/column security - you combine similarity search with ordinary filters so retrieval respects entitlements. Use Azure AI Search when you want the best hybrid + semantic retrieval and integrated vectorization. Either way, filter retrieved context to what the requesting user is allowed to see.

RAG architecture on Azure

Docs in Blob / ADLS Chunk + embed Azure AI Search index -- runtime query path -- User query Serving layer(App Svc/Function: authz + guardrails) Retrieve top-k(security-trimmed) Azure OpenAI (GPT)grounded answer Audit + Content Safety
Ingestion: docs → chunk → embed → Azure AI Search. Runtime: query → governed serving layer → security-trimmed retrieval → Azure OpenAI generates a grounded, audited answer (Content Safety on the loop).

Enterprise patterns

PatternHowWatch out for
Chat with documentsRAG over Blob/ADLS + AI Search + Azure OpenAIChunking quality; stale index; citations
Chat with databaseRetrieve from curated views; grounded answersNever raw prod OLTP; use a serving layer
Natural language to SQLAzure OpenAI proposes SQL over a governed schemaValidate/parametrize; read-only; no dynamic SQL
RAG with private dataPrivate endpoints for OpenAI + Search + storage; security trimmingEntitlement-aware retrieval
Document processingDocument Intelligence → extract → SQL/warehouseHuman review of low-confidence extractions
Call center AISpeech + Language + Azure OpenAI + botGrounding; human escalation
MLOps pipelineAzure ML pipelines + registry + managed endpoints + monitoringReproducibility; drift monitoring
Private GenAIPrivate networking + managed identity + Content Safety + auditSecurity trimming, prompt-injection defense

Governance and security for GenAI

  • Serving layer, always - agents/LLMs call a governed API (App Service/Function) that enforces authN/authZ, rate limits, input/output validation, and logging. They do not touch data stores directly.
  • Security-trimmed retrieval - filter retrieved context to what the requesting user may see (document/row/column) so RAG cannot leak across users.
  • Private & perimetered - use Private Endpoints for Azure OpenAI, AI Search, storage, and databases; keep model/data traffic off the internet.
  • Credential hygiene - secrets in Key Vault, access via managed identity; the model never sees raw credentials.
  • Auditability - log prompts, retrieved document IDs, and responses (per privacy rules); use Content Safety and groundedness checks.
  • Responsible AI - evaluate quality/safety in prompt flow, monitor deployed models, and require human review for consequential outputs.

Warnings (read before connecting AI to enterprise data)

Do not do these
  • Do not connect LLM agents directly to production OLTP databases without a governed serving layer. Live transactional systems are not a query surface for a probabilistic agent.
  • Avoid uncontrolled dynamic SQL. NL-to-SQL must produce validated, parameterized, read-only queries against a curated schema - never free-form DML against production.
  • Protect credentials. No DB passwords, keys, or connection strings in prompts, code, or agent memory. Use Key Vault + managed identity.
  • Add auditability. If you cannot show what data an answer came from and who asked, you cannot defend it to security or compliance.
  • Use curated datasets, APIs, or read-only reporting layers as the AI's data surface - not raw production tables.
  • Validate output before business use. Treat model output as a draft/suggestion until a human or deterministic check confirms it.
  • Monitor prompt injection and data-leakage risks - untrusted content in the context can hijack instructions; use Content Safety prompt-shields and isolate/sanitize retrieved and user content.
  • Check Azure OpenAI regional and model availability, quota, and pricing before you design - these vary by region and change frequently.
AI note - the pattern that scales safely
The durable enterprise GenAI shape is: curated/governed data → security-trimmed retrieval → model behind a serving API → validated, audited output, all over private endpoints with Content Safety on the loop. Everything risky (raw OLTP access, dynamic SQL, embedded credentials, unlogged answers) is a shortcut that works in a demo and fails an audit. Build the governed path first.

13. Migration and Disaster Recovery

Getting workloads into Azure (VMs, databases, data) and keeping them recoverable - Azure Migrate, Site Recovery, Backup, Database Migration Service - plus DR patterns by tier and how RTO/RPO drive architecture and cost.

Last reviewed: July 2026 Verify supported sources for Azure Migrate/ASR/DMS in current docs.
TL;DR

Assess and migrate servers with Azure Migrate (VMware/Hyper-V/physical → VMs), databases with Database Migration Service (+ tooling like DMA/SSMA), and bulk data with Data Box / AzCopy. For DR, choose per tier: backup & restore (cheapest, slow), pilot light, warm standby, or hot/active-active. Azure Site Recovery replicates running workloads for region failover; Front Door / Traffic Manager handle traffic failover. Your RTO/RPO targets pick the pattern - and DR you never test is not DR.

Migration tooling

MoveToolingNotes
Assess & planAzure MigrateDiscovery, dependency mapping, right-sizing, and cost estimates before you move
Servers / VMsAzure Migrate: Server Migration (agentless/agent)VMware, Hyper-V, physical, and other-cloud VMs → Azure VMs
Databases (low downtime)Database Migration Service (DMS) + DMA/SSMASQL/Postgres/MySQL to Azure PaaS; SSMA for heterogeneous (Oracle→SQL) conversions
Bulk dataData Box / AzCopy / Storage MoverPhysical appliance for large sets; online for the rest
FilesAzure File Sync / Storage MoverHybrid file access + migration to Azure Files
DR replicationAzure Site Recovery (ASR)Replicate running VMs to another region for failover

Database migration paths

Source → targetMethodDowntime
SQL Server → Managed Instance / SQL DBDMS online (log replay), or Managed Instance link (near-real-time)Near-zero
SQL Server → SQL on VMBackup/restore, Always On, or ASRWindow-dependent
PostgreSQL/MySQL → Flexible ServerDMS online / native replicationNear-zero
Oracle → SQL / PostgresSSMA / DMS (heterogeneous conversion)Low + conversion effort
Oracle → Oracle on Azure VM / DB@AzureData Pump / RMAN / Data GuardDepends on method
DBA note - heterogeneous Oracle moves are conversions
Oracle → Azure SQL/PostgreSQL is a heterogeneous migration: SSMA/DMS help with schema/code conversion and data movement, but PL/SQL, datatypes, and features need real work and testing. If the app must stay on Oracle, plan self-managed Oracle on an Azure VM (Data Pump/RMAN/Data Guard) or Oracle Database@Azure instead of a conversion. The Managed Instance link is a clean low-downtime path for SQL Server.

DR patterns

PatternStandby stateRTORPOCost
Backup & restoreBackups in another region; nothing runningHours+Since last backupLowest
Pilot lightCore data replicated (geo-replica / ASR); app offTens of minSmallLow
Warm standbyScaled-down full stack in the DR regionMinutesSmallMedium
Hot / active-activeBoth regions serving (Front Door + multi-region data)Near-zeroNear-zeroHighest + complexity

Building blocks: Azure SQL failover groups / geo-replication or Cosmos multi-region (data); GZRS/GRS storage and object replication (objects); Azure Site Recovery for VM replication; Availability Zones for in-region HA and region pairs for DR; and Front Door / Traffic Manager for traffic failover.

Architect note - Front Door simplifies app DR
Front Door (or Traffic Manager) with backends in two regions and health-based routing fails traffic over automatically for stateless tiers on a regional outage. That removes a lot of DR plumbing for the app layer. The hard part remains the data tier: use SQL failover groups / Cosmos multi-region / ASR per your RTO/RPO, and rehearse the failover (including that the app repoints to the promoted database).
Common mistake - active-active data is hard
Stateless tiers go active-active easily behind Front Door; stateful databases generally do not without conflict handling. Most "active-active" requirements are satisfied by active/passive with fast failover (failover groups) or a natively multi-region store (Cosmos DB). Don't take on multi-master complexity unless the requirement truly demands it.

RTO and RPO

  • RTO - how long you can be down → drives standby readiness and automation.
  • RPO - how much data you can lose → drives replication mode (sync HA vs async geo-replication vs backup interval).
  • Zero data loss needs synchronous replication (zone-redundant HA, Business Critical, or a sync AG) and low latency; GRS/geo-replication are async (an RPO applies). Verify the network and trade-offs.

Architecture examples

Azure Front Door Region A (primary) App (zone-spread) Azure SQL (primary) Region B (paired, DR) App (warm) SQL geo-secondary failover group
Front Door fronts both regions (auto failover for stateless tiers); Azure SQL uses a failover group with a geo-secondary promoted on failover.
  • On-prem VM → Azure: Azure Migrate server migration.
  • SQL Server → Azure: DMS / Managed Instance link, cut over at low lag.
  • Oracle → Azure: self-managed on VM (Data Guard) or Oracle DB@Azure; SSMA if converting to SQL/Postgres.
  • Cross-region DR (app): Front Door + warm tier + storage replication.
  • Cross-region DR (database): failover groups / geo-replication / Cosmos multi-region.
  • Backup-based DR: Azure Backup with cross-region restore; rebuild on demand.
  • ASR pattern: replicate IaaS VMs to the paired region, fail over with recovery plans.

DR testing

DR you have never tested is a hope, not a plan
Run regular drills: ASR test failover (isolated, non-disruptive), failover-group failover, and full app validation in DR (not just "the DB opened"). Verify RTO/RPO are actually met, that Key Vault CMK keys exist and are usable in the DR region (a missing key makes encrypted data/replica unusable), that Front Door/Traffic Manager failover works, and that runbooks and connection strings are current. Use ASR recovery plans to codify and rehearse.
  • Geo-secondary / ASR replication within RPO; failover rehearsed.
  • CMK keys present and usable in the DR region.
  • App tier can start and connect in DR; config points to DR endpoints.
  • Front Door / Traffic Manager failover tested and time-measured.
  • Object data (GZRS/GRS or replicated) within RPO.
  • Capacity available in DR; runbook / recovery plan current.

14. Cost Management and Governance

How Azure charges, the tools to track and cap spend (Cost Management, budgets, exports), the discount levers (Reservations, Savings Plan, Hybrid Benefit, Spot), and the governance model - ending in a monthly cost-review checklist.

Last reviewed: July 2026 Pricing and discount models change - verify all rates on the pricing pages.
TL;DR

Azure bills mainly by compute (VM size-hours), storage GB + transactions + redundancy, database vCore/DTU, and data egress. Track with Microsoft Cost Management + budgets + cost exports to storage/BigQuery-style analysis, cap with budgets (alert) and quotas (block). Big levers: Reserved Instances / Savings Plan for Compute for baseline, Azure Hybrid Benefit for Windows/SQL licenses, Spot for interruptible work, right-sizing, storage lifecycle, and reducing Log Analytics ingestion. Governance = management groups + Azure Policy + budgets + tags.

Pricing basics

DimensionCharged onNotes
VMsSize per hour (+ OS licensing)RIs / Savings Plan; Hybrid Benefit; Spot; Dev/Test pricing
Managed disksProvisioned GB + tier (+ IOPS/throughput for v2/Ultra)Snapshots accumulate; choose SKU deliberately
Storage accountsGB + tier + transactions + redundancy (LRS<ZRS<GRS<GZRS)Lifecycle to cool/archive; watch egress/retrieval
Azure SQL / databasesvCore/DTU + storage + backups (LTR)Serverless auto-pause; Hybrid Benefit; right-size tier
NetworkingEgress + inter-region; Application Gateway & Azure Firewall have hourly + data costsFirewall/App Gateway are often surprising line items - right-size and consolidate
Log AnalyticsIngestion (GB) + retentionNoisy logs get expensive - filter and set retention/basic-logs tiers
Cost note - the sneaky ones
Azure Firewall, Application Gateway, and Log Analytics ingestion are frequently underestimated. Consolidate firewalls in the hub (don't run one per spoke), right-size App Gateway (autoscale v2), and aggressively filter/route logs (exclude chatty sources, use Basic Logs / Auxiliary tiers, set retention). GRS/GZRS redundancy also doubles storage cost for cross-region copies.

Cost tracking tools

ToolDoes
Cost Management + cost analysisSpend by subscription, RG, resource, service, tag, time; forecasts.
Budgets + cost alertsTrack spend against a target at MG/subscription/RG scope; alert at thresholds (and trigger automation). Budgets notify - they don't block.
Cost exportsScheduled detailed usage to a storage account for your own BI/analysis.
QuotasPer-subscription limits - the "block" control (e.g. cap vCPU by family/region).
AdvisorCost (right-sizing, idle, reservation) + reliability/security/performance recommendations.
Pricing / TCO calculatorsEstimate before you build.
Cost note - budgets alert, quotas enforce
Use both: a budget warns you spend is trending over; a quota stops a subscription creating the expensive thing. Attribute everything via tags (enforced by Policy) + cost exports so chargeback works. This only works if tagging is enforced from the start (section 1).

Discounts

  • Reserved Instances (RIs) - 1/3-year commitment to specific VM families/regions (or other services) for a big discount on steady state.
  • Savings Plan for Compute - a 1/3-year hourly-spend commitment that flexes across VM families/regions (more flexible than RIs, sometimes smaller discount).
  • Azure Hybrid Benefit - apply existing Windows Server / SQL Server licenses to Azure - often the single biggest saving for a Microsoft shop.
  • Spot VMs - 60-90% off for interruptible workloads.
  • Dev/Test pricing - reduced rates for non-production under Dev/Test subscriptions.

Governance model

Governance is enforced through the same primitives as security: the management-group hierarchy and subscriptions (isolation + attribution), Azure Policy (restrict regions/SKUs, require tags, deny public exposure), budgets + quotas, resource locks, and tags - all deployed as a landing zone in code. This keeps spend controlled and attributable by design.

Cost optimization examples

ActionTypical savingEffort
Stop / auto-shutdown non-prod VMs off-hoursHigh (up to ~65-70% of that compute)Low
Right-size VMs (Advisor)HighLow
Reserved Instances / Savings Plan for baselineHighMedium
Azure Hybrid Benefit (Windows/SQL)Very high (Microsoft licenses)Low
Spot VMs for interruptible / batchVery high (60-90%)Medium
Choose correct disk type / delete unused disksMediumLow
Storage lifecycle to cool/archiveMedium-HighLow
Reduce Log Analytics ingestion (filters, tiers, retention)Medium-HighLow
Reduce cross-region traffic / consolidate firewallsMediumMedium
Delete old snapshots & unused public IPsMediumLow
Database right-sizing / serverless auto-pauseMedium-HighLow
Cost note - cheap wins first
Before re-architecting: auto-shutdown non-prod, act on Advisor right-sizing, apply Azure Hybrid Benefit, buy RIs/Savings Plan for baseline, and cut Log Analytics ingestion. Idle public IPs, orphaned disks, over-sized App Gateway/Firewall, and unbounded log ingestion quietly add up - review monthly.

Monthly Azure cost review checklist

  • Review Cost Management month-over-month by subscription, service, and tag; investigate spikes.
  • Check each budget: which subscriptions/RGs/tags are over or trending over target.
  • Act on Advisor right-sizing and idle-resource recommendations.
  • Confirm non-prod auto-shutdown ran (nothing running 24x7 by accident).
  • Find and delete unused managed disks, orphaned snapshots, and idle VMs.
  • Release unassociated public IPs (Standard IPs bill when idle).
  • Review RI / Savings Plan coverage and utilization; buy/adjust; confirm Hybrid Benefit applied.
  • Log Analytics: top ingestion sources; add filters / Basic Logs; check retention.
  • Storage: lifecycle rules moving cold data to cool/archive; review redundancy choices.
  • Right-size Azure SQL / databases vs. utilization; serverless auto-pause where suitable.
  • Review App Gateway / Azure Firewall sizing and consolidation.
  • Review egress / inter-region charges; co-locate chatty services; use private access.
  • Confirm every resource is tagged (cost-center/env/owner) for attribution.
  • Validate quotas still reflect intent; check for anomalous new spend by service.

15. Enterprise Architecture Patterns

Reference blueprints for real Azure deployments. Each card gives the business case, services, traffic flow, and the security / HA / DR / monitoring / cost / risk dimensions so you can adapt rather than start from a blank page.

Last reviewed: July 2026 Blueprints are starting points - validate sizing/services against current docs and requirements.
HOW TO READ THESE

Every pattern lists the same dimensions. Start from the one closest to your workload, then apply the service deep dives (sections 3-12) and the DR/cost guidance (13-14). The recurring backbone is: Front Door/App Gateway + WAF → private compute (App Service / VMSS / AKS) → managed database on a Private Endpoint → Private Endpoints for PaaS → centralized Log Analytics → cross-region DR, all inside a governed landing zone (management groups + Policy + hub-and-spoke).

Foundational three-tier (reference backbone)

Three-tier enterprise application
The pattern most others extend
Users Front Door+ WAF Spoke VNet (hub-and-spoke) App tier: App Service / VMSS (no public IP) zone 1 zone 2 Data: Azure SQL (Private Endpoint, HA) Azure SQL zone-redundant NAT / FW Priv Endpt Key Vault Log Analytics
Business caseStandard internal/external web or enterprise app needing HA and controlled exposure.
ServicesHub-and-spoke VNet, Front Door + WAF (or App Gateway), App Service/VMSS, Azure SQL (Private Endpoint), NAT/Firewall, Key Vault, Azure Monitor/Log Analytics.
Traffic flowUser → Front Door/WAF → app (private) → SQL (Private Endpoint); egress via NAT/firewall; PaaS via Private Endpoints.
SecurityNo public IPs on workloads; NSG/ASG; DB private; CMK; secrets in Key Vault; Azure Policy + Private Endpoints; Defender on.
HAZone-spread app + zone-redundant SQL; Front Door/App Gateway health probes.
DRSecond-region backends behind Front Door + SQL failover group.
MonitoringBackend health, app + DB metrics; alerts → action groups; central Log Analytics.
CostApp Service/VMSS right-size + RIs/Savings Plan; Hybrid Benefit; storage lifecycle.
Risks / mistakesProbe NSG rule missing; DB public endpoint; no zone spread; secrets in app settings; Private DNS not linked.

Pattern library

Simple web application Small
CaseLow-complexity site/app, cost-sensitive.
ServicesApp Service + Azure SQL (or Cosmos) + Blob for assets + Front Door + WAF.
HA/DR/costApp Service zone-redundant; SQL HA; scale rules. Risk: public DB endpoint, no backups tested.
Highly available application HA
CaseMust survive zone (and ideally region) failure.
ServicesZone-spread VMSS/App Service, zone-redundant Azure SQL, Front Door, zone-redundant LB/App Gateway.
DRSecond-region backends + SQL failover group. Risk: state on a single zonal disk; untested failover.
Private enterprise application Regulated
CaseInternal-only, reachable from on-prem, no public footprint.
ServicesPrivate subnets, internal App Gateway/LB, ExpressRoute/VPN via hub, Private Endpoints, Bastion, no public IPs.
RiskCIDR overlap; Private DNS not linked; transitive-peering assumption.
Hub-and-spoke / centralized networking & security & logging Platform
CaseMany subscriptions with centrally-governed network, security, and logging.
ServicesConnectivity subscription (hub VNet, Azure Firewall, gateways, DNS, Bastion), management subscription (Log Analytics, Automation, Backup), Defender + Sentinel, MG-level Policy.
RiskFirewall per spoke (cost); shadow VNets; missing UDR return routes.
Multi-subscription landing zone Governance
CaseGoverned foundation before workloads land.
ServicesManagement-group hierarchy, platform + landing-zone subscriptions, baseline RBAC (groups) + PIM, preventive Azure Policy, hub network, central logging + Defender, budgets/quotas, locks - all IaC (CAF ALZ).
RiskSkipping it and retrofitting governance later.
SQL Server enterprise workload SQL
CaseLift a SQL Server estate to managed.
ServicesSQL Managed Instance (Business Critical + zone redundancy), delegated subnet/Private Endpoint, failover group, Hybrid Benefit, Defender for SQL.
RiskUnsupported instance feature; public endpoint; untested failover.
Oracle workload on Azure Oracle
CaseRun Oracle on Azure.
ServicesSelf-managed Oracle on E/M VMs + Premium SSD v2/Ultra or NetApp Files, Data Guard to a paired region, backups to Blob, constrained-vCPU/Dedicated Host for licensing; or Oracle Database@Azure for managed Oracle.
RiskLicensing/counting on Azure; storage IOPS undersized; verify DB@Azure availability.
Azure SQL application PaaS DB
CaseNew relational app backend.
ServicesAzure SQL Database (vCore, zone-redundant), Private Endpoint, Entra auth + managed identity, PITR + LTR, failover group for DR, Query Performance Insight.
RiskPublic endpoint; HA not enabled; instance features expected (use MI instead).
Data warehouse / data lake Data
CaseEnterprise analytics on curated + raw data.
ServicesADLS Gen2 / OneLake (bronze/silver/gold) + Fabric/Synapse/Databricks + Data Factory + Event Hubs + Purview + Power BI; Private Endpoints.
Cost/riskCapacity/query controls; column-level security. Risk: ungoverned "data swamp"; runaway compute.
Kubernetes platform Cloud native
CaseContainer platform for many microservices with CI/CD.
ServicesPrivate AKS in a spoke VNet, AGIC/Gateway ingress, Workload Identity, ACR (scanned) + Defender for Containers, Azure Policy for AKS, pipeline (GitHub Actions/Azure DevOps).
RiskPod IP exhaustion (CNI); over-privileged Workload Identity; public API server.
Serverless application Serverless
CaseEvent-driven / stateless services with minimal ops.
ServicesContainer Apps / Functions + Front Door + Event Grid/Service Bus + Cosmos/SQL + Key Vault; VNet integration to private data.
Cost/riskScale-to-zero. Risk: cold starts for spiky critical paths (min replicas/Premium plan).
Event-driven architecture Events
CaseDecoupled, resilient pipelines.
ServicesEvent Grid + Functions/Container Apps + Service Bus + Event Hubs; dead-letter queues.
RiskPoison messages without DLQ; non-idempotent handlers; backlog from slow consumers.
Hybrid cloud Hybrid
CaseWorkloads split across on-prem and Azure.
ServicesExpressRoute (primary) + VPN (backup) via hub, hub-and-spoke / Virtual WAN, hybrid DNS (Private Resolver), Azure Arc for on-prem management.
RiskCIDR overlap; transitive-peering assumption; single link; asymmetric routing.
Multi-region DR DR
CaseBusiness-critical stack needing regional resilience.
ServicesFront Door with multi-region backends, SQL failover group / Cosmos multi-region, GZRS storage, Azure Site Recovery, reservations.
RiskUntested DR; CMK key missing in DR region; capacity unavailable at failover.
Secure landing zone Security
CasePreventive-guardrail foundation.
ServicesAzure Policy (deny public IP/blob, location/SKU restrictions, require diagnostics), PIM + Conditional Access, hub firewall, Private Link strategy, central Defender + Sentinel + logging, Key Vault, budgets/quotas - as code.
RiskGuardrails off; over-broad break-glass; standing Owner.
GenAI with private enterprise data AI
CaseRAG/assistant over internal data, governed.
ServicesBlob/ADLS + Azure AI Search + Azure OpenAI behind an App Service/Function serving API + Key Vault + Private Endpoints + Content Safety + Log Analytics.
Flow / riskQuery → serving layer (authz + guardrails) → security-trimmed retrieval → grounded, audited answer. Risk: ungoverned data access, dynamic SQL, credential leakage (section 12 warnings).
Common mistakes across all patterns
  • Databases/services on public endpoints "to get it working"; Private DNS zone not linked.
  • No zone/region spread - a zone event takes the whole "HA" tier.
  • Health-probe NSG rule (and App Gateway management ports) forgotten - unhealthy backends on day one.
  • DR designed but never tested; CMK keys missing in the DR region.
  • Secrets in app settings/code instead of Key Vault; standing Owner instead of PIM.
  • No centralized logging/Defender until an incident needs it.
  • CIDR overlap / transitive-peering assumption discovered during hybrid setup.
  • Landing zone / Azure Policy skipped and retrofitted painfully later.

16. Troubleshooting Guides

A runbook catalog for the failures you will actually hit. Each entry lists symptoms, likely causes, checks (with portal path, CLI, and PowerShell where useful), fixes, and prevention. Deeper versions of some live in their service sections; this is the consolidated index.

Last reviewed: July 2026 Verify CLI/PowerShell syntax with az <group> --help / Get-Help.
General method
Work top-down: identity/API (right tenant/subscription? role/scope? PIM active? provider registered?) → network (Effective routes + IP Flow Verify + DNS) → host/service (listening/healthy?) → data. For "cannot reach," use Network Watcher (IP Flow Verify / Connection Troubleshoot); for "access denied," use IAM > Check access and the section 2 model. Everything is in Activity Log + Log Analytics.
ComputeStorageNetworkLBDBIdentityServerlessAKSMonitor

Compute & access

⚑ VM not reachable / SSH / RDP / Bastion issue

Causes: NSG blocking 22/3389 from your source; VM has no public IP and you're not using Bastion; Bastion subnet (AzureBastionSubnet) missing/misconfigured or RBAC missing; VM stopped/boot failed; OS firewall; wrong credentials. Checks: Effective security rules + IP Flow Verify; boot diagnostics/serial console; Bastion deployment. Fix: use Bastion (no public IP needed), allow the source in the NSG, reset password/SSH via "Reset password"/Run Command. Prevention: standardize Bastion + no public IPs.

az network watcher test-ip-flow -g RG --vm VM --direction Inbound --protocol TCP --local IP:22 --remote SRC:0
az vm boot-diagnostics get-boot-log -g RG -n VM
⚑ VM boot issue

Causes: bad fstab mount, full OS disk, driver/kernel issue, failed extension. Checks: boot diagnostics screenshot + serial console. Fix: use the VM "Repair" (attach OS disk to a rescue VM) to fix config; keep disk snapshots. Prevention: test image/extension changes in non-prod.

⚑ High CPU / memory pressure / disk full

CPU: Azure Monitor trend; on host top; right-size/autoscale (VMSS). Memory: requires the Azure Monitor Agent (guest memory not collected by default) - deploy via VM Insights, then right-size. Disk full: expand the managed disk, grow the partition/filesystem, alert at 85%.

⚑ Managed disk attachment issue

Causes: disk in a different region/zone than the VM; not initialized; LUN mapping; disk still attached elsewhere. Checks: disk state + LUN; lsblk/Disk Management. Fix: attach in the same region/zone, initialize/mount, add to fstab by UUID. Prevention: automate via extension/cloud-init.

Storage

⚑ Storage account access denied / Blob public access issue

Denied: missing RBAC data role (needs Storage Blob Data Reader/Contributor - control-plane roles like Contributor don't grant data access); wrong SAS/expired; firewall blocking; "allow shared key" disabled but code uses a key; VNet/Private Endpoint restriction; API not registered. Public access: account-level "allow blob public access" is off (correct) - use SAS/RBAC instead. Checks: IAM data roles; storage firewall; az storage blob list --auth-mode login. Fix: grant the data role to the managed identity; use Private Endpoint + Entra auth.

⚑ Azure Files mount issue

Causes: port 445 blocked (ISP/NSG) for SMB; wrong credentials (storage key or Entra/AD Kerberos); Private Endpoint DNS not resolving; NFS export/rules wrong. Fix: use Private Endpoint (avoids 445-over-internet), configure identity-based auth, verify Private DNS, check firewall.

Network

⚑ DNS / NSG / route / Azure Firewall / NAT / Private Endpoint / peering / Private DNS

Method: IP Flow Verify (which NSG rule decides), Effective routes/rules on the NIC (real next hop + merged rules), Connection Troubleshoot, and nslookup. Per case:

  • NSG: priority/direction; default deny-inbound; missing allow.
  • Route (UDR): forced-tunnel to firewall with no return route (black hole); missing route.
  • Azure Firewall: app/network rule collection missing the FQDN/port; DNAT config; UDR not pointing at it.
  • NAT Gateway: outbound only; SNAT port exhaustion; not on the subnet.
  • Private Endpoint: Private DNS zone not linked / no A record - FQDN resolves to public IP.
  • Peering: not transitive; overlap; missing "allow forwarded traffic"/gateway transit.
⚑ VPN down / ExpressRoute issue

Causes: IKE/PSK mismatch (VPN); BGP not advertising/learning; ExpressRoute circuit/peering down or route filters; gateway SKU bandwidth exhausted; CIDR overlap. Checks: connection/circuit status; BGP routes; effective routes. Fix: align IKE, correct BGP/route filters, resolve overlap, right-size the gateway SKU.

Load balancing & databases

⚑ App Gateway backend unhealthy / LB probe failure / SSL cert

Unhealthy: NSG blocks the health probe or App Gateway management ports (GatewayManager service tag) on the App Gateway subnet; wrong probe host/path/port/protocol; app on localhost; backend HTTP settings mismatch. LB: probe port not open; Standard LB needs an outbound rule. SSL: managed cert needs DNS → frontend first; Key Vault access for the gateway identity; chain/SAN. Fix: per section 7.

⚑ Azure SQL connection / performance / backup issue

Connection: firewall/public access disabled but no Private Endpoint path; Private DNS not resolving; Entra token/permission; wrong connection string (server FQDN); transient errors (retry logic). Performance: Query Performance Insight / Query Store; DTU/vCore/log-IO limits; missing indexes; enable automatic tuning. Backup: PITR/LTR retention config; test a restore. Checks: az sql db show; audit/diagnostic logs.

Identity

⚑ RBAC permission denied / managed identity / app secret expired

Denied: walk the section 2 model - right tenant/subscription? which principal? role + control vs data plane? scope/inheritance? deny assignment? Azure Policy? Conditional Access? PIM eligible but not activated? Managed identity: is it assigned to the resource, and does it have the target role (data role too)? propagation delay after assignment. App secret expired: a service principal's client secret/cert expired - rotate it (and move to a managed identity / workload identity federation to avoid recurrence). Tools: IAM > Check access; Entra sign-in logs (Conditional Access); Activity Log.

Serverless & AKS

⚑ Function timeout / trigger issue / Container App revision

Function timeout: Consumption plan caps duration - use Premium/Flex, make idempotent, offload long work. Trigger not firing: check the binding connection (managed identity / connection string), the source (queue/blob/Event Grid subscription + filter), and function logs. Container Apps: new revision failing - container must listen on the target port, check ingress + scale rules (min replicas) + managed identity RBAC; roll back or split traffic.

⚑ AKS pod not starting

Causes: Pending (capacity / pod-IP exhaustion with Azure CNI / requests too big), ImagePullBackOff (grant AcrPull to the kubelet identity; private ACR needs endpoint/DNS), CrashLoopBackOff (config/probes). Tools: kubectl describe/logs, node capacity, subnet IPs. (Section 10.)

Monitoring

⚑ Azure Monitor alert not firing / diagnostic logs missing

Alert: wrong signal/scope/threshold, evaluation window never met, action group has no verified receiver, alert disabled, or maintenance suppression. Test by forcing the condition; check the alert's history. Logs missing: diagnostic settings not enabled on the resource, Azure Monitor Agent / DCR not deployed on the VM, wrong Log Analytics workspace, or ingestion delay/retention expired. Fix: enable diagnostic settings (enforce via Policy), deploy the agent via VM Insights, verify the workspace.

17. Azure CLI, PowerShell, Bicep, ARM & Terraform

Practical, copy-friendly automation: CLI/PowerShell setup and tenant/subscription selection, managed identity, Bicep vs ARM vs Terraform, clean examples for VNet, VM, storage, RBAC, and alerts - plus state, structure, and CI/CD practices.

Last reviewed: July 2026 Verify provider/API versions and resource argument names in current docs.
TL;DR

Two CLIs: Azure CLI (az) and Azure PowerShell (Az) - pick a house standard. Always set the right tenant + subscription before acting. Prefer managed identity / federated credentials over service-principal secrets. For IaC, use Bicep (Azure-native, readable) or Terraform (azurerm, multicloud) - both beat hand-written ARM JSON. Keep Terraform state remote and locked (a storage account with blob lease), structure into modules, separate environments, and deploy via a pipeline using a federated (secretless) identity.

Azure CLI and PowerShell setup

# Azure CLI
az login                                  # interactive
az account list -o table                   # subscriptions you can see
az account set --subscription "Prod-01"    # select subscription
az account show                            # confirm tenant + subscription

# Azure PowerShell
Connect-AzAccount
Set-AzContext -Subscription "Prod-01"
Get-AzContext

Auth & managed identity (no secrets)

# On an Azure resource with a managed identity, tools authenticate automatically:
az login --identity                        # system-assigned MI
az login --identity --username <clientId>   # user-assigned MI

# CI/CD: use OIDC workload identity federation (no client secret)
# - create an app registration + federated credential for the pipeline
# - the pipeline exchanges its OIDC token for an Azure token; nothing to store
Security note
Prefer managed identities for Azure-hosted automation and workload identity federation (OIDC) for external CI (GitHub Actions/Azure DevOps) - both avoid long-lived service-principal secrets. If a secret is unavoidable, store it in Key Vault, scope the SP minimally, and rotate it.

Common commands

# Resource groups & providers
az group create -n rg-app-prod -l eastus2
az provider register --namespace Microsoft.Sql

# Compute
az vm list -o table
az vm create -g rg-app-prod -n web1 --image Ubuntu2204 --public-ip-address "" --size Standard_D2s_v5

# Storage
az storage account create -g rg-app-prod -n stappprod01 --sku Standard_ZRS --allow-blob-public-access false
az storage blob upload-batch -d container -s ./data --auth-mode login

# RBAC
az role assignment create --assignee-object-id <objId> --role "Storage Blob Data Reader" --scope <resourceId>
az role assignment list --assignee <objId> --all -o table

# Logs (KQL)
az monitor log-analytics query -w <workspaceId> --analytics-query "AzureActivity | take 20"

Bicep vs ARM vs Terraform

BicepARM templatesTerraform
LanguageConcise DSL → compiles to ARMVerbose JSONHCL (azurerm provider)
Best forAzure-only, native, day-1 feature parityLegacy; avoid authoring by handMulticloud / standardized IaC, rich ecosystem
StateNone (ARM tracks deployments)NoneYou manage remote state
Architect note
Both Bicep and Terraform are valid - pick one house standard. Bicep if you're Azure-only and want native, immediate feature support with no state to manage. Terraform if you standardize IaC across clouds or want its module/ecosystem. Don't author raw ARM JSON by hand; if you have ARM, decompile to Bicep.

Create a VNet + subnet + NSG

// Bicep: network.bicep
resource vnet 'Microsoft.Network/virtualNetworks@2023-11-01' = {
  name: 'vnet-app'
  location: resourceGroup().location
  properties: {
    addressSpace: { addressPrefixes: [ '10.10.0.0/20' ] }
    subnets: [
      {
        name: 'snet-app'
        properties: {
          addressPrefix: '10.10.1.0/24'
          networkSecurityGroup: { id: nsg.id }
          privateEndpointNetworkPolicies: 'Disabled'
        }
      }
    ]
  }
}
resource nsg 'Microsoft.Network/networkSecurityGroups@2023-11-01' = {
  name: 'nsg-app'
  location: resourceGroup().location
}
# Terraform: network.tf
resource "azurerm_virtual_network" "vnet" {
  name                = "vnet-app"
  resource_group_name = azurerm_resource_group.app.name
  location            = azurerm_resource_group.app.location
  address_space       = ["10.10.0.0/20"]
}
resource "azurerm_subnet" "app" {
  name                 = "snet-app"
  resource_group_name  = azurerm_resource_group.app.name
  virtual_network_name = azurerm_virtual_network.vnet.name
  address_prefixes     = ["10.10.1.0/24"]
}

Create a VM (no public IP)

# Terraform: a Linux VM with a system-assigned managed identity, no public IP
resource "azurerm_linux_virtual_machine" "app" {
  name                = "vm-app-1"
  resource_group_name = azurerm_resource_group.app.name
  location            = azurerm_resource_group.app.location
  size                = "Standard_D2s_v5"
  admin_username      = "azureuser"
  network_interface_ids = [azurerm_network_interface.app.id]
  identity { type = "SystemAssigned" }
  admin_ssh_key { username = "azureuser"  public_key = file("~/.ssh/id_rsa.pub") }
  os_disk { caching = "ReadWrite"  storage_account_type = "Premium_LRS" }
  source_image_reference { publisher = "Canonical"  offer = "0001-com-ubuntu-server-jammy"  sku = "22_04-lts-gen2"  version = "latest" }
}

Create a hardened storage account

resource "azurerm_storage_account" "data" {
  name                            = "stappdataprod01"
  resource_group_name             = azurerm_resource_group.app.name
  location                        = azurerm_resource_group.app.location
  account_tier                    = "Standard"
  account_replication_type        = "ZRS"
  allow_nested_items_to_be_public = false        # no public blob access
  shared_access_key_enabled       = false        # force Entra auth
  min_tls_version                 = "TLS1_2"
  blob_properties { versioning_enabled = true  delete_retention_policy { days = 30 } }
}

Create an RBAC assignment

# Least-privilege: data role on one storage account, to a GROUP
resource "azurerm_role_assignment" "blob_read" {
  scope                = azurerm_storage_account.data.id
  role_definition_name = "Storage Blob Data Reader"
  principal_id         = azuread_group.app_team.object_id
}

Create a metric alert

resource "azurerm_monitor_action_group" "ops" {
  name                = "ops-email"
  resource_group_name = azurerm_resource_group.app.name
  short_name          = "ops"
  email_receiver { name = "oncall"  email_address = "oncall@example.com" }
}
resource "azurerm_monitor_metric_alert" "cpu" {
  name                = "vm-cpu-high"
  resource_group_name = azurerm_resource_group.app.name
  scopes              = [azurerm_linux_virtual_machine.app.id]
  criteria {
    metric_namespace = "Microsoft.Compute/virtualMachines"
    metric_name      = "Percentage CPU"
    aggregation      = "Average"
    operator         = "GreaterThan"
    threshold        = 85
  }
  window_size = "PT5M"  frequency = "PT1M"
  action { action_group_id = azurerm_monitor_action_group.ops.id }
}

State, structure, and CI/CD

  • Remote, locked state (Terraform): an Azure storage account backend (blob) with blob leasing for locking and versioning on. Never keep prod state on a laptop; never commit state (it holds secrets).
  • Modular structure: reusable modules (network, compute, data, rbac, monitoring) composed per environment.
  • Environment separation: separate state per env (workspaces or separate backends/keys) driven by dev.tfvars / prod.tfvars; separate subscriptions and ideally separate deployer identities/pipelines.
  • CI/CD: run plan/apply (or Bicep what-if/deploy) in GitHub Actions or Azure DevOps using OIDC workload identity federation (no secret). Gate apply with approvals; run plan/what-if on PRs.
  • No secrets in code: reference Key Vault by name; keep secret tfvars out of git.
azure-infra/
  modules/
    network/   compute/   data/   rbac/   monitoring/
  envs/
    dev/    main.tf  dev.tfvars   backend.tf
    prod/   main.tf  prod.tfvars  backend.tf
  .github/workflows/  or  azure-pipelines.yml
  README.md
Architect note
Prod and DR should be provably identical because they come from the same modules with different variables. Manual portal changes in prod are the enemy of a working DR - enforce "infrastructure changes go through IaC," run plan/what-if in CI on every PR, and use Azure Policy + Change Tracking / drift detection to catch out-of-band changes.

18. Azure Well-Architected Framework

The five pillars Microsoft uses to review a workload - Reliability, Security, Cost Optimization, Operational Excellence, and Performance Efficiency. Written for real architecture reviews: what each means, the services that support it, examples, mistakes, and a review checklist.

Last reviewed: July 2026 Verify against the current Well-Architected Framework docs.
HOW TO USE THIS

Run a workload (or a design) through all five pillars. For each, ask the checklist questions, map to concrete Azure services, and record gaps as action items. A pillar with no owner and no evidence is a risk, not a pass. (Microsoft's WAF review + the Advisor score are good companions.)

Reliability
Security
Cost Optimization
Operational Excellence
Performance Efficiency

Reliability

What it means: the workload meets its availability and recovery targets and withstands failures - built around zones, redundancy, health modeling, and tested DR.

Why it matters: reliability targets (SLOs) drive architecture and cost; you can't bolt on availability after an outage.

Supporting services: Availability Zones, zone-redundant VMSS/App Service/LB, Azure SQL failover groups / zone redundancy, Cosmos multi-region, Azure Site Recovery, Backup, Front Door, Azure Monitor (health/SLOs).

Practical examples: zone-spread every tier; a defined + tested DR pattern per tier with RTO/RPO; health probes + autohealing; graceful degradation and retries with backoff; capacity planning; chaos/failure testing.

Common mistakes
Single-zone "prod"; DR never tested; CMK keys missing in DR; no retry/timeout handling for transient faults; no health model.
Review checklist
Is every critical tier zone-redundant? Defined + tested DR per tier with RTO/RPO? Health probes + autohealing? Transient-fault handling (retries/circuit breakers)? Backups immutable and restore-tested? Capacity for failover?

Security

What it means: protect identities, data, and workloads and meet compliance - Zero Trust, least privilege, encryption, and defense in depth.

Why it matters: a single over-broad role, public storage account, or long-lived secret can undo everything else. Security is a design property.

Supporting services: Entra ID + Conditional Access + PIM, Azure RBAC, managed identities, Azure Policy, Key Vault/Managed HSM, Private Link, Azure Firewall/WAF/DDoS, Defender for Cloud, Sentinel.

Practical examples: no standing Owner (PIM); MFA everywhere; managed identities (no secrets); preventive Policy; Private Endpoints; CMK; centralized logs; Defender on. (See section 8's checklist.)

Common mistakes
Owner everywhere / no PIM; weak Conditional Access; public storage/DB; secrets in code; no Private Endpoints for sensitive services; Policy off; logs not centralized.
Review checklist
Least privilege via groups/RBAC + PIM? MFA + Conditional Access? Managed identities, minimal secrets? Preventive Azure Policy on? Private Endpoints for sensitive PaaS? CMK + Key Vault purge protection? Defender + centralized logs? DR keys cross-region?

Cost Optimization

What it means: deliver required value at the lowest sustainable cost - right-sizing, commitments, eliminating waste, and attributing spend.

Why it matters: unmanaged Azure spend grows silently; cost is a first-class design and operational concern.

Supporting services: Cost Management + budgets + exports, quotas, Advisor, Reservations, Savings Plan, Azure Hybrid Benefit, Spot, storage lifecycle, Log Analytics tiers.

Practical examples: tags + exports for attribution; RIs/Savings Plan for baseline; Hybrid Benefit; auto-shutdown non-prod; storage lifecycle; Log Analytics filtering; monthly review (section 14).

Common mistakes
No tags/attribution; over-provisioned VMs/disks; on-demand for steady-state; unfiltered Log Analytics; over-sized App Gateway/Firewall; idle IPs and orphaned disks.
Review checklist
Spend attributed via tags + exports? Budgets + quotas? RI/Savings Plan + Hybrid Benefit applied? Right-sizing acted on? Storage lifecycle + Log Analytics tiers? A recurring cost review?

Operational Excellence

What it means: run and improve the workload reliably and repeatably - automation, observability, incident response, and safe change.

Why it matters: most outages come from change and from not seeing problems early; operational maturity turns a good design into a dependable service.

Supporting services: Bicep/Terraform + pipelines (GitHub Actions/Azure DevOps), Azure Monitor/Log Analytics/App Insights, Update Manager, Change Tracking, Automation, Azure Policy (drift), Deployment stacks.

Practical examples: everything as code with peer review; SLOs + alerts on symptoms; centralized logs; automated patching; runbooks tied to alerts; blameless post-mortems; progressive delivery (slots/rings).

Common mistakes
Manual portal changes in prod; alerting on causes / alert fatigue; no SLOs; observability added after an incident; no defined incident process.
Review checklist
All infra in code with review? SLOs + symptom-based alerts? Centralized logs + Activity Log? Automated patching? Runbooks current + rehearsed? Safe deployment (slots/canary) + rollback?

Performance Efficiency

What it means: meet performance requirements efficiently as demand changes - right SKUs, autoscaling, caching, data locality, and query design.

Why it matters: performance affects user experience and cost simultaneously; the right shape and data design often beat simply adding capacity.

Supporting services: VM series (F/E/M/L), autoscaling (VMSS/App Service/AKS HPA), Front Door + CDN, Azure Cache for Redis, Premium SSD v2/Ultra, Azure SQL tiers/read replicas, App Insights.

Practical examples: match VM series to the bottleneck; autoscale on the right signal; cache at the edge (Front Door/CDN) and in-memory (Redis); co-locate data + compute (proximity groups) to cut latency/egress; tune database indexes/queries; load-test before launch.

Common mistakes
Wrong VM series (CPU-bound on general purpose); no autoscaling / scaling on the wrong metric; disk IOPS ceiling mistaken for CPU; chatty cross-region calls; untuned database queries.
Review checklist
VM series matched to workload? Autoscaling on a meaningful signal? Caching (Front Door/Redis) where it helps? Data co-located with compute? Storage/DB performance sized (Premium SSD v2/Ultra, right DB tier)? Load-tested?

19. Learning Path

A structured route from Azure fundamentals to enterprise-grade architecture, security, and AI - aimed at people coming from traditional infrastructure or another cloud. Each level lists what to learn, why, hands-on labs, common mistakes, and the outcome you should reach.

Last reviewed: July 2026 Certification names/exam details change - verify on Microsoft Learn before scheduling.
Beginner
Fundamentals: hierarchy, Entra, RBAC, VNet, VM, Storage, Monitor
Intermediate
LB/App Gateway, private networking, Azure SQL, Key Vault, Defender, Policy, cost
Advanced
Management groups, landing zones, hub-spoke, Firewall, Private Link, PIM, AKS, Synapse/Fabric, OpenAI, DR, Terraform
How to use this
Do the labs, don't just read. Use a free trial / Azure free account for hands-on. Map each level to the deep-dive sections above - the learning path is the syllabus, the sections are the textbook. Certifications (AZ-900 → AZ-104 → AZ-305, plus specialties) are useful checkpoints, but capability comes from building.

Beginner

Level 1 - Foundations
Goal: deploy and connect basic Azure resources confidently

What to learn

  • Fundamentals: regions/zones, the governance hierarchy (management group / subscription / resource group), and the Azure mental model (section 1).
  • Entra ID basics and RBAC vs Entra roles; groups; managed identities (section 2).
  • VNet basics: subnets, NSGs, and how traffic flows (section 3).
  • VM basics: sizes, managed disks, Bastion access (section 4).
  • Storage basics: storage accounts, blob tiers, redundancy, disabling public access (section 5).
  • Azure Monitor basics: metrics, the agent, an alert, Log Analytics (section 9).

Why it matters

Every design rests on the hierarchy, the Entra/RBAC split, and the VNet model. Get these right and everything later is easier.

Hands-on labs

  • Create a resource group; assign a built-in role to a group at RG scope; test with Check access.
  • Build a VNet with subnets + NSGs; deploy a VM with no public IP; connect via Bastion.
  • Create a storage account (ZRS, public access disabled); upload blobs with Entra auth.
  • Deploy the Azure Monitor Agent (VM Insights); create a CPU alert to an action group.

Common mistakes

Confusing Entra roles with RBAC; Owner at subscription; public storage; no agent for memory; VMs with public IPs.

Expected outcome

You can stand up a segmented VNet, reach a private VM via Bastion, use RBAC correctly, and see basic telemetry.

Intermediate

Level 2 - Building real workloads
Goal: deploy an HA app + managed database with monitoring, security, and cost control

What to learn

  • Load balancing (Load Balancer, Application Gateway + WAF, Front Door) and VMSS + autoscale (sections 7, 4).
  • Private networking: Private Endpoints, NAT Gateway, VPN Gateway, ExpressRoute basics (section 3).
  • Azure SQL: service tiers, zone redundancy, Private Endpoint, PITR, failover groups (section 6).
  • Key Vault + managed identities; Defender for Cloud; Azure Policy basics (section 8).
  • Cost Management: budgets, tags, Reservations, Hybrid Benefit (section 14).

Why it matters

This is the day job: HA app tiers, managed databases, and the operational, security, and cost controls that make them production-worthy.

Hands-on labs

  • Deploy a 3-tier app: Front Door + WAF → App Service/VMSS → Azure SQL (Private Endpoint, zone-redundant).
  • Allow the health probe (+ App Gateway management ports); confirm backends healthy; force a failover.
  • Store the DB connection secret in Key Vault; connect via managed identity / Entra auth.
  • Create alerts (CPU, unhealthy backend, DB storage) + an action group; wire a budget + tags.
  • Apply a couple of Azure Policies (deny public IP, allowed regions) at the resource group.

Common mistakes

Health-probe NSG rule missing; DB public endpoint; secrets in app settings; Private DNS not linked; noisy alerts.

Expected outcome

You can deploy a secure, monitored, HA application + managed database, connect it privately, and control its cost and access.

Advanced

Level 3 - Enterprise architecture, data & AI
Goal: design governed, multi-region, data-and-AI-capable platforms

What to learn

  • Management groups, landing zones (CAF ALZ), and Azure Policy at scale (sections 1, 8, 14).
  • Hub-and-spoke / Virtual WAN, Azure Firewall, Private Link, DNS Private Resolver (section 3).
  • Advanced RBAC + PIM + Conditional Access (section 2).
  • AKS / Container Apps, Functions, Event Grid/Service Bus (section 10).
  • Synapse / Microsoft Fabric, Data Factory, Purview (section 11).
  • Azure OpenAI, Azure AI Search, and governed RAG / vector search (section 12).
  • Multi-region DR (Front Door + failover groups / Cosmos) (section 13).
  • Bicep/Terraform + pipelines; enterprise security; large-enterprise architecture (sections 17, 8).

Why it matters

At this level you own governance, resilience, data platforms, and AI enablement across many teams - decisions that are expensive to reverse.

Hands-on labs

  • Deploy a landing zone (CAF ALZ accelerator): management groups, baseline RBAC + PIM, Azure Policy, hub network, central logging + Defender, budgets.
  • Stand up a private AKS cluster in a spoke with Workload Identity and a GitHub Actions/Azure DevOps pipeline.
  • Build a Fabric/Synapse lakehouse with a Data Factory pipeline and Purview governance; tune a query.
  • Build a governed RAG assistant: Blob + Azure AI Search + Azure OpenAI behind an App Service serving API, Private Endpoints, Content Safety, and security-trimmed retrieval.
  • Implement cross-region DR for an Azure SQL app (failover group) behind Front Door; rehearse failover and confirm CMK keys in DR.

Common mistakes

Skipping the landing zone; DR never tested; standing Owner instead of PIM; pod-IP exhaustion; connecting AI to production data without a governed serving layer.

Expected outcome

You can design and operate a governed, automated, multi-region Azure platform - including data and AI workloads - and defend the trade-offs on security, reliability, and cost.

Certification checkpoints (optional)

LevelTypical certification track
BeginnerAZ-900 (Azure Fundamentals); AZ-104 (Administrator)
IntermediateAZ-104; AZ-500 (Security); AZ-700 (Networking)
AdvancedAZ-305 (Solutions Architect Expert); DP-203/DP-600 (Data); AI-102 (AI Engineer)
Verify before scheduling
Microsoft updates exam content and role certifications regularly. Confirm the current track and objectives on Microsoft Learn before you prepare. Certifications validate knowledge; the labs above build the capability employers pay for.