Microsoft Azure Deep Dive Portal
A practical reference for Cloud Architects, DBAs, and Enterprise Infrastructure Teams. Built to be used while you learn, design, implement, operate, secure, and troubleshoot real Azure environments - not a marketing overview.
Cloud architects, infrastructure engineers, Apps DBAs, DBAs, enterprise architects, DevOps, security engineers, Microsoft administrators - and anyone moving from traditional infrastructure or another cloud into Azure. It assumes you know servers, networks, storage, and databases, and focuses on how those map into Azure and what changes operationally.
How this portal is organized
Each section is a self-contained deep dive. Use the left navigation or the top-bar search to jump to a topic. Every section carries a Last reviewed date and, where content changes frequently (pricing, VM sizes, quotas, model availability, service names), a Verify with current Microsoft documentation flag.
Sections 1-2 establish the mental model: the governance hierarchy (management group / subscription / resource group), Entra ID, and the RBAC model everything depends on.
Sections 3-12 cover networking, compute, storage, databases, load balancing, security, observability, containers, data, and AI - with diagrams, tables, and gotchas.
Sections 13-19 cover migration and DR, cost and governance, reference patterns, troubleshooting runbooks, automation, the Well-Architected Framework, and a learning path.
Reading the callouts
Several note types recur. They flag the perspective that matters most for a point.
The Azure shared responsibility model (orientation)
Responsibility is split, and the split moves depending on the service. Get it wrong and you either leave gaps (exposed data, lost recoverability) or redo work Microsoft already does.
| Layer | Virtual Machines (IaaS) | AKS | Azure SQL DB / PaaS DB | App Service / Functions |
|---|---|---|---|---|
| Physical / hypervisor | Microsoft | Microsoft | Microsoft | Microsoft |
| OS patching | You (Update Manager) | You (nodes) / MS (control plane) | Microsoft | Microsoft |
| Runtime / engine patching | You | Shared | Microsoft (in window) | Microsoft |
| Backup config | You (Azure Backup) | You | Managed, you configure | Managed / you export |
| Scaling / HA | You (Scale Sets, zones) | You configure | You enable zone-redundancy | Automatic |
| Data, schema, access, RBAC | You | You | You | You |
Suggested reading order
1. Azure Fundamentals
The global infrastructure and the governance hierarchy (management groups, subscriptions, resource groups) that every Azure deployment is built on - plus the mental model that makes the rest of the platform predictable.
Azure is a set of regions (most paired for DR, many with Availability Zones) on Microsoft's global network. Governance is a hierarchy: Management Groups > Subscriptions > Resource Groups > resources, all under one Microsoft Entra tenant (the identity boundary). The subscription is the billing/deployment/quota boundary; the resource group is the lifecycle boundary. Azure Resource Manager (ARM) is the deployment/control plane; RBAC grants access and Azure Policy constrains what is allowed. Get the hierarchy and a landing zone right before production.
What Azure is
Microsoft Azure is Microsoft's public cloud: on-demand compute, storage, networking, databases, data, and AI services delivered from Microsoft-operated regions, consumed over Microsoft's global network, and billed by usage. Its distinctive strengths for enterprises are deep Microsoft ecosystem integration (Entra ID / Active Directory, Windows Server, SQL Server, Microsoft 365), strong hybrid tooling (Azure Arc, Azure Stack, ExpressRoute), and a mature governance model (management groups, Azure Policy, landing zones). If you come from traditional Windows/SQL/AD infrastructure, much of Azure will feel familiar; the biggest early shift is the resource hierarchy and that identity lives in Entra ID, separate from Azure resource access (RBAC).
Azure global infrastructure
| Concept | What it is | Protects against / used for |
|---|---|---|
| Region | A set of data centers in a geography; your primary deployment + residency boundary | Choose by latency, residency, and service/zone availability |
| Region pair | Each region is paired with another in the same geography; Microsoft sequences updates and some services replicate to the pair | Regional DR target; ordered platform maintenance |
| Availability Zone (AZ) | Physically separate datacenters within a region (independent power/cooling/network) | Datacenter-level failure - spread across zones for in-region HA |
| Availability Set | A grouping that spreads VMs across fault domains (racks) and update domains (maintenance groups) within a single datacenter | Rack/maintenance failure when zones aren't used |
| Fault domain | A rack of hardware (shared power/network) | Anti-affinity within an availability set |
| Update domain | A group updated/rebooted together during planned maintenance | Ensures not all instances reboot at once |
| Sovereign clouds | Isolated clouds (Azure Government, Azure operated by 21Vianet in China) | Regulatory/sovereignty isolation - separate endpoints and features |
The governance hierarchy
- Microsoft Entra tenant - the identity boundary (users, groups, apps). One tenant can hold many subscriptions. Identity (Entra) is separate from resource access (Azure RBAC) - a crucial distinction (section 2).
- Management groups - a hierarchy above subscriptions for applying RBAC and Azure Policy at scale (e.g. a Platform MG and a Landing Zones MG under the root). Can nest.
- Subscriptions - the unit of billing, quota/limits, and deployment. Also a strong isolation and blast-radius boundary. Large enterprises use many subscriptions (per app/environment/BU).
- Resource groups - a lifecycle/management container for resources that share a lifecycle; RBAC, locks, and deletion apply at this level. A resource belongs to exactly one RG.
- Azure Resource Manager (ARM) - the control-plane API/service for all deployments and management; ARM templates and Bicep are its native IaC languages.
The Azure mental model
| Concept | Is the boundary for |
|---|---|
| Entra tenant | Identity (who exists) |
| Subscription | Billing and deployment (and quota/limits) |
| Resource group | Lifecycle and management (deploy/delete/lock together) |
| Management group | Governance (apply RBAC/Policy across many subscriptions) |
| Azure Policy | Control and compliance (what is allowed to exist / be configured) |
| Azure RBAC | Access (who can do what to Azure resources) |
ARM, Bicep, and the tools
The web UI. Best for learning, exploring, and reading state. Not for repeatable production changes - use IaC.
Two first-class command lines (az and Az module). Pick one house standard; both cover the control plane. See section 17.
Browser shell pre-authenticated as your identity, with CLI/PowerShell/Terraform/kubectl and persistent storage.
Native IaC. Bicep is the modern, readable authoring language that compiles to ARM JSON - prefer it over raw ARM templates.
The azurerm provider is widely used for multi-cloud/standardized IaC. Bicep vs Terraform is a house choice; both are valid.
Everything is an ARM REST call; idiomatic SDKs (.NET, Python, Java, JS, Go) for building tooling and apps.
Azure Policy and resource locks
- Azure Policy - define, assign, and enforce rules (deny, audit, deployIfNotExists, modify) at MG/subscription/RG scope, inheriting down. Common: allowed regions, allowed SKUs, require tags, deny public IPs/public storage, require diagnostic settings. Initiatives bundle policies (e.g. a regulatory baseline).
- Resource locks -
CanNotDeleteorReadOnlylocks on a subscription/RG/resource to prevent accidental deletion/changes (e.g. lock the hub network and prod databases).
Tags
Tags are key/value metadata on resources, resource groups, and subscriptions - used for cost attribution (they flow into Cost Management and billing exports), organization, and automation. Tags are not inherited by default (a resource does not automatically get its RG's tags), though Azure Policy can enforce/inherit them.
Designing the hierarchy & landing zones
How to structure tenants, management groups, subscriptions, resource groups Design
- One Entra tenant for the enterprise (identity boundary); multiple tenants only for genuine sovereignty/M&A isolation.
- Management groups following the Cloud Adoption Framework (CAF) pattern: a Platform MG (connectivity, management, identity subscriptions) and a Landing Zones MG (corp/online workload subscriptions), plus Sandbox and Decommissioned.
- Subscriptions per workload/environment (or per app), not one giant subscription - they are the quota and blast-radius boundary. Separate platform subscriptions for connectivity (hub network), management (logging/monitoring), and identity.
- Resource groups per app-tier or per lifecycle (things deployed and deleted together).
Separating dev / test / stage / prod / shared / security / networking / logging Design
- Separate subscriptions per environment for independent quota, budgets, RBAC, and blast radius; use management groups to apply environment-wide Policy/RBAC once.
- Dedicated platform subscriptions: connectivity (hub VNet, Azure Firewall, ExpressRoute/VPN), management (Log Analytics, Automation, Backup), identity (domain controllers / Entra Domain Services if needed).
- Keep prod under stricter Policy (no public IPs, restricted regions/SKUs, mandatory logging) than nonprod.
- Never mix sandbox/experimentation with production - separate MG with its own guardrails and budgets.
What an Azure landing zone includes Design
An Azure Landing Zone (CAF) is a codified, repeatable baseline deployed before workloads:
- Management-group hierarchy, subscription organization, and naming/tagging standards.
- Identity: Entra tenant config, groups, PIM, break-glass, Conditional Access baseline.
- Baseline RBAC (groups, not users) and preventive Azure Policy initiatives.
- Connectivity: hub-and-spoke (or Virtual WAN), Azure Firewall, DNS, ExpressRoute/VPN in the connectivity subscription.
- Management: central Log Analytics workspace, diagnostic settings policy, Azure Monitor, Backup, Update Manager.
- Security: Defender for Cloud, Sentinel, Key Vault, Private Link/Private DNS strategy.
- Guardrails: budgets, quotas, resource locks, tags - all as code (Terraform / Bicep; the CAF ALZ accelerator is a starting point).
- One big subscription for everything, so quota, RBAC, and cost attribution collapse together.
- Skipping the landing zone and retrofitting management groups, Policy, and hub networking later.
- Confusing the Entra tenant (identity) with a subscription (billing) - they are different boundaries.
- Granting RBAC at subscription/MG level for convenience so everything inherits broad access.
- No naming/tagging standard; mixing sandbox with production.
2. Identity and Access Management
Microsoft Entra ID and Azure RBAC - the two different systems that decide who exists and who can do what - plus managed identities, PIM, Conditional Access, and a troubleshooting model. This is where most Azure access issues and security incidents originate.
Azure has two access systems: Microsoft Entra ID (identity + directory roles like Global Administrator that govern Entra/M365) and Azure RBAC (roles like Contributor that govern Azure resources at MG/subscription/RG/resource scope). Confusing them is the #1 IAM error. Use groups (not users), built-in roles scoped narrowly (not Owner at subscription), managed identities for workloads (not secrets), PIM for just-in-time privileged access, and Conditional Access + MFA for sign-in. Deny assignments and Policy can block regardless of role.
Entra ID roles vs Azure RBAC roles (the critical distinction)
| Microsoft Entra directory roles | Azure RBAC roles | |
|---|---|---|
| Govern | Entra ID & Microsoft 365 (users, groups, apps, tenant settings) | Azure resources (VMs, storage, networks, databases) |
| Examples | Global Administrator, User Administrator, Application Administrator | Owner, Contributor, Reader, Storage Blob Data Contributor |
| Scope | Tenant (and administrative units) | Management group / subscription / resource group / resource |
| Managed in | Entra ID > Roles and administrators | Resource > Access control (IAM) |
Principals (who can be granted access)
| Principal | What it is | Use for |
|---|---|---|
| User | A human identity in Entra ID | People - but grant via groups, not directly |
| Group | An Entra security group of users/principals | All human access management |
| Administrative unit | A container to scope Entra role admins to a subset of users/groups | Delegated identity admin (e.g. per-region helpdesk) |
| App registration + service principal | An application identity; the app registration is the definition, the service principal is its instance in a tenant | Apps/CI authenticating to Azure/Graph (prefer managed identity where possible) |
| Managed identity | An Azure-managed service principal with no credentials you handle | Azure workloads calling Azure/Entra - the preferred workload identity |
| External identity (B2B guest) | A user from another tenant invited as a guest | Partner/vendor collaboration |
Managed identities: system- vs user-assigned
| System-assigned | User-assigned | |
|---|---|---|
| Lifecycle | Tied to one resource; created/deleted with it | Standalone resource; attach to many |
| Use for | A single workload needing its own identity | Sharing one identity across resources; pre-creating and granting RBAC before deploy |
| Credentials | None you manage - Azure rotates them | None you manage |
Azure RBAC: roles and scope
- Built-in roles - Owner (full + manage access), Contributor (full except manage access), Reader, plus hundreds of granular roles (e.g.
Virtual Machine Contributor,Storage Blob Data Reader). Prefer the narrowest built-in role. - Custom roles - compose specific actions/dataActions when no built-in role fits without over-granting.
- Role assignment = role + principal + scope. Scope is MG / subscription / RG / resource; assignments inherit downward and are additive.
- Deny assignments - explicitly block actions regardless of role (used by Azure-managed features like Blueprints/managed apps); evaluated before allows.
- Control-plane vs data-plane - some roles govern management (create/delete a storage account) vs data (read blobs). A Contributor on a storage account can't necessarily read the blobs without a data role - a frequent surprise.
Privileged Identity Management & Conditional Access
Just-in-time, time-bound, approval-and-MFA-gated activation of privileged roles (Entra and Azure RBAC). Admins are eligible, not permanently assigned - they activate when needed, with justification and audit. The single biggest reduction of standing privilege.
Sign-in policies that require MFA, compliant/managed devices, trusted locations, or block risky sign-ins - the enforcement layer for Zero Trust. Enforce MFA for all users, especially admins.
Risk-based detection (leaked credentials, impossible travel) feeding Conditional Access to block/step-up risky sign-ins.
Periodic recertification of group/role/guest access, and access packages for governed self-service - keep access from silently accumulating.
Real RBAC / identity scenarios
App team manages their resource group only Medium risk
Who: the app team group. Scope: their resource group (not subscription). Role: Contributor on the RG (or narrower roles like Virtual Machine Contributor + Web Plan Contributor). Risk: medium - contained to one RG. Safer alternative: deploy via a pipeline managed identity and give humans Reader + specific action roles. Common misuse: Owner/Contributor at subscription scope.
Workload reads a storage account - no secrets Low risk
Who: a VM/App Service/AKS workload. Scope: a single storage account (or container). Role: Storage Blob Data Reader assigned to the workload's managed identity. Risk: low - narrow, keyless. This is the pattern to imitate. Common misuse: a storage account key or SAS in app config + Contributor at RG.
Read-only auditor across the platform Low risk
Who: security/audit group. Scope: the root or Platform management group (auditors need breadth). Role: Reader (+ Security Reader for Defender), granted to a group. Risk: low - read-only. Common misuse: giving auditors Contributor "just in case".
Emergency change requiring elevated access Higher risk
Who: an on-call engineer. Scope: the affected subscription. Role: Contributor/Owner made eligible via PIM - activated just-in-time with MFA + justification, time-boxed, audited. Risk: higher, but no standing privilege and full audit trail. Common misuse: a permanent Owner assignment "for emergencies".
Common Azure IAM mistakes
- Confusing Entra roles with Azure RBAC roles - different systems, scopes, and portals.
- Owner assigned too broadly - use narrow built-in roles at RG/resource scope.
- Subscription-scope grants unnecessarily - inherit into every RG; grant lower.
- Not using PIM - standing privileged access is the biggest avoidable risk.
- Break-glass accounts not protected/monitored correctly - or not existing at all.
- Too many app secrets, not rotated - and used where a managed identity would work.
- Not using managed identities for Azure workloads.
- Mixing human users and workload identities - different lifecycles/controls.
- Weak Conditional Access - no MFA enforcement, no risk policies, admins unprotected.
Azure access troubleshooting mental model
When access fails (or unexpectedly works), walk the layers in order:
- Which tenant are you signed into? (Guest in the wrong tenant is common.)
- Which subscription is the resource in, and is it the selected one?
- Which identity is making the request - user, group, service principal, or managed identity? (For workloads, which MI is actually assigned?)
- What role is assigned, and does it include the required action (control-plane vs data-plane)?
- At what scope (MG/subscription/RG/resource)? Does inheritance reach this resource?
- Is there a deny assignment blocking it?
- Is Azure Policy blocking the action (deny effect)?
- Is Conditional Access blocking the sign-in (device/location/MFA)?
- Does the role require PIM activation that hasn't been done (eligible but not active)?
- Is the resource provider registered / API available in the subscription?
Tools
Resource > Access control (IAM) > Check access and View my access; Entra sign-in logs (for Conditional Access); Activity Log (for the denied operation); PIM (eligible vs active).
az role assignment list --assignee <objectId> --all -o table
az account show # which tenant/subscription am I in?
az provider show -n Microsoft.Sql --query registrationState3. Networking Deep Dive
Virtual Networks, NSGs vs Azure Firewall, Private Endpoint vs Service Endpoint, hub-and-spoke, hybrid connectivity, and Private DNS - plus the traffic-flow reasoning you need to design and debug real Azure networks.
A Virtual Network (VNet) is regional; you carve subnets from its CIDR. NSGs are stateful L3/L4 allow/deny rules (per subnet/NIC); Azure Firewall is a managed L3-L7 firewall for centralized egress/inspection in a hub. Reach PaaS privately with Private Endpoint (a private IP in your VNet, works cross-region/hybrid) or the older Service Endpoint (keeps traffic on the backbone but the service keeps a public endpoint). VNet peering is not transitive - use hub-and-spoke or Virtual WAN. Private Endpoint needs correct Private DNS zone linkage or nothing resolves. Plan non-overlapping CIDRs first.
Virtual Network and CIDR planning
- A VNet is regional with one or more address spaces; subnets partition it. Azure reserves 5 IPs per subnet (first 4 + last).
- Some services want dedicated/delegated subnets (Azure Firewall needs
AzureFirewallSubnet, gateways needGatewaySubnet, Bastion needsAzureBastionSubnet, App Gateway its own subnet). Plan these in advance. - CIDRs must not overlap with peered VNets or on-premises. Overlap is the #1 cause of hybrid that "connects but won't route."
- Plan generously: leave room for growth, gateway/firewall/bastion subnets, and AKS (which consumes many IPs with Azure CNI).
Network Security Groups & Application Security Groups
- NSG - stateful allow/deny rules on source/dest, port, protocol, evaluated by priority (lower wins). Attach to a subnet and/or a NIC; both apply. Default rules allow intra-VNet and deny inbound from internet.
- Application Security Group (ASG) - a named group of NICs you reference in NSG rules instead of IPs, so rules read "web-asg → db-asg on 1433" and update automatically as VMs scale.
- Service tags - Microsoft-maintained IP groups (e.g.
Storage,Sql,AzureMonitor,Internet) you use in rules instead of hardcoding ranges.
Routes, UDRs, and internet egress
- System routes handle intra-VNet, peering, and a default route to the internet. User Defined Routes (UDRs) in a route table override them - e.g. force
0.0.0.0/0through the Azure Firewall (forced tunneling / centralized egress). - NAT Gateway - the recommended way to give a subnet scalable, stable outbound internet (SNAT) without per-VM public IPs. Outbound only.
- Public IPs (Standard SKU, zone-redundant) for inbound-facing resources; avoid on workload VMs - use a load balancer / Bastion / firewall instead.
- Default outbound access is being retired - new VMs should have an explicit outbound method (NAT Gateway, LB outbound rule, or firewall). Don't rely on implicit outbound.
0.0.0.0/0 to the firewall without a matching route back can black-hole traffic - design symmetric routing.Azure Firewall & Firewall Manager
Azure Firewall is a managed, highly-available, stateful network firewall (Standard/Premium; Premium adds TLS inspection, IDPS, URL filtering). Placed in the hub VNet with spoke UDRs pointing at it, it centralizes egress control (FQDN + network rules), DNAT for inbound, and logging. Firewall Manager centrally manages firewall policies across hubs (incl. Virtual WAN secured hubs).
Private Endpoint vs Service Endpoint (the constant confusion)
| Private Endpoint (Private Link) | Service Endpoint | |
|---|---|---|
| What it is | A private IP in your subnet that maps to a specific PaaS resource instance | Extends the subnet's identity to the service over the backbone; the service keeps its public endpoint |
| Traffic | Fully private; the PaaS resource can disable public access entirely | Stays on the Azure backbone, but the resource still has a public endpoint (restricted by rules) |
| Cross-region / on-prem | Yes - reachable from peered VNets and on-prem (ExpressRoute/VPN) | No - VNet/region local; not reachable from on-prem |
| DNS | Requires a Private DNS zone mapping the service FQDN to the private IP | No DNS change; uses the public FQDN |
| Cost | Per-endpoint + data | Free |
| Use for | The modern default for private PaaS access (Storage, SQL, Key Vault...) | Simpler/cheaper cases where a public endpoint restricted to your subnet is acceptable |
privatelink.blob.core.windows.net, privatelink.database.windows.net), an A record for the endpoint, and a VNet link (and hybrid forwarding for on-prem). Automate this - it is easy to forget and hard to notice.VNet peering & hub-and-spoke
- VNet peering (regional or global) connects two VNets privately over the backbone. Crucially, peering is not transitive: spoke A peered to hub H and spoke B peered to H cannot reach each other unless you route through a hub appliance (Azure Firewall/NVA) with UDRs and "allow forwarded traffic".
- Hub-and-spoke - a central hub VNet (firewall, gateways, DNS, Bastion) peered to workload spokes; spokes route egress and cross-spoke traffic through the hub firewall.
- Virtual WAN - a Microsoft-managed hub-and-spoke at scale (managed hubs, integrated firewall, any-to-any transit, branch/VPN/ExpressRoute) - use it instead of hand-built hubs for large/global topologies.
Hybrid connectivity: VPN Gateway and ExpressRoute
| VPN Gateway (Site-to-Site) | ExpressRoute | |
|---|---|---|
| Path | Over the internet, IPSec-encrypted | Private, dedicated circuit via a partner/provider |
| Bandwidth | Up to gateway-SKU limits (hundreds of Mbps-Gbps) | 50 Mbps to 100 Gbps circuits |
| SLA / latency | Best-effort internet | Consistent, low latency; higher SLA; private |
| Setup | Minutes-hours | Days-weeks (provider provisioning) |
| Use as | Quick start / backup / lower bandwidth | Primary enterprise link; large data; low latency; private access to Microsoft/M365 |
ExpressRoute Global Reach connects on-prem sites to each other through Azure. Common pattern: ExpressRoute primary + VPN backup. Gateway SKU determines bandwidth and features - size it deliberately (a frequent bottleneck).
Azure DNS & Private DNS
- Azure DNS public zones for internet-facing names; Private DNS zones for internal resolution, linked to VNets (with optional auto-registration).
- Private Endpoints depend on Private DNS zones resolving the
privatelink.*FQDNs to the private IPs. - Azure DNS Private Resolver - a managed resolver for hybrid DNS (inbound/outbound endpoints + forwarding rulesets) so on-prem can resolve Azure private names and vice-versa, without running DNS VMs.
How traffic flows in Azure
- Destination inside the VNet (or a peered VNet)? Routes over the backbone - only NSG rules apply.
- Outside? The effective route table (system + UDRs) picks the next hop: internet (via NAT Gateway / public IP), the firewall/NVA (if a UDR forces it), the VPN/ExpressRoute gateway, or a peering.
- NSGs (subnet + NIC, by priority) must allow it - remember default deny-inbound-from-internet.
- For private PaaS: Private Endpoint + correct Private DNS resolution, or Service Endpoint with the public FQDN.
Debugging is almost always: what do effective routes say? what do effective NSG rules say (both directions)? does DNS resolve to the right (private) IP?
Reference diagrams
Hub-and-spoke with centralized egress
Private Endpoint pattern
Network Watcher
| Tool | What it gives you |
|---|---|
| IP Flow Verify | "Is this specific 5-tuple allowed or denied, and by which NSG rule?" First stop for NSG questions. |
| Effective security rules / Effective routes | The actual merged NSG rules and routes applied to a NIC - what really governs the traffic. |
| Connection Monitor / Connection Troubleshoot | Test and continuously monitor reachability/latency A→B, naming the blocker. |
| NSG flow logs | Connection records for monitoring, forensics, and "is my rule dropping this?" (into a storage account / Log Analytics). |
Networking troubleshooting
Causes: no explicit outbound (NAT Gateway/firewall) and default outbound retired; a UDR sends 0.0.0.0/0 to a firewall that blocks it or has asymmetric routing; NSG egress deny; no public IP where one is required. Checks: Effective routes on the NIC; NAT Gateway on the subnet; firewall rules; IP Flow Verify. Fix: add a NAT Gateway (or firewall egress allow for repos/service tags); fix UDR/return routing. For OS repos, allow the relevant service tags/FQDNs on the firewall.
Causes: Private Endpoint's Private DNS zone not linked to the VNet (FQDN resolves to public IP); missing A record; on-prem DNS lacks a conditional forwarder; the storage account still allows only the public path; NSG blocking. Checks: nslookup the FQDN from the VM (should return the private IP); Private DNS zone VNet links; endpoint connection state. Fix: link the correct privatelink.* zone to the VNet, add the A record (or use the auto DNS integration), set hybrid forwarding.
Causes: relying on transitive peering (unsupported); overlapping CIDRs; peering missing "allow forwarded traffic"/gateway transit; NSG blocking the peer range; UDR not routing spoke-to-spoke via the hub firewall. Fix: route cross-spoke through the hub firewall (UDR + allow forwarded traffic) or use Virtual WAN; resolve overlap; open NSGs for the peer range.
Causes: CIDR overlap; BGP not advertising routes both ways; VPN tunnel down (IKE/PSK mismatch); ExpressRoute circuit/peering down or route filters wrong; gateway SKU bandwidth exhausted; NSG/firewall blocking on-prem range. Checks: gateway/connection status; BGP learned/advertised routes; effective routes. Fix: align IKE, correct BGP/route filters, resolve overlap, right-size the gateway SKU. Portal: Virtual network gateways / ExpressRoute circuits.
Causes: NSG on the backend/App Gateway subnet blocking the probe or the required App Gateway management ports; wrong probe host/path/port/protocol; backend app on localhost or wrong port; certificate mismatch on HTTPS probes; App Gateway subnet missing required outbound. Fix: allow the probe + management traffic, align the probe, bind the app to all interfaces. Full flow in section 7.
Method: IP Flow Verify (NSG decision), Effective routes/rules (real next hop + merged rules), NSG flow logs (drops), and nslookup for DNS. For Azure Firewall, check the network/application rule collections and the UDR forcing traffic to it; for Private DNS, verify zone links and forwarders.
Azure networking gotchas
- NSG vs Azure Firewall - NSG is stateful L3/L4 filtering, not a firewall appliance; use both.
- Private Endpoint vs Service Endpoint - different mechanisms; Private Endpoint needs Private DNS zone linkage.
- Forgetting Private DNS zone linkage - the top Private Endpoint failure.
- Overlapping CIDRs - break peering and hybrid; plan IP space early.
- Poor hub-and-spoke - decide hub ownership, firewall, and UDRs before spokes land; consider Virtual WAN.
- Databases/services on public endpoints - use Private Endpoint; disable public access; deny public by Policy.
- Peering is not transitive - route spoke-to-spoke via the hub firewall or use Virtual WAN.
- Route tables (UDR) - forced tunneling and asymmetric routing black-hole traffic; design return paths.
- Default outbound retiring - give every subnet an explicit outbound (NAT Gateway/firewall).
- Gateway SKU limits - VPN/ExpressRoute/App Gateway SKUs cap bandwidth and features; size deliberately.
4. Compute Deep Dive
Azure Virtual Machines (series, disks, placement, HA), Scale Sets, Spot, images, and the managed/serverless options (App Service, Functions, Container Apps) - how to choose, place, scale, and operate compute on Azure.
Azure VMs come in series (B/D general, F compute, E/M memory, L storage, N GPU) with families and sizes. Use Availability Zones + Virtual Machine Scale Sets (flexible orchestration) for HA and autoscale, managed disks (Premium SSD v2 / Ultra for demanding I/O), Bastion for keyless RDP/SSH, and managed identity for keyless API access. For new apps, consider App Service, Container Apps, or Functions before managing VMs. Reserved Instances / Savings Plan and Azure Hybrid Benefit are the main cost levers.
VM series and families
| Series | Family | Best for |
|---|---|---|
| B | Burstable | Low-average, spiky workloads (dev, small web, bastions) |
| D (Dv5/Dasv5...) | General purpose | Web, app, microservices - the default |
| F | Compute optimized | High CPU-to-memory: batch, gaming, app servers |
| E | Memory optimized | Databases, in-memory caches, mid-size SAP |
| M / Mv3 | Memory optimized (very large) | Large databases, SAP HANA (certified), in-memory analytics |
| L (Lsv3) | Storage optimized | High local-disk IOPS/throughput: NoSQL, big data, data nodes |
| N (NC/ND/NV) | GPU | AI/ML training/inference, rendering, visualization |
| DC | Confidential computing | Data-in-use encryption (SGX / AMD SEV-SNP) |
Availability Zones, Availability Sets, and Scale Sets
| Mechanism | Protects against | Notes |
|---|---|---|
| Availability Zones | Datacenter failure within a region | Spread VMs/scale sets across 2-3 zones; ~99.99% VM SLA. Preferred for new HA. |
| Availability Set | Rack (fault domain) & maintenance (update domain) failure in one datacenter | Use where zones aren't available; cannot span zones. |
| Virtual Machine Scale Sets (VMSS) | - | Manage identical VMs with autoscale + rolling upgrades; Flexible orchestration is the modern default (can span zones + mix sizes). |
| Proximity Placement Group | - | Co-locates VMs for lowest latency (e.g. app + DB tier); trades off against zone spread. |
Spot VMs and dedicated hosts
| Option | What it does | Use when |
|---|---|---|
| Spot VMs | Deeply discounted capacity Azure can evict with 30s notice | Fault-tolerant, interruptible batch/CI/render - never stateful prod |
| Dedicated Hosts | A physical server dedicated to your subscription | Compliance/isolation, or per-core licensing needing host affinity/visibility |
| Reserved Instances / Savings Plan | 1/3-year commitment for a discount | Steady-state baseline compute (section 14) |
| Azure Hybrid Benefit | Apply existing Windows Server / SQL Server licenses | Big savings for Microsoft-licensed workloads |
Images, Compute Gallery, and managed disks
- Marketplace/platform images and custom images; the Azure Compute Gallery versions and replicates images across regions for scale.
- Managed disks: Standard HDD, Standard SSD, Premium SSD, Premium SSD v2 (independently tunable IOPS/throughput/size), and Ultra Disk (highest performance, sub-ms). Ephemeral OS disks live on the host (fast, free, but lost on deallocate - stateless only).
- Snapshots and disk encryption (platform-managed keys, customer-managed keys via Key Vault, or Azure Disk Encryption/host-based).
Access and management
- Azure Bastion - browser-based RDP/SSH to VMs with no public IP and no exposed 3389/22, gated by RBAC. The secure default for admin access.
- Serial console + boot diagnostics for out-of-band recovery.
- VM extensions (Custom Script, DSC), Run Command for ad-hoc commands, Azure Monitor Agent for telemetry.
- Update Manager for patch orchestration; Azure Arc to manage on-prem/other-cloud servers with the same tooling (policy, monitoring, extensions).
App Service, Functions, Container Apps
| Service | What it is | Use for |
|---|---|---|
| App Service | Managed PaaS for web apps/APIs (Windows/Linux, code or container) | Web apps/APIs without managing VMs - a common default |
| Azure Functions | Event-driven serverless functions (Consumption/Premium/Flex plans) | Event handlers, glue, automation, scale-to-zero |
| Container Apps | Serverless containers (Kubernetes/KEDA under the hood, no cluster to run) | Microservices/containers without managing AKS - scale to zero, Dapr optional |
| Container Instances (ACI) | Single containers on demand | Simple, short-lived container tasks |
Choosing VM families by workload
| Workload | Starting point |
|---|---|
| Web / API | App Service/Container Apps; or D-series VMSS behind a load balancer |
| Middleware | D/E-series VMSS, memory-leaning |
| Databases (self-managed) | E/M-series + Premium SSD v2/Ultra; or Azure SQL/PaaS (section 6) |
| Oracle workloads | E/M-series VM (self-managed Oracle) or Oracle Database@Azure; constrained-vCPU for licensing |
| SAP | M-series (HANA-certified), proximity placement group, Ultra Disk |
| Batch / CI / render | Spot VMs in a VMSS, or Azure Batch |
| Memory-heavy | E or M series |
| CPU-heavy | F series |
| Storage-heavy (local IOPS) | L series |
| GPU / AI | N series (NC/ND/NV); or Azure ML (section 12) |
| Cost-sensitive / spiky | B-series + autoscale; Spot for fault-tolerant parts; RIs/Savings Plan for baseline |
Operational guidance
Resize / patch VMs safely Ops
- Resize: deallocate, change size (must be available in the region/zone/host cluster), start - brief downtime; in a VMSS, update the model and roll. Changing to a size not on the current host cluster requires stop/deallocate.
- Patch: use Update Manager for scheduled, reported OS patching across VMs (and Arc servers); for VMSS prefer replacing instances from a new image (immutable) over in-place patching.
Troubleshoot boot / high CPU / memory / disk Ops
- Boot: enable boot diagnostics + serial console to see boot output; use the VM "Redeploy"/"Reset password"/repair-VM options for stuck boots.
- High CPU / memory: Azure Monitor + VM Insights (install the Azure Monitor Agent - guest memory isn't collected without it); right-size or autoscale.
- Disk full: expand the managed disk, then grow the partition/filesystem; alert at 85%.
- Disk attach: confirm the disk is attached and initialized; check the disk is in the same region; LUN mapping.
Design compute for production HA Design
- Zone-spread VMSS (Flexible) + autoscale + health probes behind a zone-redundant load balancer.
- No public IPs; Bastion for access; managed identity for API access.
- Azure Monitor Agent + VM Insights; Update Manager for patch compliance; Compute Gallery for golden images.
- RIs/Savings Plan + Hybrid Benefit for cost; a paired region for DR (Azure Site Recovery, section 13).
5. Storage Deep Dive
Managed Disks, Blob Storage, Azure Files, and NetApp Files - their performance, redundancy (LRS/ZRS/GRS/GZRS), tiers, and the decision of which to use for databases, shared filesystems, backups, archives, and data lakes. Plus Azure Backup and Site Recovery.
Managed Disks (Standard HDD/SSD, Premium SSD, Premium SSD v2, Ultra) = block storage for VMs. Blob Storage (Hot/Cool/Cold/Archive tiers) = object storage for backups, data lakes, static content - not a filesystem. Azure Files = managed SMB/NFS shares; Azure NetApp Files = high-performance enterprise NAS. Choose redundancy (LRS/ZRS/GRS/GZRS) deliberately per data. Lock down storage accounts (disable public blob access, use Private Endpoints, prefer managed identity over keys/SAS). Azure Backup for backup, Site Recovery for DR replication.
Managed Disks
| Disk type | Profile | Use for |
|---|---|---|
| Standard HDD | Lowest cost, low IOPS | Dev/test, infrequently accessed data |
| Standard SSD | Better latency/consistency than HDD | Light production, web servers |
| Premium SSD | Production-grade, size-linked performance | Most production VMs/databases |
| Premium SSD v2 | IOPS/throughput tunable independently of size | Best cost/perf for databases needing IOPS without huge capacity |
| Ultra Disk | Highest performance, sub-ms latency | Top-tier databases (log disks), SAP HANA |
| Ephemeral OS disk | On the host - fast, free, lost on deallocate | Stateless VMSS OS disks only |
Blob Storage
- A storage account holds blob containers (plus optionally Files/Queues/Tables). Blobs come in access tiers: Hot (frequent), Cool (~30-day), Cold (~90-day), Archive (offline, cheapest, must be rehydrated before reading - hours).
- Lifecycle management auto-moves/deletes blobs by age/access; versioning + soft delete protect against accidental change/delete; immutable storage (time-based/legal hold) gives WORM compliance; object replication copies between accounts/regions.
- Data Lake Storage Gen2 = a storage account with hierarchical namespace enabled (real directories, POSIX ACLs) for analytics - covered in section 11.
- Access: prefer Entra ID + RBAC data roles + managed identity. SAS tokens (account/service/user-delegation) grant scoped, time-boxed access; stored access policies let you revoke them. Storage account keys are all-powerful - avoid distributing them.
Redundancy: LRS, ZRS, GRS, GZRS
| Option | Copies | Protects against |
|---|---|---|
| LRS | 3 copies in one datacenter | Disk/rack failure only |
| ZRS | 3 copies across zones in the region | Datacenter/zone failure (in-region HA) |
| GRS | LRS + async copy to the paired region | Regional disaster (read access with RA-GRS) |
| GZRS | ZRS + async copy to the paired region | Zone and regional failure (highest) |
Azure Files, File Sync, and NetApp Files
- Azure Files - managed SMB and NFS shares (Standard on HDD, Premium on SSD). Mount from VMs, on-prem, or containers. Identity-based auth (Entra/AD) for SMB.
- Azure File Sync - cache Azure Files on on-prem Windows servers with cloud tiering (hybrid file access).
- Azure NetApp Files - high-performance, low-latency enterprise NAS (NFS/SMB) for demanding workloads: SAP, HPC, large shared filesystems, and Oracle datafiles over NFS. Separate service, higher performance/cost.
Azure Backup and Azure Site Recovery
| Azure Backup | Azure Site Recovery (ASR) | |
|---|---|---|
| Purpose | Point-in-time backup & restore (VMs, disks, files, SQL/SAP in VM, Blob) | Continuous replication for DR failover to another region |
| Recovery | Restore from a recovery point (higher RTO/RPO) | Fail over the whole workload with low RPO |
| Use for | Data protection, retention, ransomware recovery (immutable vault) | Cross-region DR of running workloads |
When to use which
| Need | Use |
|---|---|
| VM OS / database disks | Managed Disks (Premium SSD v2 / Ultra for DB) |
| Backups (DB/app) | Azure Backup + Blob (Cool/Archive) with lifecycle + immutability |
| Log / long-term archive | Blob Archive tier + lifecycle + immutable policy |
| Data lake | Blob with hierarchical namespace (ADLS Gen2) |
| Shared filesystem (SMB/NFS) | Azure Files (or NetApp Files for high performance) |
| High-performance NAS / SAP / Oracle NFS | Azure NetApp Files |
| Static website / media | Blob static website + Front Door/CDN |
| Hybrid file access | Azure File Sync |
| Bulk data into Azure | Data Box (physical) / AzCopy (online) |
| Cross-region DR of workloads | Azure Site Recovery |
Storage gotchas
- Blob is object storage, not a filesystem - no random writes/locks; use Files/NetApp for that.
- Archive tier has a rehydration delay (hours) - never for data you need immediately.
- SAS tokens are a security risk - short-lived, user-delegation, revocable; prefer managed identity + RBAC.
- Public blob access exposure - disable at account level; enforce by Policy.
- Private Endpoint DNS mistakes - link the correct
privatelink.*zone or nothing resolves. - Disk performance sizing - IOPS/throughput follow SKU/size; use Premium SSD v2/Ultra + striping for DBs.
- Snapshot cost growth - incremental snapshots still accumulate; set retention.
- Cross-region replication cost - GRS/object replication cost money and lag (async RPO).
- Wrong redundancy - confusing LRS/ZRS/GRS/GZRS leaves data under- or over-protected.
- Ephemeral OS disk data loss - it's lost on deallocate; stateless only.
6. Database Services Deep Dive
Azure's database portfolio - Azure SQL (Database / Managed Instance / on VM), PostgreSQL, MySQL, Cosmos DB, Cache for Redis, and the analytics stores - what each manages, how HA/DR/backup/patching differ, how to choose, and what changes for a DBA coming from SQL Server or Oracle.
For SQL Server workloads, choose along a spectrum: Azure SQL Database (fully-managed, cloud-native, most managed) → SQL Managed Instance (near-full SQL Server compatibility, managed) → SQL Server on a VM (full control, you manage). For open source, Azure Database for PostgreSQL / MySQL (Flexible Server). For NoSQL at global scale, Cosmos DB. For cache, Azure Cache for Redis. For analytics, Synapse / Microsoft Fabric. Oracle has no native managed service - use a VM or Oracle Database@Azure. Managed services own patching/backup/HA; you own schema, queries, and access.
The portfolio at a glance
| Service | Model | Sweet spot |
|---|---|---|
| Azure SQL Database | Fully-managed SQL (single DB / elastic pool; DTU or vCore) | New/cloud-native SQL apps; most managed, least control |
| Azure SQL Managed Instance | Managed instance with near-full SQL Server surface (SQL Agent, cross-DB, CLR, linked servers) | Lift-and-shift SQL Server needing instance features |
| SQL Server on Azure VM | IaaS - you run SQL Server | Full control, unsupported features/versions, OS access |
| Azure DB for PostgreSQL / MySQL (Flexible Server) | Managed OSS databases | Postgres/MySQL apps; zone-redundant HA |
| Cosmos DB | Globally-distributed multi-model NoSQL (NoSQL/Mongo/Cassandra/Gremlin/Table APIs) | Global scale, low latency, elastic throughput |
| Azure Cache for Redis | Managed Redis | Cache, session, rate limiting, leaderboards |
| Synapse / Microsoft Fabric / Data Explorer | Analytics warehouse / lakehouse / log-time-series | Analytics, not OLTP (section 11) |
| Oracle Database@Azure | Oracle Exadata/Autonomous run by Oracle inside Azure datacenters | Oracle workloads wanting managed Oracle in Azure (verify current availability) |
Service deep dives
Azure SQL Database
- Purchasing models: vCore (recommended; General Purpose / Business Critical / Hyperscale service tiers) or legacy DTU. Serverless compute auto-scales and can auto-pause. Hyperscale scales storage to 100TB+ with fast backups/restores.
- HA: built-in; Business Critical and zone-redundant options give higher SLAs. Failover groups + active geo-replication for cross-region DR (readable secondaries).
- Backups: automatic with point-in-time restore (retention configurable) + long-term retention; you don't manage backup files.
- Patching: fully Microsoft-managed (in maintenance windows you can set).
- Limits: no SQL Agent, no cross-database queries by default, no instance-level features - it's a single database service. If you need those, use Managed Instance.
Azure SQL Managed Instance
A managed instance with near-full SQL Server compatibility: SQL Agent, cross-database queries, CLR, Service Broker, linked servers, and instance-scoped features - deployed into your VNet (private). The best target for lifting an existing SQL Server estate while offloading patching/backup/HA.
- HA: built-in; Business Critical + zone redundancy; failover groups for cross-region DR.
- Backups/patching: managed (automatic backups + PITR; Microsoft patches in windows).
- Networking: lives in a delegated subnet in your VNet - plan the subnet and connectivity.
- Still has limits vs. a full SQL Server on a VM (some instance features, unlimited OS access, specific configs). Verify the surface for your app.
Azure Database for PostgreSQL / MySQL (Flexible Server)
- Flexible Server is the current model: zone-redundant HA (standby in another zone), configurable maintenance windows, private access (VNet-injected or Private Endpoint), and server parameters.
- HA/DR: zone-redundant HA in-region; read replicas (incl. cross-region) for scaling and DR.
- Backups: automatic + PITR; you set retention.
- Postgres supports the pgvector extension for AI/vector workloads (section 12); MySQL for common LAMP-style apps.
Cosmos DB & Azure Cache for Redis
- Cosmos DB - globally-distributed, multi-model NoSQL with turnkey multi-region writes, 5 consistency levels, elastic RU/s (or autoscale/serverless), and single-digit-ms latency. Partition key design is critical - a poor key hotspots partitions and inflates RU cost. Has an analytical store + vector search.
- Azure Cache for Redis - managed Redis for caching, sessions, and pub/sub; tiers (Basic/Standard/Premium/Enterprise) trade HA, clustering, persistence, and modules.
Database service decision table
| Workload | Recommended | Reason | HA | DR | Ops responsibility | Cost lever |
|---|---|---|---|---|---|---|
| New SQL app (single DB) | Azure SQL Database | Most managed, cloud-native | Built-in / zone-redundant | Failover group / geo-replica | Schema/queries | vCore right-size; serverless auto-pause |
| Lift-and-shift SQL Server | SQL Managed Instance | Instance-level compatibility, managed | Built-in / zone-redundant | Failover group | Schema/queries + agent jobs | Right-size; Hybrid Benefit |
| Full control / unsupported feature | SQL Server on Azure VM | OS + full SQL control | You build (AG/FCI + zones) | You build (AG/ASR) | Everything | Hybrid Benefit; VM size |
| PostgreSQL app | Azure DB for PostgreSQL (Flexible) | Managed, zone-redundant | Zone-redundant HA | Cross-region read replica | Schema/queries | Right-size; burstable tiers |
| MySQL web app | Azure DB for MySQL (Flexible) | Managed, common | Zone-redundant HA | Read replica | Schema/queries | Right-size |
| Global NoSQL / low latency | Cosmos DB | Global distribution, elastic | Built-in | Multi-region | Data model / partition key | Autoscale RU / serverless |
| Cache / session | Azure Cache for Redis | Managed Redis | Premium/Enterprise HA | Geo-replication (Enterprise) | Keys/TTL | Right-size tier |
| Data warehouse / analytics | Synapse / Microsoft Fabric | Analytics, not OLTP | Built-in | Config-dependent | Schema/queries | Pause/scale compute |
| Oracle workload | Oracle DB@Azure or Oracle on VM | No native managed Oracle | Oracle-managed / you build | Data Guard | Oracle side | Licensing; verify offering |
Connectivity & security
- Private Endpoint (or VNet injection for MI/Flexible Server) is the production default - disable public network access. Plan the Private DNS zone (
privatelink.database.windows.net, etc.). - Entra authentication + managed identity - apps authenticate to Azure SQL/Postgres/MySQL via Entra tokens (no passwords in config). Prefer this over SQL logins.
- Encryption: TDE (transparent data encryption, on by default; customer-managed keys via Key Vault), Always Encrypted for column-level protection from even DBAs, TLS in transit.
- Microsoft Defender for SQL - vulnerability assessment + threat detection; Query Performance Insight + automatic tuning for performance.
How HA, DR, backup, and patching differ
| Service | HA | DR | Backup | Patching |
|---|---|---|---|---|
| Azure SQL DB | Built-in; zone-redundant / Business Critical | Failover groups + active geo-replication | Automatic + PITR + LTR | Microsoft (windows) |
| SQL Managed Instance | Built-in; zone-redundant | Failover groups | Automatic + PITR | Microsoft (windows) |
| PostgreSQL / MySQL Flexible | Zone-redundant HA (opt-in) | Cross-region read replica | Automatic + PITR | Microsoft (windows) |
| SQL Server on VM | You build: Always On AG / FCI + zones | You build: AG replica / Azure Site Recovery | You configure (Azure Backup for SQL) | You (Update Manager) |
| Cosmos DB | Built-in | Multi-region (turnkey) | Continuous backup + PITR | Fully managed |
Azure database gotchas for Oracle DBAs
- Azure SQL Database != SQL Server on a VM - it's a cloud-native single-database service without instance features (Agent, cross-DB, linked servers, OS access).
- Managed Instance gives more compatibility but still has limits - verify your specific instance features are supported before assuming a clean lift-and-shift.
- Patching control differs by service - PaaS patches in Microsoft-run windows you schedule, not your opatch/CU cadence.
- Backup access differs - PaaS backups are service-managed (PITR/LTR), not files you copy; export (BACPAC/dump) for portability.
- Private Endpoint and DNS must be planned - the private path needs the
privatelink.*zone linked, or connections use the public endpoint. - Licensing choices matter - Azure Hybrid Benefit (bring your SQL/Windows licenses) materially changes cost; for Oracle, licensing/counting on Azure VMs needs LMS-aware planning.
- Performance troubleshooting differs - Query Performance Insight / Query Store / automatic tuning and Azure Monitor, not your on-prem toolset.
- Oracle is a special case - there is no native "managed Oracle." You self-manage Oracle on a VM (Data Guard, backups, ASM/NetApp Files, constrained-vCPU for licensing) or use Oracle Database@Azure (Oracle-operated Exadata/Autonomous in Azure) - verify current regional availability and terms in the official documentation before designing.
Enterprise examples
SQL Server enterprise workload SQL
Lift to SQL Managed Instance (Business Critical + zone redundancy) in a delegated subnet, Private Endpoint/VNet only, failover group to a paired region, Azure Hybrid Benefit, Defender for SQL on. Use SQL-on-VM only if a required feature isn't in MI.
Oracle workload on Azure Oracle
Self-managed Oracle on E/M-series VMs with Premium SSD v2/Ultra or Azure NetApp Files, Data Guard to a paired region, backups to Blob, constrained-vCPU + Dedicated Host for licensing. Or Oracle Database@Azure for a managed Oracle Exadata/Autonomous experience inside Azure (verify availability).
PostgreSQL application database OSS
Azure Database for PostgreSQL Flexible Server, zone-redundant HA, Private Endpoint, Entra auth, automatic backups + PITR, cross-region read replica for DR, pgvector if doing AI/RAG.
Globally distributed NoSQL NoSQL
Cosmos DB with multi-region writes, partition key chosen for even distribution, autoscale RU/s, chosen consistency level per workload, analytical store or vector search where needed.
7. Load Balancing and Traffic Management
The four Azure load-balancing services - Load Balancer (L4), Application Gateway (L7 regional), Front Door (L7 global), and Traffic Manager (DNS) - when to use each, how they are assembled, and how to debug the classic unhealthy-backend and DNS failures.
Four services, two axes (L4 vs L7, regional vs global): Azure Load Balancer (L4, regional, public/internal, ultra-fast), Application Gateway (L7, regional, with WAF, path/host routing, TLS), Front Door (L7, global, anycast + CDN + WAF + global failover), and Traffic Manager (DNS-based global routing). Combine them (e.g. Front Door → regional App Gateway → backend). The #1 failure is an NSG blocking the health probe or the App Gateway management ports.
The four load-balancing services
| Service | Layer / scope | Use for |
|---|---|---|
| Azure Load Balancer (Standard) | L4 (TCP/UDP), regional, public or internal, zone-redundant | High-throughput L4, internal VIPs (e.g. SQL AG listener), non-HTTP; also provides outbound rules |
| Application Gateway (v2) | L7 (HTTP/S), regional, with WAF | Regional web apps needing path/host routing, TLS termination, WAF, autoscaling |
| Azure Front Door | L7 (HTTP/S), global, anycast + CDN + WAF | Internet-facing global apps: edge acceleration, global load balancing/failover, WAF at the edge |
| Traffic Manager | DNS-based, global | DNS-level routing across regions/endpoints (priority/weighted/performance/geographic); works for non-HTTP too |
| Cross-region Load Balancer | L4 global | Global L4 with a single anycast frontend across regional LBs |
| NAT Gateway | Outbound SNAT | Scalable outbound internet for a subnet (not a load balancer, but the modern outbound method) |
When to use which
Application Gateway anatomy + WAF
- Components: frontend IP, listener (port/protocol + cert), rules (path/host routing), HTTP settings (backend port/protocol, cookie affinity, probe), backend pool (VMSS/NICs/App Service/IPs), health probe, and WAF policy.
- SSL: termination at the gateway (offload) or end-to-end (re-encrypt to backend). Manage certs via Key Vault integration.
- WAF - OWASP core rule set + custom rules, geo/rate limiting; run in Detection mode first, then Prevention.
Front Door
Front Door is the global L7 entry point: anycast frontend, edge TLS + caching/CDN, WAF at the edge, and health-based global routing/failover across regional backends. Use it for internet-facing apps that need low latency worldwide and automatic region failover. Managed certificates auto-provision once DNS points at the Front Door endpoint.
Load balancing troubleshooting
Likely causes (in order)
- NSG blocks the probe or the required App Gateway management ports (v2 needs specific inbound from the GatewayManager service tag) on the App Gateway subnet.
- Health probe host/path/port/protocol wrong vs. what the app serves (probe expects 200-399).
- App not listening / bound to localhost / wrong backend port in HTTP settings.
- HTTPS probe cert/hostname mismatch (end-to-end SSL); backend expects a specific host header.
- For Azure Load Balancer: probe port not open, or (Standard LB) no outbound rule so backends can't respond.
Checks & fix
App Gateway > Backend health (shows the reason); allow the probe + management traffic on the NSG; align the probe (path/port/protocol/host); bind the app to all interfaces; verify HTTP settings backend port/protocol.
az network application-gateway show-backend-health -g RG -n APPGW -o tableCauses: Front Door/App Gateway managed cert not provisioned (DNS must point at the frontend first, then validation completes); expired/incomplete cert chain; Key Vault access policy/permissions missing for App Gateway's managed identity; hostname/SNI mismatch. Fix: point DNS at the frontend and wait for provisioning; grant the gateway identity get on the Key Vault secret/cert; include the full chain and correct SANs.
Causes: listener on the wrong port/host or a catch-all rule masking a specific one; WAF denying legitimate requests (over-broad rule / false positive - check WAF logs, use Detection mode first, then tune/exclude); backend behind a Private Endpoint whose Private DNS isn't resolving; wrong frontend IP config (public vs private). Fix: verify listener/rule order and host, review WAF logs and add exclusions, confirm Private DNS zone linkage, check the frontend IP.
8. Security Deep Dive
Defense in depth on Azure: identity (Entra, PIM, Conditional Access), governance (Azure Policy, locks), network (NSG, Firewall, Private Link, DDoS), data (Key Vault, encryption), and detection (Defender for Cloud, Sentinel) - plus how to secure subscriptions, storage, VMs, and databases, ending in a production checklist.
Layer your controls: identity (Entra + PIM for privileged, Conditional Access + MFA, managed identities, no standing Owner), governance (Azure Policy to forbid the risky thing; resource locks), network (private endpoints, NSG/ASG, Azure Firewall, DDoS, no public exposure, Bastion), data (Key Vault/Managed HSM, CMK, TDE, Always Encrypted), and detection (Defender for Cloud, Sentinel, centralized diagnostic logs to Log Analytics). Reduce public exposure, encrypt with keys you control, centralize logs, and prefer preventive Policy over after-the-fact detection.
Azure shared responsibility model
Microsoft secures the infrastructure (physical, host, network fabric, and managed-service internals). You own: identity and access (Entra + RBAC), data classification and access, network exposure and firewall, key management choices, workload/OS security (IaaS/AKS nodes), secure configuration, and monitoring/response. The higher up the managed stack (VM → AKS → Azure SQL → App Service/Functions), the more Microsoft handles - but data, identity, and configuration always remain yours.
The control layers
| Layer | Controls | Key services |
|---|---|---|
| Identity & access | Who can sign in and do what | Entra ID, Conditional Access, PIM, Identity Protection, RBAC, managed identities |
| Governance | What is allowed to exist/be configured | Azure Policy (deny/audit/deployIfNotExists), resource locks, management groups |
| Network | What can reach what | NSG/ASG, Azure Firewall, Private Link/Endpoint, DDoS Protection, Bastion, WAF |
| Data | Protect data at rest/in transit | Key Vault / Managed HSM (CMK), TDE, Always Encrypted, storage encryption, TLS |
| Detective / posture | Find misconfig & threats | Microsoft Defender for Cloud, Microsoft Sentinel, Activity Log, diagnostic settings |
Key Vault, Managed HSM, and encryption
- Key Vault - store keys, secrets, and certificates; workloads read them via managed identity + RBAC (or access policies). Managed HSM for FIPS 140-2 Level 3 single-tenant HSM keys.
- Encryption at rest - on by default (platform keys); use customer-managed keys (CMK) in Key Vault for storage/disks/databases where you need key control and the "disable key" switch.
- Always Encrypted / TDE for databases; TLS everywhere in transit.
- Turn on Key Vault soft-delete + purge protection so keys/secrets can't be permanently destroyed by mistake or malice.
Defender for Cloud & Sentinel
CSPM (secure score, misconfiguration recommendations) + CWP (Defender plans for servers, SQL, storage, containers, Key Vault, etc.) with threat detection. Turn on the relevant plans; work the secure score down.
Cloud-native SIEM/SOAR on Log Analytics - ingest Azure + M365 + third-party logs, detect with analytics rules, and automate response with playbooks.
Activity Log = control-plane operations (who did what). Diagnostic settings route resource + platform logs/metrics to a central Log Analytics workspace / storage / Event Hub. Enable them everywhere via Policy.
DDoS Network/IP Protection for L3/4 volumetric attacks; WAF (App Gateway/Front Door) for L7. Protect public frontends.
How to secure specific things
Secure a production subscription (and multi-subscription env) Foundation
- Access via groups; no basic Owner at subscription; least-privilege built-in roles at RG/resource; privileged roles eligible via PIM; Conditional Access + MFA; protected break-glass accounts.
- Preventive Azure Policy at the management group: allowed regions/SKUs, deny public IPs, deny public blob access, require diagnostic settings, require tags; resource locks on foundational resources.
- Hub-and-spoke with Azure Firewall; Private Endpoints for PaaS; DDoS on public frontends.
- Defender for Cloud (all relevant plans) + Sentinel; central Log Analytics; budgets + quotas.
- Key Vault + managed identities; CMK for sensitive data.
Secure storage accounts Storage
- Disable public blob access + "allow shared key" where possible; use Entra + RBAC data roles + managed identity over keys/SAS.
- Private Endpoint + Private DNS; firewall to specific VNets; CMK for sensitive data; soft delete + versioning + immutability for backups.
- If SAS is required: user-delegation SAS, short lifetime, stored access policy to revoke.
Secure VMs & databases Compute / Data
- VMs: no public IP; Bastion for access; managed identity; NSG/ASG micro-segmentation; Update Manager; Defender for Servers.
- Databases: Private Endpoint / VNet only, disable public access; Entra auth + managed identity; TDE + CMK; Always Encrypted for sensitive columns; Defender for SQL.
Secure public load balancers & reduce exposure Edge
- Public HTTP behind Front Door / App Gateway + WAF and DDoS; backends private (no public IPs).
- Enforce "no public IP on VMs" and "no public blob access" via Policy; use Private Endpoints for all sensitive PaaS.
- Prefer Bastion + private access; audit for stray public IPs regularly (Defender/Policy).
Production Azure security checklist
- Human access via groups; MFA enforced via Conditional Access; risky sign-ins blocked (Identity Protection).
- Privileged roles (Owner, User Access Admin, Global Admin) eligible via PIM, not permanent; approvals + audit on.
- Two protected, monitored break-glass accounts (excluded from lock-out policies, alerted on every sign-in).
- No basic Owner/Contributor at subscription/MG for daily work; least-privilege built-in roles at RG/resource.
- Workloads use managed identities; app secrets minimized, rotated, and in Key Vault.
- Preventive Azure Policy: deny public IPs, deny public blob access, allowed regions/SKUs, require diagnostic settings + tags.
- Resource locks (CanNotDelete) on hub network, Key Vault, and production data.
- Databases and PaaS on Private Endpoints; public network access disabled; no public database endpoints.
- Public HTTP behind Front Door/App Gateway + WAF; DDoS Protection on public frontends; backends private.
- Sensitive data encrypted with CMK in Key Vault (soft delete + purge protection on).
- Diagnostic settings + Activity Log centralized to a Log Analytics workspace; Sentinel ingesting.
- Defender for Cloud plans enabled across the management group; secure score tracked.
- Alerts on RBAC/PIM changes, new app secrets, public exposure, Key Vault access anomalies.
- Budgets + quotas as guardrails; consistent tags for attribution.
- Backups immutable (vault soft-delete/immutability); DR tested incl. CMK key availability in the DR region.
Common security mistakes
- Owner assigned too broadly; not using PIM (standing privilege).
- Weak Conditional Access (no MFA, admins unprotected).
- Storage account public exposure; over-permissive NSGs.
- Public database endpoints instead of Private Endpoints.
- Not enabling diagnostic logs; not centralizing logs.
- Secrets in code/app settings instead of Key Vault; not using managed identities.
- Not using Private Endpoints for sensitive services; not enforcing Azure Policy (guardrails off).
9. Observability, Monitoring, and Operations
Azure Monitor (metrics, logs, alerts), Log Analytics + KQL, Application Insights, and the operations tooling - what to monitor per service, how to build useful alerts without noise, and how to centralize logs across subscriptions.
Azure Monitor is the umbrella: metrics (near-real-time numeric), logs (in a Log Analytics workspace, queried with KQL), Application Insights (APM for apps), and alerts (metric/log/activity) firing to action groups. Install the Azure Monitor Agent (AMA) on VMs for guest metrics/logs (memory isn't collected by default). Diagnostic settings route each resource's logs to the workspace. Centralize across subscriptions with a shared workspace + Policy, alert on user-visible symptoms, and route by severity.
The observability stack
| Service | Role |
|---|---|
| Azure Monitor Metrics | Platform + custom numeric metrics, near-real-time, for dashboards and metric alerts. |
| Log Analytics workspace + KQL | Central log store; query with Kusto Query Language; the target for diagnostic settings. |
| Application Insights | APM: requests, dependencies, exceptions, traces, live metrics, availability tests for apps. |
| Alerts + Action Groups | Metric/log/activity/resource-health alerts → email, SMS, webhook, Logic App, ITSM, Functions. |
| Diagnostic settings | Route resource logs/metrics to Log Analytics / storage / Event Hub. Enable via Policy everywhere. |
| Activity Log / Resource Health / Service Health | Control-plane operations; per-resource health; Azure-side incidents & planned maintenance. |
| VM Insights / Container Insights | Curated VM and AKS monitoring (perf, maps, container logs) via the agent. |
| Azure Monitor Agent (AMA) | The agent for VM/Arc guest metrics and logs; configured by Data Collection Rules. |
| Workbooks / Dashboards | Interactive reports and shared operational views. |
What to monitor per area
CPU, memory (agent), disk free %, disk IOPS/throughput vs. SKU, availability/heartbeat, VMSS instance health.
Disk IOPS/throughput vs. provisioned; storage transactions, throttling (429), availability, capacity, unusual access.
DTU/vCore/CPU, storage %, connections, deadlocks, replication lag, backup status; Query Performance Insight.
Backend health, unhealthy host count, response time, 5xx, throughput, WAF blocks.
VPN/ExpressRoute status, NSG flow-log anomalies, NAT SNAT port usage, DNS.
Defender alerts, Activity Log anomalies (RBAC/policy/public-exposure changes), Key Vault access failures.
Building useful alerts
- Alert on symptoms users feel (5xx, unhealthy backends, DB down, latency), not only causes.
- Use appropriate aggregation (avg/percentile) and an evaluation window + frequency to avoid flapping.
- Use log alerts (KQL) for things metrics can't express; metric alerts for fast numeric thresholds; activity-log alerts for governance events.
- Route by severity via action groups: Sev0/1 → page; Sev2/3 → ticket/Teams; info → dashboard.
- Consider dynamic thresholds (ML baselines) for noisy signals, and Resource Health alerts for platform issues.
Example alerts to implement
| Alert | Condition | Severity |
|---|---|---|
| VM CPU high | CPU > 85% avg for 5-10 min | Warning → Critical |
| VM unavailable | Heartbeat missing / Resource Health unavailable | Critical |
| Memory pressure | Available memory (agent) < threshold | Warning |
| Disk usage / IOPS | Disk free < 15%; IOPS near provisioned limit | Warning |
| App Gateway unhealthy backend | Unhealthy host count > 0 | Critical |
| Azure SQL CPU / DTU-vCore / storage | CPU/DTU > 90%; storage > 85% | Warning → Critical |
| Failed backups | Backup job failed / success signal absent | Critical |
| VPN tunnel down / ExpressRoute issue | Connection/circuit status != connected | Critical |
| Storage unusual access / throttling | Spike / 429 throttling / anomalous access | Warning / Security |
| Function errors / throttles | Failure rate / throttle count over threshold | Warning → Critical |
| Key Vault access denied spikes | Forbidden/denied requests rising | Security review |
Centralizing logs across subscriptions
Use a shared Log Analytics workspace (in the management subscription) and enforce diagnostic settings across all resources via Azure Policy (deployIfNotExists) so every subscription sends logs there. Route the Activity Log too, and connect the workspace to Sentinel for security analytics. This gives cross-subscription visibility and satisfies retention/compliance without per-resource setup.
# KQL: top VMs by CPU over the last hour
InsightsMetrics
| where Namespace == "Processor" and Name == "UtilizationPercentage"
| summarize avg(Val) by Computer, bin(TimeGenerated, 5m)
| top 20 by avg_Val descOperations tooling
- Update Manager for OS patch orchestration/compliance; Change Tracking & Inventory for drift; Automation Account for runbooks.
- Azure Arc to bring on-prem/other-cloud servers, Kubernetes, and data services under the same monitoring, policy, and update tooling.
- Defender for Cloud recommendations and Advisor for security/reliability/cost/performance guidance; Cost Management exports for spend (section 14).
10. Containers, Kubernetes, and Cloud Native
AKS, Container Apps, and the serverless / event-driven building blocks (Functions, Event Grid, Service Bus, Event Hubs, Logic Apps) - when to use each, how networking and identity work for containers, and reference patterns.
AKS (managed Kubernetes) for orchestrated microservices when you need the K8s ecosystem; Container Apps for serverless containers without running a cluster (scale-to-zero, KEDA, Dapr); Functions for event-driven code; App Service for web apps. Around them: Azure Container Registry, Event Grid / Event Hubs / Service Bus, Logic Apps, and API Management. AKS workloads use Workload Identity (federated, no secrets) to reach Azure/Entra.
The cloud-native services
| Service | What it is | Use for |
|---|---|---|
| Azure Kubernetes Service (AKS) | Managed Kubernetes (free control plane; you manage node pools, or use node autoprovisioning) | Orchestrated microservices, platform teams, portable K8s |
| Container Apps | Serverless containers on managed Kubernetes/KEDA - no cluster ops, scale to zero | Most containerized microservices without AKS overhead |
| Container Instances (ACI) | Single containers on demand | Simple/short-lived tasks; AKS virtual-node burst |
| App Service | Managed web app/API PaaS (code or container) | Web apps/APIs |
| Azure Functions | Event-driven serverless functions | Event handlers, glue, automation |
| Azure Container Registry (ACR) | Private registry with scanning, geo-replication, tasks | Store/scan/build images |
| API Management (APIM) | Full API gateway/management | Publishing, securing, throttling, versioning APIs |
| Event Grid / Event Hubs / Service Bus | Eventing / big-data streaming / enterprise messaging | Event-driven and decoupled architectures |
| Logic Apps | Low-code workflow/integration with connectors | Integration, orchestration, SaaS connectors |
AKS deep dive
- Node pools: a system node pool (runs cluster-critical pods) and one or more user node pools (your workloads). Use Spot user pools for fault-tolerant work; scale per pool; virtual nodes (ACI) for burst.
- Networking: Azure CNI (pods get VNet IPs - plan a big subnet; CNI Overlay reduces IP usage) vs kubenet (legacy). Private clusters keep the API server private.
- Ingress: the Application Gateway Ingress Controller (AGIC) or the managed App Routing add-on / Gateway API provisions an App Gateway or LB; a
Service type=LoadBalancercreates an Azure Load Balancer. - Identity: Microsoft Entra Workload Identity federates a Kubernetes service account to a managed identity so pods get Entra tokens with no secrets. Use Entra + Azure RBAC for cluster access, plus Kubernetes RBAC.
- Security/ops: Defender for Containers, image scanning in ACR, Azure Policy for AKS (Gatekeeper), Container Insights for monitoring.
AKS vs Container Apps vs Functions vs VMs
Networking & identity for containers
- Networking - AKS/Container Apps deploy into a VNet subnet; use private endpoints for dependencies (ACR, Key Vault, databases), Private DNS, and NSGs. Internal ingress + Private Link for private platforms.
- Identity - Entra Workload Identity (AKS) / managed identity (Container Apps, Functions, App Service) for keyless access to Key Vault, storage, databases.
- Supply chain - scan images in ACR (Defender), sign/verify, restrict pull to the workload identity; Azure Policy for AKS to enforce baselines.
- Monitoring - Container Insights (AKS), built-in metrics/logs for Container Apps/Functions; App Insights for app-level tracing.
Messaging & events
| Service | Model | Use for |
|---|---|---|
| Event Grid | Discrete event routing (pub/sub, reactive) | React to Azure/resource events (e.g. blob created) → Functions/Container Apps |
| Event Hubs | High-throughput streaming (Kafka-compatible) | Telemetry/log/IoT ingestion pipelines |
| Service Bus | Enterprise messaging (queues/topics, ordering, transactions, dead-letter) | Reliable decoupling, work queues, ordered processing |
| Logic Apps | Low-code workflows with connectors | Integration/orchestration across SaaS + Azure |
Architecture patterns
- Microservices on AKS - deployments behind AGIC/Gateway ingress, HPA/KEDA autoscaling, optional service mesh, Workload Identity, ACR + a deployment pipeline.
- Microservices on Container Apps - each service a container, internal ingress + Dapr for service-to-service, KEDA scaling, managed identity - minimal ops.
- Serverless function on a Blob event - as diagrammed; image/ETL/validation.
- Event-driven architecture - Event Grid + Functions/Container Apps + Service Bus + Event Hubs for decoupled, resilient pipelines.
- Private container platform - private AKS/Container Apps in a spoke VNet, internal ingress, private endpoints to ACR/Key Vault/DB, no public endpoints.
Troubleshooting
Causes: Pending = no schedulable capacity or pod IP exhaustion (Azure CNI subnet too small) or resource requests too big; ImagePullBackOff = ACR pull permission missing (grant AcrPull to the kubelet/managed identity), private ACR unreachable (needs private endpoint/DNS), or wrong image path; CrashLoopBackOff = app config/secret missing or bad probes. Checks: kubectl describe pod, kubectl logs --previous, node capacity, subnet free IPs. Fix: scale/enable autoscale or use CNI Overlay; grant AcrPull; fix probes/config/Workload Identity.
Container Apps: a new revision serving traffic but failing - check the container listens on the target port, ingress config, scale rules (min replicas), and the managed identity's RBAC; roll back to a previous revision or split traffic. Function timeout: raise the timeout/plan (Consumption caps duration; use Premium/Flex for long work), make idempotent, offload long work. Trigger not firing: check the binding/connection (managed identity or connection string), the event source, and the function's logs/metrics; verify the Event Grid subscription/filter.
11. Analytics, Data, and Integration
The Azure data stack - Microsoft Fabric and Synapse, Data Factory, Data Lake Storage Gen2, Databricks, Event Hubs and Stream Analytics, Data Explorer, Purview governance, and Power BI - with the common lake/warehouse/streaming patterns.
Land data in Data Lake Storage Gen2 (a storage account with hierarchical namespace). Analyze with Microsoft Fabric (the unified SaaS analytics platform: OneLake, Lakehouse, Warehouse, Data Factory, Power BI) or Synapse/Databricks. Ingest streams with Event Hubs + Stream Analytics, orchestrate ETL with Data Factory, query time-series/logs with Data Explorer, govern with Microsoft Purview, and visualize with Power BI.
The services
| Service | Role |
|---|---|
| Microsoft Fabric | Unified SaaS analytics: OneLake (one logical lake), Lakehouse, Data Warehouse, Data Factory, Real-Time Intelligence, and Power BI - one capacity, one governance surface. |
| Azure Synapse Analytics | Integrated analytics (dedicated/serverless SQL pools, Spark, pipelines). Much of it is converging into Fabric - check current guidance. |
| Azure Databricks | First-party Apache Spark + Delta Lakehouse for large-scale data engineering/ML. |
| Data Factory | Cloud ETL/ELT orchestration with 100+ connectors (also in Synapse/Fabric). |
| Data Lake Storage Gen2 | Blob + hierarchical namespace (directories, POSIX ACLs) - the lake foundation. |
| Event Hubs | High-throughput event streaming (Kafka-compatible) for ingestion. |
| Stream Analytics | Serverless real-time stream processing (SQL over streams). |
| Azure Data Explorer (ADX / Kusto) | Fast analytics over logs/time-series/telemetry (KQL). |
| Microsoft Purview | Data governance: catalog, classification, lineage, and access policies across the estate. |
| Power BI | BI/reporting and semantic models (native in Fabric). |
The Fabric / OneLake model (mental model)
- OneLake is one logical data lake for the whole tenant (built on ADLS Gen2, open Delta/Parquet format) - "shortcuts" reference data in place instead of copying.
- Storage and compute are separate; you buy capacity (compute) and workloads (Lakehouse, Warehouse, Power BI) share it.
- Lakehouse (Spark/notebooks + SQL endpoint) vs Warehouse (T-SQL, transactional) - both over the same Delta data in OneLake.
- Direct Lake lets Power BI read Delta directly for speed without import/DirectQuery trade-offs.
- It is analytical (OLAP), not transactional - keep OLTP in Azure SQL/Cosmos and feed the lake.
Common data patterns
| Pattern | Built from |
|---|---|
| Data lake | ADLS Gen2 / OneLake (bronze/silver/gold) + Purview governance |
| Data warehouse | Synapse dedicated SQL pool or Fabric Warehouse + Power BI |
| Lakehouse | Fabric Lakehouse or Databricks (Delta) over ADLS/OneLake |
| ETL / ELT | Data Factory pipelines (+ Spark/Databricks for transforms) |
| Streaming ingestion | Event Hubs → Stream Analytics / Fabric Real-Time → lake/warehouse |
| Event-driven integration | Event Grid + Functions/Logic Apps + Service Bus |
| Reporting / BI | Power BI over Warehouse/Lakehouse (Direct Lake) |
| AI-ready data | Curated lake + Azure ML / Azure OpenAI + vector search (section 12) |
| Cross-org data sharing | Azure Data Share / Fabric sharing with governance |
Governance with Purview
Microsoft Purview provides a unified catalog, automated classification (PII/sensitive), lineage, and data access policies across Azure data sources, on-prem, and (increasingly) multicloud - so a growing lake stays governed instead of a "data swamp." Combine with Private Endpoints around data services, storage/lake ACLs, and column/row-level security in the warehouse for sensitive data.
Reference architecture: lakehouse + BI
12. AI, ML, and Generative AI on Azure
Azure AI Foundry, Azure OpenAI, Azure AI Search, the pretrained AI services, and Azure Machine Learning - plus vector search across Cosmos DB / PostgreSQL, the enterprise RAG patterns, and the governance guardrails that separate a demo from something you can run on real data.
Azure AI Foundry is the platform for building/deploying AI (models, agents, prompt flow, evaluation); Azure OpenAI serves GPT/embedding models with enterprise controls; Azure AI Search provides the retrieval layer (keyword + vector + semantic) for RAG. Store vectors in AI Search, Cosmos DB, or PostgreSQL (pgvector). Pretrained Azure AI services (Document Intelligence, Vision, Language, Speech, Translator) cover common tasks; Azure ML for custom models/MLOps. The hard part is not the model - it is governing what it can reach; use private endpoints, managed identity, and Content Safety.
Azure AI Foundry & Azure OpenAI
| Capability | What it does |
|---|---|
| Azure AI Foundry | The unified platform/portal + SDK to build, evaluate, deploy, and monitor generative AI apps and agents; model catalog, prompt flow, tracing, and evaluation. |
| Azure OpenAI Service | GPT (chat/completions), embeddings, and other OpenAI models with Azure enterprise controls (private networking, RBAC, content filtering, data-not-used-to-train). |
| Model catalog | OpenAI + open + partner models to deploy (managed or serverless endpoints). |
| Prompt flow | Author, test, and evaluate LLM app flows (prompts, tools, retrieval) with versioning. |
| Content Safety | Detect/block harmful content, jailbreaks/prompt-injection, and groundedness issues. |
| Fine-tuning | Customize supported models where available (verify per model/region). |
Azure AI Search (the retrieval layer)
Azure AI Search is the managed retrieval engine for RAG: it indexes your content and supports keyword, vector, and hybrid search plus semantic ranking. Integrated vectorization and indexers can chunk, embed, and index content from Blob/ADLS/SQL/Cosmos automatically. It is the most common grounding store for Azure OpenAI RAG.
Applied AI services & Azure ML
Extract text, tables, key-value pairs, and structure from documents (invoices, forms, contracts).
Pretrained APIs for image analysis, entity/sentiment/PII, speech-to-text/text-to-speech, and translation - no training.
Conversational bots and assistant patterns grounded on your data.
Full MLOps: training, pipelines, model registry, managed online/batch endpoints, and monitoring for custom models.
Vector search options
| Option | Use when |
|---|---|
| Azure AI Search (vector/hybrid) | Purpose-built retrieval with hybrid + semantic ranking; the default RAG store. |
| Cosmos DB vector search | Vectors alongside operational NoSQL data at global scale. |
| Azure DB for PostgreSQL (pgvector) | Vectors in an existing Postgres, alongside relational data. |
| Azure SQL vector support | Vectors alongside relational SQL data (verify current availability). |
RAG architecture on Azure
Enterprise patterns
| Pattern | How | Watch out for |
|---|---|---|
| Chat with documents | RAG over Blob/ADLS + AI Search + Azure OpenAI | Chunking quality; stale index; citations |
| Chat with database | Retrieve from curated views; grounded answers | Never raw prod OLTP; use a serving layer |
| Natural language to SQL | Azure OpenAI proposes SQL over a governed schema | Validate/parametrize; read-only; no dynamic SQL |
| RAG with private data | Private endpoints for OpenAI + Search + storage; security trimming | Entitlement-aware retrieval |
| Document processing | Document Intelligence → extract → SQL/warehouse | Human review of low-confidence extractions |
| Call center AI | Speech + Language + Azure OpenAI + bot | Grounding; human escalation |
| MLOps pipeline | Azure ML pipelines + registry + managed endpoints + monitoring | Reproducibility; drift monitoring |
| Private GenAI | Private networking + managed identity + Content Safety + audit | Security trimming, prompt-injection defense |
Governance and security for GenAI
- Serving layer, always - agents/LLMs call a governed API (App Service/Function) that enforces authN/authZ, rate limits, input/output validation, and logging. They do not touch data stores directly.
- Security-trimmed retrieval - filter retrieved context to what the requesting user may see (document/row/column) so RAG cannot leak across users.
- Private & perimetered - use Private Endpoints for Azure OpenAI, AI Search, storage, and databases; keep model/data traffic off the internet.
- Credential hygiene - secrets in Key Vault, access via managed identity; the model never sees raw credentials.
- Auditability - log prompts, retrieved document IDs, and responses (per privacy rules); use Content Safety and groundedness checks.
- Responsible AI - evaluate quality/safety in prompt flow, monitor deployed models, and require human review for consequential outputs.
Warnings (read before connecting AI to enterprise data)
- Do not connect LLM agents directly to production OLTP databases without a governed serving layer. Live transactional systems are not a query surface for a probabilistic agent.
- Avoid uncontrolled dynamic SQL. NL-to-SQL must produce validated, parameterized, read-only queries against a curated schema - never free-form DML against production.
- Protect credentials. No DB passwords, keys, or connection strings in prompts, code, or agent memory. Use Key Vault + managed identity.
- Add auditability. If you cannot show what data an answer came from and who asked, you cannot defend it to security or compliance.
- Use curated datasets, APIs, or read-only reporting layers as the AI's data surface - not raw production tables.
- Validate output before business use. Treat model output as a draft/suggestion until a human or deterministic check confirms it.
- Monitor prompt injection and data-leakage risks - untrusted content in the context can hijack instructions; use Content Safety prompt-shields and isolate/sanitize retrieved and user content.
- Check Azure OpenAI regional and model availability, quota, and pricing before you design - these vary by region and change frequently.
13. Migration and Disaster Recovery
Getting workloads into Azure (VMs, databases, data) and keeping them recoverable - Azure Migrate, Site Recovery, Backup, Database Migration Service - plus DR patterns by tier and how RTO/RPO drive architecture and cost.
Assess and migrate servers with Azure Migrate (VMware/Hyper-V/physical → VMs), databases with Database Migration Service (+ tooling like DMA/SSMA), and bulk data with Data Box / AzCopy. For DR, choose per tier: backup & restore (cheapest, slow), pilot light, warm standby, or hot/active-active. Azure Site Recovery replicates running workloads for region failover; Front Door / Traffic Manager handle traffic failover. Your RTO/RPO targets pick the pattern - and DR you never test is not DR.
Migration tooling
| Move | Tooling | Notes |
|---|---|---|
| Assess & plan | Azure Migrate | Discovery, dependency mapping, right-sizing, and cost estimates before you move |
| Servers / VMs | Azure Migrate: Server Migration (agentless/agent) | VMware, Hyper-V, physical, and other-cloud VMs → Azure VMs |
| Databases (low downtime) | Database Migration Service (DMS) + DMA/SSMA | SQL/Postgres/MySQL to Azure PaaS; SSMA for heterogeneous (Oracle→SQL) conversions |
| Bulk data | Data Box / AzCopy / Storage Mover | Physical appliance for large sets; online for the rest |
| Files | Azure File Sync / Storage Mover | Hybrid file access + migration to Azure Files |
| DR replication | Azure Site Recovery (ASR) | Replicate running VMs to another region for failover |
Database migration paths
| Source → target | Method | Downtime |
|---|---|---|
| SQL Server → Managed Instance / SQL DB | DMS online (log replay), or Managed Instance link (near-real-time) | Near-zero |
| SQL Server → SQL on VM | Backup/restore, Always On, or ASR | Window-dependent |
| PostgreSQL/MySQL → Flexible Server | DMS online / native replication | Near-zero |
| Oracle → SQL / Postgres | SSMA / DMS (heterogeneous conversion) | Low + conversion effort |
| Oracle → Oracle on Azure VM / DB@Azure | Data Pump / RMAN / Data Guard | Depends on method |
DR patterns
| Pattern | Standby state | RTO | RPO | Cost |
|---|---|---|---|---|
| Backup & restore | Backups in another region; nothing running | Hours+ | Since last backup | Lowest |
| Pilot light | Core data replicated (geo-replica / ASR); app off | Tens of min | Small | Low |
| Warm standby | Scaled-down full stack in the DR region | Minutes | Small | Medium |
| Hot / active-active | Both regions serving (Front Door + multi-region data) | Near-zero | Near-zero | Highest + complexity |
Building blocks: Azure SQL failover groups / geo-replication or Cosmos multi-region (data); GZRS/GRS storage and object replication (objects); Azure Site Recovery for VM replication; Availability Zones for in-region HA and region pairs for DR; and Front Door / Traffic Manager for traffic failover.
RTO and RPO
- RTO - how long you can be down → drives standby readiness and automation.
- RPO - how much data you can lose → drives replication mode (sync HA vs async geo-replication vs backup interval).
- Zero data loss needs synchronous replication (zone-redundant HA, Business Critical, or a sync AG) and low latency; GRS/geo-replication are async (an RPO applies). Verify the network and trade-offs.
Architecture examples
- On-prem VM → Azure: Azure Migrate server migration.
- SQL Server → Azure: DMS / Managed Instance link, cut over at low lag.
- Oracle → Azure: self-managed on VM (Data Guard) or Oracle DB@Azure; SSMA if converting to SQL/Postgres.
- Cross-region DR (app): Front Door + warm tier + storage replication.
- Cross-region DR (database): failover groups / geo-replication / Cosmos multi-region.
- Backup-based DR: Azure Backup with cross-region restore; rebuild on demand.
- ASR pattern: replicate IaaS VMs to the paired region, fail over with recovery plans.
DR testing
- Geo-secondary / ASR replication within RPO; failover rehearsed.
- CMK keys present and usable in the DR region.
- App tier can start and connect in DR; config points to DR endpoints.
- Front Door / Traffic Manager failover tested and time-measured.
- Object data (GZRS/GRS or replicated) within RPO.
- Capacity available in DR; runbook / recovery plan current.
14. Cost Management and Governance
How Azure charges, the tools to track and cap spend (Cost Management, budgets, exports), the discount levers (Reservations, Savings Plan, Hybrid Benefit, Spot), and the governance model - ending in a monthly cost-review checklist.
Azure bills mainly by compute (VM size-hours), storage GB + transactions + redundancy, database vCore/DTU, and data egress. Track with Microsoft Cost Management + budgets + cost exports to storage/BigQuery-style analysis, cap with budgets (alert) and quotas (block). Big levers: Reserved Instances / Savings Plan for Compute for baseline, Azure Hybrid Benefit for Windows/SQL licenses, Spot for interruptible work, right-sizing, storage lifecycle, and reducing Log Analytics ingestion. Governance = management groups + Azure Policy + budgets + tags.
Pricing basics
| Dimension | Charged on | Notes |
|---|---|---|
| VMs | Size per hour (+ OS licensing) | RIs / Savings Plan; Hybrid Benefit; Spot; Dev/Test pricing |
| Managed disks | Provisioned GB + tier (+ IOPS/throughput for v2/Ultra) | Snapshots accumulate; choose SKU deliberately |
| Storage accounts | GB + tier + transactions + redundancy (LRS<ZRS<GRS<GZRS) | Lifecycle to cool/archive; watch egress/retrieval |
| Azure SQL / databases | vCore/DTU + storage + backups (LTR) | Serverless auto-pause; Hybrid Benefit; right-size tier |
| Networking | Egress + inter-region; Application Gateway & Azure Firewall have hourly + data costs | Firewall/App Gateway are often surprising line items - right-size and consolidate |
| Log Analytics | Ingestion (GB) + retention | Noisy logs get expensive - filter and set retention/basic-logs tiers |
Cost tracking tools
| Tool | Does |
|---|---|
| Cost Management + cost analysis | Spend by subscription, RG, resource, service, tag, time; forecasts. |
| Budgets + cost alerts | Track spend against a target at MG/subscription/RG scope; alert at thresholds (and trigger automation). Budgets notify - they don't block. |
| Cost exports | Scheduled detailed usage to a storage account for your own BI/analysis. |
| Quotas | Per-subscription limits - the "block" control (e.g. cap vCPU by family/region). |
| Advisor | Cost (right-sizing, idle, reservation) + reliability/security/performance recommendations. |
| Pricing / TCO calculators | Estimate before you build. |
Discounts
- Reserved Instances (RIs) - 1/3-year commitment to specific VM families/regions (or other services) for a big discount on steady state.
- Savings Plan for Compute - a 1/3-year hourly-spend commitment that flexes across VM families/regions (more flexible than RIs, sometimes smaller discount).
- Azure Hybrid Benefit - apply existing Windows Server / SQL Server licenses to Azure - often the single biggest saving for a Microsoft shop.
- Spot VMs - 60-90% off for interruptible workloads.
- Dev/Test pricing - reduced rates for non-production under Dev/Test subscriptions.
Governance model
Governance is enforced through the same primitives as security: the management-group hierarchy and subscriptions (isolation + attribution), Azure Policy (restrict regions/SKUs, require tags, deny public exposure), budgets + quotas, resource locks, and tags - all deployed as a landing zone in code. This keeps spend controlled and attributable by design.
Cost optimization examples
| Action | Typical saving | Effort |
|---|---|---|
| Stop / auto-shutdown non-prod VMs off-hours | High (up to ~65-70% of that compute) | Low |
| Right-size VMs (Advisor) | High | Low |
| Reserved Instances / Savings Plan for baseline | High | Medium |
| Azure Hybrid Benefit (Windows/SQL) | Very high (Microsoft licenses) | Low |
| Spot VMs for interruptible / batch | Very high (60-90%) | Medium |
| Choose correct disk type / delete unused disks | Medium | Low |
| Storage lifecycle to cool/archive | Medium-High | Low |
| Reduce Log Analytics ingestion (filters, tiers, retention) | Medium-High | Low |
| Reduce cross-region traffic / consolidate firewalls | Medium | Medium |
| Delete old snapshots & unused public IPs | Medium | Low |
| Database right-sizing / serverless auto-pause | Medium-High | Low |
Monthly Azure cost review checklist
- Review Cost Management month-over-month by subscription, service, and tag; investigate spikes.
- Check each budget: which subscriptions/RGs/tags are over or trending over target.
- Act on Advisor right-sizing and idle-resource recommendations.
- Confirm non-prod auto-shutdown ran (nothing running 24x7 by accident).
- Find and delete unused managed disks, orphaned snapshots, and idle VMs.
- Release unassociated public IPs (Standard IPs bill when idle).
- Review RI / Savings Plan coverage and utilization; buy/adjust; confirm Hybrid Benefit applied.
- Log Analytics: top ingestion sources; add filters / Basic Logs; check retention.
- Storage: lifecycle rules moving cold data to cool/archive; review redundancy choices.
- Right-size Azure SQL / databases vs. utilization; serverless auto-pause where suitable.
- Review App Gateway / Azure Firewall sizing and consolidation.
- Review egress / inter-region charges; co-locate chatty services; use private access.
- Confirm every resource is tagged (cost-center/env/owner) for attribution.
- Validate quotas still reflect intent; check for anomalous new spend by service.
15. Enterprise Architecture Patterns
Reference blueprints for real Azure deployments. Each card gives the business case, services, traffic flow, and the security / HA / DR / monitoring / cost / risk dimensions so you can adapt rather than start from a blank page.
Every pattern lists the same dimensions. Start from the one closest to your workload, then apply the service deep dives (sections 3-12) and the DR/cost guidance (13-14). The recurring backbone is: Front Door/App Gateway + WAF → private compute (App Service / VMSS / AKS) → managed database on a Private Endpoint → Private Endpoints for PaaS → centralized Log Analytics → cross-region DR, all inside a governed landing zone (management groups + Policy + hub-and-spoke).
Foundational three-tier (reference backbone)
| Business case | Standard internal/external web or enterprise app needing HA and controlled exposure. |
|---|---|
| Services | Hub-and-spoke VNet, Front Door + WAF (or App Gateway), App Service/VMSS, Azure SQL (Private Endpoint), NAT/Firewall, Key Vault, Azure Monitor/Log Analytics. |
| Traffic flow | User → Front Door/WAF → app (private) → SQL (Private Endpoint); egress via NAT/firewall; PaaS via Private Endpoints. |
| Security | No public IPs on workloads; NSG/ASG; DB private; CMK; secrets in Key Vault; Azure Policy + Private Endpoints; Defender on. |
| HA | Zone-spread app + zone-redundant SQL; Front Door/App Gateway health probes. |
| DR | Second-region backends behind Front Door + SQL failover group. |
| Monitoring | Backend health, app + DB metrics; alerts → action groups; central Log Analytics. |
| Cost | App Service/VMSS right-size + RIs/Savings Plan; Hybrid Benefit; storage lifecycle. |
| Risks / mistakes | Probe NSG rule missing; DB public endpoint; no zone spread; secrets in app settings; Private DNS not linked. |
Pattern library
Simple web application Small
| Case | Low-complexity site/app, cost-sensitive. |
|---|---|
| Services | App Service + Azure SQL (or Cosmos) + Blob for assets + Front Door + WAF. |
| HA/DR/cost | App Service zone-redundant; SQL HA; scale rules. Risk: public DB endpoint, no backups tested. |
Highly available application HA
| Case | Must survive zone (and ideally region) failure. |
|---|---|
| Services | Zone-spread VMSS/App Service, zone-redundant Azure SQL, Front Door, zone-redundant LB/App Gateway. |
| DR | Second-region backends + SQL failover group. Risk: state on a single zonal disk; untested failover. |
Private enterprise application Regulated
| Case | Internal-only, reachable from on-prem, no public footprint. |
|---|---|
| Services | Private subnets, internal App Gateway/LB, ExpressRoute/VPN via hub, Private Endpoints, Bastion, no public IPs. |
| Risk | CIDR overlap; Private DNS not linked; transitive-peering assumption. |
Hub-and-spoke / centralized networking & security & logging Platform
| Case | Many subscriptions with centrally-governed network, security, and logging. |
|---|---|
| Services | Connectivity subscription (hub VNet, Azure Firewall, gateways, DNS, Bastion), management subscription (Log Analytics, Automation, Backup), Defender + Sentinel, MG-level Policy. |
| Risk | Firewall per spoke (cost); shadow VNets; missing UDR return routes. |
Multi-subscription landing zone Governance
| Case | Governed foundation before workloads land. |
|---|---|
| Services | Management-group hierarchy, platform + landing-zone subscriptions, baseline RBAC (groups) + PIM, preventive Azure Policy, hub network, central logging + Defender, budgets/quotas, locks - all IaC (CAF ALZ). |
| Risk | Skipping it and retrofitting governance later. |
SQL Server enterprise workload SQL
| Case | Lift a SQL Server estate to managed. |
|---|---|
| Services | SQL Managed Instance (Business Critical + zone redundancy), delegated subnet/Private Endpoint, failover group, Hybrid Benefit, Defender for SQL. |
| Risk | Unsupported instance feature; public endpoint; untested failover. |
Oracle workload on Azure Oracle
| Case | Run Oracle on Azure. |
|---|---|
| Services | Self-managed Oracle on E/M VMs + Premium SSD v2/Ultra or NetApp Files, Data Guard to a paired region, backups to Blob, constrained-vCPU/Dedicated Host for licensing; or Oracle Database@Azure for managed Oracle. |
| Risk | Licensing/counting on Azure; storage IOPS undersized; verify DB@Azure availability. |
Azure SQL application PaaS DB
| Case | New relational app backend. |
|---|---|
| Services | Azure SQL Database (vCore, zone-redundant), Private Endpoint, Entra auth + managed identity, PITR + LTR, failover group for DR, Query Performance Insight. |
| Risk | Public endpoint; HA not enabled; instance features expected (use MI instead). |
Data warehouse / data lake Data
| Case | Enterprise analytics on curated + raw data. |
|---|---|
| Services | ADLS Gen2 / OneLake (bronze/silver/gold) + Fabric/Synapse/Databricks + Data Factory + Event Hubs + Purview + Power BI; Private Endpoints. |
| Cost/risk | Capacity/query controls; column-level security. Risk: ungoverned "data swamp"; runaway compute. |
Kubernetes platform Cloud native
| Case | Container platform for many microservices with CI/CD. |
|---|---|
| Services | Private AKS in a spoke VNet, AGIC/Gateway ingress, Workload Identity, ACR (scanned) + Defender for Containers, Azure Policy for AKS, pipeline (GitHub Actions/Azure DevOps). |
| Risk | Pod IP exhaustion (CNI); over-privileged Workload Identity; public API server. |
Serverless application Serverless
| Case | Event-driven / stateless services with minimal ops. |
|---|---|
| Services | Container Apps / Functions + Front Door + Event Grid/Service Bus + Cosmos/SQL + Key Vault; VNet integration to private data. |
| Cost/risk | Scale-to-zero. Risk: cold starts for spiky critical paths (min replicas/Premium plan). |
Event-driven architecture Events
| Case | Decoupled, resilient pipelines. |
|---|---|
| Services | Event Grid + Functions/Container Apps + Service Bus + Event Hubs; dead-letter queues. |
| Risk | Poison messages without DLQ; non-idempotent handlers; backlog from slow consumers. |
Hybrid cloud Hybrid
| Case | Workloads split across on-prem and Azure. |
|---|---|
| Services | ExpressRoute (primary) + VPN (backup) via hub, hub-and-spoke / Virtual WAN, hybrid DNS (Private Resolver), Azure Arc for on-prem management. |
| Risk | CIDR overlap; transitive-peering assumption; single link; asymmetric routing. |
Multi-region DR DR
| Case | Business-critical stack needing regional resilience. |
|---|---|
| Services | Front Door with multi-region backends, SQL failover group / Cosmos multi-region, GZRS storage, Azure Site Recovery, reservations. |
| Risk | Untested DR; CMK key missing in DR region; capacity unavailable at failover. |
Secure landing zone Security
| Case | Preventive-guardrail foundation. |
|---|---|
| Services | Azure Policy (deny public IP/blob, location/SKU restrictions, require diagnostics), PIM + Conditional Access, hub firewall, Private Link strategy, central Defender + Sentinel + logging, Key Vault, budgets/quotas - as code. |
| Risk | Guardrails off; over-broad break-glass; standing Owner. |
GenAI with private enterprise data AI
| Case | RAG/assistant over internal data, governed. |
|---|---|
| Services | Blob/ADLS + Azure AI Search + Azure OpenAI behind an App Service/Function serving API + Key Vault + Private Endpoints + Content Safety + Log Analytics. |
| Flow / risk | Query → serving layer (authz + guardrails) → security-trimmed retrieval → grounded, audited answer. Risk: ungoverned data access, dynamic SQL, credential leakage (section 12 warnings). |
- Databases/services on public endpoints "to get it working"; Private DNS zone not linked.
- No zone/region spread - a zone event takes the whole "HA" tier.
- Health-probe NSG rule (and App Gateway management ports) forgotten - unhealthy backends on day one.
- DR designed but never tested; CMK keys missing in the DR region.
- Secrets in app settings/code instead of Key Vault; standing Owner instead of PIM.
- No centralized logging/Defender until an incident needs it.
- CIDR overlap / transitive-peering assumption discovered during hybrid setup.
- Landing zone / Azure Policy skipped and retrofitted painfully later.
16. Troubleshooting Guides
A runbook catalog for the failures you will actually hit. Each entry lists symptoms, likely causes, checks (with portal path, CLI, and PowerShell where useful), fixes, and prevention. Deeper versions of some live in their service sections; this is the consolidated index.
Compute & access
Causes: NSG blocking 22/3389 from your source; VM has no public IP and you're not using Bastion; Bastion subnet (AzureBastionSubnet) missing/misconfigured or RBAC missing; VM stopped/boot failed; OS firewall; wrong credentials. Checks: Effective security rules + IP Flow Verify; boot diagnostics/serial console; Bastion deployment. Fix: use Bastion (no public IP needed), allow the source in the NSG, reset password/SSH via "Reset password"/Run Command. Prevention: standardize Bastion + no public IPs.
az network watcher test-ip-flow -g RG --vm VM --direction Inbound --protocol TCP --local IP:22 --remote SRC:0
az vm boot-diagnostics get-boot-log -g RG -n VMCauses: bad fstab mount, full OS disk, driver/kernel issue, failed extension. Checks: boot diagnostics screenshot + serial console. Fix: use the VM "Repair" (attach OS disk to a rescue VM) to fix config; keep disk snapshots. Prevention: test image/extension changes in non-prod.
CPU: Azure Monitor trend; on host top; right-size/autoscale (VMSS). Memory: requires the Azure Monitor Agent (guest memory not collected by default) - deploy via VM Insights, then right-size. Disk full: expand the managed disk, grow the partition/filesystem, alert at 85%.
Causes: disk in a different region/zone than the VM; not initialized; LUN mapping; disk still attached elsewhere. Checks: disk state + LUN; lsblk/Disk Management. Fix: attach in the same region/zone, initialize/mount, add to fstab by UUID. Prevention: automate via extension/cloud-init.
Storage
Denied: missing RBAC data role (needs Storage Blob Data Reader/Contributor - control-plane roles like Contributor don't grant data access); wrong SAS/expired; firewall blocking; "allow shared key" disabled but code uses a key; VNet/Private Endpoint restriction; API not registered. Public access: account-level "allow blob public access" is off (correct) - use SAS/RBAC instead. Checks: IAM data roles; storage firewall; az storage blob list --auth-mode login. Fix: grant the data role to the managed identity; use Private Endpoint + Entra auth.
Causes: port 445 blocked (ISP/NSG) for SMB; wrong credentials (storage key or Entra/AD Kerberos); Private Endpoint DNS not resolving; NFS export/rules wrong. Fix: use Private Endpoint (avoids 445-over-internet), configure identity-based auth, verify Private DNS, check firewall.
Network
Method: IP Flow Verify (which NSG rule decides), Effective routes/rules on the NIC (real next hop + merged rules), Connection Troubleshoot, and nslookup. Per case:
- NSG: priority/direction; default deny-inbound; missing allow.
- Route (UDR): forced-tunnel to firewall with no return route (black hole); missing route.
- Azure Firewall: app/network rule collection missing the FQDN/port; DNAT config; UDR not pointing at it.
- NAT Gateway: outbound only; SNAT port exhaustion; not on the subnet.
- Private Endpoint: Private DNS zone not linked / no A record - FQDN resolves to public IP.
- Peering: not transitive; overlap; missing "allow forwarded traffic"/gateway transit.
Causes: IKE/PSK mismatch (VPN); BGP not advertising/learning; ExpressRoute circuit/peering down or route filters; gateway SKU bandwidth exhausted; CIDR overlap. Checks: connection/circuit status; BGP routes; effective routes. Fix: align IKE, correct BGP/route filters, resolve overlap, right-size the gateway SKU.
Load balancing & databases
Unhealthy: NSG blocks the health probe or App Gateway management ports (GatewayManager service tag) on the App Gateway subnet; wrong probe host/path/port/protocol; app on localhost; backend HTTP settings mismatch. LB: probe port not open; Standard LB needs an outbound rule. SSL: managed cert needs DNS → frontend first; Key Vault access for the gateway identity; chain/SAN. Fix: per section 7.
Connection: firewall/public access disabled but no Private Endpoint path; Private DNS not resolving; Entra token/permission; wrong connection string (server FQDN); transient errors (retry logic). Performance: Query Performance Insight / Query Store; DTU/vCore/log-IO limits; missing indexes; enable automatic tuning. Backup: PITR/LTR retention config; test a restore. Checks: az sql db show; audit/diagnostic logs.
Identity
Denied: walk the section 2 model - right tenant/subscription? which principal? role + control vs data plane? scope/inheritance? deny assignment? Azure Policy? Conditional Access? PIM eligible but not activated? Managed identity: is it assigned to the resource, and does it have the target role (data role too)? propagation delay after assignment. App secret expired: a service principal's client secret/cert expired - rotate it (and move to a managed identity / workload identity federation to avoid recurrence). Tools: IAM > Check access; Entra sign-in logs (Conditional Access); Activity Log.
Serverless & AKS
Function timeout: Consumption plan caps duration - use Premium/Flex, make idempotent, offload long work. Trigger not firing: check the binding connection (managed identity / connection string), the source (queue/blob/Event Grid subscription + filter), and function logs. Container Apps: new revision failing - container must listen on the target port, check ingress + scale rules (min replicas) + managed identity RBAC; roll back or split traffic.
Causes: Pending (capacity / pod-IP exhaustion with Azure CNI / requests too big), ImagePullBackOff (grant AcrPull to the kubelet identity; private ACR needs endpoint/DNS), CrashLoopBackOff (config/probes). Tools: kubectl describe/logs, node capacity, subnet IPs. (Section 10.)
Monitoring
Alert: wrong signal/scope/threshold, evaluation window never met, action group has no verified receiver, alert disabled, or maintenance suppression. Test by forcing the condition; check the alert's history. Logs missing: diagnostic settings not enabled on the resource, Azure Monitor Agent / DCR not deployed on the VM, wrong Log Analytics workspace, or ingestion delay/retention expired. Fix: enable diagnostic settings (enforce via Policy), deploy the agent via VM Insights, verify the workspace.
17. Azure CLI, PowerShell, Bicep, ARM & Terraform
Practical, copy-friendly automation: CLI/PowerShell setup and tenant/subscription selection, managed identity, Bicep vs ARM vs Terraform, clean examples for VNet, VM, storage, RBAC, and alerts - plus state, structure, and CI/CD practices.
Two CLIs: Azure CLI (az) and Azure PowerShell (Az) - pick a house standard. Always set the right tenant + subscription before acting. Prefer managed identity / federated credentials over service-principal secrets. For IaC, use Bicep (Azure-native, readable) or Terraform (azurerm, multicloud) - both beat hand-written ARM JSON. Keep Terraform state remote and locked (a storage account with blob lease), structure into modules, separate environments, and deploy via a pipeline using a federated (secretless) identity.
Azure CLI and PowerShell setup
# Azure CLI
az login # interactive
az account list -o table # subscriptions you can see
az account set --subscription "Prod-01" # select subscription
az account show # confirm tenant + subscription
# Azure PowerShell
Connect-AzAccount
Set-AzContext -Subscription "Prod-01"
Get-AzContextAuth & managed identity (no secrets)
# On an Azure resource with a managed identity, tools authenticate automatically:
az login --identity # system-assigned MI
az login --identity --username <clientId> # user-assigned MI
# CI/CD: use OIDC workload identity federation (no client secret)
# - create an app registration + federated credential for the pipeline
# - the pipeline exchanges its OIDC token for an Azure token; nothing to storeCommon commands
# Resource groups & providers
az group create -n rg-app-prod -l eastus2
az provider register --namespace Microsoft.Sql
# Compute
az vm list -o table
az vm create -g rg-app-prod -n web1 --image Ubuntu2204 --public-ip-address "" --size Standard_D2s_v5
# Storage
az storage account create -g rg-app-prod -n stappprod01 --sku Standard_ZRS --allow-blob-public-access false
az storage blob upload-batch -d container -s ./data --auth-mode login
# RBAC
az role assignment create --assignee-object-id <objId> --role "Storage Blob Data Reader" --scope <resourceId>
az role assignment list --assignee <objId> --all -o table
# Logs (KQL)
az monitor log-analytics query -w <workspaceId> --analytics-query "AzureActivity | take 20"Bicep vs ARM vs Terraform
| Bicep | ARM templates | Terraform | |
|---|---|---|---|
| Language | Concise DSL → compiles to ARM | Verbose JSON | HCL (azurerm provider) |
| Best for | Azure-only, native, day-1 feature parity | Legacy; avoid authoring by hand | Multicloud / standardized IaC, rich ecosystem |
| State | None (ARM tracks deployments) | None | You manage remote state |
Create a VNet + subnet + NSG
// Bicep: network.bicep
resource vnet 'Microsoft.Network/virtualNetworks@2023-11-01' = {
name: 'vnet-app'
location: resourceGroup().location
properties: {
addressSpace: { addressPrefixes: [ '10.10.0.0/20' ] }
subnets: [
{
name: 'snet-app'
properties: {
addressPrefix: '10.10.1.0/24'
networkSecurityGroup: { id: nsg.id }
privateEndpointNetworkPolicies: 'Disabled'
}
}
]
}
}
resource nsg 'Microsoft.Network/networkSecurityGroups@2023-11-01' = {
name: 'nsg-app'
location: resourceGroup().location
}# Terraform: network.tf
resource "azurerm_virtual_network" "vnet" {
name = "vnet-app"
resource_group_name = azurerm_resource_group.app.name
location = azurerm_resource_group.app.location
address_space = ["10.10.0.0/20"]
}
resource "azurerm_subnet" "app" {
name = "snet-app"
resource_group_name = azurerm_resource_group.app.name
virtual_network_name = azurerm_virtual_network.vnet.name
address_prefixes = ["10.10.1.0/24"]
}Create a VM (no public IP)
# Terraform: a Linux VM with a system-assigned managed identity, no public IP
resource "azurerm_linux_virtual_machine" "app" {
name = "vm-app-1"
resource_group_name = azurerm_resource_group.app.name
location = azurerm_resource_group.app.location
size = "Standard_D2s_v5"
admin_username = "azureuser"
network_interface_ids = [azurerm_network_interface.app.id]
identity { type = "SystemAssigned" }
admin_ssh_key { username = "azureuser" public_key = file("~/.ssh/id_rsa.pub") }
os_disk { caching = "ReadWrite" storage_account_type = "Premium_LRS" }
source_image_reference { publisher = "Canonical" offer = "0001-com-ubuntu-server-jammy" sku = "22_04-lts-gen2" version = "latest" }
}Create a hardened storage account
resource "azurerm_storage_account" "data" {
name = "stappdataprod01"
resource_group_name = azurerm_resource_group.app.name
location = azurerm_resource_group.app.location
account_tier = "Standard"
account_replication_type = "ZRS"
allow_nested_items_to_be_public = false # no public blob access
shared_access_key_enabled = false # force Entra auth
min_tls_version = "TLS1_2"
blob_properties { versioning_enabled = true delete_retention_policy { days = 30 } }
}Create an RBAC assignment
# Least-privilege: data role on one storage account, to a GROUP
resource "azurerm_role_assignment" "blob_read" {
scope = azurerm_storage_account.data.id
role_definition_name = "Storage Blob Data Reader"
principal_id = azuread_group.app_team.object_id
}Create a metric alert
resource "azurerm_monitor_action_group" "ops" {
name = "ops-email"
resource_group_name = azurerm_resource_group.app.name
short_name = "ops"
email_receiver { name = "oncall" email_address = "oncall@example.com" }
}
resource "azurerm_monitor_metric_alert" "cpu" {
name = "vm-cpu-high"
resource_group_name = azurerm_resource_group.app.name
scopes = [azurerm_linux_virtual_machine.app.id]
criteria {
metric_namespace = "Microsoft.Compute/virtualMachines"
metric_name = "Percentage CPU"
aggregation = "Average"
operator = "GreaterThan"
threshold = 85
}
window_size = "PT5M" frequency = "PT1M"
action { action_group_id = azurerm_monitor_action_group.ops.id }
}State, structure, and CI/CD
- Remote, locked state (Terraform): an Azure storage account backend (blob) with blob leasing for locking and versioning on. Never keep prod state on a laptop; never commit state (it holds secrets).
- Modular structure: reusable modules (network, compute, data, rbac, monitoring) composed per environment.
- Environment separation: separate state per env (workspaces or separate backends/keys) driven by
dev.tfvars/prod.tfvars; separate subscriptions and ideally separate deployer identities/pipelines. - CI/CD: run
plan/apply(or Bicepwhat-if/deploy) in GitHub Actions or Azure DevOps using OIDC workload identity federation (no secret). Gateapplywith approvals; runplan/what-ifon PRs. - No secrets in code: reference Key Vault by name; keep secret
tfvarsout of git.
azure-infra/
modules/
network/ compute/ data/ rbac/ monitoring/
envs/
dev/ main.tf dev.tfvars backend.tf
prod/ main.tf prod.tfvars backend.tf
.github/workflows/ or azure-pipelines.yml
README.mdplan/what-if in CI on every PR, and use Azure Policy + Change Tracking / drift detection to catch out-of-band changes.18. Azure Well-Architected Framework
The five pillars Microsoft uses to review a workload - Reliability, Security, Cost Optimization, Operational Excellence, and Performance Efficiency. Written for real architecture reviews: what each means, the services that support it, examples, mistakes, and a review checklist.
Run a workload (or a design) through all five pillars. For each, ask the checklist questions, map to concrete Azure services, and record gaps as action items. A pillar with no owner and no evidence is a risk, not a pass. (Microsoft's WAF review + the Advisor score are good companions.)
Reliability
What it means: the workload meets its availability and recovery targets and withstands failures - built around zones, redundancy, health modeling, and tested DR.
Why it matters: reliability targets (SLOs) drive architecture and cost; you can't bolt on availability after an outage.
Supporting services: Availability Zones, zone-redundant VMSS/App Service/LB, Azure SQL failover groups / zone redundancy, Cosmos multi-region, Azure Site Recovery, Backup, Front Door, Azure Monitor (health/SLOs).
Practical examples: zone-spread every tier; a defined + tested DR pattern per tier with RTO/RPO; health probes + autohealing; graceful degradation and retries with backoff; capacity planning; chaos/failure testing.
Security
What it means: protect identities, data, and workloads and meet compliance - Zero Trust, least privilege, encryption, and defense in depth.
Why it matters: a single over-broad role, public storage account, or long-lived secret can undo everything else. Security is a design property.
Supporting services: Entra ID + Conditional Access + PIM, Azure RBAC, managed identities, Azure Policy, Key Vault/Managed HSM, Private Link, Azure Firewall/WAF/DDoS, Defender for Cloud, Sentinel.
Practical examples: no standing Owner (PIM); MFA everywhere; managed identities (no secrets); preventive Policy; Private Endpoints; CMK; centralized logs; Defender on. (See section 8's checklist.)
Cost Optimization
What it means: deliver required value at the lowest sustainable cost - right-sizing, commitments, eliminating waste, and attributing spend.
Why it matters: unmanaged Azure spend grows silently; cost is a first-class design and operational concern.
Supporting services: Cost Management + budgets + exports, quotas, Advisor, Reservations, Savings Plan, Azure Hybrid Benefit, Spot, storage lifecycle, Log Analytics tiers.
Practical examples: tags + exports for attribution; RIs/Savings Plan for baseline; Hybrid Benefit; auto-shutdown non-prod; storage lifecycle; Log Analytics filtering; monthly review (section 14).
Operational Excellence
What it means: run and improve the workload reliably and repeatably - automation, observability, incident response, and safe change.
Why it matters: most outages come from change and from not seeing problems early; operational maturity turns a good design into a dependable service.
Supporting services: Bicep/Terraform + pipelines (GitHub Actions/Azure DevOps), Azure Monitor/Log Analytics/App Insights, Update Manager, Change Tracking, Automation, Azure Policy (drift), Deployment stacks.
Practical examples: everything as code with peer review; SLOs + alerts on symptoms; centralized logs; automated patching; runbooks tied to alerts; blameless post-mortems; progressive delivery (slots/rings).
Performance Efficiency
What it means: meet performance requirements efficiently as demand changes - right SKUs, autoscaling, caching, data locality, and query design.
Why it matters: performance affects user experience and cost simultaneously; the right shape and data design often beat simply adding capacity.
Supporting services: VM series (F/E/M/L), autoscaling (VMSS/App Service/AKS HPA), Front Door + CDN, Azure Cache for Redis, Premium SSD v2/Ultra, Azure SQL tiers/read replicas, App Insights.
Practical examples: match VM series to the bottleneck; autoscale on the right signal; cache at the edge (Front Door/CDN) and in-memory (Redis); co-locate data + compute (proximity groups) to cut latency/egress; tune database indexes/queries; load-test before launch.
19. Learning Path
A structured route from Azure fundamentals to enterprise-grade architecture, security, and AI - aimed at people coming from traditional infrastructure or another cloud. Each level lists what to learn, why, hands-on labs, common mistakes, and the outcome you should reach.
Beginner
What to learn
- Fundamentals: regions/zones, the governance hierarchy (management group / subscription / resource group), and the Azure mental model (section 1).
- Entra ID basics and RBAC vs Entra roles; groups; managed identities (section 2).
- VNet basics: subnets, NSGs, and how traffic flows (section 3).
- VM basics: sizes, managed disks, Bastion access (section 4).
- Storage basics: storage accounts, blob tiers, redundancy, disabling public access (section 5).
- Azure Monitor basics: metrics, the agent, an alert, Log Analytics (section 9).
Why it matters
Every design rests on the hierarchy, the Entra/RBAC split, and the VNet model. Get these right and everything later is easier.
Hands-on labs
- Create a resource group; assign a built-in role to a group at RG scope; test with Check access.
- Build a VNet with subnets + NSGs; deploy a VM with no public IP; connect via Bastion.
- Create a storage account (ZRS, public access disabled); upload blobs with Entra auth.
- Deploy the Azure Monitor Agent (VM Insights); create a CPU alert to an action group.
Common mistakes
Confusing Entra roles with RBAC; Owner at subscription; public storage; no agent for memory; VMs with public IPs.
Expected outcome
You can stand up a segmented VNet, reach a private VM via Bastion, use RBAC correctly, and see basic telemetry.
Intermediate
What to learn
- Load balancing (Load Balancer, Application Gateway + WAF, Front Door) and VMSS + autoscale (sections 7, 4).
- Private networking: Private Endpoints, NAT Gateway, VPN Gateway, ExpressRoute basics (section 3).
- Azure SQL: service tiers, zone redundancy, Private Endpoint, PITR, failover groups (section 6).
- Key Vault + managed identities; Defender for Cloud; Azure Policy basics (section 8).
- Cost Management: budgets, tags, Reservations, Hybrid Benefit (section 14).
Why it matters
This is the day job: HA app tiers, managed databases, and the operational, security, and cost controls that make them production-worthy.
Hands-on labs
- Deploy a 3-tier app: Front Door + WAF → App Service/VMSS → Azure SQL (Private Endpoint, zone-redundant).
- Allow the health probe (+ App Gateway management ports); confirm backends healthy; force a failover.
- Store the DB connection secret in Key Vault; connect via managed identity / Entra auth.
- Create alerts (CPU, unhealthy backend, DB storage) + an action group; wire a budget + tags.
- Apply a couple of Azure Policies (deny public IP, allowed regions) at the resource group.
Common mistakes
Health-probe NSG rule missing; DB public endpoint; secrets in app settings; Private DNS not linked; noisy alerts.
Expected outcome
You can deploy a secure, monitored, HA application + managed database, connect it privately, and control its cost and access.
Advanced
What to learn
- Management groups, landing zones (CAF ALZ), and Azure Policy at scale (sections 1, 8, 14).
- Hub-and-spoke / Virtual WAN, Azure Firewall, Private Link, DNS Private Resolver (section 3).
- Advanced RBAC + PIM + Conditional Access (section 2).
- AKS / Container Apps, Functions, Event Grid/Service Bus (section 10).
- Synapse / Microsoft Fabric, Data Factory, Purview (section 11).
- Azure OpenAI, Azure AI Search, and governed RAG / vector search (section 12).
- Multi-region DR (Front Door + failover groups / Cosmos) (section 13).
- Bicep/Terraform + pipelines; enterprise security; large-enterprise architecture (sections 17, 8).
Why it matters
At this level you own governance, resilience, data platforms, and AI enablement across many teams - decisions that are expensive to reverse.
Hands-on labs
- Deploy a landing zone (CAF ALZ accelerator): management groups, baseline RBAC + PIM, Azure Policy, hub network, central logging + Defender, budgets.
- Stand up a private AKS cluster in a spoke with Workload Identity and a GitHub Actions/Azure DevOps pipeline.
- Build a Fabric/Synapse lakehouse with a Data Factory pipeline and Purview governance; tune a query.
- Build a governed RAG assistant: Blob + Azure AI Search + Azure OpenAI behind an App Service serving API, Private Endpoints, Content Safety, and security-trimmed retrieval.
- Implement cross-region DR for an Azure SQL app (failover group) behind Front Door; rehearse failover and confirm CMK keys in DR.
Common mistakes
Skipping the landing zone; DR never tested; standing Owner instead of PIM; pod-IP exhaustion; connecting AI to production data without a governed serving layer.
Expected outcome
You can design and operate a governed, automated, multi-region Azure platform - including data and AI workloads - and defend the trade-offs on security, reliability, and cost.
Certification checkpoints (optional)
| Level | Typical certification track |
|---|---|
| Beginner | AZ-900 (Azure Fundamentals); AZ-104 (Administrator) |
| Intermediate | AZ-104; AZ-500 (Security); AZ-700 (Networking) |
| Advanced | AZ-305 (Solutions Architect Expert); DP-203/DP-600 (Data); AI-102 (AI Engineer) |