As on 26 June 2026
← expertoracle.com

Azure AI, the practical way

An architecture-first reference for the Microsoft Azure AI stack as of June 2026. The platform formerly called Azure AI Foundry is now Microsoft Foundry - one surface for models, agents, evaluation, and governance. This portal covers Foundry, the model catalog, Foundry Agent Service, Copilot, and the silicon - trade-offs and risks, no marketing.

Refreshed June 2026Architecture-firstEnterprise focusVendor-neutral
Naming, 2026
Azure AI Foundry is being renamed Microsoft Foundry (docs, blogs, and SDKs are mid-transition; many URLs still say ai-foundry / foundry). Azure OpenAI models now live inside Foundry as part of Foundry Models. Agent 365 is the Microsoft-365-side governance layer for agents. Same lineage, new packaging.
TL;DR

Azure's 2026 AI story has three pillars. Microsoft Foundry is the unified build platform: Foundry Models (OpenAI GPT-5.x, o-series, plus Llama, Mistral, DeepSeek, Phi, Nemotron, and partner catalogs), Foundry Agent Service (GA - Responses-API runtime, MCP + A2A, connected multi-agent), a model router, evaluations, and observability. Copilot is the distribution engine - Microsoft 365 Copilot, Copilot Studio, GitHub/Security Copilot, governed by Agent 365. Underneath sit Azure's data services (AI Search, Cosmos/SQL/PostgreSQL vectors, Fabric) and custom silicon (Maia, Cobalt) alongside NVIDIA GPUs. If you are a Microsoft shop with OpenAI ambitions and M365 reach, this stack is the default - the cost is keeping up with the fastest-moving naming in the industry.

The Azure AI mental model

LAYER 3 - COPILOT & AGENTS (consume) Microsoft 365 Copilot - Copilot Studio - GitHub / Security / Azure Copilot - Agent 365 governance Pre-built or low-code. Governed by Entra ID + Agent 365. Configure tools and knowledge, not weights. LAYER 2 - MICROSOFT FOUNDRY (build) Foundry Models (GPT-5.x, o-series, Llama, Phi, Nemotron…) - Model Router - Foundry Agent Service - MCP/A2A Evaluations - Observability - Content Safety - Prompt Flow - Fine-tuning & distillation Pick/route models, build & govern agents, evaluate, red-team, deploy, monitor - one platform. LAYER 1 - DATA & INFRASTRUCTURE (ground) Azure AI Search - Cosmos DB / Azure SQL / PostgreSQL vectors - Microsoft Fabric / OneLake - Blob Storage Maia AI accelerators - Cobalt CPU - ND-series GPUs (GB200) - Azure AI infrastructure Your data and vectors live here, next to the rest of your Azure estate and Entra identity.
Figure 1 - The Azure AI stack. Most teams enter at Layer 3 (Copilot) or Layer 2 (Foundry); drop to Layer 1 for data gravity and silicon.

What sets Azure apart in 2026

DifferentiatorWhat it means in practice
First access to OpenAI frontierGPT-5.x (5.4, 5.4 Mini, 5.5) and o-series land on Foundry with enterprise SLAs, private networking, and quota tiers - often the cleanest enterprise path to OpenAI's newest models.
Open standards in the runtimeFoundry Agent Service speaks MCP, A2A, and OpenAPI natively - connected agents, tool reuse, and cross-vendor interop without protocol lock-in.
M365 distributionOne-click publish from Foundry to Microsoft 365 Copilot and Teams; Agent 365 gives a single registry and guardrails across every agent in the tenant.
Model breadth + routerOpenAI, Anthropic, Meta, Mistral, DeepSeek, Microsoft Phi, NVIDIA Nemotron, Fireworks-hosted open models - and a model router that auto-picks the cheapest model that clears quality.
Entra identity + governanceIdentity, Content Safety, Foundry Control Plane, Observability (tracing, evals, continuous red-teaming) are first-class, not bolt-ons.

Where Azure is weaker (be honest)

Naming velocity & sprawl
Azure AI Foundry to Microsoft Foundry, Azure OpenAI folded into Foundry Models, Copilot/Agent 365/Copilot Studio/Agent Framework overlapping - the surface changes faster than docs can keep up. Budget real time to map current names to what you already deployed.
OpenAI dependency
The frontier story leans heavily on the OpenAI relationship. Microsoft's own frontier models (beyond Phi SLMs) are still emerging; if that relationship shifts, the top of the catalog is a partner's, not Microsoft's.

How to read this portal

Each service tab follows the same shape: Overview, Architecture, Capabilities, Pricing, Risks, When to use. If you only read one sub-tab, read Risks & gotchas. The others tell you what something does; Risks tells you what bites you in production.

What's New - Ignite 2025 through June 2026

Material changes that affect architecture, cost, or risk. Curated, not a press-release dump.

TL;DR

Three threads dominate. One: the platform rebrand - Azure AI Foundry to Microsoft Foundry, with Foundry Agent Service reaching GA (Responses-API runtime, private networking, MCP OAuth passthrough) and Observability GA. Two: the GPT-5 cadence - 5.4, 5.4 Mini, then 5.5 - plus a model router, Priority Processing, Phi-4 vision/reasoning, GPT-image-2, and Fireworks-hosted open models (DeepSeek V3.2, gpt-oss, Kimi, MiniMax). Three: agent governance across the tenant - Agent 365, the Agent registry, and Foundry Control Plane.

DateReleaseWhy it matters
Nov 2025Ignite 2025: Foundry Agent Service, Foundry Control Plane (preview), Agent 365Production agent runtime + one-click publish to M365/Teams; a single place to govern any agent (Foundry, Copilot Studio, third-party) with guardrails on inputs/outputs/tool calls.
Nov 2025Observability GA; Microsoft Agent Framework (AutoGen + Semantic Kernel lineage)Evals + OpenTelemetry tracing + continuous red-teaming + Azure Monitor; a code-first agent framework supporting AG-UI and ChatKit front-ends.
Dec 2025Azure AI Foundry to Microsoft Foundry rebrand underwayOne platform brand; Azure OpenAI becomes part of Foundry Models. Watch SDK/role/URL changes.
Jan 2026Model router; SDK 2.0 GAAuto-select the optimal model per prompt to cut cost while holding quality; stable SDK surface.
Mar 2026GPT-5.4 (GA), GPT-5.4 Mini, Phi-4 Vision, Priority Processing, new evaluations5.4 targets agent reliability (task-drift, mid-workflow failures, tool-call consistency); Mini for cheap classify/extract; Priority Processing reserves low-latency compute lanes.
Mar 2026Foundry Agent Service GA runtime: Responses API, end-to-end private networking, MCP OAuth passthroughProduction-ready agent hosting with private networking and standardized tool auth. Migrate 2025 agent pilots here.
Apr-May 2026GPT-5.5; GPT-image-2 (4K); Fireworks AI open models; Nemotron first-classLatest frontier OpenAI tier; high-res image gen; DeepSeek V3.2 / gpt-oss-120b / Kimi K2.5 / MiniMax M2.5 hosted; broader open-model choice in one catalog.
2026MCP + A2A first-class in Agent Service; connected (multi-)agentsAgents call agents as tools and interoperate across vendors via open standards - real multi-agent systems, with the governance burden that implies.
Practical read
If you built on Azure OpenAI + a hand-rolled orchestrator in 2025, plan a migration to Foundry Agent Service (managed runtime, MCP/A2A, Observability) and adopt the model router to control spend. Re-check IAM/role names and SDK packages against current Microsoft Foundry docs.

Service Map

The Azure AI services worth knowing, grouped by what you do with them.

PLATFORMMicrosoft Foundry

Formerly Azure AI Foundry. Models, agents, router, evaluations, observability, content safety - one build platform.

MODELSFoundry Models

OpenAI GPT-5.x & o-series, plus Llama, Mistral, DeepSeek, Phi-4, Nemotron, Fireworks-hosted open models.

AGENTSFoundry Agent Service

GA. Responses-API runtime, MCP + A2A, connected multi-agent, private networking, observability.

ASSISTCopilot family

Microsoft 365 Copilot, Copilot Studio, GitHub/Security/Azure Copilot - governed by Agent 365.

APPLIEDAzure AI Services

Vision, Document Intelligence, Language, Speech, Translator, Content Understanding.

DATAAI Search & vectors

Azure AI Search (vector + hybrid + semantic ranker), Cosmos/SQL/PostgreSQL vectors, Microsoft Fabric.

SILICONMaia & Cobalt

Maia AI accelerators, Cobalt ARM CPUs, ND-series NVIDIA GPUs (GB200), AI infrastructure.

GOVERNContent Safety + Agent 365

Content filters, Control Plane guardrails, Agent registry, evaluations, continuous red-teaming.

GROUNDRAG & grounding

Azure AI Search retrieval, "On Your Data", Bing grounding, Fabric data agents.

How to read this
The flagship services (Microsoft Foundry, Foundry Agent Service) carry the full Overview / Architecture / Capabilities / Pricing / Risks / When-to-use sub-tabs with reference-architecture diagrams. Secondary services use a single rich page with the same architecture-first, risk-honest treatment. Start at the service you care about; read its Risks before its Overview if you're scoping production.

Microsoft Foundry was Azure AI Foundry

The unified platform to choose models, build and govern agents, evaluate, and ship - the center of gravity for AI on Azure.

Official documentation ↗

Overview
Architecture
Capabilities
Pricing model
Risks & gotchas
When to use
TL;DR

Foundry is the single pane for the whole AI lifecycle on Azure: a 1000+ model catalog (Foundry Models) with a router, a managed Agent Service, prompt flow, fine-tuning/distillation, evaluations, content safety, and observability - all under Entra identity, private networking, and Azure billing. You organize work in projects inside a Foundry resource/hub, and promote from experiment to production without leaving the platform.

What problem this solves

Enterprises don't want to wire together a model API, a vector store, a guardrail service, an eval harness, an agent orchestrator, and a monitoring stack from separate vendors - each with its own identity and billing. Foundry's offer is one governed surface where you swap models without rewriting the app, apply the same Content Safety policy across every model, and trace/evaluate agents in production. The trade-off is breadth: the platform is large and renaming fast, so onboarding has a real learning curve.

The building blocks

ConceptWhat it is
Foundry resource / hubThe top-level Azure resource that holds shared config, connections, and security boundaries.
ProjectA workspace for a use case - models, data connections, agents, evaluations, and deployments scoped together.
Foundry ModelsThe model catalog: OpenAI, Microsoft, and partner/open models, sold directly by Azure or via the marketplace.
Foundry Agent ServiceThe managed runtime for production agents (see its own tab).
Evaluations & ObservabilityQuality/safety evaluation, OpenTelemetry tracing, continuous red-teaming, Azure Monitor.
Rule of thumb
Start every Azure GenAI workload as a Foundry project. Only drop to raw model endpoints or a custom stack when you have a specific need Foundry doesn't cover - and you will need fewer of those than you think.

Reference architecture

Azure VNet (private endpoints, NSG) - Entra ID identity Application AKS / Functions / App Service Foundry SDK 2.0 / Responses API Foundry project endpoint Private endpoint, Entra auth Content Safety in-line Model router + Foundry Models ▸ OpenAI GPT-5.x / o-series / GPT-image-2 ▸ Microsoft Phi-4 (vision / reasoning) ▸ Llama / Mistral / DeepSeek / Nemotron ▸ Fireworks-hosted open models ▸ Anthropic Claude (via catalog) Serverless (pay-go) deployment Per-token; standard or Provisioned Throughput (PTU) Priority Processing lanes for low latency Agent Service runtime Responses API, threads/state, MCP + A2A tools Connected (multi-)agents, private networking Knowledge / RAG Azure AI Search (vector+hybrid) Cosmos / SQL / PostgreSQL vectors Fabric / OneLake Govern & observe Content Safety, Control Plane Evals, OTel tracing, red-team Azure Monitor, Defender for AI Identity & security Entra ID, managed identity, RBAC Key Vault, customer-managed keys Private Link, no public egress
Figure - Microsoft Foundry reference shape. Private endpoints + Entra keep traffic off the public internet; Content Safety sits in-line on every model call.

Network and identity

Foundry projects support private endpoints (Private Link) so model and agent traffic never traverses the public internet. Authentication is Entra ID; apps use managed identities and RBAC scoped to the Foundry resource, project, and deployment. Secrets and keys belong in Key Vault, and you can enforce customer-managed keys (CMK) for data at rest. For regulated workloads, combine private networking, CMK, no-public-egress NSG rules, and Defender for AI monitoring.

Where the data goes

Microsoft's stated position is that prompts and completions in Azure OpenAI / Foundry Models are not used to train the foundation models, and data stays within your Azure tenant and chosen region/data-zone. You control whether request/response logging is enabled. For data residency, use region- or data-zone-pinned deployments and confirm the specific model's availability there before designing around it.

Capability matrix (June 2026)

CapabilityStatusNotes
Model catalog + router1000+ models; router auto-selects the cheapest model that clears quality.
Foundry Agent ServiceGA - Responses API runtime, MCP + A2A, connected multi-agent.
EvaluationsAutomated + LLM-judge quality/safety evals, including agent evals.
ObservabilityGA - OpenTelemetry tracing, continuous red-teaming, Azure Monitor.
Content SafetyIn-line filters: hate/sexual/violence/self-harm, jailbreak/prompt-shield, groundedness, protected material.
Fine-tuning / distillationSupervised fine-tuning, distillation; reinforcement methods on select models.
Prompt FlowAuthor, test, and deploy prompt/orchestration flows.
Private networkingPrivate Link / VNet integration end to end.
Provisioned Throughput (PTU)Reserved capacity for predictable latency/cost at volume.
Priority ProcessingPreview - dedicated low-latency compute lanes for real-time agents/chat.

How Foundry bills

ModeHow you payBest for
Standard (pay-as-you-go)Per input/output token, per model.Prototyping, variable/low volume, model comparison.
Provisioned Throughput (PTU)Reserved throughput units (hourly/monthly/annual reservations).Steady high volume needing predictable latency and cost.
Priority ProcessingPremium for reserved low-latency lanes.Customer-facing real-time chat / agents with strict latency.
Fine-tuningTraining tokens + hosting of the tuned deployment.Narrow tasks where a tuned small model beats prompting a large one.
Agent Service / toolsUnderlying model tokens x steps + tool/runtime charges.Production agents - watch step count.
Rule of thumb
Cross from Standard to PTU when sustained traffic makes reserved units cheaper than pay-go and you need latency guarantees. Use the model router and GPT-5.x Mini to keep the per-call cost down before reserving capacity.
Naming & API churn
Foundry, Azure OpenAI, SDK 1.x vs 2.0, classic AI Studio - names, roles, and endpoints are mid-migration. Pin SDK versions, watch deprecations, and confirm role assignments after upgrades.
Quota & capacity
Newest models (and PTU) have tiered quotas and regional capacity limits. Request increases early; the model you demo with may not have headroom in your region at launch.
Agent cost & actions
Agent loops multiply token cost by steps and can take real actions. Cap loop length, scope tool permissions via Entra, and require approval on high-impact tools.
Data zone / residency
Data-zone and regional deployments differ by model. Confirm the exact model is available in your required region/data-zone before committing an architecture.
  • Use Foundry when you are on Azure and want one governed surface for models, agents, evals, and safety - which is almost every Azure GenAI workload.
  • Lead with the model router + GPT-5.x Mini for cost; reserve PTU once volume is steady.
  • Go straight to Agent Service for anything heading to production rather than hand-rolling an orchestrator.
  • Drop to raw endpoints / custom stack only for a specific capability Foundry doesn't cover.

Foundry Models

The model catalog behind Foundry - OpenAI frontier, Microsoft Phi, and a broad partner/open selection, with a router to pick between them.

Official documentation ↗

FamilyExamples (June 2026)Use
OpenAI flagshipGPT-5.5, GPT-5.4, GPT-5.4 Mini, o-seriesHardest reasoning, agents, coding; Mini for cheap classify/extract/tool-calls.
OpenAI mediaGPT-image-2 (4K), TTS / RealtimeImage generation/editing and voice.
Microsoft (own)Phi-4, Phi-4 Vision, Phi-4 Reasoning Vision 15BSmall, efficient multimodal/reasoning models; on-prem/edge via Foundry Local.
Open / partnerLlama, Mistral, DeepSeek V3.2, NVIDIA Nemotron, Anthropic ClaudeOpen-weight customization, cost, or specific-vendor strengths.
Fireworks-hostedgpt-oss-120b, Kimi K2.5, MiniMax M2.5High-performance open-model inference without standing up your own serving.
Model router
Rather than hard-coding a model, point at the router: it auto-selects the cheapest model that meets the quality bar for each prompt. Pair with evals so "quality bar" is something you measured, not assumed.
Sold-by-Azure vs marketplace
Some models are sold directly by Azure (first-party billing/SLA); others are partner/marketplace offers with different terms. Check the billing and data-handling terms per model before production.

Foundry Agent Service GA

The managed runtime for production agents on Azure - Responses-API based, with MCP, A2A, connected multi-agent, private networking, and observability.

Official documentation ↗

Overview
Architecture
Tools & protocols
Risks & gotchas
When to use
TL;DR

Agent Service turns a model + instructions + tools + knowledge into a managed, stateful agent you don't have to host. The 2026 GA runtime is built on the Responses API with threads/state, end-to-end private networking, and standardized tool auth (MCP OAuth passthrough). It speaks MCP, A2A, and OpenAPI, and supports connected agents - agents calling other agents as tools - so you can compose specialists instead of building one monolith.

What problem this solves

Hand-built agent loops are easy to prototype and hard to operate: state, retries, tool auth, networking, tracing, and safety all become your problem. Agent Service makes those managed concerns and standardizes the integration surface (MCP/A2A/OpenAPI) so tools and other agents plug in without bespoke glue. You publish to Microsoft 365 Copilot and Teams in one click and govern everything through Agent 365 and the Foundry Control Plane.

Migrate 2025 pilots
If you built agents on Azure OpenAI + a custom orchestrator, the GA runtime replaces most of that scaffolding. Move for the managed state, private networking, MCP auth, and observability alone.

Reference architecture

Foundry project - VNet, Entra, Content Safety in-line Channel M365 Copilot / Teams / app one-click publish Agent (Responses API) model + instructions threads / state / memory Content Safety + Control Plane Model router GPT-5.x / o-series / Phi cheapest that clears quality Priority Processing optional Tools (governed) MCP servers (OAuth passthrough) OpenAPI / Logic Apps / Functions Bing grounding, code interpreter file search, browser Connected agents (A2A) specialist agents as tools cross-vendor via A2A v1 orchestrator delegates tasks Agent 365 registry + identity Knowledge Azure AI Search (vector+hybrid) Cosmos / SQL / PostgreSQL vectors Fabric data agents "On Your Data"
Figure - Foundry Agent Service. The agent runtime brokers model calls (via the router), governed tools (MCP/OpenAPI), connected specialist agents (A2A), and knowledge - all under Entra and Content Safety.

Tools & protocols

SurfaceWhat it gives you
MCP (Model Context Protocol)Connect external MCP servers as governed tools, with OAuth passthrough for delegated auth.
A2A (Agent-to-Agent)Call other agents - your own or third-party - as interoperable endpoints.
OpenAPI toolsWrap any REST API as a tool from its spec.
Connected agentsCompose specialist agents; an orchestrator delegates subtasks.
Hosted toolsBing grounding, file search, code interpreter, browser, Logic Apps / Functions.
KnowledgeAzure AI Search, "On Your Data", Cosmos/SQL/PostgreSQL vectors, Fabric data agents.
Why open standards matter
MCP + A2A + OpenAPI mean your tools and agents aren't trapped in one vendor's format. The same MCP tool can serve a Foundry agent today and a different runtime tomorrow.
Multi-agent blast radius
Connected agents + A2A make systems powerful and hard to reason about. Use Agent 365 registry, per-agent Entra identities, least-privilege tool scopes, budgets, and Observability traces from day one.
Tool auth & data exfiltration
MCP/OpenAPI tools can read and send data. Vet every tool, prefer OAuth passthrough with scoped consent, and keep agents on private networking with egress controls.
Runaway loops
Cap steps and tokens per conversation; route routine steps to GPT-5.x Mini and reserve the flagship for the hard parts.
Publish surface
One-click publish to M365/Teams is convenient and broad - make sure governance (Agent 365, DLP, Content Safety) is set before you expose an agent to the whole tenant.
  • Use Agent Service for any agent heading to production - you get managed state, private networking, MCP/A2A, and observability for free.
  • Use connected agents / A2A when the problem decomposes into specialists; keep a single monolith only for simple flows.
  • Pair with the model router and Priority Processing for cost and latency control.
  • Govern through Agent 365 before publishing to M365/Teams.

Governance & Safety

The controls that make agents and models safe to run in an enterprise tenant.

Official documentation ↗

ControlWhat it does
Azure AI Content SafetyIn-line filters for hate/sexual/violence/self-harm, plus prompt shields (jailbreak), groundedness detection, and protected-material checks - applied to any model.
Foundry Control PlaneGovern every agent in one place (Foundry, Copilot Studio, third-party) with consistent guardrails across inputs, outputs, tool calls, and tool responses.
Agent 365 + Agent registryTenant-wide discovery, identity, and management of all agents, with admin guardrails and DLP.
ObservabilityEvaluations, OpenTelemetry tracing, continuous red-teaming, Azure Monitor insights.
Defender for AI / PurviewThreat protection for AI workloads and data governance/compliance across prompts and outputs.
Entra identityManaged identities, RBAC, conditional access - the same identity plane as the rest of Azure.
Apply at the platform layer
Put Content Safety and Control Plane guardrails between the app and the model so the same policy holds no matter which model the router picks - don't scatter safety logic across app code.
Governance before scale
Agent 365 and DLP should be configured before you one-click-publish agents to a tenant of users. Retrofitting governance after broad exposure is the hard path.

Azure vs AWS vs OCI vs GCP

A practitioner's quick read. Every cloud does the basics; the differences are in defaults, data gravity, and silicon.

DimensionAzureAWSOCIGCP
Frontier modelOpenAI GPT-5.xNova (mid); Claude hostedNone (partners)Gemini 3.x
Model breadth (managed)Foundry Models (1000+)Bedrock (widest)Broad (OCI Gen AI)Model Garden (200+)
AgentsFoundry Agent Service + MCP/A2AAgentCoreEnterprise AI AgentsAgent Platform + A2A
Custom siliconMaia (emerging)Trainium/InferentiaGPU (NVIDIA)TPU (Ironwood/8th)
Data gravityFabric / OneLakeS3 / RedshiftOracle DB 26ai (in-DB vectors)BigQuery
DistributionMicrosoft 365Console / partnersOracle apps / EBSWorkspace
Best whenMicrosoft shop; want OpenAI frontier + M365 reachAlready on AWS; want model choice + silicon economicsRun Oracle DB/EBS; want in-DB vectors + sovereigntyBigQuery/Workspace central; want Gemini + TPU full stack
Honest take
The cloud your data and identity already live in usually wins - gravity beats a marginally better model. Azure's edge is the OpenAI frontier plus unmatched Microsoft 365 distribution; its tax is the fastest naming churn of the four.

Sources

Primary Microsoft material used for this portal (June 2026). Names and versions are mid-transition - confirm in current docs before designing.

Accuracy note
Compiled by Brijesh Gogia for expertoracle.com. Independent and not affiliated with Microsoft. Azure's AI naming changed substantially across Ignite 2025 / 2026 - treat this as orientation and confirm in the portal/docs before designing.

Model Router

One endpoint that auto-selects the cheapest model clearing your quality bar - per prompt.

Official documentation ↗

Instead of hard-coding GPT-5.5 (expensive) or GPT-5.4 Mini (cheap) at every call site, you target the router. It classifies each request and dispatches to the model that meets the quality target at the lowest cost and latency, escalating to stronger models only when the prompt needs it. For mixed workloads - where most requests are easy and a few are genuinely hard - it is one of the simplest cost levers on the platform.

Use the router whenSkip it when
Workload mixes easy and hard prompts; you want cost savings without re-engineering call sites.You need a single fixed model version for reproducibility or a compliance attestation.
You can tolerate a small classification step before dispatch.Latency budget is so tight the routing hop is unacceptable.
"Quality bar" must be measured
The router only saves money safely if your quality target is something you evaluated. Wire it to Foundry evaluations so routing is grounded in measured task accuracy, not assumptions - otherwise you trade cost for silent quality regressions.
Pairs with
Priority Processing (low-latency lanes) and Provisioned Throughput (reserved capacity) for the high-volume tier the router sends most traffic to.

Knowledge & RAG - Azure AI Search

Keep answers grounded in your data. Azure AI Search is the default retrieval engine; several databases can also serve vectors.

Official documentation ↗

Ingestion (offline) & Query (online) SourcesBlob / SharePoint / DB Chunk + embedskillset / integrated vectorization Azure AI Search indexvector + keyword + semantichybrid + semantic ranker User queryagent / app / Copilot Retrieve + rerankhybrid search, semantic ranker Grounded generationmodel + cited context Content Safetygroundedness + filters
Figure - Azure RAG. AI Search holds the hybrid (vector + keyword) index with a semantic ranker; Content Safety checks groundedness on the way out.
Retrieval optionBest for
Azure AI SearchThe default - vector + keyword + semantic ranking, integrated vectorization, security trimming. Most RAG starts here.
"On Your Data"Fastest path - point a model at a data source and get grounded answers with minimal code.
Cosmos DB / Azure SQL / PostgreSQL vectorsWhen vectors must live beside operational data with transactional consistency.
Microsoft Fabric data agentsGrounding over the lakehouse / OneLake for analytics-centric estates.
Retrieval is the failure point
Most "the model hallucinated" incidents are actually retrieval misses. Use hybrid search + the semantic ranker, enable groundedness detection in Content Safety, and evaluate retrieval quality before blaming the LLM.

Copilot & Agent 365

Microsoft's distribution layer - buy the assistant inside the tools people already use, and govern every agent in the tenant.

Official documentation ↗

ProductWhat it is
Microsoft 365 CopilotThe assistant embedded in Word, Excel, Outlook, Teams - grounded in your Graph (mail, files, chats) with user permissions.
Copilot StudioLow-code builder for custom agents/topics; uses OpenAI and Anthropic models; publishes to Teams, web, and M365 Copilot.
GitHub CopilotCoding agent across the SDLC - completion, chat, agent mode, code review.
Security CopilotSOC assistant for triage, hunting, and incident summarization across Defender/Sentinel.
Azure CopilotOperations assistant for managing and troubleshooting Azure resources.
Agent 365Tenant-wide governance: the Agent registry discovers and manages every agent (Copilot Studio, Agent Builder, SharePoint, M365 Agent SDK, Foundry, third-party), with identity, DLP, and admin guardrails.
Build vs buy
For internal knowledge or productivity assistants, pilot Microsoft 365 Copilot or a Copilot Studio agent before building custom on Foundry - the Graph grounding and permission inheritance save real work. Build on Foundry/Agent Service when you need bespoke logic, custom UX, or tools Copilot Studio can't express.
Govern before you scale
One-click publish to a tenant of users is powerful. Configure Agent 365 registry, DLP, and Content Safety before broad exposure - retrofitting governance after the fact is the hard path.

Applied AI Services

Task-specific managed APIs in Azure AI Services - call them, no model selection required.

Official documentation ↗

ServiceTask
Azure AI VisionImage analysis, OCR, spatial analysis, image captioning/tags.
Document IntelligenceExtract text, tables, key-value pairs, and structure from documents (the former Form Recognizer).
Azure AI LanguageEntity recognition, sentiment, PII detection, summarization, custom classification, question answering.
Azure AI SpeechSpeech-to-text, text-to-speech (incl. custom/neural voices), translation, diarization.
TranslatorNeural machine translation across many languages, document translation.
Content UnderstandingMultimodal extraction across documents, images, audio, and video into structured output.
Trend to watch
Many classic tasks (extraction, classification, summarization) are increasingly done with a Foundry multimodal model or Content Understanding. Reach for the dedicated applied service when it is cheaper, lower-latency, or compliance-certified for that exact task; reach for Foundry when you need flexibility.

Data & Vectors

Where embeddings and ground-truth live. Pick by where your data already is.

Official documentation ↗

StoreBest for
Azure AI Search (vector)The default RAG index - hybrid (vector + keyword) search with a semantic ranker and security trimming.
Azure Cosmos DB (vector)Vectors beside globally-distributed operational/app data, low latency, NoSQL.
Azure SQL / SQL DB (vector)Vectors next to relational data with transactional consistency.
Azure Database for PostgreSQL (pgvector + DiskANN)Open-source vector path beside Postgres data; DiskANN for scale.
Microsoft Fabric / OneLakeLakehouse-scale data and Fabric data agents for analytics-centric grounding.
Default
Most teams start with Azure AI Search for RAG. Move vectors into Cosmos/SQL/PostgreSQL when they must sit beside operational rows; use Fabric when the corpus is the warehouse.

Maia & Silicon

The compute under the stack - Microsoft's custom accelerators alongside NVIDIA GPUs.

Official documentation ↗

SiliconRole
Maia AI acceleratorMicrosoft's in-house AI chip for training/inference economics on first-party and hosted workloads; the emerging cost lever versus pure-GPU.
Cobalt (Arm CPU)Microsoft's Arm-based general-purpose CPU - efficient serving and supporting workloads around AI.
ND-series GPU VMs (NVIDIA, incl. GB200)Top-end GPU training/inference with full CUDA/framework compatibility.
Azure AI infrastructure / Maia clustersNetwork-dense accelerator fabrics for large-scale training; reserved capacity options.
Architect's lever
Most teams consume models via Foundry and never touch raw silicon. When you do run dedicated capacity at volume, benchmark Maia and ND-series GPUs on your workload - the price/perf gap can dominate TCO. Keep GPUs where a specific CUDA/framework path is required.

Architecture Patterns

The shapes most Azure GenAI workloads fall into.

1. Grounded enterprise assistant

Foundry model + Azure AI Search (hybrid) + Content Safety, fronted by an app or Copilot Studio. The default knowledge assistant.

2. Production agent

Foundry Agent Service + model router + MCP/OpenAPI tools + connected agents (A2A), governed by Agent 365 and Observability. Human-in-the-loop on high-impact actions.

3. M365-embedded

Microsoft 365 Copilot or a Copilot Studio agent over Graph data - buy the assistant in the tools people already use.

4. Multimodal pipeline

Content Understanding or a Foundry multimodal model extracts from docs/images/audio/video into structured output feeding Search or a warehouse.

5. Custom / open model service

Fine-tune or host an open model (Phi, Llama) via Foundry; distill to cut run-cost once quality is proven.

6. Cost-routed high volume

Model router + GPT-5.x Mini for routine traffic, PTU for the steady tier, flagship models only on hard prompts.

Decision Matrix

Fast answers for design reviews.

QuestionDefault answer
Which model?Router by default; GPT-5.x Mini for routine, GPT-5.5/o-series for hard reasoning, Phi for small/edge, Claude/open when they win your eval.
Buy or build the assistant?M365 Copilot / Copilot Studio first; build on Foundry/Agent Service for bespoke logic or UX.
Agent runtime?Foundry Agent Service for anything production; connected agents + A2A when the problem decomposes into specialists.
RAG how?Azure AI Search (hybrid + semantic ranker) by default; "On Your Data" for speed; DB-native vectors for locality.
Standard or PTU?Standard for variable/low volume; PTU once traffic is steady and you need latency/cost predictability.
Where do vectors live?AI Search default; Cosmos/SQL/PostgreSQL when beside operational data; Fabric for lakehouse scale.

Pricing & Cost Control

Shape, not exact numbers - rates change and vary by model/region. Confirm on the Azure pricing pages.

LeverHow it billsControl
Standard (pay-go)Per input/output token, per model.Use the router; GPT-5.x Mini for routine; cap output tokens; cache where possible.
Provisioned Throughput (PTU)Reserved throughput units (hourly + reservations).For steady high volume needing predictable latency; commit after you know the load.
Priority ProcessingPremium for low-latency lanes.Only for strict real-time chat/agents.
AI SearchService tier (search units) + storage.Right-size the tier; prune stale docs; tune replicas/partitions to load.
AgentsModel tokens x steps + tool/runtime.Cap loop length; route routine steps to Mini; budget per conversation.
The agent cost trap
Agent loops multiply token cost by steps, and connected agents multiply again. A multi-agent flow on a flagship model is the most common surprise bill. Budget per-conversation, log token usage, and let the router + Mini handle routine steps.

Risks & Gotchas

Read this one. What actually bites teams in production.

Naming & API churn
Azure AI Foundry to Microsoft Foundry, Azure OpenAI into Foundry Models, SDK 1.x vs 2.0, classic AI Studio. Names, roles, and endpoints are mid-migration - pin SDK versions, watch deprecations, re-verify RBAC after upgrades.
Quota & regional capacity
Newest models and PTU have tiered quotas and per-region capacity limits. The model you demo with may have no headroom in your region at launch - request increases early and design for backoff.
Multi-agent blast radius
Connected agents + A2A + broad tool access cause cost blowouts and unintended actions. Enforce per-agent Entra identity, least-privilege tools, budgets, Agent 365 governance, and Observability from day one.
Data residency / data zone
Data-zone and regional deployments differ by model. Confirm the exact model is available in your required region/data-zone before committing - and verify logging settings for sensitive data.
OpenAI dependency
The frontier tier leans on the OpenAI relationship. Keep prompts and evals portable so you can shift to Phi/open/Anthropic models if commercial or availability terms change.
Estate fit
Azure AI shines when Entra, M365, and Fabric are central. If your data lives in AWS/GCP, weigh egress and identity friction before committing.