Azure AI, the practical way

An architecture-first reference for the Microsoft Azure AI stack as of June 2026. The platform formerly called Azure AI Foundry is now Microsoft Foundry - one surface for models, agents, evaluation, and governance. This portal covers Foundry, the model catalog, Foundry Agent Service, Copilot, and the silicon - trade-offs and risks, no marketing.

Refreshed June 2026Architecture-firstEnterprise focusVendor-neutral

Naming, 2026

Azure AI Foundry is being renamed Microsoft Foundry (docs, blogs, and SDKs are mid-transition; many URLs still say ai-foundry / foundry). Azure OpenAI models now live inside Foundry as part of Foundry Models. Agent 365 is the Microsoft-365-side governance layer for agents. Same lineage, new packaging.

TL;DR

Azure's 2026 AI story has three pillars. Microsoft Foundry is the unified build platform: Foundry Models (OpenAI GPT-5.x, o-series, plus Llama, Mistral, DeepSeek, Phi, Nemotron, and partner catalogs), Foundry Agent Service (GA - Responses-API runtime, MCP + A2A, connected multi-agent), a model router, evaluations, and observability. Copilot is the distribution engine - Microsoft 365 Copilot, Copilot Studio, GitHub/Security Copilot, governed by Agent 365. Underneath sit Azure's data services (AI Search, Cosmos/SQL/PostgreSQL vectors, Fabric) and custom silicon (Maia, Cobalt) alongside NVIDIA GPUs. If you are a Microsoft shop with OpenAI ambitions and M365 reach, this stack is the default - the cost is keeping up with the fastest-moving naming in the industry.

The Azure AI mental model

Figure 1 - The Azure AI stack. Most teams enter at Layer 3 (Copilot) or Layer 2 (Foundry); drop to Layer 1 for data gravity and silicon.

What sets Azure apart in 2026

Differentiator	What it means in practice
First access to OpenAI frontier	GPT-5.x (5.4, 5.4 Mini, 5.5) and o-series land on Foundry with enterprise SLAs, private networking, and quota tiers - often the cleanest enterprise path to OpenAI's newest models.
Open standards in the runtime	Foundry Agent Service speaks MCP, A2A, and OpenAPI natively - connected agents, tool reuse, and cross-vendor interop without protocol lock-in.
M365 distribution	One-click publish from Foundry to Microsoft 365 Copilot and Teams; Agent 365 gives a single registry and guardrails across every agent in the tenant.
Model breadth + router	OpenAI, Anthropic, Meta, Mistral, DeepSeek, Microsoft Phi, NVIDIA Nemotron, Fireworks-hosted open models - and a model router that auto-picks the cheapest model that clears quality.
Entra identity + governance	Identity, Content Safety, Foundry Control Plane, Observability (tracing, evals, continuous red-teaming) are first-class, not bolt-ons.

Where Azure is weaker (be honest)

Naming velocity & sprawl

Azure AI Foundry to Microsoft Foundry, Azure OpenAI folded into Foundry Models, Copilot/Agent 365/Copilot Studio/Agent Framework overlapping - the surface changes faster than docs can keep up. Budget real time to map current names to what you already deployed.

OpenAI dependency

The frontier story leans heavily on the OpenAI relationship. Microsoft's own frontier models (beyond Phi SLMs) are still emerging; if that relationship shifts, the top of the catalog is a partner's, not Microsoft's.

How to read this portal

Each service tab follows the same shape: Overview, Architecture, Capabilities, Pricing, Risks, When to use. If you only read one sub-tab, read Risks & gotchas. The others tell you what something does; Risks tells you what bites you in production.

What's New - Ignite 2025 through June 2026

Material changes that affect architecture, cost, or risk. Curated, not a press-release dump.

TL;DR

Three threads dominate. One: the platform rebrand - Azure AI Foundry to Microsoft Foundry, with Foundry Agent Service reaching GA (Responses-API runtime, private networking, MCP OAuth passthrough) and Observability GA. Two: the GPT-5 cadence - 5.4, 5.4 Mini, then 5.5 - plus a model router, Priority Processing, Phi-4 vision/reasoning, GPT-image-2, and Fireworks-hosted open models (DeepSeek V3.2, gpt-oss, Kimi, MiniMax). Three: agent governance across the tenant - Agent 365, the Agent registry, and Foundry Control Plane.

Date	Release	Why it matters
Nov 2025	Ignite 2025: Foundry Agent Service, Foundry Control Plane (preview), Agent 365	Production agent runtime + one-click publish to M365/Teams; a single place to govern any agent (Foundry, Copilot Studio, third-party) with guardrails on inputs/outputs/tool calls.
Nov 2025	Observability GA; Microsoft Agent Framework (AutoGen + Semantic Kernel lineage)	Evals + OpenTelemetry tracing + continuous red-teaming + Azure Monitor; a code-first agent framework supporting AG-UI and ChatKit front-ends.
Dec 2025	Azure AI Foundry to Microsoft Foundry rebrand underway	One platform brand; Azure OpenAI becomes part of Foundry Models. Watch SDK/role/URL changes.
Jan 2026	Model router; SDK 2.0 GA	Auto-select the optimal model per prompt to cut cost while holding quality; stable SDK surface.
Mar 2026	GPT-5.4 (GA), GPT-5.4 Mini, Phi-4 Vision, Priority Processing, new evaluations	5.4 targets agent reliability (task-drift, mid-workflow failures, tool-call consistency); Mini for cheap classify/extract; Priority Processing reserves low-latency compute lanes.
Mar 2026	Foundry Agent Service GA runtime: Responses API, end-to-end private networking, MCP OAuth passthrough	Production-ready agent hosting with private networking and standardized tool auth. Migrate 2025 agent pilots here.
Apr-May 2026	GPT-5.5; GPT-image-2 (4K); Fireworks AI open models; Nemotron first-class	Latest frontier OpenAI tier; high-res image gen; DeepSeek V3.2 / gpt-oss-120b / Kimi K2.5 / MiniMax M2.5 hosted; broader open-model choice in one catalog.
2026	MCP + A2A first-class in Agent Service; connected (multi-)agents	Agents call agents as tools and interoperate across vendors via open standards - real multi-agent systems, with the governance burden that implies.

Practical read

If you built on Azure OpenAI + a hand-rolled orchestrator in 2025, plan a migration to Foundry Agent Service (managed runtime, MCP/A2A, Observability) and adopt the model router to control spend. Re-check IAM/role names and SDK packages against current Microsoft Foundry docs.

Service Map

The Azure AI services worth knowing, grouped by what you do with them.

PLATFORMMicrosoft Foundry

Formerly Azure AI Foundry. Models, agents, router, evaluations, observability, content safety - one build platform.

MODELSFoundry Models

OpenAI GPT-5.x & o-series, plus Llama, Mistral, DeepSeek, Phi-4, Nemotron, Fireworks-hosted open models.

AGENTSFoundry Agent Service

GA. Responses-API runtime, MCP + A2A, connected multi-agent, private networking, observability.

ASSISTCopilot family

Microsoft 365 Copilot, Copilot Studio, GitHub/Security/Azure Copilot - governed by Agent 365.

APPLIEDAzure AI Services

Vision, Document Intelligence, Language, Speech, Translator, Content Understanding.

DATAAI Search & vectors

Azure AI Search (vector + hybrid + semantic ranker), Cosmos/SQL/PostgreSQL vectors, Microsoft Fabric.

SILICONMaia & Cobalt

Maia AI accelerators, Cobalt ARM CPUs, ND-series NVIDIA GPUs (GB200), AI infrastructure.

GOVERNContent Safety + Agent 365

Content filters, Control Plane guardrails, Agent registry, evaluations, continuous red-teaming.

GROUNDRAG & grounding

Azure AI Search retrieval, "On Your Data", Bing grounding, Fabric data agents.

How to read this

The flagship services (Microsoft Foundry, Foundry Agent Service) carry the full Overview / Architecture / Capabilities / Pricing / Risks / When-to-use sub-tabs with reference-architecture diagrams. Secondary services use a single rich page with the same architecture-first, risk-honest treatment. Start at the service you care about; read its Risks before its Overview if you're scoping production.

Microsoft Foundry was Azure AI Foundry

The unified platform to choose models, build and govern agents, evaluate, and ship - the center of gravity for AI on Azure.

Official documentation ↗

Overview

Architecture

Capabilities

Pricing model

Risks & gotchas

When to use

TL;DR

Foundry is the single pane for the whole AI lifecycle on Azure: a 1000+ model catalog (Foundry Models) with a router, a managed Agent Service, prompt flow, fine-tuning/distillation, evaluations, content safety, and observability - all under Entra identity, private networking, and Azure billing. You organize work in projects inside a Foundry resource/hub, and promote from experiment to production without leaving the platform.

What problem this solves

Enterprises don't want to wire together a model API, a vector store, a guardrail service, an eval harness, an agent orchestrator, and a monitoring stack from separate vendors - each with its own identity and billing. Foundry's offer is one governed surface where you swap models without rewriting the app, apply the same Content Safety policy across every model, and trace/evaluate agents in production. The trade-off is breadth: the platform is large and renaming fast, so onboarding has a real learning curve.

The building blocks

Concept	What it is
Foundry resource / hub	The top-level Azure resource that holds shared config, connections, and security boundaries.
Project	A workspace for a use case - models, data connections, agents, evaluations, and deployments scoped together.
Foundry Models	The model catalog: OpenAI, Microsoft, and partner/open models, sold directly by Azure or via the marketplace.
Foundry Agent Service	The managed runtime for production agents (see its own tab).
Evaluations & Observability	Quality/safety evaluation, OpenTelemetry tracing, continuous red-teaming, Azure Monitor.

Rule of thumb

Start every Azure GenAI workload as a Foundry project. Only drop to raw model endpoints or a custom stack when you have a specific need Foundry doesn't cover - and you will need fewer of those than you think.

Reference architecture

Figure - Microsoft Foundry reference shape. Private endpoints + Entra keep traffic off the public internet; Content Safety sits in-line on every model call.

Network and identity

Foundry projects support private endpoints (Private Link) so model and agent traffic never traverses the public internet. Authentication is Entra ID; apps use managed identities and RBAC scoped to the Foundry resource, project, and deployment. Secrets and keys belong in Key Vault, and you can enforce customer-managed keys (CMK) for data at rest. For regulated workloads, combine private networking, CMK, no-public-egress NSG rules, and Defender for AI monitoring.

Where the data goes

Microsoft's stated position is that prompts and completions in Azure OpenAI / Foundry Models are not used to train the foundation models, and data stays within your Azure tenant and chosen region/data-zone. You control whether request/response logging is enabled. For data residency, use region- or data-zone-pinned deployments and confirm the specific model's availability there before designing around it.

Capability matrix (June 2026)

Capability	Status	Notes
Model catalog + router	●	1000+ models; router auto-selects the cheapest model that clears quality.
Foundry Agent Service	●	GA - Responses API runtime, MCP + A2A, connected multi-agent.
Evaluations	●	Automated + LLM-judge quality/safety evals, including agent evals.
Observability	●	GA - OpenTelemetry tracing, continuous red-teaming, Azure Monitor.
Content Safety	●	In-line filters: hate/sexual/violence/self-harm, jailbreak/prompt-shield, groundedness, protected material.
Fine-tuning / distillation	●	Supervised fine-tuning, distillation; reinforcement methods on select models.
Prompt Flow	●	Author, test, and deploy prompt/orchestration flows.
Private networking	●	Private Link / VNet integration end to end.
Provisioned Throughput (PTU)	●	Reserved capacity for predictable latency/cost at volume.
Priority Processing	◐	Preview - dedicated low-latency compute lanes for real-time agents/chat.

How Foundry bills

Mode	How you pay	Best for
Standard (pay-as-you-go)	Per input/output token, per model.	Prototyping, variable/low volume, model comparison.
Provisioned Throughput (PTU)	Reserved throughput units (hourly/monthly/annual reservations).	Steady high volume needing predictable latency and cost.
Priority Processing	Premium for reserved low-latency lanes.	Customer-facing real-time chat / agents with strict latency.
Fine-tuning	Training tokens + hosting of the tuned deployment.	Narrow tasks where a tuned small model beats prompting a large one.
Agent Service / tools	Underlying model tokens x steps + tool/runtime charges.	Production agents - watch step count.

Rule of thumb

Cross from Standard to PTU when sustained traffic makes reserved units cheaper than pay-go and you need latency guarantees. Use the model router and GPT-5.x Mini to keep the per-call cost down before reserving capacity.

Naming & API churn

Foundry, Azure OpenAI, SDK 1.x vs 2.0, classic AI Studio - names, roles, and endpoints are mid-migration. Pin SDK versions, watch deprecations, and confirm role assignments after upgrades.

Quota & capacity

Newest models (and PTU) have tiered quotas and regional capacity limits. Request increases early; the model you demo with may not have headroom in your region at launch.

Agent cost & actions

Agent loops multiply token cost by steps and can take real actions. Cap loop length, scope tool permissions via Entra, and require approval on high-impact tools.

Data zone / residency

Data-zone and regional deployments differ by model. Confirm the exact model is available in your required region/data-zone before committing an architecture.

Use Foundry when you are on Azure and want one governed surface for models, agents, evals, and safety - which is almost every Azure GenAI workload.
Lead with the model router + GPT-5.x Mini for cost; reserve PTU once volume is steady.
Go straight to Agent Service for anything heading to production rather than hand-rolling an orchestrator.
Drop to raw endpoints / custom stack only for a specific capability Foundry doesn't cover.

Foundry Models

The model catalog behind Foundry - OpenAI frontier, Microsoft Phi, and a broad partner/open selection, with a router to pick between them.

Official documentation ↗

Family	Examples (June 2026)	Use
OpenAI flagship	GPT-5.5, GPT-5.4, GPT-5.4 Mini, o-series	Hardest reasoning, agents, coding; Mini for cheap classify/extract/tool-calls.
OpenAI media	GPT-image-2 (4K), TTS / Realtime	Image generation/editing and voice.
Microsoft (own)	Phi-4, Phi-4 Vision, Phi-4 Reasoning Vision 15B	Small, efficient multimodal/reasoning models; on-prem/edge via Foundry Local.
Open / partner	Llama, Mistral, DeepSeek V3.2, NVIDIA Nemotron, Anthropic Claude	Open-weight customization, cost, or specific-vendor strengths.
Fireworks-hosted	gpt-oss-120b, Kimi K2.5, MiniMax M2.5	High-performance open-model inference without standing up your own serving.

Model router

Rather than hard-coding a model, point at the router: it auto-selects the cheapest model that meets the quality bar for each prompt. Pair with evals so "quality bar" is something you measured, not assumed.

Sold-by-Azure vs marketplace

Some models are sold directly by Azure (first-party billing/SLA); others are partner/marketplace offers with different terms. Check the billing and data-handling terms per model before production.

Foundry Agent Service GA

The managed runtime for production agents on Azure - Responses-API based, with MCP, A2A, connected multi-agent, private networking, and observability.

Official documentation ↗

Overview

Architecture

Tools & protocols

Risks & gotchas

When to use

TL;DR

Agent Service turns a model + instructions + tools + knowledge into a managed, stateful agent you don't have to host. The 2026 GA runtime is built on the Responses API with threads/state, end-to-end private networking, and standardized tool auth (MCP OAuth passthrough). It speaks MCP, A2A, and OpenAPI, and supports connected agents - agents calling other agents as tools - so you can compose specialists instead of building one monolith.

What problem this solves

Hand-built agent loops are easy to prototype and hard to operate: state, retries, tool auth, networking, tracing, and safety all become your problem. Agent Service makes those managed concerns and standardizes the integration surface (MCP/A2A/OpenAPI) so tools and other agents plug in without bespoke glue. You publish to Microsoft 365 Copilot and Teams in one click and govern everything through Agent 365 and the Foundry Control Plane.

Migrate 2025 pilots

If you built agents on Azure OpenAI + a custom orchestrator, the GA runtime replaces most of that scaffolding. Move for the managed state, private networking, MCP auth, and observability alone.

Reference architecture

Figure - Foundry Agent Service. The agent runtime brokers model calls (via the router), governed tools (MCP/OpenAPI), connected specialist agents (A2A), and knowledge - all under Entra and Content Safety.

Tools & protocols

Surface	What it gives you
MCP (Model Context Protocol)	Connect external MCP servers as governed tools, with OAuth passthrough for delegated auth.
A2A (Agent-to-Agent)	Call other agents - your own or third-party - as interoperable endpoints.
OpenAPI tools	Wrap any REST API as a tool from its spec.
Connected agents	Compose specialist agents; an orchestrator delegates subtasks.
Hosted tools	Bing grounding, file search, code interpreter, browser, Logic Apps / Functions.
Knowledge	Azure AI Search, "On Your Data", Cosmos/SQL/PostgreSQL vectors, Fabric data agents.

Why open standards matter

MCP + A2A + OpenAPI mean your tools and agents aren't trapped in one vendor's format. The same MCP tool can serve a Foundry agent today and a different runtime tomorrow.

Multi-agent blast radius

Connected agents + A2A make systems powerful and hard to reason about. Use Agent 365 registry, per-agent Entra identities, least-privilege tool scopes, budgets, and Observability traces from day one.

Tool auth & data exfiltration

MCP/OpenAPI tools can read and send data. Vet every tool, prefer OAuth passthrough with scoped consent, and keep agents on private networking with egress controls.

Runaway loops

Cap steps and tokens per conversation; route routine steps to GPT-5.x Mini and reserve the flagship for the hard parts.

Publish surface

One-click publish to M365/Teams is convenient and broad - make sure governance (Agent 365, DLP, Content Safety) is set before you expose an agent to the whole tenant.

Use Agent Service for any agent heading to production - you get managed state, private networking, MCP/A2A, and observability for free.
Use connected agents / A2A when the problem decomposes into specialists; keep a single monolith only for simple flows.
Pair with the model router and Priority Processing for cost and latency control.
Govern through Agent 365 before publishing to M365/Teams.

Governance & Safety

The controls that make agents and models safe to run in an enterprise tenant.

Official documentation ↗

Control	What it does
Azure AI Content Safety	In-line filters for hate/sexual/violence/self-harm, plus prompt shields (jailbreak), groundedness detection, and protected-material checks - applied to any model.
Foundry Control Plane	Govern every agent in one place (Foundry, Copilot Studio, third-party) with consistent guardrails across inputs, outputs, tool calls, and tool responses.
Agent 365 + Agent registry	Tenant-wide discovery, identity, and management of all agents, with admin guardrails and DLP.
Observability	Evaluations, OpenTelemetry tracing, continuous red-teaming, Azure Monitor insights.
Defender for AI / Purview	Threat protection for AI workloads and data governance/compliance across prompts and outputs.
Entra identity	Managed identities, RBAC, conditional access - the same identity plane as the rest of Azure.

Apply at the platform layer

Put Content Safety and Control Plane guardrails between the app and the model so the same policy holds no matter which model the router picks - don't scatter safety logic across app code.

Governance before scale

Agent 365 and DLP should be configured before you one-click-publish agents to a tenant of users. Retrofitting governance after broad exposure is the hard path.

Azure vs AWS vs OCI vs GCP

A practitioner's quick read. Every cloud does the basics; the differences are in defaults, data gravity, and silicon.

Dimension	Azure	AWS	OCI	GCP
Frontier model	OpenAI GPT-5.x	Nova (mid); Claude hosted	None (partners)	Gemini 3.x
Model breadth (managed)	Foundry Models (1000+)	Bedrock (widest)	Broad (OCI Gen AI)	Model Garden (200+)
Agents	Foundry Agent Service + MCP/A2A	AgentCore	Enterprise AI Agents	Agent Platform + A2A
Custom silicon	Maia (emerging)	Trainium/Inferentia	GPU (NVIDIA)	TPU (Ironwood/8th)
Data gravity	Fabric / OneLake	S3 / Redshift	Oracle DB 26ai (in-DB vectors)	BigQuery
Distribution	Microsoft 365	Console / partners	Oracle apps / EBS	Workspace
Best when	Microsoft shop; want OpenAI frontier + M365 reach	Already on AWS; want model choice + silicon economics	Run Oracle DB/EBS; want in-DB vectors + sovereignty	BigQuery/Workspace central; want Gemini + TPU full stack

Honest take

The cloud your data and identity already live in usually wins - gravity beats a marginally better model. Azure's edge is the OpenAI frontier plus unmatched Microsoft 365 distribution; its tax is the fastest naming churn of the four.

Sources

Primary Microsoft material used for this portal (June 2026). Names and versions are mid-transition - confirm in current docs before designing.

Microsoft Foundry (Azure AI Foundry) · Foundry docs
Foundry Agent Service overview · MCP tools · A2A
What's new in Microsoft Foundry - Mar 2026 (and Apr/May 2026 editions)
GPT-5 in Azure AI Foundry · GPT-5.5 in Microsoft Foundry
Foundry Agent Service at Ignite 2025
Azure AI Content Safety

Accuracy note

Compiled by Brijesh Gogia for expertoracle.com. Independent and not affiliated with Microsoft. Azure's AI naming changed substantially across Ignite 2025 / 2026 - treat this as orientation and confirm in the portal/docs before designing.

Model Router

One endpoint that auto-selects the cheapest model clearing your quality bar - per prompt.

Official documentation ↗

Instead of hard-coding GPT-5.5 (expensive) or GPT-5.4 Mini (cheap) at every call site, you target the router. It classifies each request and dispatches to the model that meets the quality target at the lowest cost and latency, escalating to stronger models only when the prompt needs it. For mixed workloads - where most requests are easy and a few are genuinely hard - it is one of the simplest cost levers on the platform.

Use the router when	Skip it when
Workload mixes easy and hard prompts; you want cost savings without re-engineering call sites.	You need a single fixed model version for reproducibility or a compliance attestation.
You can tolerate a small classification step before dispatch.	Latency budget is so tight the routing hop is unacceptable.

"Quality bar" must be measured

The router only saves money safely if your quality target is something you evaluated. Wire it to Foundry evaluations so routing is grounded in measured task accuracy, not assumptions - otherwise you trade cost for silent quality regressions.

Pairs with

Priority Processing (low-latency lanes) and Provisioned Throughput (reserved capacity) for the high-volume tier the router sends most traffic to.

Knowledge & RAG - Azure AI Search

Keep answers grounded in your data. Azure AI Search is the default retrieval engine; several databases can also serve vectors.

Official documentation ↗

Figure - Azure RAG. AI Search holds the hybrid (vector + keyword) index with a semantic ranker; Content Safety checks groundedness on the way out.

Retrieval option	Best for
Azure AI Search	The default - vector + keyword + semantic ranking, integrated vectorization, security trimming. Most RAG starts here.
"On Your Data"	Fastest path - point a model at a data source and get grounded answers with minimal code.
Cosmos DB / Azure SQL / PostgreSQL vectors	When vectors must live beside operational data with transactional consistency.
Microsoft Fabric data agents	Grounding over the lakehouse / OneLake for analytics-centric estates.

Retrieval is the failure point

Most "the model hallucinated" incidents are actually retrieval misses. Use hybrid search + the semantic ranker, enable groundedness detection in Content Safety, and evaluate retrieval quality before blaming the LLM.

Copilot & Agent 365

Microsoft's distribution layer - buy the assistant inside the tools people already use, and govern every agent in the tenant.

Official documentation ↗

Product	What it is
Microsoft 365 Copilot	The assistant embedded in Word, Excel, Outlook, Teams - grounded in your Graph (mail, files, chats) with user permissions.
Copilot Studio	Low-code builder for custom agents/topics; uses OpenAI and Anthropic models; publishes to Teams, web, and M365 Copilot.
GitHub Copilot	Coding agent across the SDLC - completion, chat, agent mode, code review.
Security Copilot	SOC assistant for triage, hunting, and incident summarization across Defender/Sentinel.
Azure Copilot	Operations assistant for managing and troubleshooting Azure resources.
Agent 365	Tenant-wide governance: the Agent registry discovers and manages every agent (Copilot Studio, Agent Builder, SharePoint, M365 Agent SDK, Foundry, third-party), with identity, DLP, and admin guardrails.

Build vs buy

For internal knowledge or productivity assistants, pilot Microsoft 365 Copilot or a Copilot Studio agent before building custom on Foundry - the Graph grounding and permission inheritance save real work. Build on Foundry/Agent Service when you need bespoke logic, custom UX, or tools Copilot Studio can't express.

Govern before you scale

One-click publish to a tenant of users is powerful. Configure Agent 365 registry, DLP, and Content Safety before broad exposure - retrofitting governance after the fact is the hard path.

Applied AI Services

Task-specific managed APIs in Azure AI Services - call them, no model selection required.

Official documentation ↗

Service	Task
Azure AI Vision	Image analysis, OCR, spatial analysis, image captioning/tags.
Document Intelligence	Extract text, tables, key-value pairs, and structure from documents (the former Form Recognizer).
Azure AI Language	Entity recognition, sentiment, PII detection, summarization, custom classification, question answering.
Azure AI Speech	Speech-to-text, text-to-speech (incl. custom/neural voices), translation, diarization.
Translator	Neural machine translation across many languages, document translation.
Content Understanding	Multimodal extraction across documents, images, audio, and video into structured output.

Trend to watch

Many classic tasks (extraction, classification, summarization) are increasingly done with a Foundry multimodal model or Content Understanding. Reach for the dedicated applied service when it is cheaper, lower-latency, or compliance-certified for that exact task; reach for Foundry when you need flexibility.

Data & Vectors

Where embeddings and ground-truth live. Pick by where your data already is.

Official documentation ↗

Store	Best for
Azure AI Search (vector)	The default RAG index - hybrid (vector + keyword) search with a semantic ranker and security trimming.
Azure Cosmos DB (vector)	Vectors beside globally-distributed operational/app data, low latency, NoSQL.
Azure SQL / SQL DB (vector)	Vectors next to relational data with transactional consistency.
Azure Database for PostgreSQL (pgvector + DiskANN)	Open-source vector path beside Postgres data; DiskANN for scale.
Microsoft Fabric / OneLake	Lakehouse-scale data and Fabric data agents for analytics-centric grounding.

Default

Most teams start with Azure AI Search for RAG. Move vectors into Cosmos/SQL/PostgreSQL when they must sit beside operational rows; use Fabric when the corpus is the warehouse.

Maia & Silicon

The compute under the stack - Microsoft's custom accelerators alongside NVIDIA GPUs.

Official documentation ↗

Silicon	Role
Maia AI accelerator	Microsoft's in-house AI chip for training/inference economics on first-party and hosted workloads; the emerging cost lever versus pure-GPU.
Cobalt (Arm CPU)	Microsoft's Arm-based general-purpose CPU - efficient serving and supporting workloads around AI.
ND-series GPU VMs (NVIDIA, incl. GB200)	Top-end GPU training/inference with full CUDA/framework compatibility.
Azure AI infrastructure / Maia clusters	Network-dense accelerator fabrics for large-scale training; reserved capacity options.

Architect's lever

Most teams consume models via Foundry and never touch raw silicon. When you do run dedicated capacity at volume, benchmark Maia and ND-series GPUs on your workload - the price/perf gap can dominate TCO. Keep GPUs where a specific CUDA/framework path is required.

Architecture Patterns

The shapes most Azure GenAI workloads fall into.

1. Grounded enterprise assistant

Foundry model + Azure AI Search (hybrid) + Content Safety, fronted by an app or Copilot Studio. The default knowledge assistant.

2. Production agent

Foundry Agent Service + model router + MCP/OpenAPI tools + connected agents (A2A), governed by Agent 365 and Observability. Human-in-the-loop on high-impact actions.

3. M365-embedded

Microsoft 365 Copilot or a Copilot Studio agent over Graph data - buy the assistant in the tools people already use.

4. Multimodal pipeline

Content Understanding or a Foundry multimodal model extracts from docs/images/audio/video into structured output feeding Search or a warehouse.

5. Custom / open model service

Fine-tune or host an open model (Phi, Llama) via Foundry; distill to cut run-cost once quality is proven.

6. Cost-routed high volume

Model router + GPT-5.x Mini for routine traffic, PTU for the steady tier, flagship models only on hard prompts.

Decision Matrix

Fast answers for design reviews.

Question	Default answer
Which model?	Router by default; GPT-5.x Mini for routine, GPT-5.5/o-series for hard reasoning, Phi for small/edge, Claude/open when they win your eval.
Buy or build the assistant?	M365 Copilot / Copilot Studio first; build on Foundry/Agent Service for bespoke logic or UX.
Agent runtime?	Foundry Agent Service for anything production; connected agents + A2A when the problem decomposes into specialists.
RAG how?	Azure AI Search (hybrid + semantic ranker) by default; "On Your Data" for speed; DB-native vectors for locality.
Standard or PTU?	Standard for variable/low volume; PTU once traffic is steady and you need latency/cost predictability.
Where do vectors live?	AI Search default; Cosmos/SQL/PostgreSQL when beside operational data; Fabric for lakehouse scale.

Pricing & Cost Control

Shape, not exact numbers - rates change and vary by model/region. Confirm on the Azure pricing pages.

Lever	How it bills	Control
Standard (pay-go)	Per input/output token, per model.	Use the router; GPT-5.x Mini for routine; cap output tokens; cache where possible.
Provisioned Throughput (PTU)	Reserved throughput units (hourly + reservations).	For steady high volume needing predictable latency; commit after you know the load.
Priority Processing	Premium for low-latency lanes.	Only for strict real-time chat/agents.
AI Search	Service tier (search units) + storage.	Right-size the tier; prune stale docs; tune replicas/partitions to load.
Agents	Model tokens x steps + tool/runtime.	Cap loop length; route routine steps to Mini; budget per conversation.

The agent cost trap

Agent loops multiply token cost by steps, and connected agents multiply again. A multi-agent flow on a flagship model is the most common surprise bill. Budget per-conversation, log token usage, and let the router + Mini handle routine steps.

Risks & Gotchas

Read this one. What actually bites teams in production.

Naming & API churn

Azure AI Foundry to Microsoft Foundry, Azure OpenAI into Foundry Models, SDK 1.x vs 2.0, classic AI Studio. Names, roles, and endpoints are mid-migration - pin SDK versions, watch deprecations, re-verify RBAC after upgrades.

Quota & regional capacity

Newest models and PTU have tiered quotas and per-region capacity limits. The model you demo with may have no headroom in your region at launch - request increases early and design for backoff.

Multi-agent blast radius

Connected agents + A2A + broad tool access cause cost blowouts and unintended actions. Enforce per-agent Entra identity, least-privilege tools, budgets, Agent 365 governance, and Observability from day one.

Data residency / data zone

Data-zone and regional deployments differ by model. Confirm the exact model is available in your required region/data-zone before committing - and verify logging settings for sensitive data.

OpenAI dependency

The frontier tier leans on the OpenAI relationship. Keep prompts and evals portable so you can shift to Phi/open/Anthropic models if commercial or availability terms change.

Estate fit

Azure AI shines when Entra, M365, and Fabric are central. If your data lives in AWS/GCP, weigh egress and identity friction before committing.