Oracle Cloud Infrastructure Deep Dive Portal

A practical reference for Cloud Architects, DBAs, and Enterprise Infrastructure Teams. Built to be used while you learn, design, implement, operate, and troubleshoot real OCI environments - not a marketing overview.

18 deep sections Architecture patterns Troubleshooting runbooks CLI & Terraform Decision matrices

Last reviewed: July 2026 Cloud services change - verify with current Oracle documentation before production use.

WHO THIS IS FOR

Oracle Cloud Architects, Apps DBAs, Oracle DBAs, infrastructure engineers, cloud engineers, enterprise architects, and anyone moving from traditional on-premises Oracle environments into OCI. It assumes you already understand servers, storage, networks, and Oracle Database - and focuses on how those ideas map into OCI and what changes operationally.

How this portal is organized

Each section is a self-contained deep dive. Use the left navigation or the search box in the top bar to jump directly to a topic. Every section carries a Last reviewed date and, where content changes frequently (pricing, shape names, service limits, model availability), a Verify with current Oracle documentation flag.

Learn

Foundations first

Sections 1-2 establish the mental model: regions, ADs, fault domains, tenancy, compartments, and the IAM policy language that everything else depends on.

Build

Service deep dives

Sections 3-12 cover networking, compute, storage, database, load balancing, security, observability, containers, analytics, and AI - with diagrams, tables, and gotchas.

Operate

Run and recover

Sections 13-18 cover migration and DR, cost and governance, reference architecture patterns, troubleshooting runbooks, automation, and a structured learning path.

Reading the callouts

Four note types recur throughout. They flag the perspective that matters most for a given point.

Architect note

Design-time decisions, trade-offs, and things you must decide before production.

DBA note

Database-specific behavior, what Oracle manages vs. what you manage, patching and backup nuances.

Security note

Exposure, least privilege, encryption, and audit considerations.

Cost note

Where money is spent and where it is commonly wasted.

Common mistake

A specific design or operational error teams repeatedly make, and how to avoid it.

The OCI shared responsibility model (orientation)

Everything in this portal sits on one idea: in cloud, responsibility is split, and the split moves depending on the service. Get this wrong and you will either leave gaps (security incidents, unrecoverable data) or do work Oracle already does for you (wasted effort).

Layer	IaaS (Compute + you install DB)	Base Database / VM DB	Exadata Cloud Service	Autonomous Database
Physical / hypervisor	Oracle	Oracle	Oracle	Oracle
OS patching	You	You (guest VM)	You (guest VM)	Oracle
DB software install/patch	You	Oracle tooling, you trigger	Oracle tooling, you trigger	Oracle
Backup config	You	Managed, you configure	Managed, you configure	Oracle (you set retention)
HA / RAC	You build it	Optional, you choose	Built-in RAC	Built-in
Schema, SQL, tuning	You	You	You	You
Data classification & access	You	You	You	You

The one rule that never moves

Oracle secures the cloud. You secure what you put in the cloud: identities, network exposure, data classification, and access. No managed service removes your responsibility for who can reach the data and what they can do with it.

1. OCI Fundamentals

The physical and logical building blocks of Oracle Cloud Infrastructure, and the tenancy and compartment structure that every enterprise deployment stands or falls on.

Last reviewed: July 2026 Verify region list, service limits, and quotas in the OCI Console.

What OCI is Global architecture Regions / AD / FD Tenancy & compartments Limits & quotas OCIDs Tags Console / CLI / SDK / Terraform Enterprise design

TL;DR

OCI is a set of regions, each built from isolated Availability Domains (data centers) that are further split into Fault Domains (racks). Your account is a tenancy; you organize resources into compartments (logical, not network) and control access with IAM policies written against those compartments. Compartment and tenancy structure is the single most important thing to get right before production - it is painful to restructure later.

What OCI is

Oracle Cloud Infrastructure is Oracle's public cloud: on-demand compute, storage, networking, database, and platform services delivered from Oracle-operated data centers, consumed over the network, and billed by usage. Compared to Oracle's first-generation cloud, OCI ("Gen 2") was rebuilt with an off-box network virtualization design - the virtualization and network isolation run on separate hardware from the customer's compute, which is the basis for its bare-metal offerings and its network isolation guarantees.

Practically, OCI gives you four things that matter to an enterprise Oracle shop:

Real bare metal - you can rent an entire physical server with no hypervisor, which matters for licensing and for the highest-performance database workloads.
Exadata as a cloud service - the same engineered system you may run on-premises, delivered as Base Database, Exadata Database Service, or Exadata Cloud@Customer.
Autonomous Database - a self-managing database platform where Oracle runs patching, tuning, backup, and scaling.
A flat, predictable network - a non-blocking, low-latency backbone with off-instance network virtualization.

OCI global architecture

OCI hierarchy: Realm > Region > Availability Domain > Fault Domain

Regions, Availability Domains, Fault Domains

Concept	What it is	Failure it protects against	What you do with it
Region	A localized geographic area containing one or more Availability Domains. Your data residency boundary.	Regional disaster, large-scale outage	Choose based on latency to users, data residency law, and service availability. Deploy DR to a second region.
Availability Domain (AD)	One or more isolated data centers within a region, with independent power, cooling, and network.	Data-center-level failure	Spread instances / DB nodes across ADs (in multi-AD regions) for HA. Many regions have only 1 AD.
Fault Domain (FD)	A grouping of hardware within an AD (think: a rack). Every AD has exactly 3 FDs.	Rack / hardware / maintenance failure within an AD	Anti-affinity: place HA pairs in different FDs. This is your only in-region HA lever in a single-AD region.
Realm	A hard isolation boundary (commercial OC1, US Gov, UK Gov, dedicated). Identities and tenancies never cross realms.	Compliance / sovereignty isolation	Usually fixed by your contract; matters for regulated workloads.

Architect note - single-AD regions

Many OCI regions have only one Availability Domain. Do not assume multi-AD HA is available everywhere. In a single-AD region, your in-region resilience comes entirely from spreading across the three Fault Domains, and your true disaster resilience comes from a second region. Confirm the AD count for your chosen region before designing HA.

Common mistake

Designing a "multi-AD" active-active database only to discover the target region has a single AD. RAC on Base/Exadata in a single-AD region still gives you node HA across fault domains, but AD-level HA (spreading nodes across ADs) is only possible in multi-AD regions like Ashburn, Phoenix, or Frankfurt.

Home region and subscriptions

When you sign up you pick a home region. IAM resources (users, groups, policies, dynamic groups, compartments, federation, and in the legacy model the tenancy's identity metadata) are mastered in the home region and replicated read-only to subscribed regions. You then subscribe the tenancy to additional regions to deploy workloads there.

You cannot change the home region after it is set. Choose deliberately - it should be a region close to your identity administrators and one you intend to keep long-term.
IAM writes (create a user, edit a policy) always go to the home region and propagate out. A home-region outage can therefore affect identity administration globally even while workloads keep running.
Subscribing to a region is easy; unsubscribing is not - plan region subscriptions rather than turning them on casually.

Tenancy and compartments

Your tenancy is the root container for your entire OCI account - it is itself the root compartment. Everything lives under it.

Compartments are logical containers for resources (compute, VCNs, buckets, databases). They are the primary unit of access control, isolation, quota, and cost tracking. Key properties that trip people up coming from AWS/on-prem:

Compartments are global, not regional. A compartment exists across all subscribed regions; the resources inside it are regional.
Compartments are logical, not a network boundary. Two instances in different compartments can talk over the network if the VCN/subnet/security rules allow it. Isolation is by IAM policy, not by compartment walls.
Compartments can be nested (up to six levels deep). Policies and quotas can be scoped at any level.
Resources can be moved between compartments (most, not all), but some have caveats. Deleting a compartment requires it be empty and is asynchronous.

A common enterprise compartment topology: shared services, per-environment workloads, and governance

How to structure compartments for an enterprise Design

There is no single correct model, but the durable patterns are:

By environment (most common): top-level compartments or a Workloads parent with Prod / Stage / Test / Dev / DR children. Clean blast-radius separation and simple policies ("group X can manage instances in Dev only").
By business unit: a compartment per BU, each with its own environment sub-compartments. Fits organizations that charge back and delegate admin per BU.
Shared services split out: put the network hub, Vault, logging, and security tooling in their own compartment(s) so platform teams own them and workload teams cannot alter them.
Hybrid (recommended for large orgs): Shared-Services + Security + per-BU (each BU has Prod/NonProd) + Sandbox. Landing zone frameworks (see the Cost & Governance section) codify this.

Architect note

Favor a shallow, predictable tree. Deep nesting (5-6 levels) makes policies hard to reason about and console navigation slow. Two to three levels covers almost every enterprise.

Separating dev, test, stage, prod, and DR Design

Separate compartments per environment give you independent IAM, quotas, budgets, and cost reporting.
Strongly consider separate VCNs (or at least separate subnets with strict NSGs) so a mistake in Dev cannot reach Prod data.
Some regulated shops use separate tenancies for Prod vs. non-Prod for hard isolation and separate billing. This is the strongest separation but adds identity/federation overhead and cross-tenancy networking complexity. Decide based on your risk and compliance posture.
DR lives in a different region. Keep its compartment structure a mirror of Prod so IAM policies and automation translate cleanly.

Designing for multiple business units Design

Give each BU a top-level compartment with delegated admin (a BU-admin group with manage rights scoped to that compartment only).
Apply compartment quotas to cap what each BU can consume, and budgets with alerts for spend.
Centralize the network in a Shared-Services compartment and connect BU spoke VCNs via a DRG hub, so BUs cannot each build divergent, unmanaged network topologies.
Use defined tags (cost-center, owner, environment) enforced by tag defaults so chargeback works from day one.

Common mistakes in tenancy / compartment design

Putting everything in the root compartment "to start" - it becomes impossible to apply least privilege later.
Modeling compartments as if they were network boundaries. They are not; network isolation is VCN/subnet/security rules.
Over-nesting. Six-level trees look tidy but make policy debugging miserable.
No naming standard. Inconsistent names (prod vs Production vs PRD) break automation and reporting.
Not reserving compartments for shared services and security up front, so those resources end up scattered in workload compartments.

Limits, quotas, and service limits

Mechanism	Set by	Scope	Purpose
Service limits	Oracle (per tenancy, per region, sometimes per AD)	Tenancy	The maximum of a resource you can create (e.g. number of OCPUs of a shape). Raise via a limit-increase request in the Console.
Compartment quotas	You (policy-like statements)	Compartment	Cap/allow/deny resource creation per compartment. Your governance lever - e.g. "no bare metal in Dev".
Budgets	You	Compartment / tag	Track and alert on spend. Do not block creation - they notify.

Quota statements look like policy but control resource counts. Example that blocks expensive shapes in a Dev compartment:

# Applied in the Dev compartment (Governance > Quotas)
zero compute-core-count-standard-e5-quota in tenancy
set compute-core-count-standard-e4-quota to 64 in tenancy
# Deny all bare-metal database shapes in this compartment
zero database-dbcs-quota in tenancy

Cost note

Compartment quotas are the cheapest guardrail you have. A "zero bare-metal" and "cap OCPUs" quota in Dev/Test prevents a single mis-click from provisioning a five-figure monthly resource. Set them before you hand a compartment to a team.

Resource OCIDs

Every OCI resource has a globally unique Oracle Cloud Identifier (OCID). You will use these constantly in CLI, Terraform, and support tickets. The format is readable:

ocid1.instance.oc1.us-ashburn-1.anuwcljr...abcd1234
   |        |      |       |            |
 version  type   realm   region     unique id (opaque)

The type segment (instance, vcn, bucket, database, compartment, user, policy...) tells you what the resource is at a glance.
Some resources are regionless (users, groups, compartments, policies) - their OCID has no region segment.
OCIDs are stable for the life of the resource; scripts and Terraform state key off them.

Tags: freeform, defined, and namespaces

Tag type	Structure	Governed?	Use for
Freeform tags	Simple key:value, no schema	No control - anyone with manage rights can set anything	Quick, informal labels. Avoid for anything you report or bill on.
Defined tags	Live in a tag namespace; keys are predefined, values can be validated/restricted	Yes - controlled by IAM policy and value lists	Cost tracking, environment, owner, data-classification. The enterprise standard.
Tag namespace	A container for defined tag keys (e.g. `Finance` namespace with `CostCenter`, `Project`)	Yes	Grouping and governing related tag keys; can be retired/reactivated.
Tag defaults	Auto-applied defined tags on any new resource in a compartment	Yes	Guaranteeing every resource is tagged (e.g. auto-stamp `CreatedBy`, `Environment`).

Architect note - decide the tag model before launch

Define a small, enforced set of defined tags (CostCenter, Environment, Owner, DataClassification) with a tag namespace and tag defaults from day one. Retrofitting tags across thousands of resources for chargeback or a security audit is slow, error-prone, and never fully complete. You can also drive cost tracking and even some IAM conditions off tags - but only if they exist consistently.

Ways to work with OCI

OCI Console

The web UI. Best for learning, exploring, one-off tasks, and reading state. Not for repeatable production changes - use IaC for those.

OCI CLI

Python-based command line. Great for scripting, ad-hoc automation, and things not yet in Terraform. Uses config profiles (see section 17).

SDKs

Java, Python, Go, TypeScript/JavaScript, .NET, Ruby, PL/SQL. For building applications and tooling against OCI APIs.

Cloud Shell

Browser-based terminal, pre-authenticated as your Console identity, with CLI/Terraform/kubectl pre-installed and ephemeral home storage. Ideal for quick, credential-free tasks.

Terraform provider for OCI

The recommended way to build and manage infrastructure declaratively. Oracle maintains the provider. Run it locally, in a pipeline, or in OCI Resource Manager (managed Terraform with state).

Resource Manager

OCI's managed Terraform service - stores state, runs plan/apply, supports stacks and drift detection without you hosting a state backend. Covered in section 17.

What to decide before production

Home region - irreversible. Pick with identity-admin proximity and long-term intent in mind.
Realm - commercial vs. government/dedicated - usually contractual, but confirm it matches your compliance needs.
Tenancy strategy - single tenancy with compartments, or separate Prod/non-Prod tenancies.
Compartment topology - environment vs. BU vs. hybrid; where shared services and security live.
Identity domain strategy - default domain vs. multiple domains, and federation to your IdP (see section 2).
Network CIDR plan - non-overlapping with on-premises and future regions (see section 3). This is very hard to change later.
Tag model - defined tags + namespaces + tag defaults for cost and governance.
Guardrails - compartment quotas, budgets, Security Zones, Cloud Guard baseline.
DR region and RTO/RPO targets - which region, which DR pattern per tier.

Common mistake

Treating the first workload as "just a POC" and letting it define the tenancy, home region, and CIDR plan by accident. POCs become production. Decide the list above deliberately before the first real workload lands.

Official documentation: OCI concepts & getting started →

2. Identity and Access Management

Who can do what, to which resources, in which compartment. IAM is where most OCI security incidents and access-denied tickets originate, so this section goes deep on the policy language, identity domains, and least-privilege design.

Last reviewed: July 2026 Identity Domains behavior evolves - verify verb/resource-type names in current docs.

Model Users / groups / dynamic groups Policy syntax Verbs & resource types Conditions Instance / resource principals Identity domains, SSO, MFA Credentials Policy examples Common mistakes Troubleshooting

TL;DR

Access in OCI is granted only by policies. A policy is a set of human-readable statements: Allow <subject> to <verb> <resource-type> in <compartment> [where <condition>]. There is no implicit access - if no policy allows it, it is denied. Groups get people access; dynamic groups and instance/resource principals let workloads authenticate without stored keys. Write policies at the lowest compartment that works, use the least verb that works, and separate admin from operator from read-only.

The IAM model

Modern OCI IAM runs inside Identity Domains - each domain is an isolated identity and access management container with its own users, groups, applications, and security settings (its lineage is Oracle Identity Cloud Service). A tenancy has a Default domain and can have additional domains. Within a domain you have users and groups; across the tenancy you have compartments and policies that reference those groups.

The evaluation is simple to state and important to internalize: a request is allowed only if at least one policy statement permits it; otherwise it is denied. There are no "deny" statements in classic IAM policy - you control access by what you grant and to which compartment. (Deny-style controls come from other layers: Security Zones, quotas, and network rules.)

Users, groups, and dynamic groups

Principal	What it is	How it authenticates	Use for
User	A person or a service identity in a domain	Password + MFA (console), API key, auth token	Humans and, sparingly, service accounts that must have static credentials
Group	A named set of users	n/a - policies target groups	All human access. Never write policies against individual users.
Dynamic group	A set of resources (instances, functions, OKE nodes, DB systems...) matched by rules	Instance/resource principal (no stored credentials)	Letting workloads call OCI APIs without API keys
Federated user	A user authenticated by an external IdP (Entra ID, Okta, AD)	SAML/OIDC SSO, mapped to a domain group	Enterprise SSO - the standard for human access at scale

Common mistake - confusing dynamic groups with user groups

A user group contains people; a dynamic group contains resources (compute instances, functions, autonomous DBs) selected by matching rules. You put a person in a user group. You never put a person in a dynamic group - and you never put an instance in a user group. Policies for workloads must target the dynamic group.

Example dynamic group matching rule (all instances in a compartment):

ALL {instance.compartment.id = 'ocid1.compartment.oc1..aaaa...'}

Policy syntax

A policy is a named collection of statements living in a compartment (or the tenancy). Every statement follows this grammar:

Allow <subject> to <verb> <resource-type> in <location> [where <conditions>]

# subject:        group Admins | dynamic-group AppServers | any-user | group DomainName/Admins
# verb:           inspect | read | use | manage
# resource-type:  instances | virtual-network-family | object-family | database-family | all-resources
# location:       tenancy | compartment Prod | compartment Prod:AppTier
# conditions:     where request.region = 'us-ashburn-1'

Where the policy lives matters: a policy attached to a compartment can only grant access to that compartment and its children. Tenancy-level policies can grant anywhere, which is exactly why you should minimize them.

Verbs and resource types

The four verbs are cumulative - each includes everything in the ones before it, plus more permissions.

Verb	Grants	Typical use	Risk
inspect	List resources (metadata only; often hides sensitive contents)	Auditors, inventory tools	Low
read	inspect + get resource details / contents	Read-only operators, dashboards	Low
use	read + work with existing resources (start/stop, attach, update some attributes) - generally not create/delete	Operators running day-2 tasks	Medium
manage	use + create and delete resources; full control	Admins, automation that provisions	High

Resource types can be individual (instances, buckets, subnets) or aggregate family types that bundle related resources:

virtual-network-family - VCNs, subnets, route tables, security lists, gateways, NSGs, DRGs.
database-family - DB systems, databases, backups, Data Guard, etc.
object-family - buckets and objects.
instance-family - compute instances, images, boot/block volume attachments.
all-resources - everything. Use with extreme care (see mistakes).

Architect note

Prefer the narrowest resource type that satisfies the task, and the least verb. "Operators can restart app servers" is use instances in the app compartment - not manage instance-family in tenancy. Every widening you allow is a widening an attacker or a mistake can use.

Conditions in policies

Conditions add a where clause that must be true for the statement to apply. They reference request/target variables.

# Restrict admin actions to a single region
Allow group NetAdmins to manage virtual-network-family in tenancy
  where request.region = 'us-ashburn-1'

# Only allow managing resources tagged for a given cost center
Allow group ProjectX to manage instances in compartment Workloads
  where target.resource.tag.Finance.CostCenter = 'CC-4412'

# Restrict to a specific resource type via request.operation (fine-grained)
Allow group Operators to use instances in compartment Prod
  where request.operation = 'InstanceAction'   # start/stop/reset, not delete

Handy condition variables

request.region, request.operation, request.user.id, request.groups.id, target.resource.tag.<ns>.<key>, request.principal.type. Combine with tags for attribute-based access control.

Instance principals and resource principals

These solve the "how does my code authenticate to OCI without storing keys" problem - the single most important security improvement most teams can make.

Mechanism	The principal is	Used by	How it works
Instance principal	A compute instance	Code running on a VM/BM instance	Instance is a member of a dynamic group; SDK/CLI obtains short-lived credentials from the instance metadata service. No API key on disk.
Resource principal	A managed resource (Function, Data Science notebook, Autonomous DB, etc.)	Serverless / managed services	The service injects a short-lived token the SDK uses automatically. Common in Functions and OKE workload identity.

Security note - prefer principals over stored keys

An API key in a config file, a script, or an environment variable is a long-lived secret that can leak, get committed to git, or outlive the person who made it. Instance/resource principals issue short-lived, automatically rotated credentials scoped by a dynamic group and policy. If a workload runs in OCI and needs to call OCI, it should almost always use a principal, not an API key.

# Let all instances in AppCompartment read objects in a specific bucket's compartment
Allow dynamic-group AppServers to read objects in compartment Data
  where target.bucket.name = 'app-artifacts'

# CLI using instance principal - no ~/.oci/config keys needed
oci os object list --bucket-name app-artifacts --auth instance_principal

Identity domains, federation, SSO, MFA

Identity domains are self-contained IAM stacks in your tenancy. Each has its own users, groups, password policy, MFA settings, sign-on policies, and federation config.

Default domain - created with the tenancy; where the initial administrator lives.
Additional domains - useful to separate, for example, employees from external partners, or Prod-admin identities from Dev identities, each with their own MFA/sign-on rules. Domains also come in types/licensing tiers (Free, Oracle Apps, Premium, External User) that affect available features - verify current tiers.
Federation / SSO - connect an external IdP (Microsoft Entra ID, Okta, ADFS, Ping) via SAML or OIDC so users sign in with corporate credentials. Map IdP groups to domain groups; policies reference the domain groups.
MFA - enforced via sign-on policies in the domain. Require MFA for all human users, especially administrators. Exempt only automated service identities that use API keys/principals, not passwords.

Architect note - plan domains early

Deciding to split identities into multiple domains after you have policies, federation, and MFA configured against the default domain is disruptive - group references in policies use the domain-qualified name (Allow group 'DomainName'/'GroupName' to ...). Decide your domain model (single default vs. multi-domain) alongside your compartment model, before production identities exist.

Common mistake

Standing up local OCI users for every engineer instead of federating to the corporate IdP. Local users mean separate lifecycle (joiners/movers/leavers), separate MFA, and passwords that outlive employment. Federate human access; reserve local users for break-glass and specific service needs.

Credential types

Credential	Used for	Notes / risk
Console password + MFA	Human Console sign-in	Federate + enforce MFA. Rotate per policy.
API signing key (PEM)	CLI/SDK/Terraform as a user	Long-lived. High risk if leaked. Prefer instance/resource principals where possible; rotate and scope tightly otherwise.
Auth token	Basic-auth style access (e.g. some Git/registry, Swift/RMAN to Object Storage)	Password-equivalent. Store in Vault, not scripts.
Customer secret key	S3-compatible Object Storage access (access key/secret)	For tools that speak the S3 API. Treat like AWS keys.
OAuth 2.0 client credentials	App-to-app / confidential apps in a domain	Managed as domain applications; scope to least privilege.
Database passwords / wallets	DB connectivity (e.g. ADB mTLS wallet)	Store wallets/secrets in Vault; never in application images.

Security note - break-glass users

Keep one or two break-glass local admin users (not federated), with very strong unique passwords and MFA, credentials sealed in your enterprise secrets vault, and every login alarmed via Audit + Events. They exist so you can still administer OCI if the IdP/federation is down. Do not use them for daily work; monitor them heavily.

Real policy examples

Read-only auditor across the tenancy Low risk

Allow group Auditors to inspect all-resources in tenancy
Allow group Auditors to read audit-events in tenancy

Allows: listing and viewing audit trails and resource metadata everywhere; no changes, and inspect hides many secret contents. Where: tenancy (auditors legitimately need breadth). Risk: low, but still scope to inspect/read only. Safer alternative: if auditors only cover certain BUs, scope to those compartments instead of tenancy.

App operators can start/stop instances in Prod only Medium risk

Allow group AppOperators to read instance-family in compartment Prod
Allow group AppOperators to use instances in compartment Prod
  where request.operation = 'InstanceAction'

Allows: viewing instances and starting/stopping/resetting them, but not creating or terminating. Where: the Prod compartment (not tenancy). Risk: medium - stop can cause outage; scoping to InstanceAction prevents delete. Safer alternative: further restrict with a tag condition so operators only touch app-tier instances, not databases.

Workload reads a bucket via instance principal Low risk

Allow dynamic-group AppServers to read objects in compartment Data
  where target.bucket.name = 'app-config'

Allows: only the matched instances, only read, only that one bucket. Where: the Data compartment. Risk: low - tightly scoped and no stored keys. This is the pattern to imitate: dynamic group + least verb + resource condition.

Delegated compartment admin for a business unit Higher risk

Allow group BU_A_Admins to manage all-resources in compartment BU-A

Allows: full control, but only inside the BU-A compartment subtree - not the rest of the tenancy. Where: the BU-A compartment. Risk: higher (manage all-resources), but blast radius is contained to one compartment. Safer alternative: split into role-specific groups (network admin, DB admin, compute admin) even within the BU so no single group holds everything. Pair with quotas and a Security Zone.

Common OCI IAM mistakes

manage all-resources too broadly - especially at tenancy level. This is effectively tenancy admin. Scope to a compartment and split by role.
Tenancy-level policies by default - a policy in the tenancy grants everywhere. Put policies in the lowest compartment that works.
Confusing dynamic groups with user groups - workloads authenticate via dynamic groups; people via user groups. Mixing them either fails or over-grants.
No separation of admin / operator / read-only - one "everyone" group with manage rights removes least privilege and makes audit meaningless.
Not planning identity domains - retrofitting a multi-domain model after federation and policies exist is painful.
Storing API keys instead of using instance/resource principals - long-lived keys leak; principals are short-lived and scoped.
Policies against individual users - unmanageable and invisible in group-based reviews. Always target groups.
Leaving the default tenancy administrator group over-used - reserve it for break-glass; do daily work as scoped roles.

IAM troubleshooting

⚑ "Authorization failed or requested resource not found"

Symptoms

A user or workload gets a 404/authorization error even though the resource exists.

Likely causes

No policy grants the action (default deny).
Policy is attached to the wrong compartment (a parent policy does not "reach down" unless it covers that compartment; a child policy cannot grant on a sibling).
Verb too weak (read where use/manage is needed) or wrong resource-type/family.
For workloads: the instance is not in the dynamic group, or the matching rule does not match, or the dynamic-group policy is missing.
A where condition (region, tag, operation) is not satisfied.
Wrong identity domain - group referenced without the domain qualifier.

Checks to perform

Confirm the user's group membership and the group's policies (Identity > Domains > Groups; Identity > Policies).
Trace the compartment path of the target resource and confirm a policy covers that compartment or an ancestor.
For dynamic groups: verify the instance OCID matches the rule (Identity > Domains > Dynamic groups) and that a policy grants the dynamic group.
Check conditions: is the request in the allowed region? Does the target carry the required tag?

Console path

Identity & Security > Policies (and Domains > Users / Groups / Dynamic groups). Use the tenancy Audit log to see the denied request details.

CLI

oci iam policy list --compartment-id <compartment-ocid> --all
oci iam group list-users --group-id <group-ocid>
oci iam dynamic-group get --dynamic-group-id <dg-ocid>

Fix options

Add the least-privilege statement at the correct compartment; fix the dynamic-group rule; correct the verb/resource-type; relax or correct the condition.

Prevention

Adopt a policy naming and location standard, review policies in code (Terraform), and keep a "who can do what" matrix per compartment.

Official documentation: IAM policies & policy reference →

3. Networking Deep Dive

The VCN is the foundation every other OCI service plugs into. This section covers CIDR planning, subnets, gateways, security rules, hybrid connectivity, DNS, and the traffic-flow reasoning you need to design and debug real networks.

Last reviewed: July 2026 Verify gateway limits, VPN specs, and FastConnect options in current docs.

VCN & CIDR Subnets Gateways Security lists vs NSG IPs & VNICs DNS & DHCP Traffic flow Hybrid connectivity Reference diagrams Flow logs & analysis Troubleshooting Gotchas

TL;DR

A VCN is your private, software-defined network in a region with a CIDR you choose. Inside it you create subnets (regional or AD-specific, public or private). Traffic leaves the VCN only through a gateway - Internet (IGW), NAT, Service (to OCI services privately), DRG (to on-prem and other VCNs), or peering. Two rule layers govern packets: stateful security lists (subnet-wide) and NSGs (per-VNIC). Plan your CIDR to never overlap with on-premises or other regions - it is the hardest thing to change later.

VCN and CIDR planning

A Virtual Cloud Network is a regional, private network. You assign it one or more CIDR blocks (RFC 1918 private ranges are standard; you can also use public ranges you own). Everything - instances, load balancers, databases, mount targets, private endpoints - gets an IP from a subnet inside the VCN.

A VCN can have multiple CIDR blocks, added after creation, which helps when you outgrow the first block. But you cannot shrink or trivially renumber - plan generously.
VCN CIDRs and subnet CIDRs must not overlap with each other, with peered VCNs, or with your on-premises networks. Overlap is the number-one cause of hybrid connectivity that "connects but cannot route."
Reserve non-overlapping space for every region and every environment you might ever run, plus DR. Treat it like an enterprise IP address plan, because it is one.

Architect note - a workable CIDR scheme

Carve a large private supernet for OCI (e.g. 10.0.0.0/8 subdivided), then allocate a /16 per region, a /20 per VCN/environment, and /24s per subnet tier. Keep a documented IPAM spreadsheet. Leave gaps. The cost of a too-large plan is zero; the cost of overlap is a re-IP project during a migration.

Common mistake

Using 10.0.0.0/16 for the first VCN because it is the console default, then discovering on-premises already uses 10.0.x.x. Now FastConnect/VPN is up but nothing routes. Choose CIDRs against your existing enterprise IP plan on day one.

Subnets

Subnet property	Options	Guidance
Scope	Regional (spans all ADs) or AD-specific	Prefer regional subnets - simpler HA, resources can land in any AD/FD. AD-specific subnets are legacy/niche.
Public vs private	Public = resources can have public IPs; Private = private IPs only	Default to private. Only front-facing load balancers and bastions belong in public subnets.
Route table	One per subnet	Determines which gateway off-VCN traffic uses. Public subnet → IGW; private subnet → NAT/Service/DRG.
Security lists	Zero or more per subnet	Subnet-wide stateful rules. Combine with NSGs.
DHCP options	One per subnet	Controls DNS resolver and search domain handed to instances.

"Public subnet" does not mean "has a public IP"

A subnet being public only means resources may have a public IP. An instance in a public subnet with no public IP assigned and no IGW route is not reachable from the internet. Reachability = public IP and an IGW route and security rules allowing it. Missing any one blocks it.

Gateways - the only ways out of a VCN

Gateway	Direction / purpose	Public IP?	Typical route target for
Internet Gateway (IGW)	Bidirectional internet for resources with public IPs	Yes	Public subnets (LB, bastion)
NAT Gateway	Outbound-only internet for private resources (patching, external APIs)	Uses OCI-managed public IP	Private subnets needing egress
Service Gateway	Private access to OCI services (Object Storage, ADB, etc.) without internet	No	Private subnets reaching OCI services
Dynamic Routing Gateway (DRG)	On-premises (VPN/FastConnect) and VCN-to-VCN / cross-region routing hub	No	Hybrid + hub-and-spoke
Local Peering Gateway (LPG)	VCN-to-VCN peering in the same region (legacy; DRG now preferred)	No	Same-region VCN peering
Remote Peering Connection (RPC)	DRG-to-DRG peering across regions	No	Cross-region VCN connectivity

Architect note - Service Gateway is not optional

Without a Service Gateway, a private instance reaching Object Storage or backing up a database to it would have to go out a NAT Gateway over the public internet path - slower, less secure, and it can incur egress considerations. The Service Gateway keeps that traffic on the OCI backbone and off the internet. Add it to every private subnet that talks to OCI services, and pair it with the correct service CIDR label ("All OSN Services" vs "Object Storage only") for least exposure.

Common mistake

Putting a NAT Gateway route as 0.0.0.0/0 in a private subnet and assuming Object Storage traffic is "private." It is not - it egresses to the internet-facing Object Storage endpoint. Add a Service Gateway and a route for the OCI services CIDR label so that traffic stays on the backbone; keep NAT only for genuine internet destinations.

Security lists vs. Network Security Groups

	Security List	Network Security Group (NSG)
Applies to	Every VNIC in the subnet	Only VNICs you add to the NSG
Granularity	Subnet-wide (coarse)	Per-workload / per-tier (fine)
Rule source/dest	CIDR, service CIDR	CIDR, service CIDR, or another NSG
Stateful?	Yes (can also be stateless)	Yes (can also be stateless)
Best for	Baseline subnet rules (e.g. allow intra-VCN)	App-tier-to-DB-tier rules by group, not IP

Both are evaluated. A packet is allowed if either the applicable security lists or the NSGs permit it (they are additive for allows; there is no deny rule - you allow what you need and everything else is implicitly denied). Effective rules = union of all security lists on the subnet + all NSGs on the VNIC.

Architect note - prefer NSGs

NSGs let you write rules like "app-tier NSG may reach db-tier NSG on 1521" without hardcoding IPs. As instances scale, membership updates automatically. Use security lists for a thin subnet-wide baseline (e.g. allow ICMP path MTU, allow intra-VCN) and put the real application rules in NSGs referencing other NSGs.

Common mistake - stateful vs stateless confusion

Mixing a stateless rule in one direction with a stateful rule in the other creates asymmetric behavior and dropped return traffic. Keep rules stateful unless you have a specific reason (very high connection rates) and, if stateless, define matching rules for both directions.

IPs, VNICs, and secondary addresses

Object	What it is	Notes
Private IP	An IP from the subnet CIDR on a VNIC	Primary private IP is fixed for the VNIC's life; secondaries can move.
Ephemeral public IP	Temporary public IP tied to a private IP/instance lifecycle	Released when the instance/VNIC is deleted. Cheapest for transient needs.
Reserved public IP	A public IP you own independent of any instance	Survives instance deletion; re-map to another resource. Use for stable ingress/whitelisting.
VNIC (primary)	The instance's first network interface	Cannot be removed; determines the instance's primary subnet.
Secondary VNIC	Additional interface, can be in a different subnet	Multi-homing (e.g. app + management network). Requires OS-level config.
Secondary private IP	Extra private IP on a VNIC	Used for IP failover / floating VIPs across instances.

DBA note - VIPs and secondary IPs

For custom HA (a floating virtual IP moved between two DB or app nodes), OCI uses a secondary private IP that you unassign from the failed node and assign to the surviving node. Grid Infrastructure/RAC on Base/Exadata handles its own VIP/SCAN addressing; for hand-rolled active/passive, plan the secondary-IP failover and the IAM permission for the failover automation to move it.

DNS and DHCP

VCN Resolver - each VCN has a built-in DNS resolver at 169.254.169.254. It resolves internal hostnames and forwards public queries.
Private DNS zones/views - create private zones for custom internal names, and use the resolver's endpoints/forwarding rules to integrate with on-premises DNS (conditional forwarding both ways for hybrid name resolution).
DHCP options - per-subnet; controls whether instances use the VCN resolver or a custom resolver, and the search domain. Point at custom resolvers when integrating enterprise DNS.

Common mistake

Hybrid apps that "cannot resolve" on-prem hostnames because the subnet's DHCP options still use the default VCN resolver with no forwarding rule to the corporate DNS. Set up the DNS resolver endpoints and conditional forwarders in both directions.

How traffic flows in OCI

For any packet leaving an instance, OCI decides the path in this order:

Is the destination inside the VCN? If yes, it routes locally (no gateway) - only security rules apply. Intra-VCN routing is automatic.
If outside the VCN, the subnet's route table is consulted for the most specific matching rule → picks a gateway (IGW, NAT, Service, DRG, LPG).
Security rules (security lists + NSGs on the VNIC) must allow the egress, and the return path must be allowed (stateful handles the return automatically).
At the destination side, the same security-rule check happens for ingress.

Debugging almost always comes down to three questions: Is there a route to the right gateway? Do the security rules allow it both ends? Is there a return path?

Reference diagrams

Three-tier architecture (public LB, private app, private DB)

Public LB in a public subnet fronts private app and DB tiers; egress via NAT, OCI services via Service Gateway. HA across fault domains.

Hub-and-spoke with DRG

DRG as the central hub: spoke VCNs, on-premises, and other regions all attach to one DRG. Inspect east-west traffic through a firewall in the hub VCN.

Service Gateway to Object Storage (private backend/backups)

RMAN backups and object reads travel the OCI backbone via the Service Gateway - no NAT, no internet exposure.

On-premises to OCI hybrid (VPN + FastConnect)

FastConnect as the private primary link, IPSec VPN as encrypted backup, both terminating on the DRG. On-prem and OCI CIDRs must not overlap.

Hybrid connectivity: VPN vs FastConnect

	Site-to-Site VPN (IPSec)	FastConnect
Path	Over the public internet, encrypted	Private, dedicated connection (via partner or colo)
Bandwidth	Limited per tunnel; multiple tunnels for HA	1/10/100 Gbps port options
Latency/jitter	Variable (internet)	Consistent, low
Setup time	Minutes	Days-weeks (physical/partner provisioning)
Use as	Quick start, or encrypted backup to FastConnect	Primary enterprise link, production DB replication, large migration

Architect note

The common enterprise pattern is FastConnect primary + VPN backup, both on the DRG, with routing (BGP) preferring FastConnect and failing over to VPN. Start on VPN during a project's early phase, then cut FastConnect in as it provisions. Design routing so failover is automatic and tested.

Flow logs, path analysis, and packet capture

Tool	What it gives you	Use when
VCN Flow Logs	Accepted/rejected connection records per subnet/VNIC (into the Logging service)	Auditing, "is my rule dropping this?", security forensics
Network Path Analyzer	Static analysis of whether a path from A to B is allowed, listing the rules/routes that permit or block it	Before deploying, or first step when "cannot reach" - it tells you the exact blocking rule
VTAP (Virtual Test Access Point)	Mirrors VNIC traffic to a capture target for deep packet inspection	IDS/IPS, deep debugging, compliance capture
Instance-side capture	`tcpdump` on the instance OS	Confirming what actually arrives at the host

Start with Network Path Analyzer

Before you SSH around checking rules by hand, run the Network Path Analyzer for the source/destination/port. It evaluates route tables, security lists, NSGs, and gateways and tells you the first rule that blocks the path - turning a 30-minute hunt into a 30-second answer.

Network Firewall and Web Application Firewall

Network Firewall - a managed, OCI-native firewall (Palo Alto-based) you place in a hub VCN to inspect north-south and east-west traffic: stateful filtering, IPS/IDS, URL filtering, TLS inspection. Route spoke traffic through it via the DRG hub.
Web Application Firewall (WAF) - Layer-7 protection (OWASP rules, bot management, rate limiting) applied in front of public HTTP endpoints, either as an edge/policy attached to a load balancer or as an edge service. Covered further in section 8.

Networking troubleshooting

⚑ Instance cannot reach the internet

Likely causes & checks

Private subnet, no egress: needs a route 0.0.0.0/0 to a NAT Gateway (or IGW if it has a public IP). Check the subnet's route table.
Public subnet but no public IP: assign an ephemeral/reserved public IP, and confirm a 0.0.0.0/0 route to the IGW.
Security rules: egress must allow the destination (often 0.0.0.0/0 on 443); return traffic is automatic if stateful.
OS firewall: iptables/firewalld on the instance may block it - Oracle Linux images ship with firewall rules.

Console path

Networking > VCN > Subnet > Route Table / Security Lists; Instance > Attached VNICs > public IP.

CLI

oci network route-table get --rt-id <ocid>
oci network security-list get --security-list-id <ocid>
# on the instance:
curl -s https://ifconfig.me ; sudo firewall-cmd --list-all

Fix / prevention

Add NAT route for private egress; standardize a subnet template (route + security rules) in Terraform so every subnet is consistent.

⚑ Instance cannot reach Object Storage privately

Likely causes & checks

No Service Gateway on the VCN, or subnet route table has no route to it for the OCI services CIDR label.
Service Gateway configured for the wrong service label (needs "Object Storage" or "All OSN Services in region").
Security rules do not allow egress to the service CIDR.
Using the wrong endpoint - use the regional Object Storage endpoint that the Service Gateway serves.

Console path

Networking > VCN > Service Gateway; then the subnet Route Table.

Fix / prevention

Create the Service Gateway, add a route rule (target = Service Gateway, destination = the services CIDR label), and an egress security rule to that label. Bake this into your standard private-subnet module.

⚑ On-premises cannot reach OCI

Likely causes & checks

CIDR overlap between on-prem and the VCN - routing is ambiguous. This is the most common root cause.
DRG route table / route distribution not advertising the VCN CIDR, or on-prem BGP not advertising its routes.
VPN tunnel down (Phase 1/2 mismatch) or FastConnect BGP session down.
Security lists/NSGs not allowing the on-prem CIDR.
On-prem firewall blocking return traffic.

Console path

Networking > Dynamic Routing Gateways > DRG > Route Tables / Attachments; Site-to-Site VPN > Tunnel status.

Fix / prevention

Resolve overlap (re-IP or NAT), confirm route advertisement both ways, verify tunnel/BGP state. Document the IP plan to prevent future overlap.

⚑ Load balancer health check is failing

Likely causes & checks

Backend subnet's security rules do not allow the health-check probe from the load balancer subnet/NSG on the backend port.
Health check configured for the wrong port/path/protocol vs. what the app serves.
Backend app not listening on the expected port, or bound to 127.0.0.1 instead of 0.0.0.0.
OS firewall on the backend dropping the probe.

Fix / prevention

Allow the LB source in the backend NSG on the health-check port; align health-check path/port with the app; confirm the app binds to all interfaces. See section 7 for the full LB troubleshooting flow.

⚑ DNS resolution issue

Likely causes & checks

Subnet DHCP options point at a resolver that cannot resolve the name (e.g. custom resolver without a forwarder).
Hybrid: no conditional forwarding between the VCN resolver and corporate DNS.
Private DNS zone missing the record, or wrong view attached to the VCN.

CLI / checks

nslookup db.internal.example.com 169.254.169.254
cat /etc/resolv.conf   # confirm which resolver the OS uses

Fix / prevention

Fix DHCP options, add resolver endpoints + forwarding rules for hybrid, and put internal records in a private zone attached to the VCN's resolver view.

⚑ Security list / NSG or route table blocking traffic

Fast method

Run Network Path Analyzer for source, destination, protocol, and port. It reports the exact security rule or missing route causing the block. Then enable VCN Flow Logs and look for REJECT records to confirm which rule dropped the packet.

Common route-table gotcha

Route rules are matched most-specific-first. A 0.0.0.0/0 to NAT plus a more specific route to a Service Gateway both being present is correct; a missing specific route sends OCI-service traffic out the NAT by mistake. For DRG route propagation issues, check the DRG route table import/export and that the attachment advertises the CIDR.

OCI Networking gotchas

OCI networking gotchas

CIDR overlap is the cardinal sin - plan against your enterprise IPAM, leave room, never reuse on-prem ranges.
"Public subnet" != reachable. Reachability needs public IP + IGW route + security rules, all three.
Service Gateway forgotten - Object Storage/ADB traffic silently goes over NAT/internet. Always add it for private subnets.
Security list + NSG are additive - a permissive security list can undo the tight NSG you thought was protecting a VNIC. Audit both.
Stateless rules asymmetry - keep rules stateful unless you have a measured reason not to.
Regional vs AD subnets - use regional subnets; AD-specific ones complicate HA and are rarely needed now.
DRG vs LPG - LPG is legacy; use the upgraded DRG (DRG v2) as the routing hub for peering and hybrid.
Route table is per subnet - a change in one subnet's route table does not affect others; inconsistency between subnets causes "works here, not there."
OS firewall - Oracle Linux images enforce their own firewall; a perfect VCN config still fails if firewalld blocks the port.
Egress data transfer - internet egress is metered; keep OCI-service traffic on the Service Gateway to avoid unnecessary internet-path cost and exposure.

Official documentation: Networking overview →

4. Compute Deep Dive

Shapes, images, placement, scaling, and the operational patterns for running application and database compute in OCI - including how to pick shapes and how licensing interacts with instance choice.

Last reviewed: July 2026 Shape names/specs change often - verify current shapes in the Console.

Shapes Bare metal vs VM Dedicated / capacity / preemptible Images & cloud-init Placement & HA Scaling & pools Choosing shapes Operations Troubleshooting

TL;DR

OCI compute comes as VMs or bare metal. Modern shapes are flexible - you dial OCPUs and memory independently. One OCPU = one physical core (two vCPU threads), which matters a lot for Oracle licensing. Use fault domains and instance pools + autoscaling for HA and elasticity, instance configurations as templates, and instance principals so instances call OCI APIs without stored keys.

Shapes: OCPU, memory, flexible

OCPU vs vCPU: OCI historically measures CPU in OCPUs. One OCPU = one physical core = two hardware threads (vCPUs) on x86. Oracle Database licensing counts cores, so 1 OCPU generally corresponds to the licensing unit for a core (subject to the core factor). Newer shapes may also be expressed in vCPUs - confirm the unit for the exact shape.
Flexible shapes (e.g. the E-series AMD, and Ampere Arm A-series) let you choose OCPU count and memory GB independently, within per-OCPU memory ranges. You pay for what you allocate.
Fixed shapes come in preset sizes (older standard shapes, some bare metal, GPU shapes).
Processor families: AMD EPYC (E-series), Intel Xeon (X/Standard), Ampere Arm (A-series, strong price/performance for scale-out and cloud-native), and NVIDIA GPU shapes for AI/ML.

DBA note - OCPUs and Oracle licensing

Because 1 OCPU = 1 core, your Oracle Database license consumption on IaaS compute is driven directly by the OCPU count you allocate (with the OCI core factor policy - verify the current policy). Right-sizing OCPUs is a licensing decision, not just a performance one. Bare metal and dedicated hosts give you control over the physical boundary for hard-partitioning/licensing arguments - discuss with Oracle LMS before relying on it.

Bare metal vs virtual machines

	Virtual Machine	Bare Metal
Tenancy	Shared host, isolated VM	Entire physical server, single tenant
Hypervisor overhead	Minimal (off-box virtualization)	None - you get the whole box
Use for	Most apps, middleware, web, smaller DBs	Highest performance, large DBs, licensing isolation, specialized workloads
Live migration	Supported for many VM shapes during infra maintenance	Not applicable - you manage maintenance windows/reboot migration
Cost model	Per-OCPU/hour, fine-grained	Whole-server; higher floor, better for dense/large

Dedicated hosts, capacity reservation, preemptible

Option	What it does	Use when
Dedicated VM Host	Your VMs run on a physical host reserved to you (no other tenants)	Compliance/licensing isolation while keeping VM flexibility
Capacity Reservation	Reserves capacity of a shape in an AD so it is guaranteed available when you launch	Guaranteeing DR failover capacity, large launches, scale-out headroom
Preemptible instances	Cheaper VMs OCI can reclaim with short notice	Fault-tolerant batch, stateless workers, CI - never for stateful/production-critical
Burstable instances	Baseline fraction of an OCPU with bursting	Low-average, spiky small workloads (bastions, light services)

Cost note - reserve DR capacity deliberately

A pilot-light/warm-standby DR plan assumes capacity will be available in the DR region when you fail over. During a large regional event, popular shapes can be constrained. Capacity Reservations guarantee it - at a cost. Decide per tier whether the RTO justifies paying for reserved DR capacity vs. accepting best-effort.

Images, custom images, and cloud-init

Platform images - Oracle Linux, and other OSes maintained by Oracle.
Custom images - capture a configured instance as an image for repeatable launches (golden images).
Bring Your Own Image (BYOI) - import a supported OS image; must meet OCI's paravirtualized/emulated driver requirements.
cloud-init - pass a startup script (user data) that runs on first boot to configure the instance (install agents, mount volumes, join config management). The standard way to bootstrap without baking everything into an image.
Instance metadata service - at 169.254.169.254, exposes instance metadata and is how instance principals fetch credentials. Restrict access to it inside the OS where appropriate.

# Example cloud-init passed as user_data (base64) to bootstrap an app host
#cloud-config
package_update: true
packages: [oracle-instantclient-basic, jq]
runcmd:
  - [ systemctl, enable, --now, myapp ]
  - [ /opt/app/register-with-lb.sh ]

Architect note - golden image + cloud-init

Bake slow-changing things (OS hardening, agents, base packages) into a custom image; use cloud-init for fast-changing config (app version, environment wiring). This keeps launches fast and reproducible and is what instance pools/autoscaling need to bring nodes up identically.

Placement, fault domains, and HA

High availability for compute is about anti-affinity: never put both halves of an HA pair on the same failure unit.

In a multi-AD region: spread across ADs for data-center-level resilience, and across FDs within each AD.
In a single-AD region: spread across the three fault domains - that is your in-region HA. DR to another region covers AD/region loss.
Instance pools distribute instances across FDs/ADs automatically per the placement configuration.
Live migration: for supported VM shapes, OCI can live-migrate your VM off hardware needing maintenance, avoiding a reboot; some events still require a reboot-migration you schedule. Bare metal you always manage yourself.

Common mistake

Launching two "HA" app nodes and letting OCI place both in the same fault domain (or not checking). One rack/maintenance event takes out both. Explicitly set fault-domain placement (or use an instance pool with a spread policy) and verify it.

Autoscaling, instance pools, and configurations

Building block	Role
Instance Configuration	A template: shape, image, network, metadata/cloud-init, volumes. Immutable versioned blueprint.
Instance Pool	Manages a set of identical instances from a configuration, across FDs/ADs, with a target size.
Autoscaling	Adjusts pool size by metric (CPU/memory) thresholds or a schedule (e.g. scale down nights/weekends).
Cluster Networks	High-performance RDMA-connected pools for HPC/AI (ultra-low-latency interconnect).

Cost note - schedule-based autoscaling

Non-production compute rarely needs to run nights and weekends. A scheduled autoscaling policy (or a simple stop schedule) that shrinks Dev/Test pools to zero off-hours is one of the highest-ROI cost actions in OCI. Combine with the shutdown checklist in section 14.

Choosing shapes by workload

Workload	Starting point	Why
General applications	Flexible AMD E-series VM (balanced OCPU:memory)	Cost-effective, dial exactly what you need
Oracle Database (IaaS)	Higher-memory flexible VM or bare metal; consider Base/Exadata service instead	Memory for SGA/PGA; bare metal for licensing isolation and top performance
Web servers	Ampere Arm A-series or small E-series, autoscaled	Excellent price/performance for stateless scale-out
Middleware (WebLogic, etc.)	Balanced flexible VM, memory-leaning	JVM heaps like memory; scale OCPU to concurrency
EBS application tier	Flexible VM sized to concurrent users; multiple nodes behind LB	Horizontal scale + HA across FDs
Batch processing	Preemptible or Arm pools, autoscaled/scheduled	Fault-tolerant, cheap, elastic
Memory-heavy (in-memory, caches, analytics)	High memory-per-OCPU flexible shape	Push memory up without paying for unneeded cores
CPU-heavy (compute, encoding)	High-OCPU flexible or dedicated; Arm for throughput	Cores are the bottleneck
AI/ML training & inference	GPU shapes (NVIDIA); cluster networks for multi-node	Accelerators + RDMA fabric

DBA note - IaaS DB vs managed DB

You can run Oracle Database on a plain compute instance, and sometimes must (specific versions/configs). But you then own patching, backup, HA, and Grid Infrastructure yourself. For most cases the Base Database Service or Exadata Database Service (section 6) removes that toil. Choose IaaS DB only when a managed option genuinely cannot meet the requirement.

Operational guidance

How to resize compute Ops

Flexible VM: change OCPU/memory - typically requires a reboot; plan a window. Scaling within the same shape family is straightforward.
Change shape family/architecture (e.g. Intel → Arm): not an in-place resize - rebuild from the image/config on the new shape (watch for architecture-specific binaries).
For pools: update the instance configuration and roll instances, or increase pool size and drain old ones.

How to move workloads safely Ops

Prefer rebuild-from-image over "lift the disk" - create a custom image, launch in the target compartment/AD/region, validate, cut over.
For cross-region moves, copy the custom image to the target region, or use Block Volume/boot volume cross-region replication.
Keep IPs stable with reserved public IPs and secondary private IPs where clients pin addresses; better, front with a load balancer or DNS name so moves are transparent.

How to troubleshoot boot / access issues Ops

Use the serial console / console connection to see boot output and reach the OS when SSH is dead (bad fstab, firewall lockout, failed service).
Check the instance's boot volume is healthy; you can detach it and attach to a rescue instance to fix an unbootable OS.
Confirm SSH key was injected (cloud-init) and the security rules/route allow 22 from your source.

How to troubleshoot performance Ops

Check Monitoring metrics: CPU utilization, memory, and especially block volume throughput/IOPS vs. the volume's performance tier limits.
A common surprise: the instance is not CPU-bound, the block volume is at its IOPS/throughput ceiling. Raise the volume's performance tier or use higher-VPU settings (section 5).
Network: verify you are not hitting shape bandwidth limits; larger shapes get more network bandwidth.
Right-size: autoscale or resize based on sustained utilization, not peak fear.

How to design compute for production Design

At least two nodes across different fault domains (and ADs where available) behind a load balancer.
Instance configuration + pool + autoscaling so capacity is elastic and nodes are reproducible.
Instance principals for API access; no stored keys.
OS Management for patch compliance; custom golden image + cloud-init for consistency.
Monitoring alarms on CPU/memory/volume; boot/block volume backups; capacity reservation for DR if RTO demands it.

OS management, patching, and recovery

OS Management (Hub) - manage OS updates/patch compliance across fleets of Oracle Linux (and other) instances from OCI.
Serial console connection - out-of-band access for recovery.
Instance recovery - stop/start moves the VM to healthy hardware; for corrupted OS, rescue via boot-volume detach/attach.

Compute troubleshooting quick runbook

⚑ Instance unreachable / SSH fails

Checks

Security rules + route allow 22 from your IP; instance has the right public/private IP path.
Instance is Running (not Stopped); boot completed - check serial console.
SSH key injected (cloud-init logs); correct user (opc for Oracle Linux).
OS firewall (firewalld) not blocking; fail2ban not locking you out.

CLI

oci compute instance get --instance-id <ocid> --query 'data."lifecycle-state"'
oci compute instance-console-connection create --instance-id <ocid> --public-key-file key.pub
oci compute instance action --instance-id <ocid> --action SOFTRESET

Prevention

Bastion service for SSH (no public IPs on hosts), standardized security rules, and console-connection procedures documented.

Official documentation: Compute overview →

5. Storage Deep Dive

Block, Object, and File storage in OCI - their performance models, backup and replication behavior, and the decision of which to use for databases, shared file systems, backups, archives, and data lakes.

Last reviewed: July 2026 Verify performance tiers, VPU values, and archive restore times in current docs.

Three storage types Block & boot volumes Object Storage File Storage When to use which Examples Gotchas

TL;DR

Block Volume = network-attached disks for instances/databases (like SAN/iSCSI), with tunable performance (VPU). Object Storage = HTTP key-value store for backups, data lakes, artifacts, archives - not a file system. File Storage (FSS) = managed NFS for shared POSIX file systems. Block for boot/DB, Object for backups/archives/lakes, File for shared app filesystems. Archive tier is cheap but has a restore delay - plan for it.

The three storage services

	Block Volume	Object Storage	File Storage (FSS)
Interface	iSCSI / paravirtualized block device	REST/HTTP (S3-compatible, Swift)	NFS v3
Looks like	A disk you format & mount	Buckets of objects (no directories)	A shared mounted filesystem
Attached to	One instance at a time (or shared/multi-attach for clusters)	Nothing - accessed over network by URL	Many instances concurrently
Scale	Per-volume size limit; attach many	Effectively unlimited	Grows automatically to petabytes
Best for	Boot disks, DB datafiles, app storage needing block I/O	Backups, archives, images, logs, data lake, static content	Shared home dirs, app clusters, EBS shared APPL_TOP

Block volumes and boot volumes

Boot volume - the OS disk created with an instance. Can be backed up, cloned, and detached for rescue.
Block volume - additional data disks. Attach via iSCSI or paravirtualized; multi-attach for shared-disk clusters (e.g. some RAC/HA configs).
Performance tiers (VPU/GB): Lower Cost, Balanced, Higher Performance, and Ultra High Performance - set by Volumes Performance Units (VPU) per GB. Higher VPU = more IOPS and throughput per GB. Auto-tune can lower performance (and cost) when a volume is detached/idle and raise it when in use.
Volume groups - group volumes (e.g. all of a DB's volumes) so backups/clones are crash-consistent across the set.
Backups - full or incremental, policy-based (scheduled), and can be copied cross-region for DR. Clones are instant copy-on-write; replication keeps a volume asynchronously mirrored to another region.

DBA note - size block volume performance, not just capacity

Database performance problems on IaaS are frequently block-volume I/O ceilings, not CPU. IOPS and throughput scale with size and VPU. A small, low-VPU volume will throttle a busy redo/datafile workload no matter how many OCPUs you add. Provision VPU for the I/O profile, use volume groups for consistent backups, and put redo/data on appropriately-tiered volumes.

Cost note

Block storage cost scales with GB and VPU. Don't put everything on Ultra High Performance. Use Balanced as a default, Higher/Ultra for hot DB volumes, and Lower Cost + auto-tune for cold/detached volumes. Delete orphaned volumes and stale backups (section 14).

Object Storage

Namespace - a tenancy-wide unique container name; buckets live in it, scoped to a compartment and region.
Storage tiers: Standard (hot, frequent access), Infrequent Access (cheaper storage, retrieval fee), Archive (cheapest, must be restored before reading, with a restore delay). Auto-Tiering can move objects between Standard/IA based on access.
Multipart upload - upload large objects in parallel parts; required/recommended for big files (backups, images).
Pre-Authenticated Requests (PARs) - time-boxed URLs granting access to a bucket/object without IAM credentials - handy but a sharing risk if leaked.
Lifecycle rules - auto-transition objects to Archive or delete them after N days.
Retention rules & versioning - retention locks objects against deletion for a period (WORM-style, supports compliance); versioning keeps prior versions.
Replication - asynchronously replicate a bucket to another region/bucket for DR.

Common mistake - Object Storage is not a file system

There are no real directories - the "/" in object names is just a naming convention, and there is no in-place random write or POSIX locking. Do not try to mount a bucket and run a database or app that expects file semantics on it. Use Block or File storage for that; use Object Storage for whole-object put/get workloads (backups, artifacts, media, lake data).

Security note - PARs and public buckets

A Pre-Authenticated Request is a bearer URL: anyone with the link has the access it grants until it expires. Never make buckets public unless the content is genuinely public. Prefer short PAR lifetimes, or IAM + instance principals. Turn on versioning + retention for backup buckets so ransomware/accidental deletes are recoverable, and audit bucket visibility regularly (Cloud Guard flags public buckets).

File Storage Service (FSS)

File system - the NFS-exported filesystem; grows automatically, snapshots supported.
Mount target - the NFS endpoint (with a private IP in your subnet) that instances mount. It carries the export set.
Export paths & NFS export options - control which CIDRs/hosts can mount, and with what access (read/write, root squash).
Snapshots - point-in-time, space-efficient; replication mirrors a file system to another region for DR.

Security note - NFS export options are your access control

FSS security is enforced by the mount target's NFS export options (source CIDR, access level, root-squash) plus the subnet's security rules on NFS ports. A mount target open to the whole VCN lets any instance mount sensitive shares. Restrict export options to the specific client CIDRs, enable root squash where appropriate, and lock the NFS ports in the NSG.

When to use which

Need	Use	Why
OS boot disk	Boot Volume (Block)	Block I/O, bootable, backup/clone
Database datafiles / redo	Block Volume (right VPU) or managed DB storage	Low-latency block I/O; tune performance
Shared filesystem for an app cluster	File Storage (FSS)	Concurrent POSIX access from many nodes
Database backups (RMAN)	Object Storage (via Service Gateway)	Durable, cheap, off-host, cross-region copy
Log/data archive, long retention	Object Storage Archive tier + lifecycle + retention	Lowest cost, WORM compliance
Data lake	Object Storage (Standard)	Scales infinitely, queried by analytics services
EBS shared APPL_TOP / concurrent tier	File Storage (FSS)	Shared filesystem semantics EBS expects
Static website / media	Object Storage + PAR/CDN	HTTP-native object serving

Practical examples

RMAN backup to Object Storage DBA

Configure RMAN to write to Object Storage via the Oracle Database Cloud Backup Module (or the DBaaS backup tooling on managed DBs). Traffic should traverse the Service Gateway, not NAT/internet. Enable a lifecycle rule to move older backup pieces to Archive, plus versioning + retention on the bucket for immutability. For DR, enable cross-region bucket replication or copy backups to the DR region.

DBA note

On Base/Exadata/Autonomous, backups to Object Storage are largely managed - you set retention and, for autonomous, the whole cycle is automatic. On IaaS DB you own the RMAN config, the backup module install, and the Service Gateway routing.

Application shared file system Apps

Create an FSS file system + mount target in the app subnet, restrict export options to the app-tier CIDR/NSG, mount on all app nodes. Snapshot on a schedule; replicate to the DR region. This is the standard shared storage for EBS APPL_TOP, WebLogic domains, or any scale-out app needing a common filesystem.

Log archive with lifecycle Ops

Ship logs (via Service Connector Hub) to an Object Storage bucket. Lifecycle rule: Standard for 30 days → Archive for long-term → delete after the compliance period. Retention rule prevents early deletion during the required window.

Data transfer options

Data Transfer Service (disk/appliance) - ship physical media/appliances to Oracle to bulk-load very large datasets when network transfer is impractical.
Online - CLI/SDK multipart upload, oci os object bulk-upload, or Storage Gateway-style/rclone tooling for ongoing sync.

Storage gotchas

Object Storage is not a filesystem - no random writes, no POSIX locks, no real directories.
Archive restore delay - Archive objects must be restored before reading, which takes time (hours-class). Never put anything you might need immediately in Archive.
Block volume undersized for IOPS - performance follows size and VPU; a small volume throttles regardless of CPU.
NFS mount target too open - lock export options to specific CIDRs and enable root squash.
Backup policy gaps - a volume/DB with no backup policy attached is silently unprotected. Audit that every prod volume has a policy.
Cross-region copy cost & timing - replication/copy incurs egress and takes time; a "DR copy" that lags your RPO is not DR. Measure it.
Detached-but-billed volumes - deleting an instance may leave block volumes behind, still billing. Clean them up.
PAR sprawl - untracked pre-authenticated URLs are a data-exfiltration risk. Inventory and expire them.

Official documentation: Block Volume, Object Storage & File Storage →

6. Database Services Deep Dive

OCI's database portfolio, from self-managed IaaS through Base Database and Exadata to fully-managed Autonomous - what each one manages for you, how HA/DR/backup/patching differ, and how to choose.

Last reviewed: July 2026 DB service names, versions (23ai/26ai), and features change - verify in current docs.

Portfolio map Service deep dives Decision table HA & DR Patching & backup Data tooling Licensing Enterprise examples

TL;DR

Pick along a spectrum of control vs. toil. DB on IaaS = full control, all the work. Base Database Service = managed VM DB systems, you still trigger patches and own the guest OS. Exadata Database Service / Cloud@Customer = engineered-system performance and scale, RAC built in. Autonomous Database = Oracle runs patching, tuning, backup, and scaling; you own schema, SQL, and data. Choose the least you need to manage that still meets performance, control, and compliance requirements.

The portfolio at a glance

Service	Form	You manage	Oracle manages	Sweet spot
DB on Compute (IaaS)	You install Oracle DB on a VM/BM	Everything above the hypervisor	Infra only	Special versions/configs a managed service can't do
Base Database Service	Managed VM DB system (single node or 2-node RAC)	Guest OS, patch scheduling, schema	Provisioning, patch tooling, backup automation	Small-to-mid Oracle DBs wanting managed lifecycle
Exadata Database Service (ExaDB-D)	Exadata infra in OCI, VM clusters	Databases, patch scheduling, schema	Exadata hardware, storage cells, RAC substrate	Large, high-performance, consolidation, mission-critical
Exadata Cloud@Customer (ExaCC)	Exadata in your data center, OCI-managed	Databases, schema	Full stack, remotely operated by Oracle	Data residency / low-latency-to-on-prem with cloud ops
Autonomous Database	Self-driving (ATP/ADW/AJD)	Schema, SQL, data, access	Patching, tuning, backup, scaling, much of security	Most new OLTP/DW/JSON where you want minimal DBA toil

Service deep dives

Base Database

Exadata (ExaDB-D / ExaCC)

Autonomous Database

RAC / CDB / TDE

Base Database Service

Managed Oracle Database on VM DB systems. You choose Standard/Enterprise Edition, version, and shape; OCI provisions the VM(s), Grid Infrastructure, and database, and provides managed backup and patching workflows.

Topologies: single-node VM DB system, or 2-node RAC VM DB system for node HA.
You still own: the guest OS (patching the OS, though Oracle provides the DB patch bundles), schema, SQL, and when to apply quarterly patches.
Backups: automatic backups to Object Storage with a retention you set; point-in-time restore.
Data Guard: one-click association to a standby (same or cross-region).

DBA note

Base Database feels closest to what an on-prem DBA already does - you still run RMAN concepts, Data Guard, and patching, but the provisioning and much of the plumbing is automated. Good stepping stone from on-prem before adopting Autonomous.

Exadata Database Service & Cloud@Customer

The Exadata engineered system delivered as a cloud service: Scale-Out compute (DB servers) + intelligent storage cells with Smart Scan, storage indexes, and flash. Databases run as RAC across the VM cluster.

ExaDB-D (Dedicated): Exadata infrastructure in an OCI region; you create VM clusters and databases on it. Elastic scaling of DB and storage servers.
ExaCC: the same Exadata hardware placed in your data center, control plane in OCI, operated by Oracle - for data-residency or ultra-low-latency-to-on-prem needs.
Why Exadata: Smart Scan offloads query processing to storage, huge consolidation density, consistent low latency, built-in RAC HA, and the top end of Oracle DB performance.
You manage: databases, PDBs, patch scheduling (Oracle provides one-click patching of the infra and DB), schema, and tuning.

Architect note - when Exadata is the right fit

Choose Exadata Cloud Service when you have large, latency-sensitive, or consolidated Oracle estates - dozens/hundreds of databases, heavy mixed OLTP+analytics, or workloads already tuned for Exadata features (Smart Scan, HCC). For a handful of modest databases it is over-provisioned; Base Database or Autonomous will be cheaper and simpler.

Autonomous Database

Self-managing Oracle Database. Oracle automates patching, tuning, backups, scaling, and much of the security. Workload flavors share one engine:

ATP (Transaction Processing) - OLTP/mixed, optimized for many short transactions.
ADW (Data Warehouse) - analytics, optimized for scans/aggregations, columnar.
AJD (JSON Database) - document/JSON-centric, SODA APIs.
Also APEX Service and (verify) transaction/AI-vector capabilities in 23ai/26ai.

Deployment models:

	Serverless	Dedicated
Infra	Shared, Oracle-managed Exadata fleet	Exadata infra dedicated to you
Isolation	Logical	Physical - your own infra
Control	Least ops, fastest start	More control (maintenance windows, isolation policies)
Use for	Most workloads, dev/test, variable load	Regulated/large estates wanting private Autonomous

Autoscaling: OCPU and storage can auto-scale (e.g. up to 3x base OCPU) to absorb spikes; scale to/near zero for dev with auto-stop.
Autonomous Data Guard: one-click managed standby (in-region or cross-region) with automatic failover options.
Backups: fully automatic with a retention you choose; point-in-time restore.

DBA note - what you still do on Autonomous

You are not out of a job - you own schema design, indexing strategy (beyond auto-indexing), SQL quality, data modeling, partitioning choices, access control, and application performance. Autonomous removes infrastructure and routine maintenance toil, not data architecture. Some features and privileges are restricted (no SYSDBA in the traditional sense, limited OS access), which is exactly what trips up lift-and-shift of DBs that rely on OS-level hooks.

When NOT to use Autonomous

Apps requiring specific unsupported features, custom OS packages, non-standard init parameters, or direct OS/filesystem access.
Databases pinned to a version/patch level Autonomous won't run.
Certified packaged apps (some EBS/Siebel configs) that require Base/Exadata, not Autonomous - check certification.
Workloads needing very specific licensing or hard-partitioning arrangements.

RAC, CDB/PDB, and encryption

RAC (Real Application Clusters): multiple DB instances on multiple nodes serving one database for node-level HA and scale. Built into Exadata and available as 2-node on Base Database. Survives a node failure with brief brownout; not a DR substitute (it is one site).
CDB/PDB (multitenant): a Container Database hosts Pluggable Databases. PDBs are the unit of consolidation, cloning, and mobility - you can clone/relocate a PDB, and Autonomous/Exadata lean heavily on the multitenant model. Great for consolidating many databases with isolation.
TDE (Transparent Data Encryption): encryption at rest is standard in OCI databases. Keys can be Oracle-managed or customer-managed via OCI Vault (or Oracle Key Vault / External Key Management). Encryption in transit uses TLS/native network encryption.
Database Vault / Data Safe: Database Vault enforces separation of duties (even DBAs can't see app data) where licensed; Data Safe provides assessment, masking, and activity auditing (see Data tooling tab).

Security note - customer-managed keys

For regulated data, use customer-managed TDE keys in OCI Vault so you control key rotation and can revoke access to the data by disabling the key. Oracle-managed keys are simpler but give you less control. Decide per data-classification; wire the DB's key to a Vault in a locked-down security compartment.

Database service decision table

Workload	Recommended service	Reason	HA	DR	Ops responsibility	Cost lever
App OLTP (new build)	Autonomous (ATP) Serverless	Minimal toil, autoscale	Built-in	Autonomous Data Guard	Schema/SQL only	Autoscale + auto-stop dev
Data warehouse	Autonomous (ADW)	Columnar, scan-optimized, elastic	Built-in	Autonomous DG cross-region	Model + load	Scale OCPU to query load
Reporting DB	ADW or Exadata (if huge)	Read-heavy analytics	Built-in / RAC	DG / backup	Model + tuning	Scale for report windows
EBS database	Base Database or Exadata (per size/cert)	Certification + control needed	2-node RAC	Data Guard	Patch + schema	Right-size OCPU; BYOL
Consolidated estate	Exadata (ExaDB-D)	Density + performance + PDB isolation	RAC	Data Guard	DBs + patch schedule	Consolidate; scale cells
Dev/test DB	Autonomous (auto-stop) or Base single-node	Cheapest managed option	N/A	Backup	Minimal	Auto-stop off-hours
AI / vector search	23ai/26ai (Autonomous or Exadata) with AI Vector Search	In-DB vectors + SQL	Built-in/RAC	DG	Schema + embeddings	Scale for embedding jobs
JSON / document	Autonomous JSON (AJD)	SODA, JSON-native, low cost	Built-in	DG	Collections	Serverless autoscale
Special version/config	DB on Compute (IaaS)	Full control when managed can't	You build it	You build it	Everything	Right-size; BYOL

How HA and DR work across services

Service	In-region HA	DR
DB on IaaS	You build RAC / clustering / FD spread	You configure Data Guard + cross-region
Base Database	2-node RAC option; FD placement	One-click Data Guard (in/cross-region)
Exadata (ExaDB-D/CC)	RAC across cluster nodes (built-in)	Data Guard / Active Data Guard to another Exadata
Autonomous	Built into the platform	Autonomous Data Guard (managed standby, optional auto-failover)

Architect note - RAC is HA, Data Guard is DR

Do not conflate them. RAC protects against node/instance failure within one site (fast, no data movement). Data Guard maintains a physically separate standby (another AD/region) that protects against site/region loss and corruption, with configurable sync (zero data loss with SYNC/Far Sync) or async. Mission-critical Oracle typically wants both: RAC for uptime, Active Data Guard cross-region for DR and read-offload.

Patching and backup differences

Service	Patching	Backup
DB on IaaS	Entirely you (OS + Grid + DB)	You configure RMAN + destination
Base Database	Oracle provides patch bundles; you schedule/apply	Automatic to Object Storage; you set retention
Exadata	One-click infra + DB patching; you schedule maintenance	Managed backup to Object Storage / local; retention configurable
Autonomous	Fully automatic (near-zero downtime), you do nothing	Fully automatic + point-in-time; you set retention window

DBA note - test restores, always

Managed backups still need restore testing. "Backups are automatic" does not prove recoverability - periodically restore to a clone and validate. On Autonomous, use point-in-time clone to a test instance; on Base/Exadata, script a periodic restore to a scratch system and check open + data integrity.

Data tooling and operations

Data Safe

Security assessment, user assessment, data discovery, masking, and activity auditing for your OCI databases. Start here for DB security posture.

Database Management

Monitoring, performance, and fleet management for databases (managed and, via agent, on-prem). Performance Hub, SQL insights.

Operations Insights

Capacity planning, SQL/resource analytics, and forecasting across the DB fleet - warehouse-style analytics on your database performance data.

Performance Hub

Real-time and historical ASH/AWR-style performance analysis in the Console for OCI databases.

GoldenGate

Real-time replication and CDC - migrations with minimal downtime, active/active, and streaming data pipelines.

Zero Downtime Migration (ZDM) & DMS

ZDM automates Data Guard-based migrations; Database Migration Service orchestrates online/offline migrations to OCI. See section 13.

SQL tuning options

Performance Hub + SQL Tuning Advisor + automatic indexing (Autonomous) + Operations Insights cover most tuning workflows. On Autonomous, auto-indexing and automatic SQL plan management do a lot; you still validate execution plans for critical SQL.

Licensing: BYOL vs License Included

	License Included (LI)	Bring Your Own License (BYOL)
What it is	The service price bundles the Oracle DB license	You apply existing on-prem Oracle licenses to the cloud service
Best when	New workloads, no existing licenses, want simplicity	You have unused/EE licenses and options; usually lower run cost
Watch for	Higher per-OCPU rate	Correct edition/options mapping, core factor, compliance with LMS

Cost note - BYOL is usually the biggest DB lever

For Oracle shops with existing Enterprise Edition + options, BYOL often cuts the effective database run-rate substantially versus License Included. But map editions/options carefully (Autonomous BYOL, Exadata BYOL, and options like RAC/Partitioning/Advanced Security have specific rules) and keep LMS-defensible records. Combine BYOL with right-sizing OCPUs since license = cores.

Enterprise examples

Oracle E-Business Suite database Apps DBA

Typically Base Database Service (2-node RAC) or Exadata depending on size and certification. Data Guard cross-region for DR. App tier on compute behind a load balancer, shared APPL_TOP on FSS. BYOL for the DB. This is the classic Apps DBA lift-and-shift; verify EBS certification for the target DB service and version.

Enterprise data warehouse Analytics

ADW for most; Exadata if extreme scale or already Exadata-tuned. Object Storage data lake feeding it; Data Integration/GoldenGate for loads; Oracle Analytics Cloud on top. Autoscale for reporting windows.

Consolidated database platform Platform

Exadata (ExaDB-D) with many PDBs for isolation and density. Standardized patch windows, Data Guard to a second region, Data Safe for auditing across the estate, Operations Insights for capacity planning.

AI / vector search database AI

Oracle Database 23ai/26ai (Autonomous or Exadata) using AI Vector Search to store embeddings alongside relational data and run similarity search in SQL - the backbone of RAG over enterprise data (see section 12). Keep it governed: agents query through a read-only/serving layer, not ad-hoc against production.

Official documentation: OCI Database services →

7. Load Balancing and Traffic Management

The Layer-7 Load Balancer and Layer-4 Network Load Balancer, their listeners/backend sets/health checks, SSL handling, routing, and a disciplined approach to the most common failure: unhealthy backends.

Last reviewed: July 2026 Verify bandwidth shapes and certificate service details in current docs.

LB vs NLB Anatomy SSL & certificates Routing When to use which Troubleshooting

TL;DR

Two products: the Load Balancer (LBaaS) is Layer 7 (HTTP/HTTPS-aware: SSL termination, path/host routing, cookies) with a flexible bandwidth shape; the Network Load Balancer (NLB) is Layer 4 (TCP/UDP, ultra-low latency, preserves source IP, scales huge). Both can be public or private. A load balancer is built from listeners → backend sets → backends with health checks. The number-one issue is a backend marked unhealthy because a security rule blocks the health-check probe.

Load Balancer vs Network Load Balancer

	Load Balancer (LBaaS, L7)	Network Load Balancer (NLB, L4)
Layer	7 (HTTP/HTTPS/TCP)	4 (TCP/UDP/ICMP)
Features	SSL termination/E2E, path & host routing, cookie persistence, WAF integration	Pass-through, preserves client source IP, very low latency, extreme scale
Source IP	Rewritten (adds X-Forwarded-For)	Preserved (great for apps that need real client IP)
Sizing	Flexible bandwidth (min/max Mbps)	Scales automatically, high throughput
Use for	Web/API tiers needing HTTP intelligence	Non-HTTP, high-throughput, source-IP-sensitive, or DB/NLB-fronted services

Both come in public (internet-facing, in a public subnet) and private (internal, in a private subnet) variants.

Anatomy: listeners, backend sets, backends, health checks

Listener terminates the connection, backend set balances across backends (spread over fault domains), health checks continuously probe each backend.

Listener - the front-end port/protocol (e.g. 443/HTTPS). Handles SSL, routing rules, and WAF.
Backend set - a group of backends plus the balancing policy (round robin, least connections, IP hash) and the health check definition.
Backend - an actual server (IP:port), weighted, drained gracefully during maintenance.
Health check - protocol/port/path/interval defining "healthy." Unhealthy backends are pulled from rotation.
Session persistence - cookie-based (LB-generated or app cookie) or IP-based, to pin a client to a backend.

SSL termination, end-to-end SSL, certificates

Mode	Where TLS terminates	Use when
SSL termination	At the LB; plaintext to backends	Offload crypto from backends; inspect/route on content; backends in trusted private subnet
End-to-end SSL	LB terminates then re-encrypts to backend	Compliance requiring encryption in transit all the way; LB still does L7
SSL pass-through (NLB)	Not terminated - backend does TLS	Backend must own the cert / mTLS; L4 only

Security note - manage certs in the Certificates service

Use the OCI Certificates service (or import) to manage LB certificates, with rotation. Terminating TLS at the LB centralizes cert management and lets WAF inspect traffic, but means the LB-to-backend hop is plaintext unless you use end-to-end SSL - keep that hop inside a private subnet with tight NSGs, or use E2E SSL for regulated data.

Hostname and path-based routing

Hostname-based routing - one LB serves multiple virtual hosts (api.example.com vs app.example.com) to different backend sets.
Path-based routing - route by URL path (/api/* → API backends, /static/* → static backends).
Rule sets - header manipulation, redirects (HTTP→HTTPS), access control by source.
Logging - enable access and error logs to the Logging service for traffic analysis and troubleshooting.

When to use which

Public web / API with HTTPS, routing, WAF

Public Load Balancer (L7) + WAF policy

Internal microservice traffic

Private Load Balancer (or NLB for pure L4)

Very high throughput, non-HTTP, need real client IP

Network Load Balancer

TLS/mTLS must terminate on the backend

NLB pass-through

Kubernetes service of type LoadBalancer

OKE provisions an LB or NLB via annotations (see section 10)

Load balancer troubleshooting

⚑ Backend shows unhealthy / 502

Symptoms

Backends "Critical/Warning" in the LB health page; clients get 502/503 or intermittent errors.

Likely causes (in order)

Security rules block the probe: the backend NSG/security list must allow the health-check source (the LB subnet/NSG) on the backend port. This is the most common cause.
Health check misconfigured: wrong port, path, protocol, or expected status code vs. what the app actually returns.
App not listening / wrong bind: service down, or bound to 127.0.0.1 not 0.0.0.0.
OS firewall on the backend dropping the probe.
Route table issue between LB and backend subnet (rare within one VCN, common across peered VCNs).
SSL mismatch: health check uses HTTPS but backend serves HTTP (or cert invalid in E2E mode).

Checks

From a host in the LB subnet, curl the backend's health-check URL directly - does it return the expected code?
Confirm the backend NSG allows the LB source on the port; run Network Path Analyzer LB→backend.
Check the app is listening: ss -tlnp | grep <port>; check bind address.
Review LB error logs (Logging service).

Console path

Networking > Load Balancers > (LB) > Backend Sets > Health; and Backend Sets > Health Check Policy.

Fix / prevention

Open the health-check port from the LB source in the backend NSG; align the health-check path/port/protocol/status; ensure the app binds to all interfaces; template the LB + NSG in Terraform so every environment matches.

Common LB mistakes

Health check on the wrong port (LB listener 443 but backend serves 8080 - the check must target the backend port).
Forgetting the health-check probe source when writing backend NSG rules.
Wrong listener protocol (TCP vs HTTP) - HTTP features/routing require an HTTP listener.
App bound to localhost only, so it works on the host but not through the LB.
Certificate expired or chain incomplete on end-to-end SSL backends.
No backend spread across fault domains - a single FD loss takes all backends.

Official documentation: Load Balancer & Network Load Balancer →

8. Security Deep Dive

Defense in depth on OCI: identity, network, data, and detective controls - plus concrete guidance for securing a production tenancy, databases, Object Storage, and public endpoints, ending in a production security checklist.

Last reviewed: July 2026 Security service capabilities evolve - verify Cloud Guard/Security Zones features in current docs.

Principles The layers Vault, keys, secrets Cloud Guard, Zones, scanning Data & DB security How to secure X Production checklist

TL;DR

Security in OCI is layered: IAM (least-privilege policies, MFA, principals), network (private subnets, NSGs, no public IPs, WAF/Network Firewall), data (TDE with customer-managed keys in Vault, Object Storage retention, Data Safe), and detection (Cloud Guard, Security Zones, Vulnerability Scanning, Audit + Logging). Reduce public exposure, encrypt with keys you control, watch everything, and enforce guardrails that make the insecure thing impossible - not just discouraged.

Security design principles

Least privilege - lowest verb, narrowest resource type, lowest compartment, conditions where useful. Separate admin/operator/read-only.
Reduce blast radius - compartments, separate VCNs, separate keys per classification, break-glass isolation.
Private by default - no public IPs unless required; front-facing only via LB/WAF; Bastion for admin access.
Encrypt everywhere - at rest (TDE, block/object/file encryption) and in transit (TLS); customer-managed keys for sensitive data.
Guardrails over guidelines - Security Zones and quotas that prevent misconfiguration beat policies that merely recommend it.
Assume breach - detect and audit - Cloud Guard, Audit (immutable), Logging, alarms on anomalies. You cannot respond to what you cannot see.

The control layers

Layer	Controls	Key services
Identity	Who can do what	IAM policies, Identity Domains, MFA, federation, instance/resource principals
Network	What can reach what	Private subnets, NSGs/security lists, gateways, Bastion, WAF, Network Firewall, DDoS protection
Data	Protect data at rest/in transit	Vault (KMS), TDE, Object Storage encryption/retention, Data Safe, certificates
Detective / posture	Find and stop misconfig & threats	Cloud Guard, Security Zones, Vulnerability Scanning, Audit, Logging, Logging Analytics

Vault: keys, secrets, certificates

Vault - managed key management (KMS/HSM-backed). Create keys for TDE, block/object/file encryption, and app-level crypto.
Oracle-managed vs customer-managed keys - Oracle-managed is default and simplest; customer-managed lets you control rotation and revoke access to encrypted data by disabling the key. Vaults can be software or HSM-protected (higher assurance).
Secrets - store DB passwords, API tokens, wallets as versioned secrets; apps fetch them at runtime via principals - never bake secrets into images or code.
Certificates - managed TLS certs and CAs for load balancers and services, with rotation.

Security note - put Vault in a locked-down compartment

Keep Vaults, keys, and secrets in a dedicated security/shared-services compartment where only a small key-admin group has manage rights, and workloads get use (encrypt/decrypt/read-secret) via dynamic-group policies - not manage. Disabling a customer-managed key is your emergency "make this data unreadable" switch; that power must be tightly held and audited.

Cloud Guard, Security Zones, Vulnerability Scanning

Cloud Guard

Continuously detects misconfigurations and risky activity (public buckets, over-permissive policies, exposed ports, risky IAM) across the tenancy, scores them, and can auto-remediate via responders. Turn it on tenancy-wide.

Security Zones

Attach a policy-enforced recipe to a compartment that blocks non-compliant actions outright - e.g. no public subnets, no unencrypted volumes, no unapproved keys. Preventive, not just detective.

Vulnerability Scanning

Scans compute instances and container images for CVEs and open ports; schedule recurring scans and feed results into your patch process.

Bastion service

Time-boxed, audited SSH/RDP sessions to private hosts without public IPs or a standing jump box. Sessions expire; access is policy-controlled.

Web Application Firewall

OWASP protection, bot management, rate limiting, and geo/IP rules in front of public HTTP endpoints; attach to load balancers.

Network Firewall

Managed next-gen firewall (Palo Alto-based) in a hub VCN for stateful inspection, IPS/IDS, URL filtering, and TLS inspection of north-south/east-west traffic.

Architect note - Security Zones + Cloud Guard together

Cloud Guard tells you something is wrong; a Security Zone stops it happening. Put production compartments in a Security Zone so a mis-click cannot create a public bucket or unencrypted volume, and run Cloud Guard tenancy-wide to catch everything the zones don't. This pairing turns security from after-the-fact cleanup into prevention.

Data and database security

Encryption at rest - on by default for block/boot/object/file storage and databases (TDE). Choose customer-managed keys for sensitive data.
Encryption in transit - TLS for service endpoints; native network encryption / TLS for DB connections; ADB uses mTLS wallets.
Data Safe - security assessment, user risk assessment, sensitive data discovery, dynamic/static masking (for non-prod copies), and DB activity auditing. The first tool to point at any production database.
Database Vault (where licensed) - realms and separation of duties so even DBAs cannot read application data.
Audit - the tenancy Audit service records all API calls (control-plane) immutably; combine with DB-level and OS logs.

Common mistake - unmasked production data in non-prod

Cloning production into Dev/Test for "realistic testing" copies real customer/PII data into weakly-controlled environments. Use Data Safe masking to sanitize sensitive columns during the clone, and keep non-prod under the same (or stricter) access controls until it is masked.

How to secure specific things

Secure a production tenancy Tenancy

Federate human access to the corporate IdP; enforce MFA for all users, especially admins. Reserve break-glass local admins, sealed and alarmed.
Least-privilege IAM by role and compartment; no manage all-resources in tenancy for daily groups.
Enable Cloud Guard tenancy-wide; put prod compartments in Security Zones; enable Vulnerability Scanning.
Centralize network egress/inspection through a hub (Network Firewall); minimize public subnets.
Enable Audit retention and stream logs to a central logging compartment via Service Connector Hub.
Compartment quotas + budgets as guardrails; tag everything with data-classification.

Secure databases DB

Private subnet only, no public IP; access via app tier / Bastion / private endpoints.
TDE with customer-managed keys in Vault; rotate keys.
Data Safe: run assessments, mask non-prod, audit activity.
Least-privilege DB accounts; Database Vault for separation of duties on the most sensitive systems.
Native network encryption / TLS for all client connections; store wallets/passwords in Vault secrets.

Secure Object Storage Storage

Keep buckets private; Cloud Guard alarms on any public bucket.
Enable versioning + retention rules on backup/compliance buckets (ransomware/accidental-delete recovery, WORM).
Prefer IAM + instance principals over PARs; if PARs are needed, short lifetimes and an inventory.
Customer-managed keys for sensitive buckets; access via Service Gateway, not internet.

Secure public load balancers / reduce public IP exposure Network

Only load balancers and Bastion live in public subnets; everything else private.
WAF in front of public HTTP; NSGs restrict listener sources where possible; rate limiting and geo rules.
Use reserved public IPs you can whitelist; terminate TLS at the LB with managed certs; consider E2E SSL for regulated data.
Replace standing jump hosts with the Bastion service (time-boxed, audited).
Audit the tenancy for stray public IPs regularly (Cloud Guard + a scheduled report).

Monitor suspicious activity Detect

Cloud Guard problems → notifications; Audit + Logging → Logging Analytics for correlation.
Alarms on: root/administrator logins, policy changes, new API keys, security-list changes, public IP creation, unusual Object Storage access.
Break-glass user login should page someone every time.

Production OCI security checklist

Human access federated to corporate IdP; MFA enforced for all users and especially admins.
Break-glass local admins created, credentials sealed, every login alarmed.
IAM least privilege: roles split (admin/operator/read-only), scoped to compartments, no broad manage all-resources in tenancy.
Workloads use instance/resource principals - no long-lived API keys stored on hosts.
Cloud Guard enabled tenancy-wide with notifications and responders configured.
Production compartments enrolled in Security Zones with appropriate recipes.
Vulnerability Scanning enabled for instances and container images.
No unintended public IPs; databases and app tiers in private subnets; Bastion for admin access.
WAF in front of public HTTP endpoints; Network Firewall inspecting hub traffic.
All data encrypted at rest; customer-managed keys in Vault for sensitive data; keys rotated.
Vault holds all secrets/wallets; nothing sensitive in images, code, or env files.
Object Storage buckets private; versioning + retention on backup/compliance buckets.
Data Safe assessments run; non-prod data masked; DB activity auditing on.
Audit log retention configured; logs centralized via Service Connector Hub.
Alarms on privilege/policy/network changes and anomalous access.
Compartment quotas + budgets as guardrails; everything tagged with data-classification.
DR and backups tested (restores verified), including key availability in the DR region.

Compliance basics

OCI maintains a broad set of certifications/attestations (SOC, ISO, PCI, HIPAA, FedRAMP/Gov in the relevant realms - verify current scope). Your responsibility is configuring services to meet your obligations: data residency (region/realm choice), encryption with controlled keys, access logging, and evidence. Cloud Guard and Security Zones help demonstrate continuous compliance; Audit provides the evidence trail.

Official documentation: OCI security overview & best practices →

9. Observability, Monitoring, and Operations

Metrics, alarms, logs, events, and the operational tooling to run OCI day-2 - including what to monitor per service, how to build useful alarms, and how to avoid drowning in noise.

Last reviewed: July 2026 Verify metric namespaces and query syntax in current docs.

The stack What to monitor Building alarms Example alarms Avoiding noise Dashboards

TL;DR

Monitoring collects metrics and fires Alarms that publish to Notifications (email, PagerDuty, Functions, Slack via webhook). Logging centralizes service, audit, and custom logs; Logging Analytics analyzes them. Service Connector Hub is the pipe that moves logs/metrics/events between services (e.g. logs → Object Storage or SIEM). Events trigger automation on resource changes. Monitor the golden signals per tier, alarm on symptoms users feel, and route by severity to avoid alert fatigue.

The observability stack

Service	Role
Monitoring (Metrics)	Time-series metrics per service (CPU, memory, IOPS, LB health, DB metrics). Query with MQL.
Alarms	Threshold/absence rules on metrics that fire notifications and can trigger automation.
Notifications (ONS)	Topics with subscriptions: email, HTTPS webhook, PagerDuty, Slack, Functions, SMS.
Logging	Central store for service logs (LB, VCN flow, WAF), audit logs, and custom application logs.
Logging Analytics	Parse, search, correlate, and visualize large log volumes; dashboards and ML-assisted analysis.
Events	Reacts to resource lifecycle changes (e.g. bucket created, instance terminated) → Functions/Notifications/Streaming.
Service Connector Hub	Moves data between sources and targets (logs → Object Storage, metrics → Functions, events → SIEM).
Audit	Immutable record of all API/control-plane activity in the tenancy.
Operations Insights / Database Management / APM	Deep DB analytics, fleet DB monitoring, and application performance tracing.
Management Agent / OS Management	Agent-based host metrics/logs and OS patch compliance.

What to monitor per area

Compute

CPU utilization, memory utilization, load, instance status; per-process via agent. Watch for sustained saturation and crash loops.

Storage

Block volume IOPS/throughput vs. tier ceiling, latency; Object Storage request/error rates; FSS throughput. Volume at its I/O ceiling is a top hidden bottleneck.

Database

CPU, sessions, wait classes, storage used %, tablespace, backup success, Data Guard apply lag, blocked sessions. Use Performance Hub + Database Management.

Load balancers

Healthy/unhealthy backend count, active connections, response time, 5xx rate, bandwidth vs. shape.

Network

VPN tunnel state, FastConnect BGP/light levels, NAT/Service GW throughput, VCN flow-log rejects, DNS query health.

Security

Cloud Guard problems, audit anomalies (policy/key/public-IP changes), unusual Object Storage access, failed logins.

Building useful alarms

Alarm on symptoms users feel (LB 5xx, unhealthy backends, DB down, high latency), not only causes.
Use appropriate statistics and windows (e.g. mean over 5 min, not a single spike) and a sensible trigger duration to avoid flapping.
Set severity and route: critical → page; warning → ticket/Slack; info → dashboard only.
Use absence alarms for "should always report" signals (heartbeat, backup completion).
Tag alarms by service/team so ownership is clear.

# MQL: alarm when average CPU across an instance exceeds 85% for 5 min
CpuUtilization[5m]{resourceId = "ocid1.instance.oc1..xxxx"}.mean() > 85

# MQL: alarm when any backend set has unhealthy backends
UnHealthyBackendServers[1m].max() > 0

# Absence alarm: no metric reported for 10m (agent/host down)
CpuUtilization[10m].absent()

Example alarms to implement

Alarm	Signal / condition	Severity
CPU high	Instance CPU mean > 85% for 5-10 min	Warning → Critical if sustained
Memory pressure	Memory utilization > 90% (agent metric)	Warning
Disk / filesystem full	Filesystem used > 85%	Warning → Critical > 95%
Block volume throughput ceiling	Throughput/IOPS near the tier limit sustained	Warning (capacity)
LB unhealthy backend	UnHealthyBackendServers.max() > 0	Critical
Database CPU	DB CPU utilization > 90% sustained	Warning
Database storage	Storage used > 85% / tablespace threshold	Warning → Critical
Failed backup	Backup job failed / absence of success event	Critical
Data Guard apply lag	Apply/transport lag > RPO threshold	Critical
VPN tunnel down	Tunnel state != UP	Critical
FastConnect issue	BGP session down / light-level alarm	Critical
Object Storage unusual access	Spike in requests / unexpected public access (via logs)	Security review

Avoiding noisy alerts

Common mistake - alert fatigue

Firing a page for every transient CPU spike trains people to ignore alarms. Fix it with: longer evaluation windows, appropriate statistics (mean/percentile not max), trigger durations, severity-based routing (only true user-impact pages), de-duplication/suppression during maintenance, and regularly pruning alarms nobody acts on. An alarm that never leads to action should be deleted or downgraded to a dashboard metric.

Operational dashboards and reports

Build Console dashboards (and Logging Analytics dashboards) per audience: an on-call "is anything on fire" view, a service-owner view, and an exec/cost view.
Use Service Connector Hub to ship logs/metrics to Object Storage for retention or to your enterprise SIEM.
Turn on Cost and usage reports and pair with Budgets (section 14) for a spend dashboard.
Resource Manager + drift detection to monitor infrastructure conformance.

Architect note - centralize observability early

Create a dedicated observability/logging compartment, enable service logs and Audit tenancy-wide, and wire Service Connector Hub to a central bucket + SIEM from the start. Retrofitting centralized logging after an incident (when you discover the logs you needed were never enabled) is the classic post-incident finding.

Official documentation: Monitoring, Logging & Observability →

10. Containers, Kubernetes, and Cloud Native

OKE, Container Instances, Functions, and the event-driven building blocks - when to use each, how networking and IAM work for containers, and reference patterns for microservices and serverless.

Last reviewed: July 2026 Verify OKE modes (basic/enhanced), virtual nodes, and DevOps features in current docs.

Services OKE deep dive OKE vs Functions vs CI vs Compute Networking & IAM Patterns Troubleshooting

TL;DR

OKE (managed Kubernetes) for long-running microservices at scale; Container Instances for a single container without running a cluster; Functions (serverless, Fn-based) for short event-driven code; plain Compute when containers add no value. Around them: Container/Artifact Registry, API Gateway, Events, Streaming, Queue, Notifications, and the DevOps service for CI/CD. OKE services of type LoadBalancer auto-provision an OCI LB/NLB; workloads use resource-principal workload identity for IAM.

The cloud-native services

Service	What it is	Use for
OKE (Kubernetes Engine)	Managed Kubernetes control plane + your worker nodes/node pools (or virtual nodes)	Microservices, platform teams, portable container workloads
Container Instances	Run containers directly, serverless, no cluster to manage	Single/few containers, batch, simple services without K8s overhead
Functions	Serverless FaaS (open-source Fn), event-triggered, scales to zero	Short event-driven tasks, glue, automation
Container Registry (OCIR)	Managed private Docker/OCI image registry	Storing/scanning images
Artifact Registry	Generic artifacts (not just images)	Build outputs, packages
API Gateway	Managed API front door: auth, routing, rate limiting, request/response transform	Exposing functions/microservices as APIs
Events	Reacts to resource changes → triggers Functions/Notifications/Streaming	Event-driven automation
Streaming	Kafka-compatible event streaming	High-throughput ingestion, pub/sub pipelines
Queue	Managed message queue (transactional, at-least-once)	Decoupling producers/consumers, work queues
Notifications	Pub/sub topics to email/webhook/Functions	Fan-out alerts and events
DevOps service	Managed Git repos, build & deployment pipelines (to OKE/Functions/Instances)	CI/CD inside OCI

OKE deep dive

Control plane - managed by Oracle (you don't run etcd/API servers). Choose Basic or Enhanced clusters (enhanced adds features like more add-ons, workload identity, higher limits, SLA).
Node pools - groups of worker nodes (managed VMs/BM) with a shape and image; scale and upgrade per pool.
Virtual nodes - serverless worker capacity where Oracle manages the node lifecycle (you don't patch/scale VMs); pods run without you managing the underlying node.
Networking - OKE uses VCN-native pod networking (pods get VCN IPs) or flannel overlay; plan subnet CIDRs to have enough IPs for pods and nodes.
Load balancing - a K8s Service of type LoadBalancer makes OKE provision an OCI Load Balancer (or NLB via annotation); Ingress controllers front HTTP routing.
Storage - CSI driver provisions Block Volumes as PVs; FSS for shared RWX volumes.
IAM - workload identity - map K8s service accounts to OCI dynamic groups so pods call OCI APIs via resource principal, no keys in the pod.

# Expose a deployment via an OCI Network Load Balancer from Kubernetes
apiVersion: v1
kind: Service
metadata:
  name: web
  annotations:
    oci.oraclecloud.com/load-balancer-type: "nlb"   # or omit for L7 LB
    oci-network-load-balancer.oraclecloud.com/security-list-management-mode: "None"
spec:
  type: LoadBalancer
  selector: { app: web }
  ports: [ { port: 443, targetPort: 8443 } ]

Architect note - size pod subnets

VCN-native pod networking gives every pod a real VCN IP - excellent for network policy and observability, but it consumes subnet IP space fast. Size the pod subnet CIDR for peak pods across all nodes (pods-per-node x max nodes), with headroom. Running out of pod IPs stalls scheduling in ways that look like mysterious pending pods.

OKE vs Functions vs Container Instances vs Compute

Many long-running microservices, need K8s ecosystem

OKE

A container or two, no desire to run a cluster

Container Instances

Short, event-triggered code that should scale to zero

Functions (+ API Gateway if HTTP)

Traditional app, VM lifecycle, no container benefit

Compute

Kafka-style ingestion pipeline

Streaming + consumers (OKE/Functions)

Cost note - do not run OKE for one container

A Kubernetes cluster carries operational and (for enhanced) control-plane cost and complexity. For a handful of containers, Container Instances or Functions are cheaper and simpler. Reserve OKE for when you genuinely need orchestration, service mesh, autoscaling fleets, or the K8s ecosystem.

Networking, IAM, and security for containers

Networking - clusters live in a VCN; control-plane/worker/pod/LB subnets with the right security rules. Private clusters keep the API endpoint off the internet.
IAM - cluster admins via IAM policy; in-cluster RBAC for K8s objects; workload identity for pod → OCI API access.
Image security - scan images in OCIR (Vulnerability Scanning); sign/verify; least-privilege pull secrets or instance principals.
Runtime security - network policies, pod security standards, secrets from OCI Vault (via CSI/secret store), and Cloud Guard over the tenancy.
Monitoring - cluster/node/pod metrics to Monitoring; container logs to Logging; APM for tracing.

Architecture patterns

Serverless event chain: an object upload triggers Events → a Function processes it and writes results / notifies. No servers to manage.

Microservices on OKE - deployments behind Ingress/LB, HPA autoscaling, service mesh for mTLS/traffic control, DevOps pipelines deploying images from OCIR, secrets from Vault, workload identity for OCI access.
Serverless function triggered by Object Storage event - as diagrammed; ideal for image processing, ETL kick-off, validation.
Event-driven architecture - Events + Streaming + Functions + Queue + Notifications for decoupled, resilient pipelines.
Container deployment pipeline - DevOps build pipeline (from managed Git or GitHub) → image to OCIR (scanned) → deployment pipeline to OKE/Functions/Container Instances, gated by approvals.

Troubleshooting: OKE pod not starting

⚑ Pod stuck Pending / CrashLoopBackOff / ImagePullBackOff

Likely causes

Pending: no schedulable node capacity, or out of pod IPs in the pod subnet, or resource requests too large, or taints/affinity mismatch.
ImagePullBackOff: bad image path, missing OCIR pull permission (instance principal/secret), or private registry unreachable (no Service Gateway/NAT route).
CrashLoopBackOff: app failing on start - config/secret missing, DB unreachable, bad liveness probe.
Cannot reach OCI APIs: workload identity/dynamic-group/policy not set up.

Checks

kubectl describe pod <pod>        # events explain Pending/ImagePull
kubectl logs <pod> --previous     # crash reason
kubectl get nodes -o wide          # capacity / readiness
oci ce cluster get --cluster-id <ocid>

Fix / prevention

Scale the node pool / fix pod-subnet sizing; grant OCIR pull via policy; fix probes and config; wire workload identity. Prevent with capacity headroom, image scanning gates, and correct subnet CIDR sizing.

⚑ Function not triggering

Checks

Event rule condition matches the resource/action; rule enabled; correct compartment.
Function has a policy allowing the trigger (Events/API Gateway invoke); resource principal permissions for what the function does.
Function deployed to the right app; concurrency/timeout limits; cold-start not mistaken for failure.
Check function logs (Logging) and the invocation metrics.

Fix / prevention

Correct the Event rule filter and the invoke policy; verify the function's own IAM (resource principal); add logging and a test invocation to your deploy pipeline.

Official documentation: OKE, Functions & Container Instances →

11. Analytics, Data, and Integration

The services that move, catalog, transform, and analyze data on OCI, and the common data-lake, warehouse, streaming, and CDC patterns built from them.

Last reviewed: July 2026 Verify service availability (Big Data, OpenSearch, OIC) per region in current docs.

Services Data patterns Reference architecture

TL;DR

Land raw data in Object Storage (the lake), move/transform it with Data Integration / Data Flow (Spark) / GoldenGate (CDC), catalog it with Data Catalog, serve analytics from Autonomous Data Warehouse, visualize with Oracle Analytics Cloud, and build models with Data Science. Streaming/Queue handle real-time ingestion; OpenSearch handles search/log analytics.

The services

Service	Role	Analogy for an Oracle person
Oracle Analytics Cloud (OAC)	BI, dashboards, self-service analytics, augmented analytics	OBIEE / modern BI, managed
Data Integration	Visual ETL/ELT with data flows and pipelines	ODI-style integration, cloud-native
Data Flow	Fully-managed Apache Spark (serverless)	Run Spark jobs without managing a cluster
Data Catalog	Metadata harvesting, glossary, data discovery/lineage	Enterprise data dictionary for the lake
Data Science	Notebooks, model training/deployment, MLOps	Managed JupyterLab + model catalog
GoldenGate	Real-time replication & change data capture	The GoldenGate you know, as a service
Streaming	Kafka-compatible event streaming	Managed Kafka
Queue	Managed message queue	AQ-style decoupling, managed
Service Connector Hub	Move data between OCI services	The plumbing/glue
API Gateway	Managed API front door	API management layer
Integration Cloud (OIC)	Application integration, prebuilt SaaS adapters, process automation	SOA Suite / iPaaS for connecting apps (EBS, Fusion, SaaS)
Big Data Service	Managed Hadoop/Spark clusters	Cloudera-style big data, where still needed
OpenSearch	Managed search & log analytics	Elasticsearch/Kibana, managed

Object Storage is the center of gravity

Almost every OCI analytics pattern uses Object Storage as the durable, cheap, infinitely scalable landing zone. ADW can query it directly (external tables), Data Flow/Spark reads and writes it, GoldenGate can deliver to it, and Data Catalog harvests it. Design your bucket/prefix layout (raw / curated / consumption zones) deliberately.

Common data patterns

Pattern	Built from	Notes
Data lake	Object Storage (raw/curated/consumption zones) + Data Catalog + Data Flow	Schema-on-read; ADW queries external data
Data warehouse	ADW + Data Integration + OAC	Curated, modeled, governed; serves BI
Streaming ingestion	Streaming (Kafka) → Functions/Data Flow → Object Storage/ADW	Real-time events into the lake/warehouse
Batch ingestion	Data Integration / Data Flow scheduled loads	Nightly/periodic bulk loads
CDC replication	GoldenGate from source DB → target (ADW/DB/Object Storage)	Near-real-time, low source impact; migrations & live feeds
Reporting architecture	ADW (or read replica) + OAC dashboards	Offload reporting off the OLTP system
AI-ready data	Curated lake + Data Science + 23ai AI Vector Search	Feed embeddings/models; see section 12

DBA note - GoldenGate for zero/low-downtime

GoldenGate is the workhorse for two very different jobs: migrations (keep the source live while OCI catches up, then cut over with minimal downtime) and ongoing replication (reporting offload, active/active, feeding a warehouse). For heterogeneous or cross-version moves where Data Guard/ZDM won't fit, GoldenGate is usually the answer. Watch supplemental logging overhead on the source and conflict handling in active/active.

Reference architecture: lakehouse + BI

Sources replicate/ingest into an Object Storage lake, Spark/DI transform, ADW serves modeled data, Data Catalog governs, OAC visualizes.

Official documentation: Data Integration, Data Flow, Analytics & GoldenGate →

12. AI, ML, and Generative AI on OCI

OCI's AI stack - Generative AI, Agents, AI Vector Search in the database, Data Science, and the pretrained AI services - plus the enterprise RAG patterns and the governance guardrails that separate a demo from something you can run on real business data.

Last reviewed: July 2026 GenAI models & regions change fast - verify available models, regions, and pricing in the Console.

Services AI Vector Search RAG architecture Enterprise patterns Governance & security Warnings

TL;DR

OCI Generative AI serves foundation models (chat, embeddings) via API, with dedicated AI clusters for isolation and fine-tuning. Generative AI Agents add managed RAG over your data. AI Vector Search in Oracle Database 23ai/26ai stores embeddings next to relational data so you do similarity search in SQL - the backbone of enterprise RAG. Around them sit Data Science (build/deploy models) and pretrained AI services (Language, Vision, Speech, Document Understanding, Anomaly Detection, Forecasting). The hard part is not the model - it is governing what the model can touch.

The AI services

Service	What it does	Use for
OCI Generative AI	Managed LLM inference (chat, embeddings, rerank); dedicated AI clusters; fine-tuning	Chatbots, summarization, extraction, RAG generation
Generative AI Agents	Managed agent/RAG service that grounds answers on your data sources	Chat-with-your-docs/data with less plumbing
AI Vector Search (in DB 23ai/26ai)	VECTOR data type + similarity search in SQL, alongside relational data	Enterprise RAG retrieval, semantic search
Data Science	Notebooks, model training, model catalog, model deployment, MLOps, AI Quick Actions	Custom ML, deploy open models, feature engineering
Language	Pretrained NLP: sentiment, entities, key phrases, translation, PII detection	Text analytics without training
Vision	Image classification, object detection, OCR	Document/image analysis
Speech	Speech-to-text (and related)	Transcription, voice input
Document Understanding	Extract text/tables/key-values from documents	Invoice/form processing pipelines
Anomaly Detection	Multivariate anomaly models	Ops/fraud/equipment monitoring
Forecasting	Time-series forecasting	Demand/capacity planning
Digital Assistant	Conversational assistant/chatbot platform	Structured skills + LLM-backed chat
OpenSearch	Keyword + vector/hybrid search	Search backends, hybrid retrieval

AI Vector Search in Oracle Database

Oracle Database 23ai/26ai adds a native VECTOR data type and vector indexes so you store embeddings in the same database as your relational data and run similarity search with SQL. For an Oracle shop this is significant: no separate vector database to operate, and you can combine semantic search with normal SQL filters, joins, and existing security.

-- Store document chunks with their embedding vectors
CREATE TABLE doc_chunks (
  id        NUMBER PRIMARY KEY,
  doc_id    NUMBER,
  chunk     CLOB,
  embedding VECTOR(1024, FLOAT32)
);

-- Retrieve the 5 most similar chunks to a query embedding (RAG retrieval)
SELECT id, doc_id, chunk
FROM   doc_chunks
ORDER  BY VECTOR_DISTANCE(embedding, :query_vec, COSINE)
FETCH FIRST 5 ROWS ONLY;

DBA note - vectors alongside relational data

Keeping embeddings in the Oracle DB means RAG retrieval inherits your existing access controls, backups, Data Guard, and auditing - you are not standing up and securing a second data store. Combine VECTOR_DISTANCE with ordinary WHERE filters (e.g. restrict to documents a user is entitled to) so retrieval respects row-level entitlements. Size vector indexes and memory for your embedding dimension and volume.

RAG architecture on OCI

Ingestion: docs → chunk → embed → store vectors in DB. Runtime: query → governed serving layer → entitlement-filtered retrieval → LLM generates a grounded answer, all audited.

Enterprise patterns

Pattern	How	Watch out for
Chat with documents	RAG over Object Storage docs + Vector Search + GenAI	Chunking quality; stale index; citations
Chat with database	Retrieve from curated views; generate grounded answers	Never expose raw prod OLTP; use a serving layer
Natural language to SQL	LLM proposes SQL against a governed schema/catalog	Validate/parametrize; read-only; guard against dynamic SQL
RAG with Object Storage + Vector Search	Standard enterprise RAG stack	Entitlement filtering at retrieval
AI assistant for operations	RAG over runbooks/logs; suggest actions	Human-in-the-loop before any change
AI assistant for business users	Governed metrics/curated data + NL interface	Answer only from curated, validated data
AI over EBS / app data	Read-only reporting layer / extracts, not live OLTP	Performance impact + data governance

Governance and security for GenAI

Serving layer, always - agents and LLMs call a governed API/service that enforces authentication, authorization, rate limits, input/output validation, and logging. They do not touch data stores directly.
Entitlement-aware retrieval - filter retrieved context to what the requesting user is allowed to see (row/document-level), so RAG cannot leak data across users.
Private connectivity - keep model and data traffic on private endpoints / the OCI backbone; dedicated AI clusters for isolation where required.
Credential hygiene - secrets in Vault, access via principals; the model never sees raw credentials.
Auditability - log prompts, retrieved context IDs, and responses (subject to privacy rules) so answers are explainable and reviewable.
Human validation - outputs that drive business decisions or changes are reviewed before use; agents that act get approval gates.

Warnings (read before connecting AI to enterprise data)

Do not do these

Do not connect LLM agents directly to production OLTP databases without a governed serving layer. Live transactional systems are not a query playground for a probabilistic agent.
Avoid uncontrolled dynamic SQL. NL-to-SQL must produce validated, parameterized, read-only queries against a curated schema - never free-form DML against production.
Protect credentials. No database passwords, wallets, or API keys in prompts, code, or agent memory. Use Vault + principals.
Add auditability. If you cannot show what data an answer came from and who asked, you cannot defend it to security or compliance.
Use curated datasets, APIs, or read-only reporting layers as the AI's data surface - not raw production tables.
Validate output before business use. Treat model output as a draft/suggestion until a human or a deterministic check confirms it.

Architect note - the pattern that scales safely

The durable enterprise GenAI shape is: curated/governed data → entitlement-filtered retrieval → model behind a serving API → validated, audited output. Everything risky (raw OLTP access, dynamic SQL, embedded credentials, unlogged answers) is a shortcut that works in a demo and fails an audit. Build the governed path first; it is far cheaper than retrofitting controls after an incident.

Cost and endpoint considerations

On-demand vs dedicated AI clusters - on-demand inference is pay-per-use and simple; dedicated clusters give isolation, predictable throughput, and fine-tuning, at a fixed cost. Match to volume and isolation needs.
Embedding jobs are a real cost at scale - batch and cache embeddings; re-embed only changed content.
Token/context size drives cost and latency - retrieve the smallest sufficient context, not everything.
Region/model availability - models and dedicated-cluster availability vary by region; verify before designing.

Official documentation: OCI Generative AI, Agents & AI Services →

13. Migration and Disaster Recovery

Getting workloads into OCI, and keeping them recoverable once there - the migration tooling, the DR patterns by tier, and how RTO/RPO drive the architecture and cost.

Last reviewed: July 2026 Verify ZDM/DMS/Full Stack DR capabilities and supported sources in current docs.

Migration tooling Database migration DR patterns RTO / RPO Examples DR testing

TL;DR

Migrate databases with Zero Downtime Migration (ZDM) or Database Migration Service (both often Data Guard/GoldenGate under the hood), and compute by rebuilding from images or replicating volumes. For DR, choose per tier: backup-and-restore (cheapest, slow), pilot light, warm standby, or active/active (fastest, priciest). Full Stack Disaster Recovery orchestrates cross-region failover of the whole stack. Your RTO/RPO targets pick the pattern; DR you never test is not DR.

Migration tooling

Move	Tooling	Notes
Database (low downtime)	ZDM, Database Migration Service, GoldenGate	Data Guard-based physical or logical; GoldenGate for heterogeneous/cross-version
Database (offline)	RMAN / Data Pump / cross-endian	Simple, requires downtime window
Compute / VM	Rebuild from image, or import; block/boot volume replication	Prefer rebuild-from-golden-image over lift-and-shift of disks
Bulk data	Data Transfer appliance/disk, or online (multipart upload, rclone)	Physical transfer when network is impractical
Files	rsync/rclone to FSS or Object Storage; FSS replication	Preserve permissions for app filesystems
Applications (EBS etc.)	Oracle Cloud lift tooling + DB migration + app-tier rebuild	Whole-stack; validate certification

Database migration paths

Method	Downtime	Best for
Zero Downtime Migration (ZDM)	Near-zero	Same-platform Oracle-to-OCI using Data Guard; automated orchestration
Database Migration Service (DMS)	Low to near-zero	Managed online/offline migrations to OCI targets
GoldenGate	Near-zero	Heterogeneous, cross-version, cross-endian, active during cutover
Data Pump	Downtime window	Logical export/import; schema-level; cross-version
RMAN / cross-endian	Downtime window	Physical restore/transport to OCI

DBA note - pick by downtime tolerance and heterogeneity

If source and target are compatible Oracle and you can use Data Guard, ZDM gives the cleanest near-zero-downtime path. If you are changing endianness, version, or platform, or need the source live throughout, GoldenGate is the tool. Reserve Data Pump/RMAN for cases where a downtime window is acceptable and simplicity wins.

DR patterns

Pattern	Standby state	RTO	RPO	Cost
Backup & restore	Backups in DR region; nothing running	Hours+	Since last backup	Lowest
Pilot light	Core (DB standby) running, app tier off	Tens of minutes	Small (DG lag)	Low
Warm standby	Scaled-down full stack running	Minutes	Small	Medium
Active / passive	Full-size standby, promote on failover	Minutes	Near-zero (SYNC)	High
Active / active	Both regions serving	Near-zero	Near-zero	Highest + complexity

Building blocks: Data Guard / Active Data Guard and Autonomous Data Guard for databases; Object/Block/File replication for storage; cross-region image copy for compute; DNS failover and load balancers for traffic redirection; Full Stack Disaster Recovery to orchestrate the whole failover as a plan you can run and test.

Architect note - tier your DR, don't blanket it

Not every workload deserves active/active. Classify workloads by business impact and assign each a DR pattern: mission-critical DB → Active Data Guard cross-region; important apps → warm standby; the rest → pilot light or backup-restore. Blanket active/active for everything is a budget-buster; blanket backup-restore leaves critical systems with an unacceptable RTO. Match pattern to tier.

Common mistake - active/active is rarely truly active/active

True active/active for stateful databases means solving write conflicts (GoldenGate bidirectional with conflict handling) - hard, and unnecessary for most. Most "active/active" requirements are satisfied by active/passive with fast failover, or active/read-only (Active Data Guard serving reads in the second region). Don't take on multi-master complexity unless the requirement genuinely demands it.

RTO and RPO - the two numbers that drive everything

RTO (Recovery Time Objective) - how long you can be down. Drives standby readiness (running vs. cold) and automation.
RPO (Recovery Point Objective) - how much data you can lose. Drives replication mode: async (small lag), SYNC/Far Sync (zero data loss), or backup interval (large).
Zero-data-loss (RPO 0) needs SYNC transport (Data Guard Maximum Protection/Availability, often via a Far Sync instance) - which requires low latency between sites and has performance implications. Confirm the network and the trade-off.

Architecture examples

Cross-region DR: Active Data Guard replicates the database, bucket replication mirrors backups/objects, warm app tier + DNS/LB failover completes the picture. Full Stack DR orchestrates the switch.

On-prem Oracle DB → OCI: ZDM/Data Guard over FastConnect, cut over in a window, keep on-prem as fallback briefly.
EBS → OCI: migrate DB (Base/Exadata) + rebuild app tier + shared FSS; DNS cutover; validate certification.
VM → OCI: custom image import/rebuild; block volume replication for large data disks.
Cross-region DR (app): warm standby + DNS/LB failover + storage replication.
Cross-region DR (database): Active Data Guard (or Autonomous Data Guard) with configurable failover.
Backup-based DR: cross-region backup copies; rebuild in DR on demand (lowest cost, highest RTO).
GoldenGate-based DR: when heterogeneous or active/active-ish read serving is required.

DR testing

DR you have never tested is a hope, not a plan

Schedule regular DR drills: switchover (planned role change) at minimum, and periodic failover tests. Verify RTO/RPO are actually met, that the app works end-to-end in DR (not just the DB opens), that keys/secrets exist in the DR region (a customer-managed TDE key missing in DR makes the standby unusable), that DNS/LB failover works, and that runbooks are current. Full Stack Disaster Recovery lets you codify and rehearse the whole plan.

Standby database is applying and within RPO (monitor apply lag).
Encryption keys/wallets present and usable in the DR region.
App tier can start and connect in DR; config points to DR endpoints.
DNS/LB failover mechanism tested and time-measured.
Storage (buckets/volumes/FSS) replication within RPO.
Runbook current; roles assigned; switchover rehearsed on a schedule.
Capacity available in DR (reservations if RTO is tight).

Official documentation: Full Stack DR, ZDM & Database Migration →

14. Cost Management and Governance

How OCI charges, the tools to track and cap spend, the governance model (landing zones, quotas, budgets), and a concrete monthly cost-review checklist.

Last reviewed: July 2026 Pricing changes constantly - verify all rates on the official pricing pages.

Pricing basics Cost tools Governance & landing zone Optimization Monthly checklist

TL;DR

OCI bills mainly by OCPU-hours, storage GB, and network egress, purchased via Universal Credits. Track with Cost Analysis and Usage Reports, cap with Budgets (alert) and Quotas (block), attribute with tags and compartments. The biggest levers: BYOL for databases, right-sizing OCPUs, stopping/scheduling non-prod, correct block-volume tiers, Object Storage lifecycle, and killing orphaned resources. Governance = a landing zone + quotas + budgets + tags so spend is controlled by design.

Pricing basics

Dimension	Charged on	Notes
Compute	OCPU-hours (+ memory for flex), per shape	Arm often cheaper per unit; preemptible cheaper still
Block storage	GB-month x performance (VPU)	Higher VPU costs more; auto-tune down when idle
Object storage	GB-month by tier + requests + retrieval (IA/Archive)	Archive cheapest to store, has retrieval cost/delay
Network	Internet egress (ingress free); some cross-region	Keep OCI-service traffic on Service Gateway to avoid internet egress
Database	OCPU-hours + storage + edition/options; LI vs BYOL	Usually your largest line item; BYOL is the big lever
Load balancer	Bandwidth shape (LBaaS) / usage	Size the flexible bandwidth to real need

Universal Credits - a consumption model where credits apply across eligible OCI services; annual commitments earn better rates than pure pay-as-you-go.
BYOL vs License Included - covered in section 6; BYOL typically the biggest database savings if you own licenses.

Cost tracking tools

Tool	Does
Cost Analysis	Console visualizations of spend by compartment, service, tag, time.
Usage/Cost Reports	Detailed CSV usage dropped to an Oracle-owned bucket for your own analysis/BI.
Budgets	Track spend against a target on a compartment or tag; alert at thresholds. Do not block.
Compartment Quotas	Policy-like statements that block resource creation - the hard cap.
Cloud Advisor	Recommendations: rightsizing, idle resources, performance/cost/availability.

Architect note - budgets alert, quotas enforce

Use them together: a budget tells you spend is trending over; a quota stops a team from creating the expensive thing in the first place. Attribute everything via defined tags (CostCenter, Environment, Owner) and per-compartment reporting so chargeback is possible. This only works if tagging was enforced from day one (section 1).

Governance model and landing zones

A landing zone is a codified baseline for a well-governed tenancy: compartment topology, IAM groups/policies, network hub, logging/audit, Security Zones, Cloud Guard, budgets, quotas, and tag defaults - deployed as Terraform so it is repeatable and reviewable. Oracle publishes CIS-aligned landing zone reference architectures/Terraform to start from.

Compartments for isolation and cost attribution (section 1).
Quotas + budgets per compartment for control.
Tags + tag defaults for attribution and automation.
Security Zones + Cloud Guard for preventive/detective guardrails (section 8).
Everything as code so environments are consistent and auditable.

Cost optimization examples

Action	Typical saving	Effort
Stop non-prod compute nights/weekends (schedule)	High - up to ~65-70% of that compute	Low
Right-size over-provisioned shapes (Cloud Advisor)	High	Low
Correct block-volume performance tier (don't over-VPU)	Medium	Low
Object Storage lifecycle to Archive / delete	Medium-High for large stores	Low
Database BYOL instead of License Included	Very High (Oracle shops)	Medium
Autonomous autoscale + auto-stop dev	High for variable/non-prod	Low
Exadata capacity planning / consolidation	High at scale	High
Remove unused reserved public IPs	Small each, adds up	Low
Delete orphaned block volumes / old backups / snapshots	Medium	Low
Move scale-out tiers to Arm shapes	Medium (price/perf)	Medium

Cost note - the cheap wins first

Before any complex re-architecture, do the boring high-ROI things: schedule non-prod off-hours, act on Cloud Advisor rightsizing, apply Object Storage lifecycle rules, and BYOL your databases. These are low-effort and recover the most money. Re-architecture (Arm migration, Exadata consolidation) comes after.

Monthly OCI cost review checklist

Review Cost Analysis month-over-month by compartment and service; investigate any spike.
Check each budget: which compartments/tags are over or trending over target.
Act on Cloud Advisor rightsizing and idle-resource recommendations.
Confirm non-prod stop/scale schedules ran (no Dev/Test running 24x7 by accident).
Find and terminate orphaned block volumes, unattached boot volumes, and idle instances.
Delete stale volume backups, DB backups beyond retention, and old snapshots.
Review Object Storage: are lifecycle rules moving cold data to Archive / deleting expired data?
Remove unused reserved public IPs and idle load balancers.
Verify database licensing posture (BYOL applied where owned; OCPU counts right-sized).
Check block-volume performance tiers vs. actual IOPS - downgrade over-provisioned volumes.
Review egress charges - is OCI-service traffic correctly on the Service Gateway?
Confirm every resource is tagged (CostCenter/Environment/Owner) for attribution.
Validate quotas still reflect intent (no team quietly raised a cap).
Reconcile Universal Credits burn-down vs. commitment; forecast to renewal.

Official documentation: Cost management, Budgets, Quotas & Governance →

15. Enterprise Architecture Patterns

Reference blueprints for real OCI deployments. Each card gives the business case, services, traffic flow, and the security / HA / DR / monitoring / cost / risk dimensions so you can adapt rather than start from a blank page.

Last reviewed: July 2026 Blueprints are starting points - validate sizing/services against current docs and your requirements.

HOW TO READ THESE

Every pattern lists the same dimensions. Start from the one closest to your workload, then apply the relevant service deep dives (sections 3-12) and the DR/cost guidance (13-14). The recurring backbone is: public LB + WAF → private app tier across fault domains → managed database → Service Gateway for OCI services → centralized logging → cross-region DR.

Foundational three-tier (reference backbone)

Three-tier enterprise application

The pattern most other patterns extend

Business case	Standard internal/external web or enterprise app needing HA and controlled exposure.
Services	VCN + public/private subnets, WAF, Load Balancer, Compute (instance pool), Base/Autonomous DB, Service Gateway, NAT, Vault, Monitoring/Logging.
Traffic flow	User → WAF → public LB → app tier (private) → DB (private); app → OCI services via Service Gateway; egress via NAT.
Security	Only LB/Bastion public; NSGs per tier (app→db on 1521 only); TDE + customer keys; secrets in Vault; Cloud Guard + Security Zone.
HA	App pool + DB nodes spread across fault domains (and ADs where available); LB health checks.
DR	Active Data Guard + warm app tier in a second region; DNS/LB failover.
Monitoring	LB backend health, app CPU/mem, DB metrics, alarms → Notifications; central logs.
Cost	Right-size app pool + autoscale; BYOL DB; schedule non-prod.
Risks / mistakes	Backends unhealthy from missing health-check NSG rule; DB in public subnet; no FD spread; secrets in images.

Pattern library

Each pattern below follows the same dimension set. Expand the ones relevant to you.

Simple web application Small

Case	Low-complexity site/app, cost-sensitive.
Services	1 VCN, public LB (or public instance), Arm compute, Autonomous DB (auto-stop for dev), Object Storage for assets.
Flow	User → LB → app → ADB; static assets from Object Storage.
Security/HA/DR	WAF on LB; 2 instances across FDs; ADB backups + optional Autonomous DG; Cloud Guard.
Cost / risk	Arm + auto-stop = cheap. Risk: single instance / no backups if cut too far.

Highly available application HA

Case	App that must survive node, rack, and (where possible) AD failure.
Services	Instance pool + autoscaling across FDs/ADs, LB, RAC DB or ADB, FSS for shared state, Vault.
Flow / HA	LB spreads to app pool; DB RAC for node HA; storage replicated; no single FD holds all.
DR / monitoring	Active Data Guard cross-region; alarms on backend health + DB; Full Stack DR plan.
Risk	State stored on a single node instead of shared/managed store; untested failover.

EBS on OCI Apps DBA

Case	Migrate/run Oracle E-Business Suite on OCI.
Services	Base/Exadata DB (2-node RAC), app-tier compute pool behind LB, shared APPL_TOP on FSS, Vault, Object Storage backups, Bastion.
Flow	Users → LB → EBS app nodes (shared FSS) → DB; concurrent/admin tiers as needed.
Security/HA/DR	Private DB, TDE; app across FDs; Data Guard DR; validated EBS certification for the DB service/version.
Cost / risk	BYOL DB; right-size nodes. Risk: unsupported DB service/version; FSS export too open.

Oracle database on OCI (managed) DB

Case	Run a production Oracle DB with managed lifecycle.
Services	Base Database (2-node RAC) or Exadata; Data Guard; Object Storage backups via Service Gateway; Data Safe; Vault keys.
Security/HA/DR	Private subnet; customer-managed TDE; RAC HA; Data Guard cross-region DR; audited via Data Safe.
Cost / risk	BYOL; right-size OCPU. Risk: untested restores; keys absent in DR region.

Exadata Cloud Service platform Scale

Case	Large, high-performance, or consolidated Oracle estate.
Services	ExaDB-D VM clusters + many PDBs; Data Guard to a second Exadata; Ops Insights; Data Safe.
Security/HA/DR	RAC built-in; Active Data Guard; PDB isolation; standardized patch windows.
Cost / risk	Consolidate for density; BYOL. Risk: over-provisioning for small estates; noisy-neighbor PDBs without resource management.

Autonomous Database application Low-ops

Case	New app wanting minimal DBA toil, elastic scale.
Services	ATP (Serverless), private endpoint, app tier on OKE/compute, Vault for wallet, APEX optional.
Security/HA/DR	Private endpoint (no public); mTLS wallet from Vault; Autonomous DG; auto-backups.
Cost / risk	Autoscale + auto-stop dev. Risk: assuming Autonomous fits a DB needing OS/feature access it can't provide.

Data warehouse & data lake Analytics

Case	Enterprise analytics / BI on curated + raw data.
Services	Object Storage lake (zones) + Data Integration/Data Flow + ADW + Data Catalog + OAC; GoldenGate CDC feeds.
Security/HA/DR	Private access; Data Catalog governance; ADW auto-backups + DG; masked non-prod.
Cost / risk	Scale ADW to query windows; lifecycle cold lake data. Risk: ungoverned lake ("data swamp").

Kubernetes platform Cloud native

Case	Container platform for many microservices with CI/CD.
Services	OKE (enhanced), node pools/virtual nodes, OCIR (scanned), API Gateway, DevOps pipelines, Vault, LB/NLB, service mesh.
Security/HA/DR	Private cluster; workload identity; network policies; multi-FD node pools; backup of cluster state/config; images in OCIR replicated.
Cost / risk	Right-size pools; Arm nodes. Risk: pod-subnet IP exhaustion; over-privileged workload identity.

Private enterprise application (no internet exposure) Regulated

Case	Internal-only app reachable from on-prem, no public footprint.
Services	Private subnets only, private LB, FastConnect/VPN via DRG, Service Gateway, private endpoints, Bastion for admin.
Security/HA/DR	No public IPs/IGW; access from corporate network only; Network Firewall inspection; cross-region DR over private links.
Cost / risk	FastConnect cost. Risk: CIDR overlap with on-prem; DNS forwarding gaps.

Hybrid cloud Hybrid

Case	Workloads split across on-prem and OCI with shared connectivity.
Services	DRG hub + FastConnect (primary) + VPN (backup), hub-and-spoke VCNs, hybrid DNS, Network Firewall.
Security/HA/DR	Redundant links with BGP failover; centralized inspection; consistent IAM/tagging.
Cost / risk	FastConnect + egress. Risk: CIDR overlap; single link with no backup; asymmetric routing.

Multi-region DR DR

Case	Business-critical stack needing regional resilience.
Services	Mirrored compartments/VCNs in 2 regions, Active Data Guard, storage replication, capacity reservation, Full Stack DR, DNS/LB failover.
Security/HA/DR	Keys present in both regions; RTO/RPO per tier; rehearsed switchover/failover.
Cost / risk	Standby cost vs. RTO. Risk: untested DR; missing DR keys; capacity unavailable at failover.

Secure landing zone Governance

Case	Governed foundation before workloads land.
Services	Compartment topology, IAM roles/policies, network hub, Vault, centralized logging/audit, Cloud Guard, Security Zones, budgets, quotas, tag defaults - all as Terraform (CIS-aligned).
Security/HA/DR	Preventive guardrails; least privilege; audit centralized; break-glass isolated.
Cost / risk	Quotas/budgets cap spend. Risk: skipping the landing zone and retrofitting governance later.

GenAI with private enterprise data AI

Case	RAG/assistant over internal documents and data, governed.
Services	Object Storage (docs) + DB 23ai Vector Search + OCI Generative AI (or Agents) behind a serving API on OKE/Functions + API Gateway + Vault + Logging.
Flow	Query → serving layer (authz + guardrails) → entitlement-filtered vector retrieval → grounded generation → audited response.
Security/HA/DR	Private endpoints; no direct model→OLTP; secrets in Vault; full audit; validated output.
Cost / risk	Cache embeddings; right-size dedicated clusters. Risk: ungoverned data access, dynamic SQL, credential leakage (see section 12 warnings).

Common mistakes across all patterns

Databases or app tiers exposed in public subnets "just to get it working."
No fault-domain spread - a single rack event takes the whole "HA" tier.
Health-check NSG rules forgotten, so LB backends are unhealthy on day one.
DR designed but never tested; keys/secrets missing in the DR region.
Secrets baked into images/code instead of Vault + principals.
No centralized logging/audit until an incident needs it.
CIDR overlap discovered during the hybrid connectivity phase.
Governance (landing zone, quotas, tags) skipped and retrofitted painfully later.

Official documentation: OCI reference architectures & adoption framework →

16. Troubleshooting Guides

A runbook catalog for the failures you will actually hit. Each entry lists symptoms, likely causes, checks (with Console path and CLI where useful), fixes, and prevention. Deeper versions of some runbooks live in their service sections; this is the consolidated index.

Last reviewed: July 2026 CLI syntax evolves - verify commands with oci <service> --help.

General method

Work top-down through the layers: identity (is the caller allowed?) → network (is there a route + do rules allow it, both ways?) → host/service (is it listening/healthy?) → data. For any "cannot reach" problem, run Network Path Analyzer and enable VCN Flow Logs first - they usually name the exact blocking rule.

ComputeStorageNetworkLBDBIAMContainersObservability

Compute & access

⚑ Compute instance not reachable / SSH fails

Symptoms: SSH times out or refuses. Likely causes: security rule/route missing (22), no public IP or wrong path, instance stopped/boot failed, wrong SSH key/user, OS firewall/fail2ban. Checks: instance state Running; serial console for boot; security rules + route allow 22 from your IP; correct user (opc); firewalld. Console: Compute > Instance > Attached VNICs; Instance > Console connection. Fix: open 22 in NSG, assign IP/route, reset via console. Prevention: use Bastion service (no public IPs), standard security templates.

oci compute instance get --instance-id <ocid> --query 'data."lifecycle-state"'
oci compute instance action --instance-id <ocid> --action SOFTRESET

⚑ Instance boot issue

Symptoms: instance up but OS unreachable/services down. Likely causes: bad /etc/fstab mount, kernel/driver issue, full root disk, failed cloud-init. Checks: serial console boot output; single-user mode. Fix: detach boot volume, attach to a rescue instance, correct fstab/config, reattach. Prevention: test image changes in non-prod; keep boot volume backups.

⚑ High CPU

Symptoms: slow app, CPU alarm. Likely causes: undersized shape, runaway process, missing autoscale, batch overlap. Checks: Monitoring CPU trend; on host top/pidstat. Fix: resize (reboot) or scale the pool; fix the offending process. Prevention: autoscaling + right-sizing from sustained metrics.

⚑ Disk full

Symptoms: writes fail, app errors. Likely causes: logs/temp growth, volume undersized, no rotation. Checks: df -h, du -sh. Fix: clean/rotate; grow the block volume online then extend the filesystem (growpart/resize2fs/xfs_growfs). Prevention: filesystem-used alarm at 85%, log rotation, lifecycle to Object Storage.

⚑ Block volume attachment issue

Symptoms: attached volume not visible in OS. Likely causes: iSCSI login steps not run (iSCSI attach), device not mounted, wrong attach type. Checks: Console shows Attached; run the iSCSI commands from the Console's attach details; lsblk. Fix: run the iSCSI iscsiadm commands or use paravirtualized attach; mount and add to fstab (use UUID). Prevention: prefer paravirtualized attach; automate mount via cloud-init.

Storage

⚑ Object Storage access denied

Symptoms: 403/404 on bucket/object. Likely causes: missing IAM policy for the user/dynamic group, wrong compartment, condition (bucket name) unmet, using API key where principal expected, expired PAR. Checks: policy grants read/manage objects in the bucket's compartment; dynamic-group membership; Audit for the denied call. Fix: add least-privilege policy; use instance principal. Prevention: standard bucket-access policy per workload; avoid PAR sprawl.

oci os object list --bucket-name <name> --auth instance_principal

⚑ File Storage (FSS) mount issue

Symptoms: NFS mount hangs or permission denied. Likely causes: NFS ports (111/2048-2050) blocked in NSG/security list, export options exclude the client CIDR, root squash, wrong mount-target IP/path. Checks: security rules for NFS between client subnet and mount target; export options; showmount/mount -v. Fix: open NFS ports, add client CIDR to export options, correct mount path. Prevention: FSS module with standard ports + export options.

Network

⚑ Service Gateway not working

Symptoms: private instance can't reach Object Storage/ADB privately. Likely causes: no Service Gateway, missing route to it for the OSN CIDR label, wrong service label, egress rule missing. Checks: VCN > Service Gateway exists + correct label; subnet route has a rule (target = SGW, dest = services CIDR label). Fix: create SGW, add route + egress rule. Prevention: bake SGW into the private-subnet Terraform module.

⚑ NAT Gateway not working

Symptoms: private instance can't reach the internet (patch repos, external APIs). Likely causes: no 0.0.0.0/0 route to NAT, egress rule missing, OS firewall. Checks: route table; security egress; curl test. Fix: add NAT route + egress. Prevention: standard subnet module; keep OCI-service traffic on SGW, not NAT.

⚑ VPN tunnel down

Symptoms: on-prem/OCI connectivity lost, tunnel state != UP. Likely causes: Phase 1/2 mismatch (encryption, PSK, lifetimes), CPE public IP change, BGP/static route misconfig, on-prem firewall. Checks: Site-to-Site VPN > tunnel status + logs; compare IKE params both ends. Fix: align IKE/IPSec parameters, correct routes/BGP. Prevention: redundant tunnels; FastConnect primary; alarm on tunnel state.

⚑ FastConnect issue

Symptoms: primary link degraded/down, BGP session down. Likely causes: physical/optical fault, BGP config, provider issue, MTU/MACsec. Checks: FastConnect state + light levels; BGP session; provider status. Fix: engage provider/DC; fail over to VPN backup. Prevention: redundant FastConnect + VPN backup with BGP failover; alarms.

⚑ DNS issue

Symptoms: names don't resolve. Likely causes: DHCP options point at wrong resolver, no hybrid forwarding, missing private-zone record/view. Checks: nslookup name 169.254.169.254; /etc/resolv.conf. Fix: fix DHCP options, add resolver forwarders both ways, add records to the private zone. Prevention: standard DNS/resolver design per VCN.

Load balancer & certificates

⚑ Load balancer backend unhealthy

Symptoms: backends Critical, 502/503. Likely causes (order): backend NSG doesn't allow the LB health-check source on the port; wrong health-check port/path/protocol; app not listening or bound to localhost; OS firewall; SSL mismatch. Checks: curl the backend health URL from the LB subnet; Path Analyzer LB→backend; ss -tlnp. Fix: allow the probe source; align health-check; bind to 0.0.0.0. Prevention: template LB + NSG together. (Full runbook in section 7.)

⚑ SSL certificate issue

Symptoms: TLS errors, browser warnings, handshake failures. Likely causes: expired cert, incomplete chain, SNI/hostname mismatch, E2E backend cert invalid. Checks: openssl s_client -connect host:443 -servername host; cert dates/chain. Fix: renew/replace via Certificates service; include full chain; match hostname. Prevention: managed certs with rotation + expiry alarms.

Database

⚑ Database backup failed

Symptoms: backup job error / no recent backup. Likely causes: Object Storage access (Service Gateway/policy), space, wallet/credential expiry, RMAN config (IaaS), retention conflict. Checks: backup job logs; Object Storage reachability; on IaaS RMAN LIST BACKUP. Fix: restore connectivity/credentials; correct RMAN/backup config. Prevention: alarm on backup success absence; periodic restore tests.

⚑ Database performance issue

Symptoms: slow queries, high waits. Likely causes: bad plans, missing indexes, I/O ceiling (block volume/storage), CPU saturation, contention. Checks: Performance Hub / AWR / ASH; wait classes; volume IOPS vs. tier. Fix: tune SQL/indexes, raise storage performance, scale OCPU, resource management. Prevention: Ops Insights capacity planning; auto-indexing (Autonomous); baseline plans.

⚑ Autonomous Database connection issue

Symptoms: app can't connect to ADB. Likely causes: wallet expired/wrong, mTLS vs TLS mismatch, private endpoint NSG rules, ACL blocks client IP (public), TNS alias wrong. Checks: wallet validity; ADB network config (private endpoint vs. public + ACL); NSG for the PE. Fix: refresh wallet from Vault, fix ACL/NSG, correct connection string. Prevention: store wallet in Vault; use private endpoints; automate wallet rotation.

IAM

⚑ IAM policy issue (authorization failed)

Symptoms: 404/authz error though resource exists. Likely causes: no policy grants it, wrong compartment, weak verb/wrong family, unmet condition, wrong domain qualifier. Checks: group membership + policies; compartment path; Audit for the denied request. Fix: add least-privilege statement at the correct compartment. Prevention: policies in Terraform; access matrix per compartment. (Full runbook in section 2.)

⚑ Dynamic group issue (workload can't call OCI)

Symptoms: instance/function gets authz errors using a principal. Likely causes: instance not matched by the DG rule, no policy grants the DG, wrong matching attribute/compartment, using API key not principal. Checks: DG matching rule vs. instance OCID/compartment; policy targets dynamic-group; --auth instance_principal. Fix: correct the rule, add the DG policy. Prevention: standard DG + policy per workload type.

Containers & automation

⚑ OKE pod not starting

Symptoms: Pending / ImagePullBackOff / CrashLoopBackOff. Likely causes: no capacity or pod-IP exhaustion; OCIR pull permission missing; app config/secret missing; bad probes; workload identity not set. Checks: kubectl describe pod, kubectl logs --previous, node capacity, pod-subnet free IPs. Fix: scale pool/fix subnet size; grant OCIR pull; fix config/probes/identity. Prevention: capacity headroom, correct CIDR sizing, image-scan gates. (Section 10.)

⚑ Function not triggering

Symptoms: event doesn't invoke the function. Likely causes: Event rule filter mismatch/disabled; invoke policy missing; function IAM (resource principal) missing; timeout/concurrency; cold start mistaken for failure. Checks: Event rule condition; function logs + invocation metrics; policies. Fix: correct the rule + invoke policy; verify resource-principal permissions. Prevention: test invocation in the deploy pipeline; logging on.

Observability

⚑ Alarm not firing

Symptoms: expected alert never arrives. Likely causes: wrong metric/namespace/dimension in the query, threshold/window never met, alarm disabled, Notifications topic has no confirmed subscription, suppression active. Checks: run the MQL in Metrics Explorer; alarm state history; topic subscription confirmed. Fix: correct the query/threshold; confirm subscription. Prevention: test alarms by forcing a condition; monitor the monitors (absence alarms).

⚑ Logging missing

Symptoms: expected logs absent when investigating. Likely causes: service log (LB/VCN flow/WAF) not enabled, agent not installed/configured, Service Connector not wired, wrong compartment/log group, retention expired. Checks: Logging > Logs for the resource; agent status; connector state. Fix: enable the log, install/config agent, wire the connector. Prevention: enable service logs + Audit tenancy-wide from the start; centralize via Service Connector Hub. (Section 9.)

Official documentation: OCI documentation home →

17. OCI CLI and Terraform Examples

Practical, copy-friendly automation: CLI setup and common commands, the Terraform provider, Resource Manager, and clean examples for VCN, compute, buckets, IAM, alarms, and tags - plus state and structure practices.

Last reviewed: July 2026 Verify provider version, resource argument names, and CLI syntax against current docs.

CLI setup Common commands Terraform setup VCN Compute Bucket IAM Alarm Tags State & structure

TL;DR

The CLI (Python) uses ~/.oci/config profiles for auth (API key or instance principal). Terraform (Oracle-maintained provider) is the way to build production infrastructure; run it locally, in a pipeline, or in Resource Manager (managed state + plan/apply + drift). Keep state remote and locked, structure code into reusable modules, and separate environments by workspace/backend + tfvars - never by copy-paste.

OCI CLI setup, profiles, authentication

# Install (Linux/macOS)
bash -c "$(curl -L https://raw.githubusercontent.com/oracle/oci-cli/master/scripts/install/install.sh)"

# Interactive setup: creates ~/.oci/config + API signing keypair
oci setup config

# ~/.oci/config with two profiles
[DEFAULT]
user=ocid1.user.oc1..aaaa...
fingerprint=aa:bb:cc:...
key_file=~/.oci/oci_api_key.pem
tenancy=ocid1.tenancy.oc1..aaaa...
region=us-ashburn-1

[PROD]
user=ocid1.user.oc1..bbbb...
fingerprint=dd:ee:ff:...
key_file=~/.oci/prod_key.pem
tenancy=ocid1.tenancy.oc1..aaaa...
region=us-phoenix-1

# Use a profile; or use instance principal (no keys on disk)
oci os ns get --profile PROD
oci compute instance list -c <compartment> --auth instance_principal

Security note

On any host running in OCI, prefer --auth instance_principal over an API key in ~/.oci/config. In Cloud Shell you are already authenticated as your Console identity - no key needed. Reserve API keys for off-cloud automation and store the PEM securely (never in git).

Common CLI commands

# Identity / discovery
oci iam compartment list --all
oci iam region-subscription list
oci os ns get                                  # Object Storage namespace

# Compute
oci compute instance list -c <compartment-ocid> --output table
oci compute instance action --instance-id <ocid> --action STOP
oci compute image list -c <compartment-ocid> --operating-system "Oracle Linux"

# Networking
oci network vcn list -c <compartment-ocid>
oci network security-list get --security-list-id <ocid>

# Object Storage
oci os object bulk-upload -bn <bucket> --src-dir ./data --auth instance_principal
oci os object list -bn <bucket>

# Database
oci db system list -c <compartment-ocid>
oci db autonomous-database list -c <compartment-ocid>

# Query + filter with JMESPath
oci compute instance list -c <ocid> \
  --query "data[?\"lifecycle-state\"=='RUNNING'].{name:\"display-name\",ocid:id}" --output table

Terraform provider setup

# provider.tf
terraform {
  required_version = ">= 1.5"
  required_providers {
    oci = { source = "oracle/oci", version = "~> 6.0" }   # verify current major
  }
}

# Auth via config file profile (local runs)
provider "oci" {
  config_file_profile = "PROD"
  region              = var.region
}

# In Resource Manager / instance-principal runs, use:
# provider "oci" { auth = "InstancePrincipal"  region = var.region }

Resource Manager

OCI Resource Manager is managed Terraform: it stores state for you, runs plan/apply as stacks, supports drift detection and private templates, and authenticates via the Console identity/resource principal - no state backend to host and no local keys. Great default for teams who don't want to run their own Terraform pipeline.

Create a VCN with subnets and a gateway

resource "oci_core_vcn" "main" {
  compartment_id = var.compartment_ocid
  cidr_blocks    = ["10.10.0.0/20"]
  display_name   = "app-vcn"
  dns_label      = "appvcn"
}

resource "oci_core_nat_gateway" "nat" {
  compartment_id = var.compartment_ocid
  vcn_id         = oci_core_vcn.main.id
  display_name   = "app-nat"
}

resource "oci_core_service_gateway" "sgw" {
  compartment_id = var.compartment_ocid
  vcn_id         = oci_core_vcn.main.id
  services { service_id = data.oci_core_services.all.services[0].id }
  display_name   = "app-sgw"
}

resource "oci_core_route_table" "private_rt" {
  compartment_id = var.compartment_ocid
  vcn_id         = oci_core_vcn.main.id
  display_name   = "private-rt"
  route_rules {
    destination       = "0.0.0.0/0"
    network_entity_id = oci_core_nat_gateway.nat.id
  }
  route_rules {
    destination_type  = "SERVICE_CIDR_BLOCK"
    destination       = data.oci_core_services.all.services[0].cidr_block
    network_entity_id = oci_core_service_gateway.sgw.id
  }
}

resource "oci_core_subnet" "app" {
  compartment_id             = var.compartment_ocid
  vcn_id                     = oci_core_vcn.main.id
  cidr_block                 = "10.10.2.0/24"
  display_name               = "app-private"
  route_table_id             = oci_core_route_table.private_rt.id
  prohibit_public_ip_on_vnic = true          # private subnet
  dns_label                  = "app"
}

data "oci_core_services" "all" {}

Create a compute instance

resource "oci_core_instance" "app" {
  compartment_id      = var.compartment_ocid
  availability_domain = var.ad
  display_name        = "app-01"
  shape               = "VM.Standard.E5.Flex"
  shape_config { ocpus = 2  memory_in_gbs = 32 }

  create_vnic_details {
    subnet_id        = oci_core_subnet.app.id
    assign_public_ip = false
    nsg_ids          = [oci_core_network_security_group.app.id]
  }
  source_details {
    source_type = "image"
    source_id   = var.image_ocid
  }
  metadata = {
    ssh_authorized_keys = file("~/.ssh/id_rsa.pub")
    user_data           = base64encode(file("cloud-init.yaml"))
  }
}

Create an Object Storage bucket

data "oci_objectstorage_namespace" "ns" { compartment_id = var.tenancy_ocid }

resource "oci_objectstorage_bucket" "backups" {
  compartment_id = var.compartment_ocid
  namespace      = data.oci_objectstorage_namespace.ns.namespace
  name           = "db-backups"
  access_type    = "NoPublicAccess"
  versioning     = "Enabled"
  # kms_key_id = oci_kms_key.data.id   # customer-managed key
}

Create IAM group, dynamic group, and policy

resource "oci_identity_dynamic_group" "app_servers" {
  compartment_id = var.tenancy_ocid
  name           = "app-servers"
  description    = "App instances in the app compartment"
  matching_rule  = "ALL {instance.compartment.id = '${var.compartment_ocid}'}"
}

resource "oci_identity_policy" "app_bucket_read" {
  compartment_id = var.compartment_ocid
  name           = "app-bucket-read"
  description    = "App servers read backups bucket"
  statements = [
    "Allow dynamic-group app-servers to read objects in compartment id ${var.compartment_ocid} where target.bucket.name = 'db-backups'"
  ]
}

Create a monitoring alarm

resource "oci_ons_notification_topic" "ops" {
  compartment_id = var.compartment_ocid
  name           = "ops-alerts"
}

resource "oci_monitoring_alarm" "cpu_high" {
  compartment_id        = var.compartment_ocid
  display_name          = "app-cpu-high"
  metric_compartment_id = var.compartment_ocid
  namespace             = "oci_computeagent"
  query                 = "CpuUtilization[5m].mean() > 85"
  severity              = "WARNING"
  destinations          = [oci_ons_notification_topic.ops.id]
  is_enabled            = true
  body                  = "App CPU above 85% for 5 minutes."
  pending_duration      = "PT5M"
}

Tag namespace, tag, and tag default

resource "oci_identity_tag_namespace" "finance" {
  compartment_id = var.tenancy_ocid
  name           = "Finance"
  description    = "Cost attribution tags"
}

resource "oci_identity_tag" "cost_center" {
  tag_namespace_id = oci_identity_tag_namespace.finance.id
  name             = "CostCenter"
  description      = "Charge-back cost center"
  validator {
    validator_type = "ENUM"
    values         = ["CC-4412", "CC-5501", "CC-7788"]
  }
}

# Auto-apply Environment tag to every new resource in a compartment
resource "oci_identity_tag_default" "env_default" {
  compartment_id    = var.compartment_ocid
  tag_definition_id = oci_identity_tag.environment.id
  value             = "prod"
}

State management and structure

Remote, locked state: use Resource Manager (state managed for you) or an Object Storage/other backend with locking. Never keep prod state only on a laptop; never commit state (it holds secrets).
Modular structure: reusable modules (network, compute, db, iam, monitoring) composed per environment - not copy-pasted stacks.
Environment separation: separate state per env (workspaces or separate backends/stacks) driven by dev.tfvars / prod.tfvars; separate compartments and, ideally, separate credentials/pipelines.
Drift detection: Resource Manager (or scheduled plan) to catch manual Console changes.
No secrets in code: reference Vault secrets/keys by OCID; keep tfvars with secrets out of git.

oci-infra/
  modules/
    network/   compute/   database/   iam/   monitoring/
  envs/
    dev/    main.tf  dev.tfvars   backend.tf
    prod/   main.tf  prod.tfvars  backend.tf
  README.md

Architect note

The goal is that Prod and DR are provably the same because they come from the same modules with different variables. Manual Console changes in prod are the enemy of a working DR; enforce "infrastructure changes go through Terraform" and use drift detection to catch violations.

Official documentation: OCI CLI, Terraform provider & Resource Manager →

18. Learning Path

A structured route from OCI fundamentals to enterprise-grade architecture and operations, aimed at people coming from traditional Oracle/infrastructure backgrounds. Each level lists what to learn, why, hands-on labs, common mistakes, and the outcome you should reach.

Last reviewed: July 2026 Certification names/exam codes change - verify on Oracle University before scheduling.

Beginner

Foundations: identity, network, compute, storage

Intermediate

LB, private networking, DB, monitoring, security, cost

Advanced

Exadata, DR, FastConnect, OKE, Terraform, landing zone, GenAI

How to use this

Do the labs, don't just read. Use an Always Free tenancy or a trial for hands-on where possible. Map each level to the deep-dive sections above - the learning path is the syllabus, the sections are the textbook. Certifications (OCI Foundations → Architect Associate → Architect Professional, plus specialty tracks) are useful checkpoints, but capability comes from building.

Beginner

Level 1 - Foundations

Goal: deploy and connect basic OCI resources confidently

What to learn

OCI fundamentals: regions, ADs, fault domains, realms, home region (section 1).
Tenancy and compartments; how to structure them.
IAM basics: users, groups, policies, the verb/resource-type model (section 2).
VCN basics: subnets, route tables, security lists/NSGs, gateways (section 3).
Compute basics: shapes, images, SSH, cloud-init (section 4).
Storage basics: block, object, file - and when to use each (section 5).

Why it matters

Every OCI design rests on these. Get the tenancy/compartment/network mental model right now and everything later is easier; get it wrong and you rebuild.

Hands-on labs

Create a compartment, a group, and a least-privilege policy; add yourself and test access.
Build a VCN with a public and a private subnet, IGW, NAT, and Service Gateway.
Launch a public bastion and a private instance; SSH to the private one through the bastion.
Attach and mount a block volume; create a bucket and upload an object; create an FSS share.

Common mistakes

Everything in root compartment; public subnet used for everything; forgetting the Service Gateway; no fault-domain awareness.

Expected outcome

You can stand up a properly-segmented VCN with public/private tiers, reach a private host securely, and use all three storage types - and explain the shared responsibility model.

Intermediate

Level 2 - Building real workloads

Goal: deploy an HA app + database with monitoring, security, and cost control

What to learn

Load balancers: L7 vs NLB, listeners/backend sets/health checks, SSL (section 7).
Private networking depth: NSGs by tier, DNS/DHCP, flow logs, Path Analyzer (section 3).
Hybrid connectivity: Site-to-Site VPN and DRG (section 3/13).
Database services: Base Database and Autonomous - provisioning, backups, Data Guard basics (section 6).
Monitoring & logging: metrics, alarms, notifications, Service Connector Hub (section 9).
Security services: Vault, Cloud Guard, Bastion, Security Zones, WAF (section 8).
Cost management: Cost Analysis, Budgets, Quotas, tagging (section 14).

Why it matters

This is the day-job: HA application tiers, managed databases, and the operational and security controls that make them production-worthy.

Hands-on labs

Deploy a 3-tier app: public LB + WAF → instance pool (multi-FD) → Autonomous/Base DB (private).
Wire NSGs so only app→db on the DB port is allowed; verify with Path Analyzer.
Set up alarms (CPU, unhealthy backend, DB storage) to a Notifications topic; force one to fire.
Store the DB wallet/secret in Vault; give the app instance-principal access.
Add a budget + quota to the compartment; tag all resources with CostCenter/Environment.
Set up a VPN/DRG to a simulated on-prem network.

Common mistakes

LB health-check NSG rule missing; DB in a public subnet; secrets in cloud-init instead of Vault; noisy alarms; no tags so cost can't be attributed.

Expected outcome

You can deploy a secure, monitored, HA application and database, connect it to on-prem, and keep its cost and access under control.

Advanced

Level 3 - Enterprise architecture & operations

Goal: design governed, multi-region, automated enterprise platforms

What to learn

Exadata Cloud Service and Autonomous deep dive: consolidation, RAC, PDBs, patching at scale (section 6).
DR design: Data Guard/Active Data Guard, Full Stack DR, RTO/RPO, pilot-light to active/active (section 13).
FastConnect and advanced hybrid: redundant links, BGP, Network Firewall, hub-and-spoke (sections 3/13).
OKE and cloud native: node pools/virtual nodes, workload identity, DevOps pipelines, service mesh (section 10).
Terraform + Resource Manager: modules, remote state, environment separation, drift (section 17).
Landing zone: CIS-aligned governance, Security Zones, centralized logging/audit (sections 8/14).
Enterprise security: customer-managed keys, Data Safe, Database Vault, break-glass, least privilege at scale (section 8).
GenAI & AI Vector Search: governed RAG over enterprise data, the serving-layer pattern (section 12).
Large-enterprise architecture: multi-BU tenancy, multi-region, chargeback, standardization (sections 1/15).

Why it matters

At this level you are responsible for governance, resilience, automation, and cost across many teams and workloads - decisions that are expensive to reverse.

Hands-on labs

Deploy a CIS-aligned landing zone via Terraform/Resource Manager (compartments, IAM, network hub, Security Zones, logging, budgets, quotas, tag defaults).
Build cross-region DR for a database (Active Data Guard) and rehearse a switchover; confirm keys exist in DR.
Stand up an OKE platform with private cluster, workload identity, and a DevOps CI/CD pipeline from OCIR.
Implement a governed RAG assistant: Object Storage + DB 23ai Vector Search + Generative AI behind a serving API, with audit and entitlement-filtered retrieval.
Refactor a hand-built environment into reusable Terraform modules with per-env state and drift detection.

Common mistakes

Skipping the landing zone and retrofitting governance; DR never tested; over-privileged workload identity/IAM; Console drift breaking DR parity; connecting AI agents to production data without a governed serving layer.

Expected outcome

You can design and operate a governed, automated, multi-region OCI platform for a large enterprise - and defend the trade-offs on security, resilience, and cost.

Certification checkpoints (optional)

Level	Typical certification track
Beginner	OCI Foundations Associate
Intermediate	OCI Architect Associate (+ Operations/Networking specialty as relevant)
Advanced	OCI Architect Professional (+ specialty: Security, Multicloud, Data/AI, Autonomous)

Verify before scheduling

Oracle updates exam codes and content regularly (and Oracle University often offers free training/exam windows). Confirm the current track and objectives on the official Oracle University site before you prepare. Certifications validate knowledge; the labs above build the capability employers actually pay for.

Official: Oracle University OCI learning paths & certifications →

Oracle Cloud Infrastructure Deep Dive Portal

How this portal is organized

Reading the callouts

The OCI shared responsibility model (orientation)

Suggested reading order

1. OCI Fundamentals

What OCI is

OCI global architecture

Regions, Availability Domains, Fault Domains

Home region and subscriptions

Tenancy and compartments

Limits, quotas, and service limits

Resource OCIDs

Tags: freeform, defined, and namespaces

Ways to work with OCI

What to decide before production

2. Identity and Access Management

The IAM model

Users, groups, and dynamic groups

Policy syntax

Verbs and resource types

Conditions in policies

Instance principals and resource principals

Identity domains, federation, SSO, MFA

Credential types

Real policy examples

Common OCI IAM mistakes

IAM troubleshooting

Symptoms

Likely causes

Checks to perform

Console path

CLI

Fix options

Prevention

3. Networking Deep Dive

VCN and CIDR planning

Subnets

Gateways - the only ways out of a VCN

Security lists vs. Network Security Groups

IPs, VNICs, and secondary addresses

DNS and DHCP

How traffic flows in OCI

Reference diagrams

Three-tier architecture (public LB, private app, private DB)

Hub-and-spoke with DRG

Service Gateway to Object Storage (private backend/backups)

On-premises to OCI hybrid (VPN + FastConnect)

Hybrid connectivity: VPN vs FastConnect

Flow logs, path analysis, and packet capture

Network Firewall and Web Application Firewall

Networking troubleshooting

Likely causes & checks

Console path

CLI

Fix / prevention

Likely causes & checks

Console path

Fix / prevention

Likely causes & checks

Console path

Fix / prevention

Likely causes & checks

Fix / prevention

Likely causes & checks

CLI / checks

Fix / prevention

Fast method

Common route-table gotcha

OCI Networking gotchas

4. Compute Deep Dive

Shapes: OCPU, memory, flexible

Bare metal vs virtual machines

Dedicated hosts, capacity reservation, preemptible

Images, custom images, and cloud-init

Placement, fault domains, and HA

Autoscaling, instance pools, and configurations

Choosing shapes by workload

Operational guidance

OS management, patching, and recovery