← expertoracle.com

Oracle Cloud Infrastructure Deep Dive Portal

A practical reference for Cloud Architects, DBAs, and Enterprise Infrastructure Teams. Built to be used while you learn, design, implement, operate, and troubleshoot real OCI environments - not a marketing overview.

18 deep sections Architecture patterns Troubleshooting runbooks CLI & Terraform Decision matrices
Last reviewed: July 2026 Cloud services change - verify with current Oracle documentation before production use.
WHO THIS IS FOR

Oracle Cloud Architects, Apps DBAs, Oracle DBAs, infrastructure engineers, cloud engineers, enterprise architects, and anyone moving from traditional on-premises Oracle environments into OCI. It assumes you already understand servers, storage, networks, and Oracle Database - and focuses on how those ideas map into OCI and what changes operationally.

How this portal is organized

Each section is a self-contained deep dive. Use the left navigation or the search box in the top bar to jump directly to a topic. Every section carries a Last reviewed date and, where content changes frequently (pricing, shape names, service limits, model availability), a Verify with current Oracle documentation flag.

Learn
Foundations first

Sections 1-2 establish the mental model: regions, ADs, fault domains, tenancy, compartments, and the IAM policy language that everything else depends on.

Build
Service deep dives

Sections 3-12 cover networking, compute, storage, database, load balancing, security, observability, containers, analytics, and AI - with diagrams, tables, and gotchas.

Operate
Run and recover

Sections 13-18 cover migration and DR, cost and governance, reference architecture patterns, troubleshooting runbooks, automation, and a structured learning path.

Reading the callouts

Four note types recur throughout. They flag the perspective that matters most for a given point.

Architect note
Design-time decisions, trade-offs, and things you must decide before production.
DBA note
Database-specific behavior, what Oracle manages vs. what you manage, patching and backup nuances.
Security note
Exposure, least privilege, encryption, and audit considerations.
Cost note
Where money is spent and where it is commonly wasted.
Common mistake
A specific design or operational error teams repeatedly make, and how to avoid it.

The OCI shared responsibility model (orientation)

Everything in this portal sits on one idea: in cloud, responsibility is split, and the split moves depending on the service. Get this wrong and you will either leave gaps (security incidents, unrecoverable data) or do work Oracle already does for you (wasted effort).

LayerIaaS (Compute + you install DB)Base Database / VM DBExadata Cloud ServiceAutonomous Database
Physical / hypervisorOracleOracleOracleOracle
OS patchingYouYou (guest VM)You (guest VM)Oracle
DB software install/patchYouOracle tooling, you triggerOracle tooling, you triggerOracle
Backup configYouManaged, you configureManaged, you configureOracle (you set retention)
HA / RACYou build itOptional, you chooseBuilt-in RACBuilt-in
Schema, SQL, tuningYouYouYouYou
Data classification & accessYouYouYouYou
The one rule that never moves
Oracle secures the cloud. You secure what you put in the cloud: identities, network exposure, data classification, and access. No managed service removes your responsibility for who can reach the data and what they can do with it.

Suggested reading order

Accuracy & independence
This is an independent educational resource, not official Oracle material and not a sales tool. Service names, limits, shapes, and pricing change frequently. Treat every concrete number, shape name, and limit here as a starting point and confirm it in the OCI Console for your region and the official Oracle documentation before any design, sizing, licensing, or purchasing decision.

1. OCI Fundamentals

The physical and logical building blocks of Oracle Cloud Infrastructure, and the tenancy and compartment structure that every enterprise deployment stands or falls on.

Last reviewed: July 2026 Verify region list, service limits, and quotas in the OCI Console.
TL;DR

OCI is a set of regions, each built from isolated Availability Domains (data centers) that are further split into Fault Domains (racks). Your account is a tenancy; you organize resources into compartments (logical, not network) and control access with IAM policies written against those compartments. Compartment and tenancy structure is the single most important thing to get right before production - it is painful to restructure later.

What OCI is

Oracle Cloud Infrastructure is Oracle's public cloud: on-demand compute, storage, networking, database, and platform services delivered from Oracle-operated data centers, consumed over the network, and billed by usage. Compared to Oracle's first-generation cloud, OCI ("Gen 2") was rebuilt with an off-box network virtualization design - the virtualization and network isolation run on separate hardware from the customer's compute, which is the basis for its bare-metal offerings and its network isolation guarantees.

Practically, OCI gives you four things that matter to an enterprise Oracle shop:

  • Real bare metal - you can rent an entire physical server with no hypervisor, which matters for licensing and for the highest-performance database workloads.
  • Exadata as a cloud service - the same engineered system you may run on-premises, delivered as Base Database, Exadata Database Service, or Exadata Cloud@Customer.
  • Autonomous Database - a self-managing database platform where Oracle runs patching, tuning, backup, and scaling.
  • A flat, predictable network - a non-blocking, low-latency backbone with off-instance network virtualization.

OCI global architecture

Realm (e.g. OC1 commercial, OC2/OC3 government, dedicated regions) Isolation boundary. Identity and tenancies do not cross realms. Region A (e.g. us-ashburn-1) Availability Domain 1 Fault Domain 1 Fault Domain 2 Fault Domain 3 Availability Domain 2 Fault Domain 1 Fault Domain 2 Fault Domain 3 Region B (paired for DR, e.g. us-phoenix-1) Cross-region replication, backups, DR Some regions have a single AD; FD placement still provides in-region HA. DR
OCI hierarchy: Realm > Region > Availability Domain > Fault Domain

Regions, Availability Domains, Fault Domains

ConceptWhat it isFailure it protects againstWhat you do with it
RegionA localized geographic area containing one or more Availability Domains. Your data residency boundary.Regional disaster, large-scale outageChoose based on latency to users, data residency law, and service availability. Deploy DR to a second region.
Availability Domain (AD)One or more isolated data centers within a region, with independent power, cooling, and network.Data-center-level failureSpread instances / DB nodes across ADs (in multi-AD regions) for HA. Many regions have only 1 AD.
Fault Domain (FD)A grouping of hardware within an AD (think: a rack). Every AD has exactly 3 FDs.Rack / hardware / maintenance failure within an ADAnti-affinity: place HA pairs in different FDs. This is your only in-region HA lever in a single-AD region.
RealmA hard isolation boundary (commercial OC1, US Gov, UK Gov, dedicated). Identities and tenancies never cross realms.Compliance / sovereignty isolationUsually fixed by your contract; matters for regulated workloads.
Architect note - single-AD regions
Many OCI regions have only one Availability Domain. Do not assume multi-AD HA is available everywhere. In a single-AD region, your in-region resilience comes entirely from spreading across the three Fault Domains, and your true disaster resilience comes from a second region. Confirm the AD count for your chosen region before designing HA.
Common mistake
Designing a "multi-AD" active-active database only to discover the target region has a single AD. RAC on Base/Exadata in a single-AD region still gives you node HA across fault domains, but AD-level HA (spreading nodes across ADs) is only possible in multi-AD regions like Ashburn, Phoenix, or Frankfurt.

Home region and subscriptions

When you sign up you pick a home region. IAM resources (users, groups, policies, dynamic groups, compartments, federation, and in the legacy model the tenancy's identity metadata) are mastered in the home region and replicated read-only to subscribed regions. You then subscribe the tenancy to additional regions to deploy workloads there.

  • You cannot change the home region after it is set. Choose deliberately - it should be a region close to your identity administrators and one you intend to keep long-term.
  • IAM writes (create a user, edit a policy) always go to the home region and propagate out. A home-region outage can therefore affect identity administration globally even while workloads keep running.
  • Subscribing to a region is easy; unsubscribing is not - plan region subscriptions rather than turning them on casually.

Tenancy and compartments

Your tenancy is the root container for your entire OCI account - it is itself the root compartment. Everything lives under it.

Compartments are logical containers for resources (compute, VCNs, buckets, databases). They are the primary unit of access control, isolation, quota, and cost tracking. Key properties that trip people up coming from AWS/on-prem:

  • Compartments are global, not regional. A compartment exists across all subscribed regions; the resources inside it are regional.
  • Compartments are logical, not a network boundary. Two instances in different compartments can talk over the network if the VCN/subnet/security rules allow it. Isolation is by IAM policy, not by compartment walls.
  • Compartments can be nested (up to six levels deep). Policies and quotas can be scoped at any level.
  • Resources can be moved between compartments (most, not all), but some have caveats. Deleting a compartment requires it be empty and is asynchronous.
Tenancy (root compartment) Shared-Services (network hub) Hub VCN, DRG, FastConnect Bastion, jump hosts Vault (keys, secrets) Logging, monitoring Security tooling Workloads (per environment) Prod (spoke VCN, DB, apps) Stage Test Dev DR (paired region) Governance & Sandbox Security (Cloud Guard, audit) Network (if centralized) Sandbox (guardrailed, capped) BU-A / BU-B (business units) Quotas + budgets per child
A common enterprise compartment topology: shared services, per-environment workloads, and governance
How to structure compartments for an enterprise Design

There is no single correct model, but the durable patterns are:

  • By environment (most common): top-level compartments or a Workloads parent with Prod / Stage / Test / Dev / DR children. Clean blast-radius separation and simple policies ("group X can manage instances in Dev only").
  • By business unit: a compartment per BU, each with its own environment sub-compartments. Fits organizations that charge back and delegate admin per BU.
  • Shared services split out: put the network hub, Vault, logging, and security tooling in their own compartment(s) so platform teams own them and workload teams cannot alter them.
  • Hybrid (recommended for large orgs): Shared-Services + Security + per-BU (each BU has Prod/NonProd) + Sandbox. Landing zone frameworks (see the Cost & Governance section) codify this.
Architect note
Favor a shallow, predictable tree. Deep nesting (5-6 levels) makes policies hard to reason about and console navigation slow. Two to three levels covers almost every enterprise.
Separating dev, test, stage, prod, and DR Design
  • Separate compartments per environment give you independent IAM, quotas, budgets, and cost reporting.
  • Strongly consider separate VCNs (or at least separate subnets with strict NSGs) so a mistake in Dev cannot reach Prod data.
  • Some regulated shops use separate tenancies for Prod vs. non-Prod for hard isolation and separate billing. This is the strongest separation but adds identity/federation overhead and cross-tenancy networking complexity. Decide based on your risk and compliance posture.
  • DR lives in a different region. Keep its compartment structure a mirror of Prod so IAM policies and automation translate cleanly.
Designing for multiple business units Design
  • Give each BU a top-level compartment with delegated admin (a BU-admin group with manage rights scoped to that compartment only).
  • Apply compartment quotas to cap what each BU can consume, and budgets with alerts for spend.
  • Centralize the network in a Shared-Services compartment and connect BU spoke VCNs via a DRG hub, so BUs cannot each build divergent, unmanaged network topologies.
  • Use defined tags (cost-center, owner, environment) enforced by tag defaults so chargeback works from day one.
Common mistakes in tenancy / compartment design
  • Putting everything in the root compartment "to start" - it becomes impossible to apply least privilege later.
  • Modeling compartments as if they were network boundaries. They are not; network isolation is VCN/subnet/security rules.
  • Over-nesting. Six-level trees look tidy but make policy debugging miserable.
  • No naming standard. Inconsistent names (prod vs Production vs PRD) break automation and reporting.
  • Not reserving compartments for shared services and security up front, so those resources end up scattered in workload compartments.

Limits, quotas, and service limits

MechanismSet byScopePurpose
Service limitsOracle (per tenancy, per region, sometimes per AD)TenancyThe maximum of a resource you can create (e.g. number of OCPUs of a shape). Raise via a limit-increase request in the Console.
Compartment quotasYou (policy-like statements)CompartmentCap/allow/deny resource creation per compartment. Your governance lever - e.g. "no bare metal in Dev".
BudgetsYouCompartment / tagTrack and alert on spend. Do not block creation - they notify.

Quota statements look like policy but control resource counts. Example that blocks expensive shapes in a Dev compartment:

# Applied in the Dev compartment (Governance > Quotas)
zero compute-core-count-standard-e5-quota in tenancy
set compute-core-count-standard-e4-quota to 64 in tenancy
# Deny all bare-metal database shapes in this compartment
zero database-dbcs-quota in tenancy
Cost note
Compartment quotas are the cheapest guardrail you have. A "zero bare-metal" and "cap OCPUs" quota in Dev/Test prevents a single mis-click from provisioning a five-figure monthly resource. Set them before you hand a compartment to a team.

Resource OCIDs

Every OCI resource has a globally unique Oracle Cloud Identifier (OCID). You will use these constantly in CLI, Terraform, and support tickets. The format is readable:

ocid1.instance.oc1.us-ashburn-1.anuwcljr...abcd1234
   |        |      |       |            |
 version  type   realm   region     unique id (opaque)
  • The type segment (instance, vcn, bucket, database, compartment, user, policy...) tells you what the resource is at a glance.
  • Some resources are regionless (users, groups, compartments, policies) - their OCID has no region segment.
  • OCIDs are stable for the life of the resource; scripts and Terraform state key off them.

Tags: freeform, defined, and namespaces

Tag typeStructureGoverned?Use for
Freeform tagsSimple key:value, no schemaNo control - anyone with manage rights can set anythingQuick, informal labels. Avoid for anything you report or bill on.
Defined tagsLive in a tag namespace; keys are predefined, values can be validated/restrictedYes - controlled by IAM policy and value listsCost tracking, environment, owner, data-classification. The enterprise standard.
Tag namespaceA container for defined tag keys (e.g. Finance namespace with CostCenter, Project)YesGrouping and governing related tag keys; can be retired/reactivated.
Tag defaultsAuto-applied defined tags on any new resource in a compartmentYesGuaranteeing every resource is tagged (e.g. auto-stamp CreatedBy, Environment).
Architect note - decide the tag model before launch
Define a small, enforced set of defined tags (CostCenter, Environment, Owner, DataClassification) with a tag namespace and tag defaults from day one. Retrofitting tags across thousands of resources for chargeback or a security audit is slow, error-prone, and never fully complete. You can also drive cost tracking and even some IAM conditions off tags - but only if they exist consistently.

Ways to work with OCI

OCI Console

The web UI. Best for learning, exploring, one-off tasks, and reading state. Not for repeatable production changes - use IaC for those.

OCI CLI

Python-based command line. Great for scripting, ad-hoc automation, and things not yet in Terraform. Uses config profiles (see section 17).

SDKs

Java, Python, Go, TypeScript/JavaScript, .NET, Ruby, PL/SQL. For building applications and tooling against OCI APIs.

Cloud Shell

Browser-based terminal, pre-authenticated as your Console identity, with CLI/Terraform/kubectl pre-installed and ephemeral home storage. Ideal for quick, credential-free tasks.

Terraform provider for OCI

The recommended way to build and manage infrastructure declaratively. Oracle maintains the provider. Run it locally, in a pipeline, or in OCI Resource Manager (managed Terraform with state).

Resource Manager

OCI's managed Terraform service - stores state, runs plan/apply, supports stacks and drift detection without you hosting a state backend. Covered in section 17.

Recommended posture
Console to learn and inspect. CLI for glue and ad-hoc ops. Terraform (via Resource Manager or a pipeline) for anything that reaches production. Manual Console clicks in prod are the root cause of most "why is this different in DR?" incidents.

What to decide before production

  • Home region - irreversible. Pick with identity-admin proximity and long-term intent in mind.
  • Realm - commercial vs. government/dedicated - usually contractual, but confirm it matches your compliance needs.
  • Tenancy strategy - single tenancy with compartments, or separate Prod/non-Prod tenancies.
  • Compartment topology - environment vs. BU vs. hybrid; where shared services and security live.
  • Identity domain strategy - default domain vs. multiple domains, and federation to your IdP (see section 2).
  • Network CIDR plan - non-overlapping with on-premises and future regions (see section 3). This is very hard to change later.
  • Tag model - defined tags + namespaces + tag defaults for cost and governance.
  • Guardrails - compartment quotas, budgets, Security Zones, Cloud Guard baseline.
  • DR region and RTO/RPO targets - which region, which DR pattern per tier.
Common mistake
Treating the first workload as "just a POC" and letting it define the tenancy, home region, and CIDR plan by accident. POCs become production. Decide the list above deliberately before the first real workload lands.

2. Identity and Access Management

Who can do what, to which resources, in which compartment. IAM is where most OCI security incidents and access-denied tickets originate, so this section goes deep on the policy language, identity domains, and least-privilege design.

Last reviewed: July 2026 Identity Domains behavior evolves - verify verb/resource-type names in current docs.
TL;DR

Access in OCI is granted only by policies. A policy is a set of human-readable statements: Allow <subject> to <verb> <resource-type> in <compartment> [where <condition>]. There is no implicit access - if no policy allows it, it is denied. Groups get people access; dynamic groups and instance/resource principals let workloads authenticate without stored keys. Write policies at the lowest compartment that works, use the least verb that works, and separate admin from operator from read-only.

The IAM model

Modern OCI IAM runs inside Identity Domains - each domain is an isolated identity and access management container with its own users, groups, applications, and security settings (its lineage is Oracle Identity Cloud Service). A tenancy has a Default domain and can have additional domains. Within a domain you have users and groups; across the tenancy you have compartments and policies that reference those groups.

The evaluation is simple to state and important to internalize: a request is allowed only if at least one policy statement permits it; otherwise it is denied. There are no "deny" statements in classic IAM policy - you control access by what you grant and to which compartment. (Deny-style controls come from other layers: Security Zones, quotas, and network rules.)

Users, groups, and dynamic groups

PrincipalWhat it isHow it authenticatesUse for
UserA person or a service identity in a domainPassword + MFA (console), API key, auth tokenHumans and, sparingly, service accounts that must have static credentials
GroupA named set of usersn/a - policies target groupsAll human access. Never write policies against individual users.
Dynamic groupA set of resources (instances, functions, OKE nodes, DB systems...) matched by rulesInstance/resource principal (no stored credentials)Letting workloads call OCI APIs without API keys
Federated userA user authenticated by an external IdP (Entra ID, Okta, AD)SAML/OIDC SSO, mapped to a domain groupEnterprise SSO - the standard for human access at scale
Common mistake - confusing dynamic groups with user groups

A user group contains people; a dynamic group contains resources (compute instances, functions, autonomous DBs) selected by matching rules. You put a person in a user group. You never put a person in a dynamic group - and you never put an instance in a user group. Policies for workloads must target the dynamic group.

Example dynamic group matching rule (all instances in a compartment):

ALL {instance.compartment.id = 'ocid1.compartment.oc1..aaaa...'}

Policy syntax

A policy is a named collection of statements living in a compartment (or the tenancy). Every statement follows this grammar:

Allow <subject> to <verb> <resource-type> in <location> [where <conditions>]

# subject:        group Admins | dynamic-group AppServers | any-user | group DomainName/Admins
# verb:           inspect | read | use | manage
# resource-type:  instances | virtual-network-family | object-family | database-family | all-resources
# location:       tenancy | compartment Prod | compartment Prod:AppTier
# conditions:     where request.region = 'us-ashburn-1'

Where the policy lives matters: a policy attached to a compartment can only grant access to that compartment and its children. Tenancy-level policies can grant anywhere, which is exactly why you should minimize them.

Verbs and resource types

The four verbs are cumulative - each includes everything in the ones before it, plus more permissions.

VerbGrantsTypical useRisk
inspectList resources (metadata only; often hides sensitive contents)Auditors, inventory toolsLow
readinspect + get resource details / contentsRead-only operators, dashboardsLow
useread + work with existing resources (start/stop, attach, update some attributes) - generally not create/deleteOperators running day-2 tasksMedium
manageuse + create and delete resources; full controlAdmins, automation that provisionsHigh

Resource types can be individual (instances, buckets, subnets) or aggregate family types that bundle related resources:

  • virtual-network-family - VCNs, subnets, route tables, security lists, gateways, NSGs, DRGs.
  • database-family - DB systems, databases, backups, Data Guard, etc.
  • object-family - buckets and objects.
  • instance-family - compute instances, images, boot/block volume attachments.
  • all-resources - everything. Use with extreme care (see mistakes).
Architect note
Prefer the narrowest resource type that satisfies the task, and the least verb. "Operators can restart app servers" is use instances in the app compartment - not manage instance-family in tenancy. Every widening you allow is a widening an attacker or a mistake can use.

Conditions in policies

Conditions add a where clause that must be true for the statement to apply. They reference request/target variables.

# Restrict admin actions to a single region
Allow group NetAdmins to manage virtual-network-family in tenancy
  where request.region = 'us-ashburn-1'

# Only allow managing resources tagged for a given cost center
Allow group ProjectX to manage instances in compartment Workloads
  where target.resource.tag.Finance.CostCenter = 'CC-4412'

# Restrict to a specific resource type via request.operation (fine-grained)
Allow group Operators to use instances in compartment Prod
  where request.operation = 'InstanceAction'   # start/stop/reset, not delete
Handy condition variables
request.region, request.operation, request.user.id, request.groups.id, target.resource.tag.<ns>.<key>, request.principal.type. Combine with tags for attribute-based access control.

Instance principals and resource principals

These solve the "how does my code authenticate to OCI without storing keys" problem - the single most important security improvement most teams can make.

MechanismThe principal isUsed byHow it works
Instance principalA compute instanceCode running on a VM/BM instanceInstance is a member of a dynamic group; SDK/CLI obtains short-lived credentials from the instance metadata service. No API key on disk.
Resource principalA managed resource (Function, Data Science notebook, Autonomous DB, etc.)Serverless / managed servicesThe service injects a short-lived token the SDK uses automatically. Common in Functions and OKE workload identity.
Security note - prefer principals over stored keys
An API key in a config file, a script, or an environment variable is a long-lived secret that can leak, get committed to git, or outlive the person who made it. Instance/resource principals issue short-lived, automatically rotated credentials scoped by a dynamic group and policy. If a workload runs in OCI and needs to call OCI, it should almost always use a principal, not an API key.
# Let all instances in AppCompartment read objects in a specific bucket's compartment
Allow dynamic-group AppServers to read objects in compartment Data
  where target.bucket.name = 'app-artifacts'

# CLI using instance principal - no ~/.oci/config keys needed
oci os object list --bucket-name app-artifacts --auth instance_principal

Identity domains, federation, SSO, MFA

Identity domains are self-contained IAM stacks in your tenancy. Each has its own users, groups, password policy, MFA settings, sign-on policies, and federation config.

  • Default domain - created with the tenancy; where the initial administrator lives.
  • Additional domains - useful to separate, for example, employees from external partners, or Prod-admin identities from Dev identities, each with their own MFA/sign-on rules. Domains also come in types/licensing tiers (Free, Oracle Apps, Premium, External User) that affect available features - verify current tiers.
  • Federation / SSO - connect an external IdP (Microsoft Entra ID, Okta, ADFS, Ping) via SAML or OIDC so users sign in with corporate credentials. Map IdP groups to domain groups; policies reference the domain groups.
  • MFA - enforced via sign-on policies in the domain. Require MFA for all human users, especially administrators. Exempt only automated service identities that use API keys/principals, not passwords.
Architect note - plan domains early
Deciding to split identities into multiple domains after you have policies, federation, and MFA configured against the default domain is disruptive - group references in policies use the domain-qualified name (Allow group 'DomainName'/'GroupName' to ...). Decide your domain model (single default vs. multi-domain) alongside your compartment model, before production identities exist.
Common mistake
Standing up local OCI users for every engineer instead of federating to the corporate IdP. Local users mean separate lifecycle (joiners/movers/leavers), separate MFA, and passwords that outlive employment. Federate human access; reserve local users for break-glass and specific service needs.

Credential types

CredentialUsed forNotes / risk
Console password + MFAHuman Console sign-inFederate + enforce MFA. Rotate per policy.
API signing key (PEM)CLI/SDK/Terraform as a userLong-lived. High risk if leaked. Prefer instance/resource principals where possible; rotate and scope tightly otherwise.
Auth tokenBasic-auth style access (e.g. some Git/registry, Swift/RMAN to Object Storage)Password-equivalent. Store in Vault, not scripts.
Customer secret keyS3-compatible Object Storage access (access key/secret)For tools that speak the S3 API. Treat like AWS keys.
OAuth 2.0 client credentialsApp-to-app / confidential apps in a domainManaged as domain applications; scope to least privilege.
Database passwords / walletsDB connectivity (e.g. ADB mTLS wallet)Store wallets/secrets in Vault; never in application images.
Security note - break-glass users
Keep one or two break-glass local admin users (not federated), with very strong unique passwords and MFA, credentials sealed in your enterprise secrets vault, and every login alarmed via Audit + Events. They exist so you can still administer OCI if the IdP/federation is down. Do not use them for daily work; monitor them heavily.

Real policy examples

Read-only auditor across the tenancy Low risk
Allow group Auditors to inspect all-resources in tenancy
Allow group Auditors to read audit-events in tenancy

Allows: listing and viewing audit trails and resource metadata everywhere; no changes, and inspect hides many secret contents. Where: tenancy (auditors legitimately need breadth). Risk: low, but still scope to inspect/read only. Safer alternative: if auditors only cover certain BUs, scope to those compartments instead of tenancy.

App operators can start/stop instances in Prod only Medium risk
Allow group AppOperators to read instance-family in compartment Prod
Allow group AppOperators to use instances in compartment Prod
  where request.operation = 'InstanceAction'

Allows: viewing instances and starting/stopping/resetting them, but not creating or terminating. Where: the Prod compartment (not tenancy). Risk: medium - stop can cause outage; scoping to InstanceAction prevents delete. Safer alternative: further restrict with a tag condition so operators only touch app-tier instances, not databases.

Workload reads a bucket via instance principal Low risk
Allow dynamic-group AppServers to read objects in compartment Data
  where target.bucket.name = 'app-config'

Allows: only the matched instances, only read, only that one bucket. Where: the Data compartment. Risk: low - tightly scoped and no stored keys. This is the pattern to imitate: dynamic group + least verb + resource condition.

Delegated compartment admin for a business unit Higher risk
Allow group BU_A_Admins to manage all-resources in compartment BU-A

Allows: full control, but only inside the BU-A compartment subtree - not the rest of the tenancy. Where: the BU-A compartment. Risk: higher (manage all-resources), but blast radius is contained to one compartment. Safer alternative: split into role-specific groups (network admin, DB admin, compute admin) even within the BU so no single group holds everything. Pair with quotas and a Security Zone.

Common OCI IAM mistakes

Common OCI IAM mistakes
  • manage all-resources too broadly - especially at tenancy level. This is effectively tenancy admin. Scope to a compartment and split by role.
  • Tenancy-level policies by default - a policy in the tenancy grants everywhere. Put policies in the lowest compartment that works.
  • Confusing dynamic groups with user groups - workloads authenticate via dynamic groups; people via user groups. Mixing them either fails or over-grants.
  • No separation of admin / operator / read-only - one "everyone" group with manage rights removes least privilege and makes audit meaningless.
  • Not planning identity domains - retrofitting a multi-domain model after federation and policies exist is painful.
  • Storing API keys instead of using instance/resource principals - long-lived keys leak; principals are short-lived and scoped.
  • Policies against individual users - unmanageable and invisible in group-based reviews. Always target groups.
  • Leaving the default tenancy administrator group over-used - reserve it for break-glass; do daily work as scoped roles.

IAM troubleshooting

⚑ "Authorization failed or requested resource not found"

Symptoms

A user or workload gets a 404/authorization error even though the resource exists.

Likely causes

  • No policy grants the action (default deny).
  • Policy is attached to the wrong compartment (a parent policy does not "reach down" unless it covers that compartment; a child policy cannot grant on a sibling).
  • Verb too weak (read where use/manage is needed) or wrong resource-type/family.
  • For workloads: the instance is not in the dynamic group, or the matching rule does not match, or the dynamic-group policy is missing.
  • A where condition (region, tag, operation) is not satisfied.
  • Wrong identity domain - group referenced without the domain qualifier.

Checks to perform

  • Confirm the user's group membership and the group's policies (Identity > Domains > Groups; Identity > Policies).
  • Trace the compartment path of the target resource and confirm a policy covers that compartment or an ancestor.
  • For dynamic groups: verify the instance OCID matches the rule (Identity > Domains > Dynamic groups) and that a policy grants the dynamic group.
  • Check conditions: is the request in the allowed region? Does the target carry the required tag?

Console path

Identity & Security > Policies (and Domains > Users / Groups / Dynamic groups). Use the tenancy Audit log to see the denied request details.

CLI

oci iam policy list --compartment-id <compartment-ocid> --all
oci iam group list-users --group-id <group-ocid>
oci iam dynamic-group get --dynamic-group-id <dg-ocid>

Fix options

Add the least-privilege statement at the correct compartment; fix the dynamic-group rule; correct the verb/resource-type; relax or correct the condition.

Prevention

Adopt a policy naming and location standard, review policies in code (Terraform), and keep a "who can do what" matrix per compartment.

3. Networking Deep Dive

The VCN is the foundation every other OCI service plugs into. This section covers CIDR planning, subnets, gateways, security rules, hybrid connectivity, DNS, and the traffic-flow reasoning you need to design and debug real networks.

Last reviewed: July 2026 Verify gateway limits, VPN specs, and FastConnect options in current docs.
TL;DR

A VCN is your private, software-defined network in a region with a CIDR you choose. Inside it you create subnets (regional or AD-specific, public or private). Traffic leaves the VCN only through a gateway - Internet (IGW), NAT, Service (to OCI services privately), DRG (to on-prem and other VCNs), or peering. Two rule layers govern packets: stateful security lists (subnet-wide) and NSGs (per-VNIC). Plan your CIDR to never overlap with on-premises or other regions - it is the hardest thing to change later.

VCN and CIDR planning

A Virtual Cloud Network is a regional, private network. You assign it one or more CIDR blocks (RFC 1918 private ranges are standard; you can also use public ranges you own). Everything - instances, load balancers, databases, mount targets, private endpoints - gets an IP from a subnet inside the VCN.

  • A VCN can have multiple CIDR blocks, added after creation, which helps when you outgrow the first block. But you cannot shrink or trivially renumber - plan generously.
  • VCN CIDRs and subnet CIDRs must not overlap with each other, with peered VCNs, or with your on-premises networks. Overlap is the number-one cause of hybrid connectivity that "connects but cannot route."
  • Reserve non-overlapping space for every region and every environment you might ever run, plus DR. Treat it like an enterprise IP address plan, because it is one.
Architect note - a workable CIDR scheme
Carve a large private supernet for OCI (e.g. 10.0.0.0/8 subdivided), then allocate a /16 per region, a /20 per VCN/environment, and /24s per subnet tier. Keep a documented IPAM spreadsheet. Leave gaps. The cost of a too-large plan is zero; the cost of overlap is a re-IP project during a migration.
Common mistake
Using 10.0.0.0/16 for the first VCN because it is the console default, then discovering on-premises already uses 10.0.x.x. Now FastConnect/VPN is up but nothing routes. Choose CIDRs against your existing enterprise IP plan on day one.

Subnets

Subnet propertyOptionsGuidance
ScopeRegional (spans all ADs) or AD-specificPrefer regional subnets - simpler HA, resources can land in any AD/FD. AD-specific subnets are legacy/niche.
Public vs privatePublic = resources can have public IPs; Private = private IPs onlyDefault to private. Only front-facing load balancers and bastions belong in public subnets.
Route tableOne per subnetDetermines which gateway off-VCN traffic uses. Public subnet → IGW; private subnet → NAT/Service/DRG.
Security listsZero or more per subnetSubnet-wide stateful rules. Combine with NSGs.
DHCP optionsOne per subnetControls DNS resolver and search domain handed to instances.
"Public subnet" does not mean "has a public IP"
A subnet being public only means resources may have a public IP. An instance in a public subnet with no public IP assigned and no IGW route is not reachable from the internet. Reachability = public IP and an IGW route and security rules allowing it. Missing any one blocks it.

Gateways - the only ways out of a VCN

GatewayDirection / purposePublic IP?Typical route target for
Internet Gateway (IGW)Bidirectional internet for resources with public IPsYesPublic subnets (LB, bastion)
NAT GatewayOutbound-only internet for private resources (patching, external APIs)Uses OCI-managed public IPPrivate subnets needing egress
Service GatewayPrivate access to OCI services (Object Storage, ADB, etc.) without internetNoPrivate subnets reaching OCI services
Dynamic Routing Gateway (DRG)On-premises (VPN/FastConnect) and VCN-to-VCN / cross-region routing hubNoHybrid + hub-and-spoke
Local Peering Gateway (LPG)VCN-to-VCN peering in the same region (legacy; DRG now preferred)NoSame-region VCN peering
Remote Peering Connection (RPC)DRG-to-DRG peering across regionsNoCross-region VCN connectivity
Architect note - Service Gateway is not optional
Without a Service Gateway, a private instance reaching Object Storage or backing up a database to it would have to go out a NAT Gateway over the public internet path - slower, less secure, and it can incur egress considerations. The Service Gateway keeps that traffic on the OCI backbone and off the internet. Add it to every private subnet that talks to OCI services, and pair it with the correct service CIDR label ("All OSN Services" vs "Object Storage only") for least exposure.
Common mistake
Putting a NAT Gateway route as 0.0.0.0/0 in a private subnet and assuming Object Storage traffic is "private." It is not - it egresses to the internet-facing Object Storage endpoint. Add a Service Gateway and a route for the OCI services CIDR label so that traffic stays on the backbone; keep NAT only for genuine internet destinations.

Security lists vs. Network Security Groups

Security ListNetwork Security Group (NSG)
Applies toEvery VNIC in the subnetOnly VNICs you add to the NSG
GranularitySubnet-wide (coarse)Per-workload / per-tier (fine)
Rule source/destCIDR, service CIDRCIDR, service CIDR, or another NSG
Stateful?Yes (can also be stateless)Yes (can also be stateless)
Best forBaseline subnet rules (e.g. allow intra-VCN)App-tier-to-DB-tier rules by group, not IP

Both are evaluated. A packet is allowed if either the applicable security lists or the NSGs permit it (they are additive for allows; there is no deny rule - you allow what you need and everything else is implicitly denied). Effective rules = union of all security lists on the subnet + all NSGs on the VNIC.

Architect note - prefer NSGs
NSGs let you write rules like "app-tier NSG may reach db-tier NSG on 1521" without hardcoding IPs. As instances scale, membership updates automatically. Use security lists for a thin subnet-wide baseline (e.g. allow ICMP path MTU, allow intra-VCN) and put the real application rules in NSGs referencing other NSGs.
Common mistake - stateful vs stateless confusion
Mixing a stateless rule in one direction with a stateful rule in the other creates asymmetric behavior and dropped return traffic. Keep rules stateful unless you have a specific reason (very high connection rates) and, if stateless, define matching rules for both directions.

IPs, VNICs, and secondary addresses

ObjectWhat it isNotes
Private IPAn IP from the subnet CIDR on a VNICPrimary private IP is fixed for the VNIC's life; secondaries can move.
Ephemeral public IPTemporary public IP tied to a private IP/instance lifecycleReleased when the instance/VNIC is deleted. Cheapest for transient needs.
Reserved public IPA public IP you own independent of any instanceSurvives instance deletion; re-map to another resource. Use for stable ingress/whitelisting.
VNIC (primary)The instance's first network interfaceCannot be removed; determines the instance's primary subnet.
Secondary VNICAdditional interface, can be in a different subnetMulti-homing (e.g. app + management network). Requires OS-level config.
Secondary private IPExtra private IP on a VNICUsed for IP failover / floating VIPs across instances.
DBA note - VIPs and secondary IPs
For custom HA (a floating virtual IP moved between two DB or app nodes), OCI uses a secondary private IP that you unassign from the failed node and assign to the surviving node. Grid Infrastructure/RAC on Base/Exadata handles its own VIP/SCAN addressing; for hand-rolled active/passive, plan the secondary-IP failover and the IAM permission for the failover automation to move it.

DNS and DHCP

  • VCN Resolver - each VCN has a built-in DNS resolver at 169.254.169.254. It resolves internal hostnames and forwards public queries.
  • Private DNS zones/views - create private zones for custom internal names, and use the resolver's endpoints/forwarding rules to integrate with on-premises DNS (conditional forwarding both ways for hybrid name resolution).
  • DHCP options - per-subnet; controls whether instances use the VCN resolver or a custom resolver, and the search domain. Point at custom resolvers when integrating enterprise DNS.
Common mistake
Hybrid apps that "cannot resolve" on-prem hostnames because the subnet's DHCP options still use the default VCN resolver with no forwarding rule to the corporate DNS. Set up the DNS resolver endpoints and conditional forwarders in both directions.

How traffic flows in OCI

For any packet leaving an instance, OCI decides the path in this order:

  1. Is the destination inside the VCN? If yes, it routes locally (no gateway) - only security rules apply. Intra-VCN routing is automatic.
  2. If outside the VCN, the subnet's route table is consulted for the most specific matching rule → picks a gateway (IGW, NAT, Service, DRG, LPG).
  3. Security rules (security lists + NSGs on the VNIC) must allow the egress, and the return path must be allowed (stateful handles the return automatically).
  4. At the destination side, the same security-rule check happens for ingress.

Debugging almost always comes down to three questions: Is there a route to the right gateway? Do the security rules allow it both ends? Is there a return path?

Reference diagrams

Three-tier architecture (public LB, private app, private DB)

VCN 10.10.0.0/20 Internet IGW Public subnet 10.10.1.0/24 Public Load Balancer Private subnet - App 10.10.2.0/24 (NSG: app) App VM (FD1) App VM (FD2) Private subnet - DB 10.10.3.0/24 (NSG: db, allow 1521 from app NSG) DB node 1 (FD1) DB node 2 (FD2) Service GW NAT GW
Public LB in a public subnet fronts private app and DB tiers; egress via NAT, OCI services via Service Gateway. HA across fault domains.

Hub-and-spoke with DRG

DRG (hub) central routing Hub VCN (shared svcs)Firewall, Bastion, DNS Spoke: Prod VCNapps + DB Spoke: NonProd VCNdev/test On-premisesvia FastConnect/VPN Other regionvia RPC
DRG as the central hub: spoke VCNs, on-premises, and other regions all attach to one DRG. Inspect east-west traffic through a firewall in the hub VCN.

Service Gateway to Object Storage (private backend/backups)

Private subnet (no internet route) DB / App VM ServiceGateway route: OSN CIDR Object Storage(OCI backbone,no internet)
RMAN backups and object reads travel the OCI backbone via the Service Gateway - no NAT, no internet exposure.

On-premises to OCI hybrid (VPN + FastConnect)

Data centerCPE router10.200.0.0/16 IPSec VPN (backup) FastConnect (primary, private) DRG OCI VCN10.10.0.0/16(no CIDR overlap!)
FastConnect as the private primary link, IPSec VPN as encrypted backup, both terminating on the DRG. On-prem and OCI CIDRs must not overlap.

Hybrid connectivity: VPN vs FastConnect

Site-to-Site VPN (IPSec)FastConnect
PathOver the public internet, encryptedPrivate, dedicated connection (via partner or colo)
BandwidthLimited per tunnel; multiple tunnels for HA1/10/100 Gbps port options
Latency/jitterVariable (internet)Consistent, low
Setup timeMinutesDays-weeks (physical/partner provisioning)
Use asQuick start, or encrypted backup to FastConnectPrimary enterprise link, production DB replication, large migration
Architect note
The common enterprise pattern is FastConnect primary + VPN backup, both on the DRG, with routing (BGP) preferring FastConnect and failing over to VPN. Start on VPN during a project's early phase, then cut FastConnect in as it provisions. Design routing so failover is automatic and tested.

Flow logs, path analysis, and packet capture

ToolWhat it gives youUse when
VCN Flow LogsAccepted/rejected connection records per subnet/VNIC (into the Logging service)Auditing, "is my rule dropping this?", security forensics
Network Path AnalyzerStatic analysis of whether a path from A to B is allowed, listing the rules/routes that permit or block itBefore deploying, or first step when "cannot reach" - it tells you the exact blocking rule
VTAP (Virtual Test Access Point)Mirrors VNIC traffic to a capture target for deep packet inspectionIDS/IPS, deep debugging, compliance capture
Instance-side capturetcpdump on the instance OSConfirming what actually arrives at the host
Start with Network Path Analyzer
Before you SSH around checking rules by hand, run the Network Path Analyzer for the source/destination/port. It evaluates route tables, security lists, NSGs, and gateways and tells you the first rule that blocks the path - turning a 30-minute hunt into a 30-second answer.

Network Firewall and Web Application Firewall

  • Network Firewall - a managed, OCI-native firewall (Palo Alto-based) you place in a hub VCN to inspect north-south and east-west traffic: stateful filtering, IPS/IDS, URL filtering, TLS inspection. Route spoke traffic through it via the DRG hub.
  • Web Application Firewall (WAF) - Layer-7 protection (OWASP rules, bot management, rate limiting) applied in front of public HTTP endpoints, either as an edge/policy attached to a load balancer or as an edge service. Covered further in section 8.

Networking troubleshooting

⚑ Instance cannot reach the internet

Likely causes & checks

  • Private subnet, no egress: needs a route 0.0.0.0/0 to a NAT Gateway (or IGW if it has a public IP). Check the subnet's route table.
  • Public subnet but no public IP: assign an ephemeral/reserved public IP, and confirm a 0.0.0.0/0 route to the IGW.
  • Security rules: egress must allow the destination (often 0.0.0.0/0 on 443); return traffic is automatic if stateful.
  • OS firewall: iptables/firewalld on the instance may block it - Oracle Linux images ship with firewall rules.

Console path

Networking > VCN > Subnet > Route Table / Security Lists; Instance > Attached VNICs > public IP.

CLI

oci network route-table get --rt-id <ocid>
oci network security-list get --security-list-id <ocid>
# on the instance:
curl -s https://ifconfig.me ; sudo firewall-cmd --list-all

Fix / prevention

Add NAT route for private egress; standardize a subnet template (route + security rules) in Terraform so every subnet is consistent.

⚑ Instance cannot reach Object Storage privately

Likely causes & checks

  • No Service Gateway on the VCN, or subnet route table has no route to it for the OCI services CIDR label.
  • Service Gateway configured for the wrong service label (needs "Object Storage" or "All OSN Services in region").
  • Security rules do not allow egress to the service CIDR.
  • Using the wrong endpoint - use the regional Object Storage endpoint that the Service Gateway serves.

Console path

Networking > VCN > Service Gateway; then the subnet Route Table.

Fix / prevention

Create the Service Gateway, add a route rule (target = Service Gateway, destination = the services CIDR label), and an egress security rule to that label. Bake this into your standard private-subnet module.

⚑ On-premises cannot reach OCI

Likely causes & checks

  • CIDR overlap between on-prem and the VCN - routing is ambiguous. This is the most common root cause.
  • DRG route table / route distribution not advertising the VCN CIDR, or on-prem BGP not advertising its routes.
  • VPN tunnel down (Phase 1/2 mismatch) or FastConnect BGP session down.
  • Security lists/NSGs not allowing the on-prem CIDR.
  • On-prem firewall blocking return traffic.

Console path

Networking > Dynamic Routing Gateways > DRG > Route Tables / Attachments; Site-to-Site VPN > Tunnel status.

Fix / prevention

Resolve overlap (re-IP or NAT), confirm route advertisement both ways, verify tunnel/BGP state. Document the IP plan to prevent future overlap.

⚑ Load balancer health check is failing

Likely causes & checks

  • Backend subnet's security rules do not allow the health-check probe from the load balancer subnet/NSG on the backend port.
  • Health check configured for the wrong port/path/protocol vs. what the app serves.
  • Backend app not listening on the expected port, or bound to 127.0.0.1 instead of 0.0.0.0.
  • OS firewall on the backend dropping the probe.

Fix / prevention

Allow the LB source in the backend NSG on the health-check port; align health-check path/port with the app; confirm the app binds to all interfaces. See section 7 for the full LB troubleshooting flow.

⚑ DNS resolution issue

Likely causes & checks

  • Subnet DHCP options point at a resolver that cannot resolve the name (e.g. custom resolver without a forwarder).
  • Hybrid: no conditional forwarding between the VCN resolver and corporate DNS.
  • Private DNS zone missing the record, or wrong view attached to the VCN.

CLI / checks

nslookup db.internal.example.com 169.254.169.254
cat /etc/resolv.conf   # confirm which resolver the OS uses

Fix / prevention

Fix DHCP options, add resolver endpoints + forwarding rules for hybrid, and put internal records in a private zone attached to the VCN's resolver view.

⚑ Security list / NSG or route table blocking traffic

Fast method

Run Network Path Analyzer for source, destination, protocol, and port. It reports the exact security rule or missing route causing the block. Then enable VCN Flow Logs and look for REJECT records to confirm which rule dropped the packet.

Common route-table gotcha

Route rules are matched most-specific-first. A 0.0.0.0/0 to NAT plus a more specific route to a Service Gateway both being present is correct; a missing specific route sends OCI-service traffic out the NAT by mistake. For DRG route propagation issues, check the DRG route table import/export and that the attachment advertises the CIDR.

OCI Networking gotchas

OCI networking gotchas
  • CIDR overlap is the cardinal sin - plan against your enterprise IPAM, leave room, never reuse on-prem ranges.
  • "Public subnet" != reachable. Reachability needs public IP + IGW route + security rules, all three.
  • Service Gateway forgotten - Object Storage/ADB traffic silently goes over NAT/internet. Always add it for private subnets.
  • Security list + NSG are additive - a permissive security list can undo the tight NSG you thought was protecting a VNIC. Audit both.
  • Stateless rules asymmetry - keep rules stateful unless you have a measured reason not to.
  • Regional vs AD subnets - use regional subnets; AD-specific ones complicate HA and are rarely needed now.
  • DRG vs LPG - LPG is legacy; use the upgraded DRG (DRG v2) as the routing hub for peering and hybrid.
  • Route table is per subnet - a change in one subnet's route table does not affect others; inconsistency between subnets causes "works here, not there."
  • OS firewall - Oracle Linux images enforce their own firewall; a perfect VCN config still fails if firewalld blocks the port.
  • Egress data transfer - internet egress is metered; keep OCI-service traffic on the Service Gateway to avoid unnecessary internet-path cost and exposure.

4. Compute Deep Dive

Shapes, images, placement, scaling, and the operational patterns for running application and database compute in OCI - including how to pick shapes and how licensing interacts with instance choice.

Last reviewed: July 2026 Shape names/specs change often - verify current shapes in the Console.
TL;DR

OCI compute comes as VMs or bare metal. Modern shapes are flexible - you dial OCPUs and memory independently. One OCPU = one physical core (two vCPU threads), which matters a lot for Oracle licensing. Use fault domains and instance pools + autoscaling for HA and elasticity, instance configurations as templates, and instance principals so instances call OCI APIs without stored keys.

Shapes: OCPU, memory, flexible

  • OCPU vs vCPU: OCI historically measures CPU in OCPUs. One OCPU = one physical core = two hardware threads (vCPUs) on x86. Oracle Database licensing counts cores, so 1 OCPU generally corresponds to the licensing unit for a core (subject to the core factor). Newer shapes may also be expressed in vCPUs - confirm the unit for the exact shape.
  • Flexible shapes (e.g. the E-series AMD, and Ampere Arm A-series) let you choose OCPU count and memory GB independently, within per-OCPU memory ranges. You pay for what you allocate.
  • Fixed shapes come in preset sizes (older standard shapes, some bare metal, GPU shapes).
  • Processor families: AMD EPYC (E-series), Intel Xeon (X/Standard), Ampere Arm (A-series, strong price/performance for scale-out and cloud-native), and NVIDIA GPU shapes for AI/ML.
DBA note - OCPUs and Oracle licensing
Because 1 OCPU = 1 core, your Oracle Database license consumption on IaaS compute is driven directly by the OCPU count you allocate (with the OCI core factor policy - verify the current policy). Right-sizing OCPUs is a licensing decision, not just a performance one. Bare metal and dedicated hosts give you control over the physical boundary for hard-partitioning/licensing arguments - discuss with Oracle LMS before relying on it.

Bare metal vs virtual machines

Virtual MachineBare Metal
TenancyShared host, isolated VMEntire physical server, single tenant
Hypervisor overheadMinimal (off-box virtualization)None - you get the whole box
Use forMost apps, middleware, web, smaller DBsHighest performance, large DBs, licensing isolation, specialized workloads
Live migrationSupported for many VM shapes during infra maintenanceNot applicable - you manage maintenance windows/reboot migration
Cost modelPer-OCPU/hour, fine-grainedWhole-server; higher floor, better for dense/large

Dedicated hosts, capacity reservation, preemptible

OptionWhat it doesUse when
Dedicated VM HostYour VMs run on a physical host reserved to you (no other tenants)Compliance/licensing isolation while keeping VM flexibility
Capacity ReservationReserves capacity of a shape in an AD so it is guaranteed available when you launchGuaranteeing DR failover capacity, large launches, scale-out headroom
Preemptible instancesCheaper VMs OCI can reclaim with short noticeFault-tolerant batch, stateless workers, CI - never for stateful/production-critical
Burstable instancesBaseline fraction of an OCPU with burstingLow-average, spiky small workloads (bastions, light services)
Cost note - reserve DR capacity deliberately
A pilot-light/warm-standby DR plan assumes capacity will be available in the DR region when you fail over. During a large regional event, popular shapes can be constrained. Capacity Reservations guarantee it - at a cost. Decide per tier whether the RTO justifies paying for reserved DR capacity vs. accepting best-effort.

Images, custom images, and cloud-init

  • Platform images - Oracle Linux, and other OSes maintained by Oracle.
  • Custom images - capture a configured instance as an image for repeatable launches (golden images).
  • Bring Your Own Image (BYOI) - import a supported OS image; must meet OCI's paravirtualized/emulated driver requirements.
  • cloud-init - pass a startup script (user data) that runs on first boot to configure the instance (install agents, mount volumes, join config management). The standard way to bootstrap without baking everything into an image.
  • Instance metadata service - at 169.254.169.254, exposes instance metadata and is how instance principals fetch credentials. Restrict access to it inside the OS where appropriate.
# Example cloud-init passed as user_data (base64) to bootstrap an app host
#cloud-config
package_update: true
packages: [oracle-instantclient-basic, jq]
runcmd:
  - [ systemctl, enable, --now, myapp ]
  - [ /opt/app/register-with-lb.sh ]
Architect note - golden image + cloud-init
Bake slow-changing things (OS hardening, agents, base packages) into a custom image; use cloud-init for fast-changing config (app version, environment wiring). This keeps launches fast and reproducible and is what instance pools/autoscaling need to bring nodes up identically.

Placement, fault domains, and HA

High availability for compute is about anti-affinity: never put both halves of an HA pair on the same failure unit.

  • In a multi-AD region: spread across ADs for data-center-level resilience, and across FDs within each AD.
  • In a single-AD region: spread across the three fault domains - that is your in-region HA. DR to another region covers AD/region loss.
  • Instance pools distribute instances across FDs/ADs automatically per the placement configuration.
  • Live migration: for supported VM shapes, OCI can live-migrate your VM off hardware needing maintenance, avoiding a reboot; some events still require a reboot-migration you schedule. Bare metal you always manage yourself.
Common mistake
Launching two "HA" app nodes and letting OCI place both in the same fault domain (or not checking). One rack/maintenance event takes out both. Explicitly set fault-domain placement (or use an instance pool with a spread policy) and verify it.

Autoscaling, instance pools, and configurations

Building blockRole
Instance ConfigurationA template: shape, image, network, metadata/cloud-init, volumes. Immutable versioned blueprint.
Instance PoolManages a set of identical instances from a configuration, across FDs/ADs, with a target size.
AutoscalingAdjusts pool size by metric (CPU/memory) thresholds or a schedule (e.g. scale down nights/weekends).
Cluster NetworksHigh-performance RDMA-connected pools for HPC/AI (ultra-low-latency interconnect).
Cost note - schedule-based autoscaling
Non-production compute rarely needs to run nights and weekends. A scheduled autoscaling policy (or a simple stop schedule) that shrinks Dev/Test pools to zero off-hours is one of the highest-ROI cost actions in OCI. Combine with the shutdown checklist in section 14.

Choosing shapes by workload

WorkloadStarting pointWhy
General applicationsFlexible AMD E-series VM (balanced OCPU:memory)Cost-effective, dial exactly what you need
Oracle Database (IaaS)Higher-memory flexible VM or bare metal; consider Base/Exadata service insteadMemory for SGA/PGA; bare metal for licensing isolation and top performance
Web serversAmpere Arm A-series or small E-series, autoscaledExcellent price/performance for stateless scale-out
Middleware (WebLogic, etc.)Balanced flexible VM, memory-leaningJVM heaps like memory; scale OCPU to concurrency
EBS application tierFlexible VM sized to concurrent users; multiple nodes behind LBHorizontal scale + HA across FDs
Batch processingPreemptible or Arm pools, autoscaled/scheduledFault-tolerant, cheap, elastic
Memory-heavy (in-memory, caches, analytics)High memory-per-OCPU flexible shapePush memory up without paying for unneeded cores
CPU-heavy (compute, encoding)High-OCPU flexible or dedicated; Arm for throughputCores are the bottleneck
AI/ML training & inferenceGPU shapes (NVIDIA); cluster networks for multi-nodeAccelerators + RDMA fabric
DBA note - IaaS DB vs managed DB
You can run Oracle Database on a plain compute instance, and sometimes must (specific versions/configs). But you then own patching, backup, HA, and Grid Infrastructure yourself. For most cases the Base Database Service or Exadata Database Service (section 6) removes that toil. Choose IaaS DB only when a managed option genuinely cannot meet the requirement.

Operational guidance

How to resize compute Ops
  • Flexible VM: change OCPU/memory - typically requires a reboot; plan a window. Scaling within the same shape family is straightforward.
  • Change shape family/architecture (e.g. Intel → Arm): not an in-place resize - rebuild from the image/config on the new shape (watch for architecture-specific binaries).
  • For pools: update the instance configuration and roll instances, or increase pool size and drain old ones.
How to move workloads safely Ops
  • Prefer rebuild-from-image over "lift the disk" - create a custom image, launch in the target compartment/AD/region, validate, cut over.
  • For cross-region moves, copy the custom image to the target region, or use Block Volume/boot volume cross-region replication.
  • Keep IPs stable with reserved public IPs and secondary private IPs where clients pin addresses; better, front with a load balancer or DNS name so moves are transparent.
How to troubleshoot boot / access issues Ops
  • Use the serial console / console connection to see boot output and reach the OS when SSH is dead (bad fstab, firewall lockout, failed service).
  • Check the instance's boot volume is healthy; you can detach it and attach to a rescue instance to fix an unbootable OS.
  • Confirm SSH key was injected (cloud-init) and the security rules/route allow 22 from your source.
How to troubleshoot performance Ops
  • Check Monitoring metrics: CPU utilization, memory, and especially block volume throughput/IOPS vs. the volume's performance tier limits.
  • A common surprise: the instance is not CPU-bound, the block volume is at its IOPS/throughput ceiling. Raise the volume's performance tier or use higher-VPU settings (section 5).
  • Network: verify you are not hitting shape bandwidth limits; larger shapes get more network bandwidth.
  • Right-size: autoscale or resize based on sustained utilization, not peak fear.
How to design compute for production Design
  • At least two nodes across different fault domains (and ADs where available) behind a load balancer.
  • Instance configuration + pool + autoscaling so capacity is elastic and nodes are reproducible.
  • Instance principals for API access; no stored keys.
  • OS Management for patch compliance; custom golden image + cloud-init for consistency.
  • Monitoring alarms on CPU/memory/volume; boot/block volume backups; capacity reservation for DR if RTO demands it.

OS management, patching, and recovery

  • OS Management (Hub) - manage OS updates/patch compliance across fleets of Oracle Linux (and other) instances from OCI.
  • Serial console connection - out-of-band access for recovery.
  • Instance recovery - stop/start moves the VM to healthy hardware; for corrupted OS, rescue via boot-volume detach/attach.

Compute troubleshooting quick runbook

⚑ Instance unreachable / SSH fails

Checks

  • Security rules + route allow 22 from your IP; instance has the right public/private IP path.
  • Instance is Running (not Stopped); boot completed - check serial console.
  • SSH key injected (cloud-init logs); correct user (opc for Oracle Linux).
  • OS firewall (firewalld) not blocking; fail2ban not locking you out.

CLI

oci compute instance get --instance-id <ocid> --query 'data."lifecycle-state"'
oci compute instance-console-connection create --instance-id <ocid> --public-key-file key.pub
oci compute instance action --instance-id <ocid> --action SOFTRESET

Prevention

Bastion service for SSH (no public IPs on hosts), standardized security rules, and console-connection procedures documented.

5. Storage Deep Dive

Block, Object, and File storage in OCI - their performance models, backup and replication behavior, and the decision of which to use for databases, shared file systems, backups, archives, and data lakes.

Last reviewed: July 2026 Verify performance tiers, VPU values, and archive restore times in current docs.
TL;DR

Block Volume = network-attached disks for instances/databases (like SAN/iSCSI), with tunable performance (VPU). Object Storage = HTTP key-value store for backups, data lakes, artifacts, archives - not a file system. File Storage (FSS) = managed NFS for shared POSIX file systems. Block for boot/DB, Object for backups/archives/lakes, File for shared app filesystems. Archive tier is cheap but has a restore delay - plan for it.

The three storage services

Block VolumeObject StorageFile Storage (FSS)
InterfaceiSCSI / paravirtualized block deviceREST/HTTP (S3-compatible, Swift)NFS v3
Looks likeA disk you format & mountBuckets of objects (no directories)A shared mounted filesystem
Attached toOne instance at a time (or shared/multi-attach for clusters)Nothing - accessed over network by URLMany instances concurrently
ScalePer-volume size limit; attach manyEffectively unlimitedGrows automatically to petabytes
Best forBoot disks, DB datafiles, app storage needing block I/OBackups, archives, images, logs, data lake, static contentShared home dirs, app clusters, EBS shared APPL_TOP

Block volumes and boot volumes

  • Boot volume - the OS disk created with an instance. Can be backed up, cloned, and detached for rescue.
  • Block volume - additional data disks. Attach via iSCSI or paravirtualized; multi-attach for shared-disk clusters (e.g. some RAC/HA configs).
  • Performance tiers (VPU/GB): Lower Cost, Balanced, Higher Performance, and Ultra High Performance - set by Volumes Performance Units (VPU) per GB. Higher VPU = more IOPS and throughput per GB. Auto-tune can lower performance (and cost) when a volume is detached/idle and raise it when in use.
  • Volume groups - group volumes (e.g. all of a DB's volumes) so backups/clones are crash-consistent across the set.
  • Backups - full or incremental, policy-based (scheduled), and can be copied cross-region for DR. Clones are instant copy-on-write; replication keeps a volume asynchronously mirrored to another region.
DBA note - size block volume performance, not just capacity
Database performance problems on IaaS are frequently block-volume I/O ceilings, not CPU. IOPS and throughput scale with size and VPU. A small, low-VPU volume will throttle a busy redo/datafile workload no matter how many OCPUs you add. Provision VPU for the I/O profile, use volume groups for consistent backups, and put redo/data on appropriately-tiered volumes.
Cost note
Block storage cost scales with GB and VPU. Don't put everything on Ultra High Performance. Use Balanced as a default, Higher/Ultra for hot DB volumes, and Lower Cost + auto-tune for cold/detached volumes. Delete orphaned volumes and stale backups (section 14).

Object Storage

  • Namespace - a tenancy-wide unique container name; buckets live in it, scoped to a compartment and region.
  • Storage tiers: Standard (hot, frequent access), Infrequent Access (cheaper storage, retrieval fee), Archive (cheapest, must be restored before reading, with a restore delay). Auto-Tiering can move objects between Standard/IA based on access.
  • Multipart upload - upload large objects in parallel parts; required/recommended for big files (backups, images).
  • Pre-Authenticated Requests (PARs) - time-boxed URLs granting access to a bucket/object without IAM credentials - handy but a sharing risk if leaked.
  • Lifecycle rules - auto-transition objects to Archive or delete them after N days.
  • Retention rules & versioning - retention locks objects against deletion for a period (WORM-style, supports compliance); versioning keeps prior versions.
  • Replication - asynchronously replicate a bucket to another region/bucket for DR.
Common mistake - Object Storage is not a file system
There are no real directories - the "/" in object names is just a naming convention, and there is no in-place random write or POSIX locking. Do not try to mount a bucket and run a database or app that expects file semantics on it. Use Block or File storage for that; use Object Storage for whole-object put/get workloads (backups, artifacts, media, lake data).
Security note - PARs and public buckets
A Pre-Authenticated Request is a bearer URL: anyone with the link has the access it grants until it expires. Never make buckets public unless the content is genuinely public. Prefer short PAR lifetimes, or IAM + instance principals. Turn on versioning + retention for backup buckets so ransomware/accidental deletes are recoverable, and audit bucket visibility regularly (Cloud Guard flags public buckets).

File Storage Service (FSS)

  • File system - the NFS-exported filesystem; grows automatically, snapshots supported.
  • Mount target - the NFS endpoint (with a private IP in your subnet) that instances mount. It carries the export set.
  • Export paths & NFS export options - control which CIDRs/hosts can mount, and with what access (read/write, root squash).
  • Snapshots - point-in-time, space-efficient; replication mirrors a file system to another region for DR.
Security note - NFS export options are your access control
FSS security is enforced by the mount target's NFS export options (source CIDR, access level, root-squash) plus the subnet's security rules on NFS ports. A mount target open to the whole VCN lets any instance mount sensitive shares. Restrict export options to the specific client CIDRs, enable root squash where appropriate, and lock the NFS ports in the NSG.

When to use which

NeedUseWhy
OS boot diskBoot Volume (Block)Block I/O, bootable, backup/clone
Database datafiles / redoBlock Volume (right VPU) or managed DB storageLow-latency block I/O; tune performance
Shared filesystem for an app clusterFile Storage (FSS)Concurrent POSIX access from many nodes
Database backups (RMAN)Object Storage (via Service Gateway)Durable, cheap, off-host, cross-region copy
Log/data archive, long retentionObject Storage Archive tier + lifecycle + retentionLowest cost, WORM compliance
Data lakeObject Storage (Standard)Scales infinitely, queried by analytics services
EBS shared APPL_TOP / concurrent tierFile Storage (FSS)Shared filesystem semantics EBS expects
Static website / mediaObject Storage + PAR/CDNHTTP-native object serving

Practical examples

RMAN backup to Object Storage DBA

Configure RMAN to write to Object Storage via the Oracle Database Cloud Backup Module (or the DBaaS backup tooling on managed DBs). Traffic should traverse the Service Gateway, not NAT/internet. Enable a lifecycle rule to move older backup pieces to Archive, plus versioning + retention on the bucket for immutability. For DR, enable cross-region bucket replication or copy backups to the DR region.

DBA note
On Base/Exadata/Autonomous, backups to Object Storage are largely managed - you set retention and, for autonomous, the whole cycle is automatic. On IaaS DB you own the RMAN config, the backup module install, and the Service Gateway routing.
Application shared file system Apps

Create an FSS file system + mount target in the app subnet, restrict export options to the app-tier CIDR/NSG, mount on all app nodes. Snapshot on a schedule; replicate to the DR region. This is the standard shared storage for EBS APPL_TOP, WebLogic domains, or any scale-out app needing a common filesystem.

Log archive with lifecycle Ops

Ship logs (via Service Connector Hub) to an Object Storage bucket. Lifecycle rule: Standard for 30 days → Archive for long-term → delete after the compliance period. Retention rule prevents early deletion during the required window.

Data transfer options

  • Data Transfer Service (disk/appliance) - ship physical media/appliances to Oracle to bulk-load very large datasets when network transfer is impractical.
  • Online - CLI/SDK multipart upload, oci os object bulk-upload, or Storage Gateway-style/rclone tooling for ongoing sync.

Storage gotchas

Storage gotchas
  • Object Storage is not a filesystem - no random writes, no POSIX locks, no real directories.
  • Archive restore delay - Archive objects must be restored before reading, which takes time (hours-class). Never put anything you might need immediately in Archive.
  • Block volume undersized for IOPS - performance follows size and VPU; a small volume throttles regardless of CPU.
  • NFS mount target too open - lock export options to specific CIDRs and enable root squash.
  • Backup policy gaps - a volume/DB with no backup policy attached is silently unprotected. Audit that every prod volume has a policy.
  • Cross-region copy cost & timing - replication/copy incurs egress and takes time; a "DR copy" that lags your RPO is not DR. Measure it.
  • Detached-but-billed volumes - deleting an instance may leave block volumes behind, still billing. Clean them up.
  • PAR sprawl - untracked pre-authenticated URLs are a data-exfiltration risk. Inventory and expire them.

6. Database Services Deep Dive

OCI's database portfolio, from self-managed IaaS through Base Database and Exadata to fully-managed Autonomous - what each one manages for you, how HA/DR/backup/patching differ, and how to choose.

Last reviewed: July 2026 DB service names, versions (23ai/26ai), and features change - verify in current docs.
TL;DR

Pick along a spectrum of control vs. toil. DB on IaaS = full control, all the work. Base Database Service = managed VM DB systems, you still trigger patches and own the guest OS. Exadata Database Service / Cloud@Customer = engineered-system performance and scale, RAC built in. Autonomous Database = Oracle runs patching, tuning, backup, and scaling; you own schema, SQL, and data. Choose the least you need to manage that still meets performance, control, and compliance requirements.

The portfolio at a glance

ServiceFormYou manageOracle managesSweet spot
DB on Compute (IaaS)You install Oracle DB on a VM/BMEverything above the hypervisorInfra onlySpecial versions/configs a managed service can't do
Base Database ServiceManaged VM DB system (single node or 2-node RAC)Guest OS, patch scheduling, schemaProvisioning, patch tooling, backup automationSmall-to-mid Oracle DBs wanting managed lifecycle
Exadata Database Service (ExaDB-D)Exadata infra in OCI, VM clustersDatabases, patch scheduling, schemaExadata hardware, storage cells, RAC substrateLarge, high-performance, consolidation, mission-critical
Exadata Cloud@Customer (ExaCC)Exadata in your data center, OCI-managedDatabases, schemaFull stack, remotely operated by OracleData residency / low-latency-to-on-prem with cloud ops
Autonomous DatabaseSelf-driving (ATP/ADW/AJD)Schema, SQL, data, accessPatching, tuning, backup, scaling, much of securityMost new OLTP/DW/JSON where you want minimal DBA toil

Service deep dives

Base Database
Exadata (ExaDB-D / ExaCC)
Autonomous Database
RAC / CDB / TDE

Base Database Service

Managed Oracle Database on VM DB systems. You choose Standard/Enterprise Edition, version, and shape; OCI provisions the VM(s), Grid Infrastructure, and database, and provides managed backup and patching workflows.

  • Topologies: single-node VM DB system, or 2-node RAC VM DB system for node HA.
  • You still own: the guest OS (patching the OS, though Oracle provides the DB patch bundles), schema, SQL, and when to apply quarterly patches.
  • Backups: automatic backups to Object Storage with a retention you set; point-in-time restore.
  • Data Guard: one-click association to a standby (same or cross-region).
DBA note
Base Database feels closest to what an on-prem DBA already does - you still run RMAN concepts, Data Guard, and patching, but the provisioning and much of the plumbing is automated. Good stepping stone from on-prem before adopting Autonomous.

Exadata Database Service & Cloud@Customer

The Exadata engineered system delivered as a cloud service: Scale-Out compute (DB servers) + intelligent storage cells with Smart Scan, storage indexes, and flash. Databases run as RAC across the VM cluster.

  • ExaDB-D (Dedicated): Exadata infrastructure in an OCI region; you create VM clusters and databases on it. Elastic scaling of DB and storage servers.
  • ExaCC: the same Exadata hardware placed in your data center, control plane in OCI, operated by Oracle - for data-residency or ultra-low-latency-to-on-prem needs.
  • Why Exadata: Smart Scan offloads query processing to storage, huge consolidation density, consistent low latency, built-in RAC HA, and the top end of Oracle DB performance.
  • You manage: databases, PDBs, patch scheduling (Oracle provides one-click patching of the infra and DB), schema, and tuning.
Architect note - when Exadata is the right fit
Choose Exadata Cloud Service when you have large, latency-sensitive, or consolidated Oracle estates - dozens/hundreds of databases, heavy mixed OLTP+analytics, or workloads already tuned for Exadata features (Smart Scan, HCC). For a handful of modest databases it is over-provisioned; Base Database or Autonomous will be cheaper and simpler.

Autonomous Database

Self-managing Oracle Database. Oracle automates patching, tuning, backups, scaling, and much of the security. Workload flavors share one engine:

  • ATP (Transaction Processing) - OLTP/mixed, optimized for many short transactions.
  • ADW (Data Warehouse) - analytics, optimized for scans/aggregations, columnar.
  • AJD (JSON Database) - document/JSON-centric, SODA APIs.
  • Also APEX Service and (verify) transaction/AI-vector capabilities in 23ai/26ai.

Deployment models:

ServerlessDedicated
InfraShared, Oracle-managed Exadata fleetExadata infra dedicated to you
IsolationLogicalPhysical - your own infra
ControlLeast ops, fastest startMore control (maintenance windows, isolation policies)
Use forMost workloads, dev/test, variable loadRegulated/large estates wanting private Autonomous
  • Autoscaling: OCPU and storage can auto-scale (e.g. up to 3x base OCPU) to absorb spikes; scale to/near zero for dev with auto-stop.
  • Autonomous Data Guard: one-click managed standby (in-region or cross-region) with automatic failover options.
  • Backups: fully automatic with a retention you choose; point-in-time restore.
DBA note - what you still do on Autonomous
You are not out of a job - you own schema design, indexing strategy (beyond auto-indexing), SQL quality, data modeling, partitioning choices, access control, and application performance. Autonomous removes infrastructure and routine maintenance toil, not data architecture. Some features and privileges are restricted (no SYSDBA in the traditional sense, limited OS access), which is exactly what trips up lift-and-shift of DBs that rely on OS-level hooks.
When NOT to use Autonomous
  • Apps requiring specific unsupported features, custom OS packages, non-standard init parameters, or direct OS/filesystem access.
  • Databases pinned to a version/patch level Autonomous won't run.
  • Certified packaged apps (some EBS/Siebel configs) that require Base/Exadata, not Autonomous - check certification.
  • Workloads needing very specific licensing or hard-partitioning arrangements.

RAC, CDB/PDB, and encryption

  • RAC (Real Application Clusters): multiple DB instances on multiple nodes serving one database for node-level HA and scale. Built into Exadata and available as 2-node on Base Database. Survives a node failure with brief brownout; not a DR substitute (it is one site).
  • CDB/PDB (multitenant): a Container Database hosts Pluggable Databases. PDBs are the unit of consolidation, cloning, and mobility - you can clone/relocate a PDB, and Autonomous/Exadata lean heavily on the multitenant model. Great for consolidating many databases with isolation.
  • TDE (Transparent Data Encryption): encryption at rest is standard in OCI databases. Keys can be Oracle-managed or customer-managed via OCI Vault (or Oracle Key Vault / External Key Management). Encryption in transit uses TLS/native network encryption.
  • Database Vault / Data Safe: Database Vault enforces separation of duties (even DBAs can't see app data) where licensed; Data Safe provides assessment, masking, and activity auditing (see Data tooling tab).
Security note - customer-managed keys
For regulated data, use customer-managed TDE keys in OCI Vault so you control key rotation and can revoke access to the data by disabling the key. Oracle-managed keys are simpler but give you less control. Decide per data-classification; wire the DB's key to a Vault in a locked-down security compartment.

Database service decision table

WorkloadRecommended serviceReasonHADROps responsibilityCost lever
App OLTP (new build)Autonomous (ATP) ServerlessMinimal toil, autoscaleBuilt-inAutonomous Data GuardSchema/SQL onlyAutoscale + auto-stop dev
Data warehouseAutonomous (ADW)Columnar, scan-optimized, elasticBuilt-inAutonomous DG cross-regionModel + loadScale OCPU to query load
Reporting DBADW or Exadata (if huge)Read-heavy analyticsBuilt-in / RACDG / backupModel + tuningScale for report windows
EBS databaseBase Database or Exadata (per size/cert)Certification + control needed2-node RACData GuardPatch + schemaRight-size OCPU; BYOL
Consolidated estateExadata (ExaDB-D)Density + performance + PDB isolationRACData GuardDBs + patch scheduleConsolidate; scale cells
Dev/test DBAutonomous (auto-stop) or Base single-nodeCheapest managed optionN/ABackupMinimalAuto-stop off-hours
AI / vector search23ai/26ai (Autonomous or Exadata) with AI Vector SearchIn-DB vectors + SQLBuilt-in/RACDGSchema + embeddingsScale for embedding jobs
JSON / documentAutonomous JSON (AJD)SODA, JSON-native, low costBuilt-inDGCollectionsServerless autoscale
Special version/configDB on Compute (IaaS)Full control when managed can'tYou build itYou build itEverythingRight-size; BYOL

How HA and DR work across services

ServiceIn-region HADR
DB on IaaSYou build RAC / clustering / FD spreadYou configure Data Guard + cross-region
Base Database2-node RAC option; FD placementOne-click Data Guard (in/cross-region)
Exadata (ExaDB-D/CC)RAC across cluster nodes (built-in)Data Guard / Active Data Guard to another Exadata
AutonomousBuilt into the platformAutonomous Data Guard (managed standby, optional auto-failover)
Architect note - RAC is HA, Data Guard is DR
Do not conflate them. RAC protects against node/instance failure within one site (fast, no data movement). Data Guard maintains a physically separate standby (another AD/region) that protects against site/region loss and corruption, with configurable sync (zero data loss with SYNC/Far Sync) or async. Mission-critical Oracle typically wants both: RAC for uptime, Active Data Guard cross-region for DR and read-offload.

Patching and backup differences

ServicePatchingBackup
DB on IaaSEntirely you (OS + Grid + DB)You configure RMAN + destination
Base DatabaseOracle provides patch bundles; you schedule/applyAutomatic to Object Storage; you set retention
ExadataOne-click infra + DB patching; you schedule maintenanceManaged backup to Object Storage / local; retention configurable
AutonomousFully automatic (near-zero downtime), you do nothingFully automatic + point-in-time; you set retention window
DBA note - test restores, always
Managed backups still need restore testing. "Backups are automatic" does not prove recoverability - periodically restore to a clone and validate. On Autonomous, use point-in-time clone to a test instance; on Base/Exadata, script a periodic restore to a scratch system and check open + data integrity.

Data tooling and operations

Data Safe

Security assessment, user assessment, data discovery, masking, and activity auditing for your OCI databases. Start here for DB security posture.

Database Management

Monitoring, performance, and fleet management for databases (managed and, via agent, on-prem). Performance Hub, SQL insights.

Operations Insights

Capacity planning, SQL/resource analytics, and forecasting across the DB fleet - warehouse-style analytics on your database performance data.

Performance Hub

Real-time and historical ASH/AWR-style performance analysis in the Console for OCI databases.

GoldenGate

Real-time replication and CDC - migrations with minimal downtime, active/active, and streaming data pipelines.

Zero Downtime Migration (ZDM) & DMS

ZDM automates Data Guard-based migrations; Database Migration Service orchestrates online/offline migrations to OCI. See section 13.

SQL tuning options
Performance Hub + SQL Tuning Advisor + automatic indexing (Autonomous) + Operations Insights cover most tuning workflows. On Autonomous, auto-indexing and automatic SQL plan management do a lot; you still validate execution plans for critical SQL.

Licensing: BYOL vs License Included

License Included (LI)Bring Your Own License (BYOL)
What it isThe service price bundles the Oracle DB licenseYou apply existing on-prem Oracle licenses to the cloud service
Best whenNew workloads, no existing licenses, want simplicityYou have unused/EE licenses and options; usually lower run cost
Watch forHigher per-OCPU rateCorrect edition/options mapping, core factor, compliance with LMS
Cost note - BYOL is usually the biggest DB lever
For Oracle shops with existing Enterprise Edition + options, BYOL often cuts the effective database run-rate substantially versus License Included. But map editions/options carefully (Autonomous BYOL, Exadata BYOL, and options like RAC/Partitioning/Advanced Security have specific rules) and keep LMS-defensible records. Combine BYOL with right-sizing OCPUs since license = cores.

Enterprise examples

Oracle E-Business Suite database Apps DBA

Typically Base Database Service (2-node RAC) or Exadata depending on size and certification. Data Guard cross-region for DR. App tier on compute behind a load balancer, shared APPL_TOP on FSS. BYOL for the DB. This is the classic Apps DBA lift-and-shift; verify EBS certification for the target DB service and version.

Enterprise data warehouse Analytics

ADW for most; Exadata if extreme scale or already Exadata-tuned. Object Storage data lake feeding it; Data Integration/GoldenGate for loads; Oracle Analytics Cloud on top. Autoscale for reporting windows.

Consolidated database platform Platform

Exadata (ExaDB-D) with many PDBs for isolation and density. Standardized patch windows, Data Guard to a second region, Data Safe for auditing across the estate, Operations Insights for capacity planning.

AI / vector search database AI

Oracle Database 23ai/26ai (Autonomous or Exadata) using AI Vector Search to store embeddings alongside relational data and run similarity search in SQL - the backbone of RAG over enterprise data (see section 12). Keep it governed: agents query through a read-only/serving layer, not ad-hoc against production.

7. Load Balancing and Traffic Management

The Layer-7 Load Balancer and Layer-4 Network Load Balancer, their listeners/backend sets/health checks, SSL handling, routing, and a disciplined approach to the most common failure: unhealthy backends.

Last reviewed: July 2026 Verify bandwidth shapes and certificate service details in current docs.
TL;DR

Two products: the Load Balancer (LBaaS) is Layer 7 (HTTP/HTTPS-aware: SSL termination, path/host routing, cookies) with a flexible bandwidth shape; the Network Load Balancer (NLB) is Layer 4 (TCP/UDP, ultra-low latency, preserves source IP, scales huge). Both can be public or private. A load balancer is built from listeners → backend sets → backends with health checks. The number-one issue is a backend marked unhealthy because a security rule blocks the health-check probe.

Load Balancer vs Network Load Balancer

Load Balancer (LBaaS, L7)Network Load Balancer (NLB, L4)
Layer7 (HTTP/HTTPS/TCP)4 (TCP/UDP/ICMP)
FeaturesSSL termination/E2E, path & host routing, cookie persistence, WAF integrationPass-through, preserves client source IP, very low latency, extreme scale
Source IPRewritten (adds X-Forwarded-For)Preserved (great for apps that need real client IP)
SizingFlexible bandwidth (min/max Mbps)Scales automatically, high throughput
Use forWeb/API tiers needing HTTP intelligenceNon-HTTP, high-throughput, source-IP-sensitive, or DB/NLB-fronted services

Both come in public (internet-facing, in a public subnet) and private (internal, in a private subnet) variants.

Anatomy: listeners, backend sets, backends, health checks

Clients Load Balancer Listener :443 (HTTPS) Backend set + policy Backend :8080 (FD1) Backend :8080 (FD2) Backend :8080 (FD3) Health check
Listener terminates the connection, backend set balances across backends (spread over fault domains), health checks continuously probe each backend.
  • Listener - the front-end port/protocol (e.g. 443/HTTPS). Handles SSL, routing rules, and WAF.
  • Backend set - a group of backends plus the balancing policy (round robin, least connections, IP hash) and the health check definition.
  • Backend - an actual server (IP:port), weighted, drained gracefully during maintenance.
  • Health check - protocol/port/path/interval defining "healthy." Unhealthy backends are pulled from rotation.
  • Session persistence - cookie-based (LB-generated or app cookie) or IP-based, to pin a client to a backend.

SSL termination, end-to-end SSL, certificates

ModeWhere TLS terminatesUse when
SSL terminationAt the LB; plaintext to backendsOffload crypto from backends; inspect/route on content; backends in trusted private subnet
End-to-end SSLLB terminates then re-encrypts to backendCompliance requiring encryption in transit all the way; LB still does L7
SSL pass-through (NLB)Not terminated - backend does TLSBackend must own the cert / mTLS; L4 only
Security note - manage certs in the Certificates service
Use the OCI Certificates service (or import) to manage LB certificates, with rotation. Terminating TLS at the LB centralizes cert management and lets WAF inspect traffic, but means the LB-to-backend hop is plaintext unless you use end-to-end SSL - keep that hop inside a private subnet with tight NSGs, or use E2E SSL for regulated data.

Hostname and path-based routing

  • Hostname-based routing - one LB serves multiple virtual hosts (api.example.com vs app.example.com) to different backend sets.
  • Path-based routing - route by URL path (/api/* → API backends, /static/* → static backends).
  • Rule sets - header manipulation, redirects (HTTP→HTTPS), access control by source.
  • Logging - enable access and error logs to the Logging service for traffic analysis and troubleshooting.

When to use which

Public web / API with HTTPS, routing, WAF
Public Load Balancer (L7) + WAF policy
Internal microservice traffic
Private Load Balancer (or NLB for pure L4)
Very high throughput, non-HTTP, need real client IP
Network Load Balancer
TLS/mTLS must terminate on the backend
NLB pass-through
Kubernetes service of type LoadBalancer
OKE provisions an LB or NLB via annotations (see section 10)

Load balancer troubleshooting

⚑ Backend shows unhealthy / 502

Symptoms

Backends "Critical/Warning" in the LB health page; clients get 502/503 or intermittent errors.

Likely causes (in order)

  1. Security rules block the probe: the backend NSG/security list must allow the health-check source (the LB subnet/NSG) on the backend port. This is the most common cause.
  2. Health check misconfigured: wrong port, path, protocol, or expected status code vs. what the app actually returns.
  3. App not listening / wrong bind: service down, or bound to 127.0.0.1 not 0.0.0.0.
  4. OS firewall on the backend dropping the probe.
  5. Route table issue between LB and backend subnet (rare within one VCN, common across peered VCNs).
  6. SSL mismatch: health check uses HTTPS but backend serves HTTP (or cert invalid in E2E mode).

Checks

  • From a host in the LB subnet, curl the backend's health-check URL directly - does it return the expected code?
  • Confirm the backend NSG allows the LB source on the port; run Network Path Analyzer LB→backend.
  • Check the app is listening: ss -tlnp | grep <port>; check bind address.
  • Review LB error logs (Logging service).

Console path

Networking > Load Balancers > (LB) > Backend Sets > Health; and Backend Sets > Health Check Policy.

Fix / prevention

Open the health-check port from the LB source in the backend NSG; align the health-check path/port/protocol/status; ensure the app binds to all interfaces; template the LB + NSG in Terraform so every environment matches.

Common LB mistakes
  • Health check on the wrong port (LB listener 443 but backend serves 8080 - the check must target the backend port).
  • Forgetting the health-check probe source when writing backend NSG rules.
  • Wrong listener protocol (TCP vs HTTP) - HTTP features/routing require an HTTP listener.
  • App bound to localhost only, so it works on the host but not through the LB.
  • Certificate expired or chain incomplete on end-to-end SSL backends.
  • No backend spread across fault domains - a single FD loss takes all backends.

8. Security Deep Dive

Defense in depth on OCI: identity, network, data, and detective controls - plus concrete guidance for securing a production tenancy, databases, Object Storage, and public endpoints, ending in a production security checklist.

Last reviewed: July 2026 Security service capabilities evolve - verify Cloud Guard/Security Zones features in current docs.
TL;DR

Security in OCI is layered: IAM (least-privilege policies, MFA, principals), network (private subnets, NSGs, no public IPs, WAF/Network Firewall), data (TDE with customer-managed keys in Vault, Object Storage retention, Data Safe), and detection (Cloud Guard, Security Zones, Vulnerability Scanning, Audit + Logging). Reduce public exposure, encrypt with keys you control, watch everything, and enforce guardrails that make the insecure thing impossible - not just discouraged.

Security design principles

  • Least privilege - lowest verb, narrowest resource type, lowest compartment, conditions where useful. Separate admin/operator/read-only.
  • Reduce blast radius - compartments, separate VCNs, separate keys per classification, break-glass isolation.
  • Private by default - no public IPs unless required; front-facing only via LB/WAF; Bastion for admin access.
  • Encrypt everywhere - at rest (TDE, block/object/file encryption) and in transit (TLS); customer-managed keys for sensitive data.
  • Guardrails over guidelines - Security Zones and quotas that prevent misconfiguration beat policies that merely recommend it.
  • Assume breach - detect and audit - Cloud Guard, Audit (immutable), Logging, alarms on anomalies. You cannot respond to what you cannot see.

The control layers

LayerControlsKey services
IdentityWho can do whatIAM policies, Identity Domains, MFA, federation, instance/resource principals
NetworkWhat can reach whatPrivate subnets, NSGs/security lists, gateways, Bastion, WAF, Network Firewall, DDoS protection
DataProtect data at rest/in transitVault (KMS), TDE, Object Storage encryption/retention, Data Safe, certificates
Detective / postureFind and stop misconfig & threatsCloud Guard, Security Zones, Vulnerability Scanning, Audit, Logging, Logging Analytics

Vault: keys, secrets, certificates

  • Vault - managed key management (KMS/HSM-backed). Create keys for TDE, block/object/file encryption, and app-level crypto.
  • Oracle-managed vs customer-managed keys - Oracle-managed is default and simplest; customer-managed lets you control rotation and revoke access to encrypted data by disabling the key. Vaults can be software or HSM-protected (higher assurance).
  • Secrets - store DB passwords, API tokens, wallets as versioned secrets; apps fetch them at runtime via principals - never bake secrets into images or code.
  • Certificates - managed TLS certs and CAs for load balancers and services, with rotation.
Security note - put Vault in a locked-down compartment
Keep Vaults, keys, and secrets in a dedicated security/shared-services compartment where only a small key-admin group has manage rights, and workloads get use (encrypt/decrypt/read-secret) via dynamic-group policies - not manage. Disabling a customer-managed key is your emergency "make this data unreadable" switch; that power must be tightly held and audited.

Cloud Guard, Security Zones, Vulnerability Scanning

Cloud Guard

Continuously detects misconfigurations and risky activity (public buckets, over-permissive policies, exposed ports, risky IAM) across the tenancy, scores them, and can auto-remediate via responders. Turn it on tenancy-wide.

Security Zones

Attach a policy-enforced recipe to a compartment that blocks non-compliant actions outright - e.g. no public subnets, no unencrypted volumes, no unapproved keys. Preventive, not just detective.

Vulnerability Scanning

Scans compute instances and container images for CVEs and open ports; schedule recurring scans and feed results into your patch process.

Bastion service

Time-boxed, audited SSH/RDP sessions to private hosts without public IPs or a standing jump box. Sessions expire; access is policy-controlled.

Web Application Firewall

OWASP protection, bot management, rate limiting, and geo/IP rules in front of public HTTP endpoints; attach to load balancers.

Network Firewall

Managed next-gen firewall (Palo Alto-based) in a hub VCN for stateful inspection, IPS/IDS, URL filtering, and TLS inspection of north-south/east-west traffic.

Architect note - Security Zones + Cloud Guard together
Cloud Guard tells you something is wrong; a Security Zone stops it happening. Put production compartments in a Security Zone so a mis-click cannot create a public bucket or unencrypted volume, and run Cloud Guard tenancy-wide to catch everything the zones don't. This pairing turns security from after-the-fact cleanup into prevention.

Data and database security

  • Encryption at rest - on by default for block/boot/object/file storage and databases (TDE). Choose customer-managed keys for sensitive data.
  • Encryption in transit - TLS for service endpoints; native network encryption / TLS for DB connections; ADB uses mTLS wallets.
  • Data Safe - security assessment, user risk assessment, sensitive data discovery, dynamic/static masking (for non-prod copies), and DB activity auditing. The first tool to point at any production database.
  • Database Vault (where licensed) - realms and separation of duties so even DBAs cannot read application data.
  • Audit - the tenancy Audit service records all API calls (control-plane) immutably; combine with DB-level and OS logs.
Common mistake - unmasked production data in non-prod
Cloning production into Dev/Test for "realistic testing" copies real customer/PII data into weakly-controlled environments. Use Data Safe masking to sanitize sensitive columns during the clone, and keep non-prod under the same (or stricter) access controls until it is masked.

How to secure specific things

Secure a production tenancy Tenancy
  • Federate human access to the corporate IdP; enforce MFA for all users, especially admins. Reserve break-glass local admins, sealed and alarmed.
  • Least-privilege IAM by role and compartment; no manage all-resources in tenancy for daily groups.
  • Enable Cloud Guard tenancy-wide; put prod compartments in Security Zones; enable Vulnerability Scanning.
  • Centralize network egress/inspection through a hub (Network Firewall); minimize public subnets.
  • Enable Audit retention and stream logs to a central logging compartment via Service Connector Hub.
  • Compartment quotas + budgets as guardrails; tag everything with data-classification.
Secure databases DB
  • Private subnet only, no public IP; access via app tier / Bastion / private endpoints.
  • TDE with customer-managed keys in Vault; rotate keys.
  • Data Safe: run assessments, mask non-prod, audit activity.
  • Least-privilege DB accounts; Database Vault for separation of duties on the most sensitive systems.
  • Native network encryption / TLS for all client connections; store wallets/passwords in Vault secrets.
Secure Object Storage Storage
  • Keep buckets private; Cloud Guard alarms on any public bucket.
  • Enable versioning + retention rules on backup/compliance buckets (ransomware/accidental-delete recovery, WORM).
  • Prefer IAM + instance principals over PARs; if PARs are needed, short lifetimes and an inventory.
  • Customer-managed keys for sensitive buckets; access via Service Gateway, not internet.
Secure public load balancers / reduce public IP exposure Network
  • Only load balancers and Bastion live in public subnets; everything else private.
  • WAF in front of public HTTP; NSGs restrict listener sources where possible; rate limiting and geo rules.
  • Use reserved public IPs you can whitelist; terminate TLS at the LB with managed certs; consider E2E SSL for regulated data.
  • Replace standing jump hosts with the Bastion service (time-boxed, audited).
  • Audit the tenancy for stray public IPs regularly (Cloud Guard + a scheduled report).
Monitor suspicious activity Detect
  • Cloud Guard problems → notifications; Audit + Logging → Logging Analytics for correlation.
  • Alarms on: root/administrator logins, policy changes, new API keys, security-list changes, public IP creation, unusual Object Storage access.
  • Break-glass user login should page someone every time.

Production OCI security checklist

  • Human access federated to corporate IdP; MFA enforced for all users and especially admins.
  • Break-glass local admins created, credentials sealed, every login alarmed.
  • IAM least privilege: roles split (admin/operator/read-only), scoped to compartments, no broad manage all-resources in tenancy.
  • Workloads use instance/resource principals - no long-lived API keys stored on hosts.
  • Cloud Guard enabled tenancy-wide with notifications and responders configured.
  • Production compartments enrolled in Security Zones with appropriate recipes.
  • Vulnerability Scanning enabled for instances and container images.
  • No unintended public IPs; databases and app tiers in private subnets; Bastion for admin access.
  • WAF in front of public HTTP endpoints; Network Firewall inspecting hub traffic.
  • All data encrypted at rest; customer-managed keys in Vault for sensitive data; keys rotated.
  • Vault holds all secrets/wallets; nothing sensitive in images, code, or env files.
  • Object Storage buckets private; versioning + retention on backup/compliance buckets.
  • Data Safe assessments run; non-prod data masked; DB activity auditing on.
  • Audit log retention configured; logs centralized via Service Connector Hub.
  • Alarms on privilege/policy/network changes and anomalous access.
  • Compartment quotas + budgets as guardrails; everything tagged with data-classification.
  • DR and backups tested (restores verified), including key availability in the DR region.

Compliance basics

OCI maintains a broad set of certifications/attestations (SOC, ISO, PCI, HIPAA, FedRAMP/Gov in the relevant realms - verify current scope). Your responsibility is configuring services to meet your obligations: data residency (region/realm choice), encryption with controlled keys, access logging, and evidence. Cloud Guard and Security Zones help demonstrate continuous compliance; Audit provides the evidence trail.

9. Observability, Monitoring, and Operations

Metrics, alarms, logs, events, and the operational tooling to run OCI day-2 - including what to monitor per service, how to build useful alarms, and how to avoid drowning in noise.

Last reviewed: July 2026 Verify metric namespaces and query syntax in current docs.
TL;DR

Monitoring collects metrics and fires Alarms that publish to Notifications (email, PagerDuty, Functions, Slack via webhook). Logging centralizes service, audit, and custom logs; Logging Analytics analyzes them. Service Connector Hub is the pipe that moves logs/metrics/events between services (e.g. logs → Object Storage or SIEM). Events trigger automation on resource changes. Monitor the golden signals per tier, alarm on symptoms users feel, and route by severity to avoid alert fatigue.

The observability stack

ServiceRole
Monitoring (Metrics)Time-series metrics per service (CPU, memory, IOPS, LB health, DB metrics). Query with MQL.
AlarmsThreshold/absence rules on metrics that fire notifications and can trigger automation.
Notifications (ONS)Topics with subscriptions: email, HTTPS webhook, PagerDuty, Slack, Functions, SMS.
LoggingCentral store for service logs (LB, VCN flow, WAF), audit logs, and custom application logs.
Logging AnalyticsParse, search, correlate, and visualize large log volumes; dashboards and ML-assisted analysis.
EventsReacts to resource lifecycle changes (e.g. bucket created, instance terminated) → Functions/Notifications/Streaming.
Service Connector HubMoves data between sources and targets (logs → Object Storage, metrics → Functions, events → SIEM).
AuditImmutable record of all API/control-plane activity in the tenancy.
Operations Insights / Database Management / APMDeep DB analytics, fleet DB monitoring, and application performance tracing.
Management Agent / OS ManagementAgent-based host metrics/logs and OS patch compliance.

What to monitor per area

Compute

CPU utilization, memory utilization, load, instance status; per-process via agent. Watch for sustained saturation and crash loops.

Storage

Block volume IOPS/throughput vs. tier ceiling, latency; Object Storage request/error rates; FSS throughput. Volume at its I/O ceiling is a top hidden bottleneck.

Database

CPU, sessions, wait classes, storage used %, tablespace, backup success, Data Guard apply lag, blocked sessions. Use Performance Hub + Database Management.

Load balancers

Healthy/unhealthy backend count, active connections, response time, 5xx rate, bandwidth vs. shape.

Network

VPN tunnel state, FastConnect BGP/light levels, NAT/Service GW throughput, VCN flow-log rejects, DNS query health.

Security

Cloud Guard problems, audit anomalies (policy/key/public-IP changes), unusual Object Storage access, failed logins.

Building useful alarms

  • Alarm on symptoms users feel (LB 5xx, unhealthy backends, DB down, high latency), not only causes.
  • Use appropriate statistics and windows (e.g. mean over 5 min, not a single spike) and a sensible trigger duration to avoid flapping.
  • Set severity and route: critical → page; warning → ticket/Slack; info → dashboard only.
  • Use absence alarms for "should always report" signals (heartbeat, backup completion).
  • Tag alarms by service/team so ownership is clear.
# MQL: alarm when average CPU across an instance exceeds 85% for 5 min
CpuUtilization[5m]{resourceId = "ocid1.instance.oc1..xxxx"}.mean() > 85

# MQL: alarm when any backend set has unhealthy backends
UnHealthyBackendServers[1m].max() > 0

# Absence alarm: no metric reported for 10m (agent/host down)
CpuUtilization[10m].absent()

Example alarms to implement

AlarmSignal / conditionSeverity
CPU highInstance CPU mean > 85% for 5-10 minWarning → Critical if sustained
Memory pressureMemory utilization > 90% (agent metric)Warning
Disk / filesystem fullFilesystem used > 85%Warning → Critical > 95%
Block volume throughput ceilingThroughput/IOPS near the tier limit sustainedWarning (capacity)
LB unhealthy backendUnHealthyBackendServers.max() > 0Critical
Database CPUDB CPU utilization > 90% sustainedWarning
Database storageStorage used > 85% / tablespace thresholdWarning → Critical
Failed backupBackup job failed / absence of success eventCritical
Data Guard apply lagApply/transport lag > RPO thresholdCritical
VPN tunnel downTunnel state != UPCritical
FastConnect issueBGP session down / light-level alarmCritical
Object Storage unusual accessSpike in requests / unexpected public access (via logs)Security review

Avoiding noisy alerts

Common mistake - alert fatigue
Firing a page for every transient CPU spike trains people to ignore alarms. Fix it with: longer evaluation windows, appropriate statistics (mean/percentile not max), trigger durations, severity-based routing (only true user-impact pages), de-duplication/suppression during maintenance, and regularly pruning alarms nobody acts on. An alarm that never leads to action should be deleted or downgraded to a dashboard metric.

Operational dashboards and reports

  • Build Console dashboards (and Logging Analytics dashboards) per audience: an on-call "is anything on fire" view, a service-owner view, and an exec/cost view.
  • Use Service Connector Hub to ship logs/metrics to Object Storage for retention or to your enterprise SIEM.
  • Turn on Cost and usage reports and pair with Budgets (section 14) for a spend dashboard.
  • Resource Manager + drift detection to monitor infrastructure conformance.
Architect note - centralize observability early
Create a dedicated observability/logging compartment, enable service logs and Audit tenancy-wide, and wire Service Connector Hub to a central bucket + SIEM from the start. Retrofitting centralized logging after an incident (when you discover the logs you needed were never enabled) is the classic post-incident finding.

10. Containers, Kubernetes, and Cloud Native

OKE, Container Instances, Functions, and the event-driven building blocks - when to use each, how networking and IAM work for containers, and reference patterns for microservices and serverless.

Last reviewed: July 2026 Verify OKE modes (basic/enhanced), virtual nodes, and DevOps features in current docs.
TL;DR

OKE (managed Kubernetes) for long-running microservices at scale; Container Instances for a single container without running a cluster; Functions (serverless, Fn-based) for short event-driven code; plain Compute when containers add no value. Around them: Container/Artifact Registry, API Gateway, Events, Streaming, Queue, Notifications, and the DevOps service for CI/CD. OKE services of type LoadBalancer auto-provision an OCI LB/NLB; workloads use resource-principal workload identity for IAM.

The cloud-native services

ServiceWhat it isUse for
OKE (Kubernetes Engine)Managed Kubernetes control plane + your worker nodes/node pools (or virtual nodes)Microservices, platform teams, portable container workloads
Container InstancesRun containers directly, serverless, no cluster to manageSingle/few containers, batch, simple services without K8s overhead
FunctionsServerless FaaS (open-source Fn), event-triggered, scales to zeroShort event-driven tasks, glue, automation
Container Registry (OCIR)Managed private Docker/OCI image registryStoring/scanning images
Artifact RegistryGeneric artifacts (not just images)Build outputs, packages
API GatewayManaged API front door: auth, routing, rate limiting, request/response transformExposing functions/microservices as APIs
EventsReacts to resource changes → triggers Functions/Notifications/StreamingEvent-driven automation
StreamingKafka-compatible event streamingHigh-throughput ingestion, pub/sub pipelines
QueueManaged message queue (transactional, at-least-once)Decoupling producers/consumers, work queues
NotificationsPub/sub topics to email/webhook/FunctionsFan-out alerts and events
DevOps serviceManaged Git repos, build & deployment pipelines (to OKE/Functions/Instances)CI/CD inside OCI

OKE deep dive

  • Control plane - managed by Oracle (you don't run etcd/API servers). Choose Basic or Enhanced clusters (enhanced adds features like more add-ons, workload identity, higher limits, SLA).
  • Node pools - groups of worker nodes (managed VMs/BM) with a shape and image; scale and upgrade per pool.
  • Virtual nodes - serverless worker capacity where Oracle manages the node lifecycle (you don't patch/scale VMs); pods run without you managing the underlying node.
  • Networking - OKE uses VCN-native pod networking (pods get VCN IPs) or flannel overlay; plan subnet CIDRs to have enough IPs for pods and nodes.
  • Load balancing - a K8s Service of type LoadBalancer makes OKE provision an OCI Load Balancer (or NLB via annotation); Ingress controllers front HTTP routing.
  • Storage - CSI driver provisions Block Volumes as PVs; FSS for shared RWX volumes.
  • IAM - workload identity - map K8s service accounts to OCI dynamic groups so pods call OCI APIs via resource principal, no keys in the pod.
# Expose a deployment via an OCI Network Load Balancer from Kubernetes
apiVersion: v1
kind: Service
metadata:
  name: web
  annotations:
    oci.oraclecloud.com/load-balancer-type: "nlb"   # or omit for L7 LB
    oci-network-load-balancer.oraclecloud.com/security-list-management-mode: "None"
spec:
  type: LoadBalancer
  selector: { app: web }
  ports: [ { port: 443, targetPort: 8443 } ]
Architect note - size pod subnets
VCN-native pod networking gives every pod a real VCN IP - excellent for network policy and observability, but it consumes subnet IP space fast. Size the pod subnet CIDR for peak pods across all nodes (pods-per-node x max nodes), with headroom. Running out of pod IPs stalls scheduling in ways that look like mysterious pending pods.

OKE vs Functions vs Container Instances vs Compute

Many long-running microservices, need K8s ecosystem
OKE
A container or two, no desire to run a cluster
Container Instances
Short, event-triggered code that should scale to zero
Functions (+ API Gateway if HTTP)
Traditional app, VM lifecycle, no container benefit
Compute
Kafka-style ingestion pipeline
Streaming + consumers (OKE/Functions)
Cost note - do not run OKE for one container
A Kubernetes cluster carries operational and (for enhanced) control-plane cost and complexity. For a handful of containers, Container Instances or Functions are cheaper and simpler. Reserve OKE for when you genuinely need orchestration, service mesh, autoscaling fleets, or the K8s ecosystem.

Networking, IAM, and security for containers

  • Networking - clusters live in a VCN; control-plane/worker/pod/LB subnets with the right security rules. Private clusters keep the API endpoint off the internet.
  • IAM - cluster admins via IAM policy; in-cluster RBAC for K8s objects; workload identity for pod → OCI API access.
  • Image security - scan images in OCIR (Vulnerability Scanning); sign/verify; least-privilege pull secrets or instance principals.
  • Runtime security - network policies, pod security standards, secrets from OCI Vault (via CSI/secret store), and Cloud Guard over the tenancy.
  • Monitoring - cluster/node/pod metrics to Monitoring; container logs to Logging; APM for tracing.

Architecture patterns

Object Storage Events object created Function Process / write to DB Notifications (email/Slack)
Serverless event chain: an object upload triggers Events → a Function processes it and writes results / notifies. No servers to manage.
  • Microservices on OKE - deployments behind Ingress/LB, HPA autoscaling, service mesh for mTLS/traffic control, DevOps pipelines deploying images from OCIR, secrets from Vault, workload identity for OCI access.
  • Serverless function triggered by Object Storage event - as diagrammed; ideal for image processing, ETL kick-off, validation.
  • Event-driven architecture - Events + Streaming + Functions + Queue + Notifications for decoupled, resilient pipelines.
  • Container deployment pipeline - DevOps build pipeline (from managed Git or GitHub) → image to OCIR (scanned) → deployment pipeline to OKE/Functions/Container Instances, gated by approvals.

Troubleshooting: OKE pod not starting

⚑ Pod stuck Pending / CrashLoopBackOff / ImagePullBackOff

Likely causes

  • Pending: no schedulable node capacity, or out of pod IPs in the pod subnet, or resource requests too large, or taints/affinity mismatch.
  • ImagePullBackOff: bad image path, missing OCIR pull permission (instance principal/secret), or private registry unreachable (no Service Gateway/NAT route).
  • CrashLoopBackOff: app failing on start - config/secret missing, DB unreachable, bad liveness probe.
  • Cannot reach OCI APIs: workload identity/dynamic-group/policy not set up.

Checks

kubectl describe pod <pod>        # events explain Pending/ImagePull
kubectl logs <pod> --previous     # crash reason
kubectl get nodes -o wide          # capacity / readiness
oci ce cluster get --cluster-id <ocid>

Fix / prevention

Scale the node pool / fix pod-subnet sizing; grant OCIR pull via policy; fix probes and config; wire workload identity. Prevent with capacity headroom, image scanning gates, and correct subnet CIDR sizing.

⚑ Function not triggering

Checks

  • Event rule condition matches the resource/action; rule enabled; correct compartment.
  • Function has a policy allowing the trigger (Events/API Gateway invoke); resource principal permissions for what the function does.
  • Function deployed to the right app; concurrency/timeout limits; cold-start not mistaken for failure.
  • Check function logs (Logging) and the invocation metrics.

Fix / prevention

Correct the Event rule filter and the invoke policy; verify the function's own IAM (resource principal); add logging and a test invocation to your deploy pipeline.

11. Analytics, Data, and Integration

The services that move, catalog, transform, and analyze data on OCI, and the common data-lake, warehouse, streaming, and CDC patterns built from them.

Last reviewed: July 2026 Verify service availability (Big Data, OpenSearch, OIC) per region in current docs.
TL;DR

Land raw data in Object Storage (the lake), move/transform it with Data Integration / Data Flow (Spark) / GoldenGate (CDC), catalog it with Data Catalog, serve analytics from Autonomous Data Warehouse, visualize with Oracle Analytics Cloud, and build models with Data Science. Streaming/Queue handle real-time ingestion; OpenSearch handles search/log analytics.

The services

ServiceRoleAnalogy for an Oracle person
Oracle Analytics Cloud (OAC)BI, dashboards, self-service analytics, augmented analyticsOBIEE / modern BI, managed
Data IntegrationVisual ETL/ELT with data flows and pipelinesODI-style integration, cloud-native
Data FlowFully-managed Apache Spark (serverless)Run Spark jobs without managing a cluster
Data CatalogMetadata harvesting, glossary, data discovery/lineageEnterprise data dictionary for the lake
Data ScienceNotebooks, model training/deployment, MLOpsManaged JupyterLab + model catalog
GoldenGateReal-time replication & change data captureThe GoldenGate you know, as a service
StreamingKafka-compatible event streamingManaged Kafka
QueueManaged message queueAQ-style decoupling, managed
Service Connector HubMove data between OCI servicesThe plumbing/glue
API GatewayManaged API front doorAPI management layer
Integration Cloud (OIC)Application integration, prebuilt SaaS adapters, process automationSOA Suite / iPaaS for connecting apps (EBS, Fusion, SaaS)
Big Data ServiceManaged Hadoop/Spark clustersCloudera-style big data, where still needed
OpenSearchManaged search & log analyticsElasticsearch/Kibana, managed
Object Storage is the center of gravity
Almost every OCI analytics pattern uses Object Storage as the durable, cheap, infinitely scalable landing zone. ADW can query it directly (external tables), Data Flow/Spark reads and writes it, GoldenGate can deliver to it, and Data Catalog harvests it. Design your bucket/prefix layout (raw / curated / consumption zones) deliberately.

Common data patterns

PatternBuilt fromNotes
Data lakeObject Storage (raw/curated/consumption zones) + Data Catalog + Data FlowSchema-on-read; ADW queries external data
Data warehouseADW + Data Integration + OACCurated, modeled, governed; serves BI
Streaming ingestionStreaming (Kafka) → Functions/Data Flow → Object Storage/ADWReal-time events into the lake/warehouse
Batch ingestionData Integration / Data Flow scheduled loadsNightly/periodic bulk loads
CDC replicationGoldenGate from source DB → target (ADW/DB/Object Storage)Near-real-time, low source impact; migrations & live feeds
Reporting architectureADW (or read replica) + OAC dashboardsOffload reporting off the OLTP system
AI-ready dataCurated lake + Data Science + 23ai AI Vector SearchFeed embeddings/models; see section 12
DBA note - GoldenGate for zero/low-downtime
GoldenGate is the workhorse for two very different jobs: migrations (keep the source live while OCI catches up, then cut over with minimal downtime) and ongoing replication (reporting offload, active/active, feeding a warehouse). For heterogeneous or cross-version moves where Data Guard/ZDM won't fit, GoldenGate is usually the answer. Watch supplemental logging overhead on the source and conflict handling in active/active.

Reference architecture: lakehouse + BI

Source DBs (OLTP) SaaS / files / IoT GoldenGate / DI Object Storage lakeraw / curated /consumption Data Flow (Spark) ADW (warehouse) Data Catalog OAC
Sources replicate/ingest into an Object Storage lake, Spark/DI transform, ADW serves modeled data, Data Catalog governs, OAC visualizes.

12. AI, ML, and Generative AI on OCI

OCI's AI stack - Generative AI, Agents, AI Vector Search in the database, Data Science, and the pretrained AI services - plus the enterprise RAG patterns and the governance guardrails that separate a demo from something you can run on real business data.

Last reviewed: July 2026 GenAI models & regions change fast - verify available models, regions, and pricing in the Console.
TL;DR

OCI Generative AI serves foundation models (chat, embeddings) via API, with dedicated AI clusters for isolation and fine-tuning. Generative AI Agents add managed RAG over your data. AI Vector Search in Oracle Database 23ai/26ai stores embeddings next to relational data so you do similarity search in SQL - the backbone of enterprise RAG. Around them sit Data Science (build/deploy models) and pretrained AI services (Language, Vision, Speech, Document Understanding, Anomaly Detection, Forecasting). The hard part is not the model - it is governing what the model can touch.

The AI services

ServiceWhat it doesUse for
OCI Generative AIManaged LLM inference (chat, embeddings, rerank); dedicated AI clusters; fine-tuningChatbots, summarization, extraction, RAG generation
Generative AI AgentsManaged agent/RAG service that grounds answers on your data sourcesChat-with-your-docs/data with less plumbing
AI Vector Search (in DB 23ai/26ai)VECTOR data type + similarity search in SQL, alongside relational dataEnterprise RAG retrieval, semantic search
Data ScienceNotebooks, model training, model catalog, model deployment, MLOps, AI Quick ActionsCustom ML, deploy open models, feature engineering
LanguagePretrained NLP: sentiment, entities, key phrases, translation, PII detectionText analytics without training
VisionImage classification, object detection, OCRDocument/image analysis
SpeechSpeech-to-text (and related)Transcription, voice input
Document UnderstandingExtract text/tables/key-values from documentsInvoice/form processing pipelines
Anomaly DetectionMultivariate anomaly modelsOps/fraud/equipment monitoring
ForecastingTime-series forecastingDemand/capacity planning
Digital AssistantConversational assistant/chatbot platformStructured skills + LLM-backed chat
OpenSearchKeyword + vector/hybrid searchSearch backends, hybrid retrieval

AI Vector Search in Oracle Database

Oracle Database 23ai/26ai adds a native VECTOR data type and vector indexes so you store embeddings in the same database as your relational data and run similarity search with SQL. For an Oracle shop this is significant: no separate vector database to operate, and you can combine semantic search with normal SQL filters, joins, and existing security.

-- Store document chunks with their embedding vectors
CREATE TABLE doc_chunks (
  id        NUMBER PRIMARY KEY,
  doc_id    NUMBER,
  chunk     CLOB,
  embedding VECTOR(1024, FLOAT32)
);

-- Retrieve the 5 most similar chunks to a query embedding (RAG retrieval)
SELECT id, doc_id, chunk
FROM   doc_chunks
ORDER  BY VECTOR_DISTANCE(embedding, :query_vec, COSINE)
FETCH FIRST 5 ROWS ONLY;
DBA note - vectors alongside relational data
Keeping embeddings in the Oracle DB means RAG retrieval inherits your existing access controls, backups, Data Guard, and auditing - you are not standing up and securing a second data store. Combine VECTOR_DISTANCE with ordinary WHERE filters (e.g. restrict to documents a user is entitled to) so retrieval respects row-level entitlements. Size vector indexes and memory for your embedding dimension and volume.

RAG architecture on OCI

Docs in Object Storage Chunk + embed DB 23ai Vector Search -- runtime query path -- User query Serving layer(API, authz, guardrails) Retrieve top-k chunks(entitlement-filtered) OCI Generative AI(grounded answer) Audit + logging
Ingestion: docs → chunk → embed → store vectors in DB. Runtime: query → governed serving layer → entitlement-filtered retrieval → LLM generates a grounded answer, all audited.

Enterprise patterns

PatternHowWatch out for
Chat with documentsRAG over Object Storage docs + Vector Search + GenAIChunking quality; stale index; citations
Chat with databaseRetrieve from curated views; generate grounded answersNever expose raw prod OLTP; use a serving layer
Natural language to SQLLLM proposes SQL against a governed schema/catalogValidate/parametrize; read-only; guard against dynamic SQL
RAG with Object Storage + Vector SearchStandard enterprise RAG stackEntitlement filtering at retrieval
AI assistant for operationsRAG over runbooks/logs; suggest actionsHuman-in-the-loop before any change
AI assistant for business usersGoverned metrics/curated data + NL interfaceAnswer only from curated, validated data
AI over EBS / app dataRead-only reporting layer / extracts, not live OLTPPerformance impact + data governance

Governance and security for GenAI

  • Serving layer, always - agents and LLMs call a governed API/service that enforces authentication, authorization, rate limits, input/output validation, and logging. They do not touch data stores directly.
  • Entitlement-aware retrieval - filter retrieved context to what the requesting user is allowed to see (row/document-level), so RAG cannot leak data across users.
  • Private connectivity - keep model and data traffic on private endpoints / the OCI backbone; dedicated AI clusters for isolation where required.
  • Credential hygiene - secrets in Vault, access via principals; the model never sees raw credentials.
  • Auditability - log prompts, retrieved context IDs, and responses (subject to privacy rules) so answers are explainable and reviewable.
  • Human validation - outputs that drive business decisions or changes are reviewed before use; agents that act get approval gates.

Warnings (read before connecting AI to enterprise data)

Do not do these
  • Do not connect LLM agents directly to production OLTP databases without a governed serving layer. Live transactional systems are not a query playground for a probabilistic agent.
  • Avoid uncontrolled dynamic SQL. NL-to-SQL must produce validated, parameterized, read-only queries against a curated schema - never free-form DML against production.
  • Protect credentials. No database passwords, wallets, or API keys in prompts, code, or agent memory. Use Vault + principals.
  • Add auditability. If you cannot show what data an answer came from and who asked, you cannot defend it to security or compliance.
  • Use curated datasets, APIs, or read-only reporting layers as the AI's data surface - not raw production tables.
  • Validate output before business use. Treat model output as a draft/suggestion until a human or a deterministic check confirms it.
Architect note - the pattern that scales safely
The durable enterprise GenAI shape is: curated/governed data → entitlement-filtered retrieval → model behind a serving API → validated, audited output. Everything risky (raw OLTP access, dynamic SQL, embedded credentials, unlogged answers) is a shortcut that works in a demo and fails an audit. Build the governed path first; it is far cheaper than retrofitting controls after an incident.

Cost and endpoint considerations

  • On-demand vs dedicated AI clusters - on-demand inference is pay-per-use and simple; dedicated clusters give isolation, predictable throughput, and fine-tuning, at a fixed cost. Match to volume and isolation needs.
  • Embedding jobs are a real cost at scale - batch and cache embeddings; re-embed only changed content.
  • Token/context size drives cost and latency - retrieve the smallest sufficient context, not everything.
  • Region/model availability - models and dedicated-cluster availability vary by region; verify before designing.

13. Migration and Disaster Recovery

Getting workloads into OCI, and keeping them recoverable once there - the migration tooling, the DR patterns by tier, and how RTO/RPO drive the architecture and cost.

Last reviewed: July 2026 Verify ZDM/DMS/Full Stack DR capabilities and supported sources in current docs.
TL;DR

Migrate databases with Zero Downtime Migration (ZDM) or Database Migration Service (both often Data Guard/GoldenGate under the hood), and compute by rebuilding from images or replicating volumes. For DR, choose per tier: backup-and-restore (cheapest, slow), pilot light, warm standby, or active/active (fastest, priciest). Full Stack Disaster Recovery orchestrates cross-region failover of the whole stack. Your RTO/RPO targets pick the pattern; DR you never test is not DR.

Migration tooling

MoveToolingNotes
Database (low downtime)ZDM, Database Migration Service, GoldenGateData Guard-based physical or logical; GoldenGate for heterogeneous/cross-version
Database (offline)RMAN / Data Pump / cross-endianSimple, requires downtime window
Compute / VMRebuild from image, or import; block/boot volume replicationPrefer rebuild-from-golden-image over lift-and-shift of disks
Bulk dataData Transfer appliance/disk, or online (multipart upload, rclone)Physical transfer when network is impractical
Filesrsync/rclone to FSS or Object Storage; FSS replicationPreserve permissions for app filesystems
Applications (EBS etc.)Oracle Cloud lift tooling + DB migration + app-tier rebuildWhole-stack; validate certification

Database migration paths

MethodDowntimeBest for
Zero Downtime Migration (ZDM)Near-zeroSame-platform Oracle-to-OCI using Data Guard; automated orchestration
Database Migration Service (DMS)Low to near-zeroManaged online/offline migrations to OCI targets
GoldenGateNear-zeroHeterogeneous, cross-version, cross-endian, active during cutover
Data PumpDowntime windowLogical export/import; schema-level; cross-version
RMAN / cross-endianDowntime windowPhysical restore/transport to OCI
DBA note - pick by downtime tolerance and heterogeneity
If source and target are compatible Oracle and you can use Data Guard, ZDM gives the cleanest near-zero-downtime path. If you are changing endianness, version, or platform, or need the source live throughout, GoldenGate is the tool. Reserve Data Pump/RMAN for cases where a downtime window is acceptable and simplicity wins.

DR patterns

PatternStandby stateRTORPOCost
Backup & restoreBackups in DR region; nothing runningHours+Since last backupLowest
Pilot lightCore (DB standby) running, app tier offTens of minutesSmall (DG lag)Low
Warm standbyScaled-down full stack runningMinutesSmallMedium
Active / passiveFull-size standby, promote on failoverMinutesNear-zero (SYNC)High
Active / activeBoth regions servingNear-zeroNear-zeroHighest + complexity

Building blocks: Data Guard / Active Data Guard and Autonomous Data Guard for databases; Object/Block/File replication for storage; cross-region image copy for compute; DNS failover and load balancers for traffic redirection; Full Stack Disaster Recovery to orchestrate the whole failover as a plan you can run and test.

Architect note - tier your DR, don't blanket it
Not every workload deserves active/active. Classify workloads by business impact and assign each a DR pattern: mission-critical DB → Active Data Guard cross-region; important apps → warm standby; the rest → pilot light or backup-restore. Blanket active/active for everything is a budget-buster; blanket backup-restore leaves critical systems with an unacceptable RTO. Match pattern to tier.
Common mistake - active/active is rarely truly active/active
True active/active for stateful databases means solving write conflicts (GoldenGate bidirectional with conflict handling) - hard, and unnecessary for most. Most "active/active" requirements are satisfied by active/passive with fast failover, or active/read-only (Active Data Guard serving reads in the second region). Don't take on multi-master complexity unless the requirement genuinely demands it.

RTO and RPO - the two numbers that drive everything

  • RTO (Recovery Time Objective) - how long you can be down. Drives standby readiness (running vs. cold) and automation.
  • RPO (Recovery Point Objective) - how much data you can lose. Drives replication mode: async (small lag), SYNC/Far Sync (zero data loss), or backup interval (large).
  • Zero-data-loss (RPO 0) needs SYNC transport (Data Guard Maximum Protection/Availability, often via a Far Sync instance) - which requires low latency between sites and has performance implications. Confirm the network and the trade-off.

Architecture examples

Primary region (Ashburn) App tier (LB) DB primary (RAC) Object Storage DR region (Phoenix) App tier (warm) DB standby (ADG) Replicated bucket Active Data Guard bucket replication
Cross-region DR: Active Data Guard replicates the database, bucket replication mirrors backups/objects, warm app tier + DNS/LB failover completes the picture. Full Stack DR orchestrates the switch.
  • On-prem Oracle DB → OCI: ZDM/Data Guard over FastConnect, cut over in a window, keep on-prem as fallback briefly.
  • EBS → OCI: migrate DB (Base/Exadata) + rebuild app tier + shared FSS; DNS cutover; validate certification.
  • VM → OCI: custom image import/rebuild; block volume replication for large data disks.
  • Cross-region DR (app): warm standby + DNS/LB failover + storage replication.
  • Cross-region DR (database): Active Data Guard (or Autonomous Data Guard) with configurable failover.
  • Backup-based DR: cross-region backup copies; rebuild in DR on demand (lowest cost, highest RTO).
  • GoldenGate-based DR: when heterogeneous or active/active-ish read serving is required.

DR testing

DR you have never tested is a hope, not a plan
Schedule regular DR drills: switchover (planned role change) at minimum, and periodic failover tests. Verify RTO/RPO are actually met, that the app works end-to-end in DR (not just the DB opens), that keys/secrets exist in the DR region (a customer-managed TDE key missing in DR makes the standby unusable), that DNS/LB failover works, and that runbooks are current. Full Stack Disaster Recovery lets you codify and rehearse the whole plan.
  • Standby database is applying and within RPO (monitor apply lag).
  • Encryption keys/wallets present and usable in the DR region.
  • App tier can start and connect in DR; config points to DR endpoints.
  • DNS/LB failover mechanism tested and time-measured.
  • Storage (buckets/volumes/FSS) replication within RPO.
  • Runbook current; roles assigned; switchover rehearsed on a schedule.
  • Capacity available in DR (reservations if RTO is tight).

14. Cost Management and Governance

How OCI charges, the tools to track and cap spend, the governance model (landing zones, quotas, budgets), and a concrete monthly cost-review checklist.

Last reviewed: July 2026 Pricing changes constantly - verify all rates on the official pricing pages.
TL;DR

OCI bills mainly by OCPU-hours, storage GB, and network egress, purchased via Universal Credits. Track with Cost Analysis and Usage Reports, cap with Budgets (alert) and Quotas (block), attribute with tags and compartments. The biggest levers: BYOL for databases, right-sizing OCPUs, stopping/scheduling non-prod, correct block-volume tiers, Object Storage lifecycle, and killing orphaned resources. Governance = a landing zone + quotas + budgets + tags so spend is controlled by design.

Pricing basics

DimensionCharged onNotes
ComputeOCPU-hours (+ memory for flex), per shapeArm often cheaper per unit; preemptible cheaper still
Block storageGB-month x performance (VPU)Higher VPU costs more; auto-tune down when idle
Object storageGB-month by tier + requests + retrieval (IA/Archive)Archive cheapest to store, has retrieval cost/delay
NetworkInternet egress (ingress free); some cross-regionKeep OCI-service traffic on Service Gateway to avoid internet egress
DatabaseOCPU-hours + storage + edition/options; LI vs BYOLUsually your largest line item; BYOL is the big lever
Load balancerBandwidth shape (LBaaS) / usageSize the flexible bandwidth to real need
  • Universal Credits - a consumption model where credits apply across eligible OCI services; annual commitments earn better rates than pure pay-as-you-go.
  • BYOL vs License Included - covered in section 6; BYOL typically the biggest database savings if you own licenses.

Cost tracking tools

ToolDoes
Cost AnalysisConsole visualizations of spend by compartment, service, tag, time.
Usage/Cost ReportsDetailed CSV usage dropped to an Oracle-owned bucket for your own analysis/BI.
BudgetsTrack spend against a target on a compartment or tag; alert at thresholds. Do not block.
Compartment QuotasPolicy-like statements that block resource creation - the hard cap.
Cloud AdvisorRecommendations: rightsizing, idle resources, performance/cost/availability.
Architect note - budgets alert, quotas enforce
Use them together: a budget tells you spend is trending over; a quota stops a team from creating the expensive thing in the first place. Attribute everything via defined tags (CostCenter, Environment, Owner) and per-compartment reporting so chargeback is possible. This only works if tagging was enforced from day one (section 1).

Governance model and landing zones

A landing zone is a codified baseline for a well-governed tenancy: compartment topology, IAM groups/policies, network hub, logging/audit, Security Zones, Cloud Guard, budgets, quotas, and tag defaults - deployed as Terraform so it is repeatable and reviewable. Oracle publishes CIS-aligned landing zone reference architectures/Terraform to start from.

  • Compartments for isolation and cost attribution (section 1).
  • Quotas + budgets per compartment for control.
  • Tags + tag defaults for attribution and automation.
  • Security Zones + Cloud Guard for preventive/detective guardrails (section 8).
  • Everything as code so environments are consistent and auditable.

Cost optimization examples

ActionTypical savingEffort
Stop non-prod compute nights/weekends (schedule)High - up to ~65-70% of that computeLow
Right-size over-provisioned shapes (Cloud Advisor)HighLow
Correct block-volume performance tier (don't over-VPU)MediumLow
Object Storage lifecycle to Archive / deleteMedium-High for large storesLow
Database BYOL instead of License IncludedVery High (Oracle shops)Medium
Autonomous autoscale + auto-stop devHigh for variable/non-prodLow
Exadata capacity planning / consolidationHigh at scaleHigh
Remove unused reserved public IPsSmall each, adds upLow
Delete orphaned block volumes / old backups / snapshotsMediumLow
Move scale-out tiers to Arm shapesMedium (price/perf)Medium
Cost note - the cheap wins first
Before any complex re-architecture, do the boring high-ROI things: schedule non-prod off-hours, act on Cloud Advisor rightsizing, apply Object Storage lifecycle rules, and BYOL your databases. These are low-effort and recover the most money. Re-architecture (Arm migration, Exadata consolidation) comes after.

Monthly OCI cost review checklist

  • Review Cost Analysis month-over-month by compartment and service; investigate any spike.
  • Check each budget: which compartments/tags are over or trending over target.
  • Act on Cloud Advisor rightsizing and idle-resource recommendations.
  • Confirm non-prod stop/scale schedules ran (no Dev/Test running 24x7 by accident).
  • Find and terminate orphaned block volumes, unattached boot volumes, and idle instances.
  • Delete stale volume backups, DB backups beyond retention, and old snapshots.
  • Review Object Storage: are lifecycle rules moving cold data to Archive / deleting expired data?
  • Remove unused reserved public IPs and idle load balancers.
  • Verify database licensing posture (BYOL applied where owned; OCPU counts right-sized).
  • Check block-volume performance tiers vs. actual IOPS - downgrade over-provisioned volumes.
  • Review egress charges - is OCI-service traffic correctly on the Service Gateway?
  • Confirm every resource is tagged (CostCenter/Environment/Owner) for attribution.
  • Validate quotas still reflect intent (no team quietly raised a cap).
  • Reconcile Universal Credits burn-down vs. commitment; forecast to renewal.

15. Enterprise Architecture Patterns

Reference blueprints for real OCI deployments. Each card gives the business case, services, traffic flow, and the security / HA / DR / monitoring / cost / risk dimensions so you can adapt rather than start from a blank page.

Last reviewed: July 2026 Blueprints are starting points - validate sizing/services against current docs and your requirements.
HOW TO READ THESE

Every pattern lists the same dimensions. Start from the one closest to your workload, then apply the relevant service deep dives (sections 3-12) and the DR/cost guidance (13-14). The recurring backbone is: public LB + WAF → private app tier across fault domains → managed database → Service Gateway for OCI services → centralized logging → cross-region DR.

Foundational three-tier (reference backbone)

Three-tier enterprise application
The pattern most other patterns extend
Users WAF +Public LB Private app subnet (NSG: app) App FD1 App FD2 Private DB subnet (NSG: db) Managed DB (RAC / ADB) NAT GW Service GW Vault Logging
Business caseStandard internal/external web or enterprise app needing HA and controlled exposure.
ServicesVCN + public/private subnets, WAF, Load Balancer, Compute (instance pool), Base/Autonomous DB, Service Gateway, NAT, Vault, Monitoring/Logging.
Traffic flowUser → WAF → public LB → app tier (private) → DB (private); app → OCI services via Service Gateway; egress via NAT.
SecurityOnly LB/Bastion public; NSGs per tier (app→db on 1521 only); TDE + customer keys; secrets in Vault; Cloud Guard + Security Zone.
HAApp pool + DB nodes spread across fault domains (and ADs where available); LB health checks.
DRActive Data Guard + warm app tier in a second region; DNS/LB failover.
MonitoringLB backend health, app CPU/mem, DB metrics, alarms → Notifications; central logs.
CostRight-size app pool + autoscale; BYOL DB; schedule non-prod.
Risks / mistakesBackends unhealthy from missing health-check NSG rule; DB in public subnet; no FD spread; secrets in images.

Pattern library

Each pattern below follows the same dimension set. Expand the ones relevant to you.

Simple web application Small
CaseLow-complexity site/app, cost-sensitive.
Services1 VCN, public LB (or public instance), Arm compute, Autonomous DB (auto-stop for dev), Object Storage for assets.
FlowUser → LB → app → ADB; static assets from Object Storage.
Security/HA/DRWAF on LB; 2 instances across FDs; ADB backups + optional Autonomous DG; Cloud Guard.
Cost / riskArm + auto-stop = cheap. Risk: single instance / no backups if cut too far.
Highly available application HA
CaseApp that must survive node, rack, and (where possible) AD failure.
ServicesInstance pool + autoscaling across FDs/ADs, LB, RAC DB or ADB, FSS for shared state, Vault.
Flow / HALB spreads to app pool; DB RAC for node HA; storage replicated; no single FD holds all.
DR / monitoringActive Data Guard cross-region; alarms on backend health + DB; Full Stack DR plan.
RiskState stored on a single node instead of shared/managed store; untested failover.
EBS on OCI Apps DBA
CaseMigrate/run Oracle E-Business Suite on OCI.
ServicesBase/Exadata DB (2-node RAC), app-tier compute pool behind LB, shared APPL_TOP on FSS, Vault, Object Storage backups, Bastion.
FlowUsers → LB → EBS app nodes (shared FSS) → DB; concurrent/admin tiers as needed.
Security/HA/DRPrivate DB, TDE; app across FDs; Data Guard DR; validated EBS certification for the DB service/version.
Cost / riskBYOL DB; right-size nodes. Risk: unsupported DB service/version; FSS export too open.
Oracle database on OCI (managed) DB
CaseRun a production Oracle DB with managed lifecycle.
ServicesBase Database (2-node RAC) or Exadata; Data Guard; Object Storage backups via Service Gateway; Data Safe; Vault keys.
Security/HA/DRPrivate subnet; customer-managed TDE; RAC HA; Data Guard cross-region DR; audited via Data Safe.
Cost / riskBYOL; right-size OCPU. Risk: untested restores; keys absent in DR region.
Exadata Cloud Service platform Scale
CaseLarge, high-performance, or consolidated Oracle estate.
ServicesExaDB-D VM clusters + many PDBs; Data Guard to a second Exadata; Ops Insights; Data Safe.
Security/HA/DRRAC built-in; Active Data Guard; PDB isolation; standardized patch windows.
Cost / riskConsolidate for density; BYOL. Risk: over-provisioning for small estates; noisy-neighbor PDBs without resource management.
Autonomous Database application Low-ops
CaseNew app wanting minimal DBA toil, elastic scale.
ServicesATP (Serverless), private endpoint, app tier on OKE/compute, Vault for wallet, APEX optional.
Security/HA/DRPrivate endpoint (no public); mTLS wallet from Vault; Autonomous DG; auto-backups.
Cost / riskAutoscale + auto-stop dev. Risk: assuming Autonomous fits a DB needing OS/feature access it can't provide.
Data warehouse & data lake Analytics
CaseEnterprise analytics / BI on curated + raw data.
ServicesObject Storage lake (zones) + Data Integration/Data Flow + ADW + Data Catalog + OAC; GoldenGate CDC feeds.
Security/HA/DRPrivate access; Data Catalog governance; ADW auto-backups + DG; masked non-prod.
Cost / riskScale ADW to query windows; lifecycle cold lake data. Risk: ungoverned lake ("data swamp").
Kubernetes platform Cloud native
CaseContainer platform for many microservices with CI/CD.
ServicesOKE (enhanced), node pools/virtual nodes, OCIR (scanned), API Gateway, DevOps pipelines, Vault, LB/NLB, service mesh.
Security/HA/DRPrivate cluster; workload identity; network policies; multi-FD node pools; backup of cluster state/config; images in OCIR replicated.
Cost / riskRight-size pools; Arm nodes. Risk: pod-subnet IP exhaustion; over-privileged workload identity.
Private enterprise application (no internet exposure) Regulated
CaseInternal-only app reachable from on-prem, no public footprint.
ServicesPrivate subnets only, private LB, FastConnect/VPN via DRG, Service Gateway, private endpoints, Bastion for admin.
Security/HA/DRNo public IPs/IGW; access from corporate network only; Network Firewall inspection; cross-region DR over private links.
Cost / riskFastConnect cost. Risk: CIDR overlap with on-prem; DNS forwarding gaps.
Hybrid cloud Hybrid
CaseWorkloads split across on-prem and OCI with shared connectivity.
ServicesDRG hub + FastConnect (primary) + VPN (backup), hub-and-spoke VCNs, hybrid DNS, Network Firewall.
Security/HA/DRRedundant links with BGP failover; centralized inspection; consistent IAM/tagging.
Cost / riskFastConnect + egress. Risk: CIDR overlap; single link with no backup; asymmetric routing.
Multi-region DR DR
CaseBusiness-critical stack needing regional resilience.
ServicesMirrored compartments/VCNs in 2 regions, Active Data Guard, storage replication, capacity reservation, Full Stack DR, DNS/LB failover.
Security/HA/DRKeys present in both regions; RTO/RPO per tier; rehearsed switchover/failover.
Cost / riskStandby cost vs. RTO. Risk: untested DR; missing DR keys; capacity unavailable at failover.
Secure landing zone Governance
CaseGoverned foundation before workloads land.
ServicesCompartment topology, IAM roles/policies, network hub, Vault, centralized logging/audit, Cloud Guard, Security Zones, budgets, quotas, tag defaults - all as Terraform (CIS-aligned).
Security/HA/DRPreventive guardrails; least privilege; audit centralized; break-glass isolated.
Cost / riskQuotas/budgets cap spend. Risk: skipping the landing zone and retrofitting governance later.
GenAI with private enterprise data AI
CaseRAG/assistant over internal documents and data, governed.
ServicesObject Storage (docs) + DB 23ai Vector Search + OCI Generative AI (or Agents) behind a serving API on OKE/Functions + API Gateway + Vault + Logging.
FlowQuery → serving layer (authz + guardrails) → entitlement-filtered vector retrieval → grounded generation → audited response.
Security/HA/DRPrivate endpoints; no direct model→OLTP; secrets in Vault; full audit; validated output.
Cost / riskCache embeddings; right-size dedicated clusters. Risk: ungoverned data access, dynamic SQL, credential leakage (see section 12 warnings).
Common mistakes across all patterns
  • Databases or app tiers exposed in public subnets "just to get it working."
  • No fault-domain spread - a single rack event takes the whole "HA" tier.
  • Health-check NSG rules forgotten, so LB backends are unhealthy on day one.
  • DR designed but never tested; keys/secrets missing in the DR region.
  • Secrets baked into images/code instead of Vault + principals.
  • No centralized logging/audit until an incident needs it.
  • CIDR overlap discovered during the hybrid connectivity phase.
  • Governance (landing zone, quotas, tags) skipped and retrofitted painfully later.

16. Troubleshooting Guides

A runbook catalog for the failures you will actually hit. Each entry lists symptoms, likely causes, checks (with Console path and CLI where useful), fixes, and prevention. Deeper versions of some runbooks live in their service sections; this is the consolidated index.

Last reviewed: July 2026 CLI syntax evolves - verify commands with oci <service> --help.
General method
Work top-down through the layers: identity (is the caller allowed?) → network (is there a route + do rules allow it, both ways?) → host/service (is it listening/healthy?) → data. For any "cannot reach" problem, run Network Path Analyzer and enable VCN Flow Logs first - they usually name the exact blocking rule.
ComputeStorageNetworkLBDBIAMContainersObservability

Compute & access

⚑ Compute instance not reachable / SSH fails

Symptoms: SSH times out or refuses. Likely causes: security rule/route missing (22), no public IP or wrong path, instance stopped/boot failed, wrong SSH key/user, OS firewall/fail2ban. Checks: instance state Running; serial console for boot; security rules + route allow 22 from your IP; correct user (opc); firewalld. Console: Compute > Instance > Attached VNICs; Instance > Console connection. Fix: open 22 in NSG, assign IP/route, reset via console. Prevention: use Bastion service (no public IPs), standard security templates.

oci compute instance get --instance-id <ocid> --query 'data."lifecycle-state"'
oci compute instance action --instance-id <ocid> --action SOFTRESET
⚑ Instance boot issue

Symptoms: instance up but OS unreachable/services down. Likely causes: bad /etc/fstab mount, kernel/driver issue, full root disk, failed cloud-init. Checks: serial console boot output; single-user mode. Fix: detach boot volume, attach to a rescue instance, correct fstab/config, reattach. Prevention: test image changes in non-prod; keep boot volume backups.

⚑ High CPU

Symptoms: slow app, CPU alarm. Likely causes: undersized shape, runaway process, missing autoscale, batch overlap. Checks: Monitoring CPU trend; on host top/pidstat. Fix: resize (reboot) or scale the pool; fix the offending process. Prevention: autoscaling + right-sizing from sustained metrics.

⚑ Disk full

Symptoms: writes fail, app errors. Likely causes: logs/temp growth, volume undersized, no rotation. Checks: df -h, du -sh. Fix: clean/rotate; grow the block volume online then extend the filesystem (growpart/resize2fs/xfs_growfs). Prevention: filesystem-used alarm at 85%, log rotation, lifecycle to Object Storage.

⚑ Block volume attachment issue

Symptoms: attached volume not visible in OS. Likely causes: iSCSI login steps not run (iSCSI attach), device not mounted, wrong attach type. Checks: Console shows Attached; run the iSCSI commands from the Console's attach details; lsblk. Fix: run the iSCSI iscsiadm commands or use paravirtualized attach; mount and add to fstab (use UUID). Prevention: prefer paravirtualized attach; automate mount via cloud-init.

Storage

⚑ Object Storage access denied

Symptoms: 403/404 on bucket/object. Likely causes: missing IAM policy for the user/dynamic group, wrong compartment, condition (bucket name) unmet, using API key where principal expected, expired PAR. Checks: policy grants read/manage objects in the bucket's compartment; dynamic-group membership; Audit for the denied call. Fix: add least-privilege policy; use instance principal. Prevention: standard bucket-access policy per workload; avoid PAR sprawl.

oci os object list --bucket-name <name> --auth instance_principal
⚑ File Storage (FSS) mount issue

Symptoms: NFS mount hangs or permission denied. Likely causes: NFS ports (111/2048-2050) blocked in NSG/security list, export options exclude the client CIDR, root squash, wrong mount-target IP/path. Checks: security rules for NFS between client subnet and mount target; export options; showmount/mount -v. Fix: open NFS ports, add client CIDR to export options, correct mount path. Prevention: FSS module with standard ports + export options.

Network

⚑ Service Gateway not working

Symptoms: private instance can't reach Object Storage/ADB privately. Likely causes: no Service Gateway, missing route to it for the OSN CIDR label, wrong service label, egress rule missing. Checks: VCN > Service Gateway exists + correct label; subnet route has a rule (target = SGW, dest = services CIDR label). Fix: create SGW, add route + egress rule. Prevention: bake SGW into the private-subnet Terraform module.

⚑ NAT Gateway not working

Symptoms: private instance can't reach the internet (patch repos, external APIs). Likely causes: no 0.0.0.0/0 route to NAT, egress rule missing, OS firewall. Checks: route table; security egress; curl test. Fix: add NAT route + egress. Prevention: standard subnet module; keep OCI-service traffic on SGW, not NAT.

⚑ VPN tunnel down

Symptoms: on-prem/OCI connectivity lost, tunnel state != UP. Likely causes: Phase 1/2 mismatch (encryption, PSK, lifetimes), CPE public IP change, BGP/static route misconfig, on-prem firewall. Checks: Site-to-Site VPN > tunnel status + logs; compare IKE params both ends. Fix: align IKE/IPSec parameters, correct routes/BGP. Prevention: redundant tunnels; FastConnect primary; alarm on tunnel state.

⚑ FastConnect issue

Symptoms: primary link degraded/down, BGP session down. Likely causes: physical/optical fault, BGP config, provider issue, MTU/MACsec. Checks: FastConnect state + light levels; BGP session; provider status. Fix: engage provider/DC; fail over to VPN backup. Prevention: redundant FastConnect + VPN backup with BGP failover; alarms.

⚑ DNS issue

Symptoms: names don't resolve. Likely causes: DHCP options point at wrong resolver, no hybrid forwarding, missing private-zone record/view. Checks: nslookup name 169.254.169.254; /etc/resolv.conf. Fix: fix DHCP options, add resolver forwarders both ways, add records to the private zone. Prevention: standard DNS/resolver design per VCN.

Load balancer & certificates

⚑ Load balancer backend unhealthy

Symptoms: backends Critical, 502/503. Likely causes (order): backend NSG doesn't allow the LB health-check source on the port; wrong health-check port/path/protocol; app not listening or bound to localhost; OS firewall; SSL mismatch. Checks: curl the backend health URL from the LB subnet; Path Analyzer LB→backend; ss -tlnp. Fix: allow the probe source; align health-check; bind to 0.0.0.0. Prevention: template LB + NSG together. (Full runbook in section 7.)

⚑ SSL certificate issue

Symptoms: TLS errors, browser warnings, handshake failures. Likely causes: expired cert, incomplete chain, SNI/hostname mismatch, E2E backend cert invalid. Checks: openssl s_client -connect host:443 -servername host; cert dates/chain. Fix: renew/replace via Certificates service; include full chain; match hostname. Prevention: managed certs with rotation + expiry alarms.

Database

⚑ Database backup failed

Symptoms: backup job error / no recent backup. Likely causes: Object Storage access (Service Gateway/policy), space, wallet/credential expiry, RMAN config (IaaS), retention conflict. Checks: backup job logs; Object Storage reachability; on IaaS RMAN LIST BACKUP. Fix: restore connectivity/credentials; correct RMAN/backup config. Prevention: alarm on backup success absence; periodic restore tests.

⚑ Database performance issue

Symptoms: slow queries, high waits. Likely causes: bad plans, missing indexes, I/O ceiling (block volume/storage), CPU saturation, contention. Checks: Performance Hub / AWR / ASH; wait classes; volume IOPS vs. tier. Fix: tune SQL/indexes, raise storage performance, scale OCPU, resource management. Prevention: Ops Insights capacity planning; auto-indexing (Autonomous); baseline plans.

⚑ Autonomous Database connection issue

Symptoms: app can't connect to ADB. Likely causes: wallet expired/wrong, mTLS vs TLS mismatch, private endpoint NSG rules, ACL blocks client IP (public), TNS alias wrong. Checks: wallet validity; ADB network config (private endpoint vs. public + ACL); NSG for the PE. Fix: refresh wallet from Vault, fix ACL/NSG, correct connection string. Prevention: store wallet in Vault; use private endpoints; automate wallet rotation.

IAM

⚑ IAM policy issue (authorization failed)

Symptoms: 404/authz error though resource exists. Likely causes: no policy grants it, wrong compartment, weak verb/wrong family, unmet condition, wrong domain qualifier. Checks: group membership + policies; compartment path; Audit for the denied request. Fix: add least-privilege statement at the correct compartment. Prevention: policies in Terraform; access matrix per compartment. (Full runbook in section 2.)

⚑ Dynamic group issue (workload can't call OCI)

Symptoms: instance/function gets authz errors using a principal. Likely causes: instance not matched by the DG rule, no policy grants the DG, wrong matching attribute/compartment, using API key not principal. Checks: DG matching rule vs. instance OCID/compartment; policy targets dynamic-group; --auth instance_principal. Fix: correct the rule, add the DG policy. Prevention: standard DG + policy per workload type.

Containers & automation

⚑ OKE pod not starting

Symptoms: Pending / ImagePullBackOff / CrashLoopBackOff. Likely causes: no capacity or pod-IP exhaustion; OCIR pull permission missing; app config/secret missing; bad probes; workload identity not set. Checks: kubectl describe pod, kubectl logs --previous, node capacity, pod-subnet free IPs. Fix: scale pool/fix subnet size; grant OCIR pull; fix config/probes/identity. Prevention: capacity headroom, correct CIDR sizing, image-scan gates. (Section 10.)

⚑ Function not triggering

Symptoms: event doesn't invoke the function. Likely causes: Event rule filter mismatch/disabled; invoke policy missing; function IAM (resource principal) missing; timeout/concurrency; cold start mistaken for failure. Checks: Event rule condition; function logs + invocation metrics; policies. Fix: correct the rule + invoke policy; verify resource-principal permissions. Prevention: test invocation in the deploy pipeline; logging on.

Observability

⚑ Alarm not firing

Symptoms: expected alert never arrives. Likely causes: wrong metric/namespace/dimension in the query, threshold/window never met, alarm disabled, Notifications topic has no confirmed subscription, suppression active. Checks: run the MQL in Metrics Explorer; alarm state history; topic subscription confirmed. Fix: correct the query/threshold; confirm subscription. Prevention: test alarms by forcing a condition; monitor the monitors (absence alarms).

⚑ Logging missing

Symptoms: expected logs absent when investigating. Likely causes: service log (LB/VCN flow/WAF) not enabled, agent not installed/configured, Service Connector not wired, wrong compartment/log group, retention expired. Checks: Logging > Logs for the resource; agent status; connector state. Fix: enable the log, install/config agent, wire the connector. Prevention: enable service logs + Audit tenancy-wide from the start; centralize via Service Connector Hub. (Section 9.)

17. OCI CLI and Terraform Examples

Practical, copy-friendly automation: CLI setup and common commands, the Terraform provider, Resource Manager, and clean examples for VCN, compute, buckets, IAM, alarms, and tags - plus state and structure practices.

Last reviewed: July 2026 Verify provider version, resource argument names, and CLI syntax against current docs.
TL;DR

The CLI (Python) uses ~/.oci/config profiles for auth (API key or instance principal). Terraform (Oracle-maintained provider) is the way to build production infrastructure; run it locally, in a pipeline, or in Resource Manager (managed state + plan/apply + drift). Keep state remote and locked, structure code into reusable modules, and separate environments by workspace/backend + tfvars - never by copy-paste.

OCI CLI setup, profiles, authentication

# Install (Linux/macOS)
bash -c "$(curl -L https://raw.githubusercontent.com/oracle/oci-cli/master/scripts/install/install.sh)"

# Interactive setup: creates ~/.oci/config + API signing keypair
oci setup config

# ~/.oci/config with two profiles
[DEFAULT]
user=ocid1.user.oc1..aaaa...
fingerprint=aa:bb:cc:...
key_file=~/.oci/oci_api_key.pem
tenancy=ocid1.tenancy.oc1..aaaa...
region=us-ashburn-1

[PROD]
user=ocid1.user.oc1..bbbb...
fingerprint=dd:ee:ff:...
key_file=~/.oci/prod_key.pem
tenancy=ocid1.tenancy.oc1..aaaa...
region=us-phoenix-1

# Use a profile; or use instance principal (no keys on disk)
oci os ns get --profile PROD
oci compute instance list -c <compartment> --auth instance_principal
Security note
On any host running in OCI, prefer --auth instance_principal over an API key in ~/.oci/config. In Cloud Shell you are already authenticated as your Console identity - no key needed. Reserve API keys for off-cloud automation and store the PEM securely (never in git).

Common CLI commands

# Identity / discovery
oci iam compartment list --all
oci iam region-subscription list
oci os ns get                                  # Object Storage namespace

# Compute
oci compute instance list -c <compartment-ocid> --output table
oci compute instance action --instance-id <ocid> --action STOP
oci compute image list -c <compartment-ocid> --operating-system "Oracle Linux"

# Networking
oci network vcn list -c <compartment-ocid>
oci network security-list get --security-list-id <ocid>

# Object Storage
oci os object bulk-upload -bn <bucket> --src-dir ./data --auth instance_principal
oci os object list -bn <bucket>

# Database
oci db system list -c <compartment-ocid>
oci db autonomous-database list -c <compartment-ocid>

# Query + filter with JMESPath
oci compute instance list -c <ocid> \
  --query "data[?\"lifecycle-state\"=='RUNNING'].{name:\"display-name\",ocid:id}" --output table

Terraform provider setup

# provider.tf
terraform {
  required_version = ">= 1.5"
  required_providers {
    oci = { source = "oracle/oci", version = "~> 6.0" }   # verify current major
  }
}

# Auth via config file profile (local runs)
provider "oci" {
  config_file_profile = "PROD"
  region              = var.region
}

# In Resource Manager / instance-principal runs, use:
# provider "oci" { auth = "InstancePrincipal"  region = var.region }
Resource Manager
OCI Resource Manager is managed Terraform: it stores state for you, runs plan/apply as stacks, supports drift detection and private templates, and authenticates via the Console identity/resource principal - no state backend to host and no local keys. Great default for teams who don't want to run their own Terraform pipeline.

Create a VCN with subnets and a gateway

resource "oci_core_vcn" "main" {
  compartment_id = var.compartment_ocid
  cidr_blocks    = ["10.10.0.0/20"]
  display_name   = "app-vcn"
  dns_label      = "appvcn"
}

resource "oci_core_nat_gateway" "nat" {
  compartment_id = var.compartment_ocid
  vcn_id         = oci_core_vcn.main.id
  display_name   = "app-nat"
}

resource "oci_core_service_gateway" "sgw" {
  compartment_id = var.compartment_ocid
  vcn_id         = oci_core_vcn.main.id
  services { service_id = data.oci_core_services.all.services[0].id }
  display_name   = "app-sgw"
}

resource "oci_core_route_table" "private_rt" {
  compartment_id = var.compartment_ocid
  vcn_id         = oci_core_vcn.main.id
  display_name   = "private-rt"
  route_rules {
    destination       = "0.0.0.0/0"
    network_entity_id = oci_core_nat_gateway.nat.id
  }
  route_rules {
    destination_type  = "SERVICE_CIDR_BLOCK"
    destination       = data.oci_core_services.all.services[0].cidr_block
    network_entity_id = oci_core_service_gateway.sgw.id
  }
}

resource "oci_core_subnet" "app" {
  compartment_id             = var.compartment_ocid
  vcn_id                     = oci_core_vcn.main.id
  cidr_block                 = "10.10.2.0/24"
  display_name               = "app-private"
  route_table_id             = oci_core_route_table.private_rt.id
  prohibit_public_ip_on_vnic = true          # private subnet
  dns_label                  = "app"
}

data "oci_core_services" "all" {}

Create a compute instance

resource "oci_core_instance" "app" {
  compartment_id      = var.compartment_ocid
  availability_domain = var.ad
  display_name        = "app-01"
  shape               = "VM.Standard.E5.Flex"
  shape_config { ocpus = 2  memory_in_gbs = 32 }

  create_vnic_details {
    subnet_id        = oci_core_subnet.app.id
    assign_public_ip = false
    nsg_ids          = [oci_core_network_security_group.app.id]
  }
  source_details {
    source_type = "image"
    source_id   = var.image_ocid
  }
  metadata = {
    ssh_authorized_keys = file("~/.ssh/id_rsa.pub")
    user_data           = base64encode(file("cloud-init.yaml"))
  }
}

Create an Object Storage bucket

data "oci_objectstorage_namespace" "ns" { compartment_id = var.tenancy_ocid }

resource "oci_objectstorage_bucket" "backups" {
  compartment_id = var.compartment_ocid
  namespace      = data.oci_objectstorage_namespace.ns.namespace
  name           = "db-backups"
  access_type    = "NoPublicAccess"
  versioning     = "Enabled"
  # kms_key_id = oci_kms_key.data.id   # customer-managed key
}

Create IAM group, dynamic group, and policy

resource "oci_identity_dynamic_group" "app_servers" {
  compartment_id = var.tenancy_ocid
  name           = "app-servers"
  description    = "App instances in the app compartment"
  matching_rule  = "ALL {instance.compartment.id = '${var.compartment_ocid}'}"
}

resource "oci_identity_policy" "app_bucket_read" {
  compartment_id = var.compartment_ocid
  name           = "app-bucket-read"
  description    = "App servers read backups bucket"
  statements = [
    "Allow dynamic-group app-servers to read objects in compartment id ${var.compartment_ocid} where target.bucket.name = 'db-backups'"
  ]
}

Create a monitoring alarm

resource "oci_ons_notification_topic" "ops" {
  compartment_id = var.compartment_ocid
  name           = "ops-alerts"
}

resource "oci_monitoring_alarm" "cpu_high" {
  compartment_id        = var.compartment_ocid
  display_name          = "app-cpu-high"
  metric_compartment_id = var.compartment_ocid
  namespace             = "oci_computeagent"
  query                 = "CpuUtilization[5m].mean() > 85"
  severity              = "WARNING"
  destinations          = [oci_ons_notification_topic.ops.id]
  is_enabled            = true
  body                  = "App CPU above 85% for 5 minutes."
  pending_duration      = "PT5M"
}

Tag namespace, tag, and tag default

resource "oci_identity_tag_namespace" "finance" {
  compartment_id = var.tenancy_ocid
  name           = "Finance"
  description    = "Cost attribution tags"
}

resource "oci_identity_tag" "cost_center" {
  tag_namespace_id = oci_identity_tag_namespace.finance.id
  name             = "CostCenter"
  description      = "Charge-back cost center"
  validator {
    validator_type = "ENUM"
    values         = ["CC-4412", "CC-5501", "CC-7788"]
  }
}

# Auto-apply Environment tag to every new resource in a compartment
resource "oci_identity_tag_default" "env_default" {
  compartment_id    = var.compartment_ocid
  tag_definition_id = oci_identity_tag.environment.id
  value             = "prod"
}

State management and structure

  • Remote, locked state: use Resource Manager (state managed for you) or an Object Storage/other backend with locking. Never keep prod state only on a laptop; never commit state (it holds secrets).
  • Modular structure: reusable modules (network, compute, db, iam, monitoring) composed per environment - not copy-pasted stacks.
  • Environment separation: separate state per env (workspaces or separate backends/stacks) driven by dev.tfvars / prod.tfvars; separate compartments and, ideally, separate credentials/pipelines.
  • Drift detection: Resource Manager (or scheduled plan) to catch manual Console changes.
  • No secrets in code: reference Vault secrets/keys by OCID; keep tfvars with secrets out of git.
oci-infra/
  modules/
    network/   compute/   database/   iam/   monitoring/
  envs/
    dev/    main.tf  dev.tfvars   backend.tf
    prod/   main.tf  prod.tfvars  backend.tf
  README.md
Architect note
The goal is that Prod and DR are provably the same because they come from the same modules with different variables. Manual Console changes in prod are the enemy of a working DR; enforce "infrastructure changes go through Terraform" and use drift detection to catch violations.

18. Learning Path

A structured route from OCI fundamentals to enterprise-grade architecture and operations, aimed at people coming from traditional Oracle/infrastructure backgrounds. Each level lists what to learn, why, hands-on labs, common mistakes, and the outcome you should reach.

Last reviewed: July 2026 Certification names/exam codes change - verify on Oracle University before scheduling.
Beginner
Foundations: identity, network, compute, storage
Intermediate
LB, private networking, DB, monitoring, security, cost
Advanced
Exadata, DR, FastConnect, OKE, Terraform, landing zone, GenAI
How to use this
Do the labs, don't just read. Use an Always Free tenancy or a trial for hands-on where possible. Map each level to the deep-dive sections above - the learning path is the syllabus, the sections are the textbook. Certifications (OCI Foundations → Architect Associate → Architect Professional, plus specialty tracks) are useful checkpoints, but capability comes from building.

Beginner

Level 1 - Foundations
Goal: deploy and connect basic OCI resources confidently

What to learn

  • OCI fundamentals: regions, ADs, fault domains, realms, home region (section 1).
  • Tenancy and compartments; how to structure them.
  • IAM basics: users, groups, policies, the verb/resource-type model (section 2).
  • VCN basics: subnets, route tables, security lists/NSGs, gateways (section 3).
  • Compute basics: shapes, images, SSH, cloud-init (section 4).
  • Storage basics: block, object, file - and when to use each (section 5).

Why it matters

Every OCI design rests on these. Get the tenancy/compartment/network mental model right now and everything later is easier; get it wrong and you rebuild.

Hands-on labs

  • Create a compartment, a group, and a least-privilege policy; add yourself and test access.
  • Build a VCN with a public and a private subnet, IGW, NAT, and Service Gateway.
  • Launch a public bastion and a private instance; SSH to the private one through the bastion.
  • Attach and mount a block volume; create a bucket and upload an object; create an FSS share.

Common mistakes

Everything in root compartment; public subnet used for everything; forgetting the Service Gateway; no fault-domain awareness.

Expected outcome

You can stand up a properly-segmented VCN with public/private tiers, reach a private host securely, and use all three storage types - and explain the shared responsibility model.

Intermediate

Level 2 - Building real workloads
Goal: deploy an HA app + database with monitoring, security, and cost control

What to learn

  • Load balancers: L7 vs NLB, listeners/backend sets/health checks, SSL (section 7).
  • Private networking depth: NSGs by tier, DNS/DHCP, flow logs, Path Analyzer (section 3).
  • Hybrid connectivity: Site-to-Site VPN and DRG (section 3/13).
  • Database services: Base Database and Autonomous - provisioning, backups, Data Guard basics (section 6).
  • Monitoring & logging: metrics, alarms, notifications, Service Connector Hub (section 9).
  • Security services: Vault, Cloud Guard, Bastion, Security Zones, WAF (section 8).
  • Cost management: Cost Analysis, Budgets, Quotas, tagging (section 14).

Why it matters

This is the day-job: HA application tiers, managed databases, and the operational and security controls that make them production-worthy.

Hands-on labs

  • Deploy a 3-tier app: public LB + WAF → instance pool (multi-FD) → Autonomous/Base DB (private).
  • Wire NSGs so only app→db on the DB port is allowed; verify with Path Analyzer.
  • Set up alarms (CPU, unhealthy backend, DB storage) to a Notifications topic; force one to fire.
  • Store the DB wallet/secret in Vault; give the app instance-principal access.
  • Add a budget + quota to the compartment; tag all resources with CostCenter/Environment.
  • Set up a VPN/DRG to a simulated on-prem network.

Common mistakes

LB health-check NSG rule missing; DB in a public subnet; secrets in cloud-init instead of Vault; noisy alarms; no tags so cost can't be attributed.

Expected outcome

You can deploy a secure, monitored, HA application and database, connect it to on-prem, and keep its cost and access under control.

Advanced

Level 3 - Enterprise architecture & operations
Goal: design governed, multi-region, automated enterprise platforms

What to learn

  • Exadata Cloud Service and Autonomous deep dive: consolidation, RAC, PDBs, patching at scale (section 6).
  • DR design: Data Guard/Active Data Guard, Full Stack DR, RTO/RPO, pilot-light to active/active (section 13).
  • FastConnect and advanced hybrid: redundant links, BGP, Network Firewall, hub-and-spoke (sections 3/13).
  • OKE and cloud native: node pools/virtual nodes, workload identity, DevOps pipelines, service mesh (section 10).
  • Terraform + Resource Manager: modules, remote state, environment separation, drift (section 17).
  • Landing zone: CIS-aligned governance, Security Zones, centralized logging/audit (sections 8/14).
  • Enterprise security: customer-managed keys, Data Safe, Database Vault, break-glass, least privilege at scale (section 8).
  • GenAI & AI Vector Search: governed RAG over enterprise data, the serving-layer pattern (section 12).
  • Large-enterprise architecture: multi-BU tenancy, multi-region, chargeback, standardization (sections 1/15).

Why it matters

At this level you are responsible for governance, resilience, automation, and cost across many teams and workloads - decisions that are expensive to reverse.

Hands-on labs

  • Deploy a CIS-aligned landing zone via Terraform/Resource Manager (compartments, IAM, network hub, Security Zones, logging, budgets, quotas, tag defaults).
  • Build cross-region DR for a database (Active Data Guard) and rehearse a switchover; confirm keys exist in DR.
  • Stand up an OKE platform with private cluster, workload identity, and a DevOps CI/CD pipeline from OCIR.
  • Implement a governed RAG assistant: Object Storage + DB 23ai Vector Search + Generative AI behind a serving API, with audit and entitlement-filtered retrieval.
  • Refactor a hand-built environment into reusable Terraform modules with per-env state and drift detection.

Common mistakes

Skipping the landing zone and retrofitting governance; DR never tested; over-privileged workload identity/IAM; Console drift breaking DR parity; connecting AI agents to production data without a governed serving layer.

Expected outcome

You can design and operate a governed, automated, multi-region OCI platform for a large enterprise - and defend the trade-offs on security, resilience, and cost.

Certification checkpoints (optional)

LevelTypical certification track
BeginnerOCI Foundations Associate
IntermediateOCI Architect Associate (+ Operations/Networking specialty as relevant)
AdvancedOCI Architect Professional (+ specialty: Security, Multicloud, Data/AI, Autonomous)
Verify before scheduling
Oracle updates exam codes and content regularly (and Oracle University often offers free training/exam windows). Confirm the current track and objectives on the official Oracle University site before you prepare. Certifications validate knowledge; the labs above build the capability employers actually pay for.