Oracle Cloud Infrastructure Deep Dive Portal
A practical reference for Cloud Architects, DBAs, and Enterprise Infrastructure Teams. Built to be used while you learn, design, implement, operate, and troubleshoot real OCI environments - not a marketing overview.
Oracle Cloud Architects, Apps DBAs, Oracle DBAs, infrastructure engineers, cloud engineers, enterprise architects, and anyone moving from traditional on-premises Oracle environments into OCI. It assumes you already understand servers, storage, networks, and Oracle Database - and focuses on how those ideas map into OCI and what changes operationally.
How this portal is organized
Each section is a self-contained deep dive. Use the left navigation or the search box in the top bar to jump directly to a topic. Every section carries a Last reviewed date and, where content changes frequently (pricing, shape names, service limits, model availability), a Verify with current Oracle documentation flag.
Sections 1-2 establish the mental model: regions, ADs, fault domains, tenancy, compartments, and the IAM policy language that everything else depends on.
Sections 3-12 cover networking, compute, storage, database, load balancing, security, observability, containers, analytics, and AI - with diagrams, tables, and gotchas.
Sections 13-18 cover migration and DR, cost and governance, reference architecture patterns, troubleshooting runbooks, automation, and a structured learning path.
Reading the callouts
Four note types recur throughout. They flag the perspective that matters most for a given point.
The OCI shared responsibility model (orientation)
Everything in this portal sits on one idea: in cloud, responsibility is split, and the split moves depending on the service. Get this wrong and you will either leave gaps (security incidents, unrecoverable data) or do work Oracle already does for you (wasted effort).
| Layer | IaaS (Compute + you install DB) | Base Database / VM DB | Exadata Cloud Service | Autonomous Database |
|---|---|---|---|---|
| Physical / hypervisor | Oracle | Oracle | Oracle | Oracle |
| OS patching | You | You (guest VM) | You (guest VM) | Oracle |
| DB software install/patch | You | Oracle tooling, you trigger | Oracle tooling, you trigger | Oracle |
| Backup config | You | Managed, you configure | Managed, you configure | Oracle (you set retention) |
| HA / RAC | You build it | Optional, you choose | Built-in RAC | Built-in |
| Schema, SQL, tuning | You | You | You | You |
| Data classification & access | You | You | You | You |
Suggested reading order
1. OCI Fundamentals
The physical and logical building blocks of Oracle Cloud Infrastructure, and the tenancy and compartment structure that every enterprise deployment stands or falls on.
OCI is a set of regions, each built from isolated Availability Domains (data centers) that are further split into Fault Domains (racks). Your account is a tenancy; you organize resources into compartments (logical, not network) and control access with IAM policies written against those compartments. Compartment and tenancy structure is the single most important thing to get right before production - it is painful to restructure later.
What OCI is
Oracle Cloud Infrastructure is Oracle's public cloud: on-demand compute, storage, networking, database, and platform services delivered from Oracle-operated data centers, consumed over the network, and billed by usage. Compared to Oracle's first-generation cloud, OCI ("Gen 2") was rebuilt with an off-box network virtualization design - the virtualization and network isolation run on separate hardware from the customer's compute, which is the basis for its bare-metal offerings and its network isolation guarantees.
Practically, OCI gives you four things that matter to an enterprise Oracle shop:
- Real bare metal - you can rent an entire physical server with no hypervisor, which matters for licensing and for the highest-performance database workloads.
- Exadata as a cloud service - the same engineered system you may run on-premises, delivered as Base Database, Exadata Database Service, or Exadata Cloud@Customer.
- Autonomous Database - a self-managing database platform where Oracle runs patching, tuning, backup, and scaling.
- A flat, predictable network - a non-blocking, low-latency backbone with off-instance network virtualization.
OCI global architecture
Regions, Availability Domains, Fault Domains
| Concept | What it is | Failure it protects against | What you do with it |
|---|---|---|---|
| Region | A localized geographic area containing one or more Availability Domains. Your data residency boundary. | Regional disaster, large-scale outage | Choose based on latency to users, data residency law, and service availability. Deploy DR to a second region. |
| Availability Domain (AD) | One or more isolated data centers within a region, with independent power, cooling, and network. | Data-center-level failure | Spread instances / DB nodes across ADs (in multi-AD regions) for HA. Many regions have only 1 AD. |
| Fault Domain (FD) | A grouping of hardware within an AD (think: a rack). Every AD has exactly 3 FDs. | Rack / hardware / maintenance failure within an AD | Anti-affinity: place HA pairs in different FDs. This is your only in-region HA lever in a single-AD region. |
| Realm | A hard isolation boundary (commercial OC1, US Gov, UK Gov, dedicated). Identities and tenancies never cross realms. | Compliance / sovereignty isolation | Usually fixed by your contract; matters for regulated workloads. |
Home region and subscriptions
When you sign up you pick a home region. IAM resources (users, groups, policies, dynamic groups, compartments, federation, and in the legacy model the tenancy's identity metadata) are mastered in the home region and replicated read-only to subscribed regions. You then subscribe the tenancy to additional regions to deploy workloads there.
- You cannot change the home region after it is set. Choose deliberately - it should be a region close to your identity administrators and one you intend to keep long-term.
- IAM writes (create a user, edit a policy) always go to the home region and propagate out. A home-region outage can therefore affect identity administration globally even while workloads keep running.
- Subscribing to a region is easy; unsubscribing is not - plan region subscriptions rather than turning them on casually.
Tenancy and compartments
Your tenancy is the root container for your entire OCI account - it is itself the root compartment. Everything lives under it.
Compartments are logical containers for resources (compute, VCNs, buckets, databases). They are the primary unit of access control, isolation, quota, and cost tracking. Key properties that trip people up coming from AWS/on-prem:
- Compartments are global, not regional. A compartment exists across all subscribed regions; the resources inside it are regional.
- Compartments are logical, not a network boundary. Two instances in different compartments can talk over the network if the VCN/subnet/security rules allow it. Isolation is by IAM policy, not by compartment walls.
- Compartments can be nested (up to six levels deep). Policies and quotas can be scoped at any level.
- Resources can be moved between compartments (most, not all), but some have caveats. Deleting a compartment requires it be empty and is asynchronous.
How to structure compartments for an enterprise Design
There is no single correct model, but the durable patterns are:
- By environment (most common): top-level compartments or a Workloads parent with Prod / Stage / Test / Dev / DR children. Clean blast-radius separation and simple policies ("group X can manage instances in Dev only").
- By business unit: a compartment per BU, each with its own environment sub-compartments. Fits organizations that charge back and delegate admin per BU.
- Shared services split out: put the network hub, Vault, logging, and security tooling in their own compartment(s) so platform teams own them and workload teams cannot alter them.
- Hybrid (recommended for large orgs): Shared-Services + Security + per-BU (each BU has Prod/NonProd) + Sandbox. Landing zone frameworks (see the Cost & Governance section) codify this.
Separating dev, test, stage, prod, and DR Design
- Separate compartments per environment give you independent IAM, quotas, budgets, and cost reporting.
- Strongly consider separate VCNs (or at least separate subnets with strict NSGs) so a mistake in Dev cannot reach Prod data.
- Some regulated shops use separate tenancies for Prod vs. non-Prod for hard isolation and separate billing. This is the strongest separation but adds identity/federation overhead and cross-tenancy networking complexity. Decide based on your risk and compliance posture.
- DR lives in a different region. Keep its compartment structure a mirror of Prod so IAM policies and automation translate cleanly.
Designing for multiple business units Design
- Give each BU a top-level compartment with delegated admin (a BU-admin group with manage rights scoped to that compartment only).
- Apply compartment quotas to cap what each BU can consume, and budgets with alerts for spend.
- Centralize the network in a Shared-Services compartment and connect BU spoke VCNs via a DRG hub, so BUs cannot each build divergent, unmanaged network topologies.
- Use defined tags (cost-center, owner, environment) enforced by tag defaults so chargeback works from day one.
- Putting everything in the root compartment "to start" - it becomes impossible to apply least privilege later.
- Modeling compartments as if they were network boundaries. They are not; network isolation is VCN/subnet/security rules.
- Over-nesting. Six-level trees look tidy but make policy debugging miserable.
- No naming standard. Inconsistent names (prod vs Production vs PRD) break automation and reporting.
- Not reserving compartments for shared services and security up front, so those resources end up scattered in workload compartments.
Limits, quotas, and service limits
| Mechanism | Set by | Scope | Purpose |
|---|---|---|---|
| Service limits | Oracle (per tenancy, per region, sometimes per AD) | Tenancy | The maximum of a resource you can create (e.g. number of OCPUs of a shape). Raise via a limit-increase request in the Console. |
| Compartment quotas | You (policy-like statements) | Compartment | Cap/allow/deny resource creation per compartment. Your governance lever - e.g. "no bare metal in Dev". |
| Budgets | You | Compartment / tag | Track and alert on spend. Do not block creation - they notify. |
Quota statements look like policy but control resource counts. Example that blocks expensive shapes in a Dev compartment:
# Applied in the Dev compartment (Governance > Quotas)
zero compute-core-count-standard-e5-quota in tenancy
set compute-core-count-standard-e4-quota to 64 in tenancy
# Deny all bare-metal database shapes in this compartment
zero database-dbcs-quota in tenancyResource OCIDs
Every OCI resource has a globally unique Oracle Cloud Identifier (OCID). You will use these constantly in CLI, Terraform, and support tickets. The format is readable:
ocid1.instance.oc1.us-ashburn-1.anuwcljr...abcd1234
| | | | |
version type realm region unique id (opaque)- The type segment (instance, vcn, bucket, database, compartment, user, policy...) tells you what the resource is at a glance.
- Some resources are regionless (users, groups, compartments, policies) - their OCID has no region segment.
- OCIDs are stable for the life of the resource; scripts and Terraform state key off them.
Tags: freeform, defined, and namespaces
| Tag type | Structure | Governed? | Use for |
|---|---|---|---|
| Freeform tags | Simple key:value, no schema | No control - anyone with manage rights can set anything | Quick, informal labels. Avoid for anything you report or bill on. |
| Defined tags | Live in a tag namespace; keys are predefined, values can be validated/restricted | Yes - controlled by IAM policy and value lists | Cost tracking, environment, owner, data-classification. The enterprise standard. |
| Tag namespace | A container for defined tag keys (e.g. Finance namespace with CostCenter, Project) | Yes | Grouping and governing related tag keys; can be retired/reactivated. |
| Tag defaults | Auto-applied defined tags on any new resource in a compartment | Yes | Guaranteeing every resource is tagged (e.g. auto-stamp CreatedBy, Environment). |
Ways to work with OCI
The web UI. Best for learning, exploring, one-off tasks, and reading state. Not for repeatable production changes - use IaC for those.
Python-based command line. Great for scripting, ad-hoc automation, and things not yet in Terraform. Uses config profiles (see section 17).
Java, Python, Go, TypeScript/JavaScript, .NET, Ruby, PL/SQL. For building applications and tooling against OCI APIs.
Browser-based terminal, pre-authenticated as your Console identity, with CLI/Terraform/kubectl pre-installed and ephemeral home storage. Ideal for quick, credential-free tasks.
The recommended way to build and manage infrastructure declaratively. Oracle maintains the provider. Run it locally, in a pipeline, or in OCI Resource Manager (managed Terraform with state).
OCI's managed Terraform service - stores state, runs plan/apply, supports stacks and drift detection without you hosting a state backend. Covered in section 17.
What to decide before production
- Home region - irreversible. Pick with identity-admin proximity and long-term intent in mind.
- Realm - commercial vs. government/dedicated - usually contractual, but confirm it matches your compliance needs.
- Tenancy strategy - single tenancy with compartments, or separate Prod/non-Prod tenancies.
- Compartment topology - environment vs. BU vs. hybrid; where shared services and security live.
- Identity domain strategy - default domain vs. multiple domains, and federation to your IdP (see section 2).
- Network CIDR plan - non-overlapping with on-premises and future regions (see section 3). This is very hard to change later.
- Tag model - defined tags + namespaces + tag defaults for cost and governance.
- Guardrails - compartment quotas, budgets, Security Zones, Cloud Guard baseline.
- DR region and RTO/RPO targets - which region, which DR pattern per tier.
2. Identity and Access Management
Who can do what, to which resources, in which compartment. IAM is where most OCI security incidents and access-denied tickets originate, so this section goes deep on the policy language, identity domains, and least-privilege design.
Access in OCI is granted only by policies. A policy is a set of human-readable statements: Allow <subject> to <verb> <resource-type> in <compartment> [where <condition>]. There is no implicit access - if no policy allows it, it is denied. Groups get people access; dynamic groups and instance/resource principals let workloads authenticate without stored keys. Write policies at the lowest compartment that works, use the least verb that works, and separate admin from operator from read-only.
The IAM model
Modern OCI IAM runs inside Identity Domains - each domain is an isolated identity and access management container with its own users, groups, applications, and security settings (its lineage is Oracle Identity Cloud Service). A tenancy has a Default domain and can have additional domains. Within a domain you have users and groups; across the tenancy you have compartments and policies that reference those groups.
The evaluation is simple to state and important to internalize: a request is allowed only if at least one policy statement permits it; otherwise it is denied. There are no "deny" statements in classic IAM policy - you control access by what you grant and to which compartment. (Deny-style controls come from other layers: Security Zones, quotas, and network rules.)
Users, groups, and dynamic groups
| Principal | What it is | How it authenticates | Use for |
|---|---|---|---|
| User | A person or a service identity in a domain | Password + MFA (console), API key, auth token | Humans and, sparingly, service accounts that must have static credentials |
| Group | A named set of users | n/a - policies target groups | All human access. Never write policies against individual users. |
| Dynamic group | A set of resources (instances, functions, OKE nodes, DB systems...) matched by rules | Instance/resource principal (no stored credentials) | Letting workloads call OCI APIs without API keys |
| Federated user | A user authenticated by an external IdP (Entra ID, Okta, AD) | SAML/OIDC SSO, mapped to a domain group | Enterprise SSO - the standard for human access at scale |
A user group contains people; a dynamic group contains resources (compute instances, functions, autonomous DBs) selected by matching rules. You put a person in a user group. You never put a person in a dynamic group - and you never put an instance in a user group. Policies for workloads must target the dynamic group.
Example dynamic group matching rule (all instances in a compartment):
ALL {instance.compartment.id = 'ocid1.compartment.oc1..aaaa...'}Policy syntax
A policy is a named collection of statements living in a compartment (or the tenancy). Every statement follows this grammar:
Allow <subject> to <verb> <resource-type> in <location> [where <conditions>]
# subject: group Admins | dynamic-group AppServers | any-user | group DomainName/Admins
# verb: inspect | read | use | manage
# resource-type: instances | virtual-network-family | object-family | database-family | all-resources
# location: tenancy | compartment Prod | compartment Prod:AppTier
# conditions: where request.region = 'us-ashburn-1'Where the policy lives matters: a policy attached to a compartment can only grant access to that compartment and its children. Tenancy-level policies can grant anywhere, which is exactly why you should minimize them.
Verbs and resource types
The four verbs are cumulative - each includes everything in the ones before it, plus more permissions.
| Verb | Grants | Typical use | Risk |
|---|---|---|---|
| inspect | List resources (metadata only; often hides sensitive contents) | Auditors, inventory tools | Low |
| read | inspect + get resource details / contents | Read-only operators, dashboards | Low |
| use | read + work with existing resources (start/stop, attach, update some attributes) - generally not create/delete | Operators running day-2 tasks | Medium |
| manage | use + create and delete resources; full control | Admins, automation that provisions | High |
Resource types can be individual (instances, buckets, subnets) or aggregate family types that bundle related resources:
virtual-network-family- VCNs, subnets, route tables, security lists, gateways, NSGs, DRGs.database-family- DB systems, databases, backups, Data Guard, etc.object-family- buckets and objects.instance-family- compute instances, images, boot/block volume attachments.all-resources- everything. Use with extreme care (see mistakes).
use instances in the app compartment - not manage instance-family in tenancy. Every widening you allow is a widening an attacker or a mistake can use.Conditions in policies
Conditions add a where clause that must be true for the statement to apply. They reference request/target variables.
# Restrict admin actions to a single region
Allow group NetAdmins to manage virtual-network-family in tenancy
where request.region = 'us-ashburn-1'
# Only allow managing resources tagged for a given cost center
Allow group ProjectX to manage instances in compartment Workloads
where target.resource.tag.Finance.CostCenter = 'CC-4412'
# Restrict to a specific resource type via request.operation (fine-grained)
Allow group Operators to use instances in compartment Prod
where request.operation = 'InstanceAction' # start/stop/reset, not deleterequest.region, request.operation, request.user.id, request.groups.id, target.resource.tag.<ns>.<key>, request.principal.type. Combine with tags for attribute-based access control.Instance principals and resource principals
These solve the "how does my code authenticate to OCI without storing keys" problem - the single most important security improvement most teams can make.
| Mechanism | The principal is | Used by | How it works |
|---|---|---|---|
| Instance principal | A compute instance | Code running on a VM/BM instance | Instance is a member of a dynamic group; SDK/CLI obtains short-lived credentials from the instance metadata service. No API key on disk. |
| Resource principal | A managed resource (Function, Data Science notebook, Autonomous DB, etc.) | Serverless / managed services | The service injects a short-lived token the SDK uses automatically. Common in Functions and OKE workload identity. |
# Let all instances in AppCompartment read objects in a specific bucket's compartment
Allow dynamic-group AppServers to read objects in compartment Data
where target.bucket.name = 'app-artifacts'
# CLI using instance principal - no ~/.oci/config keys needed
oci os object list --bucket-name app-artifacts --auth instance_principalIdentity domains, federation, SSO, MFA
Identity domains are self-contained IAM stacks in your tenancy. Each has its own users, groups, password policy, MFA settings, sign-on policies, and federation config.
- Default domain - created with the tenancy; where the initial administrator lives.
- Additional domains - useful to separate, for example, employees from external partners, or Prod-admin identities from Dev identities, each with their own MFA/sign-on rules. Domains also come in types/licensing tiers (Free, Oracle Apps, Premium, External User) that affect available features - verify current tiers.
- Federation / SSO - connect an external IdP (Microsoft Entra ID, Okta, ADFS, Ping) via SAML or OIDC so users sign in with corporate credentials. Map IdP groups to domain groups; policies reference the domain groups.
- MFA - enforced via sign-on policies in the domain. Require MFA for all human users, especially administrators. Exempt only automated service identities that use API keys/principals, not passwords.
Allow group 'DomainName'/'GroupName' to ...). Decide your domain model (single default vs. multi-domain) alongside your compartment model, before production identities exist.Credential types
| Credential | Used for | Notes / risk |
|---|---|---|
| Console password + MFA | Human Console sign-in | Federate + enforce MFA. Rotate per policy. |
| API signing key (PEM) | CLI/SDK/Terraform as a user | Long-lived. High risk if leaked. Prefer instance/resource principals where possible; rotate and scope tightly otherwise. |
| Auth token | Basic-auth style access (e.g. some Git/registry, Swift/RMAN to Object Storage) | Password-equivalent. Store in Vault, not scripts. |
| Customer secret key | S3-compatible Object Storage access (access key/secret) | For tools that speak the S3 API. Treat like AWS keys. |
| OAuth 2.0 client credentials | App-to-app / confidential apps in a domain | Managed as domain applications; scope to least privilege. |
| Database passwords / wallets | DB connectivity (e.g. ADB mTLS wallet) | Store wallets/secrets in Vault; never in application images. |
Real policy examples
Read-only auditor across the tenancy Low risk
Allow group Auditors to inspect all-resources in tenancy
Allow group Auditors to read audit-events in tenancyAllows: listing and viewing audit trails and resource metadata everywhere; no changes, and inspect hides many secret contents. Where: tenancy (auditors legitimately need breadth). Risk: low, but still scope to inspect/read only. Safer alternative: if auditors only cover certain BUs, scope to those compartments instead of tenancy.
App operators can start/stop instances in Prod only Medium risk
Allow group AppOperators to read instance-family in compartment Prod
Allow group AppOperators to use instances in compartment Prod
where request.operation = 'InstanceAction'Allows: viewing instances and starting/stopping/resetting them, but not creating or terminating. Where: the Prod compartment (not tenancy). Risk: medium - stop can cause outage; scoping to InstanceAction prevents delete. Safer alternative: further restrict with a tag condition so operators only touch app-tier instances, not databases.
Workload reads a bucket via instance principal Low risk
Allow dynamic-group AppServers to read objects in compartment Data
where target.bucket.name = 'app-config'Allows: only the matched instances, only read, only that one bucket. Where: the Data compartment. Risk: low - tightly scoped and no stored keys. This is the pattern to imitate: dynamic group + least verb + resource condition.
Delegated compartment admin for a business unit Higher risk
Allow group BU_A_Admins to manage all-resources in compartment BU-AAllows: full control, but only inside the BU-A compartment subtree - not the rest of the tenancy. Where: the BU-A compartment. Risk: higher (manage all-resources), but blast radius is contained to one compartment. Safer alternative: split into role-specific groups (network admin, DB admin, compute admin) even within the BU so no single group holds everything. Pair with quotas and a Security Zone.
Common OCI IAM mistakes
- manage all-resources too broadly - especially at tenancy level. This is effectively tenancy admin. Scope to a compartment and split by role.
- Tenancy-level policies by default - a policy in the tenancy grants everywhere. Put policies in the lowest compartment that works.
- Confusing dynamic groups with user groups - workloads authenticate via dynamic groups; people via user groups. Mixing them either fails or over-grants.
- No separation of admin / operator / read-only - one "everyone" group with manage rights removes least privilege and makes audit meaningless.
- Not planning identity domains - retrofitting a multi-domain model after federation and policies exist is painful.
- Storing API keys instead of using instance/resource principals - long-lived keys leak; principals are short-lived and scoped.
- Policies against individual users - unmanageable and invisible in group-based reviews. Always target groups.
- Leaving the default tenancy administrator group over-used - reserve it for break-glass; do daily work as scoped roles.
IAM troubleshooting
Symptoms
A user or workload gets a 404/authorization error even though the resource exists.
Likely causes
- No policy grants the action (default deny).
- Policy is attached to the wrong compartment (a parent policy does not "reach down" unless it covers that compartment; a child policy cannot grant on a sibling).
- Verb too weak (
readwhereuse/manageis needed) or wrong resource-type/family. - For workloads: the instance is not in the dynamic group, or the matching rule does not match, or the dynamic-group policy is missing.
- A
wherecondition (region, tag, operation) is not satisfied. - Wrong identity domain - group referenced without the domain qualifier.
Checks to perform
- Confirm the user's group membership and the group's policies (Identity > Domains > Groups; Identity > Policies).
- Trace the compartment path of the target resource and confirm a policy covers that compartment or an ancestor.
- For dynamic groups: verify the instance OCID matches the rule (Identity > Domains > Dynamic groups) and that a policy grants the dynamic group.
- Check conditions: is the request in the allowed region? Does the target carry the required tag?
Console path
Identity & Security > Policies (and Domains > Users / Groups / Dynamic groups). Use the tenancy Audit log to see the denied request details.
CLI
oci iam policy list --compartment-id <compartment-ocid> --all
oci iam group list-users --group-id <group-ocid>
oci iam dynamic-group get --dynamic-group-id <dg-ocid>Fix options
Add the least-privilege statement at the correct compartment; fix the dynamic-group rule; correct the verb/resource-type; relax or correct the condition.
Prevention
Adopt a policy naming and location standard, review policies in code (Terraform), and keep a "who can do what" matrix per compartment.
3. Networking Deep Dive
The VCN is the foundation every other OCI service plugs into. This section covers CIDR planning, subnets, gateways, security rules, hybrid connectivity, DNS, and the traffic-flow reasoning you need to design and debug real networks.
A VCN is your private, software-defined network in a region with a CIDR you choose. Inside it you create subnets (regional or AD-specific, public or private). Traffic leaves the VCN only through a gateway - Internet (IGW), NAT, Service (to OCI services privately), DRG (to on-prem and other VCNs), or peering. Two rule layers govern packets: stateful security lists (subnet-wide) and NSGs (per-VNIC). Plan your CIDR to never overlap with on-premises or other regions - it is the hardest thing to change later.
VCN and CIDR planning
A Virtual Cloud Network is a regional, private network. You assign it one or more CIDR blocks (RFC 1918 private ranges are standard; you can also use public ranges you own). Everything - instances, load balancers, databases, mount targets, private endpoints - gets an IP from a subnet inside the VCN.
- A VCN can have multiple CIDR blocks, added after creation, which helps when you outgrow the first block. But you cannot shrink or trivially renumber - plan generously.
- VCN CIDRs and subnet CIDRs must not overlap with each other, with peered VCNs, or with your on-premises networks. Overlap is the number-one cause of hybrid connectivity that "connects but cannot route."
- Reserve non-overlapping space for every region and every environment you might ever run, plus DR. Treat it like an enterprise IP address plan, because it is one.
10.0.0.0/8 subdivided), then allocate a /16 per region, a /20 per VCN/environment, and /24s per subnet tier. Keep a documented IPAM spreadsheet. Leave gaps. The cost of a too-large plan is zero; the cost of overlap is a re-IP project during a migration.10.0.0.0/16 for the first VCN because it is the console default, then discovering on-premises already uses 10.0.x.x. Now FastConnect/VPN is up but nothing routes. Choose CIDRs against your existing enterprise IP plan on day one.Subnets
| Subnet property | Options | Guidance |
|---|---|---|
| Scope | Regional (spans all ADs) or AD-specific | Prefer regional subnets - simpler HA, resources can land in any AD/FD. AD-specific subnets are legacy/niche. |
| Public vs private | Public = resources can have public IPs; Private = private IPs only | Default to private. Only front-facing load balancers and bastions belong in public subnets. |
| Route table | One per subnet | Determines which gateway off-VCN traffic uses. Public subnet → IGW; private subnet → NAT/Service/DRG. |
| Security lists | Zero or more per subnet | Subnet-wide stateful rules. Combine with NSGs. |
| DHCP options | One per subnet | Controls DNS resolver and search domain handed to instances. |
Gateways - the only ways out of a VCN
| Gateway | Direction / purpose | Public IP? | Typical route target for |
|---|---|---|---|
| Internet Gateway (IGW) | Bidirectional internet for resources with public IPs | Yes | Public subnets (LB, bastion) |
| NAT Gateway | Outbound-only internet for private resources (patching, external APIs) | Uses OCI-managed public IP | Private subnets needing egress |
| Service Gateway | Private access to OCI services (Object Storage, ADB, etc.) without internet | No | Private subnets reaching OCI services |
| Dynamic Routing Gateway (DRG) | On-premises (VPN/FastConnect) and VCN-to-VCN / cross-region routing hub | No | Hybrid + hub-and-spoke |
| Local Peering Gateway (LPG) | VCN-to-VCN peering in the same region (legacy; DRG now preferred) | No | Same-region VCN peering |
| Remote Peering Connection (RPC) | DRG-to-DRG peering across regions | No | Cross-region VCN connectivity |
0.0.0.0/0 in a private subnet and assuming Object Storage traffic is "private." It is not - it egresses to the internet-facing Object Storage endpoint. Add a Service Gateway and a route for the OCI services CIDR label so that traffic stays on the backbone; keep NAT only for genuine internet destinations.Security lists vs. Network Security Groups
| Security List | Network Security Group (NSG) | |
|---|---|---|
| Applies to | Every VNIC in the subnet | Only VNICs you add to the NSG |
| Granularity | Subnet-wide (coarse) | Per-workload / per-tier (fine) |
| Rule source/dest | CIDR, service CIDR | CIDR, service CIDR, or another NSG |
| Stateful? | Yes (can also be stateless) | Yes (can also be stateless) |
| Best for | Baseline subnet rules (e.g. allow intra-VCN) | App-tier-to-DB-tier rules by group, not IP |
Both are evaluated. A packet is allowed if either the applicable security lists or the NSGs permit it (they are additive for allows; there is no deny rule - you allow what you need and everything else is implicitly denied). Effective rules = union of all security lists on the subnet + all NSGs on the VNIC.
IPs, VNICs, and secondary addresses
| Object | What it is | Notes |
|---|---|---|
| Private IP | An IP from the subnet CIDR on a VNIC | Primary private IP is fixed for the VNIC's life; secondaries can move. |
| Ephemeral public IP | Temporary public IP tied to a private IP/instance lifecycle | Released when the instance/VNIC is deleted. Cheapest for transient needs. |
| Reserved public IP | A public IP you own independent of any instance | Survives instance deletion; re-map to another resource. Use for stable ingress/whitelisting. |
| VNIC (primary) | The instance's first network interface | Cannot be removed; determines the instance's primary subnet. |
| Secondary VNIC | Additional interface, can be in a different subnet | Multi-homing (e.g. app + management network). Requires OS-level config. |
| Secondary private IP | Extra private IP on a VNIC | Used for IP failover / floating VIPs across instances. |
DNS and DHCP
- VCN Resolver - each VCN has a built-in DNS resolver at
169.254.169.254. It resolves internal hostnames and forwards public queries. - Private DNS zones/views - create private zones for custom internal names, and use the resolver's endpoints/forwarding rules to integrate with on-premises DNS (conditional forwarding both ways for hybrid name resolution).
- DHCP options - per-subnet; controls whether instances use the VCN resolver or a custom resolver, and the search domain. Point at custom resolvers when integrating enterprise DNS.
How traffic flows in OCI
For any packet leaving an instance, OCI decides the path in this order:
- Is the destination inside the VCN? If yes, it routes locally (no gateway) - only security rules apply. Intra-VCN routing is automatic.
- If outside the VCN, the subnet's route table is consulted for the most specific matching rule → picks a gateway (IGW, NAT, Service, DRG, LPG).
- Security rules (security lists + NSGs on the VNIC) must allow the egress, and the return path must be allowed (stateful handles the return automatically).
- At the destination side, the same security-rule check happens for ingress.
Debugging almost always comes down to three questions: Is there a route to the right gateway? Do the security rules allow it both ends? Is there a return path?
Reference diagrams
Three-tier architecture (public LB, private app, private DB)
Hub-and-spoke with DRG
Service Gateway to Object Storage (private backend/backups)
On-premises to OCI hybrid (VPN + FastConnect)
Hybrid connectivity: VPN vs FastConnect
| Site-to-Site VPN (IPSec) | FastConnect | |
|---|---|---|
| Path | Over the public internet, encrypted | Private, dedicated connection (via partner or colo) |
| Bandwidth | Limited per tunnel; multiple tunnels for HA | 1/10/100 Gbps port options |
| Latency/jitter | Variable (internet) | Consistent, low |
| Setup time | Minutes | Days-weeks (physical/partner provisioning) |
| Use as | Quick start, or encrypted backup to FastConnect | Primary enterprise link, production DB replication, large migration |
Flow logs, path analysis, and packet capture
| Tool | What it gives you | Use when |
|---|---|---|
| VCN Flow Logs | Accepted/rejected connection records per subnet/VNIC (into the Logging service) | Auditing, "is my rule dropping this?", security forensics |
| Network Path Analyzer | Static analysis of whether a path from A to B is allowed, listing the rules/routes that permit or block it | Before deploying, or first step when "cannot reach" - it tells you the exact blocking rule |
| VTAP (Virtual Test Access Point) | Mirrors VNIC traffic to a capture target for deep packet inspection | IDS/IPS, deep debugging, compliance capture |
| Instance-side capture | tcpdump on the instance OS | Confirming what actually arrives at the host |
Network Firewall and Web Application Firewall
- Network Firewall - a managed, OCI-native firewall (Palo Alto-based) you place in a hub VCN to inspect north-south and east-west traffic: stateful filtering, IPS/IDS, URL filtering, TLS inspection. Route spoke traffic through it via the DRG hub.
- Web Application Firewall (WAF) - Layer-7 protection (OWASP rules, bot management, rate limiting) applied in front of public HTTP endpoints, either as an edge/policy attached to a load balancer or as an edge service. Covered further in section 8.
Networking troubleshooting
Likely causes & checks
- Private subnet, no egress: needs a route
0.0.0.0/0to a NAT Gateway (or IGW if it has a public IP). Check the subnet's route table. - Public subnet but no public IP: assign an ephemeral/reserved public IP, and confirm a
0.0.0.0/0route to the IGW. - Security rules: egress must allow the destination (often
0.0.0.0/0on 443); return traffic is automatic if stateful. - OS firewall:
iptables/firewalldon the instance may block it - Oracle Linux images ship with firewall rules.
Console path
Networking > VCN > Subnet > Route Table / Security Lists; Instance > Attached VNICs > public IP.
CLI
oci network route-table get --rt-id <ocid>
oci network security-list get --security-list-id <ocid>
# on the instance:
curl -s https://ifconfig.me ; sudo firewall-cmd --list-allFix / prevention
Add NAT route for private egress; standardize a subnet template (route + security rules) in Terraform so every subnet is consistent.
Likely causes & checks
- No Service Gateway on the VCN, or subnet route table has no route to it for the OCI services CIDR label.
- Service Gateway configured for the wrong service label (needs "Object Storage" or "All OSN Services in region").
- Security rules do not allow egress to the service CIDR.
- Using the wrong endpoint - use the regional Object Storage endpoint that the Service Gateway serves.
Console path
Networking > VCN > Service Gateway; then the subnet Route Table.
Fix / prevention
Create the Service Gateway, add a route rule (target = Service Gateway, destination = the services CIDR label), and an egress security rule to that label. Bake this into your standard private-subnet module.
Likely causes & checks
- CIDR overlap between on-prem and the VCN - routing is ambiguous. This is the most common root cause.
- DRG route table / route distribution not advertising the VCN CIDR, or on-prem BGP not advertising its routes.
- VPN tunnel down (Phase 1/2 mismatch) or FastConnect BGP session down.
- Security lists/NSGs not allowing the on-prem CIDR.
- On-prem firewall blocking return traffic.
Console path
Networking > Dynamic Routing Gateways > DRG > Route Tables / Attachments; Site-to-Site VPN > Tunnel status.
Fix / prevention
Resolve overlap (re-IP or NAT), confirm route advertisement both ways, verify tunnel/BGP state. Document the IP plan to prevent future overlap.
Likely causes & checks
- Backend subnet's security rules do not allow the health-check probe from the load balancer subnet/NSG on the backend port.
- Health check configured for the wrong port/path/protocol vs. what the app serves.
- Backend app not listening on the expected port, or bound to
127.0.0.1instead of0.0.0.0. - OS firewall on the backend dropping the probe.
Fix / prevention
Allow the LB source in the backend NSG on the health-check port; align health-check path/port with the app; confirm the app binds to all interfaces. See section 7 for the full LB troubleshooting flow.
Likely causes & checks
- Subnet DHCP options point at a resolver that cannot resolve the name (e.g. custom resolver without a forwarder).
- Hybrid: no conditional forwarding between the VCN resolver and corporate DNS.
- Private DNS zone missing the record, or wrong view attached to the VCN.
CLI / checks
nslookup db.internal.example.com 169.254.169.254
cat /etc/resolv.conf # confirm which resolver the OS usesFix / prevention
Fix DHCP options, add resolver endpoints + forwarding rules for hybrid, and put internal records in a private zone attached to the VCN's resolver view.
Fast method
Run Network Path Analyzer for source, destination, protocol, and port. It reports the exact security rule or missing route causing the block. Then enable VCN Flow Logs and look for REJECT records to confirm which rule dropped the packet.
Common route-table gotcha
Route rules are matched most-specific-first. A 0.0.0.0/0 to NAT plus a more specific route to a Service Gateway both being present is correct; a missing specific route sends OCI-service traffic out the NAT by mistake. For DRG route propagation issues, check the DRG route table import/export and that the attachment advertises the CIDR.
OCI Networking gotchas
- CIDR overlap is the cardinal sin - plan against your enterprise IPAM, leave room, never reuse on-prem ranges.
- "Public subnet" != reachable. Reachability needs public IP + IGW route + security rules, all three.
- Service Gateway forgotten - Object Storage/ADB traffic silently goes over NAT/internet. Always add it for private subnets.
- Security list + NSG are additive - a permissive security list can undo the tight NSG you thought was protecting a VNIC. Audit both.
- Stateless rules asymmetry - keep rules stateful unless you have a measured reason not to.
- Regional vs AD subnets - use regional subnets; AD-specific ones complicate HA and are rarely needed now.
- DRG vs LPG - LPG is legacy; use the upgraded DRG (DRG v2) as the routing hub for peering and hybrid.
- Route table is per subnet - a change in one subnet's route table does not affect others; inconsistency between subnets causes "works here, not there."
- OS firewall - Oracle Linux images enforce their own firewall; a perfect VCN config still fails if
firewalldblocks the port. - Egress data transfer - internet egress is metered; keep OCI-service traffic on the Service Gateway to avoid unnecessary internet-path cost and exposure.
4. Compute Deep Dive
Shapes, images, placement, scaling, and the operational patterns for running application and database compute in OCI - including how to pick shapes and how licensing interacts with instance choice.
OCI compute comes as VMs or bare metal. Modern shapes are flexible - you dial OCPUs and memory independently. One OCPU = one physical core (two vCPU threads), which matters a lot for Oracle licensing. Use fault domains and instance pools + autoscaling for HA and elasticity, instance configurations as templates, and instance principals so instances call OCI APIs without stored keys.
Shapes: OCPU, memory, flexible
- OCPU vs vCPU: OCI historically measures CPU in OCPUs. One OCPU = one physical core = two hardware threads (vCPUs) on x86. Oracle Database licensing counts cores, so 1 OCPU generally corresponds to the licensing unit for a core (subject to the core factor). Newer shapes may also be expressed in vCPUs - confirm the unit for the exact shape.
- Flexible shapes (e.g. the E-series AMD, and Ampere Arm A-series) let you choose OCPU count and memory GB independently, within per-OCPU memory ranges. You pay for what you allocate.
- Fixed shapes come in preset sizes (older standard shapes, some bare metal, GPU shapes).
- Processor families: AMD EPYC (E-series), Intel Xeon (X/Standard), Ampere Arm (A-series, strong price/performance for scale-out and cloud-native), and NVIDIA GPU shapes for AI/ML.
Bare metal vs virtual machines
| Virtual Machine | Bare Metal | |
|---|---|---|
| Tenancy | Shared host, isolated VM | Entire physical server, single tenant |
| Hypervisor overhead | Minimal (off-box virtualization) | None - you get the whole box |
| Use for | Most apps, middleware, web, smaller DBs | Highest performance, large DBs, licensing isolation, specialized workloads |
| Live migration | Supported for many VM shapes during infra maintenance | Not applicable - you manage maintenance windows/reboot migration |
| Cost model | Per-OCPU/hour, fine-grained | Whole-server; higher floor, better for dense/large |
Dedicated hosts, capacity reservation, preemptible
| Option | What it does | Use when |
|---|---|---|
| Dedicated VM Host | Your VMs run on a physical host reserved to you (no other tenants) | Compliance/licensing isolation while keeping VM flexibility |
| Capacity Reservation | Reserves capacity of a shape in an AD so it is guaranteed available when you launch | Guaranteeing DR failover capacity, large launches, scale-out headroom |
| Preemptible instances | Cheaper VMs OCI can reclaim with short notice | Fault-tolerant batch, stateless workers, CI - never for stateful/production-critical |
| Burstable instances | Baseline fraction of an OCPU with bursting | Low-average, spiky small workloads (bastions, light services) |
Images, custom images, and cloud-init
- Platform images - Oracle Linux, and other OSes maintained by Oracle.
- Custom images - capture a configured instance as an image for repeatable launches (golden images).
- Bring Your Own Image (BYOI) - import a supported OS image; must meet OCI's paravirtualized/emulated driver requirements.
- cloud-init - pass a startup script (user data) that runs on first boot to configure the instance (install agents, mount volumes, join config management). The standard way to bootstrap without baking everything into an image.
- Instance metadata service - at
169.254.169.254, exposes instance metadata and is how instance principals fetch credentials. Restrict access to it inside the OS where appropriate.
# Example cloud-init passed as user_data (base64) to bootstrap an app host
#cloud-config
package_update: true
packages: [oracle-instantclient-basic, jq]
runcmd:
- [ systemctl, enable, --now, myapp ]
- [ /opt/app/register-with-lb.sh ]Placement, fault domains, and HA
High availability for compute is about anti-affinity: never put both halves of an HA pair on the same failure unit.
- In a multi-AD region: spread across ADs for data-center-level resilience, and across FDs within each AD.
- In a single-AD region: spread across the three fault domains - that is your in-region HA. DR to another region covers AD/region loss.
- Instance pools distribute instances across FDs/ADs automatically per the placement configuration.
- Live migration: for supported VM shapes, OCI can live-migrate your VM off hardware needing maintenance, avoiding a reboot; some events still require a reboot-migration you schedule. Bare metal you always manage yourself.
Autoscaling, instance pools, and configurations
| Building block | Role |
|---|---|
| Instance Configuration | A template: shape, image, network, metadata/cloud-init, volumes. Immutable versioned blueprint. |
| Instance Pool | Manages a set of identical instances from a configuration, across FDs/ADs, with a target size. |
| Autoscaling | Adjusts pool size by metric (CPU/memory) thresholds or a schedule (e.g. scale down nights/weekends). |
| Cluster Networks | High-performance RDMA-connected pools for HPC/AI (ultra-low-latency interconnect). |
Choosing shapes by workload
| Workload | Starting point | Why |
|---|---|---|
| General applications | Flexible AMD E-series VM (balanced OCPU:memory) | Cost-effective, dial exactly what you need |
| Oracle Database (IaaS) | Higher-memory flexible VM or bare metal; consider Base/Exadata service instead | Memory for SGA/PGA; bare metal for licensing isolation and top performance |
| Web servers | Ampere Arm A-series or small E-series, autoscaled | Excellent price/performance for stateless scale-out |
| Middleware (WebLogic, etc.) | Balanced flexible VM, memory-leaning | JVM heaps like memory; scale OCPU to concurrency |
| EBS application tier | Flexible VM sized to concurrent users; multiple nodes behind LB | Horizontal scale + HA across FDs |
| Batch processing | Preemptible or Arm pools, autoscaled/scheduled | Fault-tolerant, cheap, elastic |
| Memory-heavy (in-memory, caches, analytics) | High memory-per-OCPU flexible shape | Push memory up without paying for unneeded cores |
| CPU-heavy (compute, encoding) | High-OCPU flexible or dedicated; Arm for throughput | Cores are the bottleneck |
| AI/ML training & inference | GPU shapes (NVIDIA); cluster networks for multi-node | Accelerators + RDMA fabric |
Operational guidance
How to resize compute Ops
- Flexible VM: change OCPU/memory - typically requires a reboot; plan a window. Scaling within the same shape family is straightforward.
- Change shape family/architecture (e.g. Intel → Arm): not an in-place resize - rebuild from the image/config on the new shape (watch for architecture-specific binaries).
- For pools: update the instance configuration and roll instances, or increase pool size and drain old ones.
How to move workloads safely Ops
- Prefer rebuild-from-image over "lift the disk" - create a custom image, launch in the target compartment/AD/region, validate, cut over.
- For cross-region moves, copy the custom image to the target region, or use Block Volume/boot volume cross-region replication.
- Keep IPs stable with reserved public IPs and secondary private IPs where clients pin addresses; better, front with a load balancer or DNS name so moves are transparent.
How to troubleshoot boot / access issues Ops
- Use the serial console / console connection to see boot output and reach the OS when SSH is dead (bad fstab, firewall lockout, failed service).
- Check the instance's boot volume is healthy; you can detach it and attach to a rescue instance to fix an unbootable OS.
- Confirm SSH key was injected (cloud-init) and the security rules/route allow 22 from your source.
How to troubleshoot performance Ops
- Check Monitoring metrics: CPU utilization, memory, and especially block volume throughput/IOPS vs. the volume's performance tier limits.
- A common surprise: the instance is not CPU-bound, the block volume is at its IOPS/throughput ceiling. Raise the volume's performance tier or use higher-VPU settings (section 5).
- Network: verify you are not hitting shape bandwidth limits; larger shapes get more network bandwidth.
- Right-size: autoscale or resize based on sustained utilization, not peak fear.
How to design compute for production Design
- At least two nodes across different fault domains (and ADs where available) behind a load balancer.
- Instance configuration + pool + autoscaling so capacity is elastic and nodes are reproducible.
- Instance principals for API access; no stored keys.
- OS Management for patch compliance; custom golden image + cloud-init for consistency.
- Monitoring alarms on CPU/memory/volume; boot/block volume backups; capacity reservation for DR if RTO demands it.
OS management, patching, and recovery
- OS Management (Hub) - manage OS updates/patch compliance across fleets of Oracle Linux (and other) instances from OCI.
- Serial console connection - out-of-band access for recovery.
- Instance recovery - stop/start moves the VM to healthy hardware; for corrupted OS, rescue via boot-volume detach/attach.
Compute troubleshooting quick runbook
Checks
- Security rules + route allow 22 from your IP; instance has the right public/private IP path.
- Instance is Running (not Stopped); boot completed - check serial console.
- SSH key injected (cloud-init logs); correct user (
opcfor Oracle Linux). - OS firewall (
firewalld) not blocking; fail2ban not locking you out.
CLI
oci compute instance get --instance-id <ocid> --query 'data."lifecycle-state"'
oci compute instance-console-connection create --instance-id <ocid> --public-key-file key.pub
oci compute instance action --instance-id <ocid> --action SOFTRESETPrevention
Bastion service for SSH (no public IPs on hosts), standardized security rules, and console-connection procedures documented.
5. Storage Deep Dive
Block, Object, and File storage in OCI - their performance models, backup and replication behavior, and the decision of which to use for databases, shared file systems, backups, archives, and data lakes.
Block Volume = network-attached disks for instances/databases (like SAN/iSCSI), with tunable performance (VPU). Object Storage = HTTP key-value store for backups, data lakes, artifacts, archives - not a file system. File Storage (FSS) = managed NFS for shared POSIX file systems. Block for boot/DB, Object for backups/archives/lakes, File for shared app filesystems. Archive tier is cheap but has a restore delay - plan for it.
The three storage services
| Block Volume | Object Storage | File Storage (FSS) | |
|---|---|---|---|
| Interface | iSCSI / paravirtualized block device | REST/HTTP (S3-compatible, Swift) | NFS v3 |
| Looks like | A disk you format & mount | Buckets of objects (no directories) | A shared mounted filesystem |
| Attached to | One instance at a time (or shared/multi-attach for clusters) | Nothing - accessed over network by URL | Many instances concurrently |
| Scale | Per-volume size limit; attach many | Effectively unlimited | Grows automatically to petabytes |
| Best for | Boot disks, DB datafiles, app storage needing block I/O | Backups, archives, images, logs, data lake, static content | Shared home dirs, app clusters, EBS shared APPL_TOP |
Block volumes and boot volumes
- Boot volume - the OS disk created with an instance. Can be backed up, cloned, and detached for rescue.
- Block volume - additional data disks. Attach via iSCSI or paravirtualized; multi-attach for shared-disk clusters (e.g. some RAC/HA configs).
- Performance tiers (VPU/GB): Lower Cost, Balanced, Higher Performance, and Ultra High Performance - set by Volumes Performance Units (VPU) per GB. Higher VPU = more IOPS and throughput per GB. Auto-tune can lower performance (and cost) when a volume is detached/idle and raise it when in use.
- Volume groups - group volumes (e.g. all of a DB's volumes) so backups/clones are crash-consistent across the set.
- Backups - full or incremental, policy-based (scheduled), and can be copied cross-region for DR. Clones are instant copy-on-write; replication keeps a volume asynchronously mirrored to another region.
Object Storage
- Namespace - a tenancy-wide unique container name; buckets live in it, scoped to a compartment and region.
- Storage tiers: Standard (hot, frequent access), Infrequent Access (cheaper storage, retrieval fee), Archive (cheapest, must be restored before reading, with a restore delay). Auto-Tiering can move objects between Standard/IA based on access.
- Multipart upload - upload large objects in parallel parts; required/recommended for big files (backups, images).
- Pre-Authenticated Requests (PARs) - time-boxed URLs granting access to a bucket/object without IAM credentials - handy but a sharing risk if leaked.
- Lifecycle rules - auto-transition objects to Archive or delete them after N days.
- Retention rules & versioning - retention locks objects against deletion for a period (WORM-style, supports compliance); versioning keeps prior versions.
- Replication - asynchronously replicate a bucket to another region/bucket for DR.
File Storage Service (FSS)
- File system - the NFS-exported filesystem; grows automatically, snapshots supported.
- Mount target - the NFS endpoint (with a private IP in your subnet) that instances mount. It carries the export set.
- Export paths & NFS export options - control which CIDRs/hosts can mount, and with what access (read/write, root squash).
- Snapshots - point-in-time, space-efficient; replication mirrors a file system to another region for DR.
When to use which
| Need | Use | Why |
|---|---|---|
| OS boot disk | Boot Volume (Block) | Block I/O, bootable, backup/clone |
| Database datafiles / redo | Block Volume (right VPU) or managed DB storage | Low-latency block I/O; tune performance |
| Shared filesystem for an app cluster | File Storage (FSS) | Concurrent POSIX access from many nodes |
| Database backups (RMAN) | Object Storage (via Service Gateway) | Durable, cheap, off-host, cross-region copy |
| Log/data archive, long retention | Object Storage Archive tier + lifecycle + retention | Lowest cost, WORM compliance |
| Data lake | Object Storage (Standard) | Scales infinitely, queried by analytics services |
| EBS shared APPL_TOP / concurrent tier | File Storage (FSS) | Shared filesystem semantics EBS expects |
| Static website / media | Object Storage + PAR/CDN | HTTP-native object serving |
Practical examples
RMAN backup to Object Storage DBA
Configure RMAN to write to Object Storage via the Oracle Database Cloud Backup Module (or the DBaaS backup tooling on managed DBs). Traffic should traverse the Service Gateway, not NAT/internet. Enable a lifecycle rule to move older backup pieces to Archive, plus versioning + retention on the bucket for immutability. For DR, enable cross-region bucket replication or copy backups to the DR region.
Application shared file system Apps
Create an FSS file system + mount target in the app subnet, restrict export options to the app-tier CIDR/NSG, mount on all app nodes. Snapshot on a schedule; replicate to the DR region. This is the standard shared storage for EBS APPL_TOP, WebLogic domains, or any scale-out app needing a common filesystem.
Log archive with lifecycle Ops
Ship logs (via Service Connector Hub) to an Object Storage bucket. Lifecycle rule: Standard for 30 days → Archive for long-term → delete after the compliance period. Retention rule prevents early deletion during the required window.
Data transfer options
- Data Transfer Service (disk/appliance) - ship physical media/appliances to Oracle to bulk-load very large datasets when network transfer is impractical.
- Online - CLI/SDK multipart upload,
oci os object bulk-upload, or Storage Gateway-style/rclone tooling for ongoing sync.
Storage gotchas
- Object Storage is not a filesystem - no random writes, no POSIX locks, no real directories.
- Archive restore delay - Archive objects must be restored before reading, which takes time (hours-class). Never put anything you might need immediately in Archive.
- Block volume undersized for IOPS - performance follows size and VPU; a small volume throttles regardless of CPU.
- NFS mount target too open - lock export options to specific CIDRs and enable root squash.
- Backup policy gaps - a volume/DB with no backup policy attached is silently unprotected. Audit that every prod volume has a policy.
- Cross-region copy cost & timing - replication/copy incurs egress and takes time; a "DR copy" that lags your RPO is not DR. Measure it.
- Detached-but-billed volumes - deleting an instance may leave block volumes behind, still billing. Clean them up.
- PAR sprawl - untracked pre-authenticated URLs are a data-exfiltration risk. Inventory and expire them.
6. Database Services Deep Dive
OCI's database portfolio, from self-managed IaaS through Base Database and Exadata to fully-managed Autonomous - what each one manages for you, how HA/DR/backup/patching differ, and how to choose.
Pick along a spectrum of control vs. toil. DB on IaaS = full control, all the work. Base Database Service = managed VM DB systems, you still trigger patches and own the guest OS. Exadata Database Service / Cloud@Customer = engineered-system performance and scale, RAC built in. Autonomous Database = Oracle runs patching, tuning, backup, and scaling; you own schema, SQL, and data. Choose the least you need to manage that still meets performance, control, and compliance requirements.
The portfolio at a glance
| Service | Form | You manage | Oracle manages | Sweet spot |
|---|---|---|---|---|
| DB on Compute (IaaS) | You install Oracle DB on a VM/BM | Everything above the hypervisor | Infra only | Special versions/configs a managed service can't do |
| Base Database Service | Managed VM DB system (single node or 2-node RAC) | Guest OS, patch scheduling, schema | Provisioning, patch tooling, backup automation | Small-to-mid Oracle DBs wanting managed lifecycle |
| Exadata Database Service (ExaDB-D) | Exadata infra in OCI, VM clusters | Databases, patch scheduling, schema | Exadata hardware, storage cells, RAC substrate | Large, high-performance, consolidation, mission-critical |
| Exadata Cloud@Customer (ExaCC) | Exadata in your data center, OCI-managed | Databases, schema | Full stack, remotely operated by Oracle | Data residency / low-latency-to-on-prem with cloud ops |
| Autonomous Database | Self-driving (ATP/ADW/AJD) | Schema, SQL, data, access | Patching, tuning, backup, scaling, much of security | Most new OLTP/DW/JSON where you want minimal DBA toil |
Service deep dives
Base Database Service
Managed Oracle Database on VM DB systems. You choose Standard/Enterprise Edition, version, and shape; OCI provisions the VM(s), Grid Infrastructure, and database, and provides managed backup and patching workflows.
- Topologies: single-node VM DB system, or 2-node RAC VM DB system for node HA.
- You still own: the guest OS (patching the OS, though Oracle provides the DB patch bundles), schema, SQL, and when to apply quarterly patches.
- Backups: automatic backups to Object Storage with a retention you set; point-in-time restore.
- Data Guard: one-click association to a standby (same or cross-region).
Exadata Database Service & Cloud@Customer
The Exadata engineered system delivered as a cloud service: Scale-Out compute (DB servers) + intelligent storage cells with Smart Scan, storage indexes, and flash. Databases run as RAC across the VM cluster.
- ExaDB-D (Dedicated): Exadata infrastructure in an OCI region; you create VM clusters and databases on it. Elastic scaling of DB and storage servers.
- ExaCC: the same Exadata hardware placed in your data center, control plane in OCI, operated by Oracle - for data-residency or ultra-low-latency-to-on-prem needs.
- Why Exadata: Smart Scan offloads query processing to storage, huge consolidation density, consistent low latency, built-in RAC HA, and the top end of Oracle DB performance.
- You manage: databases, PDBs, patch scheduling (Oracle provides one-click patching of the infra and DB), schema, and tuning.
Autonomous Database
Self-managing Oracle Database. Oracle automates patching, tuning, backups, scaling, and much of the security. Workload flavors share one engine:
- ATP (Transaction Processing) - OLTP/mixed, optimized for many short transactions.
- ADW (Data Warehouse) - analytics, optimized for scans/aggregations, columnar.
- AJD (JSON Database) - document/JSON-centric, SODA APIs.
- Also APEX Service and (verify) transaction/AI-vector capabilities in 23ai/26ai.
Deployment models:
| Serverless | Dedicated | |
|---|---|---|
| Infra | Shared, Oracle-managed Exadata fleet | Exadata infra dedicated to you |
| Isolation | Logical | Physical - your own infra |
| Control | Least ops, fastest start | More control (maintenance windows, isolation policies) |
| Use for | Most workloads, dev/test, variable load | Regulated/large estates wanting private Autonomous |
- Autoscaling: OCPU and storage can auto-scale (e.g. up to 3x base OCPU) to absorb spikes; scale to/near zero for dev with auto-stop.
- Autonomous Data Guard: one-click managed standby (in-region or cross-region) with automatic failover options.
- Backups: fully automatic with a retention you choose; point-in-time restore.
- Apps requiring specific unsupported features, custom OS packages, non-standard init parameters, or direct OS/filesystem access.
- Databases pinned to a version/patch level Autonomous won't run.
- Certified packaged apps (some EBS/Siebel configs) that require Base/Exadata, not Autonomous - check certification.
- Workloads needing very specific licensing or hard-partitioning arrangements.
RAC, CDB/PDB, and encryption
- RAC (Real Application Clusters): multiple DB instances on multiple nodes serving one database for node-level HA and scale. Built into Exadata and available as 2-node on Base Database. Survives a node failure with brief brownout; not a DR substitute (it is one site).
- CDB/PDB (multitenant): a Container Database hosts Pluggable Databases. PDBs are the unit of consolidation, cloning, and mobility - you can clone/relocate a PDB, and Autonomous/Exadata lean heavily on the multitenant model. Great for consolidating many databases with isolation.
- TDE (Transparent Data Encryption): encryption at rest is standard in OCI databases. Keys can be Oracle-managed or customer-managed via OCI Vault (or Oracle Key Vault / External Key Management). Encryption in transit uses TLS/native network encryption.
- Database Vault / Data Safe: Database Vault enforces separation of duties (even DBAs can't see app data) where licensed; Data Safe provides assessment, masking, and activity auditing (see Data tooling tab).
Database service decision table
| Workload | Recommended service | Reason | HA | DR | Ops responsibility | Cost lever |
|---|---|---|---|---|---|---|
| App OLTP (new build) | Autonomous (ATP) Serverless | Minimal toil, autoscale | Built-in | Autonomous Data Guard | Schema/SQL only | Autoscale + auto-stop dev |
| Data warehouse | Autonomous (ADW) | Columnar, scan-optimized, elastic | Built-in | Autonomous DG cross-region | Model + load | Scale OCPU to query load |
| Reporting DB | ADW or Exadata (if huge) | Read-heavy analytics | Built-in / RAC | DG / backup | Model + tuning | Scale for report windows |
| EBS database | Base Database or Exadata (per size/cert) | Certification + control needed | 2-node RAC | Data Guard | Patch + schema | Right-size OCPU; BYOL |
| Consolidated estate | Exadata (ExaDB-D) | Density + performance + PDB isolation | RAC | Data Guard | DBs + patch schedule | Consolidate; scale cells |
| Dev/test DB | Autonomous (auto-stop) or Base single-node | Cheapest managed option | N/A | Backup | Minimal | Auto-stop off-hours |
| AI / vector search | 23ai/26ai (Autonomous or Exadata) with AI Vector Search | In-DB vectors + SQL | Built-in/RAC | DG | Schema + embeddings | Scale for embedding jobs |
| JSON / document | Autonomous JSON (AJD) | SODA, JSON-native, low cost | Built-in | DG | Collections | Serverless autoscale |
| Special version/config | DB on Compute (IaaS) | Full control when managed can't | You build it | You build it | Everything | Right-size; BYOL |
How HA and DR work across services
| Service | In-region HA | DR |
|---|---|---|
| DB on IaaS | You build RAC / clustering / FD spread | You configure Data Guard + cross-region |
| Base Database | 2-node RAC option; FD placement | One-click Data Guard (in/cross-region) |
| Exadata (ExaDB-D/CC) | RAC across cluster nodes (built-in) | Data Guard / Active Data Guard to another Exadata |
| Autonomous | Built into the platform | Autonomous Data Guard (managed standby, optional auto-failover) |
Patching and backup differences
| Service | Patching | Backup |
|---|---|---|
| DB on IaaS | Entirely you (OS + Grid + DB) | You configure RMAN + destination |
| Base Database | Oracle provides patch bundles; you schedule/apply | Automatic to Object Storage; you set retention |
| Exadata | One-click infra + DB patching; you schedule maintenance | Managed backup to Object Storage / local; retention configurable |
| Autonomous | Fully automatic (near-zero downtime), you do nothing | Fully automatic + point-in-time; you set retention window |
Data tooling and operations
Security assessment, user assessment, data discovery, masking, and activity auditing for your OCI databases. Start here for DB security posture.
Monitoring, performance, and fleet management for databases (managed and, via agent, on-prem). Performance Hub, SQL insights.
Capacity planning, SQL/resource analytics, and forecasting across the DB fleet - warehouse-style analytics on your database performance data.
Real-time and historical ASH/AWR-style performance analysis in the Console for OCI databases.
Real-time replication and CDC - migrations with minimal downtime, active/active, and streaming data pipelines.
ZDM automates Data Guard-based migrations; Database Migration Service orchestrates online/offline migrations to OCI. See section 13.
Licensing: BYOL vs License Included
| License Included (LI) | Bring Your Own License (BYOL) | |
|---|---|---|
| What it is | The service price bundles the Oracle DB license | You apply existing on-prem Oracle licenses to the cloud service |
| Best when | New workloads, no existing licenses, want simplicity | You have unused/EE licenses and options; usually lower run cost |
| Watch for | Higher per-OCPU rate | Correct edition/options mapping, core factor, compliance with LMS |
Enterprise examples
Oracle E-Business Suite database Apps DBA
Typically Base Database Service (2-node RAC) or Exadata depending on size and certification. Data Guard cross-region for DR. App tier on compute behind a load balancer, shared APPL_TOP on FSS. BYOL for the DB. This is the classic Apps DBA lift-and-shift; verify EBS certification for the target DB service and version.
Enterprise data warehouse Analytics
ADW for most; Exadata if extreme scale or already Exadata-tuned. Object Storage data lake feeding it; Data Integration/GoldenGate for loads; Oracle Analytics Cloud on top. Autoscale for reporting windows.
Consolidated database platform Platform
Exadata (ExaDB-D) with many PDBs for isolation and density. Standardized patch windows, Data Guard to a second region, Data Safe for auditing across the estate, Operations Insights for capacity planning.
AI / vector search database AI
Oracle Database 23ai/26ai (Autonomous or Exadata) using AI Vector Search to store embeddings alongside relational data and run similarity search in SQL - the backbone of RAG over enterprise data (see section 12). Keep it governed: agents query through a read-only/serving layer, not ad-hoc against production.
7. Load Balancing and Traffic Management
The Layer-7 Load Balancer and Layer-4 Network Load Balancer, their listeners/backend sets/health checks, SSL handling, routing, and a disciplined approach to the most common failure: unhealthy backends.
Two products: the Load Balancer (LBaaS) is Layer 7 (HTTP/HTTPS-aware: SSL termination, path/host routing, cookies) with a flexible bandwidth shape; the Network Load Balancer (NLB) is Layer 4 (TCP/UDP, ultra-low latency, preserves source IP, scales huge). Both can be public or private. A load balancer is built from listeners → backend sets → backends with health checks. The number-one issue is a backend marked unhealthy because a security rule blocks the health-check probe.
Load Balancer vs Network Load Balancer
| Load Balancer (LBaaS, L7) | Network Load Balancer (NLB, L4) | |
|---|---|---|
| Layer | 7 (HTTP/HTTPS/TCP) | 4 (TCP/UDP/ICMP) |
| Features | SSL termination/E2E, path & host routing, cookie persistence, WAF integration | Pass-through, preserves client source IP, very low latency, extreme scale |
| Source IP | Rewritten (adds X-Forwarded-For) | Preserved (great for apps that need real client IP) |
| Sizing | Flexible bandwidth (min/max Mbps) | Scales automatically, high throughput |
| Use for | Web/API tiers needing HTTP intelligence | Non-HTTP, high-throughput, source-IP-sensitive, or DB/NLB-fronted services |
Both come in public (internet-facing, in a public subnet) and private (internal, in a private subnet) variants.
Anatomy: listeners, backend sets, backends, health checks
- Listener - the front-end port/protocol (e.g. 443/HTTPS). Handles SSL, routing rules, and WAF.
- Backend set - a group of backends plus the balancing policy (round robin, least connections, IP hash) and the health check definition.
- Backend - an actual server (IP:port), weighted, drained gracefully during maintenance.
- Health check - protocol/port/path/interval defining "healthy." Unhealthy backends are pulled from rotation.
- Session persistence - cookie-based (LB-generated or app cookie) or IP-based, to pin a client to a backend.
SSL termination, end-to-end SSL, certificates
| Mode | Where TLS terminates | Use when |
|---|---|---|
| SSL termination | At the LB; plaintext to backends | Offload crypto from backends; inspect/route on content; backends in trusted private subnet |
| End-to-end SSL | LB terminates then re-encrypts to backend | Compliance requiring encryption in transit all the way; LB still does L7 |
| SSL pass-through (NLB) | Not terminated - backend does TLS | Backend must own the cert / mTLS; L4 only |
Hostname and path-based routing
- Hostname-based routing - one LB serves multiple virtual hosts (
api.example.comvsapp.example.com) to different backend sets. - Path-based routing - route by URL path (
/api/*→ API backends,/static/*→ static backends). - Rule sets - header manipulation, redirects (HTTP→HTTPS), access control by source.
- Logging - enable access and error logs to the Logging service for traffic analysis and troubleshooting.
When to use which
Load balancer troubleshooting
Symptoms
Backends "Critical/Warning" in the LB health page; clients get 502/503 or intermittent errors.
Likely causes (in order)
- Security rules block the probe: the backend NSG/security list must allow the health-check source (the LB subnet/NSG) on the backend port. This is the most common cause.
- Health check misconfigured: wrong port, path, protocol, or expected status code vs. what the app actually returns.
- App not listening / wrong bind: service down, or bound to
127.0.0.1not0.0.0.0. - OS firewall on the backend dropping the probe.
- Route table issue between LB and backend subnet (rare within one VCN, common across peered VCNs).
- SSL mismatch: health check uses HTTPS but backend serves HTTP (or cert invalid in E2E mode).
Checks
- From a host in the LB subnet,
curlthe backend's health-check URL directly - does it return the expected code? - Confirm the backend NSG allows the LB source on the port; run Network Path Analyzer LB→backend.
- Check the app is listening:
ss -tlnp | grep <port>; check bind address. - Review LB error logs (Logging service).
Console path
Networking > Load Balancers > (LB) > Backend Sets > Health; and Backend Sets > Health Check Policy.
Fix / prevention
Open the health-check port from the LB source in the backend NSG; align the health-check path/port/protocol/status; ensure the app binds to all interfaces; template the LB + NSG in Terraform so every environment matches.
- Health check on the wrong port (LB listener 443 but backend serves 8080 - the check must target the backend port).
- Forgetting the health-check probe source when writing backend NSG rules.
- Wrong listener protocol (TCP vs HTTP) - HTTP features/routing require an HTTP listener.
- App bound to localhost only, so it works on the host but not through the LB.
- Certificate expired or chain incomplete on end-to-end SSL backends.
- No backend spread across fault domains - a single FD loss takes all backends.
8. Security Deep Dive
Defense in depth on OCI: identity, network, data, and detective controls - plus concrete guidance for securing a production tenancy, databases, Object Storage, and public endpoints, ending in a production security checklist.
Security in OCI is layered: IAM (least-privilege policies, MFA, principals), network (private subnets, NSGs, no public IPs, WAF/Network Firewall), data (TDE with customer-managed keys in Vault, Object Storage retention, Data Safe), and detection (Cloud Guard, Security Zones, Vulnerability Scanning, Audit + Logging). Reduce public exposure, encrypt with keys you control, watch everything, and enforce guardrails that make the insecure thing impossible - not just discouraged.
Security design principles
- Least privilege - lowest verb, narrowest resource type, lowest compartment, conditions where useful. Separate admin/operator/read-only.
- Reduce blast radius - compartments, separate VCNs, separate keys per classification, break-glass isolation.
- Private by default - no public IPs unless required; front-facing only via LB/WAF; Bastion for admin access.
- Encrypt everywhere - at rest (TDE, block/object/file encryption) and in transit (TLS); customer-managed keys for sensitive data.
- Guardrails over guidelines - Security Zones and quotas that prevent misconfiguration beat policies that merely recommend it.
- Assume breach - detect and audit - Cloud Guard, Audit (immutable), Logging, alarms on anomalies. You cannot respond to what you cannot see.
The control layers
| Layer | Controls | Key services |
|---|---|---|
| Identity | Who can do what | IAM policies, Identity Domains, MFA, federation, instance/resource principals |
| Network | What can reach what | Private subnets, NSGs/security lists, gateways, Bastion, WAF, Network Firewall, DDoS protection |
| Data | Protect data at rest/in transit | Vault (KMS), TDE, Object Storage encryption/retention, Data Safe, certificates |
| Detective / posture | Find and stop misconfig & threats | Cloud Guard, Security Zones, Vulnerability Scanning, Audit, Logging, Logging Analytics |
Vault: keys, secrets, certificates
- Vault - managed key management (KMS/HSM-backed). Create keys for TDE, block/object/file encryption, and app-level crypto.
- Oracle-managed vs customer-managed keys - Oracle-managed is default and simplest; customer-managed lets you control rotation and revoke access to encrypted data by disabling the key. Vaults can be software or HSM-protected (higher assurance).
- Secrets - store DB passwords, API tokens, wallets as versioned secrets; apps fetch them at runtime via principals - never bake secrets into images or code.
- Certificates - managed TLS certs and CAs for load balancers and services, with rotation.
Cloud Guard, Security Zones, Vulnerability Scanning
Continuously detects misconfigurations and risky activity (public buckets, over-permissive policies, exposed ports, risky IAM) across the tenancy, scores them, and can auto-remediate via responders. Turn it on tenancy-wide.
Attach a policy-enforced recipe to a compartment that blocks non-compliant actions outright - e.g. no public subnets, no unencrypted volumes, no unapproved keys. Preventive, not just detective.
Scans compute instances and container images for CVEs and open ports; schedule recurring scans and feed results into your patch process.
Time-boxed, audited SSH/RDP sessions to private hosts without public IPs or a standing jump box. Sessions expire; access is policy-controlled.
OWASP protection, bot management, rate limiting, and geo/IP rules in front of public HTTP endpoints; attach to load balancers.
Managed next-gen firewall (Palo Alto-based) in a hub VCN for stateful inspection, IPS/IDS, URL filtering, and TLS inspection of north-south/east-west traffic.
Data and database security
- Encryption at rest - on by default for block/boot/object/file storage and databases (TDE). Choose customer-managed keys for sensitive data.
- Encryption in transit - TLS for service endpoints; native network encryption / TLS for DB connections; ADB uses mTLS wallets.
- Data Safe - security assessment, user risk assessment, sensitive data discovery, dynamic/static masking (for non-prod copies), and DB activity auditing. The first tool to point at any production database.
- Database Vault (where licensed) - realms and separation of duties so even DBAs cannot read application data.
- Audit - the tenancy Audit service records all API calls (control-plane) immutably; combine with DB-level and OS logs.
How to secure specific things
Secure a production tenancy Tenancy
- Federate human access to the corporate IdP; enforce MFA for all users, especially admins. Reserve break-glass local admins, sealed and alarmed.
- Least-privilege IAM by role and compartment; no
manage all-resources in tenancyfor daily groups. - Enable Cloud Guard tenancy-wide; put prod compartments in Security Zones; enable Vulnerability Scanning.
- Centralize network egress/inspection through a hub (Network Firewall); minimize public subnets.
- Enable Audit retention and stream logs to a central logging compartment via Service Connector Hub.
- Compartment quotas + budgets as guardrails; tag everything with data-classification.
Secure databases DB
- Private subnet only, no public IP; access via app tier / Bastion / private endpoints.
- TDE with customer-managed keys in Vault; rotate keys.
- Data Safe: run assessments, mask non-prod, audit activity.
- Least-privilege DB accounts; Database Vault for separation of duties on the most sensitive systems.
- Native network encryption / TLS for all client connections; store wallets/passwords in Vault secrets.
Secure Object Storage Storage
- Keep buckets private; Cloud Guard alarms on any public bucket.
- Enable versioning + retention rules on backup/compliance buckets (ransomware/accidental-delete recovery, WORM).
- Prefer IAM + instance principals over PARs; if PARs are needed, short lifetimes and an inventory.
- Customer-managed keys for sensitive buckets; access via Service Gateway, not internet.
Secure public load balancers / reduce public IP exposure Network
- Only load balancers and Bastion live in public subnets; everything else private.
- WAF in front of public HTTP; NSGs restrict listener sources where possible; rate limiting and geo rules.
- Use reserved public IPs you can whitelist; terminate TLS at the LB with managed certs; consider E2E SSL for regulated data.
- Replace standing jump hosts with the Bastion service (time-boxed, audited).
- Audit the tenancy for stray public IPs regularly (Cloud Guard + a scheduled report).
Monitor suspicious activity Detect
- Cloud Guard problems → notifications; Audit + Logging → Logging Analytics for correlation.
- Alarms on: root/administrator logins, policy changes, new API keys, security-list changes, public IP creation, unusual Object Storage access.
- Break-glass user login should page someone every time.
Production OCI security checklist
- Human access federated to corporate IdP; MFA enforced for all users and especially admins.
- Break-glass local admins created, credentials sealed, every login alarmed.
- IAM least privilege: roles split (admin/operator/read-only), scoped to compartments, no broad
manage all-resources in tenancy. - Workloads use instance/resource principals - no long-lived API keys stored on hosts.
- Cloud Guard enabled tenancy-wide with notifications and responders configured.
- Production compartments enrolled in Security Zones with appropriate recipes.
- Vulnerability Scanning enabled for instances and container images.
- No unintended public IPs; databases and app tiers in private subnets; Bastion for admin access.
- WAF in front of public HTTP endpoints; Network Firewall inspecting hub traffic.
- All data encrypted at rest; customer-managed keys in Vault for sensitive data; keys rotated.
- Vault holds all secrets/wallets; nothing sensitive in images, code, or env files.
- Object Storage buckets private; versioning + retention on backup/compliance buckets.
- Data Safe assessments run; non-prod data masked; DB activity auditing on.
- Audit log retention configured; logs centralized via Service Connector Hub.
- Alarms on privilege/policy/network changes and anomalous access.
- Compartment quotas + budgets as guardrails; everything tagged with data-classification.
- DR and backups tested (restores verified), including key availability in the DR region.
Compliance basics
OCI maintains a broad set of certifications/attestations (SOC, ISO, PCI, HIPAA, FedRAMP/Gov in the relevant realms - verify current scope). Your responsibility is configuring services to meet your obligations: data residency (region/realm choice), encryption with controlled keys, access logging, and evidence. Cloud Guard and Security Zones help demonstrate continuous compliance; Audit provides the evidence trail.
9. Observability, Monitoring, and Operations
Metrics, alarms, logs, events, and the operational tooling to run OCI day-2 - including what to monitor per service, how to build useful alarms, and how to avoid drowning in noise.
Monitoring collects metrics and fires Alarms that publish to Notifications (email, PagerDuty, Functions, Slack via webhook). Logging centralizes service, audit, and custom logs; Logging Analytics analyzes them. Service Connector Hub is the pipe that moves logs/metrics/events between services (e.g. logs → Object Storage or SIEM). Events trigger automation on resource changes. Monitor the golden signals per tier, alarm on symptoms users feel, and route by severity to avoid alert fatigue.
The observability stack
| Service | Role |
|---|---|
| Monitoring (Metrics) | Time-series metrics per service (CPU, memory, IOPS, LB health, DB metrics). Query with MQL. |
| Alarms | Threshold/absence rules on metrics that fire notifications and can trigger automation. |
| Notifications (ONS) | Topics with subscriptions: email, HTTPS webhook, PagerDuty, Slack, Functions, SMS. |
| Logging | Central store for service logs (LB, VCN flow, WAF), audit logs, and custom application logs. |
| Logging Analytics | Parse, search, correlate, and visualize large log volumes; dashboards and ML-assisted analysis. |
| Events | Reacts to resource lifecycle changes (e.g. bucket created, instance terminated) → Functions/Notifications/Streaming. |
| Service Connector Hub | Moves data between sources and targets (logs → Object Storage, metrics → Functions, events → SIEM). |
| Audit | Immutable record of all API/control-plane activity in the tenancy. |
| Operations Insights / Database Management / APM | Deep DB analytics, fleet DB monitoring, and application performance tracing. |
| Management Agent / OS Management | Agent-based host metrics/logs and OS patch compliance. |
What to monitor per area
CPU utilization, memory utilization, load, instance status; per-process via agent. Watch for sustained saturation and crash loops.
Block volume IOPS/throughput vs. tier ceiling, latency; Object Storage request/error rates; FSS throughput. Volume at its I/O ceiling is a top hidden bottleneck.
CPU, sessions, wait classes, storage used %, tablespace, backup success, Data Guard apply lag, blocked sessions. Use Performance Hub + Database Management.
Healthy/unhealthy backend count, active connections, response time, 5xx rate, bandwidth vs. shape.
VPN tunnel state, FastConnect BGP/light levels, NAT/Service GW throughput, VCN flow-log rejects, DNS query health.
Cloud Guard problems, audit anomalies (policy/key/public-IP changes), unusual Object Storage access, failed logins.
Building useful alarms
- Alarm on symptoms users feel (LB 5xx, unhealthy backends, DB down, high latency), not only causes.
- Use appropriate statistics and windows (e.g. mean over 5 min, not a single spike) and a sensible trigger duration to avoid flapping.
- Set severity and route: critical → page; warning → ticket/Slack; info → dashboard only.
- Use absence alarms for "should always report" signals (heartbeat, backup completion).
- Tag alarms by service/team so ownership is clear.
# MQL: alarm when average CPU across an instance exceeds 85% for 5 min
CpuUtilization[5m]{resourceId = "ocid1.instance.oc1..xxxx"}.mean() > 85
# MQL: alarm when any backend set has unhealthy backends
UnHealthyBackendServers[1m].max() > 0
# Absence alarm: no metric reported for 10m (agent/host down)
CpuUtilization[10m].absent()Example alarms to implement
| Alarm | Signal / condition | Severity |
|---|---|---|
| CPU high | Instance CPU mean > 85% for 5-10 min | Warning → Critical if sustained |
| Memory pressure | Memory utilization > 90% (agent metric) | Warning |
| Disk / filesystem full | Filesystem used > 85% | Warning → Critical > 95% |
| Block volume throughput ceiling | Throughput/IOPS near the tier limit sustained | Warning (capacity) |
| LB unhealthy backend | UnHealthyBackendServers.max() > 0 | Critical |
| Database CPU | DB CPU utilization > 90% sustained | Warning |
| Database storage | Storage used > 85% / tablespace threshold | Warning → Critical |
| Failed backup | Backup job failed / absence of success event | Critical |
| Data Guard apply lag | Apply/transport lag > RPO threshold | Critical |
| VPN tunnel down | Tunnel state != UP | Critical |
| FastConnect issue | BGP session down / light-level alarm | Critical |
| Object Storage unusual access | Spike in requests / unexpected public access (via logs) | Security review |
Avoiding noisy alerts
Operational dashboards and reports
- Build Console dashboards (and Logging Analytics dashboards) per audience: an on-call "is anything on fire" view, a service-owner view, and an exec/cost view.
- Use Service Connector Hub to ship logs/metrics to Object Storage for retention or to your enterprise SIEM.
- Turn on Cost and usage reports and pair with Budgets (section 14) for a spend dashboard.
- Resource Manager + drift detection to monitor infrastructure conformance.
10. Containers, Kubernetes, and Cloud Native
OKE, Container Instances, Functions, and the event-driven building blocks - when to use each, how networking and IAM work for containers, and reference patterns for microservices and serverless.
OKE (managed Kubernetes) for long-running microservices at scale; Container Instances for a single container without running a cluster; Functions (serverless, Fn-based) for short event-driven code; plain Compute when containers add no value. Around them: Container/Artifact Registry, API Gateway, Events, Streaming, Queue, Notifications, and the DevOps service for CI/CD. OKE services of type LoadBalancer auto-provision an OCI LB/NLB; workloads use resource-principal workload identity for IAM.
The cloud-native services
| Service | What it is | Use for |
|---|---|---|
| OKE (Kubernetes Engine) | Managed Kubernetes control plane + your worker nodes/node pools (or virtual nodes) | Microservices, platform teams, portable container workloads |
| Container Instances | Run containers directly, serverless, no cluster to manage | Single/few containers, batch, simple services without K8s overhead |
| Functions | Serverless FaaS (open-source Fn), event-triggered, scales to zero | Short event-driven tasks, glue, automation |
| Container Registry (OCIR) | Managed private Docker/OCI image registry | Storing/scanning images |
| Artifact Registry | Generic artifacts (not just images) | Build outputs, packages |
| API Gateway | Managed API front door: auth, routing, rate limiting, request/response transform | Exposing functions/microservices as APIs |
| Events | Reacts to resource changes → triggers Functions/Notifications/Streaming | Event-driven automation |
| Streaming | Kafka-compatible event streaming | High-throughput ingestion, pub/sub pipelines |
| Queue | Managed message queue (transactional, at-least-once) | Decoupling producers/consumers, work queues |
| Notifications | Pub/sub topics to email/webhook/Functions | Fan-out alerts and events |
| DevOps service | Managed Git repos, build & deployment pipelines (to OKE/Functions/Instances) | CI/CD inside OCI |
OKE deep dive
- Control plane - managed by Oracle (you don't run etcd/API servers). Choose Basic or Enhanced clusters (enhanced adds features like more add-ons, workload identity, higher limits, SLA).
- Node pools - groups of worker nodes (managed VMs/BM) with a shape and image; scale and upgrade per pool.
- Virtual nodes - serverless worker capacity where Oracle manages the node lifecycle (you don't patch/scale VMs); pods run without you managing the underlying node.
- Networking - OKE uses VCN-native pod networking (pods get VCN IPs) or flannel overlay; plan subnet CIDRs to have enough IPs for pods and nodes.
- Load balancing - a K8s Service of type
LoadBalancermakes OKE provision an OCI Load Balancer (or NLB via annotation); Ingress controllers front HTTP routing. - Storage - CSI driver provisions Block Volumes as PVs; FSS for shared RWX volumes.
- IAM - workload identity - map K8s service accounts to OCI dynamic groups so pods call OCI APIs via resource principal, no keys in the pod.
# Expose a deployment via an OCI Network Load Balancer from Kubernetes
apiVersion: v1
kind: Service
metadata:
name: web
annotations:
oci.oraclecloud.com/load-balancer-type: "nlb" # or omit for L7 LB
oci-network-load-balancer.oraclecloud.com/security-list-management-mode: "None"
spec:
type: LoadBalancer
selector: { app: web }
ports: [ { port: 443, targetPort: 8443 } ]OKE vs Functions vs Container Instances vs Compute
Networking, IAM, and security for containers
- Networking - clusters live in a VCN; control-plane/worker/pod/LB subnets with the right security rules. Private clusters keep the API endpoint off the internet.
- IAM - cluster admins via IAM policy; in-cluster RBAC for K8s objects; workload identity for pod → OCI API access.
- Image security - scan images in OCIR (Vulnerability Scanning); sign/verify; least-privilege pull secrets or instance principals.
- Runtime security - network policies, pod security standards, secrets from OCI Vault (via CSI/secret store), and Cloud Guard over the tenancy.
- Monitoring - cluster/node/pod metrics to Monitoring; container logs to Logging; APM for tracing.
Architecture patterns
- Microservices on OKE - deployments behind Ingress/LB, HPA autoscaling, service mesh for mTLS/traffic control, DevOps pipelines deploying images from OCIR, secrets from Vault, workload identity for OCI access.
- Serverless function triggered by Object Storage event - as diagrammed; ideal for image processing, ETL kick-off, validation.
- Event-driven architecture - Events + Streaming + Functions + Queue + Notifications for decoupled, resilient pipelines.
- Container deployment pipeline - DevOps build pipeline (from managed Git or GitHub) → image to OCIR (scanned) → deployment pipeline to OKE/Functions/Container Instances, gated by approvals.
Troubleshooting: OKE pod not starting
Likely causes
- Pending: no schedulable node capacity, or out of pod IPs in the pod subnet, or resource requests too large, or taints/affinity mismatch.
- ImagePullBackOff: bad image path, missing OCIR pull permission (instance principal/secret), or private registry unreachable (no Service Gateway/NAT route).
- CrashLoopBackOff: app failing on start - config/secret missing, DB unreachable, bad liveness probe.
- Cannot reach OCI APIs: workload identity/dynamic-group/policy not set up.
Checks
kubectl describe pod <pod> # events explain Pending/ImagePull
kubectl logs <pod> --previous # crash reason
kubectl get nodes -o wide # capacity / readiness
oci ce cluster get --cluster-id <ocid>Fix / prevention
Scale the node pool / fix pod-subnet sizing; grant OCIR pull via policy; fix probes and config; wire workload identity. Prevent with capacity headroom, image scanning gates, and correct subnet CIDR sizing.
Checks
- Event rule condition matches the resource/action; rule enabled; correct compartment.
- Function has a policy allowing the trigger (Events/API Gateway invoke); resource principal permissions for what the function does.
- Function deployed to the right app; concurrency/timeout limits; cold-start not mistaken for failure.
- Check function logs (Logging) and the invocation metrics.
Fix / prevention
Correct the Event rule filter and the invoke policy; verify the function's own IAM (resource principal); add logging and a test invocation to your deploy pipeline.
11. Analytics, Data, and Integration
The services that move, catalog, transform, and analyze data on OCI, and the common data-lake, warehouse, streaming, and CDC patterns built from them.
Land raw data in Object Storage (the lake), move/transform it with Data Integration / Data Flow (Spark) / GoldenGate (CDC), catalog it with Data Catalog, serve analytics from Autonomous Data Warehouse, visualize with Oracle Analytics Cloud, and build models with Data Science. Streaming/Queue handle real-time ingestion; OpenSearch handles search/log analytics.
The services
| Service | Role | Analogy for an Oracle person |
|---|---|---|
| Oracle Analytics Cloud (OAC) | BI, dashboards, self-service analytics, augmented analytics | OBIEE / modern BI, managed |
| Data Integration | Visual ETL/ELT with data flows and pipelines | ODI-style integration, cloud-native |
| Data Flow | Fully-managed Apache Spark (serverless) | Run Spark jobs without managing a cluster |
| Data Catalog | Metadata harvesting, glossary, data discovery/lineage | Enterprise data dictionary for the lake |
| Data Science | Notebooks, model training/deployment, MLOps | Managed JupyterLab + model catalog |
| GoldenGate | Real-time replication & change data capture | The GoldenGate you know, as a service |
| Streaming | Kafka-compatible event streaming | Managed Kafka |
| Queue | Managed message queue | AQ-style decoupling, managed |
| Service Connector Hub | Move data between OCI services | The plumbing/glue |
| API Gateway | Managed API front door | API management layer |
| Integration Cloud (OIC) | Application integration, prebuilt SaaS adapters, process automation | SOA Suite / iPaaS for connecting apps (EBS, Fusion, SaaS) |
| Big Data Service | Managed Hadoop/Spark clusters | Cloudera-style big data, where still needed |
| OpenSearch | Managed search & log analytics | Elasticsearch/Kibana, managed |
Common data patterns
| Pattern | Built from | Notes |
|---|---|---|
| Data lake | Object Storage (raw/curated/consumption zones) + Data Catalog + Data Flow | Schema-on-read; ADW queries external data |
| Data warehouse | ADW + Data Integration + OAC | Curated, modeled, governed; serves BI |
| Streaming ingestion | Streaming (Kafka) → Functions/Data Flow → Object Storage/ADW | Real-time events into the lake/warehouse |
| Batch ingestion | Data Integration / Data Flow scheduled loads | Nightly/periodic bulk loads |
| CDC replication | GoldenGate from source DB → target (ADW/DB/Object Storage) | Near-real-time, low source impact; migrations & live feeds |
| Reporting architecture | ADW (or read replica) + OAC dashboards | Offload reporting off the OLTP system |
| AI-ready data | Curated lake + Data Science + 23ai AI Vector Search | Feed embeddings/models; see section 12 |
Reference architecture: lakehouse + BI
12. AI, ML, and Generative AI on OCI
OCI's AI stack - Generative AI, Agents, AI Vector Search in the database, Data Science, and the pretrained AI services - plus the enterprise RAG patterns and the governance guardrails that separate a demo from something you can run on real business data.
OCI Generative AI serves foundation models (chat, embeddings) via API, with dedicated AI clusters for isolation and fine-tuning. Generative AI Agents add managed RAG over your data. AI Vector Search in Oracle Database 23ai/26ai stores embeddings next to relational data so you do similarity search in SQL - the backbone of enterprise RAG. Around them sit Data Science (build/deploy models) and pretrained AI services (Language, Vision, Speech, Document Understanding, Anomaly Detection, Forecasting). The hard part is not the model - it is governing what the model can touch.
The AI services
| Service | What it does | Use for |
|---|---|---|
| OCI Generative AI | Managed LLM inference (chat, embeddings, rerank); dedicated AI clusters; fine-tuning | Chatbots, summarization, extraction, RAG generation |
| Generative AI Agents | Managed agent/RAG service that grounds answers on your data sources | Chat-with-your-docs/data with less plumbing |
| AI Vector Search (in DB 23ai/26ai) | VECTOR data type + similarity search in SQL, alongside relational data | Enterprise RAG retrieval, semantic search |
| Data Science | Notebooks, model training, model catalog, model deployment, MLOps, AI Quick Actions | Custom ML, deploy open models, feature engineering |
| Language | Pretrained NLP: sentiment, entities, key phrases, translation, PII detection | Text analytics without training |
| Vision | Image classification, object detection, OCR | Document/image analysis |
| Speech | Speech-to-text (and related) | Transcription, voice input |
| Document Understanding | Extract text/tables/key-values from documents | Invoice/form processing pipelines |
| Anomaly Detection | Multivariate anomaly models | Ops/fraud/equipment monitoring |
| Forecasting | Time-series forecasting | Demand/capacity planning |
| Digital Assistant | Conversational assistant/chatbot platform | Structured skills + LLM-backed chat |
| OpenSearch | Keyword + vector/hybrid search | Search backends, hybrid retrieval |
AI Vector Search in Oracle Database
Oracle Database 23ai/26ai adds a native VECTOR data type and vector indexes so you store embeddings in the same database as your relational data and run similarity search with SQL. For an Oracle shop this is significant: no separate vector database to operate, and you can combine semantic search with normal SQL filters, joins, and existing security.
-- Store document chunks with their embedding vectors
CREATE TABLE doc_chunks (
id NUMBER PRIMARY KEY,
doc_id NUMBER,
chunk CLOB,
embedding VECTOR(1024, FLOAT32)
);
-- Retrieve the 5 most similar chunks to a query embedding (RAG retrieval)
SELECT id, doc_id, chunk
FROM doc_chunks
ORDER BY VECTOR_DISTANCE(embedding, :query_vec, COSINE)
FETCH FIRST 5 ROWS ONLY;VECTOR_DISTANCE with ordinary WHERE filters (e.g. restrict to documents a user is entitled to) so retrieval respects row-level entitlements. Size vector indexes and memory for your embedding dimension and volume.RAG architecture on OCI
Enterprise patterns
| Pattern | How | Watch out for |
|---|---|---|
| Chat with documents | RAG over Object Storage docs + Vector Search + GenAI | Chunking quality; stale index; citations |
| Chat with database | Retrieve from curated views; generate grounded answers | Never expose raw prod OLTP; use a serving layer |
| Natural language to SQL | LLM proposes SQL against a governed schema/catalog | Validate/parametrize; read-only; guard against dynamic SQL |
| RAG with Object Storage + Vector Search | Standard enterprise RAG stack | Entitlement filtering at retrieval |
| AI assistant for operations | RAG over runbooks/logs; suggest actions | Human-in-the-loop before any change |
| AI assistant for business users | Governed metrics/curated data + NL interface | Answer only from curated, validated data |
| AI over EBS / app data | Read-only reporting layer / extracts, not live OLTP | Performance impact + data governance |
Governance and security for GenAI
- Serving layer, always - agents and LLMs call a governed API/service that enforces authentication, authorization, rate limits, input/output validation, and logging. They do not touch data stores directly.
- Entitlement-aware retrieval - filter retrieved context to what the requesting user is allowed to see (row/document-level), so RAG cannot leak data across users.
- Private connectivity - keep model and data traffic on private endpoints / the OCI backbone; dedicated AI clusters for isolation where required.
- Credential hygiene - secrets in Vault, access via principals; the model never sees raw credentials.
- Auditability - log prompts, retrieved context IDs, and responses (subject to privacy rules) so answers are explainable and reviewable.
- Human validation - outputs that drive business decisions or changes are reviewed before use; agents that act get approval gates.
Warnings (read before connecting AI to enterprise data)
- Do not connect LLM agents directly to production OLTP databases without a governed serving layer. Live transactional systems are not a query playground for a probabilistic agent.
- Avoid uncontrolled dynamic SQL. NL-to-SQL must produce validated, parameterized, read-only queries against a curated schema - never free-form DML against production.
- Protect credentials. No database passwords, wallets, or API keys in prompts, code, or agent memory. Use Vault + principals.
- Add auditability. If you cannot show what data an answer came from and who asked, you cannot defend it to security or compliance.
- Use curated datasets, APIs, or read-only reporting layers as the AI's data surface - not raw production tables.
- Validate output before business use. Treat model output as a draft/suggestion until a human or a deterministic check confirms it.
Cost and endpoint considerations
- On-demand vs dedicated AI clusters - on-demand inference is pay-per-use and simple; dedicated clusters give isolation, predictable throughput, and fine-tuning, at a fixed cost. Match to volume and isolation needs.
- Embedding jobs are a real cost at scale - batch and cache embeddings; re-embed only changed content.
- Token/context size drives cost and latency - retrieve the smallest sufficient context, not everything.
- Region/model availability - models and dedicated-cluster availability vary by region; verify before designing.
13. Migration and Disaster Recovery
Getting workloads into OCI, and keeping them recoverable once there - the migration tooling, the DR patterns by tier, and how RTO/RPO drive the architecture and cost.
Migrate databases with Zero Downtime Migration (ZDM) or Database Migration Service (both often Data Guard/GoldenGate under the hood), and compute by rebuilding from images or replicating volumes. For DR, choose per tier: backup-and-restore (cheapest, slow), pilot light, warm standby, or active/active (fastest, priciest). Full Stack Disaster Recovery orchestrates cross-region failover of the whole stack. Your RTO/RPO targets pick the pattern; DR you never test is not DR.
Migration tooling
| Move | Tooling | Notes |
|---|---|---|
| Database (low downtime) | ZDM, Database Migration Service, GoldenGate | Data Guard-based physical or logical; GoldenGate for heterogeneous/cross-version |
| Database (offline) | RMAN / Data Pump / cross-endian | Simple, requires downtime window |
| Compute / VM | Rebuild from image, or import; block/boot volume replication | Prefer rebuild-from-golden-image over lift-and-shift of disks |
| Bulk data | Data Transfer appliance/disk, or online (multipart upload, rclone) | Physical transfer when network is impractical |
| Files | rsync/rclone to FSS or Object Storage; FSS replication | Preserve permissions for app filesystems |
| Applications (EBS etc.) | Oracle Cloud lift tooling + DB migration + app-tier rebuild | Whole-stack; validate certification |
Database migration paths
| Method | Downtime | Best for |
|---|---|---|
| Zero Downtime Migration (ZDM) | Near-zero | Same-platform Oracle-to-OCI using Data Guard; automated orchestration |
| Database Migration Service (DMS) | Low to near-zero | Managed online/offline migrations to OCI targets |
| GoldenGate | Near-zero | Heterogeneous, cross-version, cross-endian, active during cutover |
| Data Pump | Downtime window | Logical export/import; schema-level; cross-version |
| RMAN / cross-endian | Downtime window | Physical restore/transport to OCI |
DR patterns
| Pattern | Standby state | RTO | RPO | Cost |
|---|---|---|---|---|
| Backup & restore | Backups in DR region; nothing running | Hours+ | Since last backup | Lowest |
| Pilot light | Core (DB standby) running, app tier off | Tens of minutes | Small (DG lag) | Low |
| Warm standby | Scaled-down full stack running | Minutes | Small | Medium |
| Active / passive | Full-size standby, promote on failover | Minutes | Near-zero (SYNC) | High |
| Active / active | Both regions serving | Near-zero | Near-zero | Highest + complexity |
Building blocks: Data Guard / Active Data Guard and Autonomous Data Guard for databases; Object/Block/File replication for storage; cross-region image copy for compute; DNS failover and load balancers for traffic redirection; Full Stack Disaster Recovery to orchestrate the whole failover as a plan you can run and test.
RTO and RPO - the two numbers that drive everything
- RTO (Recovery Time Objective) - how long you can be down. Drives standby readiness (running vs. cold) and automation.
- RPO (Recovery Point Objective) - how much data you can lose. Drives replication mode: async (small lag), SYNC/Far Sync (zero data loss), or backup interval (large).
- Zero-data-loss (RPO 0) needs SYNC transport (Data Guard Maximum Protection/Availability, often via a Far Sync instance) - which requires low latency between sites and has performance implications. Confirm the network and the trade-off.
Architecture examples
- On-prem Oracle DB → OCI: ZDM/Data Guard over FastConnect, cut over in a window, keep on-prem as fallback briefly.
- EBS → OCI: migrate DB (Base/Exadata) + rebuild app tier + shared FSS; DNS cutover; validate certification.
- VM → OCI: custom image import/rebuild; block volume replication for large data disks.
- Cross-region DR (app): warm standby + DNS/LB failover + storage replication.
- Cross-region DR (database): Active Data Guard (or Autonomous Data Guard) with configurable failover.
- Backup-based DR: cross-region backup copies; rebuild in DR on demand (lowest cost, highest RTO).
- GoldenGate-based DR: when heterogeneous or active/active-ish read serving is required.
DR testing
- Standby database is applying and within RPO (monitor apply lag).
- Encryption keys/wallets present and usable in the DR region.
- App tier can start and connect in DR; config points to DR endpoints.
- DNS/LB failover mechanism tested and time-measured.
- Storage (buckets/volumes/FSS) replication within RPO.
- Runbook current; roles assigned; switchover rehearsed on a schedule.
- Capacity available in DR (reservations if RTO is tight).
14. Cost Management and Governance
How OCI charges, the tools to track and cap spend, the governance model (landing zones, quotas, budgets), and a concrete monthly cost-review checklist.
OCI bills mainly by OCPU-hours, storage GB, and network egress, purchased via Universal Credits. Track with Cost Analysis and Usage Reports, cap with Budgets (alert) and Quotas (block), attribute with tags and compartments. The biggest levers: BYOL for databases, right-sizing OCPUs, stopping/scheduling non-prod, correct block-volume tiers, Object Storage lifecycle, and killing orphaned resources. Governance = a landing zone + quotas + budgets + tags so spend is controlled by design.
Pricing basics
| Dimension | Charged on | Notes |
|---|---|---|
| Compute | OCPU-hours (+ memory for flex), per shape | Arm often cheaper per unit; preemptible cheaper still |
| Block storage | GB-month x performance (VPU) | Higher VPU costs more; auto-tune down when idle |
| Object storage | GB-month by tier + requests + retrieval (IA/Archive) | Archive cheapest to store, has retrieval cost/delay |
| Network | Internet egress (ingress free); some cross-region | Keep OCI-service traffic on Service Gateway to avoid internet egress |
| Database | OCPU-hours + storage + edition/options; LI vs BYOL | Usually your largest line item; BYOL is the big lever |
| Load balancer | Bandwidth shape (LBaaS) / usage | Size the flexible bandwidth to real need |
- Universal Credits - a consumption model where credits apply across eligible OCI services; annual commitments earn better rates than pure pay-as-you-go.
- BYOL vs License Included - covered in section 6; BYOL typically the biggest database savings if you own licenses.
Cost tracking tools
| Tool | Does |
|---|---|
| Cost Analysis | Console visualizations of spend by compartment, service, tag, time. |
| Usage/Cost Reports | Detailed CSV usage dropped to an Oracle-owned bucket for your own analysis/BI. |
| Budgets | Track spend against a target on a compartment or tag; alert at thresholds. Do not block. |
| Compartment Quotas | Policy-like statements that block resource creation - the hard cap. |
| Cloud Advisor | Recommendations: rightsizing, idle resources, performance/cost/availability. |
Governance model and landing zones
A landing zone is a codified baseline for a well-governed tenancy: compartment topology, IAM groups/policies, network hub, logging/audit, Security Zones, Cloud Guard, budgets, quotas, and tag defaults - deployed as Terraform so it is repeatable and reviewable. Oracle publishes CIS-aligned landing zone reference architectures/Terraform to start from.
- Compartments for isolation and cost attribution (section 1).
- Quotas + budgets per compartment for control.
- Tags + tag defaults for attribution and automation.
- Security Zones + Cloud Guard for preventive/detective guardrails (section 8).
- Everything as code so environments are consistent and auditable.
Cost optimization examples
| Action | Typical saving | Effort |
|---|---|---|
| Stop non-prod compute nights/weekends (schedule) | High - up to ~65-70% of that compute | Low |
| Right-size over-provisioned shapes (Cloud Advisor) | High | Low |
| Correct block-volume performance tier (don't over-VPU) | Medium | Low |
| Object Storage lifecycle to Archive / delete | Medium-High for large stores | Low |
| Database BYOL instead of License Included | Very High (Oracle shops) | Medium |
| Autonomous autoscale + auto-stop dev | High for variable/non-prod | Low |
| Exadata capacity planning / consolidation | High at scale | High |
| Remove unused reserved public IPs | Small each, adds up | Low |
| Delete orphaned block volumes / old backups / snapshots | Medium | Low |
| Move scale-out tiers to Arm shapes | Medium (price/perf) | Medium |
Monthly OCI cost review checklist
- Review Cost Analysis month-over-month by compartment and service; investigate any spike.
- Check each budget: which compartments/tags are over or trending over target.
- Act on Cloud Advisor rightsizing and idle-resource recommendations.
- Confirm non-prod stop/scale schedules ran (no Dev/Test running 24x7 by accident).
- Find and terminate orphaned block volumes, unattached boot volumes, and idle instances.
- Delete stale volume backups, DB backups beyond retention, and old snapshots.
- Review Object Storage: are lifecycle rules moving cold data to Archive / deleting expired data?
- Remove unused reserved public IPs and idle load balancers.
- Verify database licensing posture (BYOL applied where owned; OCPU counts right-sized).
- Check block-volume performance tiers vs. actual IOPS - downgrade over-provisioned volumes.
- Review egress charges - is OCI-service traffic correctly on the Service Gateway?
- Confirm every resource is tagged (CostCenter/Environment/Owner) for attribution.
- Validate quotas still reflect intent (no team quietly raised a cap).
- Reconcile Universal Credits burn-down vs. commitment; forecast to renewal.
15. Enterprise Architecture Patterns
Reference blueprints for real OCI deployments. Each card gives the business case, services, traffic flow, and the security / HA / DR / monitoring / cost / risk dimensions so you can adapt rather than start from a blank page.
Every pattern lists the same dimensions. Start from the one closest to your workload, then apply the relevant service deep dives (sections 3-12) and the DR/cost guidance (13-14). The recurring backbone is: public LB + WAF → private app tier across fault domains → managed database → Service Gateway for OCI services → centralized logging → cross-region DR.
Foundational three-tier (reference backbone)
| Business case | Standard internal/external web or enterprise app needing HA and controlled exposure. |
|---|---|
| Services | VCN + public/private subnets, WAF, Load Balancer, Compute (instance pool), Base/Autonomous DB, Service Gateway, NAT, Vault, Monitoring/Logging. |
| Traffic flow | User → WAF → public LB → app tier (private) → DB (private); app → OCI services via Service Gateway; egress via NAT. |
| Security | Only LB/Bastion public; NSGs per tier (app→db on 1521 only); TDE + customer keys; secrets in Vault; Cloud Guard + Security Zone. |
| HA | App pool + DB nodes spread across fault domains (and ADs where available); LB health checks. |
| DR | Active Data Guard + warm app tier in a second region; DNS/LB failover. |
| Monitoring | LB backend health, app CPU/mem, DB metrics, alarms → Notifications; central logs. |
| Cost | Right-size app pool + autoscale; BYOL DB; schedule non-prod. |
| Risks / mistakes | Backends unhealthy from missing health-check NSG rule; DB in public subnet; no FD spread; secrets in images. |
Pattern library
Each pattern below follows the same dimension set. Expand the ones relevant to you.
Simple web application Small
| Case | Low-complexity site/app, cost-sensitive. |
|---|---|
| Services | 1 VCN, public LB (or public instance), Arm compute, Autonomous DB (auto-stop for dev), Object Storage for assets. |
| Flow | User → LB → app → ADB; static assets from Object Storage. |
| Security/HA/DR | WAF on LB; 2 instances across FDs; ADB backups + optional Autonomous DG; Cloud Guard. |
| Cost / risk | Arm + auto-stop = cheap. Risk: single instance / no backups if cut too far. |
Highly available application HA
| Case | App that must survive node, rack, and (where possible) AD failure. |
|---|---|
| Services | Instance pool + autoscaling across FDs/ADs, LB, RAC DB or ADB, FSS for shared state, Vault. |
| Flow / HA | LB spreads to app pool; DB RAC for node HA; storage replicated; no single FD holds all. |
| DR / monitoring | Active Data Guard cross-region; alarms on backend health + DB; Full Stack DR plan. |
| Risk | State stored on a single node instead of shared/managed store; untested failover. |
EBS on OCI Apps DBA
| Case | Migrate/run Oracle E-Business Suite on OCI. |
|---|---|
| Services | Base/Exadata DB (2-node RAC), app-tier compute pool behind LB, shared APPL_TOP on FSS, Vault, Object Storage backups, Bastion. |
| Flow | Users → LB → EBS app nodes (shared FSS) → DB; concurrent/admin tiers as needed. |
| Security/HA/DR | Private DB, TDE; app across FDs; Data Guard DR; validated EBS certification for the DB service/version. |
| Cost / risk | BYOL DB; right-size nodes. Risk: unsupported DB service/version; FSS export too open. |
Oracle database on OCI (managed) DB
| Case | Run a production Oracle DB with managed lifecycle. |
|---|---|
| Services | Base Database (2-node RAC) or Exadata; Data Guard; Object Storage backups via Service Gateway; Data Safe; Vault keys. |
| Security/HA/DR | Private subnet; customer-managed TDE; RAC HA; Data Guard cross-region DR; audited via Data Safe. |
| Cost / risk | BYOL; right-size OCPU. Risk: untested restores; keys absent in DR region. |
Exadata Cloud Service platform Scale
| Case | Large, high-performance, or consolidated Oracle estate. |
|---|---|
| Services | ExaDB-D VM clusters + many PDBs; Data Guard to a second Exadata; Ops Insights; Data Safe. |
| Security/HA/DR | RAC built-in; Active Data Guard; PDB isolation; standardized patch windows. |
| Cost / risk | Consolidate for density; BYOL. Risk: over-provisioning for small estates; noisy-neighbor PDBs without resource management. |
Autonomous Database application Low-ops
| Case | New app wanting minimal DBA toil, elastic scale. |
|---|---|
| Services | ATP (Serverless), private endpoint, app tier on OKE/compute, Vault for wallet, APEX optional. |
| Security/HA/DR | Private endpoint (no public); mTLS wallet from Vault; Autonomous DG; auto-backups. |
| Cost / risk | Autoscale + auto-stop dev. Risk: assuming Autonomous fits a DB needing OS/feature access it can't provide. |
Data warehouse & data lake Analytics
| Case | Enterprise analytics / BI on curated + raw data. |
|---|---|
| Services | Object Storage lake (zones) + Data Integration/Data Flow + ADW + Data Catalog + OAC; GoldenGate CDC feeds. |
| Security/HA/DR | Private access; Data Catalog governance; ADW auto-backups + DG; masked non-prod. |
| Cost / risk | Scale ADW to query windows; lifecycle cold lake data. Risk: ungoverned lake ("data swamp"). |
Kubernetes platform Cloud native
| Case | Container platform for many microservices with CI/CD. |
|---|---|
| Services | OKE (enhanced), node pools/virtual nodes, OCIR (scanned), API Gateway, DevOps pipelines, Vault, LB/NLB, service mesh. |
| Security/HA/DR | Private cluster; workload identity; network policies; multi-FD node pools; backup of cluster state/config; images in OCIR replicated. |
| Cost / risk | Right-size pools; Arm nodes. Risk: pod-subnet IP exhaustion; over-privileged workload identity. |
Private enterprise application (no internet exposure) Regulated
| Case | Internal-only app reachable from on-prem, no public footprint. |
|---|---|
| Services | Private subnets only, private LB, FastConnect/VPN via DRG, Service Gateway, private endpoints, Bastion for admin. |
| Security/HA/DR | No public IPs/IGW; access from corporate network only; Network Firewall inspection; cross-region DR over private links. |
| Cost / risk | FastConnect cost. Risk: CIDR overlap with on-prem; DNS forwarding gaps. |
Hybrid cloud Hybrid
| Case | Workloads split across on-prem and OCI with shared connectivity. |
|---|---|
| Services | DRG hub + FastConnect (primary) + VPN (backup), hub-and-spoke VCNs, hybrid DNS, Network Firewall. |
| Security/HA/DR | Redundant links with BGP failover; centralized inspection; consistent IAM/tagging. |
| Cost / risk | FastConnect + egress. Risk: CIDR overlap; single link with no backup; asymmetric routing. |
Multi-region DR DR
| Case | Business-critical stack needing regional resilience. |
|---|---|
| Services | Mirrored compartments/VCNs in 2 regions, Active Data Guard, storage replication, capacity reservation, Full Stack DR, DNS/LB failover. |
| Security/HA/DR | Keys present in both regions; RTO/RPO per tier; rehearsed switchover/failover. |
| Cost / risk | Standby cost vs. RTO. Risk: untested DR; missing DR keys; capacity unavailable at failover. |
Secure landing zone Governance
| Case | Governed foundation before workloads land. |
|---|---|
| Services | Compartment topology, IAM roles/policies, network hub, Vault, centralized logging/audit, Cloud Guard, Security Zones, budgets, quotas, tag defaults - all as Terraform (CIS-aligned). |
| Security/HA/DR | Preventive guardrails; least privilege; audit centralized; break-glass isolated. |
| Cost / risk | Quotas/budgets cap spend. Risk: skipping the landing zone and retrofitting governance later. |
GenAI with private enterprise data AI
| Case | RAG/assistant over internal documents and data, governed. |
|---|---|
| Services | Object Storage (docs) + DB 23ai Vector Search + OCI Generative AI (or Agents) behind a serving API on OKE/Functions + API Gateway + Vault + Logging. |
| Flow | Query → serving layer (authz + guardrails) → entitlement-filtered vector retrieval → grounded generation → audited response. |
| Security/HA/DR | Private endpoints; no direct model→OLTP; secrets in Vault; full audit; validated output. |
| Cost / risk | Cache embeddings; right-size dedicated clusters. Risk: ungoverned data access, dynamic SQL, credential leakage (see section 12 warnings). |
- Databases or app tiers exposed in public subnets "just to get it working."
- No fault-domain spread - a single rack event takes the whole "HA" tier.
- Health-check NSG rules forgotten, so LB backends are unhealthy on day one.
- DR designed but never tested; keys/secrets missing in the DR region.
- Secrets baked into images/code instead of Vault + principals.
- No centralized logging/audit until an incident needs it.
- CIDR overlap discovered during the hybrid connectivity phase.
- Governance (landing zone, quotas, tags) skipped and retrofitted painfully later.
16. Troubleshooting Guides
A runbook catalog for the failures you will actually hit. Each entry lists symptoms, likely causes, checks (with Console path and CLI where useful), fixes, and prevention. Deeper versions of some runbooks live in their service sections; this is the consolidated index.
Compute & access
Symptoms: SSH times out or refuses. Likely causes: security rule/route missing (22), no public IP or wrong path, instance stopped/boot failed, wrong SSH key/user, OS firewall/fail2ban. Checks: instance state Running; serial console for boot; security rules + route allow 22 from your IP; correct user (opc); firewalld. Console: Compute > Instance > Attached VNICs; Instance > Console connection. Fix: open 22 in NSG, assign IP/route, reset via console. Prevention: use Bastion service (no public IPs), standard security templates.
oci compute instance get --instance-id <ocid> --query 'data."lifecycle-state"'
oci compute instance action --instance-id <ocid> --action SOFTRESETSymptoms: instance up but OS unreachable/services down. Likely causes: bad /etc/fstab mount, kernel/driver issue, full root disk, failed cloud-init. Checks: serial console boot output; single-user mode. Fix: detach boot volume, attach to a rescue instance, correct fstab/config, reattach. Prevention: test image changes in non-prod; keep boot volume backups.
Symptoms: slow app, CPU alarm. Likely causes: undersized shape, runaway process, missing autoscale, batch overlap. Checks: Monitoring CPU trend; on host top/pidstat. Fix: resize (reboot) or scale the pool; fix the offending process. Prevention: autoscaling + right-sizing from sustained metrics.
Symptoms: writes fail, app errors. Likely causes: logs/temp growth, volume undersized, no rotation. Checks: df -h, du -sh. Fix: clean/rotate; grow the block volume online then extend the filesystem (growpart/resize2fs/xfs_growfs). Prevention: filesystem-used alarm at 85%, log rotation, lifecycle to Object Storage.
Symptoms: attached volume not visible in OS. Likely causes: iSCSI login steps not run (iSCSI attach), device not mounted, wrong attach type. Checks: Console shows Attached; run the iSCSI commands from the Console's attach details; lsblk. Fix: run the iSCSI iscsiadm commands or use paravirtualized attach; mount and add to fstab (use UUID). Prevention: prefer paravirtualized attach; automate mount via cloud-init.
Storage
Symptoms: 403/404 on bucket/object. Likely causes: missing IAM policy for the user/dynamic group, wrong compartment, condition (bucket name) unmet, using API key where principal expected, expired PAR. Checks: policy grants read/manage objects in the bucket's compartment; dynamic-group membership; Audit for the denied call. Fix: add least-privilege policy; use instance principal. Prevention: standard bucket-access policy per workload; avoid PAR sprawl.
oci os object list --bucket-name <name> --auth instance_principalSymptoms: NFS mount hangs or permission denied. Likely causes: NFS ports (111/2048-2050) blocked in NSG/security list, export options exclude the client CIDR, root squash, wrong mount-target IP/path. Checks: security rules for NFS between client subnet and mount target; export options; showmount/mount -v. Fix: open NFS ports, add client CIDR to export options, correct mount path. Prevention: FSS module with standard ports + export options.
Network
Symptoms: private instance can't reach Object Storage/ADB privately. Likely causes: no Service Gateway, missing route to it for the OSN CIDR label, wrong service label, egress rule missing. Checks: VCN > Service Gateway exists + correct label; subnet route has a rule (target = SGW, dest = services CIDR label). Fix: create SGW, add route + egress rule. Prevention: bake SGW into the private-subnet Terraform module.
Symptoms: private instance can't reach the internet (patch repos, external APIs). Likely causes: no 0.0.0.0/0 route to NAT, egress rule missing, OS firewall. Checks: route table; security egress; curl test. Fix: add NAT route + egress. Prevention: standard subnet module; keep OCI-service traffic on SGW, not NAT.
Symptoms: on-prem/OCI connectivity lost, tunnel state != UP. Likely causes: Phase 1/2 mismatch (encryption, PSK, lifetimes), CPE public IP change, BGP/static route misconfig, on-prem firewall. Checks: Site-to-Site VPN > tunnel status + logs; compare IKE params both ends. Fix: align IKE/IPSec parameters, correct routes/BGP. Prevention: redundant tunnels; FastConnect primary; alarm on tunnel state.
Symptoms: primary link degraded/down, BGP session down. Likely causes: physical/optical fault, BGP config, provider issue, MTU/MACsec. Checks: FastConnect state + light levels; BGP session; provider status. Fix: engage provider/DC; fail over to VPN backup. Prevention: redundant FastConnect + VPN backup with BGP failover; alarms.
Symptoms: names don't resolve. Likely causes: DHCP options point at wrong resolver, no hybrid forwarding, missing private-zone record/view. Checks: nslookup name 169.254.169.254; /etc/resolv.conf. Fix: fix DHCP options, add resolver forwarders both ways, add records to the private zone. Prevention: standard DNS/resolver design per VCN.
Load balancer & certificates
Symptoms: backends Critical, 502/503. Likely causes (order): backend NSG doesn't allow the LB health-check source on the port; wrong health-check port/path/protocol; app not listening or bound to localhost; OS firewall; SSL mismatch. Checks: curl the backend health URL from the LB subnet; Path Analyzer LB→backend; ss -tlnp. Fix: allow the probe source; align health-check; bind to 0.0.0.0. Prevention: template LB + NSG together. (Full runbook in section 7.)
Symptoms: TLS errors, browser warnings, handshake failures. Likely causes: expired cert, incomplete chain, SNI/hostname mismatch, E2E backend cert invalid. Checks: openssl s_client -connect host:443 -servername host; cert dates/chain. Fix: renew/replace via Certificates service; include full chain; match hostname. Prevention: managed certs with rotation + expiry alarms.
Database
Symptoms: backup job error / no recent backup. Likely causes: Object Storage access (Service Gateway/policy), space, wallet/credential expiry, RMAN config (IaaS), retention conflict. Checks: backup job logs; Object Storage reachability; on IaaS RMAN LIST BACKUP. Fix: restore connectivity/credentials; correct RMAN/backup config. Prevention: alarm on backup success absence; periodic restore tests.
Symptoms: slow queries, high waits. Likely causes: bad plans, missing indexes, I/O ceiling (block volume/storage), CPU saturation, contention. Checks: Performance Hub / AWR / ASH; wait classes; volume IOPS vs. tier. Fix: tune SQL/indexes, raise storage performance, scale OCPU, resource management. Prevention: Ops Insights capacity planning; auto-indexing (Autonomous); baseline plans.
Symptoms: app can't connect to ADB. Likely causes: wallet expired/wrong, mTLS vs TLS mismatch, private endpoint NSG rules, ACL blocks client IP (public), TNS alias wrong. Checks: wallet validity; ADB network config (private endpoint vs. public + ACL); NSG for the PE. Fix: refresh wallet from Vault, fix ACL/NSG, correct connection string. Prevention: store wallet in Vault; use private endpoints; automate wallet rotation.
IAM
Symptoms: 404/authz error though resource exists. Likely causes: no policy grants it, wrong compartment, weak verb/wrong family, unmet condition, wrong domain qualifier. Checks: group membership + policies; compartment path; Audit for the denied request. Fix: add least-privilege statement at the correct compartment. Prevention: policies in Terraform; access matrix per compartment. (Full runbook in section 2.)
Symptoms: instance/function gets authz errors using a principal. Likely causes: instance not matched by the DG rule, no policy grants the DG, wrong matching attribute/compartment, using API key not principal. Checks: DG matching rule vs. instance OCID/compartment; policy targets dynamic-group; --auth instance_principal. Fix: correct the rule, add the DG policy. Prevention: standard DG + policy per workload type.
Containers & automation
Symptoms: Pending / ImagePullBackOff / CrashLoopBackOff. Likely causes: no capacity or pod-IP exhaustion; OCIR pull permission missing; app config/secret missing; bad probes; workload identity not set. Checks: kubectl describe pod, kubectl logs --previous, node capacity, pod-subnet free IPs. Fix: scale pool/fix subnet size; grant OCIR pull; fix config/probes/identity. Prevention: capacity headroom, correct CIDR sizing, image-scan gates. (Section 10.)
Symptoms: event doesn't invoke the function. Likely causes: Event rule filter mismatch/disabled; invoke policy missing; function IAM (resource principal) missing; timeout/concurrency; cold start mistaken for failure. Checks: Event rule condition; function logs + invocation metrics; policies. Fix: correct the rule + invoke policy; verify resource-principal permissions. Prevention: test invocation in the deploy pipeline; logging on.
Observability
Symptoms: expected alert never arrives. Likely causes: wrong metric/namespace/dimension in the query, threshold/window never met, alarm disabled, Notifications topic has no confirmed subscription, suppression active. Checks: run the MQL in Metrics Explorer; alarm state history; topic subscription confirmed. Fix: correct the query/threshold; confirm subscription. Prevention: test alarms by forcing a condition; monitor the monitors (absence alarms).
Symptoms: expected logs absent when investigating. Likely causes: service log (LB/VCN flow/WAF) not enabled, agent not installed/configured, Service Connector not wired, wrong compartment/log group, retention expired. Checks: Logging > Logs for the resource; agent status; connector state. Fix: enable the log, install/config agent, wire the connector. Prevention: enable service logs + Audit tenancy-wide from the start; centralize via Service Connector Hub. (Section 9.)
17. OCI CLI and Terraform Examples
Practical, copy-friendly automation: CLI setup and common commands, the Terraform provider, Resource Manager, and clean examples for VCN, compute, buckets, IAM, alarms, and tags - plus state and structure practices.
The CLI (Python) uses ~/.oci/config profiles for auth (API key or instance principal). Terraform (Oracle-maintained provider) is the way to build production infrastructure; run it locally, in a pipeline, or in Resource Manager (managed state + plan/apply + drift). Keep state remote and locked, structure code into reusable modules, and separate environments by workspace/backend + tfvars - never by copy-paste.
OCI CLI setup, profiles, authentication
# Install (Linux/macOS)
bash -c "$(curl -L https://raw.githubusercontent.com/oracle/oci-cli/master/scripts/install/install.sh)"
# Interactive setup: creates ~/.oci/config + API signing keypair
oci setup config
# ~/.oci/config with two profiles
[DEFAULT]
user=ocid1.user.oc1..aaaa...
fingerprint=aa:bb:cc:...
key_file=~/.oci/oci_api_key.pem
tenancy=ocid1.tenancy.oc1..aaaa...
region=us-ashburn-1
[PROD]
user=ocid1.user.oc1..bbbb...
fingerprint=dd:ee:ff:...
key_file=~/.oci/prod_key.pem
tenancy=ocid1.tenancy.oc1..aaaa...
region=us-phoenix-1
# Use a profile; or use instance principal (no keys on disk)
oci os ns get --profile PROD
oci compute instance list -c <compartment> --auth instance_principal--auth instance_principal over an API key in ~/.oci/config. In Cloud Shell you are already authenticated as your Console identity - no key needed. Reserve API keys for off-cloud automation and store the PEM securely (never in git).Common CLI commands
# Identity / discovery
oci iam compartment list --all
oci iam region-subscription list
oci os ns get # Object Storage namespace
# Compute
oci compute instance list -c <compartment-ocid> --output table
oci compute instance action --instance-id <ocid> --action STOP
oci compute image list -c <compartment-ocid> --operating-system "Oracle Linux"
# Networking
oci network vcn list -c <compartment-ocid>
oci network security-list get --security-list-id <ocid>
# Object Storage
oci os object bulk-upload -bn <bucket> --src-dir ./data --auth instance_principal
oci os object list -bn <bucket>
# Database
oci db system list -c <compartment-ocid>
oci db autonomous-database list -c <compartment-ocid>
# Query + filter with JMESPath
oci compute instance list -c <ocid> \
--query "data[?\"lifecycle-state\"=='RUNNING'].{name:\"display-name\",ocid:id}" --output tableTerraform provider setup
# provider.tf
terraform {
required_version = ">= 1.5"
required_providers {
oci = { source = "oracle/oci", version = "~> 6.0" } # verify current major
}
}
# Auth via config file profile (local runs)
provider "oci" {
config_file_profile = "PROD"
region = var.region
}
# In Resource Manager / instance-principal runs, use:
# provider "oci" { auth = "InstancePrincipal" region = var.region }Create a VCN with subnets and a gateway
resource "oci_core_vcn" "main" {
compartment_id = var.compartment_ocid
cidr_blocks = ["10.10.0.0/20"]
display_name = "app-vcn"
dns_label = "appvcn"
}
resource "oci_core_nat_gateway" "nat" {
compartment_id = var.compartment_ocid
vcn_id = oci_core_vcn.main.id
display_name = "app-nat"
}
resource "oci_core_service_gateway" "sgw" {
compartment_id = var.compartment_ocid
vcn_id = oci_core_vcn.main.id
services { service_id = data.oci_core_services.all.services[0].id }
display_name = "app-sgw"
}
resource "oci_core_route_table" "private_rt" {
compartment_id = var.compartment_ocid
vcn_id = oci_core_vcn.main.id
display_name = "private-rt"
route_rules {
destination = "0.0.0.0/0"
network_entity_id = oci_core_nat_gateway.nat.id
}
route_rules {
destination_type = "SERVICE_CIDR_BLOCK"
destination = data.oci_core_services.all.services[0].cidr_block
network_entity_id = oci_core_service_gateway.sgw.id
}
}
resource "oci_core_subnet" "app" {
compartment_id = var.compartment_ocid
vcn_id = oci_core_vcn.main.id
cidr_block = "10.10.2.0/24"
display_name = "app-private"
route_table_id = oci_core_route_table.private_rt.id
prohibit_public_ip_on_vnic = true # private subnet
dns_label = "app"
}
data "oci_core_services" "all" {}Create a compute instance
resource "oci_core_instance" "app" {
compartment_id = var.compartment_ocid
availability_domain = var.ad
display_name = "app-01"
shape = "VM.Standard.E5.Flex"
shape_config { ocpus = 2 memory_in_gbs = 32 }
create_vnic_details {
subnet_id = oci_core_subnet.app.id
assign_public_ip = false
nsg_ids = [oci_core_network_security_group.app.id]
}
source_details {
source_type = "image"
source_id = var.image_ocid
}
metadata = {
ssh_authorized_keys = file("~/.ssh/id_rsa.pub")
user_data = base64encode(file("cloud-init.yaml"))
}
}Create an Object Storage bucket
data "oci_objectstorage_namespace" "ns" { compartment_id = var.tenancy_ocid }
resource "oci_objectstorage_bucket" "backups" {
compartment_id = var.compartment_ocid
namespace = data.oci_objectstorage_namespace.ns.namespace
name = "db-backups"
access_type = "NoPublicAccess"
versioning = "Enabled"
# kms_key_id = oci_kms_key.data.id # customer-managed key
}Create IAM group, dynamic group, and policy
resource "oci_identity_dynamic_group" "app_servers" {
compartment_id = var.tenancy_ocid
name = "app-servers"
description = "App instances in the app compartment"
matching_rule = "ALL {instance.compartment.id = '${var.compartment_ocid}'}"
}
resource "oci_identity_policy" "app_bucket_read" {
compartment_id = var.compartment_ocid
name = "app-bucket-read"
description = "App servers read backups bucket"
statements = [
"Allow dynamic-group app-servers to read objects in compartment id ${var.compartment_ocid} where target.bucket.name = 'db-backups'"
]
}Create a monitoring alarm
resource "oci_ons_notification_topic" "ops" {
compartment_id = var.compartment_ocid
name = "ops-alerts"
}
resource "oci_monitoring_alarm" "cpu_high" {
compartment_id = var.compartment_ocid
display_name = "app-cpu-high"
metric_compartment_id = var.compartment_ocid
namespace = "oci_computeagent"
query = "CpuUtilization[5m].mean() > 85"
severity = "WARNING"
destinations = [oci_ons_notification_topic.ops.id]
is_enabled = true
body = "App CPU above 85% for 5 minutes."
pending_duration = "PT5M"
}Tag namespace, tag, and tag default
resource "oci_identity_tag_namespace" "finance" {
compartment_id = var.tenancy_ocid
name = "Finance"
description = "Cost attribution tags"
}
resource "oci_identity_tag" "cost_center" {
tag_namespace_id = oci_identity_tag_namespace.finance.id
name = "CostCenter"
description = "Charge-back cost center"
validator {
validator_type = "ENUM"
values = ["CC-4412", "CC-5501", "CC-7788"]
}
}
# Auto-apply Environment tag to every new resource in a compartment
resource "oci_identity_tag_default" "env_default" {
compartment_id = var.compartment_ocid
tag_definition_id = oci_identity_tag.environment.id
value = "prod"
}State management and structure
- Remote, locked state: use Resource Manager (state managed for you) or an Object Storage/other backend with locking. Never keep prod state only on a laptop; never commit state (it holds secrets).
- Modular structure: reusable modules (network, compute, db, iam, monitoring) composed per environment - not copy-pasted stacks.
- Environment separation: separate state per env (workspaces or separate backends/stacks) driven by
dev.tfvars/prod.tfvars; separate compartments and, ideally, separate credentials/pipelines. - Drift detection: Resource Manager (or scheduled
plan) to catch manual Console changes. - No secrets in code: reference Vault secrets/keys by OCID; keep
tfvarswith secrets out of git.
oci-infra/
modules/
network/ compute/ database/ iam/ monitoring/
envs/
dev/ main.tf dev.tfvars backend.tf
prod/ main.tf prod.tfvars backend.tf
README.md18. Learning Path
A structured route from OCI fundamentals to enterprise-grade architecture and operations, aimed at people coming from traditional Oracle/infrastructure backgrounds. Each level lists what to learn, why, hands-on labs, common mistakes, and the outcome you should reach.
Beginner
What to learn
- OCI fundamentals: regions, ADs, fault domains, realms, home region (section 1).
- Tenancy and compartments; how to structure them.
- IAM basics: users, groups, policies, the verb/resource-type model (section 2).
- VCN basics: subnets, route tables, security lists/NSGs, gateways (section 3).
- Compute basics: shapes, images, SSH, cloud-init (section 4).
- Storage basics: block, object, file - and when to use each (section 5).
Why it matters
Every OCI design rests on these. Get the tenancy/compartment/network mental model right now and everything later is easier; get it wrong and you rebuild.
Hands-on labs
- Create a compartment, a group, and a least-privilege policy; add yourself and test access.
- Build a VCN with a public and a private subnet, IGW, NAT, and Service Gateway.
- Launch a public bastion and a private instance; SSH to the private one through the bastion.
- Attach and mount a block volume; create a bucket and upload an object; create an FSS share.
Common mistakes
Everything in root compartment; public subnet used for everything; forgetting the Service Gateway; no fault-domain awareness.
Expected outcome
You can stand up a properly-segmented VCN with public/private tiers, reach a private host securely, and use all three storage types - and explain the shared responsibility model.
Intermediate
What to learn
- Load balancers: L7 vs NLB, listeners/backend sets/health checks, SSL (section 7).
- Private networking depth: NSGs by tier, DNS/DHCP, flow logs, Path Analyzer (section 3).
- Hybrid connectivity: Site-to-Site VPN and DRG (section 3/13).
- Database services: Base Database and Autonomous - provisioning, backups, Data Guard basics (section 6).
- Monitoring & logging: metrics, alarms, notifications, Service Connector Hub (section 9).
- Security services: Vault, Cloud Guard, Bastion, Security Zones, WAF (section 8).
- Cost management: Cost Analysis, Budgets, Quotas, tagging (section 14).
Why it matters
This is the day-job: HA application tiers, managed databases, and the operational and security controls that make them production-worthy.
Hands-on labs
- Deploy a 3-tier app: public LB + WAF → instance pool (multi-FD) → Autonomous/Base DB (private).
- Wire NSGs so only app→db on the DB port is allowed; verify with Path Analyzer.
- Set up alarms (CPU, unhealthy backend, DB storage) to a Notifications topic; force one to fire.
- Store the DB wallet/secret in Vault; give the app instance-principal access.
- Add a budget + quota to the compartment; tag all resources with CostCenter/Environment.
- Set up a VPN/DRG to a simulated on-prem network.
Common mistakes
LB health-check NSG rule missing; DB in a public subnet; secrets in cloud-init instead of Vault; noisy alarms; no tags so cost can't be attributed.
Expected outcome
You can deploy a secure, monitored, HA application and database, connect it to on-prem, and keep its cost and access under control.
Advanced
What to learn
- Exadata Cloud Service and Autonomous deep dive: consolidation, RAC, PDBs, patching at scale (section 6).
- DR design: Data Guard/Active Data Guard, Full Stack DR, RTO/RPO, pilot-light to active/active (section 13).
- FastConnect and advanced hybrid: redundant links, BGP, Network Firewall, hub-and-spoke (sections 3/13).
- OKE and cloud native: node pools/virtual nodes, workload identity, DevOps pipelines, service mesh (section 10).
- Terraform + Resource Manager: modules, remote state, environment separation, drift (section 17).
- Landing zone: CIS-aligned governance, Security Zones, centralized logging/audit (sections 8/14).
- Enterprise security: customer-managed keys, Data Safe, Database Vault, break-glass, least privilege at scale (section 8).
- GenAI & AI Vector Search: governed RAG over enterprise data, the serving-layer pattern (section 12).
- Large-enterprise architecture: multi-BU tenancy, multi-region, chargeback, standardization (sections 1/15).
Why it matters
At this level you are responsible for governance, resilience, automation, and cost across many teams and workloads - decisions that are expensive to reverse.
Hands-on labs
- Deploy a CIS-aligned landing zone via Terraform/Resource Manager (compartments, IAM, network hub, Security Zones, logging, budgets, quotas, tag defaults).
- Build cross-region DR for a database (Active Data Guard) and rehearse a switchover; confirm keys exist in DR.
- Stand up an OKE platform with private cluster, workload identity, and a DevOps CI/CD pipeline from OCIR.
- Implement a governed RAG assistant: Object Storage + DB 23ai Vector Search + Generative AI behind a serving API, with audit and entitlement-filtered retrieval.
- Refactor a hand-built environment into reusable Terraform modules with per-env state and drift detection.
Common mistakes
Skipping the landing zone and retrofitting governance; DR never tested; over-privileged workload identity/IAM; Console drift breaking DR parity; connecting AI agents to production data without a governed serving layer.
Expected outcome
You can design and operate a governed, automated, multi-region OCI platform for a large enterprise - and defend the trade-offs on security, resilience, and cost.
Certification checkpoints (optional)
| Level | Typical certification track |
|---|---|
| Beginner | OCI Foundations Associate |
| Intermediate | OCI Architect Associate (+ Operations/Networking specialty as relevant) |
| Advanced | OCI Architect Professional (+ specialty: Security, Multicloud, Data/AI, Autonomous) |