AWS Cloud Deep Dive Portal

A practical reference for cloud architects, DBAs, and enterprise infrastructure teams

Not a service catalog and not a marketing tour. This is a working reference for the people who design, build, run, secure, and troubleshoot AWS in production: how the services actually behave, where the model breaks your assumptions, the gotchas that cost you a weekend, the decision guidance for choosing between overlapping options, and the runbooks you reach for when something is down at 2 a.m.

Last reviewed June 2026AWS-onlyArchitecture-firstProduction-oriented

HOW TO USE THIS PORTAL

Use the left navigation as a table of contents. Each section is a standalone deep dive. Start at AWS Fundamentals for the account and global-infrastructure model, then use the AWS Service Index (searchable, filterable) to find any service fast. The deep-dive tabs (IAM, Networking, Compute, Storage, Database, and so on) go beyond definitions into decision guidance, gotchas, and operations. When something breaks, jump to Troubleshooting Runbooks. When you are designing, use Architecture Patterns, Well-Architected, and the Decision tables inside each deep dive.

Who this is for

Architects & Enterprise Architects

Account topology, landing zones, network design, pattern selection, Well-Architected reviews, and the trade-offs behind each.

DBAs & Apps DBAs

RDS vs Aurora vs EC2 self-managed, Oracle on AWS realities, HA/DR, backups, patching, and what AWS manages vs what you still own. Look for the DBA note callouts.

Infra, DevOps & Security

Compute, storage, IAM least privilege, guardrails, observability, cost control, IaC, and incident runbooks. Look for the Security, Cost, and Operations callouts.

Callout legend

These labeled boxes recur throughout. Scan for them when you want the practitioner shortcut rather than the full prose.

Architect note

A design-level trade-off, boundary decision, or pattern choice that is expensive to change later.

DBA note

Something a database professional specifically needs to know: control you keep or lose, backup and recovery behaviour, licensing, patching.

Security note

A control, exposure, or misconfiguration that has real blast radius.

Cost note

A charge that surprises teams, or a lever that meaningfully reduces spend.

Operations note

Day-2 behaviour: what to monitor, how to patch, how it fails.

Common mistake

A pattern teams repeat that causes outages, cost, or security incidents. Avoid it deliberately.

Match / scope badges used in tables

In the Service Index and elsewhere, these badges describe the operational model at a glance:

Badge	Meaning
Serverless	No servers to manage or patch; you pay per request/usage and AWS runs the capacity.
Managed	AWS runs the control plane, patching, and HA plumbing; you configure and operate the workload on top.
Self-managed	You run it on primitives (usually EC2); AWS provides the infrastructure only.
Global	The service or resource spans all Regions (for example IAM, Route 53, CloudFront, WAF for CloudFront).
Regional	Scoped to one Region; resilient across Availability Zones within it.
AZ-scoped	Lives in a single Availability Zone; you design redundancy across AZs yourself (for example EBS, an EC2 instance, a subnet).
Legacy	Still supported but superseded; do not choose it for new work without a specific reason.

The AWS mental model in five sentences

Everything is an API call, and every API call is authorized by IAM. If you understand the request-evaluation logic, you understand AWS security.
The Region is your blast-radius and data-residency boundary; the Availability Zone is your fault-isolation boundary. Design for AZ failure by default; design for Region failure when the business requires it.
The account is the hard isolation and billing boundary. Multi-account is the norm for anything serious, organized by AWS Organizations.
Managed does not mean hands-off. AWS runs the undifferentiated plumbing; you still own configuration, data, IAM, network placement, and cost.
Almost every problem is one of: IAM (permission), networking (reachability), or a service limit/quota. Check those three first.

Accuracy and freshness

AWS ships changes constantly. Everything here reflects June 2026. Exact service limits, instance types, Region and feature availability, and pricing change frequently and are the most volatile facts in this portal. Where a detail is fast-moving it is marked verify with current AWS documentation before production use. Treat this as a reasoning and design aid, not a substitute for the official docs and the console for current values.

AWS Service Index

A searchable, filterable reference of the AWS services that matter to infrastructure, data, and platform teams. Search any field; filter by domain or operational model. Click a row to expand what it is, when to use it, the main gotcha, and a cost note. This is the fast lookup layer; the deep-dive tabs carry the detail.

Tip: click any row to expand. The deep-dive tabs on the left cover each domain in depth.

Service	Domain	What it is	Model / Scope

1. AWS Fundamentals

The global infrastructure, account model, and organizational primitives that every other decision sits on top of. Get the account topology and Region/AZ model right first; almost everything else is easier to change later than these are.

TL;DR

AWS is a set of regional service APIs on top of a global backbone. The Region is your data-residency and blast-radius boundary; the Availability Zone is your fault-isolation boundary; the account is your hard security and billing boundary. Real estates use many accounts under AWS Organizations, provisioned and governed through Control Tower as a landing zone. Design the account and network topology before you deploy production.

What AWS is

AWS is a collection of independent services, each exposed as an API, running in data centers grouped into Regions around the world. You do not "log into a server"; you make authenticated, authorized API calls (via the console, CLI, SDK, or IaC) that create, configure, and destroy resources. Two consequences follow and shape everything:

Every action is an IAM-authorized API call. Security, automation, and troubleshooting all reduce to "which principal called which API on which resource, and was it allowed."
Services are largely regional and independent. A failure or a limit in one service/Region usually does not cascade to others. You compose resilient systems from these independent pieces.

AWS global infrastructure

Construct	What it is	Use it for	Key fact
Region Regional	A geographic area with multiple isolated data-center clusters (AZs). Example: us-east-1, eu-west-1, ap-south-1.	Data residency, latency to users, service/feature availability, disaster-recovery separation.	Most services are regional and isolated per Region. Data does not leave a Region unless you move it. Not all services/features exist in every Region.
Availability Zone (AZ) AZ	One or more discrete data centers with independent power, cooling, and networking, within a Region, interconnected by low-latency links.	Fault isolation. Spread instances/subnets across AZs so one AZ failure does not take you down.	AZ IDs (use1-az1) are consistent across accounts; AZ names (us-east-1a) are randomized per account. Cross-AZ traffic is chargeable.
Local Zone	An extension of a Region placing compute/storage closer to a metro area (for example large cities).	Single-digit-millisecond latency for specific metros: media, gaming, real-time apps.	Subset of services only; treated like a special AZ you opt into. Verify service availability.
Wavelength Zone	AWS infrastructure embedded in telecom 5G networks.	Ultra-low-latency mobile/edge apps over 5G.	Niche; only relevant for 5G edge use cases.
Edge location / PoP Global	Hundreds of points of presence used by CloudFront, Route 53, Global Accelerator, and WAF.	Caching, DNS, DDoS absorption, and accelerated ingress close to users.	Far more numerous than Regions; you do not deploy servers here, you use edge services.

Architect note

Default to a single Region with workloads spread across at least three AZs. Go multi-Region only when a specific requirement (RTO/RPO, data residency, global latency) justifies the large jump in cost and complexity. "Multi-Region because it feels safer" is not a requirement; it is a budget line and an operational tax.

Common mistake

Assuming us-east-1a in one account is the same physical AZ as us-east-1a in another. AZ names are shuffled per account; only AZ IDs (like use1-az1) are stable. This matters when you place partner or cross-account resources for latency or cost.

Accounts and AWS Organizations

An AWS account is the fundamental container and the strongest isolation boundary AWS offers: separate resources, separate IAM, separate default limits, and a separate bill. Serious environments do not run everything in one account. They use many accounts, grouped and governed by AWS Organizations.

Concept	Meaning	Guidance
Management (payer) account	The root of the organization. Owns billing, creates member accounts, applies SCPs.	Keep it nearly empty. No workloads. Tightly restricted human access, hardware MFA on root, minimal IAM. Its compromise is organization-wide.
Member account	Any account managed within the organization.	Where workloads and environments actually live.
Organizational Unit (OU)	A folder grouping accounts for policy inheritance.	Group by function/environment, not by team org chart. Common: Security, Infrastructure, Workloads (with Prod/Non-Prod under it), Sandbox, Suspended.
Service Control Policy (SCP)	An org-level guardrail that sets the maximum permissions for accounts/OUs. Restricts only; never grants.	Use to deny dangerous actions org-wide (leaving the org, disabling CloudTrail/GuardDuty, using disallowed Regions, deleting log buckets).

Security note

An SCP is a ceiling, not a grant. Even AdministratorAccess in a member account cannot exceed what the SCPs above it allow. Effective permission = intersection of (SCPs) and (IAM identity/resource policies), minus any explicit Deny. This is the backbone of preventive multi-account governance.

Control Tower and the Landing Zone

A landing zone is a pre-configured, secure, multi-account baseline: the org structure, centralized logging, guardrails, identity, and network scaffolding, ready before workloads arrive. AWS Control Tower is the managed way to build and maintain one.

Account Factory vends new accounts from a standard blueprint (baseline IAM, logging, guardrails, VPC).
Guardrails are preventive (SCP-based) and detective (Config-based) controls applied per OU.
It wires up a Log Archive account and an Audit/Security account automatically.

Operations note

Control Tower is opinionated. If you need heavy customization or GitOps-driven account vending, look at Account Factory for Terraform (AFT) or building the landing zone yourself with Organizations + StackSets. Fighting Control Tower's opinions with manual changes leads to drift it will flag.

Reference multi-account structure

OU	Accounts	Purpose
Root / Management	Management (payer)	Billing, Organizations, SCPs. No workloads.
Security	Log Archive, Audit/Security Tooling	Immutable central log store; GuardDuty/Security Hub/Config delegated admin; break-glass.
Infrastructure	Network (Transit Gateway/Cloud WAN, DNS), Shared Services	Central networking, shared AD/CI/artifact/endpoint services consumed by workload accounts.
Workloads	Per app-and-environment: app-a-prod, app-a-nonprod, app-b-prod...	The actual applications. Prod isolated from non-prod at the account boundary.
Sandbox	Individual/team sandboxes	Experimentation with tight budget alarms and restrictive SCPs; detached from prod networks.

Architect note: dev/test/stage/prod separation

Separate environments by account, not just by tag or VPC. Account boundaries give you clean blast-radius isolation, independent quotas, per-environment cost visibility, and the ability to grant broad access in sandbox while locking prod down. Put shared plumbing (network hub, logging, security tooling) in dedicated infrastructure/security accounts so workload teams consume it without owning it.

Naming, ARNs, tags, and Resource Groups

ARN (Amazon Resource Name) uniquely identifies every resource: arn:aws:service:region:account-id:resource. IAM policies, cross-account grants, and automation all reference ARNs. Some are global (no Region), like IAM and S3 buckets.
Tags are key/value metadata. They drive cost allocation, access control (ABAC), automation targeting, and inventory. Define a mandatory tagging standard early (Owner, Environment, CostCenter, Application, DataClassification) and enforce it with SCPs/Config/tag policies.
Resource Groups collect resources by tag or CloudFormation stack for bulk operations and views.

Cost note

Cost allocation tags are not active by default. Activate them in the Billing console, then they appear in Cost Explorer and the Cost and Usage Report. Retroactive tagging does not backfill historical cost data, so define and enforce the tag standard on day one.

Ways to interact with AWS

Interface	What	When
Console	Web UI.	Learning, exploration, one-off tasks, break-glass. Not for repeatable prod changes.
AWS CLI	Command-line access to every API.	Scripting, automation, troubleshooting. Use named profiles and role assumption.
SDKs	Language libraries (Python/boto3, Java, JS, Go, .NET...).	Applications and custom tooling.
CloudShell	Browser-based shell with the CLI pre-authenticated as your console identity.	Quick CLI work without local setup.
CloudFormation	Native IaC (YAML/JSON), managed stacks, StackSets across accounts.	Declarative provisioning within AWS.
CDK	IaC in real programming languages, synthesizes to CloudFormation.	Teams who want code constructs, reuse, and testing.
Terraform	Third-party, multi-cloud IaC via the AWS provider.	The de facto standard for many enterprises; portable skill set and state-based workflow.

Common mistake: ClickOps in production

Building prod by hand in the console gives you nothing to review, no reproducibility, and no drift detection. Use IaC for anything that must be recreated, audited, or promoted across environments. Reserve the console for read-only investigation and genuine break-glass.

Decide before production

Account topology and OU structure (multi-account from the start; retrofitting is painful).
Region strategy: primary Region, whether multi-Region DR is required, allowed Regions (enforced by SCP).
Network CIDR plan across all VPCs and on-prem (non-overlapping, room to grow).
Identity: IAM Identity Center wired to your corporate IdP; no per-account IAM users for humans.
Centralized logging (CloudTrail org trail, Config, VPC Flow Logs) to a locked log-archive account.
Guardrails: baseline SCPs and Config rules; GuardDuty and Security Hub org-wide.
Tagging standard and cost allocation tags activated.
IaC tooling and pipeline; state management strategy.
Backup and DR baseline (AWS Backup plans, RTO/RPO targets).

Common mistakes in account design

Common mistakes

Running everything in one account and separating only by tags. No hard blast-radius or quota isolation.
Putting workloads in the management account. It should be nearly empty and heavily locked down.
Modeling OUs on the human org chart instead of on function/environment/policy needs.
Creating dozens of accounts with no automation (Account Factory/AFT), leading to inconsistent baselines and drift.
No central logging account, so logs live in the same accounts an attacker could tamper with.
Deferring the CIDR plan, then discovering overlaps that block Transit Gateway/peering later.

2. Identity and Access Management Deep Dive

IAM is the authorization engine in front of every AWS API call. Understand the evaluation logic and the difference between the policy types, and most "why can/can't this work" questions answer themselves.

TL;DR

Prefer roles and short-lived credentials over IAM users and long-lived keys. Give humans access through IAM Identity Center (SSO), give workloads access through roles (instance profiles, task roles, IRSA, execution roles). A request is allowed only if some policy explicitly allows it and nothing explicitly denies it and it is within the SCP/permission-boundary ceiling. Design least privilege and verify with Access Analyzer and the policy simulator.

The building blocks

Object	What it is	Use it for	Note
IAM user	A long-lived identity with a password and/or access keys.	Increasingly rare. Break-glass, or legacy systems that cannot federate.	Avoid for humans. Every human user is a long-lived credential to rotate and protect.
IAM group	A collection of users sharing policies.	Attaching permissions to sets of users.	Groups cannot be nested and cannot be principals in a trust policy.
IAM role	An identity with permissions but no long-lived credentials; assumed to get temporary credentials via STS.	Workloads, cross-account access, federated humans, AWS services acting on your behalf.	The intended pattern for almost everything. Defined by a permissions policy + a trust policy.
Identity-based policy	JSON attached to a user/group/role saying what that identity can do.	Granting permissions to principals.	Managed (AWS or customer) or inline.
Resource-based policy	JSON attached to a resource (S3 bucket, SQS queue, KMS key, Lambda) saying who can access it.	Cross-account access and per-resource control.	Has a `Principal` element. Cross-account access needs allow on both sides (except a few resource-policy-only cases).
Permission boundary	An advanced policy capping the maximum permissions an identity can have.	Safe delegation: let teams create roles without exceeding a ceiling.	Does not grant; it limits. Effective = intersection of boundary and identity policy.
SCP	Org-level guardrail (see Fundamentals).	Org-wide maximum permissions.	Applies to everything in member accounts, including root.
Session policy	Passed at AssumeRole time to further scope the session.	Temporarily narrowing a broad role for a specific task/session.	Can only restrict, never expand, the role's permissions.
Trust policy	The resource-based policy on a role defining who can assume it.	Controlling role assumption (which account/service/principal).	Separate from the role's permissions. Getting this wrong is the classic cross-account failure.

STS, temporary credentials, and role assumption

STS issues short-lived credentials (access key + secret + session token) when a principal assumes a role or federates. This is how you avoid long-lived keys everywhere:

Instance profile attaches a role to EC2; the app reads temporary creds from IMDSv2. No keys on disk.
Task role (ECS) / IRSA or Pod Identity (EKS) / execution role (Lambda) give containers and functions scoped credentials.
Cross-account roles: a principal in account A assumes a role in account B whose trust policy permits it.
Federation: SAML or OIDC exchanges an external identity for AWS credentials (the basis of IAM Identity Center and web/OIDC workloads like GitHub Actions).

IAM Identity Center and federation

IAM Identity Center (formerly AWS SSO) is how humans should access AWS across many accounts. You connect it to an identity source (Microsoft Entra ID, Okta, or its built-in directory) via SAML/SCIM, define permission sets (which become IAM roles in each target account), and assign users/groups to accounts. Users get a portal, pick an account and role, and receive short-lived credentials. No per-account IAM users, no long-lived keys, centralized deprovisioning.

Architect note

Wire IAM Identity Center to your corporate IdP once, at the org level, before onboarding teams. Model access as permission sets mapped to IdP groups. This gives you one place to grant/revoke, consistent roles across accounts, and an audit trail that ties AWS actions back to a corporate identity.

MFA, access keys, and credential hygiene

Enforce MFA everywhere, especially root and any remaining IAM users. Prefer hardware/FIDO2 for privileged access.
Root account: hardware MFA, no access keys, used only for the handful of tasks that require it. Lock it away.
Avoid long-lived access keys. Where unavoidable, rotate on a schedule and scope tightly. Prefer roles/federation.
Use the credential report (account-wide CSV of users, key age, MFA, last use) and last-accessed data to prune.

Example policy statements

Each example below states what it allows, where to attach it, its risk level, a safer alternative, and how it gets misused.

Example 1: read-only access to one S3 bucket

# Identity-based policy attached to a role/permission set
{
  "Version": "2012-10-17",
  "Statement": [
    { "Sid": "ListBucket",
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": "arn:aws:s3:::my-reports-bucket" },
    { "Sid": "GetObjects",
      "Effect": "Allow",
      "Action": ["s3:GetObject"],
      "Resource": "arn:aws:s3:::my-reports-bucket/*" }
  ]
}

Allows	List and read objects in exactly one bucket. Nothing else.
Attach to	A role/permission set for an app or analyst that only needs to read reports.
Risk	Low Scoped to one bucket and read-only.
Safer alt	Add a condition (`aws:SourceVpce` or `aws:PrincipalOrgID`) if access should be restricted to a VPC endpoint or the org.
Misuse	Widening `Resource` to `arn:aws:s3:::*` "to save time," which grants read to every bucket.

Example 2: the dangerous one - full admin

{
  "Version": "2012-10-17",
  "Statement": [
    { "Effect": "Allow", "Action": "*", "Resource": "*" }
  ]
}

Allows	Every action on every resource. This is `AdministratorAccess`.
Attach to	Almost nothing. A tightly controlled break-glass role at most.
Risk	Critical Full blast radius if the identity is compromised.
Safer alt	Job-function managed policies, or purpose-built least-privilege policies. Cap with a permission boundary and SCPs.
Misuse	Handing `AdministratorAccess` to developers "so they stop filing access tickets." The most common over-permission in AWS.

Example 3: a cross-account trust policy

# Trust policy ON the role in account B (the account being accessed)
{
  "Version": "2012-10-17",
  "Statement": [
    { "Effect": "Allow",
      "Principal": { "AWS": "arn:aws:iam::111122223333:role/CICDDeployer" },
      "Action": "sts:AssumeRole",
      "Condition": { "StringEquals": { "sts:ExternalId": "shared-secret-value" } } }
  ]
}

Allows	Only the `CICDDeployer` role in account 1111... to assume this role, and only when it presents the matching ExternalId.
Attach to	The role in the target account (trust policy). The caller also needs `sts:AssumeRole` permission on their side.
Risk	Medium Controlled, but the assumed role's permissions policy is where the real power lives.
Safer alt	Scope the principal to a specific role ARN (not the whole account root `:root`), use ExternalId for third parties, and least-privilege the permissions policy.
Misuse	Trusting `"AWS": "arn:aws:iam::111122223333:root"` (the whole account) plus a broad permissions policy, so any principal in that account can assume a powerful role.

Example 4: a permission boundary for delegated role creation

# Boundary limiting what any role a team creates can ever do
{
  "Version": "2012-10-17",
  "Statement": [
    { "Effect": "Allow",
      "Action": ["s3:*","dynamodb:*","logs:*","cloudwatch:*"],
      "Resource": "*" }
  ]
}

Allows	Sets the ceiling: any role created with this boundary can never exceed S3/DynamoDB/logs/CloudWatch, even if its permissions policy says more.
Attach to	Roles created by delegated admins; combined with a policy requiring the boundary on any `iam:CreateRole`.
Risk	Low when used to constrain delegation; it only limits.
Safer alt	Pair with an SCP that denies `iam:CreateRole`/`PutRolePolicy` unless the boundary is attached, so it cannot be bypassed.
Misuse	Confusing a boundary with a grant. It does not give access; it caps it.

Least-privilege design in practice

Start from zero and add specific actions/resources, not from * and trim.
Use IAM Access Analyzer policy generation to build a policy from actual CloudTrail activity.
Add conditions: aws:PrincipalOrgID, aws:SourceIp, aws:SourceVpce, aws:RequestedRegion, tag conditions (ABAC).
Prefer ABAC (attribute/tag-based) for scale: "allow if resource tag Team = principal tag Team" instead of enumerating resources.
Review with last-accessed data and prune unused permissions periodically.

Tooling

Tool	Use
IAM Access Analyzer	Finds resources shared outside your account/org, validates policies, generates least-privilege policies from CloudTrail, and flags unused access.
Credential report	Account-wide CSV of all users: key age, MFA status, last activity. Use to prune and enforce hygiene.
Policy simulator	Test whether a given principal would be allowed a given action on a resource before deploying.
Last accessed data	Shows which services/actions a principal actually used, to right-size policies.

The AWS IAM mental model

When you evaluate any "is this allowed?" question, walk this in order. AWS evaluates every request against all applicable policies:

Who is making the request? (Principal)

An IAM user, an assumed role (workload or human via SSO), an AWS service principal, or a federated identity. The principal's identity-based policies define its potential permissions.

What action, on which resource?

Every request is a specific API action against a specific ARN, possibly with request context (source IP, VPC endpoint, tags, Region).

Is there an explicit Deny anywhere?

An explicit Deny in any applicable policy (identity, resource, SCP, boundary, session) immediately denies the request. Deny always wins.

Is it within the SCP and permission-boundary ceiling?

The request must be permitted by the SCPs on the account/OU and by any permission boundary on the principal. These set the maximum.

Is there an explicit Allow?

With no Deny and within the ceiling, there must be an explicit Allow (identity-based, or resource-based) for the action. Default is implicit deny.

Is this cross-account?

For cross-account, you generally need an Allow on both sides: the caller's identity policy (or trust for AssumeRole) and the target's resource policy or role trust.

One-line rule

Effective permission = ( identity-based Allow OR resource-based Allow ) AND ( within SCP ) AND ( within permission boundary ) AND ( within session policy ) AND ( no explicit Deny anywhere ). Miss any factor and the call fails.

Break-glass access

Even with SSO, keep a tightly controlled emergency path for when the IdP or Identity Center is unavailable: a dedicated break-glass IAM role or user per critical account, hardware MFA, credentials sealed and split, every use alarmed via CloudTrail/EventBridge to security. Test it periodically so it works when you need it.

Common AWS IAM mistakes

Common mistakes

Using the root account for daily work. Root is for the few tasks that require it. Lock it with hardware MFA, no keys.
Long-lived access keys where roles would do. Every static key is a liability; use instance profiles, task roles, IRSA, and federation.
AdministratorAccess handed out broadly. Use job-function policies and boundaries; grant admin narrowly and temporarily.
Confusing roles with users. Roles are assumed for temporary creds; users are long-lived. Workloads use roles, not embedded user keys.
Misunderstanding trust policies. The trust policy controls who can assume; the permissions policy controls what they can do. Both matter.
Not using IAM Identity Center. Per-account IAM users for humans do not scale and are hard to deprovision.
Poor cross-account design. Trusting :root of another account plus broad permissions is effectively trusting anyone in that account.
Careless SCPs. Too loose and they protect nothing; too tight and they break legitimate work. Test in a non-prod OU first.
Wildcards everywhere. Action:"*" and Resource:"*" are the norm in over-permissioned accounts; scope both.
Ignoring permission boundaries when delegating role creation, so a team can escalate to admin by creating a powerful role.

3. Networking Deep Dive

The VPC and everything around it. Networking is where the most time is lost in AWS, because reachability depends on the intersection of routing, security groups, NACLs, gateways, and DNS, all of which must line up. This section covers the model, the diagrams, and the runbooks.

TL;DR

A VPC is a regional private network; subnets are AZ-scoped. Public vs private is defined by the route table (a route to an Internet Gateway makes a subnet public). Security groups are stateful and attached to ENIs; NACLs are stateless and attached to subnets. Reach the internet outbound from private subnets via NAT Gateway; reach AWS services privately via VPC endpoints. Connect networks with Transit Gateway (hub), peering (point-to-point), VPN, or Direct Connect. Plan non-overlapping CIDRs first.

Core building blocks

Component	What it does	Key fact / gotcha
VPC	Isolated virtual network in a Region with a CIDR block.	Regional. CIDR cannot shrink; you can add secondary CIDRs. Plan for growth and non-overlap.
Subnet	A CIDR slice bound to one AZ.	AZ-scoped. AWS reserves 5 IPs per subnet. "Public" vs "private" is purely about the route table.
Route table	Directs traffic by destination CIDR to a target (IGW, NAT, TGW, endpoint, peering).	Most specific route wins. The presence of a `0.0.0.0/0 -> IGW` route is what makes a subnet public.
Internet Gateway (IGW)	Enables bidirectional internet for resources with public IPs in public subnets.	Horizontally scaled, no bottleneck. Needs a public IP + route + SG/NACL allow to actually work.
NAT Gateway	Outbound-only internet for private subnets.	AZ-scoped; deploy one per AZ. Hourly + per-GB charge. No inbound. Common surprise cost.
Egress-only IGW	Outbound-only internet for IPv6.	The IPv6 analogue of NAT (IPv6 is globally routable, so you need explicit egress-only control).
Elastic IP	Static public IPv4 you own and attach.	Charged when unattached or beyond the free allotment. Clean up unused EIPs.
ENI	Elastic Network Interface: a virtual NIC with private IP(s), SGs, and optional public IP.	Security groups attach to ENIs. Instances, ALBs, endpoints, RDS all use ENIs, which consume subnet IPs.
Security Group	Stateful virtual firewall on an ENI. Allow rules only.	Return traffic is automatically allowed. Can reference other SGs as source. First line of instance-level control.
Network ACL (NACL)	Stateless firewall at the subnet boundary. Allow and deny rules, numbered.	You must allow return traffic explicitly (ephemeral ports). Default NACL allows all; custom NACLs deny all until you add rules.

Security groups vs NACLs (the classic confusion)

Security group: stateful, on the ENI/instance, allow-only, can reference other SGs. If you allow inbound 443, the response goes out automatically. NACL: stateless, on the subnet, allow+deny, evaluated in rule-number order. If you allow inbound 443, you must separately allow outbound on the ephemeral port range for the response. Use SGs as your primary control; use NACLs for coarse subnet-level deny (for example blocking a bad IP range).

CIDR planning

Pick RFC 1918 space with room to grow (a /16 per VPC is common). Reserve distinct ranges per environment/Region/account.
Never overlap CIDRs across VPCs or with on-premises if they might ever connect. Overlap makes Transit Gateway/peering/VPN routing impossible without NAT.
Leave room for multiple subnet tiers across 3 AZs (public, private-app, private-data, per AZ).
Watch IP consumption: the EKS VPC CNI, interface endpoints, and large fleets consume many ENIs/IPs.

Common mistake

Using the same default-looking CIDR (like 10.0.0.0/16) in every VPC. The day you need to connect them (merger, shared services, hybrid) you cannot route between overlapping ranges. Maintain a central IP allocation plan (IPAM can help) before the first VPC.

Diagram: three-tier VPC across AZs

Web/ALB tier in public subnets; app tier and data tier in private subnets; database Multi-AZ across two AZs. Outbound from private subnets goes through a NAT Gateway in each AZ. The database has no route to the internet.

Diagram: private subnet reaching S3 via a Gateway VPC endpoint

A Gateway endpoint (S3, DynamoDB only) adds a route so traffic to the S3 prefix list stays on the AWS network. It is free and removes NAT data-processing charges for S3 traffic. For other services use an Interface endpoint (PrivateLink), which is an ENI in your subnet, priced per hour + per GB.

Diagram: hub-and-spoke with Transit Gateway

Transit Gateway is a regional router. Each VPC and the on-prem connection attach once. TGW route tables control which spokes can reach which (for example, prod isolated from non-prod, both reaching shared services). This replaces an unmanageable mesh of VPC peerings. Watch per-attachment and per-GB data-processing charges.

Connecting networks: which option

Option	Use when	Watch out for
VPC Peering	A few VPCs need direct, low-latency connectivity.	Non-transitive (A-B and B-C does not give A-C). Scales poorly (N-squared). No overlapping CIDRs.
Transit Gateway	Many VPCs and hybrid connections; you want central routing/segmentation.	Per-attachment + per-GB charges; route-table design complexity.
Cloud WAN	Large, global, multi-Region networks managed by one central policy.	Newer, higher-level; evaluate cost/fit vs TGW.
PrivateLink	Expose/consume a specific service privately (yours or a SaaS partner's) without exposing the whole VPC.	One-directional service access, not general network connectivity. Per-endpoint cost.
Site-to-Site VPN	Encrypted hybrid connectivity fast; DX backup.	Throughput limits; use 2 tunnels + BGP for HA.
Direct Connect	Consistent low latency, high throughput, large steady transfer, cheaper egress.	Weeks to provision; pair with VPN backup; DX Gateway for multi-VPC/Region.

Architect note: PrivateLink vs peering vs TGW

Use PrivateLink when you want to share one service across accounts without joining networks (great for SaaS and internal platform teams, avoids CIDR overlap issues entirely). Use peering for a couple of VPCs. Use Transit Gateway when connectivity is many-to-many or hybrid. Mixing these deliberately is normal in a mature estate.

Route 53 and DNS

Public hosted zones serve internet DNS; private hosted zones serve names inside associated VPCs.
Resolver endpoints (inbound/outbound) bridge DNS between on-prem and VPC for hybrid name resolution.
Routing policies: simple, weighted, latency, failover, geolocation, geoproximity, multivalue. Failover + health checks are the basis of DNS-level DR.
Alias records point to AWS resources (ALB, CloudFront, S3 website) and are free to query.

Firewalls and traffic security

Service	Layer	Use
Security Groups	Instance (ENI), stateful	Primary allow-list for workload traffic.
NACLs	Subnet, stateless	Coarse subnet deny (blocklists).
AWS Network Firewall	VPC, stateful IDS/IPS	Deep packet inspection, domain filtering, centralized egress control.
AWS WAF	L7 (ALB/API GW/CloudFront)	OWASP rules, rate limiting, bot control for web apps.
AWS Shield	Edge	DDoS protection (Standard free; Advanced paid).
Firewall Manager	Org-wide	Centrally manage WAF/Shield/Network Firewall/SG policies across accounts.

Visibility

VPC Flow Logs: record accepted/rejected traffic metadata per ENI/subnet/VPC to CloudWatch Logs or S3. Essential for "is traffic even arriving, and is it being rejected."
Reachability Analyzer: static analysis of whether a path exists between two resources and, if not, which component blocks it. The fastest way to debug SG/NACL/route problems.
Traffic Mirroring: copy packets to security/monitoring appliances for deep inspection.

Networking troubleshooting runbooks

EC2 instance cannot reach the internet connectivity

Checks: Is the instance in a public subnet with a public IP/EIP and a 0.0.0.0/0 -> IGW route? Or in a private subnet needing a 0.0.0.0/0 -> NAT route? Does the security group allow outbound (default allows all; a locked-down SG may not)? Does the NACL allow outbound and the return ephemeral ports? Is the NAT Gateway healthy and in a public subnet with an IGW route?
Fastest path: run Reachability Analyzer from the ENI to an internet destination; check VPC Flow Logs for REJECT. Prevention: standardize subnet/route templates in IaC.

Private instance cannot download OS patches connectivity

Likely: no NAT route, NAT in a failed AZ, or the repo is reached over the internet with no path. Checks: private subnet route table has 0.0.0.0/0 -> nat-...; NAT is in a public subnet with an IGW route; SG/NACL allow 443 out. Better design: use Systems Manager Patch Manager with S3/SSM VPC endpoints so patching does not depend on the internet at all.

Instance cannot reach S3 privately endpoint

Checks: Is there an S3 Gateway endpoint associated with the route table for that subnet? Does the endpoint policy allow the action/bucket? Does the bucket policy allow access from the endpoint/VPC (aws:SourceVpce)? Is the SG outbound allowing HTTPS? Gotcha: Gateway endpoint routes are added to specific route tables; a subnet using a different route table will not use it.

Application cannot connect across VPCs routing

Checks: Is there a peering/TGW attachment, and are both route tables updated for the remote CIDR? For TGW, does the TGW route table associate/propagate correctly for both attachments? Do SGs on both ends allow the traffic (SG references do not cross VPC/peering unless same-Region peering with SG referencing enabled)? Any CIDR overlap? Tool: Reachability Analyzer across the VPCs.

On-premises cannot reach AWS hybrid

Checks: VPN tunnel(s) UP and BGP routes exchanged? DX virtual interface up and advertising routes? VPC/TGW route tables have the on-prem CIDR pointing at the VGW/TGW? On-prem firewall/routing has the AWS CIDR pointing at the tunnel? SG/NACL allow the source CIDR? Gotcha: asymmetric routing when both VPN and DX exist; set route priorities deliberately.

Load balancer health checks failing elb

Checks: target SG allows the health-check port from the load balancer SG; the health-check path returns 200 (not a redirect); the app is actually listening on the target port; target-group protocol/port match the app; instance is in a subnet the LB can reach; NACL allows the traffic. See the Load Balancing tab for the full runbook.

DNS resolution issue dns

Checks: Is enableDnsSupport and enableDnsHostnames on for the VPC? For private zones, is the private hosted zone associated with the VPC? For hybrid, are Resolver endpoints/rules configured toward on-prem? Is the record correct and TTL not stale? Tool: dig/nslookup from an instance in the VPC; check the .2 resolver (VPC base +2).

Security group vs NACL vs route table isolation firewall

Isolate the layer: (1) Route table: is there a route to the destination at all? (2) NACL: is inbound and outbound (ephemeral) allowed on both subnets? (3) SG: does the target allow the source (IP or SG) on the port, and does the source allow the return (auto for SG)? Flow Logs show REJECT with the ENI, which pinpoints the layer. Change one thing at a time.

Transit Gateway route issue tgw

Checks: attachment is associated with the correct TGW route table; the destination CIDR is propagated or statically added; the VPC subnet route table sends the remote CIDR to the TGW; appliance mode is on if using a centralized inspection VPC (prevents asymmetric flows across AZs). Gotcha: forgetting the VPC-side route to the TGW even though the TGW route table is correct.

VPC endpoint policy issue endpoint

Checks: the endpoint policy is not overly restrictive (default is allow-all; a custom policy may block the action/resource); for interface endpoints, the endpoint SG allows 443 from the client; private DNS is enabled so the service hostname resolves to the endpoint. Gotcha: combining a restrictive endpoint policy with a restrictive bucket policy and an IAM policy, then debugging only one of the three.

AWS networking gotchas

Gotchas and how to avoid them

SG vs NACL confusion. SG stateful/instance; NACL stateless/subnet. Return traffic is automatic only for SGs.
Forgetting NACL return traffic. With custom NACLs you must allow outbound ephemeral ports for responses.
Overlapping CIDRs. Blocks peering/TGW/VPN routing forever. Plan IP space centrally up front.
Too many peering connections. N-squared mesh; move to Transit Gateway.
NAT assumed to allow inbound. It is outbound-only. Inbound needs an IGW + public IP + LB.
One NAT for all AZs. Cross-AZ charges and an AZ-failure single point. One NAT per AZ.
Databases in public subnets. Put data tiers in private subnets with no internet route.
Ignoring VPC endpoint policies. They silently restrict access on top of IAM and resource policies.
Cross-AZ data charges. Chatty cross-AZ traffic (app-to-DB, replication) adds up; keep tightly-coupled traffic AZ-local where sensible while staying HA.
Not using VPC endpoints for AWS traffic. Routing S3/ECR/API traffic through NAT wastes money and adds an internet dependency.

4. Compute Deep Dive

From EC2 virtual machines to Lambda functions. The recurring decision is how much of the stack you want to own: full OS (EC2), containers (ECS/EKS/Fargate, covered in the Containers tab), or just code (Lambda). This tab focuses on EC2 and the surrounding compute primitives.

TL;DR

Choose the smallest unit that fits: Lambda for event-driven/short tasks, containers for services, EC2 when you need a full OS or specialized hardware. On EC2, pick the right instance family for the workload profile, run across AZs behind an Auto Scaling Group, enforce IMDSv2, manage it with Systems Manager (no SSH), and buy Savings Plans/Reserved for steady load and Spot for interruptible work.

EC2 essentials

Concept	What	Note
Instance family/type	Named by family + generation + size (for example `m7i.xlarge`): general, compute, memory, storage, accelerated.	The letter is the workload profile; the number is the generation (newer = better price/perf). Match to the workload.
Nitro system	AWS's hypervisor/hardware offload powering modern instances.	Enables bare-metal, higher performance, better security isolation, and features like EBS-optimized-by-default.
AMI	Amazon Machine Image: the template (OS + config) an instance boots from.	Build golden AMIs with EC2 Image Builder; bake in the SSM agent, CloudWatch agent, and hardening.
EBS-backed vs instance store	Root/data on network EBS (persistent) vs local NVMe (ephemeral, lost on stop/terminate).	Instance store is fast but temporary; never put durable data on it. EBS survives stop/start.
User data	Bootstrap script run at first boot.	Use for lightweight bootstrap; prefer baked AMIs + config management for repeatability.
IMDS / IMDSv2	Instance Metadata Service exposes instance info and role credentials at 169.254.169.254.	Enforce IMDSv2 (session/token-based). IMDSv1 was exploited via SSRF to steal role credentials.
Instance profile	The wrapper that attaches an IAM role to an instance.	The right way to give an instance AWS permissions. No keys on disk.
Placement groups	Control physical placement: cluster (low latency), spread (max isolation), partition (large distributed).	Cluster for HPC/low-latency; spread for small critical sets; partition for HDFS/Cassandra-style.

Security note: enforce IMDSv2

Set instance metadata to HttpTokens=required (IMDSv2) and set the hop limit to 1. This closes the SSRF-to-credential-theft class of attacks that repeatedly leaks EC2 role credentials. Enforce it org-wide via an SCP/Config rule and bake it into launch templates and AMIs.

Choosing an instance family

Workload	Family type	Examples (verify current gen)	Why
Web / middleware / general	General purpose (M, or Graviton Mg)	m7i, m7g, t4g (burstable)	Balanced CPU:memory. Graviton (Arm) often 20-40% better price/perf if the stack supports Arm.
CPU-heavy (batch, encoding, HPC front)	Compute optimized (C)	c7i, c7g	Higher CPU:memory ratio and clock for compute-bound work.
Databases / in-memory / analytics	Memory optimized (R, X, z)	r7i, r7g, x2idn, z1d	High memory per vCPU. X/high-memory for SAP HANA and large in-memory DBs.
Oracle / SQL Server (self-managed)	Memory optimized (R/X), consider core count for licensing	r7i, x2iedn	Memory-bound and license-cost-sensitive; fewer, faster cores can cut license cost.
SAP workloads	Memory optimized / high-memory (SAP-certified)	Certified X/High Memory/U instances	SAP requires certified instances; check the SAP-on-AWS certified list.
Storage / big-data / high-IOPS local	Storage optimized (I, D, Im)	i4i, im4gn, d3	Large fast local NVMe for NoSQL, data nodes, warehouses.
GPU / ML / graphics	Accelerated (P, G, Trn/Inf)	p5, g6, trn1, inf2	P for training, G for inference/graphics, Trainium/Inferentia for cost-efficient ML.
Cost-sensitive / bursty non-prod	Burstable (T) + Spot	t4g, t3	CPU credits for low-average, spiky load; combine with Spot for savings.

DBA note: cores and licensing

For self-managed Oracle or SQL Server on EC2, instance choice drives license cost. Both are licensed by core (with vendor core factors). Prefer higher-clock, memory-optimized instances so you meet performance with fewer cores, and pin/verify vCPU-to-core mapping (Nitro exposes cores + threads). You can also set optimize CPU to disable hyperthreading and reduce licensable cores. Always confirm your specific licensing terms; this is contractual, not technical.

Purchasing and capacity models

Model	Use	Trade-off
On-Demand	Default, spiky, short-lived, unpredictable.	Most expensive per hour; zero commitment.
Savings Plans	Steady baseline compute (EC2, Fargate, Lambda).	1 or 3-year commit for up to ~72% off. Compute Savings Plans are the most flexible.
Reserved Instances	Steady, specific instance families (legacy vs Savings Plans for most).	Similar savings but less flexible than Savings Plans.
Spot Instances	Fault-tolerant, interruptible: batch, CI, stateless web, big data.	Up to ~90% off but can be reclaimed with a 2-minute warning. Never for stateful singletons.
Dedicated Hosts	Physical-host isolation, socket/core-based BYOL licensing (Windows/Oracle), compliance.	Most expensive; needed for host-based licensing and strict isolation.
Dedicated Instances	Hardware isolated from other accounts (but not host-level control).	Isolation without the BYOL granularity of Dedicated Hosts.
Capacity Reservations	Guarantee capacity in a specific AZ (for example for a launch or DR).	Pay for reserved capacity whether or not used; combine with Savings Plans for the discount.

Cost note

The biggest EC2 savings come from three moves: right-size (Compute Optimizer), schedule non-prod off nights/weekends, and commit the steady baseline with a Compute Savings Plan while running interruptible work on Spot. Do these before chasing smaller optimizations.

Auto Scaling and high availability

Run instances in an Auto Scaling Group spanning at least 3 AZs, behind a load balancer, launched from a launch template (launch configurations are legacy).
Use ELB health checks (not just EC2 status checks) so unhealthy-but-running app instances are replaced.
Set scaling policies (target tracking on CPU/requests) and lifecycle hooks for graceful add/remove.
Use mixed instances + Spot in the ASG for cost with capacity resilience.
Treat instances as cattle: no unique state on the box; store state in RDS/S3/ElastiCache.

Managing EC2 with Systems Manager

Capability	Replaces	Note
Session Manager	SSH, bastion hosts	Browser/CLI shell with no open port 22, no keys; every session logged to CloudWatch/S3.
Patch Manager	Manual patching	Scheduled patch baselines with maintenance windows and compliance reporting.
Run Command	Ad hoc SSH loops	Run a command across a fleet by tag, with output captured.
Parameter Store	Config files/secrets sprawl	Central config and SecureString secrets (cheaper than Secrets Manager for simple cases).
Inventory / State Manager	Manual audits	Track installed software/config and enforce desired state.

Operations note: kill SSH

Requirements for Session Manager: the SSM agent (pre-installed on modern AMIs), an instance role with the SSM managed policy, and network reachability to SSM endpoints (via NAT or, better, interface VPC endpoints). Once in place, close inbound 22/3389 entirely. You get keyless access, full audit, and no bastion to patch.

Other compute options

Service	Use
Lambda	Event-driven functions, API backends, automation. 15-min max, scales to zero. See the AI/Integration tabs for patterns.
AWS Batch	Managed batch job scheduling over EC2/Fargate/Spot for HPC, rendering, genomics.
Elastic Beanstalk	PaaS that provisions EC2/ASG/ELB for you from app code. Simple, but less control; often superseded by containers.
Lightsail	Simplified VPS with predictable flat pricing for small/simple workloads and quick prototypes.
EC2 Image Builder	Automated, tested, hardened golden-AMI pipelines.

EC2 operational runbooks

How to resize an EC2 instance op

Stop the instance (EBS-backed), change the instance type, start it. Requires downtime for that instance. For zero downtime, launch new instances of the new type in the ASG and drain the old (rolling replacement). Verify the new family supports your AMI architecture (x86 vs Arm/Graviton) and any enhanced networking/EBS-optimized requirements.

How to patch EC2 safely op

Use Patch Manager with a patch baseline and a maintenance window. Patch a canary/one AZ first, verify health via ELB checks, then proceed. For immutable infra, patch by building a new AMI (Image Builder) and rolling the ASG rather than patching in place. Always have a rollback (previous AMI/launch template version).

Troubleshoot boot / instance not reachable rb

Check EC2 status checks: system status (AWS infra, often fixed by stop/start which moves hosts) vs instance status (OS-level). Use EC2 Serial Console or get-console-output to see boot logs. Common causes: full root disk, bad /etc/fstab mount, failed cloud-init/user-data, kernel/GRUB issue. For access issues, verify SG/subnet/route and prefer Session Manager over SSH.

Troubleshoot high CPU rb

CloudWatch CPUUtilization shows the trend; for burstable (T) instances also check CPUCreditBalance (exhausted credits throttle you). On the box: top/htop, identify the process, check for runaway threads, GC, or a code loop. Fix by scaling out (ASG), scaling up (larger/compute-optimized), or fixing the app. For chronic T-instance credit exhaustion, move to M/C family.

Troubleshoot memory pressure rb

Memory is not a default CloudWatch metric; install the CloudWatch agent to get mem_used_percent. On the box: free -m, check for OOM-killer events in dmesg/syslog, identify the consumer. Fix: right-size to a memory-optimized (R) family, tune the app/JVM heap, or add swap as a stopgap (not a real fix).

Troubleshoot disk full rb

Disk metrics also need the CloudWatch agent. On the box: df -h to find the full filesystem, du -sh /* to find the culprit (often logs, /tmp, or a runaway file). Fix immediately by clearing/rotating; longer term grow the EBS volume online (modify volume, then extend the partition and filesystem with growpart/resize2fs/xfs_growfs). Set up log rotation and disk alarms.

Design EC2 for production HA design

ASG across 3 AZs, behind an ALB/NLB, min capacity that survives one AZ loss, ELB health checks, target-tracking scaling, launch template with IMDSv2/SSM/CloudWatch agent baked in, no local state, automated patching, and Multi-AZ backing stores (RDS/EFS). Add capacity reservations only where guaranteed capacity is required.

5. Storage Deep Dive

Block, object, and file storage, plus archive, backup, and data-movement services. The first decision is almost always block (EBS) vs object (S3) vs file (EFS/FSx); get that right and the rest is configuration.

TL;DR

EBS = a disk for one EC2 instance (AZ-scoped block). S3 = durable object store for anything (not a filesystem). EFS = shared POSIX/NFS across many instances (regional). FSx = managed Windows/Lustre/ONTAP/OpenZFS file systems. Use storage classes and lifecycle on S3, and gp3 as the EBS default. Snapshots are how block data crosses AZ/Region boundaries.

Choosing the storage type

Type	Access	Scope	Use for
EBS (block)	Attached to one instance (or multi-attach for io2)	AZ	Boot disks, database files, low-latency single-instance storage.
Instance store (block)	Local NVMe on the host	Ephemeral	Scratch, cache, temp data you can lose on stop/terminate.
S3 (object)	HTTPS API, not a mount	Regional	Data lake, backups, static assets, logs, artifacts, archive.
EFS (file, NFS)	Mounted by many Linux instances/containers	Regional	Shared POSIX filesystem, lift-and-shift apps needing a shared mount.
FSx (file)	SMB/Lustre/NFS depending on flavour	AZ / Multi-AZ	Windows SMB shares, HPC scratch, NetApp ONTAP features, OpenZFS.

EBS in depth

Volume type	Profile	Use
gp3 (SSD)	Baseline SSD; IOPS/throughput provisioned independently of size.	The default for most workloads. Cheaper than gp2 and you tune performance without growing the disk.
io2 / io2 Block Express (SSD)	High, consistent IOPS with high durability; Block Express for the largest/fastest.	Mission-critical databases (Oracle, SQL Server, high-TPS PostgreSQL) needing guaranteed IOPS/low latency.
st1 (HDD)	Throughput-optimized HDD.	Big sequential workloads: logs, data warehouse staging, streaming. Not for random I/O or boot.
sc1 (HDD)	Cold HDD, cheapest.	Infrequently accessed large data where cost beats performance.

Snapshots are incremental, stored in S3 (managed), and are how you back up, copy across AZ/Region, and create AMIs.
Encryption via KMS; enable encryption-by-default at the account level so no unencrypted volume is ever created.
Multi-Attach (io2) lets a volume attach to multiple instances in one AZ, but the app/cluster filesystem must handle concurrency (for example clustered databases). Not a general share.
Performance: gp3 gives 3,000 IOPS / 125 MB/s baseline, tunable up; ensure the instance is EBS-optimized and large enough to not cap volume throughput.

DBA note: database storage on EBS

For self-managed databases, prefer io2 Block Express or well-provisioned gp3, separate data/redo/temp onto volumes with appropriate profiles, and size the instance so EBS bandwidth is not the bottleneck. Use crash-consistent snapshots carefully (quiesce or use application-consistent backups for transactional integrity). Monitor VolumeQueueLength and burst balance.

Common mistake

Leaving unattached EBS volumes and old snapshots around. They bill silently forever. Also: assuming an EBS volume is available in another AZ; it is AZ-locked. To move it, snapshot and restore in the target AZ/Region.

S3 in depth

Storage classes

Class	Use	Note
Standard	Frequently accessed data.	Default; multi-AZ durability.
Intelligent-Tiering	Unknown/changing access patterns.	Auto-moves objects between tiers; small monitoring fee, no retrieval fees. Good default for unpredictable data.
Standard-IA	Infrequent but rapid access.	Cheaper storage, per-GB retrieval fee, min duration/size charges.
One Zone-IA	Infrequent, re-creatable data.	Single AZ (less durable); cheaper. Not for irreplaceable data.
Glacier Instant Retrieval	Archive with millisecond access.	Cheap storage, higher retrieval cost; for rarely-read-but-need-it-now data.
Glacier Flexible Retrieval	Archive, minutes-to-hours retrieval.	Cheaper; retrieval delay (expedited/standard/bulk).
Glacier Deep Archive	Long-term archive, hours retrieval.	Cheapest; for 7-10 year retention (compliance).

Key features

Versioning keeps prior object versions (protects against overwrite/delete); pair with lifecycle to expire noncurrent versions.
Lifecycle rules transition objects to cheaper classes and expire them on a schedule.
Object Lock (WORM) enforces retention for compliance/ransomware protection; combine with a backup account.
Replication (CRR/SRR) copies objects across Regions/buckets for DR or locality.
Multipart upload for large objects; pre-signed URLs for temporary, credential-free access.
Access Points give per-application access policies to a shared bucket.

S3 security

Security note: locking down S3

Keep Block Public Access ON at the account and bucket level unless you have a deliberate public use case (and even then, front it with CloudFront + OAC). Control access with bucket policies + IAM (prefer these over ACLs; disable ACLs with the bucket-owner-enforced setting). Enable default encryption, versioning, and access logging. Use aws:SourceVpce/aws:PrincipalOrgID conditions to scope access. Public bucket exposure remains one of the most common AWS breaches.

EFS and FSx

EFS: regional, elastic NFS; create a mount target per AZ so every AZ can mount it. Performance modes (General Purpose vs Max I/O) and throughput modes (Bursting/Elastic/Provisioned). Lifecycle to IA/Archive for cost. Higher latency/cost than EBS for single-instance work, so use it when you genuinely need a shared filesystem.
FSx for Windows File Server: managed SMB with AD integration, for Windows apps and shares.
FSx for Lustre: high-performance parallel filesystem for HPC/ML, can link to S3.
FSx for NetApp ONTAP: full ONTAP (snapshots, dedup, compression, SnapMirror, multi-protocol) for enterprises standardizing on NetApp.
FSx for OpenZFS: ZFS features (snapshots, clones) with NFS access.

Data movement and hybrid

Service	Use
AWS Backup	Centralized, policy-based backup and cross-Region/account copies across many services.
Storage Gateway	Hybrid access: present S3/EBS/tape to on-prem via file (SMB/NFS), volume, or tape gateway.
DataSync	Fast, scheduled bulk transfer between on-prem/other clouds and AWS storage (S3/EFS/FSx).
Transfer Family	Managed SFTP/FTPS/FTP endpoints backed by S3/EFS.
Snow Family	Physical devices to move petabytes when the network is impractical (Snowball Edge / Snowcone).

Storage decision examples

Need	Choose	Why
Database data files (self-managed)	EBS io2 Block Express / gp3	Low-latency block, provisioned IOPS, per-instance.
Shared app filesystem (Linux)	EFS	Many instances mount the same POSIX tree.
Windows file share	FSx for Windows	Native SMB + AD.
Static website / assets	S3 + CloudFront	Durable object store served at the edge.
Backups	S3 (IA/Glacier) via AWS Backup	Durable, lifecycle to archive, cross-Region copy.
Log archive / long retention	S3 Glacier Deep Archive + Object Lock	Cheapest, WORM for compliance.
Data lake	S3 + Glue/Athena/Lake Formation	Object store as the lake substrate.
HPC scratch	FSx for Lustre	Parallel high-throughput filesystem, S3-linked.
Hybrid file access to cloud	Storage Gateway / DataSync	Bridge on-prem to S3/EFS/FSx.
Move petabytes offline	Snowball Edge	Network transfer would take too long.

Storage gotchas

Gotchas

S3 is object storage, not a filesystem. No in-place edits, no POSIX semantics; whole-object PUT/GET. Use EFS/FSx if you need a mount.
EBS is AZ-scoped. Cross-AZ/Region movement is via snapshots.
EFS is regional but reachable only through mount targets in each AZ; forget the mount target in an AZ and instances there cannot mount.
Public bucket exposure. Keep Block Public Access on; disable ACLs.
Glacier restore delay. Deep Archive retrieval is hours; do not archive data you might need immediately unless using Instant Retrieval.
EBS performance sizing. gp3 baseline may be too low; provision IOPS/throughput and ensure the instance is not the bottleneck.
Snapshot cost growth. Incremental but accumulates; use lifecycle/DLM to age them out.
Cross-Region replication and cross-AZ transfer costs. Both are per-GB and add up on chatty or large flows.

6. Database Services Deep Dive

AWS offers a purpose-built database for almost every workload. The hard part is not any single service; it is choosing correctly and understanding exactly what you give up in control when you move from self-managed to managed. This section is written with DBAs in mind.

TL;DR

Default to managed (RDS/Aurora) and drop to self-managed on EC2 only when a hard requirement forces it (OS/root access, unsupported feature, RAC, specific patch control). Use Aurora for MySQL/PostgreSQL that needs scale/performance, RDS for standard relational engines including Oracle and SQL Server, DynamoDB for high-scale key-value, Redshift for warehousing, ElastiCache for caching. Multi-AZ = HA; read replicas = read scale; cross-Region replicas = DR/locality. With managed, AWS owns the OS and HA plumbing; you still own schema, tuning, query performance, and cost.

The database portfolio

Service	Type	Use
RDS (Oracle, PostgreSQL, MySQL, MariaDB, SQL Server)	Managed relational	Standard OLTP/apps needing a specific engine without managing the OS.
Aurora (MySQL/PostgreSQL-compatible)	Managed relational, cloud-native storage	Higher performance, faster failover, up to 15 replicas, Serverless v2 autoscaling.
DynamoDB	Serverless NoSQL key-value/document	Massive scale, single-digit-ms latency, serverless apps.
ElastiCache (Redis/Valkey/Memcached)	Managed in-memory cache	Caching, sessions, rate limiting.
MemoryDB	Durable in-memory (Redis/Valkey)	Redis speed as a primary, durable datastore.
Redshift	Columnar data warehouse	Analytics/BI at scale; querying the S3 lake (Spectrum).
DocumentDB	MongoDB-compatible document DB	MongoDB-API workloads, managed.
Neptune	Graph DB	Highly connected data (fraud, knowledge graphs).
Keyspaces	Cassandra-compatible	CQL workloads, serverless.
Timestream	Time-series DB	IoT/metrics/telemetry with time-based retention.

When to use what (decision table)

Workload	Recommended	Reason	HA	DR	You manage
Oracle DB (managed, standard features)	RDS for Oracle	Managed Oracle without OS work.	Multi-AZ	Cross-Region read replica / snapshots	Schema, tuning, params (within limits), licensing choice.
Oracle DB (RAC, full control, unsupported features)	EC2 self-managed (or Oracle DB@AWS where available)	RDS cannot do RAC / full SYSDBA / every option.	Data Guard / RAC (you build)	Data Guard standby	Everything: OS, DB, HA, backup, patching.
PostgreSQL app backend	Aurora PostgreSQL (or RDS PostgreSQL)	Performance + read scaling; RDS if you want vanilla PG.	Multi-AZ / Aurora replicas	Aurora Global Database / cross-Region replica	Schema, tuning, queries.
MySQL web app	Aurora MySQL (or RDS MySQL)	Scale and failover; RDS for simplicity/cost.	Multi-AZ / Aurora replicas	Global Database / replica	Schema, tuning.
SQL Server app	RDS for SQL Server	Managed SQL Server.	Multi-AZ	Cross-Region snapshots/replica	Schema, tuning, licensing.
High-scale key-value	DynamoDB	Serverless, any scale, low latency.	Built-in multi-AZ	Global Tables (multi-Region active-active)	Data model / access patterns.
Enterprise data warehouse	Redshift	Columnar MPP for analytics.	Multi-AZ (RA3/Serverless)	Snapshots to another Region	Modeling, dist/sort keys, queries.
Reporting / ad hoc over lake	Athena (+ Redshift Spectrum)	Serverless SQL over S3.	Managed	Data in S3 (durable/replicable)	Partitioning, formats, queries.
Cache layer	ElastiCache	Reduce DB load, sub-ms reads.	Multi-AZ replication	Rebuildable / snapshots	Cache strategy, eviction.
Graph workload	Neptune	Native graph queries.	Multi-AZ	Snapshots / cross-Region	Graph model.
Document workload	DocumentDB (or DynamoDB)	MongoDB API, managed.	Multi-AZ	Snapshots / global cluster	Data model.
Time-series	Timestream	Purpose-built, tiered retention.	Managed	Managed	Schema, retention tiers.

HA, DR, scaling, backup, patching across services

Dimension	RDS	Aurora	DynamoDB	Redshift
HA	Multi-AZ (synchronous standby, automatic failover)	Storage replicated across 3 AZs; replicas promote fast	Automatic across AZs	Multi-AZ (RA3/Serverless), cluster relocation
Read scale	Up to 15 read replicas (async)	Up to 15 low-lag replicas sharing storage	Horizontal by design; DAX for caching	Concurrency scaling
DR	Cross-Region read replica, snapshot copy	Aurora Global Database (sub-second lag, fast promote)	Global Tables (active-active)	Cross-Region snapshots
Backup	Automated backups + manual snapshots; PITR	Continuous to S3; PITR; backtrack (MySQL)	PITR + on-demand backups	Automated + manual snapshots
Patching	You pick maintenance windows; AWS patches OS+engine	Managed, minimal downtime; Blue/Green deployments	Fully managed (serverless)	Managed maintenance windows
Scale compute	Change instance class (brief downtime unless Multi-AZ)	Change class or Serverless v2 (ACUs autoscale)	On-demand or provisioned capacity	Resize / Serverless RPUs

Architect note: Multi-AZ is not read scaling

An RDS Multi-AZ standby is for availability (it is not readable in the classic single-standby setup). For read scale you add read replicas. Multi-AZ DB clusters (three-instance) add readable standbys. Do not size for read load by "turning on Multi-AZ"; add replicas and offload reads deliberately.

Operational features you should use

Performance Insights: database load by wait event and top SQL; the first place to look for RDS/Aurora performance issues.
Enhanced Monitoring: OS-level metrics at high resolution.
RDS Proxy: connection pooling in front of RDS/Aurora; smooths connection storms (important for Lambda and many-client apps), preserves connections across failover.
Blue/Green deployments: create a synchronized green copy, test, then switch with minimal downtime (great for major-version upgrades).
Parameter groups / option groups: engine configuration and features; the managed equivalent of tuning init.ora and installing options, within AWS-permitted bounds.
Secrets Manager + KMS: managed credential rotation and encryption at rest.

Enterprise examples

Scenario	Design
Oracle DB on RDS	RDS for Oracle, Multi-AZ, automated backups + cross-Region snapshot copy, Performance Insights, License Included or BYOL, params via parameter/option groups. Accept RDS feature limits.
Oracle DB self-managed on EC2	Memory-optimized EC2, io2 Block Express, Data Guard standby in another AZ/Region, RMAN to S3, you own patching/HA. Choose this for RAC or full control.
PostgreSQL app database	Aurora PostgreSQL, writer + 2 readers across AZs, RDS Proxy, Global Database for DR.
MySQL web app	Aurora MySQL or RDS MySQL Multi-AZ, read replicas for reporting, ElastiCache in front for hot reads.
High-scale key-value	DynamoDB on-demand, single-table design, DAX cache, Global Tables if multi-Region.
Data warehouse	Redshift RA3/Serverless, Spectrum over the S3 lake, snapshots cross-Region, QuickSight for BI.
Reporting database	Read replica of the OLTP DB, or Athena/Redshift over exported data, to isolate reporting load.
Cache layer	ElastiCache (Redis/Valkey) cluster-mode, Multi-AZ, with a defined cache-aside strategy.
Graph / Document / Time-series	Neptune / DocumentDB / Timestream respectively, Multi-AZ, snapshots.

AWS database gotchas for Oracle DBAs

This is the section most likely to catch an experienced Oracle DBA off guard. RDS is genuinely managed, which means AWS takes away control you are used to having.

RDS for Oracle: what you lose vs self-managed

No SYSDBA / no OS access. You get a master user with elevated (but not full DBA) privileges. Many operations are done through Amazon rdsadmin packages, not native commands. No SSH to the host, no direct filesystem access.
No RAC. RDS for Oracle is single-instance with Multi-AZ standby (not RAC). Need RAC, go EC2 or Oracle Database@AWS where available.
Option/parameter group restrictions. Only AWS-supported options/parameters can be changed; some are fixed. Features like certain audit/replication options are enabled through option groups, not manually.
Storage scaling behaviour. Storage auto-scaling and modifications happen online but within RDS rules; you do not manage ASM/diskgroups. There are limits and cooldowns on scaling.
Licensing choices. License Included (AWS provides the license, SE2 only) vs BYOL (your license, EE possible). This is a real architectural decision with cost and edition implications.
Backup/restore model. RMAN is largely replaced by RDS automated backups, snapshots, and PITR. You cannot run arbitrary RMAN; some export/import via Data Pump to S3 integration.
Multi-AZ behaviour. The standby is not open/readable (unlike an Active Data Guard standby you might run yourself). Failover is automatic but you do not control it like a manual Data Guard switchover.
Performance troubleshooting differs. Use Performance Insights and CloudWatch instead of full access to AWR/ASH the way you would on-prem (AWR is available with the right license/option, but access patterns differ).
When EC2 is required instead: RAC, Data Guard with full control, unsupported options/features, specific patch timing, or third-party tools needing OS access.

Architect note: the managed vs self-managed trade

Managed removes toil (OS patching, backup plumbing, HA failover) at the cost of control and some features. For most standard databases that trade is worth it. For a small number of workloads (RAC, deep customization, strict change-control) the control of EC2 self-managed is worth the operational burden. Decide per workload, not by blanket policy.

Migration

DMS (Database Migration Service): homogeneous and heterogeneous migration with CDC for near-zero-downtime cutover.
Schema Conversion Tool (SCT): converts schema/code for heterogeneous engine changes (for example Oracle to PostgreSQL/Aurora).
For Oracle-to-Oracle lift-and-shift, consider native tooling (Data Pump, RMAN, Data Guard) plus DMS for the CDC phase. Always run data validation. See the Migration & DR tab.

7. Load Balancing and Traffic Management

Distributing traffic to healthy targets, terminating TLS, routing by content, and steering users globally. Pick the right load balancer for the protocol and the requirement; most "it works locally but not through the LB" issues are health checks, security groups, or listener rules.

TL;DR

ALB for HTTP/HTTPS with host/path routing. NLB for TCP/UDP, extreme performance, static IPs, or PrivateLink. GWLB to insert inline security appliances. Use ACM for free managed TLS certs. Steer globally with Route 53 (DNS-based) or Global Accelerator (anycast over the backbone), and cache web content with CloudFront. Health checks must hit a real application path.

Load balancer types

Type	Layer	Use	Key features
Application LB (ALB)	L7 (HTTP/HTTPS)	Web apps, microservices, container services.	Host/path routing, listener rules, WebSocket, redirects, auth (OIDC/Cognito), targets = instances/IPs/Lambda/containers.
Network LB (NLB)	L4 (TCP/UDP/TLS)	Extreme throughput/low latency, non-HTTP, static IP, source-IP preservation, PrivateLink.	Millions of req/s, ultra-low latency, static/Elastic IP per AZ, TLS passthrough or termination.
Gateway LB (GWLB)	L3/4 (GENEVE)	Insert third-party virtual appliances (firewalls, IDS/IPS) transparently inline.	Fleet of appliances behind one entry point; used for centralized inspection.
Classic LB (CLB) Legacy	L4/L7	Do not use for new work.	Superseded by ALB/NLB; migrate off it.

Core concepts

Concept	Meaning
Target group	The set of backends (instances/IPs/Lambda/containers) an LB routes to, with its own health check and protocol/port.
Listener	A port/protocol the LB accepts on (for example HTTPS:443).
Listener rule (ALB)	Conditions (host, path, header, method) that route to a target group. Evaluated by priority.
Health check	Periodic probe; only healthy targets receive traffic. Must succeed for an instance to serve.
TLS/ACM	ACM provisions and auto-renews free public certs for ALB/NLB/CloudFront/API GW. Terminate TLS at the LB (or passthrough on NLB).
Sticky sessions	Bind a client to one target (cookie-based on ALB). Use sparingly; prefer stateless apps with external session stores.
Cross-zone LB	Distributes evenly across targets in all AZs. On by default for ALB; optional (and may incur cross-AZ charges) for NLB.
Internal vs internet-facing	Internal LBs have private IPs (internal tiers/services); internet-facing have public IPs in public subnets.

Global traffic management

Service	Use	Not for
Route 53	DNS-based routing: latency, weighted, failover, geolocation, geoproximity, multivalue. DR failover across Regions.	Instant failover (bound by DNS TTL/caching).
Global Accelerator	Anycast static IPs routing over the AWS backbone to the nearest healthy Regional endpoint; fast failover; TCP/UDP.	Content caching (no cache).
CloudFront	CDN caching web content/APIs at the edge; TLS, WAF, OAC to private origins; Functions/Lambda@Edge.	Non-HTTP TCP/UDP transport.

Architect note: which global tool

Cacheable web content -> CloudFront. Non-HTTP or you need fixed anycast entry IPs and fast regional failover -> Global Accelerator. DNS-level steering, weighted rollouts, or cross-Region DR failover -> Route 53. They compose: Route 53 -> CloudFront -> ALB -> targets is a very common stack.

Choosing quickly

HTTP(S) app with routing rules, containers, or Lambda targets -> ALB.
Millions of connections, gaming/IoT/TCP, static IP, source IP preservation, or exposing a service via PrivateLink -> NLB.
Inline third-party firewall/IDS for all traffic -> GWLB.
Global static web assets -> CloudFront. Global low-latency TCP/UDP -> Global Accelerator. Cross-Region DR -> Route 53 failover.

Troubleshooting: target unhealthy and related

Target shows unhealthy elb

Walk these in order:

Security groups: the target's SG must allow the health-check port from the load balancer's SG (reference the LB SG, not an IP). This is the number-one cause.
Health check path/port: the path must return HTTP 200 (a 301/302 redirect fails the check). Confirm the app listens on the configured target port.
App actually up: curl the health path locally on the instance/container.
Protocol/port match: target-group protocol (HTTP vs HTTPS) and port match what the app serves.
NACL: subnet NACL allows the health-check traffic and its return (ephemeral ports).
Subnet/AZ: the LB is enabled in the AZ where the target lives; the target is in a reachable subnet.
Timeouts/thresholds: a slow app may need a longer health-check timeout or a lighter check path.

TLS / SSL certificate issue tls

Certificate not valid/mismatched: confirm the ACM cert covers the requested hostname (SAN), is in the same Region as the LB (CloudFront needs the cert in us-east-1), and is attached to the HTTPS listener. For DNS-validated ACM certs, the validation CNAME must exist. Check the listener's security policy for protocol/cipher compatibility. Backend re-encryption needs the target to present a trusted cert if using HTTPS target-group protocol.

Wrong listener rule routing alb

ALB rules evaluate by priority; a broad catch-all at low priority number can shadow a specific rule. Verify host/path conditions and priorities. The default rule fires when nothing matches. Test with explicit Host headers.

Security group / NACL blocking traffic firewall

Client -> LB: the LB SG must allow inbound from clients (or CloudFront/GA). LB -> target: the target SG must allow the LB SG on the app/health port. NACLs on both subnets must allow the traffic and the ephemeral return ports. Use VPC Flow Logs / Reachability Analyzer to find the blocking layer.

Route table issue routing

An internet-facing LB needs public subnets with an IGW route; targets in private subnets need to be reachable from the LB subnets (same VPC routing). For internal LBs, ensure clients have a route to the LB subnets.

Application not responding on backend port app

The app is bound to localhost/127.0.0.1 instead of 0.0.0.0, or on a different port than the target group expects, or crashed. SSH-free check via Session Manager: confirm the listener socket (ss -ltnp) and hit the health path locally.

Wrong target group protocol / health check path config

HTTP target group pointing at an HTTPS-only app (or vice versa) fails silently as unhealthy. The health-check path may not exist (404) or require auth. Use a dedicated lightweight /healthz endpoint that returns 200 without dependencies.

8. Security Deep Dive

Security in AWS is layered: the shared responsibility split, identity, network, data protection, detection, and governance. Most incidents trace to a small set of mistakes (public S3, over-broad IAM, open security groups, no detection). Close those first, then build depth.

TL;DR

AWS secures the cloud; you secure what you put in it. Enforce identity least privilege (IAM tab), keep data private by default (Block Public Access, encryption everywhere via KMS), reduce internet exposure (private subnets, endpoints, WAF/Shield on what must be public), and turn on detection org-wide (CloudTrail, Config, GuardDuty, Security Hub) with logs centralized in a locked log-archive account.

The shared responsibility model

AWS is responsible for (security OF the cloud)	You are responsible for (security IN the cloud)
Physical data centers, hardware, the hypervisor/Nitro, the global network, and managed-service control planes.	IAM and credentials, network configuration (SG/NACL/routing), OS patching on EC2, data classification and encryption choices, application security, and configuration of every service.

Common mistake

Assuming "managed" means "secured for me." AWS patches the RDS host, but you still control who can reach it, whether it is public, how it is encrypted, and who has credentials. The line moves per service (more on serverless, less on EC2), but configuration and data are always yours.

The security service map

Category	Services	Purpose
Identity	IAM, Identity Center, Cognito, Organizations/SCPs	Who can do what.
Network	Security Groups, NACLs, Network Firewall, WAF, Shield, Firewall Manager	What can reach what.
Data protection	KMS, CloudHSM, Secrets Manager, SSM Parameter Store, ACM, S3/EBS/RDS encryption	Keys, secrets, encryption, certs.
Detection	CloudTrail, Config, GuardDuty, Security Hub, Inspector, Detective, Macie	Audit, posture, threats, vulnerabilities, sensitive-data discovery.
Governance/compliance	Control Tower, Audit Manager, Artifact, Config rules, Access Analyzer	Guardrails, evidence, compliance reports.

Data protection

KMS: customer-managed keys with key policies + IAM; integrated with S3, EBS, RDS, secrets, and more. Enable encryption-by-default account-wide. Deleting/disabling a key can make data unrecoverable, so manage key lifecycle deliberately; use multi-Region keys only when needed.
CloudHSM: dedicated single-tenant HSM for strict compliance or custom cryptographic needs.
Secrets Manager vs SSM Parameter Store: Secrets Manager for secrets needing rotation (DB creds); Parameter Store SecureString for simpler/cheaper config and secrets without rotation.
ACM: managed TLS certificates, auto-renewed, for LBs/CloudFront/API GW.
Encrypt in transit (TLS) and at rest (KMS) everywhere; make it the default, not a per-resource decision.

Security note: stop storing secrets in code

No credentials in source, container images, user-data, or environment variables baked into AMIs. Use Secrets Manager/Parameter Store fetched at runtime via an instance/task role. Scan repos and images for leaked secrets; rotate anything that ever touched a repo.

Detection and response

Service	What it catches / does
CloudTrail	Every API call (who/what/when/where). The audit backbone; enable an org trail to a locked central bucket.
AWS Config	Resource configuration history + compliance rules + drift + auto-remediation.
GuardDuty	Threat detection from logs (compromised keys, crypto-mining, recon, malware, anomalous behaviour).
Security Hub	Aggregates and scores findings against standards (CIS, AWS FSBP); single pane across accounts.
Inspector	Continuous vulnerability scanning of EC2, ECR images, and Lambda.
Detective	Investigate/visualize the scope of a finding (entity behaviour over time).
Macie	Discovers sensitive data (PII) in S3.

Operations note: wire detection to action

Route GuardDuty/Security Hub findings through EventBridge to automated response (Lambda/SSM Automation) and to your ticketing/SIEM. Detection without response just fills a dashboard. Delegate a security-tooling account as the org admin for these services so coverage is org-wide, not per-account opt-in.

How to secure specific things

Secure a production AWS account how-to

Root locked (hardware MFA, no keys); humans via Identity Center with MFA; least-privilege roles; CloudTrail + Config + GuardDuty on; encryption-by-default; Block Public Access; SGs least-open; logs shipped to the log-archive account; budget + billing alarms; Security Hub scoring.

Secure a multi-account environment how-to

Organizations + SCP guardrails (deny disabling CloudTrail/GuardDuty, deny leaving org, restrict Regions); Control Tower landing zone; delegated security admin; centralized logging (org trail) to a dedicated Log Archive account with S3 Object Lock; Firewall Manager for org-wide WAF/SG policy; Access Analyzer for external-sharing detection.

Secure S3 how-to

Block Public Access on (account + bucket); disable ACLs (bucket-owner-enforced); bucket policy least-privilege with aws:PrincipalOrgID/aws:SourceVpce conditions; default encryption (SSE-KMS); versioning + lifecycle; access logging; Object Lock for WORM; Macie for PII discovery. Serve public content via CloudFront + OAC, never a public bucket.

Secure EC2 how-to

IMDSv2 enforced; instance role (no static keys); Session Manager instead of open SSH; SGs least-open; private subnets where possible; patched via Patch Manager; Inspector scanning; encrypted EBS; golden hardened AMIs.

Secure RDS how-to

Private subnets only (no public accessibility); SG restricted to app tier; KMS encryption at rest + TLS in transit; credentials in Secrets Manager with rotation; automated backups + snapshots; deletion protection; audit logging to CloudWatch; IAM database authentication where supported.

Secure a public load balancer how-to

WAF attached (managed OWASP + rate limiting); Shield (Advanced for high-risk); TLS via ACM with a modern security policy; only the LB is public, targets in private subnets; restrict origin access so backends only accept the LB SG; log access to S3.

Reduce public internet exposure how-to

Default resources to private subnets; use VPC endpoints for AWS service traffic; expose only what must be public (behind LB + WAF + CloudFront); no public IPs on app/DB tiers; centralized inspected egress (Network Firewall) if required; audit with Access Analyzer and GuardDuty.

Centralize security logs how-to

Org CloudTrail -> central S3 in the Log Archive account (Object Lock, restricted policy). VPC Flow Logs, Config, GuardDuty, ELB/S3/CloudFront logs aggregated there. Optionally forward to a SIEM or use Security Lake to normalize into OCSF.

Production security checklist

Baseline checklist

Root: hardware MFA, no access keys, used only when required.
Humans via IAM Identity Center with MFA; no per-account IAM users.
SCP guardrails: deny org-leave, deny disabling CloudTrail/GuardDuty/Config, restrict Regions.
Org CloudTrail to a locked central bucket in a dedicated Log Archive account.
AWS Config + conformance packs; GuardDuty + Security Hub org-wide.
Block Public Access on; ACLs disabled; default encryption on S3/EBS/RDS.
Secrets in Secrets Manager/Parameter Store; none in code or images.
Security groups least-open; databases private; IMDSv2 enforced.
WAF + Shield on internet-facing endpoints.
Access Analyzer for external sharing; credential report reviewed; unused access pruned.
Backups (AWS Backup) with cross-account/Region copies and periodic restore tests.
Billing/budget alarms and cost anomaly detection.

Common security mistakes

Avoid these

Weakly protected root account.
Public S3 buckets / disabled Block Public Access.
Over-permissive IAM (AdministratorAccess, wildcards).
Open security groups (0.0.0.0/0 on 22/3389/database ports).
CloudTrail not enabled everywhere / not centralized.
No GuardDuty, no Config.
Secrets in code, images, or environment.
No credential rotation; long-lived keys everywhere.
Routing private AWS traffic over the internet instead of VPC endpoints.
Logs stored in the same account (and thus tamper-reachable) as the workloads.

9. Observability, Monitoring, and Operations

You cannot operate what you cannot see. CloudWatch is the hub for metrics, logs, and alarms; CloudTrail and Config cover audit and configuration; Systems Manager covers fleet operations. The goal is useful signal, not noise.

TL;DR

Metrics + alarms + logs live in CloudWatch (install the agent for memory/disk, which are not default). Audit is CloudTrail; config/compliance is Config; events/automation route through EventBridge; fleet ops through Systems Manager. Set log retention (default is forever), alarm on symptoms not every metric, and centralize logs across accounts.

The observability stack

Service	Role
CloudWatch Metrics	Time-series metrics from AWS services and custom sources; alarms and math on them.
CloudWatch Logs	Central log ingestion, search (Logs Insights), retention, metric filters, subscriptions.
CloudWatch Alarms	Threshold/anomaly alarms -> SNS/EventBridge/Auto Scaling/OpsItems.
CloudWatch Dashboards	Operational views combining metrics, logs, and alarms.
CloudWatch Agent	Collects OS metrics (memory, disk) and logs from EC2/on-prem. Not installed by default.
EventBridge	Event bus + Scheduler + Pipes; routes state changes to targets for automation.
CloudTrail	API audit trail.
AWS Config	Resource config history and compliance.
Systems Manager	Inventory, Patch Manager, Session Manager, Automation, OpsCenter.
Health Dashboard	AWS service health + your account's Personal Health Dashboard (events affecting your resources).
X-Ray / OpenTelemetry	Distributed tracing across microservices.
Managed Prometheus / Managed Grafana	Managed metrics store and dashboards for container/Prometheus ecosystems.
DevOps Guru	ML-based anomaly detection and operational insights.

What to monitor, by resource

Resource	Watch
EC2	CPUUtilization, StatusCheckFailed (system/instance), CPUCreditBalance (T family), plus agent-based mem/disk.
EBS	VolumeReadOps/WriteOps, VolumeQueueLength, BurstBalance (gp2/st1/sc1), throughput vs provisioned.
S3	4xx/5xx errors, request counts, bucket size/object count, replication latency, unusual access (via CloudTrail data events).
RDS/Aurora	CPUUtilization, FreeableMemory, FreeStorageSpace, DatabaseConnections, Read/WriteLatency, ReplicaLag, and Performance Insights DB load.
Load balancers	UnHealthyHostCount, HTTPCode_Target_5XX, TargetResponseTime, RejectedConnectionCount, ActiveConnectionCount.
VPC	NAT Gateway ErrorPortAllocation/BytesOut, VPN TunnelState, Flow Logs REJECTs, endpoint metrics.
Lambda	Errors, Throttles, Duration, ConcurrentExecutions, IteratorAge (streams).
Security	Root login, IAM changes, GuardDuty findings, Config non-compliance, failed console logins.

Designing good alarms

Alarm on symptoms users feel (error rate, latency, unhealthy hosts, saturation) before alarming on causes.
Use appropriate periods, evaluation periods, and M-of-N datapoints to avoid flapping.
Use anomaly detection bands for metrics without a fixed threshold.
Set alarms to actionable destinations; page only what a human must act on now. Everything else is a dashboard or ticket.
Alarm on missing data where absence is bad (a stopped heartbeat).
Composite alarms to reduce noise (alert once when several related alarms fire).

Cost note

CloudWatch Logs default to never expire and log ingestion/storage is a common runaway cost. Set retention per log group, filter noisy logs before ingestion, and use metric filters instead of storing everything long-term. Custom metrics and high-resolution alarms also add up.

Example alarms

Alarm	Metric / condition	Action
EC2 CPU high	CPUUtilization > 80% for 5 min	Scale out / investigate
EC2 status check failed	StatusCheckFailed >= 1	Auto-recover / replace
Memory pressure	mem_used_percent > 90% (agent)	Investigate / right-size
Disk usage	disk_used_percent > 85% (agent)	Clean / grow volume
EBS burst balance low	BurstBalance < 20%	Move to gp3/provisioned
EBS queue length	VolumeQueueLength sustained high	Provision IOPS / larger instance
ALB unhealthy targets	UnHealthyHostCount > 0	Investigate targets
ALB 5xx	HTTPCode_Target_5XX rising	App/backend investigation
RDS CPU	CPUUtilization > 80%	Tune queries / scale
RDS storage low	FreeStorageSpace < threshold	Grow storage / enable autoscaling
RDS connections	DatabaseConnections near max	RDS Proxy / pool / investigate
Failed backups	AWS Backup job failed event	Page backup owner
NAT Gateway errors	ErrorPortAllocation > 0	Investigate port exhaustion
VPN tunnel down	TunnelState = 0	Failover / investigate
Direct Connect issue	ConnectionState / BGP down	Failover to VPN
S3 unusual access	CloudTrail data-event anomaly	Security review
Lambda errors/throttles	Errors or Throttles > 0 sustained	Investigate / raise concurrency

Centralizing observability across accounts

Cross-account CloudWatch observability: link source accounts to a central monitoring account for a unified metrics/logs/traces view.
Ship logs to a central account (subscription filters -> Kinesis Firehose -> central S3, or Security Lake for security data).
Aggregate GuardDuty/Security Hub/Config to delegated admin accounts.
Standardize dashboards and alarms via IaC so every account gets the same baseline.

Operations note

Bake the CloudWatch agent, SSM agent, standard log config, and a baseline alarm set into your golden AMIs and IaC modules. Observability retrofitted per-app is inconsistent; observability built into the platform is uniform and auditable.

10. Containers, Kubernetes, and Cloud Native

Running containers, functions, and event-driven systems on AWS. The main choice is ECS vs EKS vs Lambda, and then EC2 vs Fargate for the compute underneath. Pick by operational appetite and portability needs, not by hype.

TL;DR

ECS for AWS-native containers with low operational overhead. EKS when you need standard Kubernetes (skills, tooling, portability, multi-cloud). Fargate to avoid managing nodes; EC2 nodes for cost at high steady utilization or special hardware. Lambda for event-driven/short work. Give containers permissions with task roles (ECS) or IRSA/Pod Identity (EKS), never node/instance roles.

The choices

Option	Use when	Trade-off
ECS	You want containers with minimal ops and deep AWS integration; no Kubernetes needed.	AWS-specific; less portable than k8s.
EKS	You need Kubernetes: existing k8s skills/manifests, Helm, operators, portability.	More operational overhead (upgrades, add-ons, CNI IP management, control-plane cost).
Fargate	You do not want to manage nodes; bursty or low-ops workloads.	Higher per-vCPU cost; some features need EC2 launch type.
EC2 launch type / node groups	High steady utilization, GPUs, custom kernels, or per-vCPU cost matters.	You patch/scale/secure the nodes.
Lambda	Event-driven, short (<15 min), spiky, scale-to-zero.	Time/memory limits, cold starts, not for long-running services.
App Runner	Simplest path: container/source to a managed HTTPS service with autoscaling.	Less control; good for straightforward web apps/APIs.

Architect note: ECS vs EKS honestly

If Kubernetes is not a requirement, ECS is less to operate and integrates natively (ALB, IAM task roles, CloudWatch, Service Connect). Choose EKS when you genuinely need the Kubernetes ecosystem or portability, and staff for the ongoing operational cost (version upgrades, add-on management, VPC CNI IP planning). "We might go multi-cloud someday" is often not enough to justify EKS overhead today.

How networking works with containers

ECS awsvpc mode and EKS VPC CNI give each task/pod its own ENI and VPC IP. This gives real VPC networking and SGs per task/pod but consumes subnet IPs fast; plan CIDRs accordingly (or use prefix delegation / custom networking on EKS).
Load balancers: ALB in front of ECS services / the AWS Load Balancer Controller for EKS Ingress; NLB for TCP/UDP or extreme performance.
Service-to-service discovery: ECS Service Connect / Cloud Map; Kubernetes Services + CoreDNS on EKS.

How IAM works with containers

ECS task role: permissions for the app in the container; the separate execution role lets ECS pull images and write logs. Do not rely on the EC2 instance role for app permissions.
EKS IRSA / EKS Pod Identity: map a Kubernetes service account to an IAM role so pods get scoped, short-lived credentials, not the node role.
Least-privilege per task/pod is the goal; the node/instance role should be minimal.

Security note

A common container security hole is giving pods/tasks the broad node instance role. Use IRSA/Pod Identity (EKS) and task roles (ECS) so each workload has only what it needs. Also enforce IMDSv2/hop-limit so a compromised container cannot reach the node's credentials via the metadata endpoint.

CI/CD and registry

ECR for private images; enable scan-on-push and lifecycle policies to purge old tags.
CodeBuild/CodeDeploy/CodePipeline or GitHub Actions/GitLab for build-test-deploy; deploy with rolling or blue/green (CodeDeploy supports ECS blue/green).
CodeArtifact for private package repositories.

Serverless and event-driven building blocks

Service	Role
Lambda	Functions triggered by events/HTTP.
API Gateway	Managed API front door (REST/HTTP/WebSocket).
EventBridge	Event bus, scheduling, and pipes for decoupling.
Step Functions	Orchestrate multi-step workflows with retries/error handling.
SQS / SNS	Queues (decouple/buffer) and pub/sub (fan-out).
Kinesis / MSK	Streaming ingestion and processing.
Cloud Map	Service discovery.

Architecture examples

Microservices on ECS (Fargate) pattern

ALB -> ECS services (one per microservice) on Fargate across 3 AZs, each with its own task role and target group; Service Connect for east-west traffic; secrets from Secrets Manager; logs to CloudWatch; images in ECR; CodePipeline blue/green deploys. Low ops, native integration.

Microservices on EKS pattern

EKS control plane + managed node groups (or Fargate profiles) across 3 AZs; AWS Load Balancer Controller for ingress (ALB); IRSA/Pod Identity for pod permissions; VPC CNI with prefix delegation for IP density; Karpenter for node autoscaling; GitOps (Argo/Flux) for deploys; Managed Prometheus/Grafana for metrics.

Serverless function on S3 event pattern

S3 PutObject -> EventBridge/S3 notification -> Lambda processes the object (thumbnail, parse, load) -> writes to DynamoDB/RDS or another bucket. DLQ on failure; concurrency limits to protect downstream.

Event-driven with EventBridge + Lambda + SNS + SQS pattern

Producers emit events to EventBridge; rules route to Lambda for processing and to SNS for fan-out; SNS -> multiple SQS queues -> worker fleets (Lambda/ECS) each processing independently with retries and DLQs. Decoupled, resilient, independently scalable.

Container deployment pipeline pattern

Commit -> CodeBuild (build+test+scan) -> push image to ECR (scan-on-push) -> CodeDeploy blue/green to ECS/EKS -> automated smoke tests -> shift traffic -> rollback on alarm. IaC (CDK/Terraform) provisions the pipeline itself.

Private container platform pattern

Private subnets only; images from ECR via interface endpoints; no NAT dependency for AWS traffic; internal ALB; PrivateLink to expose specific services to other accounts; centralized logging and Security Hub; SCP guardrails on the platform accounts.

Operating containers

Monitor task/pod health, restarts, CPU/memory limits vs usage, and pending/unschedulable tasks (capacity).
Right-size task/pod requests and limits; over-requesting wastes capacity, under-requesting causes throttling/OOM.
Use Container Insights (CloudWatch) or Prometheus/Grafana for cluster and workload metrics.
On EKS, keep the control plane and node AMIs/add-ons on supported versions; plan quarterly upgrades.

Common mistake

Running EKS "because Kubernetes" for a handful of simple services, then absorbing continuous upgrade and add-on toil. Match the platform to the team. Conversely, forcing everything into Lambda when a long-running service is the natural fit leads to timeout/cold-start contortions.

11. Analytics, Data, and Integration

Storing, moving, transforming, querying, and streaming data at scale, plus the integration services that connect systems. The lake-on-S3 plus purpose-built engines model is the backbone of most modern AWS data platforms.

TL;DR

Land raw data in S3 (the lake), catalog with Glue, govern with Lake Formation, query serverlessly with Athena, warehouse in Redshift, stream with Kinesis/MSK, transform with Glue/EMR, replicate databases with DMS, and visualize with QuickSight. Use columnar formats (Parquet) and partitioning to control cost.

The data and analytics services

Service	Role
S3	The data lake substrate: durable, cheap, decoupled storage.
Glue	Serverless ETL, crawlers, and the Data Catalog (metadata shared by Athena/Redshift/EMR).
Lake Formation	Central, fine-grained (row/column/tag) permissions and cross-account sharing over the lake.
Athena	Serverless SQL (Trino/Presto) over S3; pay per TB scanned.
Redshift	Columnar MPP data warehouse; Serverless and Spectrum (query S3).
EMR	Managed Spark/Hive/Presto/HBase (on EC2/EKS/Serverless) for large-scale processing.
Kinesis	Data Streams (ingest), Firehose (deliver to S3/Redshift/OpenSearch), Managed Service for Apache Flink (analytics).
MSK	Managed Apache Kafka for the Kafka ecosystem.
OpenSearch Service	Search, log analytics, and vector search.
QuickSight	Serverless BI dashboards (SPICE engine, Q natural language).
DMS	Database migration and CDC replication into the lake/warehouse.
AppFlow	No-code SaaS data flows (Salesforce, ServiceNow, etc.) to AWS.
MWAA	Managed Apache Airflow for orchestrating pipelines.
DataZone	Data catalog/governance and a business data marketplace across the org.

Integration services

Service	Use
EventBridge	Event routing, scheduling, SaaS integration.
API Gateway	Publish and secure APIs.
Step Functions	Workflow orchestration.
SQS / SNS	Queues and pub/sub.
Amazon MQ	Managed ActiveMQ/RabbitMQ for lift-and-shift of existing JMS/AMQP apps.

Architect note: EventBridge vs SNS vs SQS vs MQ vs Kinesis

SQS: decouple with a durable queue (pull, at-least-once). SNS: fan-out one message to many (push). EventBridge: content-based routing of events from AWS/SaaS/custom sources with filtering and schema. Kinesis/MSK: ordered, replayable, high-throughput streams for analytics/multiple consumers. Amazon MQ: only when migrating an app that already speaks JMS/AMQP and you cannot refactor. Do not default to MQ for new cloud-native work.

Common data patterns

Data lake on S3 pattern

Raw/curated/consumption zones in S3 (Parquet, partitioned); Glue crawlers/catalog; Lake Formation permissions; Athena for ad hoc SQL; Redshift Spectrum for warehouse joins; lifecycle rules to archive raw. The default analytics foundation.

Data warehouse on Redshift pattern

Load curated data (from Glue/DMS/Firehose) into Redshift RA3/Serverless; model with appropriate distribution/sort keys; Spectrum to query cold data in S3 without loading; QuickSight for BI. Scale with concurrency scaling.

Serverless query with Athena pattern

Point Athena at S3 via the Glue catalog; query with SQL, pay per TB scanned. Partition by date/tenant and store Parquet to cut cost dramatically. Good for log analysis and infrequent/ad hoc analytics.

ETL with Glue pattern

Glue Spark/Python jobs read raw S3, clean/join/aggregate, write curated Parquet; crawlers keep the catalog current; jobs orchestrated by Glue workflows, Step Functions, or MWAA.

Streaming ingestion with Kinesis pattern

Producers -> Kinesis Data Streams -> Firehose -> S3/Redshift/OpenSearch (buffered), and/or Managed Flink for real-time aggregation/alerts. Use on-demand mode to avoid shard sizing, or provision shards for cost control at steady volume.

Kafka pattern with MSK pattern

MSK (provisioned or Serverless) as the event backbone; producers/consumers use the Kafka API; connect to sinks with MSK Connect; use when you need Kafka semantics/ecosystem specifically.

CDC with DMS pattern

DMS replicates ongoing changes from a source database (Oracle/SQL Server/PostgreSQL/MySQL) into S3/Redshift/Kinesis for near-real-time analytics or to keep a target in sync during migration. Monitor CDC latency and validate data.

Cross-account data sharing pattern

Lake Formation cross-account grants (or Redshift data sharing / S3 Access Points) let a data-producer account share governed datasets with consumer accounts without copying data, with row/column controls.

AI-ready data architecture pattern

Curated, governed lake -> feature/embedding pipelines -> vector store (OpenSearch/Aurora pgvector) -> Bedrock/SageMaker consume it. Governance (Lake Formation/DataZone) and lineage matter so AI uses trusted, permissioned data. See the AI/ML tab.

Cost note: the two biggest analytics levers

(1) File format and partitioning: converting JSON/CSV to partitioned Parquet routinely cuts Athena/Redshift Spectrum scan cost by 10-100x. (2) Right engine for the job: do not run a permanent EMR/Redshift cluster for occasional queries when Athena serverless would do, and do not hammer Athena for high-concurrency BI when Redshift/QuickSight-SPICE fits better.

12. AI, ML, and Generative AI on AWS

From custom ML on SageMaker to generative AI on Bedrock and applied-AI APIs. The enterprise question is rarely "which model" and usually "how do we connect models to our data safely, cheaply, and with an audit trail." This section covers both the services and the governance.

TL;DR

Bedrock for generative AI (managed foundation models, Agents, Knowledge Bases, Guardrails) without running model infrastructure. SageMaker for custom model build/train/deploy and MLOps. Applied-AI APIs (Textract, Comprehend, Rekognition, etc.) for common tasks. For RAG, put embeddings in a vector store (OpenSearch or Aurora/RDS pgvector). Never wire an LLM agent directly to a production OLTP database; put a governed serving layer in front.

Generative AI: Amazon Bedrock

Foundation models from multiple providers (Anthropic Claude, Amazon Nova/Titan, Meta, Mistral, and others) behind one API. Availability varies by Region; verify.
Knowledge Bases: managed RAG (chunk, embed, store, retrieve) over your data in S3, backed by a vector store.
Agents: orchestrate multi-step tasks and tool/API calls with a model.
Guardrails: content filters, denied topics, PII redaction, and grounding checks applied to prompts/responses.
Pricing: per input/output token (on-demand) or provisioned throughput; Knowledge Bases add embedding + vector-store cost.

Custom ML: Amazon SageMaker

Studio (IDE), training jobs, hyperparameter tuning, endpoints (real-time/serverless/async/batch), Pipelines (MLOps), Model Registry, Feature Store, and Clarify (bias/explainability).
Use for training/fine-tuning your own models and productionizing the full ML lifecycle.

Cost note

SageMaker real-time endpoints bill continuously while running, even idle. This is the most common ML cost leak. Use serverless or asynchronous inference for spiky traffic, auto-scale to zero where supported, and shut down notebooks/training you are not using. For occasional batch scoring, use batch transform, not a standing endpoint.

Applied-AI APIs

Service	Task
Textract	Extract text/tables/forms from documents.
Comprehend	NLP: entities, sentiment, PII detection, classification.
Rekognition	Image/video analysis.
Transcribe / Polly	Speech-to-text / text-to-speech.
Translate	Machine translation.
Kendra	Enterprise semantic search (also usable as a RAG retriever).
Lex	Conversational bots (chat/voice).
Forecast / Personalize	Time-series forecasting / recommendations.
Fraud Detector	ML fraud detection.

Vector stores for RAG

Option	Use when
OpenSearch (vector)	Large-scale vector + keyword hybrid search; dedicated retrieval layer.
Aurora / RDS PostgreSQL pgvector	You already run PostgreSQL and want vectors alongside relational data (joinable in SQL).
DocumentDB vector	MongoDB-API apps needing vector search.
Bedrock Knowledge Bases	You want managed RAG and let AWS handle chunk/embed/store/retrieve.

Enterprise GenAI patterns

Chat with documents (RAG) pattern

Docs in S3 -> Bedrock Knowledge Base (chunk + embed) -> vector store (OpenSearch/pgvector) -> retrieval augments the prompt -> Bedrock model answers with citations. Guardrails on I/O; access scoped by IAM so users only retrieve permitted content.

Chat with database / natural language to SQL pattern

The safe pattern: the model generates SQL against a read-only reporting replica or governed views, through a query layer that validates/parameterizes and enforces row/column permissions, with logging. Never let the model run arbitrary DML against production OLTP.

AI assistant for operations / business users pattern

A Bedrock agent with tools that call curated, permissioned APIs (not raw databases). Each tool has least-privilege IAM, input validation, and an audit trail. The agent reasons; the tools enforce policy.

Document processing pipeline pattern

S3 upload -> Textract extracts -> Comprehend/model structures + classifies -> results to DynamoDB/RDS + human-in-the-loop review (A2I) for low-confidence -> downstream systems. Event-driven with Step Functions.

Call-center AI pattern

Transcribe for speech-to-text, Comprehend for sentiment/intent, Bedrock/Lex for responses and summaries, Connect as the contact-center platform. Redact PII with Guardrails/Comprehend before storage.

MLOps pipeline pattern

SageMaker Pipelines: data prep -> train -> evaluate -> register (Model Registry) -> approve -> deploy endpoint -> monitor (Model Monitor for drift) -> retrain trigger. IaC-provisioned, CI/CD-driven.

Governance and safety warnings

Do not connect LLM agents directly to production OLTP

Putting a model or agent in front of your live transactional database with the ability to run generated SQL is a data-integrity and security incident waiting to happen. Interpose a governed serving layer: read-only replicas or curated views, a query catalog of approved parameterized queries, or read-only reporting APIs. The model requests; the layer authorizes, validates, and audits.

GenAI enterprise guardrails

Avoid uncontrolled dynamic SQL; use parameterized, allow-listed queries or a serving layer.
Protect credentials: models/agents get access via least-privilege IAM roles, never embedded keys.
Add auditability: log prompts, retrievals, tool calls, and responses.
Use curated datasets, APIs, or read-only reporting layers, not raw production stores.
Validate AI output before business use; keep a human in the loop for consequential actions.
Monitor for prompt injection and data leakage; apply Guardrails and content filtering.
Enforce data-access boundaries so retrieval respects the requesting user's permissions (no cross-tenant leakage).
Watch cost: token usage and standing endpoints can grow fast; set budgets and quotas.

DBA note

When "chat with our data" lands on your desk, steer it to a reporting layer. Stand up a read replica or materialized reporting views, expose them through a controlled query interface, and keep the model out of the primary. This protects the OLTP workload's performance and integrity while still delivering the capability. This is the same discipline as any reporting-offload decision, applied to AI.

13. Migration and Disaster Recovery

Getting workloads into AWS, and keeping them recoverable once there. Migration and DR share tooling (replication, backup, failover) and the same core discipline: define your objectives (RTO/RPO) first, then pick the pattern that meets them at acceptable cost.

TL;DR

For migration, use MGN (rehost servers), DMS + SCT (databases), DataSync/Transfer/Snow (data). For DR, pick a pattern by RTO/RPO: Backup & Restore (cheapest, slowest) -> Pilot Light -> Warm Standby -> Active/Active (fastest, priciest). Use AWS Backup for backups, Elastic Disaster Recovery for low-RPO server DR, and Route 53 for failover. Always test DR.

RTO and RPO drive everything

Term	Meaning	Drives
RTO (Recovery Time Objective)	How long you can be down.	How much standing infrastructure you keep warm.
RPO (Recovery Point Objective)	How much data you can lose (time).	Replication frequency and mechanism.

Migration services

Service	Use
Migration Hub	Track and coordinate migrations across tools and accounts.
Application Migration Service (MGN)	Lift-and-shift rehosting of servers into EC2 via block replication.
Database Migration Service (DMS)	Migrate/replicate databases with CDC for minimal downtime.
Schema Conversion Tool (SCT)	Convert schema/code for heterogeneous engine changes.
DataSync	Fast bulk file/object transfer to S3/EFS/FSx.
Transfer Family	Managed SFTP/FTPS/FTP into S3/EFS.
Snow Family	Offline petabyte-scale transfer devices.

The 7 Rs of migration

Strategy	Meaning	When
Rehost	Lift-and-shift as-is (MGN).	Speed; refactor later. Most common first move.
Replatform	Minor optimization (for example DB -> RDS).	Small changes for big managed-service wins.
Repurchase	Move to SaaS.	Commodity apps (email, CRM).
Refactor	Re-architect cloud-native.	High-value apps needing scale/agility.
Relocate	Move VMware without conversion.	VMware Cloud on AWS estates.
Retain	Keep on-prem for now.	Not ready / dependencies / compliance.
Retire	Decommission.	Unused apps found during discovery.

Architect note

Rehost first to exit the data center on schedule, then modernize the workloads that justify it. Trying to refactor everything during migration stalls the program. Also: right-size at migration time. Do not carry on-prem over-provisioning into EC2, or you pay for idle capacity forever.

DR patterns (ordered by RTO/RPO and cost)

Pattern	RTO / RPO	Cost	How
Backup & Restore	Hours / hours	Lowest	Backups (AWS Backup) + IaC to rebuild in the DR Region on demand.
Pilot Light	Tens of min / minutes	Low	Core data replicated and minimal services running; scale up on failover.
Warm Standby	Minutes / seconds-min	Medium	A scaled-down but functional copy always running; scale up and shift traffic on failover.
Active/Active (Multi-site)	Near-zero / near-zero	Highest	Full capacity in multiple Regions serving live; failover is traffic steering.

Cost note

DR cost scales with how much you keep warm. Most workloads do not need active/active. Match the pattern to the business RTO/RPO, not to a desire for maximum safety. Backup & restore or pilot light covers a large share of real requirements at a fraction of the cost.

DR building blocks

AWS Backup: policy-based backups with cross-Region/cross-account copies and Vault Lock (immutability against ransomware).
Elastic Disaster Recovery (DRS): continuous block replication for low-RPO server recovery; drill without disrupting production.
Database replication: RDS cross-Region read replicas, Aurora Global Database (sub-second lag), DynamoDB Global Tables (active-active).
Route 53 failover: health-check-based DNS failover to the DR Region.
Multi-AZ for in-Region resilience (the default); multi-Region for Region-level DR.
IaC: the DR Region must be reproducible from code; manual DR is DR that fails.

Architecture examples

On-prem VM migration to AWS migrate

Discover (Migration Hub/Application Discovery) -> install MGN agents -> continuous block replication to a staging area -> test-launch and validate in a test subnet -> right-size -> cutover during a window -> decommission source. Databases handled separately via DMS.

Oracle database migration to AWS migrate

Homogeneous (Oracle -> Oracle on RDS/EC2): Data Pump for the bulk load + DMS CDC to sync changes, cut over at low lag. Heterogeneous (Oracle -> PostgreSQL/Aurora): SCT converts schema/PLSQL, DMS moves data + CDC, then validate thoroughly. Decide RDS vs EC2 by the control/feature needs in the Database tab.

Application migration to AWS migrate

Rehost the app servers (MGN), replatform the database (RDS), front with an ALB, put static assets on S3/CloudFront, wire IAM roles and Secrets Manager, add monitoring, then iterate toward containers/serverless where it pays off.

Cross-Region DR for an application dr

Warm standby: scaled-down ASG + ALB in the DR Region, database replicated (Aurora Global DB / cross-Region replica), config/IaC identical, Route 53 failover record with health checks. On failover: scale up, promote the DB, shift DNS.

Cross-Region DR for a database dr

Aurora Global Database for sub-second RPO and fast promotion, or RDS cross-Region read replica for simpler engines, or DynamoDB Global Tables for active-active. Snapshots cross-Region as an additional cheap safety net.

Backup-based DR dr

AWS Backup plans with cross-Region + cross-account copies to a locked vault; IaC to stand up the environment in the DR Region; documented and tested restore runbook. Lowest cost, higher RTO.

Route 53 failover pattern dr

Primary and secondary records with health checks; when the primary endpoint fails health checks, Route 53 answers with the secondary. Mind DNS TTL/caching for actual failover time; combine with Global Accelerator for faster IP-level failover if needed.

Common mistake: untested DR

A DR plan that has never been executed is a hypothesis. Schedule regular game-day drills (DRS/Aurora Global DB support non-disruptive drills), measure actual RTO/RPO against targets, and fix the gaps. Also verify backups are restorable, not just that jobs succeeded.

14. Cost Management and Governance

AWS cost is an engineering discipline, not a finance afterthought. The bill is driven by architecture decisions (data transfer, NAT, idle resources, over-provisioning) far more than by unit prices. Governance keeps those decisions consistent across many accounts.

TL;DR

Understand the pricing levers (On-Demand vs Savings Plans vs Spot; data transfer; NAT; storage classes), see spend with Cost Explorer/CUR and control it with Budgets + anomaly detection, right-size with Compute Optimizer, and govern with Organizations/SCPs, tagging, quotas, and Control Tower. Do the big three first: right-size, schedule off, and commit the baseline.

Pricing model basics

Lever	Note
On-Demand	No commitment, highest per-unit. Default for spiky/unknown.
Savings Plans	1/3-year compute commitment for up to ~72% off (EC2, Fargate, Lambda). Compute Savings Plans are most flexible.
Reserved Instances	Similar savings, less flexible; still used for RDS/ElastiCache/Redshift/OpenSearch reservations.
Spot	Up to ~90% off, interruptible. Batch/CI/stateless/big-data.
Dedicated Hosts	Priciest; for BYOL host-based licensing and isolation.

The costs that surprise teams

Cost	Why it grows	Control
Data transfer	Egress to internet, cross-Region, and cross-AZ all cost per GB.	Keep tightly-coupled traffic AZ-local (while staying HA); use CloudFront for egress; use endpoints for AWS traffic.
NAT Gateway	Hourly + per-GB processed; all private-subnet egress flows through it.	VPC endpoints for S3/ECR/etc.; consolidate; one-per-AZ but no more.
EBS	Unattached volumes and old snapshots bill forever.	gp3 default; delete orphans; DLM snapshot lifecycle.
S3	Wrong storage class, versioning without expiry, request costs.	Lifecycle/Intelligent-Tiering; expire noncurrent versions.
RDS	Over-sized instances, Multi-AZ doubling compute, idle non-prod.	Right-size, stop non-prod, reservations for steady prod.
Lambda	High memory setting or very high steady volume.	Tune memory (also affects speed); consider containers at high steady load.
CloudWatch	Never-expiring logs, custom metrics, high-res alarms.	Set retention; filter; reduce cardinality.
SageMaker / idle endpoints	Standing endpoints and notebooks bill while idle.	Serverless/async inference; shut down idle resources.

Cost note: data transfer is the silent line item

Cross-AZ, cross-Region, and internet egress charges frequently rival compute on chatty or data-heavy systems, and they hide inside "data transfer" on the bill. Architect for locality, route AWS-service traffic through VPC endpoints, and serve internet egress through CloudFront. This is design work, not a billing setting.

Cost management tools

Tool	Use
Cost Explorer	Visualize and forecast spend; slice by tag/service/account.
Cost and Usage Report (CUR)	The raw, granular billing data in S3 for deep analysis (Athena/QuickSight).
Budgets	Alerts and automated actions on cost/usage/RI-SP-coverage thresholds.
Cost Anomaly Detection	ML-based alerts on unusual spend.
Compute Optimizer	Right-sizing recommendations for EC2/ASG/EBS/Lambda/RDS.
Trusted Advisor	Idle resources, low-utilization, and cost hygiene checks.
Cost allocation tags	Attribute spend to teams/apps/environments (must be activated).

Governance

Organizations + consolidated billing: one bill, shared volume discounts and Savings Plan/RI benefit across accounts.
SCPs: restrict Regions, instance families, or services to prevent expensive mistakes.
Tagging strategy + tag policies: enforce mandatory tags (Owner, Environment, CostCenter, App) so cost is attributable.
Service Quotas: know and manage limits proactively (they also cap runaway spend).
Control Tower / landing zone: consistent guardrails and account baselines including budgets.
Per-account or per-OU budgets with alerts; chargeback/showback from CUR.

Cost optimization examples

Action	Impact
Stop non-prod EC2/RDS nights and weekends	Up to ~65% off those resources (scheduler).
Right-size EC2/RDS (Compute Optimizer)	Cut over-provisioning, often 20-40%.
Compute Savings Plans on steady baseline	Up to ~72% vs On-Demand.
Spot for batch/CI/stateless	Up to ~90% off.
gp3 instead of gp2	~20% cheaper + tunable performance.
S3 lifecycle / Intelligent-Tiering	Large savings on cold data.
VPC endpoints to cut NAT processing	Removes per-GB NAT cost for AWS traffic.
Reduce cross-AZ chatter	Cuts data-transfer line item.
Delete old snapshots + unattached EBS + idle EIPs	Removes silent recurring charges.
CloudWatch log retention	Stops unbounded log storage growth.
Reserved capacity planning (RDS/Redshift/etc.)	Discounts on steady managed services.

Monthly AWS cost review checklist

Run this monthly

Review Cost Explorer month-over-month by service and by account/tag; investigate anomalies.
Check Savings Plan / RI coverage and utilization; buy/adjust for the steady baseline.
Action Compute Optimizer and Trusted Advisor right-sizing/idle findings.
Find and delete: unattached EBS, old snapshots, idle EIPs, idle load balancers, orphaned NAT gateways, stopped-but-costing resources.
Verify non-prod scheduling is working.
Review S3 storage-class distribution and lifecycle effectiveness.
Check CloudWatch log retention and largest log groups.
Review data-transfer costs (cross-AZ/Region/internet) for new hotspots.
Confirm untagged resources are shrinking; enforce tag policy.
Validate budgets/alerts are current for each account/OU.

15. Enterprise Architecture Patterns

Reference designs for real AWS deployments. Each card gives the use case, the services, the traffic flow, and how security, HA, DR, monitoring, and cost play out, plus the risks and common mistakes. Use them as starting points, not blueprints to copy blindly.

How to read these

Every pattern assumes the account/landing-zone and IAM baselines from the Fundamentals and IAM tabs (multi-account, Identity Center, guardrails, centralized logging). The cards focus on the workload-specific design.

Simple web application web

Use case: a small public site/app.
Services: Route 53 -> CloudFront -> S3 (static) and/or ALB -> EC2/ECS -> RDS; ACM for TLS.
Flow: user -> Route 53 -> CloudFront (cache) -> S3 for static, ALB for dynamic -> app -> database.
Security: WAF on CloudFront/ALB, Block Public Access (serve S3 via OAC), SGs least-open, DB private.
HA: multi-AZ ALB + app; RDS Multi-AZ. DR: snapshots/backups (backup-and-restore). Monitoring: CloudWatch alarms on 5xx/latency/DB. Cost: low; CloudFront cuts egress.
Risks/mistakes: public S3 bucket instead of OAC; single-AZ database; no WAF.

Three-tier enterprise application web

Use case: classic web/app/data tiers with internal users and integrations.
Services: ALB (public) -> ASG/ECS app tier (private) -> RDS/Aurora (private data subnets); ElastiCache; Secrets Manager; CloudWatch.
Flow: user -> ALB in public subnets -> app tier in private subnets -> DB in isolated data subnets (no internet route).
Security: tier-to-tier SGs (each references the SG above it), WAF, private data tier, KMS everywhere, secrets in Secrets Manager.
HA: everything across 3 AZs; RDS Multi-AZ + read replicas. DR: pilot light/warm standby cross-Region. Monitoring: per-tier dashboards + Performance Insights. Cost: right-size app tier, reservations for DB.
Risks/mistakes: DB in public subnet; SGs opened by CIDR instead of SG reference; one NAT for all AZs.

Highly available application ha

Use case: must survive an AZ failure with no manual action.
Services: ALB + ASG across 3 AZs, Aurora (multi-AZ, fast failover), ElastiCache Multi-AZ, stateless app, S3 for shared assets.
Flow: ALB spreads traffic; ASG self-heals via ELB health checks; DB fails over automatically.
Security: as three-tier. HA: N+1 capacity so one AZ can be lost; no singletons. DR: add cross-Region for Region failure. Monitoring: unhealthy-host and failover alarms. Cost: extra capacity for redundancy is the price of availability.
Risks/mistakes: capacity sized so that losing one AZ overloads the rest; stateful instance data lost on replacement.

Private enterprise application (no internet exposure) private

Use case: internal-only app for corporate users via VPN/Direct Connect.
Services: internal ALB, private subnets only, VPC endpoints (no NAT dependency for AWS traffic), Route 53 private zones, PrivateLink for shared services.
Flow: corporate network -> DX/VPN -> internal ALB -> app -> DB; AWS-service calls via endpoints.
Security: no public IPs, endpoints with policies, SG/NACL least-open, centralized egress inspection if needed. HA/DR: as HA app. Monitoring: Flow Logs + standard. Cost: endpoints instead of NAT can be cheaper for AWS-heavy traffic.
Risks/mistakes: accidental public exposure; forgetting endpoints and routing everything through NAT.

Shared services account pattern platform

Use case: centralize AD, CI/CD, artifact repos, DNS, and shared endpoints consumed by many workload accounts.
Services: Shared Services VPC + Transit Gateway/Cloud WAN; PrivateLink to expose services; Route 53 Resolver; central ECR/CodeArtifact.
Flow: workload accounts reach shared services over TGW or PrivateLink, not the internet.
Security: tightly-scoped cross-account access; resource policies; endpoint policies. HA/DR: shared services must be as resilient as the workloads depending on them. Cost: shared once, consumed by many.
Risks/mistakes: making shared services a single point of failure; unbounded cross-account trust.

Centralized networking (hub-and-spoke) network

Use case: many VPCs and hybrid connectivity with central routing and inspection.
Services: Transit Gateway (or Cloud WAN), central inspection VPC (Network Firewall), DX/VPN, central egress.
Flow: spokes attach to TGW; route tables segment prod/non-prod/shared; egress and inspection centralized.
Security: segmentation via TGW route tables; inspection appliances (GWLB/Network Firewall). HA: per-AZ attachments, appliance-mode. Cost: watch TGW per-GB and central-egress data charges.
Risks/mistakes: asymmetric routing without appliance mode; CIDR overlaps; central egress becoming a bottleneck/cost center.

Multi-account landing zone foundation

Use case: the organizational foundation everything else sits on.
Services: Organizations, Control Tower, SCP guardrails, dedicated Log Archive + Audit accounts, IAM Identity Center, Account Factory/AFT.
Flow: accounts vended from a blueprint; logs centralized; guardrails inherited by OU.
Security: the whole point: preventive + detective guardrails, centralized immutable logs, least-privilege SSO. DR: IaC-reproducible. Cost: baseline services per account.
Risks/mistakes: workloads in the management account; OUs modeled on org chart; no account-vending automation.

Oracle database on AWS database

Use case: run Oracle with the right balance of managed vs control.
Services: RDS for Oracle (managed) OR EC2 self-managed with Data Guard (full control); io2/gp3 storage; Secrets Manager; AWS Backup/RMAN-to-S3.
Flow: app tier -> private DB (RDS or EC2) in data subnets.
Security: private subnets, KMS, TLS, restricted SG, deletion protection. HA: RDS Multi-AZ or Data Guard. DR: cross-Region replica/snapshots or Data Guard standby. Cost: licensing (LI vs BYOL) is a major factor; core count on EC2.
Risks/mistakes: choosing RDS then hitting a feature limit (RAC/SYSDBA); public DB; ignoring license core-count on EC2. See the Database tab's Oracle gotchas.

Data lake pattern data

Use case: central governed analytics store.
Services: S3 (zones) + Glue (catalog/ETL) + Lake Formation (governance) + Athena/Redshift Spectrum + QuickSight.
Flow: ingest (DMS/Firehose/DataSync) -> raw -> ETL -> curated -> query/BI.
Security: Lake Formation row/column/tag permissions; encryption; cross-account sharing without copies. HA/DR: S3 durability + cross-Region replication. Cost: Parquet + partitioning; serverless query.
Risks/mistakes: unpartitioned raw formats blowing up scan cost; no governance ("data swamp").

Kubernetes platform pattern containers

Use case: internal container platform for many teams.
Services: EKS (managed node groups + Karpenter or Fargate), AWS Load Balancer Controller, IRSA/Pod Identity, ECR, GitOps, Managed Prometheus/Grafana.
Flow: ALB ingress -> pods across AZs; per-pod IAM; images from ECR endpoints.
Security: least-privilege pod roles, network policies, image scanning, IMDSv2. HA: multi-AZ node groups. DR: IaC + GitOps redeploy + data replication. Cost: node right-sizing/Karpenter, Spot for stateless.
Risks/mistakes: node role over-permissioning pods; VPC CNI IP exhaustion; skipped version upgrades.

Serverless application pattern serverless

Use case: event-driven or spiky app with minimal ops.
Services: API Gateway -> Lambda -> DynamoDB; S3 events; EventBridge; Step Functions; SQS/SNS.
Flow: request/event -> function -> data store; async work decoupled via queues.
Security: per-function least-privilege roles, API auth (Cognito/authorizers), input validation. HA: built-in multi-AZ. DR: DynamoDB Global Tables + multi-Region deploy. Cost: pay-per-use, scales to zero; watch high steady volume.
Risks/mistakes: cold starts on latency-critical paths; no DLQs; overly broad function roles.

Event-driven architecture integration

Use case: decoupled services reacting to events.
Services: EventBridge (routing) + SNS (fan-out) + SQS (buffering) + Lambda/ECS (processing) + DLQs.
Flow: producers emit events -> routed/fanned out -> independent consumers with retries.
Security: per-consumer roles; event-bus resource policies. HA/DR: managed, multi-AZ; replicate critical state. Cost: per-event, cheap.
Risks/mistakes: no DLQ; poison messages; assuming ordering where Standard queues do not guarantee it.

Hybrid cloud pattern hybrid

Use case: AWS integrated with on-prem systems.
Services: Direct Connect (+ VPN backup), Transit Gateway, Route 53 Resolver (hybrid DNS), Storage Gateway/DataSync.
Flow: on-prem <-> DX/VPN <-> TGW <-> VPCs; DNS resolves both ways via Resolver endpoints.
Security: encrypted links, restricted routing, no overlapping CIDRs. HA: redundant DX/VPN. Cost: DX reduces egress cost for steady transfer.
Risks/mistakes: single DX with no backup; asymmetric routing; CIDR overlap.

Multi-region DR pattern dr

Use case: survive a Region failure within a defined RTO/RPO.
Services: warm standby in a second Region: ALB+ASG scaled down, Aurora Global Database, replicated S3, Route 53 failover, IaC parity.
Flow: normal traffic to primary; on failure, scale up DR, promote DB, shift DNS/Global Accelerator.
Security: identical guardrails both Regions. DR: the point of the pattern; test regularly. Cost: pay for standby capacity + replication.
Risks/mistakes: untested failover; config drift between Regions; DNS TTL delaying failover.

Secure landing zone pattern security

Use case: a governed, secure baseline for regulated environments.
Services: Control Tower, SCPs, GuardDuty/Security Hub/Config org-wide, central logging with Object Lock, Firewall Manager, IAM Identity Center, Network Firewall central egress.
Flow: every account inherits guardrails; all logs immutable and central; detection to a security account.
Security: preventive + detective + response wired together. Cost: baseline security services per account.
Risks/mistakes: detection without response; logs not immutable; guardrails not tested in non-prod first.

GenAI with private enterprise data ai

Use case: AI assistant grounded in internal, permissioned data.
Services: Bedrock (models + Guardrails) + Knowledge Bases + vector store (OpenSearch/pgvector) + S3 (curated docs) + a governed serving layer for structured data.
Flow: user query -> retrieve permitted context (respecting user permissions) -> model answers with citations -> logged.
Security: least-privilege retrieval, Guardrails, no direct OLTP access, full audit. Cost: tokens + vector store + embeddings.
Risks/mistakes: connecting the model to production databases directly; cross-tenant data leakage in retrieval; no output validation. See the AI/ML tab warnings.

16. Troubleshooting Guides

Runbooks for the failures you will actually hit. Each has symptoms, likely causes, checks (console path + CLI where useful), fixes, and prevention. The meta-rule: almost every AWS problem is IAM (permission), networking (reachability), or a quota/limit. Isolate which, then dig.

First three checks, always

(1) IAM: is the calling principal allowed this action on this resource (check CloudTrail for the AccessDenied, and the policy simulator)? (2) Networking: is there a path (route + SG + NACL + endpoint), verified with Reachability Analyzer / Flow Logs? (3) Quotas: are you at a Service Quota limit (check Service Quotas / Trusted Advisor)?

Compute and access

EC2 instance not reachable compute

Symptoms: connection timeouts to the instance.
Causes: SG/NACL blocking, no route/public IP, instance down, wrong subnet.
Checks: SG allows the port from your source; route table + public IP (if internet-facing); instance running and status checks passing; NACL allows in+out (ephemeral). Console: EC2 > Instances > (Networking, Status checks). CLI:

aws ec2 describe-instance-status --instance-ids i-0abc
aws ec2 describe-security-groups --group-ids sg-0abc

Fix: correct SG/route; use Reachability Analyzer. Prevention: standardize network templates; use Session Manager (no inbound port needed).

SSH issue compute

Symptoms: permission denied / timeout on SSH.
Causes: wrong key, wrong user, SG not allowing 22 from your IP, key perms, host firewall.
Checks: correct .pem and username (ec2-user/ubuntu/etc.); chmod 400 the key; SG allows 22 from your IP. Fix: prefer Session Manager to bypass SSH entirely. Prevention: close 22; use SSM.

Session Manager not working compute

Causes: SSM agent not running, no instance role with SSM permissions, no network path to SSM endpoints.
Checks: agent installed/running; instance role has AmazonSSMManagedInstanceCore; reachability to ssm, ssmmessages, ec2messages endpoints (NAT or interface endpoints). Console: Systems Manager > Fleet Manager (managed node shows). Fix: attach role, add endpoints. Prevention: bake agent+role into AMIs/launch templates.

Instance boot issue compute

Causes: full root disk, bad fstab mount, failed cloud-init/user-data, kernel/GRUB.
Checks: EC2 Serial Console; aws ec2 get-console-output --instance-id i-0abc; look for mount/cloud-init errors. Fix: detach root volume, attach to a rescue instance, repair, reattach. Prevention: test AMIs; avoid fragile user-data; disk alarms.

EC2 status check failure compute

System status (AWS infra): often resolved by stop/start (moves to new hardware). Instance status (OS): boot/network/config issue, needs OS fix. Checks: which check failed. Fix: stop/start for system; console output/rescue for instance. Prevention: enable auto-recovery alarms.

High CPU compute

Checks: CloudWatch CPUUtilization; T-family CPUCreditBalance; on box top. Fix: scale out (ASG), scale up/compute-optimized, or fix the app; move off T-family if credits exhaust. Prevention: target-tracking scaling; right family.

Memory pressure compute

Checks: memory needs the CloudWatch agent (mem_used_percent); free -m, OOM in dmesg. Fix: memory-optimized family, tune heap. Prevention: agent + alarms; right-size.

Disk full compute

Checks: agent disk metrics; df -h, du -sh /*. Fix: clear/rotate logs; grow EBS online then growpart + resize2fs/xfs_growfs. Prevention: log rotation, disk alarms.

EBS volume attachment issue storage

Causes: volume and instance in different AZs; already attached; device name conflict.
Checks: same AZ; volume state available. CLI: aws ec2 attach-volume --volume-id vol-0abc --instance-id i-0abc --device /dev/xvdf then mount in OS. Fix: snapshot + restore in target AZ if cross-AZ. Prevention: place volumes in the instance AZ.

Storage

S3 access denied storage

Causes: IAM policy, bucket policy, Block Public Access, VPC endpoint policy, KMS key policy, or an explicit Deny; wrong account owner.
Checks: all four layers (IAM, bucket policy, endpoint policy, KMS); CloudTrail shows which denied. Fix: grant on the failing layer; ensure the KMS key policy allows the principal. Prevention: use Access Analyzer; document the intended access.

S3 public access issue storage

Symptom: objects unexpectedly public, or a needed public-read failing.
Checks: Block Public Access settings (account + bucket), ACLs (should be disabled), bucket policy. Fix: for exposure, enable Block Public Access and remove public statements; for legitimate public content, use CloudFront + OAC instead. Prevention: BPA on org-wide via SCP; Macie for PII.

EFS mount issue storage

Causes: no mount target in the instance's AZ, SG blocking NFS (2049), missing efs-utils, wrong DNS.
Checks: mount target exists in that AZ; EFS SG allows 2049 from the instance SG; mount -t efs with the correct FS id. Fix: create the missing mount target; open 2049. Prevention: mount target per AZ; template the SG.

Networking and load balancing

Load balancer target unhealthy elb

See the Load Balancing tab for the full ordered walk. Short version: target SG must allow the LB SG on the health-check port; health path returns 200; app listening on the target port; protocol/port match; NACL allows return; LB enabled in the target's AZ.

SSL certificate issue elb

Cert must cover the hostname (SAN), be in the same Region as the LB (us-east-1 for CloudFront), be attached to the HTTPS listener, and (if DNS-validated) have its validation record present. Check the listener security policy for cipher/protocol support.

Route 53 DNS issue dns

Checks: record exists and is correct; private zone associated with the VPC; VPC enableDnsSupport/enableDnsHostnames on; TTL not masking a change; hybrid resolution via Resolver endpoints/rules. CLI: dig @169.254.169.253 name from inside the VPC. Fix: correct record/association. Prevention: low TTL before changes; alias records for AWS targets.

VPC routing issue network

Confirm the subnet's route table has a route to the destination (IGW/NAT/TGW/peering/endpoint), most-specific-wins; the target is healthy. Use Reachability Analyzer. Prevention: IaC route templates; document intended routing.

Security Group issue network

SGs are stateful/allow-only; return traffic is automatic. Verify the target SG allows the source (IP or referencing SG) on the port. Cross-VPC SG references need same-Region peering with the feature enabled. Prevention: reference SGs, not CIDRs, for east-west.

NACL issue network

NACLs are stateless: allow inbound and the outbound ephemeral return (1024-65535), evaluated by rule number (lowest first, explicit deny possible). A custom NACL denies all until you add allows. Prevention: keep NACLs simple; use SGs as the primary control.

NAT Gateway issue network

Causes: NAT in a failed AZ, no 0.0.0.0/0 -> nat route, NAT in a private subnet (must be public with IGW route), port exhaustion.
Checks: NAT in a public subnet with IGW route; private route table points to it; ErrorPortAllocation metric. Fix: one NAT per AZ; correct routes. Prevention: VPC endpoints reduce NAT dependency and cost.

VPC endpoint issue network

Gateway endpoint (S3/DynamoDB): route added to the right route table; endpoint policy allows the action; bucket policy allows the VPCE. Interface endpoint: SG allows 443 from clients; private DNS enabled. Prevention: keep endpoint policies deliberate and documented.

Transit Gateway route issue network

Attachment associated with the right TGW route table; destination CIDR propagated/added; VPC subnet route table points the remote CIDR at the TGW; appliance mode on for centralized inspection. Prevention: name and document TGW route tables per segment.

VPN down hybrid

Checks: tunnel state (should have 2 up); BGP session; pre-shared keys / IKE params match; on-prem device config; SG/NACL for the traffic. CLI: aws ec2 describe-vpn-connections. Fix: reset/reconfigure tunnel; failover to the second tunnel. Prevention: 2 tunnels + BGP + tunnel-state alarm.

Direct Connect issue hybrid

Check the connection/VIF state, BGP session, and advertised routes; verify the DX Gateway associations. Have a VPN backup path and set route priorities to avoid asymmetric routing when both exist. Prevention: redundant DX or DX + VPN; connection-state alarms.

Database

RDS backup failed database

Causes: storage full, long-running transaction blocking the snapshot, IAM/KMS issue, maintenance conflict.
Checks: RDS events + CloudWatch FreeStorageSpace; AWS Backup job error. Fix: free/grow storage, resolve blocking transaction, fix KMS key access. Prevention: storage-autoscaling, storage alarms, backup-failure alarms.

RDS performance issue database

Checks: Performance Insights (top SQL, wait events), CPUUtilization, FreeableMemory, ReadIOPS/WriteIOPS vs provisioned, connection count, ReplicaLag. Fix: tune queries/indexes, right-size, add read replicas, RDS Proxy for connection storms, provision IOPS. Prevention: PI enabled; query review; alarms.

RDS connection issue database

Causes: SG not allowing the app SG on the DB port, DB in a different subnet/AZ path, max connections reached, wrong endpoint (writer vs reader), credentials.
Checks: SG, DatabaseConnections metric, endpoint used, secret value. Fix: open SG from app SG; RDS Proxy/pooling for exhaustion; use the correct endpoint. Prevention: Proxy, connection limits, private-subnet placement.

IAM and serverless

IAM permission denied iam

Checks: CloudTrail event shows the exact action/resource and the deny; run the policy simulator; check for explicit Deny, SCP ceiling, permission boundary, and (cross-account) both sides. Fix: add the specific action/resource to the identity or resource policy within the ceiling. Prevention: Access Analyzer policy generation; least privilege from the start.

Cross-account role issue iam

Causes: trust policy does not permit the caller, caller lacks sts:AssumeRole, missing ExternalId, or the assumed role lacks the needed permissions.
Checks: trust policy principal + condition; caller's policy; the role's permission policy. CLI: aws sts assume-role --role-arn arn:... --role-session-name test. Fix: align both sides. Prevention: scope trust to specific role ARNs; use ExternalId for third parties.

Lambda timeout serverless

Causes: downstream slowness, VPC ENI/DNS, cold start on a heavy init, timeout set too low, no connection reuse.
Checks: Duration vs configured timeout; X-Ray trace; if in a VPC, endpoint/NAT reachability. Fix: raise timeout/memory (memory also raises CPU), reuse connections (RDS Proxy), optimize init, provisioned concurrency for latency. Prevention: right-size, trace, avoid VPC unless needed.

Lambda permission issue serverless

Two directions: the function's execution role (what the function can call) and its resource policy (what can invoke it, for example API Gateway/S3/EventBridge). Check both. Prevention: least-privilege execution role; explicit invoke permissions per trigger.

ECS task not starting containers

Causes: image pull failure (ECR perms/endpoint), execution role missing, insufficient capacity/CPU-mem, failing health check, bad env/secret.
Checks: stopped-task reason in the console; CloudWatch logs; execution role has ECR + logs perms. Fix: correct role/endpoints/capacity/health path. Prevention: scan images; validate task defs in CI.

EKS pod not starting containers

Checks: kubectl describe pod events; common states: ImagePullBackOff (ECR perms/endpoint), Pending (no schedulable node / IP exhaustion), CrashLoopBackOff (app error), IRSA misconfig (no AWS perms). Fix: per cause. Prevention: Karpenter/capacity, prefix delegation for IPs, IRSA validation.

Observability

CloudWatch alarm not firing obs

Causes: metric not being published (agent missing for mem/disk), wrong namespace/dimensions, insufficient datapoints, treat-missing-data setting, alarm in INSUFFICIENT_DATA.
Checks: the metric exists with data; alarm math and period; missing-data behaviour. Fix: install agent / correct dimensions / adjust evaluation. Prevention: test alarms; alarm on missing data where absence is bad.

CloudTrail logs missing obs

Causes: trail not multi-Region/org, data events not enabled, S3 bucket policy blocking delivery, KMS access, log-file validation gaps.
Checks: trail config, destination bucket policy, KMS key policy. Fix: enable org multi-Region trail; fix bucket/KMS policy. Prevention: org trail to a locked central bucket; SCP denying trail changes.

17. AWS CLI, CloudFormation, CDK, and Terraform Examples

Practical, copy-friendly automation. Prefer IaC for anything that must be reproducible, reviewed, or promoted across environments. Reserve the CLI for investigation and glue. All snippets are illustrative; verify against current provider/service versions.

Verify before production use

APIs, resource arguments, and provider versions change. Treat these as patterns; pin versions and test in a non-prod account first.

AWS CLI

Setup and profiles (with SSO / role assumption)

# Configure IAM Identity Center (SSO) - the recommended way for humans
aws configure sso
# ~/.aws/config: a profile that assumes a role via SSO
[profile prod-admin]
sso_session = corp
sso_account_id = 111122223333
sso_role_name = AdministratorAccess
region = us-east-1

# A profile that assumes a cross-account role from a base profile
[profile deploy]
role_arn = arn:aws:iam::444455556666:role/CICDDeployer
source_profile = prod-admin
external_id = shared-secret-value

# Use a profile
aws sts get-caller-identity --profile deploy
aws s3 ls --profile prod-admin

Common commands

# Who am I / what account
aws sts get-caller-identity

# EC2
aws ec2 describe-instances --filters "Name=instance-state-name,Values=running" \
  --query "Reservations[].Instances[].[InstanceId,InstanceType,PrivateIpAddress]" --output table

# Assume a role manually
aws sts assume-role --role-arn arn:aws:iam::444455556666:role/ReadOnly \
  --role-session-name debug

# S3 sync
aws s3 sync ./build s3://my-site-bucket --delete

# Tail logs
aws logs tail /aws/lambda/my-func --follow

# Start a keyless shell (no SSH)
aws ssm start-session --target i-0abc123

CloudFormation (YAML)

# An S3 bucket, encrypted, private, versioned
AWSTemplateFormatVersion: "2010-09-09"
Resources:
  ReportsBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Sub "reports-${AWS::AccountId}-${AWS::Region}"
      VersioningConfiguration: { Status: Enabled }
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault: { SSEAlgorithm: aws:kms }
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true
      Tags:
        - { Key: Environment, Value: prod }
        - { Key: Owner, Value: platform }
Outputs:
  BucketName:
    Value: !Ref ReportsBucket

Deploy across accounts/Regions with StackSets; detect manual changes with drift detection.

AWS CDK (TypeScript)

import { Stack, StackProps, RemovalPolicy } from "aws-cdk-lib";
import { Bucket, BucketEncryption, BlockPublicAccess } from "aws-cdk-lib/aws-s3";
import { Construct } from "constructs";

export class StorageStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);
    new Bucket(this, "ReportsBucket", {
      versioned: true,
      encryption: BucketEncryption.KMS_MANAGED,
      blockPublicAccess: BlockPublicAccess.BLOCK_ALL,
      removalPolicy: RemovalPolicy.RETAIN,
    });
  }
}

CDK synthesizes to CloudFormation, so you inherit its deploy semantics with real language constructs, reuse, and tests.

Terraform

Provider and remote state

terraform {
  required_version = ">= 1.6"
  required_providers {
    aws = { source = "hashicorp/aws", version = "~> 5.0" }
  }
  # Remote state in S3 with DynamoDB lock (create these once, separately)
  backend "s3" {
    bucket         = "tfstate-111122223333"
    key            = "prod/network/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "tfstate-lock"
    encrypt        = true
  }
}

provider "aws" {
  region = "us-east-1"
  default_tags { tags = { Environment = "prod", ManagedBy = "terraform" } }
}

VPC (minimal, three-AZ public/private)

resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_support   = true
  enable_dns_hostnames = true
  tags = { Name = "prod-vpc" }
}

resource "aws_subnet" "private" {
  count             = 3
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(aws_vpc.main.cidr_block, 4, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]
  tags = { Name = "private-${count.index}", Tier = "private" }
}

EC2 with instance profile and IMDSv2 enforced

resource "aws_instance" "app" {
  ami                    = data.aws_ami.al2023.id
  instance_type          = "m7i.large"
  subnet_id              = aws_subnet.private[0].id
  iam_instance_profile   = aws_iam_instance_profile.app.name
  vpc_security_group_ids = [aws_security_group.app.id]
  metadata_options {
    http_tokens   = "required"   # enforce IMDSv2
    http_endpoint = "enabled"
    http_put_response_hop_limit = 1
  }
  tags = { Name = "app-1", Environment = "prod" }
}

S3 bucket, locked down

resource "aws_s3_bucket" "reports" {
  bucket = "reports-111122223333-us-east-1"
}
resource "aws_s3_bucket_versioning" "reports" {
  bucket = aws_s3_bucket.reports.id
  versioning_configuration { status = "Enabled" }
}
resource "aws_s3_bucket_public_access_block" "reports" {
  bucket                  = aws_s3_bucket.reports.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}
resource "aws_s3_bucket_server_side_encryption_configuration" "reports" {
  bucket = aws_s3_bucket.reports.id
  rule { apply_server_side_encryption_by_default { sse_algorithm = "aws:kms" } }
}

IAM role and least-privilege policy

data "aws_iam_policy_document" "assume" {
  statement {
    actions = ["sts:AssumeRole"]
    principals { type = "Service", identifiers = ["ec2.amazonaws.com"] }
  }
}
resource "aws_iam_role" "app" {
  name               = "app-role"
  assume_role_policy = data.aws_iam_policy_document.assume.json
}
data "aws_iam_policy_document" "app" {
  statement {
    actions   = ["s3:GetObject"]
    resources = ["${aws_s3_bucket.reports.arn}/*"]
  }
}
resource "aws_iam_role_policy" "app" {
  role   = aws_iam_role.app.id
  policy = data.aws_iam_policy_document.app.json
}
resource "aws_iam_instance_profile" "app" {
  name = "app-profile"
  role = aws_iam_role.app.name
}

CloudWatch alarm

resource "aws_cloudwatch_metric_alarm" "cpu_high" {
  alarm_name          = "app-cpu-high"
  namespace           = "AWS/EC2"
  metric_name         = "CPUUtilization"
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 2
  threshold           = 80
  comparison_operator = "GreaterThanThreshold"
  dimensions          = { InstanceId = aws_instance.app.id }
  alarm_actions       = [aws_sns_topic.alerts.arn]
}

Terraform best practices

Remote state in S3 with DynamoDB locking (or a managed backend); never commit state to git; state contains secrets.
Separate state per environment (and often per component) to limit blast radius.
Modules for reusable components (network, EKS, RDS); version them.
Environment separation via directories/workspaces + tfvars; distinct accounts per environment.
Pin provider and module versions; run plan in CI and require review before apply.
Use default_tags for consistent tagging; enforce with policy-as-code (OPA/Sentinel/Checkov).
Least-privilege the CI role that runs Terraform; do not use admin.

Modular structure

infra/
  modules/
    vpc/          # reusable VPC module
    eks/
    rds/
  envs/
    prod/
      main.tf     # calls modules with prod inputs + prod backend
      prod.tfvars
    nonprod/
      main.tf
      nonprod.tfvars

CI/CD for infrastructure

Pipeline: fmt + validate + security scan (Checkov/tfsec) -> plan (posted for review) -> manual approval -> apply with a least-privilege CI role, per environment. Never run apply from a laptop against prod. Store plan artifacts for audit.

18. AWS Well-Architected Framework

Six pillars for evaluating and improving architectures. Treat them as a review lens with concrete questions, not a checklist to tick. A real Well-Architected Review surfaces trade-offs you made implicitly and forces you to make them on purpose.

TL;DR

The pillars are Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability. They pull against each other (more reliability/performance often costs more); the framework is about making those trade-offs deliberately and documenting them. Use the Well-Architected Tool to run structured reviews per workload.

Operational Excellence

Meaning: run and monitor systems to deliver business value and continually improve. Why: most outages and slow recoveries come from operational gaps, not architecture.
Services: CloudFormation/CDK/Terraform, Systems Manager, CloudWatch, EventBridge, CloudTrail, Config, DevOps Guru.
Practical: everything as code, small frequent reversible changes, runbooks/playbooks, game days, observability built into the platform, post-incident reviews that produce fixes.

Common mistakes

ClickOps in prod; no runbooks; alerting on causes not symptoms; no post-incident learning; manual deployments.

Review checklist

Is all infra and config in version-controlled IaC?
Are deployments automated, small, and reversible?
Do you have runbooks for common failures and tested them?
Is observability standardized across the platform?
Do incidents produce tracked corrective actions?

Security

Meaning: protect data, systems, and assets. Why: a single misconfiguration (public S3, over-broad IAM) can be catastrophic.
Services: IAM/Identity Center, Organizations/SCPs, KMS, Secrets Manager, CloudTrail, Config, GuardDuty, Security Hub, WAF/Shield, Macie, Inspector.
Practical: least privilege, identity-based access with MFA, encryption everywhere, private-by-default networking, centralized immutable logging, detection wired to response, automated guardrails.

Common mistakes

Root for daily work; wildcard IAM; public buckets; no CloudTrail/GuardDuty; secrets in code; logs not centralized/immutable.

Review checklist

Root locked; humans via SSO+MFA; least-privilege roles?
Encryption at rest + in transit everywhere?
Block Public Access + no unintended exposure?
CloudTrail/Config/GuardDuty/Security Hub org-wide?
Secrets managed, rotated, not in code?
Detection findings drive automated/tracked response?

Reliability

Meaning: a workload performs its function correctly and recovers from failure. Why: availability and recovery are what the business feels.
Services: Multi-AZ everything, Auto Scaling, ELB, Route 53, AWS Backup, Elastic Disaster Recovery, Aurora Global DB, Service Quotas.
Practical: design for failure (assume AZ loss), automate recovery (self-healing ASGs, auto-failover DBs), manage quotas proactively, test DR with game days, define and measure RTO/RPO.

Common mistakes

Single-AZ components; capacity that cannot absorb one AZ loss; untested DR; hitting an unmanaged service quota during an incident; stateful singletons.

Review checklist

Multi-AZ with N+1 capacity?
Automated failover and self-healing?
RTO/RPO defined, met, and DR tested?
Backups tested by restore, not just job success?
Quotas known and monitored?

Performance Efficiency

Meaning: use resources efficiently and adapt as demand and technology change. Why: right technology + right sizing delivers performance without waste.
Services: the right compute (Graviton, serverless), purpose-built databases, caching (CloudFront/ElastiCache/DAX), Compute Optimizer, auto scaling.
Practical: pick purpose-built services over one-size-fits-all, benchmark, cache aggressively, scale elastically, adopt newer instance generations, use managed/serverless to offload undifferentiated work.

Common mistakes

One database engine for every workload; no caching; over/under-provisioning; ignoring Graviton/newer generations; synchronous where async would scale better.

Review checklist

Purpose-built services matched to each workload?
Right-sized (Compute Optimizer) and current-generation?
Caching at the right layers?
Elastic scaling tied to real demand signals?

Cost Optimization

Meaning: deliver business value at the lowest price point. Why: unmanaged cloud cost grows silently through idle and over-provisioned resources.
Services: Cost Explorer, Budgets, CUR, Compute Optimizer, Savings Plans, Spot, S3 lifecycle, Trusted Advisor.
Practical: right-size, schedule non-prod off, commit steady baseline (Savings Plans), Spot for interruptible, lifecycle storage, cut data-transfer/NAT, attribute cost via tags, review monthly.

Common mistakes

Idle/unattached resources; On-Demand for steady baseline; no tagging/attribution; ignoring data-transfer and NAT costs; never-expiring logs.

Review checklist

Right-sized and idle resources removed?
Savings Plans/RIs cover the baseline; Spot for interruptible?
Storage lifecycle in place?
Cost attributed via tags and reviewed monthly?
Data-transfer/NAT costs understood and minimized?

Sustainability

Meaning: minimize the environmental impact of running cloud workloads. Why: efficiency and sustainability align, and it is increasingly a reporting/compliance requirement.
Services/levers: Graviton (better perf/watt), serverless (no idle), right-sizing, storage-class efficiency, Region choice, the Customer Carbon Footprint Tool.
Practical: maximize utilization, eliminate idle, prefer efficient instance types, delete unused data, choose managed/serverless to raise shared-infrastructure efficiency.

Common mistakes

Over-provisioned always-on fleets; hoarding unused data/snapshots; ignoring more efficient architectures because "it works."

Review checklist

High utilization, minimal idle?
Efficient instance types (Graviton) and serverless where fitting?
Unused data/resources removed on a schedule?

Architect note: run real reviews

Use the Well-Architected Tool per workload, involve the people who operate it, and record the trade-offs you accept (for example "single-Region because RTO allows it and cost matters"). A documented, deliberate trade-off is a good architecture decision; an undocumented default is technical debt waiting to surprise you.

19. Learning Path

A structured route from fundamentals to enterprise-grade AWS. Each level lists what to learn, why it matters, hands-on labs, common mistakes, and the outcome you should be able to demonstrate. Build in a sandbox account with a budget alarm; tear down what you create.

Before you start

Create a dedicated learning account (or use a sandbox OU), set a low Budget alarm, and habitually delete resources after labs (NAT gateways, EIPs, RDS, and load balancers cost money even when idle). Do all labs with an IAM Identity Center role, not root.

Beginner

Topic	Why it matters
AWS fundamentals, Regions/AZs	The blast-radius and residency model behind every decision.
IAM basics (users, groups, roles, policies)	Every action is IAM-authorized; this is the security core.
VPC basics (subnets, route tables, SG/NACL, IGW/NAT)	Reachability is the #1 source of confusion.
EC2 basics (launch, connect via SSM, AMIs, EBS)	The foundational compute + storage primitives.
S3 basics (buckets, classes, Block Public Access)	The default durable store; also the default breach vector if misconfigured.
CloudWatch basics (metrics, logs, alarms)	You cannot operate what you cannot see.

Hands-on labs: build a VPC with public+private subnets across 2 AZs; launch an EC2 instance, connect via Session Manager (no SSH); host a static site on S3+CloudFront; create a CloudWatch alarm on CPU.
Common mistakes: using root; opening 0.0.0.0/0; public buckets; leaving resources running.
Outcome: stand up a basic, private, monitored workload and explain how traffic and permissions flow.

Intermediate

Topic	Why it matters
Load balancers + Auto Scaling	HA and elasticity for real apps.
Private networking, VPC endpoints	Reduce exposure and NAT cost.
VPN + Direct Connect basics	Hybrid connectivity.
RDS / Aurora (Multi-AZ, replicas, backups)	Managed databases and their HA/DR model.
CloudTrail + Config	Audit and configuration compliance.
Security services (GuardDuty, Security Hub, KMS, Secrets Manager)	Detection and data protection.
Cost management (Cost Explorer, Budgets, Savings Plans)	Control spend deliberately.

Hands-on labs: ALB + ASG three-tier app with RDS Multi-AZ; add VPC endpoints and remove NAT for AWS traffic; enable CloudTrail+Config+GuardDuty; set budgets; store DB creds in Secrets Manager.
Common mistakes: DB in public subnet; Multi-AZ mistaken for read scaling; no backups tested; ignoring data-transfer cost.
Outcome: deploy a highly available, monitored, secured, cost-aware three-tier application.

Advanced

Topic	Why it matters
Organizations, Control Tower, SCPs, landing zones	Multi-account governance at enterprise scale.
Transit Gateway / Cloud WAN, multi-account networking	Scalable, segmented connectivity.
Advanced IAM (boundaries, ABAC, cross-account, Access Analyzer)	Least privilege that scales.
ECS / EKS / Lambda / API Gateway	Container and serverless platforms.
EventBridge / Step Functions	Event-driven and orchestrated systems.
Redshift / Lake Formation / analytics	Data platform at scale.
Bedrock / SageMaker	GenAI and ML with governance.
Multi-region DR	Region-level resilience.
Terraform / IaC at scale, enterprise security, large-scale architecture	Operating a real estate reproducibly and safely.

Hands-on labs: build a multi-account landing zone (Control Tower); hub-and-spoke with TGW; EKS or ECS platform with per-workload IAM; an event-driven serverless app with DLQs; a data lake (S3+Glue+Athena); a Bedrock RAG assistant; a warm-standby cross-Region DR with tested failover; provision it all with Terraform + CI.
Common mistakes: workloads in the management account; over-permissioned pods/tasks; untested DR; unmanaged Terraform state; connecting an LLM directly to prod databases.
Outcome: design, secure, automate, and operate an enterprise multi-account AWS environment and justify the trade-offs.

Suggested certification path (optional)

Stage	Certification
Foundational	AWS Certified Cloud Practitioner (optional warm-up).
Associate	Solutions Architect Associate; then SysOps or Developer Associate by role.
Professional	Solutions Architect Professional; DevOps Engineer Professional.
Specialty	Security, Advanced Networking, Machine Learning, Database (by focus).

Architect note

Certifications validate breadth but do not substitute for building. The fastest path to real competence is repeatedly designing, deploying (via IaC), breaking, and troubleshooting workloads in a sandbox. Use the troubleshooting runbooks in this portal as deliberate practice: cause a failure, then fix it.

Sources and Accuracy

How to treat the content in this portal, and where to confirm current facts.

STANCE

This portal is a reasoning and design aid, not a live reference for volatile facts. Concepts, models, patterns, trade-offs, and troubleshooting logic are stable. Exact service limits, instance types, feature/Region availability, and pricing change constantly and must be confirmed in the official AWS documentation and the console before production use.

Primary sources (preferred)

AWS Documentation - the authoritative per-service reference.
AWS Well-Architected Framework - pillars, lenses, and the Well-Architected Tool.
AWS Architecture Center and Prescriptive Guidance - reference architectures and patterns.
AWS Security Documentation and the Security Reference Architecture (AWS SRA).
Service Quotas and the AWS Pricing Calculator - for current limits and cost estimates.
AWS What's New and service release notes - for the latest features and changes.
The AWS Console - the ground truth for what exists in your account/Region right now.

What changes fast (always verify)

Volatile	Confirm in
Service limits / quotas	Service Quotas console.
Instance families/types and generations	EC2 docs / console.
Feature and Region availability	AWS Regional Services list / What's New.
Pricing	Service pricing pages / Pricing Calculator / your CUR.
Model availability (Bedrock)	Bedrock console per Region.
Security best-practice defaults	Current AWS security docs.

Verify with current AWS documentation before production use

Where a detail is fast-moving, this portal marks it explicitly. Even where it does not, treat any specific number (limit, price, instance spec) as a starting point to confirm, not a guarantee. Architecture patterns and reasoning are the durable value here.

About this portal

Last reviewed June 2026. Built as a practical learning and reference aid for cloud architects, DBAs, and enterprise infrastructure teams. Part of expertoracle.com.