AWS Cloud Deep Dive Portal
Not a service catalog and not a marketing tour. This is a working reference for the people who design, build, run, secure, and troubleshoot AWS in production: how the services actually behave, where the model breaks your assumptions, the gotchas that cost you a weekend, the decision guidance for choosing between overlapping options, and the runbooks you reach for when something is down at 2 a.m.
Use the left navigation as a table of contents. Each section is a standalone deep dive. Start at AWS Fundamentals for the account and global-infrastructure model, then use the AWS Service Index (searchable, filterable) to find any service fast. The deep-dive tabs (IAM, Networking, Compute, Storage, Database, and so on) go beyond definitions into decision guidance, gotchas, and operations. When something breaks, jump to Troubleshooting Runbooks. When you are designing, use Architecture Patterns, Well-Architected, and the Decision tables inside each deep dive.
Who this is for
Callout legend
These labeled boxes recur throughout. Scan for them when you want the practitioner shortcut rather than the full prose.
Match / scope badges used in tables
In the Service Index and elsewhere, these badges describe the operational model at a glance:
| Badge | Meaning |
|---|---|
| Serverless | No servers to manage or patch; you pay per request/usage and AWS runs the capacity. |
| Managed | AWS runs the control plane, patching, and HA plumbing; you configure and operate the workload on top. |
| Self-managed | You run it on primitives (usually EC2); AWS provides the infrastructure only. |
| Global | The service or resource spans all Regions (for example IAM, Route 53, CloudFront, WAF for CloudFront). |
| Regional | Scoped to one Region; resilient across Availability Zones within it. |
| AZ-scoped | Lives in a single Availability Zone; you design redundancy across AZs yourself (for example EBS, an EC2 instance, a subnet). |
| Legacy | Still supported but superseded; do not choose it for new work without a specific reason. |
The AWS mental model in five sentences
- Everything is an API call, and every API call is authorized by IAM. If you understand the request-evaluation logic, you understand AWS security.
- The Region is your blast-radius and data-residency boundary; the Availability Zone is your fault-isolation boundary. Design for AZ failure by default; design for Region failure when the business requires it.
- The account is the hard isolation and billing boundary. Multi-account is the norm for anything serious, organized by AWS Organizations.
- Managed does not mean hands-off. AWS runs the undifferentiated plumbing; you still own configuration, data, IAM, network placement, and cost.
- Almost every problem is one of: IAM (permission), networking (reachability), or a service limit/quota. Check those three first.
AWS Service Index
A searchable, filterable reference of the AWS services that matter to infrastructure, data, and platform teams. Search any field; filter by domain or operational model. Click a row to expand what it is, when to use it, the main gotcha, and a cost note. This is the fast lookup layer; the deep-dive tabs carry the detail.
Tip: click any row to expand. The deep-dive tabs on the left cover each domain in depth.
| Service | Domain | What it is | Model / Scope |
|---|
1. AWS Fundamentals
The global infrastructure, account model, and organizational primitives that every other decision sits on top of. Get the account topology and Region/AZ model right first; almost everything else is easier to change later than these are.
AWS is a set of regional service APIs on top of a global backbone. The Region is your data-residency and blast-radius boundary; the Availability Zone is your fault-isolation boundary; the account is your hard security and billing boundary. Real estates use many accounts under AWS Organizations, provisioned and governed through Control Tower as a landing zone. Design the account and network topology before you deploy production.
What AWS is
AWS is a collection of independent services, each exposed as an API, running in data centers grouped into Regions around the world. You do not "log into a server"; you make authenticated, authorized API calls (via the console, CLI, SDK, or IaC) that create, configure, and destroy resources. Two consequences follow and shape everything:
- Every action is an IAM-authorized API call. Security, automation, and troubleshooting all reduce to "which principal called which API on which resource, and was it allowed."
- Services are largely regional and independent. A failure or a limit in one service/Region usually does not cascade to others. You compose resilient systems from these independent pieces.
AWS global infrastructure
| Construct | What it is | Use it for | Key fact |
|---|---|---|---|
| Region Regional | A geographic area with multiple isolated data-center clusters (AZs). Example: us-east-1, eu-west-1, ap-south-1. | Data residency, latency to users, service/feature availability, disaster-recovery separation. | Most services are regional and isolated per Region. Data does not leave a Region unless you move it. Not all services/features exist in every Region. |
| Availability Zone (AZ) AZ | One or more discrete data centers with independent power, cooling, and networking, within a Region, interconnected by low-latency links. | Fault isolation. Spread instances/subnets across AZs so one AZ failure does not take you down. | AZ IDs (use1-az1) are consistent across accounts; AZ names (us-east-1a) are randomized per account. Cross-AZ traffic is chargeable. |
| Local Zone | An extension of a Region placing compute/storage closer to a metro area (for example large cities). | Single-digit-millisecond latency for specific metros: media, gaming, real-time apps. | Subset of services only; treated like a special AZ you opt into. Verify service availability. |
| Wavelength Zone | AWS infrastructure embedded in telecom 5G networks. | Ultra-low-latency mobile/edge apps over 5G. | Niche; only relevant for 5G edge use cases. |
| Edge location / PoP Global | Hundreds of points of presence used by CloudFront, Route 53, Global Accelerator, and WAF. | Caching, DNS, DDoS absorption, and accelerated ingress close to users. | Far more numerous than Regions; you do not deploy servers here, you use edge services. |
Accounts and AWS Organizations
An AWS account is the fundamental container and the strongest isolation boundary AWS offers: separate resources, separate IAM, separate default limits, and a separate bill. Serious environments do not run everything in one account. They use many accounts, grouped and governed by AWS Organizations.
| Concept | Meaning | Guidance |
|---|---|---|
| Management (payer) account | The root of the organization. Owns billing, creates member accounts, applies SCPs. | Keep it nearly empty. No workloads. Tightly restricted human access, hardware MFA on root, minimal IAM. Its compromise is organization-wide. |
| Member account | Any account managed within the organization. | Where workloads and environments actually live. |
| Organizational Unit (OU) | A folder grouping accounts for policy inheritance. | Group by function/environment, not by team org chart. Common: Security, Infrastructure, Workloads (with Prod/Non-Prod under it), Sandbox, Suspended. |
| Service Control Policy (SCP) | An org-level guardrail that sets the maximum permissions for accounts/OUs. Restricts only; never grants. | Use to deny dangerous actions org-wide (leaving the org, disabling CloudTrail/GuardDuty, using disallowed Regions, deleting log buckets). |
AdministratorAccess in a member account cannot exceed what the SCPs above it allow. Effective permission = intersection of (SCPs) and (IAM identity/resource policies), minus any explicit Deny. This is the backbone of preventive multi-account governance.Control Tower and the Landing Zone
A landing zone is a pre-configured, secure, multi-account baseline: the org structure, centralized logging, guardrails, identity, and network scaffolding, ready before workloads arrive. AWS Control Tower is the managed way to build and maintain one.
- Account Factory vends new accounts from a standard blueprint (baseline IAM, logging, guardrails, VPC).
- Guardrails are preventive (SCP-based) and detective (Config-based) controls applied per OU.
- It wires up a Log Archive account and an Audit/Security account automatically.
Reference multi-account structure
| OU | Accounts | Purpose |
|---|---|---|
| Root / Management | Management (payer) | Billing, Organizations, SCPs. No workloads. |
| Security | Log Archive, Audit/Security Tooling | Immutable central log store; GuardDuty/Security Hub/Config delegated admin; break-glass. |
| Infrastructure | Network (Transit Gateway/Cloud WAN, DNS), Shared Services | Central networking, shared AD/CI/artifact/endpoint services consumed by workload accounts. |
| Workloads | Per app-and-environment: app-a-prod, app-a-nonprod, app-b-prod... | The actual applications. Prod isolated from non-prod at the account boundary. |
| Sandbox | Individual/team sandboxes | Experimentation with tight budget alarms and restrictive SCPs; detached from prod networks. |
Naming, ARNs, tags, and Resource Groups
- ARN (Amazon Resource Name) uniquely identifies every resource:
arn:aws:service:region:account-id:resource. IAM policies, cross-account grants, and automation all reference ARNs. Some are global (no Region), like IAM and S3 buckets. - Tags are key/value metadata. They drive cost allocation, access control (ABAC), automation targeting, and inventory. Define a mandatory tagging standard early (Owner, Environment, CostCenter, Application, DataClassification) and enforce it with SCPs/Config/tag policies.
- Resource Groups collect resources by tag or CloudFormation stack for bulk operations and views.
Ways to interact with AWS
| Interface | What | When |
|---|---|---|
| Console | Web UI. | Learning, exploration, one-off tasks, break-glass. Not for repeatable prod changes. |
| AWS CLI | Command-line access to every API. | Scripting, automation, troubleshooting. Use named profiles and role assumption. |
| SDKs | Language libraries (Python/boto3, Java, JS, Go, .NET...). | Applications and custom tooling. |
| CloudShell | Browser-based shell with the CLI pre-authenticated as your console identity. | Quick CLI work without local setup. |
| CloudFormation | Native IaC (YAML/JSON), managed stacks, StackSets across accounts. | Declarative provisioning within AWS. |
| CDK | IaC in real programming languages, synthesizes to CloudFormation. | Teams who want code constructs, reuse, and testing. |
| Terraform | Third-party, multi-cloud IaC via the AWS provider. | The de facto standard for many enterprises; portable skill set and state-based workflow. |
Decide before production
- Account topology and OU structure (multi-account from the start; retrofitting is painful).
- Region strategy: primary Region, whether multi-Region DR is required, allowed Regions (enforced by SCP).
- Network CIDR plan across all VPCs and on-prem (non-overlapping, room to grow).
- Identity: IAM Identity Center wired to your corporate IdP; no per-account IAM users for humans.
- Centralized logging (CloudTrail org trail, Config, VPC Flow Logs) to a locked log-archive account.
- Guardrails: baseline SCPs and Config rules; GuardDuty and Security Hub org-wide.
- Tagging standard and cost allocation tags activated.
- IaC tooling and pipeline; state management strategy.
- Backup and DR baseline (AWS Backup plans, RTO/RPO targets).
Common mistakes in account design
- Running everything in one account and separating only by tags. No hard blast-radius or quota isolation.
- Putting workloads in the management account. It should be nearly empty and heavily locked down.
- Modeling OUs on the human org chart instead of on function/environment/policy needs.
- Creating dozens of accounts with no automation (Account Factory/AFT), leading to inconsistent baselines and drift.
- No central logging account, so logs live in the same accounts an attacker could tamper with.
- Deferring the CIDR plan, then discovering overlaps that block Transit Gateway/peering later.
2. Identity and Access Management Deep Dive
IAM is the authorization engine in front of every AWS API call. Understand the evaluation logic and the difference between the policy types, and most "why can/can't this work" questions answer themselves.
Prefer roles and short-lived credentials over IAM users and long-lived keys. Give humans access through IAM Identity Center (SSO), give workloads access through roles (instance profiles, task roles, IRSA, execution roles). A request is allowed only if some policy explicitly allows it and nothing explicitly denies it and it is within the SCP/permission-boundary ceiling. Design least privilege and verify with Access Analyzer and the policy simulator.
The building blocks
| Object | What it is | Use it for | Note |
|---|---|---|---|
| IAM user | A long-lived identity with a password and/or access keys. | Increasingly rare. Break-glass, or legacy systems that cannot federate. | Avoid for humans. Every human user is a long-lived credential to rotate and protect. |
| IAM group | A collection of users sharing policies. | Attaching permissions to sets of users. | Groups cannot be nested and cannot be principals in a trust policy. |
| IAM role | An identity with permissions but no long-lived credentials; assumed to get temporary credentials via STS. | Workloads, cross-account access, federated humans, AWS services acting on your behalf. | The intended pattern for almost everything. Defined by a permissions policy + a trust policy. |
| Identity-based policy | JSON attached to a user/group/role saying what that identity can do. | Granting permissions to principals. | Managed (AWS or customer) or inline. |
| Resource-based policy | JSON attached to a resource (S3 bucket, SQS queue, KMS key, Lambda) saying who can access it. | Cross-account access and per-resource control. | Has a Principal element. Cross-account access needs allow on both sides (except a few resource-policy-only cases). |
| Permission boundary | An advanced policy capping the maximum permissions an identity can have. | Safe delegation: let teams create roles without exceeding a ceiling. | Does not grant; it limits. Effective = intersection of boundary and identity policy. |
| SCP | Org-level guardrail (see Fundamentals). | Org-wide maximum permissions. | Applies to everything in member accounts, including root. |
| Session policy | Passed at AssumeRole time to further scope the session. | Temporarily narrowing a broad role for a specific task/session. | Can only restrict, never expand, the role's permissions. |
| Trust policy | The resource-based policy on a role defining who can assume it. | Controlling role assumption (which account/service/principal). | Separate from the role's permissions. Getting this wrong is the classic cross-account failure. |
STS, temporary credentials, and role assumption
STS issues short-lived credentials (access key + secret + session token) when a principal assumes a role or federates. This is how you avoid long-lived keys everywhere:
- Instance profile attaches a role to EC2; the app reads temporary creds from IMDSv2. No keys on disk.
- Task role (ECS) / IRSA or Pod Identity (EKS) / execution role (Lambda) give containers and functions scoped credentials.
- Cross-account roles: a principal in account A assumes a role in account B whose trust policy permits it.
- Federation: SAML or OIDC exchanges an external identity for AWS credentials (the basis of IAM Identity Center and web/OIDC workloads like GitHub Actions).
IAM Identity Center and federation
IAM Identity Center (formerly AWS SSO) is how humans should access AWS across many accounts. You connect it to an identity source (Microsoft Entra ID, Okta, or its built-in directory) via SAML/SCIM, define permission sets (which become IAM roles in each target account), and assign users/groups to accounts. Users get a portal, pick an account and role, and receive short-lived credentials. No per-account IAM users, no long-lived keys, centralized deprovisioning.
MFA, access keys, and credential hygiene
- Enforce MFA everywhere, especially root and any remaining IAM users. Prefer hardware/FIDO2 for privileged access.
- Root account: hardware MFA, no access keys, used only for the handful of tasks that require it. Lock it away.
- Avoid long-lived access keys. Where unavoidable, rotate on a schedule and scope tightly. Prefer roles/federation.
- Use the credential report (account-wide CSV of users, key age, MFA, last use) and last-accessed data to prune.
Example policy statements
Each example below states what it allows, where to attach it, its risk level, a safer alternative, and how it gets misused.
Example 1: read-only access to one S3 bucket
# Identity-based policy attached to a role/permission set
{
"Version": "2012-10-17",
"Statement": [
{ "Sid": "ListBucket",
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": "arn:aws:s3:::my-reports-bucket" },
{ "Sid": "GetObjects",
"Effect": "Allow",
"Action": ["s3:GetObject"],
"Resource": "arn:aws:s3:::my-reports-bucket/*" }
]
}
| Allows | List and read objects in exactly one bucket. Nothing else. |
|---|---|
| Attach to | A role/permission set for an app or analyst that only needs to read reports. |
| Risk | Low Scoped to one bucket and read-only. |
| Safer alt | Add a condition (aws:SourceVpce or aws:PrincipalOrgID) if access should be restricted to a VPC endpoint or the org. |
| Misuse | Widening Resource to arn:aws:s3:::* "to save time," which grants read to every bucket. |
Example 2: the dangerous one - full admin
{
"Version": "2012-10-17",
"Statement": [
{ "Effect": "Allow", "Action": "*", "Resource": "*" }
]
}
| Allows | Every action on every resource. This is AdministratorAccess. |
|---|---|
| Attach to | Almost nothing. A tightly controlled break-glass role at most. |
| Risk | Critical Full blast radius if the identity is compromised. |
| Safer alt | Job-function managed policies, or purpose-built least-privilege policies. Cap with a permission boundary and SCPs. |
| Misuse | Handing AdministratorAccess to developers "so they stop filing access tickets." The most common over-permission in AWS. |
Example 3: a cross-account trust policy
# Trust policy ON the role in account B (the account being accessed)
{
"Version": "2012-10-17",
"Statement": [
{ "Effect": "Allow",
"Principal": { "AWS": "arn:aws:iam::111122223333:role/CICDDeployer" },
"Action": "sts:AssumeRole",
"Condition": { "StringEquals": { "sts:ExternalId": "shared-secret-value" } } }
]
}
| Allows | Only the CICDDeployer role in account 1111... to assume this role, and only when it presents the matching ExternalId. |
|---|---|
| Attach to | The role in the target account (trust policy). The caller also needs sts:AssumeRole permission on their side. |
| Risk | Medium Controlled, but the assumed role's permissions policy is where the real power lives. |
| Safer alt | Scope the principal to a specific role ARN (not the whole account root :root), use ExternalId for third parties, and least-privilege the permissions policy. |
| Misuse | Trusting "AWS": "arn:aws:iam::111122223333:root" (the whole account) plus a broad permissions policy, so any principal in that account can assume a powerful role. |
Example 4: a permission boundary for delegated role creation
# Boundary limiting what any role a team creates can ever do
{
"Version": "2012-10-17",
"Statement": [
{ "Effect": "Allow",
"Action": ["s3:*","dynamodb:*","logs:*","cloudwatch:*"],
"Resource": "*" }
]
}
| Allows | Sets the ceiling: any role created with this boundary can never exceed S3/DynamoDB/logs/CloudWatch, even if its permissions policy says more. |
|---|---|
| Attach to | Roles created by delegated admins; combined with a policy requiring the boundary on any iam:CreateRole. |
| Risk | Low when used to constrain delegation; it only limits. |
| Safer alt | Pair with an SCP that denies iam:CreateRole/PutRolePolicy unless the boundary is attached, so it cannot be bypassed. |
| Misuse | Confusing a boundary with a grant. It does not give access; it caps it. |
Least-privilege design in practice
- Start from zero and add specific actions/resources, not from
*and trim. - Use IAM Access Analyzer policy generation to build a policy from actual CloudTrail activity.
- Add conditions:
aws:PrincipalOrgID,aws:SourceIp,aws:SourceVpce,aws:RequestedRegion, tag conditions (ABAC). - Prefer ABAC (attribute/tag-based) for scale: "allow if resource tag Team = principal tag Team" instead of enumerating resources.
- Review with last-accessed data and prune unused permissions periodically.
Tooling
| Tool | Use |
|---|---|
| IAM Access Analyzer | Finds resources shared outside your account/org, validates policies, generates least-privilege policies from CloudTrail, and flags unused access. |
| Credential report | Account-wide CSV of all users: key age, MFA status, last activity. Use to prune and enforce hygiene. |
| Policy simulator | Test whether a given principal would be allowed a given action on a resource before deploying. |
| Last accessed data | Shows which services/actions a principal actually used, to right-size policies. |
The AWS IAM mental model
When you evaluate any "is this allowed?" question, walk this in order. AWS evaluates every request against all applicable policies:
Who is making the request? (Principal)
An IAM user, an assumed role (workload or human via SSO), an AWS service principal, or a federated identity. The principal's identity-based policies define its potential permissions.What action, on which resource?
Every request is a specific API action against a specific ARN, possibly with request context (source IP, VPC endpoint, tags, Region).Is there an explicit Deny anywhere?
An explicitDeny in any applicable policy (identity, resource, SCP, boundary, session) immediately denies the request. Deny always wins.Is it within the SCP and permission-boundary ceiling?
The request must be permitted by the SCPs on the account/OU and by any permission boundary on the principal. These set the maximum.Is there an explicit Allow?
With no Deny and within the ceiling, there must be an explicit Allow (identity-based, or resource-based) for the action. Default is implicit deny.Is this cross-account?
For cross-account, you generally need an Allow on both sides: the caller's identity policy (or trust for AssumeRole) and the target's resource policy or role trust.Break-glass access
Even with SSO, keep a tightly controlled emergency path for when the IdP or Identity Center is unavailable: a dedicated break-glass IAM role or user per critical account, hardware MFA, credentials sealed and split, every use alarmed via CloudTrail/EventBridge to security. Test it periodically so it works when you need it.
Common AWS IAM mistakes
- Using the root account for daily work. Root is for the few tasks that require it. Lock it with hardware MFA, no keys.
- Long-lived access keys where roles would do. Every static key is a liability; use instance profiles, task roles, IRSA, and federation.
- AdministratorAccess handed out broadly. Use job-function policies and boundaries; grant admin narrowly and temporarily.
- Confusing roles with users. Roles are assumed for temporary creds; users are long-lived. Workloads use roles, not embedded user keys.
- Misunderstanding trust policies. The trust policy controls who can assume; the permissions policy controls what they can do. Both matter.
- Not using IAM Identity Center. Per-account IAM users for humans do not scale and are hard to deprovision.
- Poor cross-account design. Trusting
:rootof another account plus broad permissions is effectively trusting anyone in that account. - Careless SCPs. Too loose and they protect nothing; too tight and they break legitimate work. Test in a non-prod OU first.
- Wildcards everywhere.
Action:"*"andResource:"*"are the norm in over-permissioned accounts; scope both. - Ignoring permission boundaries when delegating role creation, so a team can escalate to admin by creating a powerful role.
3. Networking Deep Dive
The VPC and everything around it. Networking is where the most time is lost in AWS, because reachability depends on the intersection of routing, security groups, NACLs, gateways, and DNS, all of which must line up. This section covers the model, the diagrams, and the runbooks.
A VPC is a regional private network; subnets are AZ-scoped. Public vs private is defined by the route table (a route to an Internet Gateway makes a subnet public). Security groups are stateful and attached to ENIs; NACLs are stateless and attached to subnets. Reach the internet outbound from private subnets via NAT Gateway; reach AWS services privately via VPC endpoints. Connect networks with Transit Gateway (hub), peering (point-to-point), VPN, or Direct Connect. Plan non-overlapping CIDRs first.
Core building blocks
| Component | What it does | Key fact / gotcha |
|---|---|---|
| VPC | Isolated virtual network in a Region with a CIDR block. | Regional. CIDR cannot shrink; you can add secondary CIDRs. Plan for growth and non-overlap. |
| Subnet | A CIDR slice bound to one AZ. | AZ-scoped. AWS reserves 5 IPs per subnet. "Public" vs "private" is purely about the route table. |
| Route table | Directs traffic by destination CIDR to a target (IGW, NAT, TGW, endpoint, peering). | Most specific route wins. The presence of a 0.0.0.0/0 -> IGW route is what makes a subnet public. |
| Internet Gateway (IGW) | Enables bidirectional internet for resources with public IPs in public subnets. | Horizontally scaled, no bottleneck. Needs a public IP + route + SG/NACL allow to actually work. |
| NAT Gateway | Outbound-only internet for private subnets. | AZ-scoped; deploy one per AZ. Hourly + per-GB charge. No inbound. Common surprise cost. |
| Egress-only IGW | Outbound-only internet for IPv6. | The IPv6 analogue of NAT (IPv6 is globally routable, so you need explicit egress-only control). |
| Elastic IP | Static public IPv4 you own and attach. | Charged when unattached or beyond the free allotment. Clean up unused EIPs. |
| ENI | Elastic Network Interface: a virtual NIC with private IP(s), SGs, and optional public IP. | Security groups attach to ENIs. Instances, ALBs, endpoints, RDS all use ENIs, which consume subnet IPs. |
| Security Group | Stateful virtual firewall on an ENI. Allow rules only. | Return traffic is automatically allowed. Can reference other SGs as source. First line of instance-level control. |
| Network ACL (NACL) | Stateless firewall at the subnet boundary. Allow and deny rules, numbered. | You must allow return traffic explicitly (ephemeral ports). Default NACL allows all; custom NACLs deny all until you add rules. |
CIDR planning
- Pick RFC 1918 space with room to grow (a
/16per VPC is common). Reserve distinct ranges per environment/Region/account. - Never overlap CIDRs across VPCs or with on-premises if they might ever connect. Overlap makes Transit Gateway/peering/VPN routing impossible without NAT.
- Leave room for multiple subnet tiers across 3 AZs (public, private-app, private-data, per AZ).
- Watch IP consumption: the EKS VPC CNI, interface endpoints, and large fleets consume many ENIs/IPs.
Diagram: three-tier VPC across AZs
Diagram: private subnet reaching S3 via a Gateway VPC endpoint
Diagram: hub-and-spoke with Transit Gateway
Connecting networks: which option
| Option | Use when | Watch out for |
|---|---|---|
| VPC Peering | A few VPCs need direct, low-latency connectivity. | Non-transitive (A-B and B-C does not give A-C). Scales poorly (N-squared). No overlapping CIDRs. |
| Transit Gateway | Many VPCs and hybrid connections; you want central routing/segmentation. | Per-attachment + per-GB charges; route-table design complexity. |
| Cloud WAN | Large, global, multi-Region networks managed by one central policy. | Newer, higher-level; evaluate cost/fit vs TGW. |
| PrivateLink | Expose/consume a specific service privately (yours or a SaaS partner's) without exposing the whole VPC. | One-directional service access, not general network connectivity. Per-endpoint cost. |
| Site-to-Site VPN | Encrypted hybrid connectivity fast; DX backup. | Throughput limits; use 2 tunnels + BGP for HA. |
| Direct Connect | Consistent low latency, high throughput, large steady transfer, cheaper egress. | Weeks to provision; pair with VPN backup; DX Gateway for multi-VPC/Region. |
Route 53 and DNS
- Public hosted zones serve internet DNS; private hosted zones serve names inside associated VPCs.
- Resolver endpoints (inbound/outbound) bridge DNS between on-prem and VPC for hybrid name resolution.
- Routing policies: simple, weighted, latency, failover, geolocation, geoproximity, multivalue. Failover + health checks are the basis of DNS-level DR.
- Alias records point to AWS resources (ALB, CloudFront, S3 website) and are free to query.
Firewalls and traffic security
| Service | Layer | Use |
|---|---|---|
| Security Groups | Instance (ENI), stateful | Primary allow-list for workload traffic. |
| NACLs | Subnet, stateless | Coarse subnet deny (blocklists). |
| AWS Network Firewall | VPC, stateful IDS/IPS | Deep packet inspection, domain filtering, centralized egress control. |
| AWS WAF | L7 (ALB/API GW/CloudFront) | OWASP rules, rate limiting, bot control for web apps. |
| AWS Shield | Edge | DDoS protection (Standard free; Advanced paid). |
| Firewall Manager | Org-wide | Centrally manage WAF/Shield/Network Firewall/SG policies across accounts. |
Visibility
- VPC Flow Logs: record accepted/rejected traffic metadata per ENI/subnet/VPC to CloudWatch Logs or S3. Essential for "is traffic even arriving, and is it being rejected."
- Reachability Analyzer: static analysis of whether a path exists between two resources and, if not, which component blocks it. The fastest way to debug SG/NACL/route problems.
- Traffic Mirroring: copy packets to security/monitoring appliances for deep inspection.
Networking troubleshooting runbooks
EC2 instance cannot reach the internet connectivity
0.0.0.0/0 -> IGW route? Or in a private subnet needing a 0.0.0.0/0 -> NAT route? Does the security group allow outbound (default allows all; a locked-down SG may not)? Does the NACL allow outbound and the return ephemeral ports? Is the NAT Gateway healthy and in a public subnet with an IGW route?
Fastest path: run Reachability Analyzer from the ENI to an internet destination; check VPC Flow Logs for REJECT. Prevention: standardize subnet/route templates in IaC.
Private instance cannot download OS patches connectivity
0.0.0.0/0 -> nat-...; NAT is in a public subnet with an IGW route; SG/NACL allow 443 out. Better design: use Systems Manager Patch Manager with S3/SSM VPC endpoints so patching does not depend on the internet at all.
Instance cannot reach S3 privately endpoint
aws:SourceVpce)? Is the SG outbound allowing HTTPS? Gotcha: Gateway endpoint routes are added to specific route tables; a subnet using a different route table will not use it.
Application cannot connect across VPCs routing
On-premises cannot reach AWS hybrid
Load balancer health checks failing elb
DNS resolution issue dns
enableDnsSupport and enableDnsHostnames on for the VPC? For private zones, is the private hosted zone associated with the VPC? For hybrid, are Resolver endpoints/rules configured toward on-prem? Is the record correct and TTL not stale? Tool: dig/nslookup from an instance in the VPC; check the .2 resolver (VPC base +2).
Security group vs NACL vs route table isolation firewall
Transit Gateway route issue tgw
VPC endpoint policy issue endpoint
AWS networking gotchas
- SG vs NACL confusion. SG stateful/instance; NACL stateless/subnet. Return traffic is automatic only for SGs.
- Forgetting NACL return traffic. With custom NACLs you must allow outbound ephemeral ports for responses.
- Overlapping CIDRs. Blocks peering/TGW/VPN routing forever. Plan IP space centrally up front.
- Too many peering connections. N-squared mesh; move to Transit Gateway.
- NAT assumed to allow inbound. It is outbound-only. Inbound needs an IGW + public IP + LB.
- One NAT for all AZs. Cross-AZ charges and an AZ-failure single point. One NAT per AZ.
- Databases in public subnets. Put data tiers in private subnets with no internet route.
- Ignoring VPC endpoint policies. They silently restrict access on top of IAM and resource policies.
- Cross-AZ data charges. Chatty cross-AZ traffic (app-to-DB, replication) adds up; keep tightly-coupled traffic AZ-local where sensible while staying HA.
- Not using VPC endpoints for AWS traffic. Routing S3/ECR/API traffic through NAT wastes money and adds an internet dependency.
4. Compute Deep Dive
From EC2 virtual machines to Lambda functions. The recurring decision is how much of the stack you want to own: full OS (EC2), containers (ECS/EKS/Fargate, covered in the Containers tab), or just code (Lambda). This tab focuses on EC2 and the surrounding compute primitives.
Choose the smallest unit that fits: Lambda for event-driven/short tasks, containers for services, EC2 when you need a full OS or specialized hardware. On EC2, pick the right instance family for the workload profile, run across AZs behind an Auto Scaling Group, enforce IMDSv2, manage it with Systems Manager (no SSH), and buy Savings Plans/Reserved for steady load and Spot for interruptible work.
EC2 essentials
| Concept | What | Note |
|---|---|---|
| Instance family/type | Named by family + generation + size (for example m7i.xlarge): general, compute, memory, storage, accelerated. | The letter is the workload profile; the number is the generation (newer = better price/perf). Match to the workload. |
| Nitro system | AWS's hypervisor/hardware offload powering modern instances. | Enables bare-metal, higher performance, better security isolation, and features like EBS-optimized-by-default. |
| AMI | Amazon Machine Image: the template (OS + config) an instance boots from. | Build golden AMIs with EC2 Image Builder; bake in the SSM agent, CloudWatch agent, and hardening. |
| EBS-backed vs instance store | Root/data on network EBS (persistent) vs local NVMe (ephemeral, lost on stop/terminate). | Instance store is fast but temporary; never put durable data on it. EBS survives stop/start. |
| User data | Bootstrap script run at first boot. | Use for lightweight bootstrap; prefer baked AMIs + config management for repeatability. |
| IMDS / IMDSv2 | Instance Metadata Service exposes instance info and role credentials at 169.254.169.254. | Enforce IMDSv2 (session/token-based). IMDSv1 was exploited via SSRF to steal role credentials. |
| Instance profile | The wrapper that attaches an IAM role to an instance. | The right way to give an instance AWS permissions. No keys on disk. |
| Placement groups | Control physical placement: cluster (low latency), spread (max isolation), partition (large distributed). | Cluster for HPC/low-latency; spread for small critical sets; partition for HDFS/Cassandra-style. |
HttpTokens=required (IMDSv2) and set the hop limit to 1. This closes the SSRF-to-credential-theft class of attacks that repeatedly leaks EC2 role credentials. Enforce it org-wide via an SCP/Config rule and bake it into launch templates and AMIs.Choosing an instance family
| Workload | Family type | Examples (verify current gen) | Why |
|---|---|---|---|
| Web / middleware / general | General purpose (M, or Graviton Mg) | m7i, m7g, t4g (burstable) | Balanced CPU:memory. Graviton (Arm) often 20-40% better price/perf if the stack supports Arm. |
| CPU-heavy (batch, encoding, HPC front) | Compute optimized (C) | c7i, c7g | Higher CPU:memory ratio and clock for compute-bound work. |
| Databases / in-memory / analytics | Memory optimized (R, X, z) | r7i, r7g, x2idn, z1d | High memory per vCPU. X/high-memory for SAP HANA and large in-memory DBs. |
| Oracle / SQL Server (self-managed) | Memory optimized (R/X), consider core count for licensing | r7i, x2iedn | Memory-bound and license-cost-sensitive; fewer, faster cores can cut license cost. |
| SAP workloads | Memory optimized / high-memory (SAP-certified) | Certified X/High Memory/U instances | SAP requires certified instances; check the SAP-on-AWS certified list. |
| Storage / big-data / high-IOPS local | Storage optimized (I, D, Im) | i4i, im4gn, d3 | Large fast local NVMe for NoSQL, data nodes, warehouses. |
| GPU / ML / graphics | Accelerated (P, G, Trn/Inf) | p5, g6, trn1, inf2 | P for training, G for inference/graphics, Trainium/Inferentia for cost-efficient ML. |
| Cost-sensitive / bursty non-prod | Burstable (T) + Spot | t4g, t3 | CPU credits for low-average, spiky load; combine with Spot for savings. |
Purchasing and capacity models
| Model | Use | Trade-off |
|---|---|---|
| On-Demand | Default, spiky, short-lived, unpredictable. | Most expensive per hour; zero commitment. |
| Savings Plans | Steady baseline compute (EC2, Fargate, Lambda). | 1 or 3-year commit for up to ~72% off. Compute Savings Plans are the most flexible. |
| Reserved Instances | Steady, specific instance families (legacy vs Savings Plans for most). | Similar savings but less flexible than Savings Plans. |
| Spot Instances | Fault-tolerant, interruptible: batch, CI, stateless web, big data. | Up to ~90% off but can be reclaimed with a 2-minute warning. Never for stateful singletons. |
| Dedicated Hosts | Physical-host isolation, socket/core-based BYOL licensing (Windows/Oracle), compliance. | Most expensive; needed for host-based licensing and strict isolation. |
| Dedicated Instances | Hardware isolated from other accounts (but not host-level control). | Isolation without the BYOL granularity of Dedicated Hosts. |
| Capacity Reservations | Guarantee capacity in a specific AZ (for example for a launch or DR). | Pay for reserved capacity whether or not used; combine with Savings Plans for the discount. |
Auto Scaling and high availability
- Run instances in an Auto Scaling Group spanning at least 3 AZs, behind a load balancer, launched from a launch template (launch configurations are legacy).
- Use ELB health checks (not just EC2 status checks) so unhealthy-but-running app instances are replaced.
- Set scaling policies (target tracking on CPU/requests) and lifecycle hooks for graceful add/remove.
- Use mixed instances + Spot in the ASG for cost with capacity resilience.
- Treat instances as cattle: no unique state on the box; store state in RDS/S3/ElastiCache.
Managing EC2 with Systems Manager
| Capability | Replaces | Note |
|---|---|---|
| Session Manager | SSH, bastion hosts | Browser/CLI shell with no open port 22, no keys; every session logged to CloudWatch/S3. |
| Patch Manager | Manual patching | Scheduled patch baselines with maintenance windows and compliance reporting. |
| Run Command | Ad hoc SSH loops | Run a command across a fleet by tag, with output captured. |
| Parameter Store | Config files/secrets sprawl | Central config and SecureString secrets (cheaper than Secrets Manager for simple cases). |
| Inventory / State Manager | Manual audits | Track installed software/config and enforce desired state. |
Other compute options
| Service | Use |
|---|---|
| Lambda | Event-driven functions, API backends, automation. 15-min max, scales to zero. See the AI/Integration tabs for patterns. |
| AWS Batch | Managed batch job scheduling over EC2/Fargate/Spot for HPC, rendering, genomics. |
| Elastic Beanstalk | PaaS that provisions EC2/ASG/ELB for you from app code. Simple, but less control; often superseded by containers. |
| Lightsail | Simplified VPS with predictable flat pricing for small/simple workloads and quick prototypes. |
| EC2 Image Builder | Automated, tested, hardened golden-AMI pipelines. |
EC2 operational runbooks
How to resize an EC2 instance op
How to patch EC2 safely op
Troubleshoot boot / instance not reachable rb
/etc/fstab mount, failed cloud-init/user-data, kernel/GRUB issue. For access issues, verify SG/subnet/route and prefer Session Manager over SSH.
Troubleshoot high CPU rb
CPUUtilization shows the trend; for burstable (T) instances also check CPUCreditBalance (exhausted credits throttle you). On the box: top/htop, identify the process, check for runaway threads, GC, or a code loop. Fix by scaling out (ASG), scaling up (larger/compute-optimized), or fixing the app. For chronic T-instance credit exhaustion, move to M/C family.
Troubleshoot memory pressure rb
mem_used_percent. On the box: free -m, check for OOM-killer events in dmesg/syslog, identify the consumer. Fix: right-size to a memory-optimized (R) family, tune the app/JVM heap, or add swap as a stopgap (not a real fix).
Troubleshoot disk full rb
df -h to find the full filesystem, du -sh /* to find the culprit (often logs, /tmp, or a runaway file). Fix immediately by clearing/rotating; longer term grow the EBS volume online (modify volume, then extend the partition and filesystem with growpart/resize2fs/xfs_growfs). Set up log rotation and disk alarms.
Design EC2 for production HA design
5. Storage Deep Dive
Block, object, and file storage, plus archive, backup, and data-movement services. The first decision is almost always block (EBS) vs object (S3) vs file (EFS/FSx); get that right and the rest is configuration.
EBS = a disk for one EC2 instance (AZ-scoped block). S3 = durable object store for anything (not a filesystem). EFS = shared POSIX/NFS across many instances (regional). FSx = managed Windows/Lustre/ONTAP/OpenZFS file systems. Use storage classes and lifecycle on S3, and gp3 as the EBS default. Snapshots are how block data crosses AZ/Region boundaries.
Choosing the storage type
| Type | Access | Scope | Use for |
|---|---|---|---|
| EBS (block) | Attached to one instance (or multi-attach for io2) | AZ | Boot disks, database files, low-latency single-instance storage. |
| Instance store (block) | Local NVMe on the host | Ephemeral | Scratch, cache, temp data you can lose on stop/terminate. |
| S3 (object) | HTTPS API, not a mount | Regional | Data lake, backups, static assets, logs, artifacts, archive. |
| EFS (file, NFS) | Mounted by many Linux instances/containers | Regional | Shared POSIX filesystem, lift-and-shift apps needing a shared mount. |
| FSx (file) | SMB/Lustre/NFS depending on flavour | AZ / Multi-AZ | Windows SMB shares, HPC scratch, NetApp ONTAP features, OpenZFS. |
EBS in depth
| Volume type | Profile | Use |
|---|---|---|
| gp3 (SSD) | Baseline SSD; IOPS/throughput provisioned independently of size. | The default for most workloads. Cheaper than gp2 and you tune performance without growing the disk. |
| io2 / io2 Block Express (SSD) | High, consistent IOPS with high durability; Block Express for the largest/fastest. | Mission-critical databases (Oracle, SQL Server, high-TPS PostgreSQL) needing guaranteed IOPS/low latency. |
| st1 (HDD) | Throughput-optimized HDD. | Big sequential workloads: logs, data warehouse staging, streaming. Not for random I/O or boot. |
| sc1 (HDD) | Cold HDD, cheapest. | Infrequently accessed large data where cost beats performance. |
- Snapshots are incremental, stored in S3 (managed), and are how you back up, copy across AZ/Region, and create AMIs.
- Encryption via KMS; enable encryption-by-default at the account level so no unencrypted volume is ever created.
- Multi-Attach (io2) lets a volume attach to multiple instances in one AZ, but the app/cluster filesystem must handle concurrency (for example clustered databases). Not a general share.
- Performance: gp3 gives 3,000 IOPS / 125 MB/s baseline, tunable up; ensure the instance is EBS-optimized and large enough to not cap volume throughput.
VolumeQueueLength and burst balance.S3 in depth
Storage classes
| Class | Use | Note |
|---|---|---|
| Standard | Frequently accessed data. | Default; multi-AZ durability. |
| Intelligent-Tiering | Unknown/changing access patterns. | Auto-moves objects between tiers; small monitoring fee, no retrieval fees. Good default for unpredictable data. |
| Standard-IA | Infrequent but rapid access. | Cheaper storage, per-GB retrieval fee, min duration/size charges. |
| One Zone-IA | Infrequent, re-creatable data. | Single AZ (less durable); cheaper. Not for irreplaceable data. |
| Glacier Instant Retrieval | Archive with millisecond access. | Cheap storage, higher retrieval cost; for rarely-read-but-need-it-now data. |
| Glacier Flexible Retrieval | Archive, minutes-to-hours retrieval. | Cheaper; retrieval delay (expedited/standard/bulk). |
| Glacier Deep Archive | Long-term archive, hours retrieval. | Cheapest; for 7-10 year retention (compliance). |
Key features
- Versioning keeps prior object versions (protects against overwrite/delete); pair with lifecycle to expire noncurrent versions.
- Lifecycle rules transition objects to cheaper classes and expire them on a schedule.
- Object Lock (WORM) enforces retention for compliance/ransomware protection; combine with a backup account.
- Replication (CRR/SRR) copies objects across Regions/buckets for DR or locality.
- Multipart upload for large objects; pre-signed URLs for temporary, credential-free access.
- Access Points give per-application access policies to a shared bucket.
S3 security
aws:SourceVpce/aws:PrincipalOrgID conditions to scope access. Public bucket exposure remains one of the most common AWS breaches.EFS and FSx
- EFS: regional, elastic NFS; create a mount target per AZ so every AZ can mount it. Performance modes (General Purpose vs Max I/O) and throughput modes (Bursting/Elastic/Provisioned). Lifecycle to IA/Archive for cost. Higher latency/cost than EBS for single-instance work, so use it when you genuinely need a shared filesystem.
- FSx for Windows File Server: managed SMB with AD integration, for Windows apps and shares.
- FSx for Lustre: high-performance parallel filesystem for HPC/ML, can link to S3.
- FSx for NetApp ONTAP: full ONTAP (snapshots, dedup, compression, SnapMirror, multi-protocol) for enterprises standardizing on NetApp.
- FSx for OpenZFS: ZFS features (snapshots, clones) with NFS access.
Data movement and hybrid
| Service | Use |
|---|---|
| AWS Backup | Centralized, policy-based backup and cross-Region/account copies across many services. |
| Storage Gateway | Hybrid access: present S3/EBS/tape to on-prem via file (SMB/NFS), volume, or tape gateway. |
| DataSync | Fast, scheduled bulk transfer between on-prem/other clouds and AWS storage (S3/EFS/FSx). |
| Transfer Family | Managed SFTP/FTPS/FTP endpoints backed by S3/EFS. |
| Snow Family | Physical devices to move petabytes when the network is impractical (Snowball Edge / Snowcone). |
Storage decision examples
| Need | Choose | Why |
|---|---|---|
| Database data files (self-managed) | EBS io2 Block Express / gp3 | Low-latency block, provisioned IOPS, per-instance. |
| Shared app filesystem (Linux) | EFS | Many instances mount the same POSIX tree. |
| Windows file share | FSx for Windows | Native SMB + AD. |
| Static website / assets | S3 + CloudFront | Durable object store served at the edge. |
| Backups | S3 (IA/Glacier) via AWS Backup | Durable, lifecycle to archive, cross-Region copy. |
| Log archive / long retention | S3 Glacier Deep Archive + Object Lock | Cheapest, WORM for compliance. |
| Data lake | S3 + Glue/Athena/Lake Formation | Object store as the lake substrate. |
| HPC scratch | FSx for Lustre | Parallel high-throughput filesystem, S3-linked. |
| Hybrid file access to cloud | Storage Gateway / DataSync | Bridge on-prem to S3/EFS/FSx. |
| Move petabytes offline | Snowball Edge | Network transfer would take too long. |
Storage gotchas
- S3 is object storage, not a filesystem. No in-place edits, no POSIX semantics; whole-object PUT/GET. Use EFS/FSx if you need a mount.
- EBS is AZ-scoped. Cross-AZ/Region movement is via snapshots.
- EFS is regional but reachable only through mount targets in each AZ; forget the mount target in an AZ and instances there cannot mount.
- Public bucket exposure. Keep Block Public Access on; disable ACLs.
- Glacier restore delay. Deep Archive retrieval is hours; do not archive data you might need immediately unless using Instant Retrieval.
- EBS performance sizing. gp3 baseline may be too low; provision IOPS/throughput and ensure the instance is not the bottleneck.
- Snapshot cost growth. Incremental but accumulates; use lifecycle/DLM to age them out.
- Cross-Region replication and cross-AZ transfer costs. Both are per-GB and add up on chatty or large flows.
6. Database Services Deep Dive
AWS offers a purpose-built database for almost every workload. The hard part is not any single service; it is choosing correctly and understanding exactly what you give up in control when you move from self-managed to managed. This section is written with DBAs in mind.
Default to managed (RDS/Aurora) and drop to self-managed on EC2 only when a hard requirement forces it (OS/root access, unsupported feature, RAC, specific patch control). Use Aurora for MySQL/PostgreSQL that needs scale/performance, RDS for standard relational engines including Oracle and SQL Server, DynamoDB for high-scale key-value, Redshift for warehousing, ElastiCache for caching. Multi-AZ = HA; read replicas = read scale; cross-Region replicas = DR/locality. With managed, AWS owns the OS and HA plumbing; you still own schema, tuning, query performance, and cost.
The database portfolio
| Service | Type | Use |
|---|---|---|
| RDS (Oracle, PostgreSQL, MySQL, MariaDB, SQL Server) | Managed relational | Standard OLTP/apps needing a specific engine without managing the OS. |
| Aurora (MySQL/PostgreSQL-compatible) | Managed relational, cloud-native storage | Higher performance, faster failover, up to 15 replicas, Serverless v2 autoscaling. |
| DynamoDB | Serverless NoSQL key-value/document | Massive scale, single-digit-ms latency, serverless apps. |
| ElastiCache (Redis/Valkey/Memcached) | Managed in-memory cache | Caching, sessions, rate limiting. |
| MemoryDB | Durable in-memory (Redis/Valkey) | Redis speed as a primary, durable datastore. |
| Redshift | Columnar data warehouse | Analytics/BI at scale; querying the S3 lake (Spectrum). |
| DocumentDB | MongoDB-compatible document DB | MongoDB-API workloads, managed. |
| Neptune | Graph DB | Highly connected data (fraud, knowledge graphs). |
| Keyspaces | Cassandra-compatible | CQL workloads, serverless. |
| Timestream | Time-series DB | IoT/metrics/telemetry with time-based retention. |
When to use what (decision table)
| Workload | Recommended | Reason | HA | DR | You manage |
|---|---|---|---|---|---|
| Oracle DB (managed, standard features) | RDS for Oracle | Managed Oracle without OS work. | Multi-AZ | Cross-Region read replica / snapshots | Schema, tuning, params (within limits), licensing choice. |
| Oracle DB (RAC, full control, unsupported features) | EC2 self-managed (or Oracle DB@AWS where available) | RDS cannot do RAC / full SYSDBA / every option. | Data Guard / RAC (you build) | Data Guard standby | Everything: OS, DB, HA, backup, patching. |
| PostgreSQL app backend | Aurora PostgreSQL (or RDS PostgreSQL) | Performance + read scaling; RDS if you want vanilla PG. | Multi-AZ / Aurora replicas | Aurora Global Database / cross-Region replica | Schema, tuning, queries. |
| MySQL web app | Aurora MySQL (or RDS MySQL) | Scale and failover; RDS for simplicity/cost. | Multi-AZ / Aurora replicas | Global Database / replica | Schema, tuning. |
| SQL Server app | RDS for SQL Server | Managed SQL Server. | Multi-AZ | Cross-Region snapshots/replica | Schema, tuning, licensing. |
| High-scale key-value | DynamoDB | Serverless, any scale, low latency. | Built-in multi-AZ | Global Tables (multi-Region active-active) | Data model / access patterns. |
| Enterprise data warehouse | Redshift | Columnar MPP for analytics. | Multi-AZ (RA3/Serverless) | Snapshots to another Region | Modeling, dist/sort keys, queries. |
| Reporting / ad hoc over lake | Athena (+ Redshift Spectrum) | Serverless SQL over S3. | Managed | Data in S3 (durable/replicable) | Partitioning, formats, queries. |
| Cache layer | ElastiCache | Reduce DB load, sub-ms reads. | Multi-AZ replication | Rebuildable / snapshots | Cache strategy, eviction. |
| Graph workload | Neptune | Native graph queries. | Multi-AZ | Snapshots / cross-Region | Graph model. |
| Document workload | DocumentDB (or DynamoDB) | MongoDB API, managed. | Multi-AZ | Snapshots / global cluster | Data model. |
| Time-series | Timestream | Purpose-built, tiered retention. | Managed | Managed | Schema, retention tiers. |
HA, DR, scaling, backup, patching across services
| Dimension | RDS | Aurora | DynamoDB | Redshift |
|---|---|---|---|---|
| HA | Multi-AZ (synchronous standby, automatic failover) | Storage replicated across 3 AZs; replicas promote fast | Automatic across AZs | Multi-AZ (RA3/Serverless), cluster relocation |
| Read scale | Up to 15 read replicas (async) | Up to 15 low-lag replicas sharing storage | Horizontal by design; DAX for caching | Concurrency scaling |
| DR | Cross-Region read replica, snapshot copy | Aurora Global Database (sub-second lag, fast promote) | Global Tables (active-active) | Cross-Region snapshots |
| Backup | Automated backups + manual snapshots; PITR | Continuous to S3; PITR; backtrack (MySQL) | PITR + on-demand backups | Automated + manual snapshots |
| Patching | You pick maintenance windows; AWS patches OS+engine | Managed, minimal downtime; Blue/Green deployments | Fully managed (serverless) | Managed maintenance windows |
| Scale compute | Change instance class (brief downtime unless Multi-AZ) | Change class or Serverless v2 (ACUs autoscale) | On-demand or provisioned capacity | Resize / Serverless RPUs |
Operational features you should use
- Performance Insights: database load by wait event and top SQL; the first place to look for RDS/Aurora performance issues.
- Enhanced Monitoring: OS-level metrics at high resolution.
- RDS Proxy: connection pooling in front of RDS/Aurora; smooths connection storms (important for Lambda and many-client apps), preserves connections across failover.
- Blue/Green deployments: create a synchronized green copy, test, then switch with minimal downtime (great for major-version upgrades).
- Parameter groups / option groups: engine configuration and features; the managed equivalent of tuning
init.oraand installing options, within AWS-permitted bounds. - Secrets Manager + KMS: managed credential rotation and encryption at rest.
Enterprise examples
| Scenario | Design |
|---|---|
| Oracle DB on RDS | RDS for Oracle, Multi-AZ, automated backups + cross-Region snapshot copy, Performance Insights, License Included or BYOL, params via parameter/option groups. Accept RDS feature limits. |
| Oracle DB self-managed on EC2 | Memory-optimized EC2, io2 Block Express, Data Guard standby in another AZ/Region, RMAN to S3, you own patching/HA. Choose this for RAC or full control. |
| PostgreSQL app database | Aurora PostgreSQL, writer + 2 readers across AZs, RDS Proxy, Global Database for DR. |
| MySQL web app | Aurora MySQL or RDS MySQL Multi-AZ, read replicas for reporting, ElastiCache in front for hot reads. |
| High-scale key-value | DynamoDB on-demand, single-table design, DAX cache, Global Tables if multi-Region. |
| Data warehouse | Redshift RA3/Serverless, Spectrum over the S3 lake, snapshots cross-Region, QuickSight for BI. |
| Reporting database | Read replica of the OLTP DB, or Athena/Redshift over exported data, to isolate reporting load. |
| Cache layer | ElastiCache (Redis/Valkey) cluster-mode, Multi-AZ, with a defined cache-aside strategy. |
| Graph / Document / Time-series | Neptune / DocumentDB / Timestream respectively, Multi-AZ, snapshots. |
AWS database gotchas for Oracle DBAs
This is the section most likely to catch an experienced Oracle DBA off guard. RDS is genuinely managed, which means AWS takes away control you are used to having.
- No SYSDBA / no OS access. You get a master user with elevated (but not full DBA) privileges. Many operations are done through Amazon
rdsadminpackages, not native commands. No SSH to the host, no direct filesystem access. - No RAC. RDS for Oracle is single-instance with Multi-AZ standby (not RAC). Need RAC, go EC2 or Oracle Database@AWS where available.
- Option/parameter group restrictions. Only AWS-supported options/parameters can be changed; some are fixed. Features like certain audit/replication options are enabled through option groups, not manually.
- Storage scaling behaviour. Storage auto-scaling and modifications happen online but within RDS rules; you do not manage ASM/diskgroups. There are limits and cooldowns on scaling.
- Licensing choices. License Included (AWS provides the license, SE2 only) vs BYOL (your license, EE possible). This is a real architectural decision with cost and edition implications.
- Backup/restore model. RMAN is largely replaced by RDS automated backups, snapshots, and PITR. You cannot run arbitrary RMAN; some export/import via Data Pump to S3 integration.
- Multi-AZ behaviour. The standby is not open/readable (unlike an Active Data Guard standby you might run yourself). Failover is automatic but you do not control it like a manual Data Guard switchover.
- Performance troubleshooting differs. Use Performance Insights and CloudWatch instead of full access to AWR/ASH the way you would on-prem (AWR is available with the right license/option, but access patterns differ).
- When EC2 is required instead: RAC, Data Guard with full control, unsupported options/features, specific patch timing, or third-party tools needing OS access.
Migration
- DMS (Database Migration Service): homogeneous and heterogeneous migration with CDC for near-zero-downtime cutover.
- Schema Conversion Tool (SCT): converts schema/code for heterogeneous engine changes (for example Oracle to PostgreSQL/Aurora).
- For Oracle-to-Oracle lift-and-shift, consider native tooling (Data Pump, RMAN, Data Guard) plus DMS for the CDC phase. Always run data validation. See the Migration & DR tab.
7. Load Balancing and Traffic Management
Distributing traffic to healthy targets, terminating TLS, routing by content, and steering users globally. Pick the right load balancer for the protocol and the requirement; most "it works locally but not through the LB" issues are health checks, security groups, or listener rules.
ALB for HTTP/HTTPS with host/path routing. NLB for TCP/UDP, extreme performance, static IPs, or PrivateLink. GWLB to insert inline security appliances. Use ACM for free managed TLS certs. Steer globally with Route 53 (DNS-based) or Global Accelerator (anycast over the backbone), and cache web content with CloudFront. Health checks must hit a real application path.
Load balancer types
| Type | Layer | Use | Key features |
|---|---|---|---|
| Application LB (ALB) | L7 (HTTP/HTTPS) | Web apps, microservices, container services. | Host/path routing, listener rules, WebSocket, redirects, auth (OIDC/Cognito), targets = instances/IPs/Lambda/containers. |
| Network LB (NLB) | L4 (TCP/UDP/TLS) | Extreme throughput/low latency, non-HTTP, static IP, source-IP preservation, PrivateLink. | Millions of req/s, ultra-low latency, static/Elastic IP per AZ, TLS passthrough or termination. |
| Gateway LB (GWLB) | L3/4 (GENEVE) | Insert third-party virtual appliances (firewalls, IDS/IPS) transparently inline. | Fleet of appliances behind one entry point; used for centralized inspection. |
| Classic LB (CLB) Legacy | L4/L7 | Do not use for new work. | Superseded by ALB/NLB; migrate off it. |
Core concepts
| Concept | Meaning |
|---|---|
| Target group | The set of backends (instances/IPs/Lambda/containers) an LB routes to, with its own health check and protocol/port. |
| Listener | A port/protocol the LB accepts on (for example HTTPS:443). |
| Listener rule (ALB) | Conditions (host, path, header, method) that route to a target group. Evaluated by priority. |
| Health check | Periodic probe; only healthy targets receive traffic. Must succeed for an instance to serve. |
| TLS/ACM | ACM provisions and auto-renews free public certs for ALB/NLB/CloudFront/API GW. Terminate TLS at the LB (or passthrough on NLB). |
| Sticky sessions | Bind a client to one target (cookie-based on ALB). Use sparingly; prefer stateless apps with external session stores. |
| Cross-zone LB | Distributes evenly across targets in all AZs. On by default for ALB; optional (and may incur cross-AZ charges) for NLB. |
| Internal vs internet-facing | Internal LBs have private IPs (internal tiers/services); internet-facing have public IPs in public subnets. |
Global traffic management
| Service | Use | Not for |
|---|---|---|
| Route 53 | DNS-based routing: latency, weighted, failover, geolocation, geoproximity, multivalue. DR failover across Regions. | Instant failover (bound by DNS TTL/caching). |
| Global Accelerator | Anycast static IPs routing over the AWS backbone to the nearest healthy Regional endpoint; fast failover; TCP/UDP. | Content caching (no cache). |
| CloudFront | CDN caching web content/APIs at the edge; TLS, WAF, OAC to private origins; Functions/Lambda@Edge. | Non-HTTP TCP/UDP transport. |
Choosing quickly
- HTTP(S) app with routing rules, containers, or Lambda targets -> ALB.
- Millions of connections, gaming/IoT/TCP, static IP, source IP preservation, or exposing a service via PrivateLink -> NLB.
- Inline third-party firewall/IDS for all traffic -> GWLB.
- Global static web assets -> CloudFront. Global low-latency TCP/UDP -> Global Accelerator. Cross-Region DR -> Route 53 failover.
Troubleshooting: target unhealthy and related
Target shows unhealthy elb
- Security groups: the target's SG must allow the health-check port from the load balancer's SG (reference the LB SG, not an IP). This is the number-one cause.
- Health check path/port: the path must return HTTP 200 (a 301/302 redirect fails the check). Confirm the app listens on the configured target port.
- App actually up: curl the health path locally on the instance/container.
- Protocol/port match: target-group protocol (HTTP vs HTTPS) and port match what the app serves.
- NACL: subnet NACL allows the health-check traffic and its return (ephemeral ports).
- Subnet/AZ: the LB is enabled in the AZ where the target lives; the target is in a reachable subnet.
- Timeouts/thresholds: a slow app may need a longer health-check timeout or a lighter check path.
TLS / SSL certificate issue tls
Wrong listener rule routing alb
Security group / NACL blocking traffic firewall
Route table issue routing
Application not responding on backend port app
ss -ltnp) and hit the health path locally.
Wrong target group protocol / health check path config
/healthz endpoint that returns 200 without dependencies.
8. Security Deep Dive
Security in AWS is layered: the shared responsibility split, identity, network, data protection, detection, and governance. Most incidents trace to a small set of mistakes (public S3, over-broad IAM, open security groups, no detection). Close those first, then build depth.
AWS secures the cloud; you secure what you put in it. Enforce identity least privilege (IAM tab), keep data private by default (Block Public Access, encryption everywhere via KMS), reduce internet exposure (private subnets, endpoints, WAF/Shield on what must be public), and turn on detection org-wide (CloudTrail, Config, GuardDuty, Security Hub) with logs centralized in a locked log-archive account.
The shared responsibility model
| AWS is responsible for (security OF the cloud) | You are responsible for (security IN the cloud) |
|---|---|
| Physical data centers, hardware, the hypervisor/Nitro, the global network, and managed-service control planes. | IAM and credentials, network configuration (SG/NACL/routing), OS patching on EC2, data classification and encryption choices, application security, and configuration of every service. |
The security service map
| Category | Services | Purpose |
|---|---|---|
| Identity | IAM, Identity Center, Cognito, Organizations/SCPs | Who can do what. |
| Network | Security Groups, NACLs, Network Firewall, WAF, Shield, Firewall Manager | What can reach what. |
| Data protection | KMS, CloudHSM, Secrets Manager, SSM Parameter Store, ACM, S3/EBS/RDS encryption | Keys, secrets, encryption, certs. |
| Detection | CloudTrail, Config, GuardDuty, Security Hub, Inspector, Detective, Macie | Audit, posture, threats, vulnerabilities, sensitive-data discovery. |
| Governance/compliance | Control Tower, Audit Manager, Artifact, Config rules, Access Analyzer | Guardrails, evidence, compliance reports. |
Data protection
- KMS: customer-managed keys with key policies + IAM; integrated with S3, EBS, RDS, secrets, and more. Enable encryption-by-default account-wide. Deleting/disabling a key can make data unrecoverable, so manage key lifecycle deliberately; use multi-Region keys only when needed.
- CloudHSM: dedicated single-tenant HSM for strict compliance or custom cryptographic needs.
- Secrets Manager vs SSM Parameter Store: Secrets Manager for secrets needing rotation (DB creds); Parameter Store SecureString for simpler/cheaper config and secrets without rotation.
- ACM: managed TLS certificates, auto-renewed, for LBs/CloudFront/API GW.
- Encrypt in transit (TLS) and at rest (KMS) everywhere; make it the default, not a per-resource decision.
Detection and response
| Service | What it catches / does |
|---|---|
| CloudTrail | Every API call (who/what/when/where). The audit backbone; enable an org trail to a locked central bucket. |
| AWS Config | Resource configuration history + compliance rules + drift + auto-remediation. |
| GuardDuty | Threat detection from logs (compromised keys, crypto-mining, recon, malware, anomalous behaviour). |
| Security Hub | Aggregates and scores findings against standards (CIS, AWS FSBP); single pane across accounts. |
| Inspector | Continuous vulnerability scanning of EC2, ECR images, and Lambda. |
| Detective | Investigate/visualize the scope of a finding (entity behaviour over time). |
| Macie | Discovers sensitive data (PII) in S3. |
How to secure specific things
Secure a production AWS account how-to
Secure a multi-account environment how-to
Secure S3 how-to
aws:PrincipalOrgID/aws:SourceVpce conditions; default encryption (SSE-KMS); versioning + lifecycle; access logging; Object Lock for WORM; Macie for PII discovery. Serve public content via CloudFront + OAC, never a public bucket.
Secure EC2 how-to
Secure RDS how-to
Secure a public load balancer how-to
Reduce public internet exposure how-to
Centralize security logs how-to
Production security checklist
- Root: hardware MFA, no access keys, used only when required.
- Humans via IAM Identity Center with MFA; no per-account IAM users.
- SCP guardrails: deny org-leave, deny disabling CloudTrail/GuardDuty/Config, restrict Regions.
- Org CloudTrail to a locked central bucket in a dedicated Log Archive account.
- AWS Config + conformance packs; GuardDuty + Security Hub org-wide.
- Block Public Access on; ACLs disabled; default encryption on S3/EBS/RDS.
- Secrets in Secrets Manager/Parameter Store; none in code or images.
- Security groups least-open; databases private; IMDSv2 enforced.
- WAF + Shield on internet-facing endpoints.
- Access Analyzer for external sharing; credential report reviewed; unused access pruned.
- Backups (AWS Backup) with cross-account/Region copies and periodic restore tests.
- Billing/budget alarms and cost anomaly detection.
Common security mistakes
- Weakly protected root account.
- Public S3 buckets / disabled Block Public Access.
- Over-permissive IAM (AdministratorAccess, wildcards).
- Open security groups (0.0.0.0/0 on 22/3389/database ports).
- CloudTrail not enabled everywhere / not centralized.
- No GuardDuty, no Config.
- Secrets in code, images, or environment.
- No credential rotation; long-lived keys everywhere.
- Routing private AWS traffic over the internet instead of VPC endpoints.
- Logs stored in the same account (and thus tamper-reachable) as the workloads.
9. Observability, Monitoring, and Operations
You cannot operate what you cannot see. CloudWatch is the hub for metrics, logs, and alarms; CloudTrail and Config cover audit and configuration; Systems Manager covers fleet operations. The goal is useful signal, not noise.
Metrics + alarms + logs live in CloudWatch (install the agent for memory/disk, which are not default). Audit is CloudTrail; config/compliance is Config; events/automation route through EventBridge; fleet ops through Systems Manager. Set log retention (default is forever), alarm on symptoms not every metric, and centralize logs across accounts.
The observability stack
| Service | Role |
|---|---|
| CloudWatch Metrics | Time-series metrics from AWS services and custom sources; alarms and math on them. |
| CloudWatch Logs | Central log ingestion, search (Logs Insights), retention, metric filters, subscriptions. |
| CloudWatch Alarms | Threshold/anomaly alarms -> SNS/EventBridge/Auto Scaling/OpsItems. |
| CloudWatch Dashboards | Operational views combining metrics, logs, and alarms. |
| CloudWatch Agent | Collects OS metrics (memory, disk) and logs from EC2/on-prem. Not installed by default. |
| EventBridge | Event bus + Scheduler + Pipes; routes state changes to targets for automation. |
| CloudTrail | API audit trail. |
| AWS Config | Resource config history and compliance. |
| Systems Manager | Inventory, Patch Manager, Session Manager, Automation, OpsCenter. |
| Health Dashboard | AWS service health + your account's Personal Health Dashboard (events affecting your resources). |
| X-Ray / OpenTelemetry | Distributed tracing across microservices. |
| Managed Prometheus / Managed Grafana | Managed metrics store and dashboards for container/Prometheus ecosystems. |
| DevOps Guru | ML-based anomaly detection and operational insights. |
What to monitor, by resource
| Resource | Watch |
|---|---|
| EC2 | CPUUtilization, StatusCheckFailed (system/instance), CPUCreditBalance (T family), plus agent-based mem/disk. |
| EBS | VolumeReadOps/WriteOps, VolumeQueueLength, BurstBalance (gp2/st1/sc1), throughput vs provisioned. |
| S3 | 4xx/5xx errors, request counts, bucket size/object count, replication latency, unusual access (via CloudTrail data events). |
| RDS/Aurora | CPUUtilization, FreeableMemory, FreeStorageSpace, DatabaseConnections, Read/WriteLatency, ReplicaLag, and Performance Insights DB load. |
| Load balancers | UnHealthyHostCount, HTTPCode_Target_5XX, TargetResponseTime, RejectedConnectionCount, ActiveConnectionCount. |
| VPC | NAT Gateway ErrorPortAllocation/BytesOut, VPN TunnelState, Flow Logs REJECTs, endpoint metrics. |
| Lambda | Errors, Throttles, Duration, ConcurrentExecutions, IteratorAge (streams). |
| Security | Root login, IAM changes, GuardDuty findings, Config non-compliance, failed console logins. |
Designing good alarms
- Alarm on symptoms users feel (error rate, latency, unhealthy hosts, saturation) before alarming on causes.
- Use appropriate periods, evaluation periods, and M-of-N datapoints to avoid flapping.
- Use anomaly detection bands for metrics without a fixed threshold.
- Set alarms to actionable destinations; page only what a human must act on now. Everything else is a dashboard or ticket.
- Alarm on missing data where absence is bad (a stopped heartbeat).
- Composite alarms to reduce noise (alert once when several related alarms fire).
Example alarms
| Alarm | Metric / condition | Action |
|---|---|---|
| EC2 CPU high | CPUUtilization > 80% for 5 min | Scale out / investigate |
| EC2 status check failed | StatusCheckFailed >= 1 | Auto-recover / replace |
| Memory pressure | mem_used_percent > 90% (agent) | Investigate / right-size |
| Disk usage | disk_used_percent > 85% (agent) | Clean / grow volume |
| EBS burst balance low | BurstBalance < 20% | Move to gp3/provisioned |
| EBS queue length | VolumeQueueLength sustained high | Provision IOPS / larger instance |
| ALB unhealthy targets | UnHealthyHostCount > 0 | Investigate targets |
| ALB 5xx | HTTPCode_Target_5XX rising | App/backend investigation |
| RDS CPU | CPUUtilization > 80% | Tune queries / scale |
| RDS storage low | FreeStorageSpace < threshold | Grow storage / enable autoscaling |
| RDS connections | DatabaseConnections near max | RDS Proxy / pool / investigate |
| Failed backups | AWS Backup job failed event | Page backup owner |
| NAT Gateway errors | ErrorPortAllocation > 0 | Investigate port exhaustion |
| VPN tunnel down | TunnelState = 0 | Failover / investigate |
| Direct Connect issue | ConnectionState / BGP down | Failover to VPN |
| S3 unusual access | CloudTrail data-event anomaly | Security review |
| Lambda errors/throttles | Errors or Throttles > 0 sustained | Investigate / raise concurrency |
Centralizing observability across accounts
- Cross-account CloudWatch observability: link source accounts to a central monitoring account for a unified metrics/logs/traces view.
- Ship logs to a central account (subscription filters -> Kinesis Firehose -> central S3, or Security Lake for security data).
- Aggregate GuardDuty/Security Hub/Config to delegated admin accounts.
- Standardize dashboards and alarms via IaC so every account gets the same baseline.
10. Containers, Kubernetes, and Cloud Native
Running containers, functions, and event-driven systems on AWS. The main choice is ECS vs EKS vs Lambda, and then EC2 vs Fargate for the compute underneath. Pick by operational appetite and portability needs, not by hype.
ECS for AWS-native containers with low operational overhead. EKS when you need standard Kubernetes (skills, tooling, portability, multi-cloud). Fargate to avoid managing nodes; EC2 nodes for cost at high steady utilization or special hardware. Lambda for event-driven/short work. Give containers permissions with task roles (ECS) or IRSA/Pod Identity (EKS), never node/instance roles.
The choices
| Option | Use when | Trade-off |
|---|---|---|
| ECS | You want containers with minimal ops and deep AWS integration; no Kubernetes needed. | AWS-specific; less portable than k8s. |
| EKS | You need Kubernetes: existing k8s skills/manifests, Helm, operators, portability. | More operational overhead (upgrades, add-ons, CNI IP management, control-plane cost). |
| Fargate | You do not want to manage nodes; bursty or low-ops workloads. | Higher per-vCPU cost; some features need EC2 launch type. |
| EC2 launch type / node groups | High steady utilization, GPUs, custom kernels, or per-vCPU cost matters. | You patch/scale/secure the nodes. |
| Lambda | Event-driven, short (<15 min), spiky, scale-to-zero. | Time/memory limits, cold starts, not for long-running services. |
| App Runner | Simplest path: container/source to a managed HTTPS service with autoscaling. | Less control; good for straightforward web apps/APIs. |
How networking works with containers
- ECS awsvpc mode and EKS VPC CNI give each task/pod its own ENI and VPC IP. This gives real VPC networking and SGs per task/pod but consumes subnet IPs fast; plan CIDRs accordingly (or use prefix delegation / custom networking on EKS).
- Load balancers: ALB in front of ECS services / the AWS Load Balancer Controller for EKS Ingress; NLB for TCP/UDP or extreme performance.
- Service-to-service discovery: ECS Service Connect / Cloud Map; Kubernetes Services + CoreDNS on EKS.
How IAM works with containers
- ECS task role: permissions for the app in the container; the separate execution role lets ECS pull images and write logs. Do not rely on the EC2 instance role for app permissions.
- EKS IRSA / EKS Pod Identity: map a Kubernetes service account to an IAM role so pods get scoped, short-lived credentials, not the node role.
- Least-privilege per task/pod is the goal; the node/instance role should be minimal.
CI/CD and registry
- ECR for private images; enable scan-on-push and lifecycle policies to purge old tags.
- CodeBuild/CodeDeploy/CodePipeline or GitHub Actions/GitLab for build-test-deploy; deploy with rolling or blue/green (CodeDeploy supports ECS blue/green).
- CodeArtifact for private package repositories.
Serverless and event-driven building blocks
| Service | Role |
|---|---|
| Lambda | Functions triggered by events/HTTP. |
| API Gateway | Managed API front door (REST/HTTP/WebSocket). |
| EventBridge | Event bus, scheduling, and pipes for decoupling. |
| Step Functions | Orchestrate multi-step workflows with retries/error handling. |
| SQS / SNS | Queues (decouple/buffer) and pub/sub (fan-out). |
| Kinesis / MSK | Streaming ingestion and processing. |
| Cloud Map | Service discovery. |
Architecture examples
Microservices on ECS (Fargate) pattern
Microservices on EKS pattern
Serverless function on S3 event pattern
Event-driven with EventBridge + Lambda + SNS + SQS pattern
Container deployment pipeline pattern
Private container platform pattern
Operating containers
- Monitor task/pod health, restarts, CPU/memory limits vs usage, and pending/unschedulable tasks (capacity).
- Right-size task/pod requests and limits; over-requesting wastes capacity, under-requesting causes throttling/OOM.
- Use Container Insights (CloudWatch) or Prometheus/Grafana for cluster and workload metrics.
- On EKS, keep the control plane and node AMIs/add-ons on supported versions; plan quarterly upgrades.
11. Analytics, Data, and Integration
Storing, moving, transforming, querying, and streaming data at scale, plus the integration services that connect systems. The lake-on-S3 plus purpose-built engines model is the backbone of most modern AWS data platforms.
Land raw data in S3 (the lake), catalog with Glue, govern with Lake Formation, query serverlessly with Athena, warehouse in Redshift, stream with Kinesis/MSK, transform with Glue/EMR, replicate databases with DMS, and visualize with QuickSight. Use columnar formats (Parquet) and partitioning to control cost.
The data and analytics services
| Service | Role |
|---|---|
| S3 | The data lake substrate: durable, cheap, decoupled storage. |
| Glue | Serverless ETL, crawlers, and the Data Catalog (metadata shared by Athena/Redshift/EMR). |
| Lake Formation | Central, fine-grained (row/column/tag) permissions and cross-account sharing over the lake. |
| Athena | Serverless SQL (Trino/Presto) over S3; pay per TB scanned. |
| Redshift | Columnar MPP data warehouse; Serverless and Spectrum (query S3). |
| EMR | Managed Spark/Hive/Presto/HBase (on EC2/EKS/Serverless) for large-scale processing. |
| Kinesis | Data Streams (ingest), Firehose (deliver to S3/Redshift/OpenSearch), Managed Service for Apache Flink (analytics). |
| MSK | Managed Apache Kafka for the Kafka ecosystem. |
| OpenSearch Service | Search, log analytics, and vector search. |
| QuickSight | Serverless BI dashboards (SPICE engine, Q natural language). |
| DMS | Database migration and CDC replication into the lake/warehouse. |
| AppFlow | No-code SaaS data flows (Salesforce, ServiceNow, etc.) to AWS. |
| MWAA | Managed Apache Airflow for orchestrating pipelines. |
| DataZone | Data catalog/governance and a business data marketplace across the org. |
Integration services
| Service | Use |
|---|---|
| EventBridge | Event routing, scheduling, SaaS integration. |
| API Gateway | Publish and secure APIs. |
| Step Functions | Workflow orchestration. |
| SQS / SNS | Queues and pub/sub. |
| Amazon MQ | Managed ActiveMQ/RabbitMQ for lift-and-shift of existing JMS/AMQP apps. |
Common data patterns
Data lake on S3 pattern
Data warehouse on Redshift pattern
Serverless query with Athena pattern
ETL with Glue pattern
Streaming ingestion with Kinesis pattern
Kafka pattern with MSK pattern
CDC with DMS pattern
Cross-account data sharing pattern
AI-ready data architecture pattern
12. AI, ML, and Generative AI on AWS
From custom ML on SageMaker to generative AI on Bedrock and applied-AI APIs. The enterprise question is rarely "which model" and usually "how do we connect models to our data safely, cheaply, and with an audit trail." This section covers both the services and the governance.
Bedrock for generative AI (managed foundation models, Agents, Knowledge Bases, Guardrails) without running model infrastructure. SageMaker for custom model build/train/deploy and MLOps. Applied-AI APIs (Textract, Comprehend, Rekognition, etc.) for common tasks. For RAG, put embeddings in a vector store (OpenSearch or Aurora/RDS pgvector). Never wire an LLM agent directly to a production OLTP database; put a governed serving layer in front.
Generative AI: Amazon Bedrock
- Foundation models from multiple providers (Anthropic Claude, Amazon Nova/Titan, Meta, Mistral, and others) behind one API. Availability varies by Region; verify.
- Knowledge Bases: managed RAG (chunk, embed, store, retrieve) over your data in S3, backed by a vector store.
- Agents: orchestrate multi-step tasks and tool/API calls with a model.
- Guardrails: content filters, denied topics, PII redaction, and grounding checks applied to prompts/responses.
- Pricing: per input/output token (on-demand) or provisioned throughput; Knowledge Bases add embedding + vector-store cost.
Custom ML: Amazon SageMaker
- Studio (IDE), training jobs, hyperparameter tuning, endpoints (real-time/serverless/async/batch), Pipelines (MLOps), Model Registry, Feature Store, and Clarify (bias/explainability).
- Use for training/fine-tuning your own models and productionizing the full ML lifecycle.
Applied-AI APIs
| Service | Task |
|---|---|
| Textract | Extract text/tables/forms from documents. |
| Comprehend | NLP: entities, sentiment, PII detection, classification. |
| Rekognition | Image/video analysis. |
| Transcribe / Polly | Speech-to-text / text-to-speech. |
| Translate | Machine translation. |
| Kendra | Enterprise semantic search (also usable as a RAG retriever). |
| Lex | Conversational bots (chat/voice). |
| Forecast / Personalize | Time-series forecasting / recommendations. |
| Fraud Detector | ML fraud detection. |
Vector stores for RAG
| Option | Use when |
|---|---|
| OpenSearch (vector) | Large-scale vector + keyword hybrid search; dedicated retrieval layer. |
| Aurora / RDS PostgreSQL pgvector | You already run PostgreSQL and want vectors alongside relational data (joinable in SQL). |
| DocumentDB vector | MongoDB-API apps needing vector search. |
| Bedrock Knowledge Bases | You want managed RAG and let AWS handle chunk/embed/store/retrieve. |
Enterprise GenAI patterns
Chat with documents (RAG) pattern
Chat with database / natural language to SQL pattern
AI assistant for operations / business users pattern
Document processing pipeline pattern
Call-center AI pattern
MLOps pipeline pattern
Governance and safety warnings
- Avoid uncontrolled dynamic SQL; use parameterized, allow-listed queries or a serving layer.
- Protect credentials: models/agents get access via least-privilege IAM roles, never embedded keys.
- Add auditability: log prompts, retrievals, tool calls, and responses.
- Use curated datasets, APIs, or read-only reporting layers, not raw production stores.
- Validate AI output before business use; keep a human in the loop for consequential actions.
- Monitor for prompt injection and data leakage; apply Guardrails and content filtering.
- Enforce data-access boundaries so retrieval respects the requesting user's permissions (no cross-tenant leakage).
- Watch cost: token usage and standing endpoints can grow fast; set budgets and quotas.
13. Migration and Disaster Recovery
Getting workloads into AWS, and keeping them recoverable once there. Migration and DR share tooling (replication, backup, failover) and the same core discipline: define your objectives (RTO/RPO) first, then pick the pattern that meets them at acceptable cost.
For migration, use MGN (rehost servers), DMS + SCT (databases), DataSync/Transfer/Snow (data). For DR, pick a pattern by RTO/RPO: Backup & Restore (cheapest, slowest) -> Pilot Light -> Warm Standby -> Active/Active (fastest, priciest). Use AWS Backup for backups, Elastic Disaster Recovery for low-RPO server DR, and Route 53 for failover. Always test DR.
RTO and RPO drive everything
| Term | Meaning | Drives |
|---|---|---|
| RTO (Recovery Time Objective) | How long you can be down. | How much standing infrastructure you keep warm. |
| RPO (Recovery Point Objective) | How much data you can lose (time). | Replication frequency and mechanism. |
Migration services
| Service | Use |
|---|---|
| Migration Hub | Track and coordinate migrations across tools and accounts. |
| Application Migration Service (MGN) | Lift-and-shift rehosting of servers into EC2 via block replication. |
| Database Migration Service (DMS) | Migrate/replicate databases with CDC for minimal downtime. |
| Schema Conversion Tool (SCT) | Convert schema/code for heterogeneous engine changes. |
| DataSync | Fast bulk file/object transfer to S3/EFS/FSx. |
| Transfer Family | Managed SFTP/FTPS/FTP into S3/EFS. |
| Snow Family | Offline petabyte-scale transfer devices. |
The 7 Rs of migration
| Strategy | Meaning | When |
|---|---|---|
| Rehost | Lift-and-shift as-is (MGN). | Speed; refactor later. Most common first move. |
| Replatform | Minor optimization (for example DB -> RDS). | Small changes for big managed-service wins. |
| Repurchase | Move to SaaS. | Commodity apps (email, CRM). |
| Refactor | Re-architect cloud-native. | High-value apps needing scale/agility. |
| Relocate | Move VMware without conversion. | VMware Cloud on AWS estates. |
| Retain | Keep on-prem for now. | Not ready / dependencies / compliance. |
| Retire | Decommission. | Unused apps found during discovery. |
DR patterns (ordered by RTO/RPO and cost)
| Pattern | RTO / RPO | Cost | How |
|---|---|---|---|
| Backup & Restore | Hours / hours | Lowest | Backups (AWS Backup) + IaC to rebuild in the DR Region on demand. |
| Pilot Light | Tens of min / minutes | Low | Core data replicated and minimal services running; scale up on failover. |
| Warm Standby | Minutes / seconds-min | Medium | A scaled-down but functional copy always running; scale up and shift traffic on failover. |
| Active/Active (Multi-site) | Near-zero / near-zero | Highest | Full capacity in multiple Regions serving live; failover is traffic steering. |
DR building blocks
- AWS Backup: policy-based backups with cross-Region/cross-account copies and Vault Lock (immutability against ransomware).
- Elastic Disaster Recovery (DRS): continuous block replication for low-RPO server recovery; drill without disrupting production.
- Database replication: RDS cross-Region read replicas, Aurora Global Database (sub-second lag), DynamoDB Global Tables (active-active).
- Route 53 failover: health-check-based DNS failover to the DR Region.
- Multi-AZ for in-Region resilience (the default); multi-Region for Region-level DR.
- IaC: the DR Region must be reproducible from code; manual DR is DR that fails.
Architecture examples
On-prem VM migration to AWS migrate
Oracle database migration to AWS migrate
Application migration to AWS migrate
Cross-Region DR for an application dr
Cross-Region DR for a database dr
Backup-based DR dr
Route 53 failover pattern dr
14. Cost Management and Governance
AWS cost is an engineering discipline, not a finance afterthought. The bill is driven by architecture decisions (data transfer, NAT, idle resources, over-provisioning) far more than by unit prices. Governance keeps those decisions consistent across many accounts.
Understand the pricing levers (On-Demand vs Savings Plans vs Spot; data transfer; NAT; storage classes), see spend with Cost Explorer/CUR and control it with Budgets + anomaly detection, right-size with Compute Optimizer, and govern with Organizations/SCPs, tagging, quotas, and Control Tower. Do the big three first: right-size, schedule off, and commit the baseline.
Pricing model basics
| Lever | Note |
|---|---|
| On-Demand | No commitment, highest per-unit. Default for spiky/unknown. |
| Savings Plans | 1/3-year compute commitment for up to ~72% off (EC2, Fargate, Lambda). Compute Savings Plans are most flexible. |
| Reserved Instances | Similar savings, less flexible; still used for RDS/ElastiCache/Redshift/OpenSearch reservations. |
| Spot | Up to ~90% off, interruptible. Batch/CI/stateless/big-data. |
| Dedicated Hosts | Priciest; for BYOL host-based licensing and isolation. |
The costs that surprise teams
| Cost | Why it grows | Control |
|---|---|---|
| Data transfer | Egress to internet, cross-Region, and cross-AZ all cost per GB. | Keep tightly-coupled traffic AZ-local (while staying HA); use CloudFront for egress; use endpoints for AWS traffic. |
| NAT Gateway | Hourly + per-GB processed; all private-subnet egress flows through it. | VPC endpoints for S3/ECR/etc.; consolidate; one-per-AZ but no more. |
| EBS | Unattached volumes and old snapshots bill forever. | gp3 default; delete orphans; DLM snapshot lifecycle. |
| S3 | Wrong storage class, versioning without expiry, request costs. | Lifecycle/Intelligent-Tiering; expire noncurrent versions. |
| RDS | Over-sized instances, Multi-AZ doubling compute, idle non-prod. | Right-size, stop non-prod, reservations for steady prod. |
| Lambda | High memory setting or very high steady volume. | Tune memory (also affects speed); consider containers at high steady load. |
| CloudWatch | Never-expiring logs, custom metrics, high-res alarms. | Set retention; filter; reduce cardinality. |
| SageMaker / idle endpoints | Standing endpoints and notebooks bill while idle. | Serverless/async inference; shut down idle resources. |
Cost management tools
| Tool | Use |
|---|---|
| Cost Explorer | Visualize and forecast spend; slice by tag/service/account. |
| Cost and Usage Report (CUR) | The raw, granular billing data in S3 for deep analysis (Athena/QuickSight). |
| Budgets | Alerts and automated actions on cost/usage/RI-SP-coverage thresholds. |
| Cost Anomaly Detection | ML-based alerts on unusual spend. |
| Compute Optimizer | Right-sizing recommendations for EC2/ASG/EBS/Lambda/RDS. |
| Trusted Advisor | Idle resources, low-utilization, and cost hygiene checks. |
| Cost allocation tags | Attribute spend to teams/apps/environments (must be activated). |
Governance
- Organizations + consolidated billing: one bill, shared volume discounts and Savings Plan/RI benefit across accounts.
- SCPs: restrict Regions, instance families, or services to prevent expensive mistakes.
- Tagging strategy + tag policies: enforce mandatory tags (Owner, Environment, CostCenter, App) so cost is attributable.
- Service Quotas: know and manage limits proactively (they also cap runaway spend).
- Control Tower / landing zone: consistent guardrails and account baselines including budgets.
- Per-account or per-OU budgets with alerts; chargeback/showback from CUR.
Cost optimization examples
| Action | Impact |
|---|---|
| Stop non-prod EC2/RDS nights and weekends | Up to ~65% off those resources (scheduler). |
| Right-size EC2/RDS (Compute Optimizer) | Cut over-provisioning, often 20-40%. |
| Compute Savings Plans on steady baseline | Up to ~72% vs On-Demand. |
| Spot for batch/CI/stateless | Up to ~90% off. |
| gp3 instead of gp2 | ~20% cheaper + tunable performance. |
| S3 lifecycle / Intelligent-Tiering | Large savings on cold data. |
| VPC endpoints to cut NAT processing | Removes per-GB NAT cost for AWS traffic. |
| Reduce cross-AZ chatter | Cuts data-transfer line item. |
| Delete old snapshots + unattached EBS + idle EIPs | Removes silent recurring charges. |
| CloudWatch log retention | Stops unbounded log storage growth. |
| Reserved capacity planning (RDS/Redshift/etc.) | Discounts on steady managed services. |
Monthly AWS cost review checklist
- Review Cost Explorer month-over-month by service and by account/tag; investigate anomalies.
- Check Savings Plan / RI coverage and utilization; buy/adjust for the steady baseline.
- Action Compute Optimizer and Trusted Advisor right-sizing/idle findings.
- Find and delete: unattached EBS, old snapshots, idle EIPs, idle load balancers, orphaned NAT gateways, stopped-but-costing resources.
- Verify non-prod scheduling is working.
- Review S3 storage-class distribution and lifecycle effectiveness.
- Check CloudWatch log retention and largest log groups.
- Review data-transfer costs (cross-AZ/Region/internet) for new hotspots.
- Confirm untagged resources are shrinking; enforce tag policy.
- Validate budgets/alerts are current for each account/OU.
15. Enterprise Architecture Patterns
Reference designs for real AWS deployments. Each card gives the use case, the services, the traffic flow, and how security, HA, DR, monitoring, and cost play out, plus the risks and common mistakes. Use them as starting points, not blueprints to copy blindly.
Simple web application web
Services: Route 53 -> CloudFront -> S3 (static) and/or ALB -> EC2/ECS -> RDS; ACM for TLS.
Flow: user -> Route 53 -> CloudFront (cache) -> S3 for static, ALB for dynamic -> app -> database.
Security: WAF on CloudFront/ALB, Block Public Access (serve S3 via OAC), SGs least-open, DB private.
HA: multi-AZ ALB + app; RDS Multi-AZ. DR: snapshots/backups (backup-and-restore). Monitoring: CloudWatch alarms on 5xx/latency/DB. Cost: low; CloudFront cuts egress.
Risks/mistakes: public S3 bucket instead of OAC; single-AZ database; no WAF.
Three-tier enterprise application web
Services: ALB (public) -> ASG/ECS app tier (private) -> RDS/Aurora (private data subnets); ElastiCache; Secrets Manager; CloudWatch.
Flow: user -> ALB in public subnets -> app tier in private subnets -> DB in isolated data subnets (no internet route).
Security: tier-to-tier SGs (each references the SG above it), WAF, private data tier, KMS everywhere, secrets in Secrets Manager.
HA: everything across 3 AZs; RDS Multi-AZ + read replicas. DR: pilot light/warm standby cross-Region. Monitoring: per-tier dashboards + Performance Insights. Cost: right-size app tier, reservations for DB.
Risks/mistakes: DB in public subnet; SGs opened by CIDR instead of SG reference; one NAT for all AZs.
Highly available application ha
Services: ALB + ASG across 3 AZs, Aurora (multi-AZ, fast failover), ElastiCache Multi-AZ, stateless app, S3 for shared assets.
Flow: ALB spreads traffic; ASG self-heals via ELB health checks; DB fails over automatically.
Security: as three-tier. HA: N+1 capacity so one AZ can be lost; no singletons. DR: add cross-Region for Region failure. Monitoring: unhealthy-host and failover alarms. Cost: extra capacity for redundancy is the price of availability.
Risks/mistakes: capacity sized so that losing one AZ overloads the rest; stateful instance data lost on replacement.
Private enterprise application (no internet exposure) private
Services: internal ALB, private subnets only, VPC endpoints (no NAT dependency for AWS traffic), Route 53 private zones, PrivateLink for shared services.
Flow: corporate network -> DX/VPN -> internal ALB -> app -> DB; AWS-service calls via endpoints.
Security: no public IPs, endpoints with policies, SG/NACL least-open, centralized egress inspection if needed. HA/DR: as HA app. Monitoring: Flow Logs + standard. Cost: endpoints instead of NAT can be cheaper for AWS-heavy traffic.
Risks/mistakes: accidental public exposure; forgetting endpoints and routing everything through NAT.
Shared services account pattern platform
Services: Shared Services VPC + Transit Gateway/Cloud WAN; PrivateLink to expose services; Route 53 Resolver; central ECR/CodeArtifact.
Flow: workload accounts reach shared services over TGW or PrivateLink, not the internet.
Security: tightly-scoped cross-account access; resource policies; endpoint policies. HA/DR: shared services must be as resilient as the workloads depending on them. Cost: shared once, consumed by many.
Risks/mistakes: making shared services a single point of failure; unbounded cross-account trust.
Centralized networking (hub-and-spoke) network
Services: Transit Gateway (or Cloud WAN), central inspection VPC (Network Firewall), DX/VPN, central egress.
Flow: spokes attach to TGW; route tables segment prod/non-prod/shared; egress and inspection centralized.
Security: segmentation via TGW route tables; inspection appliances (GWLB/Network Firewall). HA: per-AZ attachments, appliance-mode. Cost: watch TGW per-GB and central-egress data charges.
Risks/mistakes: asymmetric routing without appliance mode; CIDR overlaps; central egress becoming a bottleneck/cost center.
Multi-account landing zone foundation
Services: Organizations, Control Tower, SCP guardrails, dedicated Log Archive + Audit accounts, IAM Identity Center, Account Factory/AFT.
Flow: accounts vended from a blueprint; logs centralized; guardrails inherited by OU.
Security: the whole point: preventive + detective guardrails, centralized immutable logs, least-privilege SSO. DR: IaC-reproducible. Cost: baseline services per account.
Risks/mistakes: workloads in the management account; OUs modeled on org chart; no account-vending automation.
Oracle database on AWS database
Services: RDS for Oracle (managed) OR EC2 self-managed with Data Guard (full control); io2/gp3 storage; Secrets Manager; AWS Backup/RMAN-to-S3.
Flow: app tier -> private DB (RDS or EC2) in data subnets.
Security: private subnets, KMS, TLS, restricted SG, deletion protection. HA: RDS Multi-AZ or Data Guard. DR: cross-Region replica/snapshots or Data Guard standby. Cost: licensing (LI vs BYOL) is a major factor; core count on EC2.
Risks/mistakes: choosing RDS then hitting a feature limit (RAC/SYSDBA); public DB; ignoring license core-count on EC2. See the Database tab's Oracle gotchas.
Data lake pattern data
Services: S3 (zones) + Glue (catalog/ETL) + Lake Formation (governance) + Athena/Redshift Spectrum + QuickSight.
Flow: ingest (DMS/Firehose/DataSync) -> raw -> ETL -> curated -> query/BI.
Security: Lake Formation row/column/tag permissions; encryption; cross-account sharing without copies. HA/DR: S3 durability + cross-Region replication. Cost: Parquet + partitioning; serverless query.
Risks/mistakes: unpartitioned raw formats blowing up scan cost; no governance ("data swamp").
Kubernetes platform pattern containers
Services: EKS (managed node groups + Karpenter or Fargate), AWS Load Balancer Controller, IRSA/Pod Identity, ECR, GitOps, Managed Prometheus/Grafana.
Flow: ALB ingress -> pods across AZs; per-pod IAM; images from ECR endpoints.
Security: least-privilege pod roles, network policies, image scanning, IMDSv2. HA: multi-AZ node groups. DR: IaC + GitOps redeploy + data replication. Cost: node right-sizing/Karpenter, Spot for stateless.
Risks/mistakes: node role over-permissioning pods; VPC CNI IP exhaustion; skipped version upgrades.
Serverless application pattern serverless
Services: API Gateway -> Lambda -> DynamoDB; S3 events; EventBridge; Step Functions; SQS/SNS.
Flow: request/event -> function -> data store; async work decoupled via queues.
Security: per-function least-privilege roles, API auth (Cognito/authorizers), input validation. HA: built-in multi-AZ. DR: DynamoDB Global Tables + multi-Region deploy. Cost: pay-per-use, scales to zero; watch high steady volume.
Risks/mistakes: cold starts on latency-critical paths; no DLQs; overly broad function roles.
Event-driven architecture integration
Services: EventBridge (routing) + SNS (fan-out) + SQS (buffering) + Lambda/ECS (processing) + DLQs.
Flow: producers emit events -> routed/fanned out -> independent consumers with retries.
Security: per-consumer roles; event-bus resource policies. HA/DR: managed, multi-AZ; replicate critical state. Cost: per-event, cheap.
Risks/mistakes: no DLQ; poison messages; assuming ordering where Standard queues do not guarantee it.
Hybrid cloud pattern hybrid
Services: Direct Connect (+ VPN backup), Transit Gateway, Route 53 Resolver (hybrid DNS), Storage Gateway/DataSync.
Flow: on-prem <-> DX/VPN <-> TGW <-> VPCs; DNS resolves both ways via Resolver endpoints.
Security: encrypted links, restricted routing, no overlapping CIDRs. HA: redundant DX/VPN. Cost: DX reduces egress cost for steady transfer.
Risks/mistakes: single DX with no backup; asymmetric routing; CIDR overlap.
Multi-region DR pattern dr
Services: warm standby in a second Region: ALB+ASG scaled down, Aurora Global Database, replicated S3, Route 53 failover, IaC parity.
Flow: normal traffic to primary; on failure, scale up DR, promote DB, shift DNS/Global Accelerator.
Security: identical guardrails both Regions. DR: the point of the pattern; test regularly. Cost: pay for standby capacity + replication.
Risks/mistakes: untested failover; config drift between Regions; DNS TTL delaying failover.
Secure landing zone pattern security
Services: Control Tower, SCPs, GuardDuty/Security Hub/Config org-wide, central logging with Object Lock, Firewall Manager, IAM Identity Center, Network Firewall central egress.
Flow: every account inherits guardrails; all logs immutable and central; detection to a security account.
Security: preventive + detective + response wired together. Cost: baseline security services per account.
Risks/mistakes: detection without response; logs not immutable; guardrails not tested in non-prod first.
GenAI with private enterprise data ai
Services: Bedrock (models + Guardrails) + Knowledge Bases + vector store (OpenSearch/pgvector) + S3 (curated docs) + a governed serving layer for structured data.
Flow: user query -> retrieve permitted context (respecting user permissions) -> model answers with citations -> logged.
Security: least-privilege retrieval, Guardrails, no direct OLTP access, full audit. Cost: tokens + vector store + embeddings.
Risks/mistakes: connecting the model to production databases directly; cross-tenant data leakage in retrieval; no output validation. See the AI/ML tab warnings.
16. Troubleshooting Guides
Runbooks for the failures you will actually hit. Each has symptoms, likely causes, checks (console path + CLI where useful), fixes, and prevention. The meta-rule: almost every AWS problem is IAM (permission), networking (reachability), or a quota/limit. Isolate which, then dig.
Compute and access
EC2 instance not reachable compute
Causes: SG/NACL blocking, no route/public IP, instance down, wrong subnet.
Checks: SG allows the port from your source; route table + public IP (if internet-facing); instance running and status checks passing; NACL allows in+out (ephemeral). Console: EC2 > Instances > (Networking, Status checks). CLI:
aws ec2 describe-instance-status --instance-ids i-0abc aws ec2 describe-security-groups --group-ids sg-0abcFix: correct SG/route; use Reachability Analyzer. Prevention: standardize network templates; use Session Manager (no inbound port needed).
SSH issue compute
Causes: wrong key, wrong user, SG not allowing 22 from your IP, key perms, host firewall.
Checks: correct .pem and username (ec2-user/ubuntu/etc.);
chmod 400 the key; SG allows 22 from your IP. Fix: prefer Session Manager to bypass SSH entirely. Prevention: close 22; use SSM.
Session Manager not working compute
Checks: agent installed/running; instance role has
AmazonSSMManagedInstanceCore; reachability to ssm, ssmmessages, ec2messages endpoints (NAT or interface endpoints). Console: Systems Manager > Fleet Manager (managed node shows). Fix: attach role, add endpoints. Prevention: bake agent+role into AMIs/launch templates.
Instance boot issue compute
Checks: EC2 Serial Console;
aws ec2 get-console-output --instance-id i-0abc; look for mount/cloud-init errors. Fix: detach root volume, attach to a rescue instance, repair, reattach. Prevention: test AMIs; avoid fragile user-data; disk alarms.
EC2 status check failure compute
High CPU compute
top. Fix: scale out (ASG), scale up/compute-optimized, or fix the app; move off T-family if credits exhaust. Prevention: target-tracking scaling; right family.
Memory pressure compute
mem_used_percent); free -m, OOM in dmesg. Fix: memory-optimized family, tune heap. Prevention: agent + alarms; right-size.
Disk full compute
df -h, du -sh /*. Fix: clear/rotate logs; grow EBS online then growpart + resize2fs/xfs_growfs. Prevention: log rotation, disk alarms.
EBS volume attachment issue storage
Checks: same AZ; volume state
available. CLI: aws ec2 attach-volume --volume-id vol-0abc --instance-id i-0abc --device /dev/xvdf then mount in OS. Fix: snapshot + restore in target AZ if cross-AZ. Prevention: place volumes in the instance AZ.
Storage
S3 access denied storage
Checks: all four layers (IAM, bucket policy, endpoint policy, KMS); CloudTrail shows which denied. Fix: grant on the failing layer; ensure the KMS key policy allows the principal. Prevention: use Access Analyzer; document the intended access.
S3 public access issue storage
Checks: Block Public Access settings (account + bucket), ACLs (should be disabled), bucket policy. Fix: for exposure, enable Block Public Access and remove public statements; for legitimate public content, use CloudFront + OAC instead. Prevention: BPA on org-wide via SCP; Macie for PII.
EFS mount issue storage
Checks: mount target exists in that AZ; EFS SG allows 2049 from the instance SG;
mount -t efs with the correct FS id. Fix: create the missing mount target; open 2049. Prevention: mount target per AZ; template the SG.
Networking and load balancing
Load balancer target unhealthy elb
SSL certificate issue elb
Route 53 DNS issue dns
enableDnsSupport/enableDnsHostnames on; TTL not masking a change; hybrid resolution via Resolver endpoints/rules. CLI: dig @169.254.169.253 name from inside the VPC. Fix: correct record/association. Prevention: low TTL before changes; alias records for AWS targets.
VPC routing issue network
Security Group issue network
NACL issue network
NAT Gateway issue network
0.0.0.0/0 -> nat route, NAT in a private subnet (must be public with IGW route), port exhaustion.Checks: NAT in a public subnet with IGW route; private route table points to it; ErrorPortAllocation metric. Fix: one NAT per AZ; correct routes. Prevention: VPC endpoints reduce NAT dependency and cost.
VPC endpoint issue network
Transit Gateway route issue network
VPN down hybrid
aws ec2 describe-vpn-connections. Fix: reset/reconfigure tunnel; failover to the second tunnel. Prevention: 2 tunnels + BGP + tunnel-state alarm.
Direct Connect issue hybrid
Database
RDS backup failed database
Checks: RDS events + CloudWatch FreeStorageSpace; AWS Backup job error. Fix: free/grow storage, resolve blocking transaction, fix KMS key access. Prevention: storage-autoscaling, storage alarms, backup-failure alarms.
RDS performance issue database
RDS connection issue database
Checks: SG, DatabaseConnections metric, endpoint used, secret value. Fix: open SG from app SG; RDS Proxy/pooling for exhaustion; use the correct endpoint. Prevention: Proxy, connection limits, private-subnet placement.
IAM and serverless
IAM permission denied iam
Cross-account role issue iam
sts:AssumeRole, missing ExternalId, or the assumed role lacks the needed permissions.Checks: trust policy principal + condition; caller's policy; the role's permission policy. CLI:
aws sts assume-role --role-arn arn:... --role-session-name test. Fix: align both sides. Prevention: scope trust to specific role ARNs; use ExternalId for third parties.
Lambda timeout serverless
Checks: Duration vs configured timeout; X-Ray trace; if in a VPC, endpoint/NAT reachability. Fix: raise timeout/memory (memory also raises CPU), reuse connections (RDS Proxy), optimize init, provisioned concurrency for latency. Prevention: right-size, trace, avoid VPC unless needed.
Lambda permission issue serverless
ECS task not starting containers
Checks: stopped-task reason in the console; CloudWatch logs; execution role has ECR + logs perms. Fix: correct role/endpoints/capacity/health path. Prevention: scan images; validate task defs in CI.
EKS pod not starting containers
kubectl describe pod events; common states: ImagePullBackOff (ECR perms/endpoint), Pending (no schedulable node / IP exhaustion), CrashLoopBackOff (app error), IRSA misconfig (no AWS perms). Fix: per cause. Prevention: Karpenter/capacity, prefix delegation for IPs, IRSA validation.
Observability
CloudWatch alarm not firing obs
Checks: the metric exists with data; alarm math and period; missing-data behaviour. Fix: install agent / correct dimensions / adjust evaluation. Prevention: test alarms; alarm on missing data where absence is bad.
CloudTrail logs missing obs
Checks: trail config, destination bucket policy, KMS key policy. Fix: enable org multi-Region trail; fix bucket/KMS policy. Prevention: org trail to a locked central bucket; SCP denying trail changes.
17. AWS CLI, CloudFormation, CDK, and Terraform Examples
Practical, copy-friendly automation. Prefer IaC for anything that must be reproducible, reviewed, or promoted across environments. Reserve the CLI for investigation and glue. All snippets are illustrative; verify against current provider/service versions.
AWS CLI
Setup and profiles (with SSO / role assumption)
# Configure IAM Identity Center (SSO) - the recommended way for humans aws configure sso # ~/.aws/config: a profile that assumes a role via SSO [profile prod-admin] sso_session = corp sso_account_id = 111122223333 sso_role_name = AdministratorAccess region = us-east-1 # A profile that assumes a cross-account role from a base profile [profile deploy] role_arn = arn:aws:iam::444455556666:role/CICDDeployer source_profile = prod-admin external_id = shared-secret-value # Use a profile aws sts get-caller-identity --profile deploy aws s3 ls --profile prod-admin
Common commands
# Who am I / what account aws sts get-caller-identity # EC2 aws ec2 describe-instances --filters "Name=instance-state-name,Values=running" \ --query "Reservations[].Instances[].[InstanceId,InstanceType,PrivateIpAddress]" --output table # Assume a role manually aws sts assume-role --role-arn arn:aws:iam::444455556666:role/ReadOnly \ --role-session-name debug # S3 sync aws s3 sync ./build s3://my-site-bucket --delete # Tail logs aws logs tail /aws/lambda/my-func --follow # Start a keyless shell (no SSH) aws ssm start-session --target i-0abc123
CloudFormation (YAML)
# An S3 bucket, encrypted, private, versioned
AWSTemplateFormatVersion: "2010-09-09"
Resources:
ReportsBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Sub "reports-${AWS::AccountId}-${AWS::Region}"
VersioningConfiguration: { Status: Enabled }
BucketEncryption:
ServerSideEncryptionConfiguration:
- ServerSideEncryptionByDefault: { SSEAlgorithm: aws:kms }
PublicAccessBlockConfiguration:
BlockPublicAcls: true
BlockPublicPolicy: true
IgnorePublicAcls: true
RestrictPublicBuckets: true
Tags:
- { Key: Environment, Value: prod }
- { Key: Owner, Value: platform }
Outputs:
BucketName:
Value: !Ref ReportsBucket
Deploy across accounts/Regions with StackSets; detect manual changes with drift detection.
AWS CDK (TypeScript)
import { Stack, StackProps, RemovalPolicy } from "aws-cdk-lib";
import { Bucket, BucketEncryption, BlockPublicAccess } from "aws-cdk-lib/aws-s3";
import { Construct } from "constructs";
export class StorageStack extends Stack {
constructor(scope: Construct, id: string, props?: StackProps) {
super(scope, id, props);
new Bucket(this, "ReportsBucket", {
versioned: true,
encryption: BucketEncryption.KMS_MANAGED,
blockPublicAccess: BlockPublicAccess.BLOCK_ALL,
removalPolicy: RemovalPolicy.RETAIN,
});
}
}
CDK synthesizes to CloudFormation, so you inherit its deploy semantics with real language constructs, reuse, and tests.
Terraform
Provider and remote state
terraform {
required_version = ">= 1.6"
required_providers {
aws = { source = "hashicorp/aws", version = "~> 5.0" }
}
# Remote state in S3 with DynamoDB lock (create these once, separately)
backend "s3" {
bucket = "tfstate-111122223333"
key = "prod/network/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "tfstate-lock"
encrypt = true
}
}
provider "aws" {
region = "us-east-1"
default_tags { tags = { Environment = "prod", ManagedBy = "terraform" } }
}
VPC (minimal, three-AZ public/private)
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_support = true
enable_dns_hostnames = true
tags = { Name = "prod-vpc" }
}
resource "aws_subnet" "private" {
count = 3
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(aws_vpc.main.cidr_block, 4, count.index)
availability_zone = data.aws_availability_zones.available.names[count.index]
tags = { Name = "private-${count.index}", Tier = "private" }
}
EC2 with instance profile and IMDSv2 enforced
resource "aws_instance" "app" {
ami = data.aws_ami.al2023.id
instance_type = "m7i.large"
subnet_id = aws_subnet.private[0].id
iam_instance_profile = aws_iam_instance_profile.app.name
vpc_security_group_ids = [aws_security_group.app.id]
metadata_options {
http_tokens = "required" # enforce IMDSv2
http_endpoint = "enabled"
http_put_response_hop_limit = 1
}
tags = { Name = "app-1", Environment = "prod" }
}
S3 bucket, locked down
resource "aws_s3_bucket" "reports" {
bucket = "reports-111122223333-us-east-1"
}
resource "aws_s3_bucket_versioning" "reports" {
bucket = aws_s3_bucket.reports.id
versioning_configuration { status = "Enabled" }
}
resource "aws_s3_bucket_public_access_block" "reports" {
bucket = aws_s3_bucket.reports.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_s3_bucket_server_side_encryption_configuration" "reports" {
bucket = aws_s3_bucket.reports.id
rule { apply_server_side_encryption_by_default { sse_algorithm = "aws:kms" } }
}
IAM role and least-privilege policy
data "aws_iam_policy_document" "assume" {
statement {
actions = ["sts:AssumeRole"]
principals { type = "Service", identifiers = ["ec2.amazonaws.com"] }
}
}
resource "aws_iam_role" "app" {
name = "app-role"
assume_role_policy = data.aws_iam_policy_document.assume.json
}
data "aws_iam_policy_document" "app" {
statement {
actions = ["s3:GetObject"]
resources = ["${aws_s3_bucket.reports.arn}/*"]
}
}
resource "aws_iam_role_policy" "app" {
role = aws_iam_role.app.id
policy = data.aws_iam_policy_document.app.json
}
resource "aws_iam_instance_profile" "app" {
name = "app-profile"
role = aws_iam_role.app.name
}
CloudWatch alarm
resource "aws_cloudwatch_metric_alarm" "cpu_high" {
alarm_name = "app-cpu-high"
namespace = "AWS/EC2"
metric_name = "CPUUtilization"
statistic = "Average"
period = 300
evaluation_periods = 2
threshold = 80
comparison_operator = "GreaterThanThreshold"
dimensions = { InstanceId = aws_instance.app.id }
alarm_actions = [aws_sns_topic.alerts.arn]
}
Terraform best practices
- Remote state in S3 with DynamoDB locking (or a managed backend); never commit state to git; state contains secrets.
- Separate state per environment (and often per component) to limit blast radius.
- Modules for reusable components (network, EKS, RDS); version them.
- Environment separation via directories/workspaces + tfvars; distinct accounts per environment.
- Pin provider and module versions; run
planin CI and require review beforeapply. - Use
default_tagsfor consistent tagging; enforce with policy-as-code (OPA/Sentinel/Checkov). - Least-privilege the CI role that runs Terraform; do not use admin.
Modular structure
infra/
modules/
vpc/ # reusable VPC module
eks/
rds/
envs/
prod/
main.tf # calls modules with prod inputs + prod backend
prod.tfvars
nonprod/
main.tf
nonprod.tfvars
fmt + validate + security scan (Checkov/tfsec) -> plan (posted for review) -> manual approval -> apply with a least-privilege CI role, per environment. Never run apply from a laptop against prod. Store plan artifacts for audit.18. AWS Well-Architected Framework
Six pillars for evaluating and improving architectures. Treat them as a review lens with concrete questions, not a checklist to tick. A real Well-Architected Review surfaces trade-offs you made implicitly and forces you to make them on purpose.
The pillars are Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability. They pull against each other (more reliability/performance often costs more); the framework is about making those trade-offs deliberately and documenting them. Use the Well-Architected Tool to run structured reviews per workload.
Operational Excellence
Meaning: run and monitor systems to deliver business value and continually improve. Why: most outages and slow recoveries come from operational gaps, not architecture.
Services: CloudFormation/CDK/Terraform, Systems Manager, CloudWatch, EventBridge, CloudTrail, Config, DevOps Guru.
Practical: everything as code, small frequent reversible changes, runbooks/playbooks, game days, observability built into the platform, post-incident reviews that produce fixes.
- Is all infra and config in version-controlled IaC?
- Are deployments automated, small, and reversible?
- Do you have runbooks for common failures and tested them?
- Is observability standardized across the platform?
- Do incidents produce tracked corrective actions?
Security
Meaning: protect data, systems, and assets. Why: a single misconfiguration (public S3, over-broad IAM) can be catastrophic.
Services: IAM/Identity Center, Organizations/SCPs, KMS, Secrets Manager, CloudTrail, Config, GuardDuty, Security Hub, WAF/Shield, Macie, Inspector.
Practical: least privilege, identity-based access with MFA, encryption everywhere, private-by-default networking, centralized immutable logging, detection wired to response, automated guardrails.
- Root locked; humans via SSO+MFA; least-privilege roles?
- Encryption at rest + in transit everywhere?
- Block Public Access + no unintended exposure?
- CloudTrail/Config/GuardDuty/Security Hub org-wide?
- Secrets managed, rotated, not in code?
- Detection findings drive automated/tracked response?
Reliability
Meaning: a workload performs its function correctly and recovers from failure. Why: availability and recovery are what the business feels.
Services: Multi-AZ everything, Auto Scaling, ELB, Route 53, AWS Backup, Elastic Disaster Recovery, Aurora Global DB, Service Quotas.
Practical: design for failure (assume AZ loss), automate recovery (self-healing ASGs, auto-failover DBs), manage quotas proactively, test DR with game days, define and measure RTO/RPO.
- Multi-AZ with N+1 capacity?
- Automated failover and self-healing?
- RTO/RPO defined, met, and DR tested?
- Backups tested by restore, not just job success?
- Quotas known and monitored?
Performance Efficiency
Meaning: use resources efficiently and adapt as demand and technology change. Why: right technology + right sizing delivers performance without waste.
Services: the right compute (Graviton, serverless), purpose-built databases, caching (CloudFront/ElastiCache/DAX), Compute Optimizer, auto scaling.
Practical: pick purpose-built services over one-size-fits-all, benchmark, cache aggressively, scale elastically, adopt newer instance generations, use managed/serverless to offload undifferentiated work.
- Purpose-built services matched to each workload?
- Right-sized (Compute Optimizer) and current-generation?
- Caching at the right layers?
- Elastic scaling tied to real demand signals?
Cost Optimization
Meaning: deliver business value at the lowest price point. Why: unmanaged cloud cost grows silently through idle and over-provisioned resources.
Services: Cost Explorer, Budgets, CUR, Compute Optimizer, Savings Plans, Spot, S3 lifecycle, Trusted Advisor.
Practical: right-size, schedule non-prod off, commit steady baseline (Savings Plans), Spot for interruptible, lifecycle storage, cut data-transfer/NAT, attribute cost via tags, review monthly.
- Right-sized and idle resources removed?
- Savings Plans/RIs cover the baseline; Spot for interruptible?
- Storage lifecycle in place?
- Cost attributed via tags and reviewed monthly?
- Data-transfer/NAT costs understood and minimized?
Sustainability
Meaning: minimize the environmental impact of running cloud workloads. Why: efficiency and sustainability align, and it is increasingly a reporting/compliance requirement.
Services/levers: Graviton (better perf/watt), serverless (no idle), right-sizing, storage-class efficiency, Region choice, the Customer Carbon Footprint Tool.
Practical: maximize utilization, eliminate idle, prefer efficient instance types, delete unused data, choose managed/serverless to raise shared-infrastructure efficiency.
- High utilization, minimal idle?
- Efficient instance types (Graviton) and serverless where fitting?
- Unused data/resources removed on a schedule?
19. Learning Path
A structured route from fundamentals to enterprise-grade AWS. Each level lists what to learn, why it matters, hands-on labs, common mistakes, and the outcome you should be able to demonstrate. Build in a sandbox account with a budget alarm; tear down what you create.
Beginner
| Topic | Why it matters |
|---|---|
| AWS fundamentals, Regions/AZs | The blast-radius and residency model behind every decision. |
| IAM basics (users, groups, roles, policies) | Every action is IAM-authorized; this is the security core. |
| VPC basics (subnets, route tables, SG/NACL, IGW/NAT) | Reachability is the #1 source of confusion. |
| EC2 basics (launch, connect via SSM, AMIs, EBS) | The foundational compute + storage primitives. |
| S3 basics (buckets, classes, Block Public Access) | The default durable store; also the default breach vector if misconfigured. |
| CloudWatch basics (metrics, logs, alarms) | You cannot operate what you cannot see. |
Common mistakes: using root; opening 0.0.0.0/0; public buckets; leaving resources running.
Outcome: stand up a basic, private, monitored workload and explain how traffic and permissions flow.
Intermediate
| Topic | Why it matters |
|---|---|
| Load balancers + Auto Scaling | HA and elasticity for real apps. |
| Private networking, VPC endpoints | Reduce exposure and NAT cost. |
| VPN + Direct Connect basics | Hybrid connectivity. |
| RDS / Aurora (Multi-AZ, replicas, backups) | Managed databases and their HA/DR model. |
| CloudTrail + Config | Audit and configuration compliance. |
| Security services (GuardDuty, Security Hub, KMS, Secrets Manager) | Detection and data protection. |
| Cost management (Cost Explorer, Budgets, Savings Plans) | Control spend deliberately. |
Common mistakes: DB in public subnet; Multi-AZ mistaken for read scaling; no backups tested; ignoring data-transfer cost.
Outcome: deploy a highly available, monitored, secured, cost-aware three-tier application.
Advanced
| Topic | Why it matters |
|---|---|
| Organizations, Control Tower, SCPs, landing zones | Multi-account governance at enterprise scale. |
| Transit Gateway / Cloud WAN, multi-account networking | Scalable, segmented connectivity. |
| Advanced IAM (boundaries, ABAC, cross-account, Access Analyzer) | Least privilege that scales. |
| ECS / EKS / Lambda / API Gateway | Container and serverless platforms. |
| EventBridge / Step Functions | Event-driven and orchestrated systems. |
| Redshift / Lake Formation / analytics | Data platform at scale. |
| Bedrock / SageMaker | GenAI and ML with governance. |
| Multi-region DR | Region-level resilience. |
| Terraform / IaC at scale, enterprise security, large-scale architecture | Operating a real estate reproducibly and safely. |
Common mistakes: workloads in the management account; over-permissioned pods/tasks; untested DR; unmanaged Terraform state; connecting an LLM directly to prod databases.
Outcome: design, secure, automate, and operate an enterprise multi-account AWS environment and justify the trade-offs.
Suggested certification path (optional)
| Stage | Certification |
|---|---|
| Foundational | AWS Certified Cloud Practitioner (optional warm-up). |
| Associate | Solutions Architect Associate; then SysOps or Developer Associate by role. |
| Professional | Solutions Architect Professional; DevOps Engineer Professional. |
| Specialty | Security, Advanced Networking, Machine Learning, Database (by focus). |
Sources and Accuracy
How to treat the content in this portal, and where to confirm current facts.
This portal is a reasoning and design aid, not a live reference for volatile facts. Concepts, models, patterns, trade-offs, and troubleshooting logic are stable. Exact service limits, instance types, feature/Region availability, and pricing change constantly and must be confirmed in the official AWS documentation and the console before production use.
Primary sources (preferred)
- AWS Documentation - the authoritative per-service reference.
- AWS Well-Architected Framework - pillars, lenses, and the Well-Architected Tool.
- AWS Architecture Center and Prescriptive Guidance - reference architectures and patterns.
- AWS Security Documentation and the Security Reference Architecture (AWS SRA).
- Service Quotas and the AWS Pricing Calculator - for current limits and cost estimates.
- AWS What's New and service release notes - for the latest features and changes.
- The AWS Console - the ground truth for what exists in your account/Region right now.
What changes fast (always verify)
| Volatile | Confirm in |
|---|---|
| Service limits / quotas | Service Quotas console. |
| Instance families/types and generations | EC2 docs / console. |
| Feature and Region availability | AWS Regional Services list / What's New. |
| Pricing | Service pricing pages / Pricing Calculator / your CUR. |
| Model availability (Bedrock) | Bedrock console per Region. |
| Security best-practice defaults | Current AWS security docs. |