AWS Study Notes

☁️

Introduction to Cloud Computing

Cloud Fundamentals

Why Cloud Computing?

Before cloud computing, every company had to build and manage its own data centers — buying servers, networking gear, storage, hiring infrastructure teams, paying for power and cooling. This was expensive, slow, and inflexible. Cloud computing solves all of this by providing IT infrastructure over the internet as a service.

No Upfront Capital Expense (CAPEX → OPEX): Instead of buying hardware, pay only for what you use. Convert capital expenses into variable operational expenses.
Scale Globally in Minutes: Deploy workloads in any AWS Region worldwide in seconds. Go from 1 server to 1000 servers in minutes.
Increased Speed & Agility: Developers can provision resources in seconds vs weeks for traditional data centers. Faster time-to-market.
Focus on Core Business: AWS manages data centers, hardware maintenance, physical security — you focus on building products.
Economies of Scale: AWS buys massive amounts of hardware — bulk buying power → lower costs passed to customers vs running your own DC.
Stop Guessing Capacity: No need to predict infrastructure needs months in advance. Scale up/down on demand.

Benefits of Cloud Computing

💰 Cost Savings

No hardware purchase
Pay-as-you-go pricing
No maintenance costs
No idle capacity
Free tier available

⚡ Performance & Speed

Latest hardware always
Global low-latency network
High availability SLAs
Provision in seconds

🔒 Security & Reliability

Physical security by AWS
Compliance certifications
Encryption built-in
Multiple redundancy layers

Types of Cloud Computing

Type	Description	Who Controls Hardware	Example
Public Cloud	Owned and operated by third-party cloud providers. Resources shared over internet. Multi-tenant.	Cloud Provider (AWS)	AWS, Azure, GCP
Private Cloud	Cloud infrastructure used exclusively by a single organization. Can be on-premises or hosted by third party.	Organization or hosted provider	VMware, OpenStack, IBM Cloud Private
Hybrid Cloud	Combination of public and private clouds with data and application portability between them. Best of both worlds.	Both	AWS Outposts + AWS Cloud
Multi-Cloud	Using services from multiple cloud providers simultaneously. Avoids vendor lock-in.	Multiple providers	AWS + Azure + GCP together
Community Cloud	Shared infrastructure for specific community with common concerns (compliance, security).	Community/Provider	Government cloud, Healthcare cloud

Cloud Service Models — IaaS, PaaS, SaaS

These three models define HOW MUCH of the stack you manage vs the cloud provider. Think of it as a pizza analogy — how much do you make yourself vs order in?

IaaS — Infrastructure as a Service

Provider manages: Hardware, networking, virtualization, storage
You manage: OS, runtime, middleware, applications, data
Most control, most responsibility
AWS Examples: EC2, VPC, EBS, S3
Good for: Lift-and-shift migrations, custom OS needs

PaaS — Platform as a Service

Provider manages: Hardware + OS + runtime + middleware
You manage: Applications and data only
Focus on code, not infrastructure
AWS Examples: Elastic Beanstalk, RDS, Lambda
Good for: Developers who want to just deploy code

SaaS — Software as a Service

Provider manages: Everything including the application
You manage: Only your data and user access
Least control, least responsibility
Examples: Gmail, Office 365, Salesforce
Good for: End-users, no IT required

Memory Trick: IaaS = "I" manage almost everything. PaaS = "P"latform handles the plumbing. SaaS = "S"omeone else does it all. On-Premises = You manage 100% everything.

Scaling in Cloud Computing

📈 Vertical Scaling (Scale Up/Down)

Increase/decrease the size of an existing instance
Example: t3.micro → t3.xlarge (more CPU + RAM)
Has a physical hardware limit (ceiling)
Usually requires downtime/reboot
Simple to implement (no code changes)
Also called "scaling up" or "scaling down"

📊 Horizontal Scaling (Scale Out/In)

Add more instances to handle increased load
Example: 2 EC2 instances → 10 EC2 instances
Virtually unlimited capacity
No single point of failure
Works with Auto Scaling Groups + ELB
Applications must be stateless for best results

Cloud Computing Issues & Challenges

Vendor Lock-In: Heavy use of AWS-specific services (DynamoDB, Lambda) makes switching providers expensive and difficult.
Data Security & Privacy: Data stored off-premises raises concerns about compliance (GDPR, HIPAA), data sovereignty, and breaches.
Internet Dependency: Cloud access requires reliable, high-speed internet connectivity. Outages = no access.
Downtime Risk: Even top providers have outages. AWS S3 outage in 2017 impacted much of the internet. Need multi-region strategies.
Cost Management (FinOps): Uncontrolled usage can lead to surprise bills. Need budget alerts, cost allocation tags, Reserved Instances.
Compliance Complexity: Different data residency laws in different countries. Must understand where data is stored.
Limited Customization: Managed services abstract away control. Cannot always configure OS-level settings.

Shared Responsibility Model

Key Concept: AWS is responsible for security OF the cloud (the physical infrastructure). YOU are responsible for security IN the cloud (your data, configurations, IAM, application code).

☁️ AWS Responsible For

Physical data center security (guards, cameras, locks)
Hardware (servers, networking equipment)
Hypervisor / virtualization layer
Global network infrastructure
Managed service OS patching (RDS, Lambda)
Compliance of underlying infrastructure

👤 Customer Responsible For

Data encryption (in transit and at rest)
IAM users, roles, policies, MFA
OS patching on EC2 instances
Application code security
Network/Security Group configuration
S3 bucket policies and public access settings

The responsibility shifts based on the service type: EC2 (IaaS) = you manage OS and above. RDS (PaaS) = AWS manages OS and DB engine patching. Lambda (Serverless) = AWS manages almost everything.

Cloud Costing Models

Model	Description	Savings vs On-Demand	Best For
On-Demand	Pay per second/hour, no commitment, no upfront cost	Baseline (0%)	Unpredictable workloads, testing, short-term
Reserved Instances (RI)	1-year or 3-year commitment. Standard RI or Convertible RI.	Up to 72%	Steady-state predictable workloads (databases, web servers)
Spot Instances	Bid on unused AWS EC2 capacity. Can be interrupted with 2-min warning.	Up to 90%	Fault-tolerant batch jobs, big data, CI/CD, stateless apps
Savings Plans	Flexible 1-3 year commitment to usage amount ($/hr). Covers EC2, Fargate, Lambda.	Up to 66%	Flexible usage across instance types and regions
Dedicated Hosts	Physical server dedicated to you. Bring your own license (BYOL).	0-30% (BYOL savings)	Compliance requirements, software licensing, HIPAA
Dedicated Instances	Runs on hardware dedicated to you but AWS manages the host.	Small premium over On-Demand	Compliance requiring dedicated hardware

AWS Global Infrastructure

Regions (34+): Geographic areas, each containing 2+ Availability Zones. Examples: us-east-1 (N. Virginia), ap-south-1 (Mumbai), eu-west-1 (Ireland). Data does NOT leave a Region unless you explicitly configure it to.
Availability Zones (AZs) (108+): One or more discrete data centers within a Region with redundant power, networking, and connectivity. AZs are connected via private, high-speed fiber links. Minimum 3 AZs per Region.
Edge Locations (400+): CDN endpoints for Amazon CloudFront and Route 53. Caches content closer to users. NOT full AWS regions — limited services only.
Local Zones: AWS infrastructure placed in metro areas closer to large population centers. Example: Los Angeles, Boston. Low-latency for demanding applications.
Wavelength Zones: AWS infrastructure embedded in telecom 5G networks for ultra-low latency mobile apps.
AWS Outposts: AWS managed hardware running in your own on-premises data center. Extends AWS cloud into your facility.

Exam Tip: For High Availability, always deploy across multiple AZs within a Region. For Disaster Recovery, deploy across multiple Regions. AZ failures happen; Regional failures are extremely rare.

🖥️

Virtualization

Cloud Fundamentals

What is Virtualization?

Virtualization is the process of creating a software-based (virtual) representation of physical computing resources such as servers, storage, networks, and desktops. It uses a software layer called a Hypervisor (VMM — Virtual Machine Monitor) to abstract physical hardware and present it to multiple virtual machines simultaneously.

Without virtualization, one physical server runs one operating system. With virtualization, one physical server can run 10, 50, or 100+ virtual machines — each with its own OS, isolated from each other.

Virtualization and Cloud Computing

Virtualization is the foundation of cloud computing. Every EC2 instance you launch in AWS is actually a virtual machine running on AWS physical hardware. When you launch 100 EC2 instances, AWS spins up 100 VMs across their physical servers using the Nitro Hypervisor. You share physical hardware with other AWS customers but remain completely isolated.

Key Insight: Cloud computing = virtualizing physical data center resources and selling them as on-demand services over the internet. Virtualization enables the "elastic" in Elastic Compute Cloud (EC2).

Types of Virtualization

Type	Description	How It Works	AWS Service
Server/Hardware Virtualization	Multiple VMs share one physical server	Hypervisor divides CPU/RAM/storage into VMs	EC2 Instances
Storage Virtualization	Multiple physical storage devices pooled into one logical storage unit	Abstraction layer manages storage allocation transparently	EBS, EFS, S3
Network Virtualization	Physical network resources abstracted into software-defined networks	VLANs, SDN, overlay networks	VPC, Security Groups, ENI
Desktop Virtualization (VDI)	Desktop environments hosted on central server, accessed remotely	Users stream desktop from server	Amazon WorkSpaces
Application Virtualization	Application runs in isolated environment separate from host OS	Container or sandbox wraps the app	Docker, ECS, EKS
OS-Level Virtualization (Containers)	Multiple isolated user-space instances on same OS kernel	Namespaces + cgroups isolate processes	ECS, EKS, Fargate

Hypervisor Types — Type 1 vs Type 2

Type 1 — Bare Metal Hypervisor

Runs directly on physical hardware — no host OS needed
Better performance (no OS overhead)
Better security (smaller attack surface)
Used in production environments and cloud
AWS uses: Nitro Hypervisor (based on KVM)
Others: VMware ESXi, Microsoft Hyper-V, Xen, KVM

Type 2 — Hosted Hypervisor

Runs on top of a host operating system
Host OS adds overhead and latency
Easier to set up for development/testing
Lower performance than Type 1
Examples: Oracle VirtualBox, VMware Workstation, VMware Fusion, Parallels Desktop
Used for: learning, dev testing on laptops

Type 1 vs Type 2 Hypervisor Architecture

Key Virtualization Terminologies

Host MachineThe physical computer running the hypervisor. The actual hardware server.

Guest Machine (VM)The virtual machine running on the host. Has its own virtual CPU, RAM, storage, NIC.

vCPUVirtual CPU — a share of the physical CPU's processing power. Each AWS EC2 instance is given a number of vCPUs.

SnapshotA point-in-time copy of a VM's disk state. Used for backups, rollback, and creating templates.

AMI (Template/Image)Amazon Machine Image — a pre-configured VM template used to launch new EC2 instances quickly.

Live MigrationMoving a running VM to another physical host with zero downtime. AWS does this during host maintenance.

OverprovisioningAllocating more virtual resources than physical resources available. Works because VMs rarely use 100% simultaneously.

OvercommittingAssigning more vCPUs/vRAM than physical CPUs/RAM exist. Relies on statistical multiplexing.

Para-VirtualizationGuest OS is modified to be aware of the hypervisor and uses special APIs. Faster than full virtualization.

Full VirtualizationGuest OS runs unmodified, hardware fully simulated. VMs cannot tell they are virtualized.

Containers vs Virtual Machines

🖥️ Virtual Machines

Each VM has its own full OS (Guest OS)
Heavier: GBs in size
Slower startup (minutes)
Complete isolation between VMs
Better security isolation
AWS: EC2 Instances

📦 Containers (Docker)

Share host OS kernel — no Guest OS needed
Lighter: MBs in size
Very fast startup (seconds)
Process-level isolation
Portable across environments
AWS: ECS, EKS, Fargate

Benefits of Virtualization

Server Consolidation: Run many VMs on one physical server. Reduces number of servers needed by 10x-20x.
Cost Reduction: Fewer physical servers = less hardware cost, less power, less cooling, less data center space.
Rapid Provisioning: New servers can be deployed in minutes by cloning a template (AMI in AWS).
Isolation: Each VM is isolated. A crash in VM1 doesn't affect VM2. Security breaches are contained.
Disaster Recovery: Snapshots and VM replication make backup and recovery much easier and faster.
Better Utilization: Physical servers typically run at 10-15% capacity. VMs help utilize 70-80% of physical capacity.
Testing & Development: Developers can run multiple OS environments on one laptop for testing.

Virtualization Vendors

AWS: Nitro Hypervisor (KVM-based) VMware: ESXi / vSphere Microsoft: Hyper-V Red Hat: KVM + libvirt Citrix: XenServer / Hypervisor Oracle: VirtualBox (Type 2) Parallels: Parallels Desktop (macOS)

🐧

Linux Basics

Why Linux in AWS?

Linux is the dominant OS in cloud computing. Over 90% of AWS workloads run on Linux. Amazon Linux 2 and Amazon Linux 2023 are AWS's own Linux distributions optimized for EC2. Linux is free, open-source, stable, and highly customizable — perfect for servers.

All-Important Linux Commands

📁 File & Directory

ls -la — list all files with permissions, hidden files
pwd — print working directory (current location)
cd /path — change directory
cd .. — go up one directory
cd ~ — go to home directory
mkdir -p dir/subdir — create directory (with parents)
rm -rf dir — remove directory recursively (careful!)
cp -r src dst — copy file/dir recursively
mv src dst — move or rename
touch file.txt — create empty file
cat file — display file content
less file — paginated view (q to quit)
head -20 file — first 20 lines
tail -20 file — last 20 lines
tail -f /var/log/app.log — follow log in real-time
grep "text" file — search pattern in file
grep -r "text" /dir — recursive search
find / -name "*.conf" — find files by name
wc -l file — count lines
diff file1 file2 — compare two files
ln -s /target /link — create symbolic link

⚙️ System & Process

top — live process monitor (q to quit)
htop — improved interactive process viewer
ps aux — list all processes with user/CPU/mem
ps aux | grep nginx — find specific process
kill PID — terminate process gracefully (SIGTERM)
kill -9 PID — force kill (SIGKILL)
pkill nginx — kill by name
df -h — disk usage (human readable)
du -sh /var — directory disk usage
free -h — memory usage
uptime — system uptime + load average
uname -r — kernel version
uname -a — all system info
hostname — show/set hostname
whoami — current logged-in user
id — current user's UID, GID, groups
history — command history
which cmd — full path of command
sudo cmd — run as superuser (root)
su - username — switch user
env — show environment variables
echo $HOME — print environment variable
export VAR=value — set environment variable

The Linux Filesystem Hierarchy

Directory	Full Name	Purpose & Contents
`/`	Root	Top of the entire filesystem hierarchy. Everything is under /
`/bin`	Binaries	Essential user command binaries (ls, cp, mv, cat, grep). Available to all users.
`/sbin`	System Binaries	System administration binaries (iptables, fdisk, mount). Mostly for root user.
`/etc`	Et Cetera	System-wide configuration files. /etc/hosts (hostname resolution), /etc/fstab (mounts), /etc/nginx (nginx config)
`/home`	Home	User home directories. /home/ubuntu, /home/ec2-user. Personal files, settings.
`/root`	Root Home	Home directory for the root user (NOT same as /)
`/var`	Variable	Variable data that changes frequently: /var/log (logs), /var/www (web files), /var/lib (databases)
`/tmp`	Temporary	Temporary files. Cleared on reboot. World-writable. Use for scratch space.
`/usr`	Unix System Resources	User programs and data: /usr/bin (most user commands), /usr/lib (libraries), /usr/local (manually installed software)
`/opt`	Optional	Optional/third-party software. JDK, AWS CLI, custom apps installed here.
`/proc`	Process	Virtual filesystem. /proc/cpuinfo, /proc/meminfo, /proc/PID/. Kernel exposes system info here.
`/dev`	Devices	Device files: /dev/sda (disk), /dev/null (discard output), /dev/random (random data)
`/mnt`	Mount	Temporary mount point for external/additional filesystems (USB drives, EBS volumes)
`/boot`	Boot	Boot loader files, Linux kernel (vmlinuz), initrd. Do NOT delete!
`/lib`	Libraries	Essential shared libraries for /bin and /sbin binaries

File Permissions

Linux permissions control who can read, write, and execute files. Every file has three permission sets: Owner (user), Group, and Others.

# Permission output from ls -la:
# -rwxr-xr--  1  ec2-user  ec2-user  1234  Jan 1  file.sh
#  ^^^ ^^^ ^^^
#  |   |   └── Others: r-- = read only
#  |   └────── Group: r-x = read and execute
#  └────────── Owner: rwx = read, write, execute
#  |
#  - = regular file, d = directory, l = symlink

# Permission values: r=4, w=2, x=1
# rwx = 4+2+1 = 7
# r-x = 4+0+1 = 5
# r-- = 4+0+0 = 4
# rw- = 4+2+0 = 6

chmod 755 script.sh     # owner=rwx(7), group=rx(5), others=rx(5)
chmod 644 config.txt    # owner=rw(6), group=r(4), others=r(4)
chmod 600 key.pem       # owner=rw only, no one else can read
chmod 400 key.pem       # owner=read only (SSH key requirement)
chmod +x script.sh      # add execute permission for all
chmod -x script.sh      # remove execute permission
chmod u+w file          # add write for owner (u=user/owner, g=group, o=others, a=all)

# Change ownership
chown ec2-user file.txt           # change owner
chown ec2-user:developers file    # change owner AND group
chgrp developers file             # change group only
chown -R ec2-user /var/www        # recursive (entire directory)

Process Management

ps aux                    # list all processes (a=all users, u=user-oriented, x=no terminal)
top                       # real-time process monitor (press 'q' to quit)
htop                      # colorful interactive process viewer
kill -l                   # list all signals (SIGTERM=15, SIGKILL=9)
kill 1234                 # send SIGTERM (graceful shutdown) to PID 1234
kill -9 1234              # send SIGKILL (force kill) to PID 1234
killall nginx             # kill all processes named 'nginx'

# Background processes
nohup ./script.sh &       # run in background, ignore hangup signal, output to nohup.out
./script.sh &             # run in background (killed on terminal close)
jobs                      # list background jobs
fg %1                     # bring job #1 to foreground
bg %1                     # resume job #1 in background
disown -h %1              # disown job so it persists after logout

# Systemd (modern Linux service management)
systemctl start nginx     # start service
systemctl stop nginx      # stop service
systemctl restart nginx   # stop then start
systemctl reload nginx    # reload config without restart
systemctl status nginx    # check service status (running/stopped/failed)
systemctl enable nginx    # auto-start on boot
systemctl disable nginx   # disable auto-start
systemctl list-units --type=service   # list all services

User Account Management

# User management
useradd -m username           # create user with home directory
useradd -m -s /bin/bash user  # create with bash shell
passwd username               # set/change password
usermod -aG sudo username     # add to sudo group (Ubuntu)
usermod -aG wheel username    # add to wheel group (RHEL/CentOS)
usermod -s /bin/bash user     # change shell
userdel username              # delete user
userdel -r username           # delete user + home directory

# Group management
groupadd developers           # create group
groupdel developers           # delete group
groups username               # show groups for user
id username                   # show UID, GID, groups

# Important user files
cat /etc/passwd               # user accounts (username:x:UID:GID:comment:home:shell)
cat /etc/shadow               # password hashes (root only)
cat /etc/group                # group definitions

# Sudo configuration
visudo                        # safely edit /etc/sudoers
# Add: username ALL=(ALL) NOPASSWD: ALL  (passwordless sudo)

Software Package Management

# Ubuntu/Debian — APT package manager
sudo apt update               # update package index (always do this first)
sudo apt upgrade              # upgrade all installed packages
sudo apt install nginx -y     # install nginx
sudo apt remove nginx         # remove nginx
sudo apt purge nginx          # remove nginx + config files
sudo apt autoremove           # remove unused dependencies
apt search nginx              # search for package
apt show nginx                # show package info
dpkg -l | grep nginx          # list installed packages matching nginx
dpkg -l                       # list all installed packages

# Amazon Linux 2 / RHEL / CentOS — YUM package manager
sudo yum update               # update all packages
sudo yum install httpd -y     # install Apache
sudo yum remove httpd         # remove Apache
sudo yum list installed       # list installed packages
sudo yum info httpd           # show package info
sudo yum search nginx         # search packages

# Amazon Linux 2023 / RHEL 8+ — DNF (newer yum)
sudo dnf update
sudo dnf install nginx
sudo dnf remove nginx

Backup and Restore Management

# TAR — tape archive (most common backup tool)
tar -czf backup.tar.gz /data/           # compress directory (c=create, z=gzip, f=filename)
tar -cjf backup.tar.bz2 /data/         # compress with bzip2 (better compression)
tar -xzf backup.tar.gz                 # extract (x=extract, z=gzip)
tar -xzf backup.tar.gz -C /restore/    # extract to specific directory
tar -tzf backup.tar.gz                 # list contents without extracting

# RSYNC — efficient file sync (only transfers changes)
rsync -avz /local/dir/ user@host:/remote/dir/   # sync to remote (a=archive, v=verbose, z=compress)
rsync -avz --delete /src/ /dst/                 # sync and delete files not in source
rsync -avz --exclude="*.log" /src/ /dst/        # exclude log files
rsync -avz --dry-run /src/ /dst/                # preview without executing

# SCP — secure copy (simpler than rsync)
scp file.txt user@host:/path/          # copy file to remote
scp user@host:/path/file.txt .         # copy from remote
scp -r /local/dir user@host:/remote/  # copy directory recursively

# DD — disk image backup
dd if=/dev/sda of=/backup/disk.img bs=4M status=progress  # full disk image
dd if=/backup/disk.img of=/dev/sda bs=4M                  # restore disk image

Systemd and Monitoring

# Journald — systemd journal (logs)
journalctl -u nginx                    # logs for nginx service
journalctl -u nginx -f                 # follow nginx logs in real-time
journalctl -u nginx --since "1 hour ago"
journalctl -u nginx --since "2024-01-01" --until "2024-01-02"
journalctl -p err                      # only errors
journalctl -b                          # logs since last boot
journalctl --disk-usage                # how much disk journal uses

# Traditional log files
tail -f /var/log/syslog                # Ubuntu: follow system log
tail -f /var/log/messages              # RHEL: system messages
tail -f /var/log/nginx/access.log      # nginx access log
tail -f /var/log/nginx/error.log       # nginx error log
tail -f /var/log/cloud-init.log        # EC2 user-data script execution log

# System monitoring commands
vmstat 1 5                             # virtual memory stats every 1s for 5 iterations
iostat -x 1                            # I/O statistics per device
sar -u 5 3                             # CPU usage report (5s interval, 3 times)
netstat -tuln                          # open ports listening
ss -tuln                               # modern replacement for netstat
lsof -i :80                            # what process is using port 80

Storage Management

# Block device management (critical for EBS volumes on EC2)
lsblk                                  # list block devices (disks/partitions)
lsblk -f                               # with filesystem info
fdisk -l                               # detailed partition table info
blkid                                  # show UUIDs of block devices

# Create filesystem and mount (new EBS volume workflow)
sudo fdisk /dev/xvdb                   # partition the disk (optional for small disks)
sudo mkfs.ext4 /dev/xvdb               # format as ext4
sudo mkfs.xfs /dev/xvdb                # format as xfs (Amazon Linux default)
sudo mkdir -p /mnt/data                # create mount point
sudo mount /dev/xvdb /mnt/data         # mount the disk
df -h                                  # verify mount and available space

# Persistent mount (survives reboot) — add to /etc/fstab
echo "UUID=$(blkid -s UUID -o value /dev/xvdb) /mnt/data ext4 defaults,nofail 0 2" | sudo tee -a /etc/fstab
sudo mount -a                          # test fstab entries
sudo umount /mnt/data                  # unmount

Networking in Linux

# Network interface info
ip addr show                           # show all interfaces and IPs
ip addr show eth0                      # specific interface
ifconfig                               # older command (same as ip addr)
ip link show                           # link layer info (MAC address)

# Routing
ip route show                          # show routing table
ip route add 10.0.0.0/8 via 10.0.1.1  # add static route
route -n                               # older routing table command

# Connectivity testing
ping -c 4 google.com                   # send 4 ICMP packets
ping 8.8.8.8                           # ping Google DNS
traceroute google.com                  # trace packet path
mtr google.com                         # combined ping + traceroute (real-time)

# DNS lookup
nslookup google.com                    # basic DNS lookup
dig google.com                         # detailed DNS info
dig google.com MX                      # look up MX records
dig @8.8.8.8 google.com               # query specific DNS server
host google.com                        # simple name resolution

# HTTP/HTTPS testing
curl https://api.example.com           # fetch URL content
curl -I https://example.com            # fetch headers only
curl -o file.zip https://example.com/file.zip  # download file
wget https://example.com/file.zip      # download file (alternative to curl)
curl -X POST -H "Content-Type: application/json" -d '{"key":"value"}' https://api.example.com

# Port and connection info
netstat -tuln                          # TCP/UDP listening ports
ss -tuln                               # same, modern version
ss -tnp                                # connections with process names
lsof -i TCP:80                         # what's using port 80
telnet host 3306                       # test if port is reachable
nc -zv host 3306                       # netcat port test (preferred over telnet)

🖥️

EC2 — Elastic Compute Cloud

Compute

What is EC2?

Amazon EC2 (Elastic Compute Cloud) provides resizable virtual computing capacity (virtual machines) in the cloud. EC2 gives you complete control: choose the OS, configure networking, manage security, attach storage, and install any software you need. It is the backbone of AWS — almost every architecture involves EC2 or services built on EC2.

Think of it as: Renting a virtual computer in AWS's data center. You control everything above the hardware level (OS and above). AWS manages the physical server, network, power, and hypervisor.

EC2 Instance Types (Families)

Family	Optimized For	Instance Types	When to Use
General Purpose	Balanced CPU/RAM/Network	t3, t4g, m5, m6i, m7i	Web servers, dev/test, small DBs, microservices
Compute Optimized	High-performance CPU	c5, c6g, c7g, c6i	Batch processing, media encoding, gaming servers, HPC
Memory Optimized	Large in-memory datasets	r5, r6i, x2idn, z1d, u-6tb1	Big data, in-memory DBs (Redis/SAP HANA), real-time analytics
Storage Optimized	High sequential I/O read/write	i3, i4i, d2, h1, im4gn	NoSQL DBs, data warehouses, log processing, Hadoop
Accelerated Computing	GPU/FPGA hardware	p4, g5, f1, inf2, trn1	ML training/inference, scientific computing, video rendering
HPC Optimized	Extreme compute + networking	hpc6a, hpc7g	High Performance Computing clusters, CFD, molecular dynamics

Instance Naming Convention

Understanding how to read an instance type name: m5.2xlarge

m = Instance family (m=general, c=compute, r=memory, etc.)
5 = Generation (higher = newer, better price/performance)
2xlarge = Instance size (nano < micro < small < medium < large < xlarge < 2xlarge < 4xlarge ...)
Suffixes: g=Graviton (ARM), a=AMD, n=higher network, d=NVMe SSD storage, e=extra storage

Instance Launch Process (Step by Step)

Choose an AMI (Amazon Machine Image): The OS + software pre-installed template. Choose Amazon Linux 2, Ubuntu 22.04, Windows Server, RHEL, or a Marketplace AMI.
Select Instance Type: Choose based on workload needs. For learning/dev: t2.micro or t3.micro (free tier eligible). For production: based on CPU/RAM requirements.
Configure Instance Details: Choose VPC, subnet (public/private), IAM role for AWS API access, user data script (runs at first boot), shutdown behavior, termination protection.
Add Storage (EBS): Root volume (OS disk, default 8-30 GB gp3). Add additional data volumes as needed. Set "Delete on Termination" appropriately.
Add Tags: Key-value metadata. Name=MyWebServer, Environment=Production, Owner=TeamA. Essential for cost allocation and resource management.
Configure Security Group: Virtual firewall. Add inbound rules for SSH (22), HTTP (80), HTTPS (443). Restrict SSH to your IP only.
Review and Launch: Choose or create a Key Pair (for SSH access). Download .pem file — this is the only chance to download it!

EC2 Connection Methods

🔑 SSH (Linux Instances)

Standard method for Linux/Unix
Requires Key Pair (.pem file)
Requires port 22 open in Security Group
Set key permissions: chmod 400 key.pem
Command: ssh -i key.pem [email protected]
Default users: Amazon Linux=ec2-user, Ubuntu=ubuntu, RHEL=ec2-user, Debian=admin

🪟 RDP (Windows Instances)

Remote Desktop Protocol — Windows GUI
Port 3389 must be open in Security Group
Right-click instance → Get Windows Password
Use Key Pair to decrypt the initial password
Connect with Windows Remote Desktop (mstsc.exe) or Mac RDP client

🌐 EC2 Instance Connect

Browser-based SSH from AWS Console
No key pair needed — AWS pushes temporary key
Requires port 22 open to AWS IP ranges
Only works with Amazon Linux 2, Amazon Linux 2023, Ubuntu
Good for quick access without SSH client

🛡️ Session Manager (SSM)

No SSH, no port 22, no key pair needed!
Encrypted session via SSM Agent
Works for instances in private subnets (no public IP needed)
Requires: SSM Agent + IAM role with AmazonSSMManagedInstanceCore
All sessions logged in CloudTrail — full audit
Best practice for production instances

AMI — Amazon Machine Image

An AMI is a pre-configured template that provides the information required to launch an EC2 instance. It contains: the OS, application server, any pre-installed applications, configuration, and EBS snapshot(s).

AMI Type	Source	Cost	Use Case
AWS-Provided	Amazon maintains these	Free (just EC2 cost)	Amazon Linux 2/2023, Ubuntu, Windows Server, RHEL
AWS Marketplace	Third-party vendors	License fee + EC2	LAMP stacks, NGINX Plus, SAP, security appliances
Community AMIs	Other AWS users	Free (community)	Public images shared by the community (use with caution)
Custom AMIs	You create from existing EC2	Storage cost (EBS snapshots)	"Golden images" — pre-configured with your software for fast Auto Scaling

Creating Custom AMI: Launch EC2 → Install all software → Configure everything → Actions → Image and templates → Create image. This creates an AMI + EBS Snapshot. Use this AMI in Launch Templates for Auto Scaling Groups. Instances launch pre-configured = faster scale-out!

Important: AMIs are regional. To use an AMI in a different region, you must copy it. AMIs can be shared with specific AWS accounts or made public.

Elastic IP (EIP)

By default, when you stop and start an EC2 instance, it gets a new public IP address. Elastic IP is a static, public IPv4 address that stays the same regardless of instance state.

You allocate an EIP to your account, then associate it to an EC2 instance or ENI
EIP can be quickly remapped to a different instance (useful for failover)
Billing: FREE when associated with a running instance. You are CHARGED ($0.005/hr) when EIP is allocated but NOT associated (wasting public IPs)
Maximum 5 EIPs per region per account (can request increase)
One EIP per instance (by default)

# Allocate and associate Elastic IP using AWS CLI
aws ec2 allocate-address --domain vpc                          # allocate EIP
aws ec2 associate-address \
  --instance-id i-1234567890abcdef0 \
  --allocation-id eipalloc-12345678                           # associate to instance
aws ec2 disassociate-address --association-id eipassoc-xxx    # disassociate
aws ec2 release-address --allocation-id eipalloc-xxx          # release (delete EIP)

Placement Groups

Placement Groups control how EC2 instances are physically placed on AWS hardware to optimize performance or availability:

🔥 Cluster

Instances packed close together in one AZ
Ultra-low latency, high bandwidth (10 Gbps+)
Risk: if the hardware fails, all instances fail
Use: HPC, big data, ML training

📊 Spread

Each instance on separate physical hardware
Max 7 instances per AZ per group
Reduces correlated failures
Use: Critical apps needing high availability

🏢 Partition

Groups of instances on separate racks
Up to 7 partitions per AZ, 100s of instances
Partition failure doesn't affect others
Use: Hadoop, Cassandra, Kafka

⚙️

EC2 Instance Management

Compute

Key Pair Management

Key pairs are used for secure authentication to EC2 instances. They use asymmetric cryptography — AWS stores the public key on the EC2 instance, and you keep the private key (.pem file) on your local machine.

RSA (2048-bit): Older, widely supported, works with PuTTY and OpenSSH
ED25519: Newer algorithm, more secure, faster, smaller key size. Not supported on Windows instances.
Once created, you CANNOT re-download the private key — save it securely immediately!
Set correct permissions on Linux/Mac: chmod 400 mykey.pem (otherwise SSH rejects the key)

# SSH with key pair
chmod 400 mykey.pem                                    # required permission (400 = owner read-only)
ssh -i mykey.pem <a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="3c595f0e11494f594e7c0908120d0e120f0812090a">[email protected]</a>                  # connect to Amazon Linux
ssh -i mykey.pem <a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="abdec9dec5dfdeeb9e9f859a9985989f859e9d">[email protected]</a>                    # connect to Ubuntu
ssh -i mykey.pem -p 2222 <a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="a8cdcb9a85dddbcddae89d9c86999a869b9c869d9e">[email protected]</a>          # custom port

# Lost your key pair? Recovery steps:
# 1. Stop instance
# 2. Detach root EBS volume
# 3. Attach volume to another "helper" EC2 instance as /dev/xvdf
# 4. Mount: sudo mount /dev/xvdf1 /mnt/recovery
# 5. Add your new public key to: /mnt/recovery/home/ec2-user/.ssh/authorized_keys
# 6. Unmount, detach, reattach to original instance, start it

Security Groups — In-Depth

Security Groups are stateful virtual firewalls at the instance level. They control inbound and outbound traffic to/from EC2 instances.

Feature	Security Group	Network ACL (NACL)
Level	Instance (ENI) level	Subnet level
Rule type	Allow rules ONLY	Allow AND Deny rules
Stateful?	YES — return traffic auto-allowed	NO — must define both directions explicitly
Rule evaluation	All rules evaluated; most permissive wins	Rules evaluated in number order; first match wins
Default behavior	All inbound denied; all outbound allowed	Default NACL: all allowed. Custom NACL: all denied.
Scope	Can be assigned to multiple instances	Applies to all instances in the subnet

Stateful explained: If you allow inbound HTTP (port 80) traffic, the Security Group automatically allows the response to go back out — even if there is no outbound rule for port 80. NACLs require explicit rules in BOTH directions.

Security Group Best Practices

Never allow SSH (22) from 0.0.0.0/0 (anywhere) in production — restrict to your office IP or VPN CIDR
Create separate SGs for each tier: web-sg (80/443 public), app-sg (8080 from web-sg only), db-sg (3306 from app-sg only)
Reference other Security Groups as sources instead of hardcoded IPs — dynamic and cleaner
Never allow 0.0.0.0/0 to RDS or database ports
Use outbound rules to restrict what your EC2 instances can reach

EBS Volume Types (Deep Dive)

Volume	Type	Max IOPS	Max Throughput	Max Size	Use Case
gp3	SSD	16,000	1,000 MB/s	16 TiB	Boot volumes, dev/test, low-latency interactive apps
gp2	SSD (older)	16,000	250 MB/s	16 TiB	Legacy workloads (migrate to gp3 — cheaper + better)
io1	Provisioned IOPS SSD	64,000	1,000 MB/s	16 TiB	I/O-intensive databases
io2	Provisioned IOPS SSD	64,000	1,000 MB/s	16 TiB	Critical databases (99.999% durability)
io2 Block Express	Provisioned IOPS SSD	256,000	4,000 MB/s	64 TiB	SAP HANA, Oracle RAC, highest performance
st1	Throughput HDD	500	500 MB/s	16 TiB	Big data, log processing, streaming workloads
sc1	Cold HDD	250	250 MB/s	16 TiB	Infrequent access archives, lowest cost HDD

gp3 vs gp2: gp3 is 20% cheaper and offers independent IOPS/throughput configuration (no burst credits). Always prefer gp3 for new workloads. gp3 baseline = 3,000 IOPS/125 MB/s (vs gp2's 100 IOPS/GB).

Storage and Snapshots

EBS Snapshot: Point-in-time backup of an EBS volume stored in S3 (you don't see the S3 bucket directly)
First snapshot: full copy of all data. Subsequent snapshots: incremental (only changed blocks)
Snapshots are stored across multiple AZs within a Region — highly durable
Use snapshots to: backup data, copy volumes to different AZs/Regions, create encrypted volumes from unencrypted
Fast Snapshot Restore (FSR): Eliminates the "lazy loading" behavior; volume is immediately at full performance. Costs extra.
Recycle Bin: Protect snapshots and AMIs from accidental deletion. Set retention rules.
Snapshot Lifecycle Manager (DLM): Automate snapshot creation and deletion on a schedule

# EBS Snapshot operations (AWS CLI)
# Create snapshot
aws ec2 create-snapshot \
  --volume-id vol-12345678 \
  --description "Daily backup $(date +%Y-%m-%d)"

# Copy snapshot to another region
aws ec2 copy-snapshot \
  --source-region us-east-1 \
  --source-snapshot-id snap-12345678 \
  --region ap-south-1 \
  --description "Cross-region copy"

# Create volume from snapshot (in different AZ)
aws ec2 create-volume \
  --snapshot-id snap-12345678 \
  --availability-zone ap-south-1b \
  --volume-type gp3

User Data and Metadata

User Data is a script that runs automatically when an EC2 instance is launched for the first time (first boot only by default). It runs as root user.

#!/bin/bash
# This runs at FIRST BOOT as root
set -e                              # exit on any error
exec > /var/log/user-data.log 2>&1  # redirect output to log file

yum update -y
yum install -y httpd php mysql git
systemctl start httpd
systemctl enable httpd

# Create a simple webpage
cat > /var/www/html/index.html << 'EOF'
<html><body>
<h1>Hello from EC2!</h1>
<p>Instance ID: $(curl -s http://169.254.169.254/latest/meta-data/instance-id)</p>
</body></html>
EOF

echo "User data completed successfully"

Instance Metadata is information about the running instance accessible from within the instance at the special IP 169.254.169.254. This is a link-local address — only reachable from within the instance itself.

# Instance Metadata Service (IMDS) — v1 (simpler)
curl http://169.254.169.254/latest/meta-data/                    # list all metadata categories
curl http://169.254.169.254/latest/meta-data/instance-id         # get instance ID
curl http://169.254.169.254/latest/meta-data/instance-type       # get instance type
curl http://169.254.169.254/latest/meta-data/public-ipv4         # get public IP
curl http://169.254.169.254/latest/meta-data/local-ipv4          # get private IP
curl http://169.254.169.254/latest/meta-data/hostname            # get hostname
curl http://169.254.169.254/latest/meta-data/placement/region    # get region
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/MyRole  # get IAM role temp creds

# Instance Metadata Service v2 (IMDSv2) — more secure (token-based)
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
curl -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/instance-id

# User data (view the script that ran)
curl http://169.254.169.254/latest/user-data

IMDSv2 is recommended over IMDSv1. It requires a session token, protecting against SSRF attacks where malicious applications might try to read metadata to steal IAM credentials. You can enforce IMDSv2 on new instances in Launch Templates.

Launch Templates (vs Launch Configurations)

A Launch Template stores the full EC2 instance configuration. It is the recommended way to define configurations for Auto Scaling Groups and EC2 Fleet.

✅ Launch Templates (Recommended)

Supports multiple versions (v1, v2, v3...)
Mix On-Demand + Spot instances
Support all EC2 features including T2/T3 Unlimited, Dedicated Hosts, Capacity Reservations
Can be used with EC2 Fleet and Spot Fleet
Supports inheritance (create child from parent)

❌ Launch Configurations (Legacy)

Immutable — can't be modified after creation
Only On-Demand instances
Missing newer EC2 features
Being phased out — AWS recommends migrating to Launch Templates

⚖️

Elastic Load Balancing (ELB)

Compute

What is Load Balancing?

A Load Balancer sits in front of your servers and distributes incoming traffic across multiple targets (EC2 instances, containers, Lambda functions, or IP addresses) in multiple Availability Zones. It continuously monitors the health of registered targets and routes traffic only to healthy ones.

Why use a Load Balancer? Single server = single point of failure. Load balancer = high availability (if one instance fails, others continue serving), horizontal scalability (add more instances behind LB), SSL termination (offload HTTPS decryption), and health monitoring.

Types of Load Balancers

Type	OSI Layer	Protocols	Key Features	Best For
ALB (Application)	Layer 7 (Application)	HTTP, HTTPS, WebSocket, HTTP/2	Path/host/header routing, WAF integration, Lambda targets, sticky sessions	Web apps, microservices, REST APIs, containers
NLB (Network)	Layer 4 (Transport)	TCP, UDP, TLS	Ultra-high performance, static IP per AZ, preserves source IP, TLS termination	Gaming, IoT, real-time trading, VPC Endpoint Services
GLB (Gateway)	Layer 3+4 (Network)	GENEVE (6081)	Transparent bump-in-the-wire traffic inspection, scales third-party appliances	Firewalls, IDS/IPS, DPI appliances
CLB (Classic)	Layer 4/7	HTTP, HTTPS, TCP, SSL	Legacy service being deprecated	Old EC2-Classic apps (migrate to ALB/NLB)

Application Load Balancer (ALB) — Deep Dive

ALB operates at Layer 7 (HTTP/HTTPS) and makes routing decisions based on request content.

ALB Routing Rules

Path-Based Routing: Route based on URL path. /api/* → API servers, /images/* → Image servers, / → Main app
Host-Based Routing: Route based on HTTP Host header. app.example.com → App servers, api.example.com → API servers
HTTP Header Routing: Route based on custom headers (e.g., X-App-Version: v2 → new deployment)
Query String Routing: Route based on query parameters (e.g., ?version=mobile → mobile-optimized servers)
Weighted Target Groups: Send X% to target group 1, Y% to target group 2. Perfect for canary/blue-green deployments.
IP-Based Routing: Route specific IP addresses to specific target groups

# ALB Rule example (in AWS Console / CLI):
# IF path is /api/* → Forward to API-TG (api target group)
# IF path is /static/* → Forward to S3 bucket origin  
# IF path starts with /admin AND source IP is 10.0.0.0/8 → Forward to Admin-TG
# IF host is mobile.example.com → Redirect to https://m.example.com/#{path}
# DEFAULT → Forward to Web-TG

ALB Fixed Response & Redirects

Fixed Response: Return a custom HTTP response (200, 404, etc.) without reaching any target
Redirect: Return HTTP 301/302 to redirect clients (e.g., HTTP to HTTPS redirect)

Network Load Balancer (NLB) — Deep Dive

Handles millions of requests per second with extremely low latency (<100ms)
One static IP per AZ — useful when clients need to whitelist specific IPs (can use Elastic IPs)
Preserves source IP (client IP not replaced with NLB IP). Target sees actual client IP.
Supports TLS termination — offloads TLS decryption from targets
No Security Groups on NLB itself — access controlled by Security Groups on target instances
Supports UDP (ALB does not) — essential for DNS, VoIP, gaming
Health checks support TCP, HTTP, HTTPS

Target Groups — Configuration

A Target Group is a logical grouping of targets that receives requests from a Load Balancer. Each listener rule points to a Target Group.

Target Type	What It Is	Use Case
Instance	EC2 instances by instance ID	Traditional EC2 workloads
IP Address	Specific IP addresses (private IPs in VPC or on-premises)	Containers with dynamic ports, on-premises servers via Direct Connect/VPN
Lambda Function	A Lambda function (ALB only)	Serverless backends, event-driven apps
ALB	Another ALB (NLB only)	When you need NLB's static IP but ALB's HTTP routing

Health Checks

Health checks run continuously. Unhealthy targets are removed from rotation until they recover. You configure:

Protocol: HTTP, HTTPS, TCP, TCP_UDP, TLS
Path: URL path to check (e.g., /health or /ping)
Port: Which port to check (usually traffic port)
Healthy threshold: # consecutive successes before marking healthy (default 3)
Unhealthy threshold: # consecutive failures before marking unhealthy (default 2)
Interval: Seconds between health checks (default 30s)
Timeout: Max time for response (default 5s)
Success codes: HTTP codes that count as success (default 200)

Sticky Sessions (Session Affinity)

Sticky sessions ensure a user's requests go to the SAME target throughout a session. Useful for stateful apps that store session data locally on the instance.

Application-based cookies: Your application sets the cookie with custom name and TTL
Duration-based cookies: LB generates a cookie with TTL you set (AWSALB cookie by default)
Warning: Can create uneven load distribution if some users have long sessions
Better architecture: use ElastiCache or DynamoDB for session storage → make app stateless → no sticky sessions needed

Cross-Zone Load Balancing

With cross-zone load balancing, each LB node distributes traffic evenly across ALL registered instances in ALL enabled AZs.

Without Cross-Zone

AZ-a LB node: distributes among only AZ-a instances
AZ-b LB node: distributes among only AZ-b instances
If AZ-a has 2 instances and AZ-b has 8, they get different traffic
Uneven distribution possible

With Cross-Zone

Each LB node distributes across ALL instances in ALL AZs
10 instances total = each gets exactly 10% of traffic
ALB: enabled by default, no charge
NLB/GLB: disabled by default, data transfer charges apply

💰

Billing and Monitoring

Compute

AWS Billing Overview

AWS billing is usage-based — you pay only for what you use, when you use it. There are no upfront costs for most services. Understanding billing is critical to avoid unexpected charges.

Key Billing Dimensions

Compute Time: EC2 billed per second (minimum 60 seconds) for Linux. Per hour for Windows/RHEL/SUSE.
Data Transfer: Inbound (ingress) is FREE. Outbound (egress) to internet is charged. Traffic within same AZ is free. Between AZs in same Region: $0.01/GB.
Storage: EBS (per GB provisioned/month), S3 (per GB stored + requests), etc.
Requests: API calls to services like S3 (PUT/GET requests), API Gateway, Lambda invocations.

AWS Free Tier

Service	Free Tier Amount	Duration	Type
EC2	750 hours/month t2.micro or t3.micro (Linux and Windows separately)	12 months	New accounts
S3	5 GB storage, 20,000 GET requests, 2,000 PUT requests	12 months	New accounts
RDS	750 hours/month db.t2.micro or db.t3.micro Single-AZ	12 months	New accounts
Lambda	1 million requests + 400,000 GB-seconds compute time	Always free	Perpetual
CloudWatch	10 custom metrics, 10 alarms, 5 GB log data	Always free	Perpetual
SNS	1 million publishes, 100,000 HTTP deliveries	Always free	Perpetual
DynamoDB	25 GB storage, 25 WCUs + 25 RCUs (enough for ~200M requests/month)	Always free	Perpetual

AWS Billing Tools

💳 Cost Explorer

Visualize cost and usage over past 12 months
Forecast next 12 months based on trends
Filter/group by service, region, account, tag, instance type
Savings Plans and Reserved Instance recommendations
Free to use

📋 AWS Budgets

Set custom cost and usage budgets
Alert via email or SNS when you hit thresholds (e.g., 80%, 100% of budget)
Types: Cost budget, Usage budget, RI utilization budget, Savings Plans budget
First 2 budgets free; $0.02/day per additional budget

🧾 Cost and Usage Reports (CUR)

Most detailed billing data available
Exported to S3 bucket daily or monthly
Line-item charges for every resource hourly
Can analyze with Athena or import to Redshift

🏷️ Cost Allocation Tags

Tag resources (e.g., Project=AppA, Environment=Prod)
Activate tags in Billing console
Filter Cost Explorer by these tags
Essential for multi-project/multi-team accounts

CloudWatch Alarms — Detailed

CloudWatch Alarms watch a single metric and perform one or more actions when that metric breaches a threshold over a specified number of evaluation periods.

Alarm State	Meaning	When It Occurs
OK	Metric is within the defined threshold	Metric is healthy
ALARM	Metric has breached the threshold for specified periods	Action is triggered
INSUFFICIENT_DATA	Not enough data points to determine state	Service just started, metric gap, new alarm

Alarm Configuration

Metric: What to watch (e.g., CPUUtilization for EC2 instance i-xxxx)
Statistic: How to aggregate data points (Average, Sum, Minimum, Maximum, p90)
Period: Time window for each evaluation (60s, 300s, 3600s)
Evaluation Periods: How many consecutive periods must breach threshold to trigger alarm
Datapoints to Alarm: Of the evaluation periods, how many must breach (M of N evaluation)
Threshold: The value that defines the breach (e.g., > 80%)
Actions: What to do when alarm triggers — SNS notification, EC2 action, Auto Scaling

Common EC2 CloudWatch Metrics

Metric	Description	Monitoring Period	Notes
CPUUtilization	% of allocated EC2 compute units in use	Basic: 5 min, Detailed: 1 min	Available by default
NetworkIn / NetworkOut	Bytes received/sent on all network interfaces	Basic: 5 min	Available by default
NetworkPacketsIn/Out	Packets received/sent	Basic: 5 min	Available by default
DiskReadOps / DiskWriteOps	IOPS completed for instance store	Basic: 5 min	Instance store only (not EBS)
DiskReadBytes / DiskWriteBytes	Bytes read/written to instance store	Basic: 5 min	Instance store only
StatusCheckFailed_Instance	Instance OS/software failure	1 min	Action: reboot/recover
StatusCheckFailed_System	AWS physical host failure	1 min	Action: recover (migrates to new host)
MemoryUtilization*	% of RAM in use	Custom metric	Requires CloudWatch Agent!
DiskSpaceUtilization*	% of disk used	Custom metric	Requires CloudWatch Agent!

Memory & Disk NOT collected by default! Memory utilization, disk space usage, and swap usage require you to install the CloudWatch Agent on your EC2 instance and configure it to send these metrics. This is a very common exam question.

Billing Alarms Setup

# Set up billing alarm (must be in us-east-1 region)
# Step 1: Enable billing alerts in Billing → Billing Preferences → Receive Billing Alerts

# Step 2: Create SNS topic for notification
aws sns create-topic --name billing-alerts --region us-east-1

# Step 3: Subscribe your email to topic  
aws sns subscribe \
  --topic-arn arn:aws:sns:us-east-1:123456789:billing-alerts \
  --protocol email \
  --notification-endpoint <a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="5c2533292e1c39313d3530723f3331">[email protected]</a>

# Step 4: Create CloudWatch alarm (ONLY works in us-east-1)
aws cloudwatch put-metric-alarm \
  --alarm-name "Monthly-Bill-Exceeds-10USD" \
  --metric-name EstimatedCharges \
  --namespace AWS/Billing \
  --statistic Maximum \
  --period 86400 \
  --threshold 10 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 \
  --alarm-actions arn:aws:sns:us-east-1:123456789:billing-alerts \
  --dimensions Name=Currency,Value=USD \
  --region us-east-1

📈

Auto Scaling

Compute

What is EC2 Auto Scaling?

EC2 Auto Scaling automatically adds or removes EC2 instances based on demand conditions you define. It ensures your application always has the right number of instances available to handle load, provides fault tolerance by replacing unhealthy instances, and optimizes costs by removing unnecessary instances.

⚡ Performance

Scale out when demand spikes
Users always get fast response times
No manual intervention needed
Works with ELB for traffic distribution

💰 Cost Optimization

Scale in when demand drops
Pay only for instances you need
Use Spot Instances in ASG for 90% savings
No over-provisioning

🛡️ High Availability

Replace unhealthy instances automatically
Spans multiple AZs
Rebalances instances across AZs
Minimum capacity always maintained

Auto Scaling Group (ASG) — Key Concepts

Launch Template: Defines what instances to launch — AMI, instance type, security groups, key pair, IAM role, user data
Minimum Capacity: ASG never goes below this number (even during quiet periods). Ensures minimum availability.
Desired Capacity: The number of instances ASG tries to maintain right now. ASG launches/terminates to reach this number.
Maximum Capacity: ASG never exceeds this number (cost protection). Sets your maximum scale-out limit.
VPC and Subnets: Choose subnets in multiple AZs. ASG will balance instances across them.
Load Balancer: Attach to ALB/NLB target group. New instances automatically registered; terminated instances deregistered.
Health Checks: EC2 status checks (default) or ELB health checks (recommended for web apps)

Example: Min=2, Desired=4, Max=10. ASG maintains 4 instances normally. During high traffic, scales to up to 10. During quiet periods, scales down to minimum 2 (never less).

Scaling Types and Policies

Policy Type	How It Works	Trigger	Best For
Manual Scaling	Manually change desired capacity in console or CLI	Human action	Planned events, maintenance windows, fixed capacity
Simple Scaling	One action per alarm breach. Waits for cooldown before next action.	CloudWatch alarm	Simple workloads (legacy, prefer Step/Target)
Step Scaling	Different actions based on HOW FAR metric is from threshold	CloudWatch alarm	When you need proportional response to varying load
Target Tracking	Automatically scale to keep a metric at a target value	Metric target value	Most workloads — simplest and most effective
Scheduled Scaling	Scale based on time (cron expression)	Date/time schedule	Known traffic patterns (business hours, weekly peaks)
Predictive Scaling	ML model predicts future load and pre-scales proactively	ML forecast	Recurring cyclical patterns (daily, weekly)

Target Tracking Policy (Recommended)

The most commonly used scaling policy. You specify a target value for a metric and Auto Scaling creates CloudWatch alarms automatically to scale in/out to maintain the target.

Predefined Metric	Description	Common Target
ASGAverageCPUUtilization	Average CPU across all instances in the ASG	50-70%
ALBRequestCountPerTarget	Number of requests per instance from ALB	1000 req/instance
ASGAverageNetworkIn	Average network bytes in per instance	Depends on app
ASGAverageNetworkOut	Average network bytes out per instance	Depends on app

# Target Tracking: Keep average CPU at 50%
# ASG will automatically:
# - Add instances if CPU goes above 50%
# - Remove instances if CPU drops below ~45% (built-in buffer)
# You don't write alarm rules — AWS manages them automatically

Step Scaling

Define multiple scaling steps based on how much the metric breaches the threshold. More granular control than Simple Scaling. Does NOT wait for cooldown between steps.

# Example Step Scaling Configuration:
# Scale OUT (add capacity):
#   CPU 50-60% → add 1 instance
#   CPU 60-75% → add 2 instances
#   CPU 75-90% → add 3 instances
#   CPU > 90%  → add 4 instances

# Scale IN (remove capacity):
#   CPU 40-50% → remove 1 instance
#   CPU 30-40% → remove 2 instances
#   CPU < 30%  → remove 3 instances

aws autoscaling put-scaling-policy \
  --auto-scaling-group-name my-asg \
  --policy-name scale-out-policy \
  --policy-type StepScaling \
  --step-adjustments MetricIntervalLowerBound=0,MetricIntervalUpperBound=10,ScalingAdjustment=1 \
                     MetricIntervalLowerBound=10,MetricIntervalUpperBound=25,ScalingAdjustment=2 \
                     MetricIntervalLowerBound=25,ScalingAdjustment=3 \
  --adjustment-type ChangeInCapacity \
  --metric-aggregation-type Average

Scheduled Scaling

# Scale up Mon-Fri at 8 AM IST (2:30 AM UTC)
aws autoscaling put-scheduled-update-group-action \
  --auto-scaling-group-name my-asg \
  --scheduled-action-name scale-up-mornings \
  --recurrence "30 2 * * 1-5" \
  --min-size 4 --max-size 20 --desired-capacity 8

# Scale down Mon-Fri at 8 PM IST (2:30 PM UTC)
aws autoscaling put-scheduled-update-group-action \
  --auto-scaling-group-name my-asg \
  --scheduled-action-name scale-down-evenings \
  --recurrence "30 14 * * 1-5" \
  --min-size 2 --max-size 10 --desired-capacity 2

# Scale up for expected traffic spike (one-time)
aws autoscaling put-scheduled-update-group-action \
  --auto-scaling-group-name my-asg \
  --scheduled-action-name pre-event-scale \
  --start-time "2024-03-01T02:00:00Z" \
  --desired-capacity 20

Instance Lifecycle and Termination Policy

Scale-out lifecycle: Pending → InService → Healthy (registered with LB)
Scale-in lifecycle: InService → Terminating:Wait (lifecycle hook) → Terminated
Lifecycle Hooks: Pause instance during launch or termination to run custom actions (configure software, drain connections, extract logs)

Termination Policy	How ASG Decides Which Instance to Terminate
Default	Oldest launch config/template → oldest instance in that config → closest to billing hour
OldestInstance	Terminates the oldest instance in the group
NewestInstance	Terminates the newest instance (useful for rolling updates testing)
OldestLaunchTemplate	Terminates instances using oldest launch template (good for rolling updates)
ClosestToNextInstanceHour	Terminates instance closest to next billing hour (cost optimization)

Cooldown Period: Default 300 seconds (5 min). After a scaling activity, ASG waits for cooldown before evaluating more alarms. Prevents thrashing. Use shorter cooldowns for scale-in (saves money faster). Use default or longer for scale-out (let metrics stabilize).

💾

EBS / EFS Storage

Storage Services

EBS — Elastic Block Store

EBS provides persistent block-level storage for EC2 instances. Think of it as a network-attached hard drive. When you terminate an EC2 instance, the EBS root volume is deleted by default (configurable), but additional EBS volumes persist. EBS volumes are automatically replicated within their AZ.

Block Storage vs Object Storage: Block storage (EBS) stores data in fixed-size blocks — like a hard drive. You can format it with a filesystem (ext4, xfs) and use it like a disk. Object storage (S3) stores entire files as objects — you cannot mount it like a drive.

EBS Volume Types (Detailed)

Volume Type	IOPS	Throughput	Size	Multi-Attach	Use Case
gp3 (General SSD)	3,000–16,000	125–1,000 MB/s	1 GiB–16 TiB	No	Boot volumes, dev/test, small/medium databases, virtual desktops
io2 (Provisioned IOPS SSD)	100–64,000	1,000 MB/s	4 GiB–16 TiB	Yes (same AZ)	I/O-intensive databases: MySQL, Oracle, SQL Server
io2 Block Express	up to 256,000	4,000 MB/s	4 GiB–64 TiB	Yes	SAP HANA, Oracle RAC, mission-critical workloads
st1 (Throughput HDD)	500 max	500 MB/s	125 GiB–16 TiB	No	Big data, data warehouses, log processing, Hadoop
sc1 (Cold HDD)	250 max	250 MB/s	125 GiB–16 TiB	No	Cold data requiring few scans/day. Cheapest option.

IOPS vs Throughput: IOPS = Input/Output Operations Per Second (number of read/write requests). Throughput = MB/s (amount of data transferred). Small random I/O (databases) = care about IOPS. Large sequential I/O (big data, video) = care about throughput.

Key EBS Facts to Know

EBS volume and the EC2 instance it attaches to must be in the same AZ
To use an EBS volume in a different AZ: take snapshot → create new volume from snapshot in target AZ
To use in a different region: take snapshot → copy snapshot to target region → create volume
Root EBS volume: deleted on instance termination by default (change DeleteOnTermination to false)
Additional EBS volumes: NOT deleted on termination by default
EBS Multi-Attach (io1/io2 only): Same volume attached to up to 16 instances in same AZ simultaneously. Application must handle concurrent writes (clustering software like Oracle RAC).
Snapshots are incremental — only store changed data blocks after initial full snapshot
You can take a snapshot of a running instance, but it's better to stop first for consistency

EBS Encryption

Uses AWS KMS Customer Master Keys (CMK) — either AWS-managed or customer-managed
Encryption is at rest AND in transit between EC2 and EBS
Minimal performance impact (handled by hardware)
Snapshots of encrypted volumes are automatically encrypted
Volumes created from encrypted snapshots are automatically encrypted
You can't remove encryption from an encrypted volume

# How to encrypt an existing UNENCRYPTED EBS volume:
# Direct encryption of existing volume is NOT possible — must use this workaround:
# Step 1: Create a snapshot of the unencrypted volume
aws ec2 create-snapshot --volume-id vol-unencrypted --description "Pre-encryption backup"

# Step 2: Copy the snapshot with encryption enabled
aws ec2 copy-snapshot \
  --source-region ap-south-1 \
  --source-snapshot-id snap-unencrypted \
  --encrypted \
  --kms-key-id arn:aws:kms:ap-south-1:123:key/your-key

# Step 3: Create a new encrypted volume from the encrypted snapshot
aws ec2 create-volume --snapshot-id snap-encrypted --volume-type gp3 \
  --availability-zone ap-south-1a

# Step 4: Detach old volume, attach new encrypted volume to instance
# Step 5: Update /etc/fstab if needed

Mounting and Managing EBS Volumes on EC2

# Step 1: Verify the volume is attached
lsblk                              # shows: xvda (root), xvdb (new unformatted)
lsblk -f                           # check if filesystem exists

# Step 2: Create filesystem (first time only — destroys existing data!)
sudo mkfs.ext4 /dev/xvdb           # format as ext4
# OR
sudo mkfs.xfs /dev/xvdb            # format as xfs (Amazon Linux default)

# Step 3: Create mount point
sudo mkdir -p /data

# Step 4: Mount the volume
sudo mount /dev/xvdb /data
df -h                              # verify it's mounted and available space

# Step 5: Make it permanent — add to /etc/fstab
# Get UUID first (better than device name — device names can change)
sudo blkid /dev/xvdb               # shows UUID

# Add to /etc/fstab (edit with: sudo nano /etc/fstab):
# UUID=xxxx-xxxx /data ext4 defaults,nofail 0 2
# "nofail" is critical — prevents boot failure if volume not attached

# Test fstab entry
sudo umount /data
sudo mount -a                      # mounts everything in fstab
df -h                              # verify

EFS — Elastic File System

EFS is a fully managed, scalable, shared file system (NFS - Network File System) for Linux workloads. Unlike EBS (one instance at a time), EFS can be mounted concurrently by thousands of EC2 instances across multiple AZs simultaneously. It automatically grows and shrinks as you add/remove files — no capacity management needed.

EFS vs EBS vs S3 — Comparison

Feature	EFS (Elastic File System)	EBS (Elastic Block Store)	S3 (Simple Storage)
Storage type	File (NFS)	Block	Object
Multi-instance access	YES — thousands of instances	NO (one at a time, except Multi-Attach io2)	YES — accessible from anywhere
Multi-AZ	YES (Standard, Regional)	NO — single AZ only	YES — minimum 3 AZs
OS support	Linux only (POSIX)	Linux and Windows	Any (HTTP API)
Mount as filesystem	YES (NFS mount)	YES (block device)	NO (not a filesystem)
Capacity management	Automatic (elastic)	Fixed (you provision)	Unlimited
Max size	Petabytes (auto-scale)	64 TiB	Unlimited
Relative cost	~3x gp2 EBS	Baseline	Cheapest per GB
Use case	Shared storage, CMS, home dirs, containers	Boot volumes, databases, app data	Backups, static assets, data lakes

EFS Storage Classes and Lifecycle

Storage Class	Availability	Cost	Use Case
EFS Standard	Multi-AZ (3+ AZs)	$0.30/GB/month	Frequently accessed files
EFS Standard-IA	Multi-AZ	$0.025/GB/month + retrieval	Infrequent access (save 92% vs Standard)
EFS One Zone	Single AZ	$0.153/GB/month	Dev/test, non-critical data (20% cheaper)
EFS One Zone-IA	Single AZ	$0.0133/GB/month	Dev/test infrequent access (cheapest)

EFS Lifecycle Management: Automatically moves files to Standard-IA after they haven't been accessed for 7, 14, 30, 60, or 90 days. Files moved back to Standard on access. Reduces storage costs significantly for mixed workloads.

EFS Performance Modes

General Purpose (default): Lowest latency. Ideal for web servers, home directories, content management. Limit of 35,000 IOPS.
Max I/O: Scale to higher throughput (hundreds of thousands of operations/sec). Slightly higher latency. Ideal for parallel workloads — big data, media processing, genomics.
Bursting Throughput: Throughput scales with filesystem size (50 MB/s per TB, burst to 100 MB/s per TB)
Provisioned Throughput: Specify throughput independently of storage size. Pay for provisioned throughput above earned burst rate.
Elastic Throughput (newest): Automatically scales throughput up and down based on workload. Best for spiky or unpredictable workloads.

Mounting EFS on EC2 Instances

# Install EFS utilities (handles NFS mounting and TLS encryption)
sudo yum install -y amazon-efs-utils        # Amazon Linux
sudo apt-get install amazon-efs-utils       # Ubuntu

# Mount using EFS mount helper (recommended — supports encryption in transit)
sudo mkdir -p /mnt/efs
sudo mount -t efs -o tls fs-0123456789:/ /mnt/efs    # with TLS encryption
sudo mount -t efs fs-0123456789:/ /mnt/efs            # without TLS

# Mount specific directory/subdirectory
sudo mount -t efs -o tls fs-0123456789:/myapp /mnt/app

# Verify mount
df -h /mnt/efs
ls /mnt/efs

# Auto-mount on reboot (/etc/fstab)
fs-0123456789:/ /mnt/efs efs _netdev,tls,iam 0 0
# _netdev = wait for network before mounting
# iam = use IAM for authorization

# Mount using NFS directly (without efs utils)
sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 \
  fs-0123456789.efs.ap-south-1.amazonaws.com:/ /mnt/efs

Security: EFS uses Security Groups for network access. The Security Group attached to EFS must allow inbound NFS (port 2049) from the Security Groups of EC2 instances that need to mount it.

🏰

VPC — Virtual Private Cloud

Networking

What is a VPC?

A Virtual Private Cloud (VPC) is your own logically isolated section of the AWS cloud. Think of it as your own private data center inside AWS — you have complete control over your virtual networking environment including IP address ranges, subnets, route tables, and network gateways. Every AWS account gets a default VPC in each region so you can launch resources immediately.

Analogy: A VPC is like your own apartment building. You control who enters (Internet Gateway), how floors are organized (subnets), which rooms talk to each other (route tables), and security at each door (Security Groups & NACLs). AWS just provides the building infrastructure.

Core VPC Concepts

Concept	Description	Example
VPC	Isolated virtual network in a region. Spans all AZs in that region.	10.0.0.0/16 (65,536 IPs)
Subnet	A subdivision of a VPC within a single AZ. Resources live in subnets.	10.0.1.0/24 in ap-south-1a
Route Table	Set of rules (routes) that determine where network traffic is directed.	0.0.0.0/0 → IGW
Internet Gateway (IGW)	Allows communication between VPC and the internet. Horizontally scaled, HA, no bandwidth limits.	Attach to VPC for internet access
NAT Gateway	Allows private subnet resources to access internet but prevents inbound connections from internet.	Private EC2 downloading updates
Security Group	Virtual stateful firewall at instance level. Controls inbound/outbound traffic.	Allow port 80 from 0.0.0.0/0
NACL	Stateless firewall at subnet level. Rules evaluated in order by number.	Deny rule 100: block bad IP
CIDR Block	IP address range assigned to VPC or subnet using CIDR notation.	192.168.0.0/24 = 256 IPs

IP Addressing Deep Dive

Understanding IP addressing is fundamental to VPC design. AWS uses IPv4 CIDR notation where the number after the slash indicates how many bits are the network portion.

CIDR	Total IPs	Usable IPs (AWS reserves 5)	Use Case
/16	65,536	65,531	VPC (large enterprise)
/20	4,096	4,091	Large subnet
/24	256	251	Standard subnet
/28	16	11	Small subnet (minimum for AWS)

AWS reserves 5 IPs per subnet: .0 (network address), .1 (VPC router), .2 (DNS server), .3 (future use), .255 (broadcast). So a /24 gives you 251 usable IPs, not 256.

Private IP Ranges (RFC 1918)

These IP ranges are not routable on the public internet — they're used for private networks like VPCs. Always use these for VPC CIDR blocks.

10.0.0.0    – 10.255.255.255   (10.0.0.0/8)    — Class A, 16M addresses
172.16.0.0  – 172.31.255.255   (172.16.0.0/12)  — Class B, 1M addresses  
192.168.0.0 – 192.168.255.255  (192.168.0.0/16) — Class C, 65K addresses

# AWS default VPC always uses: 172.31.0.0/16
# Best practice for custom VPC: use 10.0.0.0/16 (avoids overlap with default)

Public vs Private Subnets

🌐 Public Subnet

Route table has route to Internet Gateway (0.0.0.0/0 → IGW)
Resources can have public IPs
Accessible from internet
Used for: Load balancers, Bastion hosts, NAT Gateways, Web servers
Example CIDR: 10.0.1.0/24, 10.0.2.0/24

🔒 Private Subnet

No route to Internet Gateway
Internet access via NAT Gateway only (outbound)
NOT directly reachable from internet
Used for: Application servers, Databases, Lambda in VPC, Cache
Example CIDR: 10.0.11.0/24, 10.0.12.0/24

Internet Gateway (IGW)

The IGW is the door between your VPC and the public internet. It performs Network Address Translation (NAT) for instances with public IPs — translating private IPs to public IPs for outbound traffic and vice versa for inbound.

One IGW per VPC — horizontally scaled by AWS, fully HA
Free — no cost for the gateway itself (only data transfer charges)
Two steps to enable internet: (1) attach IGW to VPC, (2) add route in public subnet route table
Only works for resources with a public/Elastic IP address

# Public subnet route table
Destination     Target
10.0.0.0/16    local          ← all VPC traffic stays local
0.0.0.0/0      igw-xxxxxxxx   ← everything else goes to internet

NAT Gateway

NAT (Network Address Translation) Gateway allows EC2 instances in private subnets to initiate outbound connections to the internet (download patches, call APIs) while preventing the internet from initiating connections into your private instances.

Managed by AWS — highly available, auto-scales up to 100 Gbps
Placed in public subnet — needs public subnet + Elastic IP
Cost: ~$0.045/hour + $0.045/GB data processed (can be expensive!)
For HA: Create one NAT Gateway per AZ — if one AZ fails, other AZs still have internet access
Not needed for S3/DynamoDB — use VPC Endpoints instead (free)

# Private subnet route table
Destination     Target
10.0.0.0/16    local              ← VPC traffic stays local
0.0.0.0/0      nat-xxxxxxxx       ← internet via NAT Gateway

Custom VPC Creation — Step by Step

Plan CIDR: Choose VPC CIDR (e.g., 10.0.0.0/16). Plan subnets — public /24s and private /24s across 2+ AZs for HA.
Create VPC: AWS Console → VPC → Create VPC. Enter name, IPv4 CIDR. Enable DNS hostnames and DNS resolution.
Create Subnets: Create public subnets (10.0.1.0/24 in AZ-a, 10.0.2.0/24 in AZ-b) and private subnets (10.0.11.0/24 in AZ-a, 10.0.12.0/24 in AZ-b).
Create & Attach IGW: Create Internet Gateway → Attach to your VPC. One IGW per VPC.
Create NAT Gateway: In a public subnet → allocate Elastic IP → create NAT GW. Create one per AZ for HA.
Configure Route Tables: Public RT: add 0.0.0.0/0 → IGW, associate public subnets. Private RT: add 0.0.0.0/0 → NAT GW, associate private subnets.
Enable Auto-assign Public IP: On public subnets → enable auto-assign public IPv4 so instances get public IPs automatically.

VPC Flow Logs

Flow Logs capture information about IP traffic going to/from network interfaces in your VPC. Essential for security analysis, troubleshooting, and compliance.

Can be enabled at VPC, subnet, or ENI (network interface) level
Destinations: CloudWatch Logs, S3, Kinesis Data Firehose
Captures: source IP, destination IP, port, protocol, bytes, action (ACCEPT/REJECT), status
Does NOT capture: DNS queries, DHCP traffic, instance metadata (169.254.x.x), Windows license activation
Use Athena on S3 to query flow logs for security investigations

# Flow log record format:
version account-id interface-id srcaddr dstaddr srcport dstport protocol packets bytes start end action log-status
2       123456789   eni-abc123   10.0.1.5  8.8.8.8  54321   443     6        10      5000  1609459200 1609459260 ACCEPT OK

Elastic Network Interface (ENI)

An ENI is a virtual network card you can attach to EC2 instances. Every instance has at least one ENI (eth0 — the primary). You can create additional ENIs and attach/detach them from instances.

Has: primary private IP, secondary private IPs, Elastic IP, MAC address, security groups
Bound to a specific AZ — cannot move across AZs
Use cases: Dual-homed instances (in two subnets), management network separation, move IP between instances on failover
Used by Lambda (when in VPC), ECS tasks, RDS, ElastiCache internally

AWS Networking Best Practices

Production VPC Design:

Always use multiple AZs (minimum 2, ideally 3) for high availability
Separate public, private-app, and private-db subnet tiers
Use /16 for VPC CIDR — gives room to grow (65K IPs)
Never use overlapping CIDRs if you plan to peer VPCs
One NAT Gateway per AZ to avoid cross-AZ traffic costs and AZ dependency
Use VPC Endpoints for S3 and DynamoDB to avoid NAT Gateway costs
Enable VPC Flow Logs from day one — invaluable for debugging and security
Use security groups as primary firewall, NACLs only for broad subnet-level rules

Reference Architecture: 3-Tier VPC

Region: ap-south-1
VPC: 10.0.0.0/16
├── AZ: ap-south-1a                    AZ: ap-south-1b
│   ├── Public Subnet 10.0.1.0/24      Public Subnet 10.0.2.0/24
│   │   ├── ALB node                   ALB node
│   │   └── NAT Gateway (EIP)          NAT Gateway (EIP)
│   ├── Private-App 10.0.11.0/24       Private-App 10.0.12.0/24
│   │   └── EC2 App Servers            EC2 App Servers
│   └── Private-DB  10.0.21.0/24       Private-DB  10.0.22.0/24
│       └── RDS Primary                RDS Standby (Multi-AZ)
│
├── Internet Gateway (attached to VPC)
├── Public Route Table  → 0.0.0.0/0 to IGW
├── Private Route Table → 0.0.0.0/0 to NAT GW (per AZ)
└── VPC Endpoints: S3 Gateway, DynamoDB Gateway (free!)

🛡️

VPC Controls

Networking

Security Groups vs NACLs — The Key Difference

AWS gives you two layers of network security. Understanding when to use each is critical for both the exam and real-world architecture.

🔒 Security Groups (SG)

Level: Instance/ENI level
Stateful: Return traffic automatically allowed
Rules: Allow only — no deny rules
Evaluation: All rules evaluated together
Default: Deny all inbound, allow all outbound
Scope: Applies to specific instances
Can reference other SGs as source/destination

🧱 NACLs (Network ACL)

Level: Subnet level
Stateless: Must explicitly allow return traffic
Rules: Allow AND Deny rules
Evaluation: Rules evaluated in number order (lowest first)
Default NACL: Allows all in/out
Scope: Applies to all instances in subnet
Cannot reference SGs

Stateless means you must open ephemeral ports! When a client connects, the response comes back on an ephemeral port (1024-65535). NACLs must explicitly allow this range on inbound rules for return traffic.

Security Group — Deep Dive

Security Groups act as virtual firewalls controlling traffic to/from EC2 instances. They're the primary and most-used security control in AWS.

# Security Group Rules — key concepts:
# Inbound: who can SEND traffic TO your instance
# Outbound: where your instance can SEND traffic TO

# Example: Web server SG
Inbound Rules:
  Type      Port    Source          Purpose
  HTTP      80      0.0.0.0/0       Allow all web traffic
  HTTPS     443     0.0.0.0/0       Allow all HTTPS traffic
  SSH       22      10.0.0.0/8      Allow SSH from internal only

Outbound Rules:
  Type      Port    Destination     Purpose
  All       All     0.0.0.0/0       Allow all outbound (default)

# SG referencing another SG (powerful pattern):
# App server SG inbound: port 8080 source = web-server-SG-id
# This means: only instances IN the web server SG can reach app server
# No need to know IP addresses — scales automatically

NACL Rules — Deep Dive

NACLs are the subnet-level firewall. Each subnet can only be associated with one NACL at a time. Rules are processed in ascending order — first match wins.

Rule #	Type	Protocol	Port	Source	Action
100	HTTP	TCP	80	0.0.0.0/0	ALLOW
110	HTTPS	TCP	443	0.0.0.0/0	ALLOW
120	Custom TCP	TCP	1024-65535	0.0.0.0/0	ALLOW ← ephemeral!
200	SSH	TCP	22	1.2.3.4/32	ALLOW
*	All traffic	All	All	0.0.0.0/0	DENY ← catch-all

Rule numbering best practice: Use increments of 10 or 100 so you can insert rules later. Rule * (asterisk) is the default deny — always last, cannot be modified.

VPC Peering

VPC Peering creates a direct, private network connection between two VPCs allowing instances to communicate as if they were in the same network — using private IPs, no internet involved.

Cross-region peering: Yes — peer VPCs across different AWS regions
Cross-account peering: Yes — peer VPCs in different AWS accounts
Non-transitive: If A↔B and B↔C, A cannot reach C through B. You need a direct A↔C peering.
No overlapping CIDRs: VPCs being peered cannot have overlapping IP ranges
Route table update required: Must add routes in BOTH VPCs pointing to the peering connection
SG reference cross-account: Can reference SGs from peered VPC (same region only)

# VPC A (10.0.0.0/16) peered with VPC B (172.16.0.0/16)
# VPC A route table must add:
Destination       Target
172.16.0.0/16    pcx-xxxxxxxxx   ← peering connection to VPC B

# VPC B route table must add:
Destination       Target
10.0.0.0/16      pcx-xxxxxxxxx   ← peering connection to VPC A

VPC Endpoints

VPC Endpoints allow private connectivity to AWS services without traffic leaving the AWS network — no internet, no NAT Gateway, no extra cost per GB (for Gateway endpoints).

Gateway Endpoint (FREE)

Supports: S3 and DynamoDB ONLY
Added as a route in route table
No additional cost — saves NAT Gateway data charges
Scoped to region — not specific AZ
Works via route table entry pointing to vpce-xxx

Interface Endpoint (PrivateLink)

Supports: 100+ AWS services (CloudWatch, SNS, SQS, SSM, Secrets Manager, etc.)
Creates an ENI in your subnet with private IP
Cost: ~$0.01/hour/AZ + $0.01/GB
Uses DNS resolution to redirect service calls
Works across VPC peering and Direct Connect

# S3 Gateway Endpoint - add to private route table:
Destination        Target
pl-xxxxxxxx        vpce-xxxxxxxx   ← S3 prefix list → endpoint

# No code change needed! Your existing S3 calls
# boto3.client('s3').upload_file(...)  ← automatically uses endpoint

AWS Transit Gateway

Transit Gateway is a network hub that connects thousands of VPCs, on-premises networks, and VPN connections through a single gateway. Instead of creating mesh of VPC peering connections, all VPCs connect to the TGW hub.

Hub-and-spoke model: Each VPC/VPN connects once to TGW — TGW handles routing between them
Transitive routing: Unlike VPC peering, A can reach C through TGW (A→TGW→C)
Cross-region: Peer Transit Gateways across regions for global connectivity
Cross-account: Share TGW with AWS RAM (Resource Access Manager)
Route tables: TGW has its own route tables — control which attachments can talk to each other (segregation)
Cost: $0.05/attachment/hour + $0.02/GB — expensive for many VPCs but cheaper than mesh peering

# Without TGW: 10 VPCs need 45 peering connections (n*(n-1)/2)
# With TGW: 10 VPCs each connect once to TGW = 10 attachments

TGW Attachments:
  ├── VPC-A (prod)
  ├── VPC-B (staging)  
  ├── VPC-C (shared-services)
  ├── VPN Connection (on-premises data center)
  └── Direct Connect Gateway

AWS Direct Connect

Direct Connect establishes a dedicated physical network connection from your on-premises data center to AWS — bypassing the public internet entirely for more consistent performance, lower latency, and reduced data transfer costs.

Speeds: 1 Gbps, 10 Gbps, 100 Gbps (hosted connections: 50 Mbps to 10 Gbps)
Not encrypted by default — combine with VPN for encryption over Direct Connect
Not redundant by default — order two connections in different facilities for HA
Lead time: Weeks to months to provision physical connection
Use Direct Connect Gateway to connect to multiple VPCs across regions from one Direct Connect

VPN Connections

AWS Site-to-Site VPN creates an encrypted IPsec tunnel between your on-premises network and your AWS VPC over the public internet.

Site-to-Site VPN

Encrypted tunnel over internet
Quick to setup (minutes)
Up to 1.25 Gbps per tunnel
2 tunnels for redundancy (different AWS endpoints)
Uses Virtual Private Gateway (VGW) on AWS side
Cost: ~$0.05/hour + data transfer

Client VPN

OpenVPN-based for individual users
Users connect laptop → AWS VPC
AD/SAML authentication
Split tunneling option
Cost: ~$0.10/hour per association + $0.05/hour per connection

AWS PrivateLink

PrivateLink allows you to expose your service privately to other VPCs without peering, without public internet, and without exposing your entire VPC. It's the technology behind Interface VPC Endpoints.

Service provider creates a Network Load Balancer in front of their service
Creates a VPC Endpoint Service — consumers create Interface Endpoints to connect
Traffic never leaves AWS network — completely private
Works across accounts and regions
Used by AWS to provide 100+ services privately (SSM, Secrets Manager, etc.)

Route 53 Resolver (DNS in VPC)

Route 53 Resolver is the built-in DNS resolver that handles DNS queries from within your VPC. Understanding it is key for hybrid cloud DNS.

VPC DNS resolver: Available at VPC CIDR +2 (e.g., 10.0.0.2 for 10.0.0.0/16)
Inbound Resolver Endpoints: Allow on-premises to resolve AWS private DNS names
Outbound Resolver Endpoints: Allow VPC to resolve on-premises DNS names
Forwarding Rules: Forward specific domain queries (e.g., corp.internal) to on-premises DNS
Enable DNS resolution and DNS hostnames in VPC settings for Route 53 private hosted zones to work

🪣

S3 — Simple Storage Service

Object Storage

What is S3?

Amazon S3 is an object storage service — not a filesystem, not a database. You store objects (files) in buckets. S3 provides 11 nines of durability (99.999999999%) by storing data across minimum 3 Availability Zones. S3 is accessed via HTTP/HTTPS API calls (PUT, GET, DELETE), not mounted as a filesystem.

S3 Durability: 99.999999999% = "eleven nines". If you store 10 million objects, you'd expect to lose 1 object every 10,000 years. This is achieved by storing each object in multiple facilities.

Core S3 Concepts

BucketContainer for objects. Name must be globally unique across ALL AWS accounts. Region-specific but name global. Max 100 buckets per account (softlimit).

ObjectFiles stored in S3. Consists of Key (unique identifier/path), Value (the actual data/content), Metadata (key-value pairs), Version ID, Access Control.

KeyFull path of object within bucket. E.g., images/2024/jan/photo.jpg. There are no actual folders — the key is just a string with / in it.

Object SizeMin 0 bytes, Max 5 TB. For objects larger than 100 MB: use Multipart Upload. Required for objects over 5 GB.

URL Formathttps://bucket-name.s3.region.amazonaws.com/key — e.g., https://my-bucket.s3.ap-south-1.amazonaws.com/images/photo.jpg

MetadataSystem metadata (Content-Type, ETag, Content-Length) set by AWS. User-defined metadata (x-amz-meta-* headers) set by you.

S3 Storage Classes

Class	Durability	Availability	AZs	Min Duration	Retrieval	Best For
S3 Standard	11 9s	99.99%	≥3	None	Milliseconds (free)	Frequently accessed data, websites, mobile apps
S3 Intelligent-Tiering	11 9s	99.9%	≥3	None	Milliseconds–hours	Unknown or changing access patterns
S3 Standard-IA	11 9s	99.9%	≥3	30 days	Milliseconds (per GB fee)	Disaster recovery, backups accessed monthly
S3 One Zone-IA	11 9s	99.5%	1	30 days	Milliseconds (per GB fee)	Non-critical infrequent data. 20% cheaper than Standard-IA.
Glacier Instant	11 9s	99.9%	≥3	90 days	Milliseconds (per GB fee)	Archives accessed once a quarter
Glacier Flexible	11 9s	99.9%	≥3	90 days	1-5 min (expedited), 3-5 hrs (standard), 5-12 hrs (bulk)	Archives accessed 1-2 times/year
Glacier Deep Archive	11 9s	99.9%	≥3	180 days	12 hrs (standard), 48 hrs (bulk)	Compliance archives, 7-10 year retention

Intelligent-Tiering: S3 monitors access patterns and automatically moves objects between access tiers: Frequent Access → Infrequent Access → Archive Instant Access → Archive Access → Deep Archive Access. No retrieval fees. Small monthly monitoring fee per object. Best when access patterns are unknown.

S3 Versioning

Versioning stores multiple versions of the same object in a bucket. Every upload creates a new version ID. This protects against accidental overwrites and deletes.

Enable versioning at the bucket level: Properties → Bucket Versioning → Enable
Once enabled, versioning can be suspended but NOT disabled. Existing versions remain.
When you "delete" a versioned object, S3 adds a Delete Marker (a special version). The object isn't actually deleted — you can restore it by deleting the Delete Marker.
To permanently delete: delete the specific version ID
Objects uploaded BEFORE versioning was enabled get version ID = null
MFA Delete: Require MFA authentication to permanently delete versions or suspend versioning. Requires AWS CLI (not Console).
Versioning increases storage costs (multiple copies of same file). Use Lifecycle rules to expire old versions.

S3 Bucket Policies vs ACLs

📋 Bucket Policies (JSON)

Resource-based IAM-style JSON policy attached to bucket
Can grant access to: specific IAM users, roles, other AWS accounts, anonymous users (public access)
Can require HTTPS-only access
Can restrict by IP address or VPC
Recommended approach — more powerful than ACLs

📄 ACLs (Legacy)

Pre-IAM access control mechanism
Less granular than policies
Disabled by default (Block Public Access)
Apply at bucket or object level
AWS recommends: disable ACLs and use bucket policies instead

Bucket Policy Examples

# Make specific objects publicly readable
{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "PublicReadGetObject",
    "Effect": "Allow",
    "Principal": "*",
    "Action": "s3:GetObject",
    "Resource": "arn:aws:s3:::my-bucket/*"
  }]
}

# Force HTTPS only (deny HTTP)
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Deny",
    "Principal": "*",
    "Action": "s3:*",
    "Resource": ["arn:aws:s3:::my-bucket", "arn:aws:s3:::my-bucket/*"],
    "Condition": {
      "Bool": { "aws:SecureTransport": "false" }
    }
  }]
}

# Allow specific IAM role to access bucket
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": { "AWS": "arn:aws:iam::123456789:role/AppRole" },
    "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
    "Resource": "arn:aws:s3:::my-bucket/*"
  }]
}

S3 Block Public Access

Block Public Access is a safety net that prevents S3 buckets from being accidentally made public. Enabled by default on all new buckets and at the account level.

BlockPublicAcls: Rejects PUTs that include public ACLs
IgnorePublicAcls: Ignores any public ACLs on bucket/objects
BlockPublicPolicy: Rejects bucket policies that grant public access
RestrictPublicBuckets: Ignores public bucket policies
For static website hosting: you must turn off Block Public Access and add a bucket policy allowing public GetObject

S3 Operations via AWS CLI

# Create bucket
aws s3 mb s3://my-unique-bucket-name --region ap-south-1

# Upload file
aws s3 cp myfile.txt s3://my-bucket/
aws s3 cp myfile.txt s3://my-bucket/folder/renamed.txt

# Download file
aws s3 cp s3://my-bucket/myfile.txt ./localfile.txt

# List bucket contents
aws s3 ls s3://my-bucket/
aws s3 ls s3://my-bucket/ --recursive       # list all files including subdirs

# Sync (only copies new or modified files)
aws s3 sync ./local-folder/ s3://my-bucket/
aws s3 sync s3://source-bucket/ s3://dest-bucket/

# Delete file
aws s3 rm s3://my-bucket/myfile.txt
aws s3 rm s3://my-bucket/ --recursive       # delete all objects (careful!)

# Make object public
aws s3api put-object-acl --bucket my-bucket --key file.txt --acl public-read

🪣

S3 Advanced

Object Storage

S3 Data Partitioning and Performance

S3 automatically partitions data based on key prefixes for performance. AWS can handle 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix.

# Single prefix = limited to 5,500 GET/s
s3://bucket/2024/all-files...         # all under same prefix = limited

# Multiple prefixes = multiply performance
s3://bucket/2024/q1/file            # prefix 1: 5,500 GET/s
s3://bucket/2024/q2/file            # prefix 2: 5,500 GET/s
s3://bucket/2024/q3/file            # prefix 3: 5,500 GET/s
s3://bucket/2024/q4/file            # prefix 4: 5,500 GET/s
# Total: 22,000 GET/s with 4 prefixes!

# Tip: Randomize prefixes to avoid hotspots (old advice for SSE-KMS uploads)
# Modern S3 handles random keys well natively

Multipart Upload

Recommended for objects larger than 100 MB
Required for objects larger than 5 GB
Splits file into up to 10,000 parts, uploads in parallel
If one part fails, only that part is retried (not whole file)
All parts must be uploaded before S3 assembles the final object
Incomplete multipart uploads should be cleaned up with lifecycle rules (cost saving)

# Multipart upload via CLI (handled automatically)
aws s3 cp largefile.iso s3://my-bucket/ --expected-size 4294967296

# Or specify multipart threshold and chunk size
aws configure set default.s3.multipart_threshold 64MB
aws configure set default.s3.multipart_chunksize 16MB

# Clean up incomplete multipart uploads
aws s3api list-multipart-uploads --bucket my-bucket
aws s3api abort-multipart-upload \
  --bucket my-bucket \
  --key object-key \
  --upload-id upload-id

S3 Transfer Acceleration

S3 Transfer Acceleration speeds up long-distance uploads to S3 by routing through AWS CloudFront Edge Locations. Instead of uploading directly to S3, data goes to the nearest Edge Location, then travels over AWS backbone to S3.

Best for: global users uploading to a centralized S3 bucket
Example: Users in Australia, India, Europe all upload to us-east-1 S3 bucket
Use special endpoint: bucket.s3-accelerate.amazonaws.com
Additional cost: $0.04-0.08/GB transferred through acceleration
Test if it helps for your scenario: Speed Comparison Tool

Cross-Region Replication (CRR) & Same-Region Replication (SRR)

Replication automatically copies objects between S3 buckets, either within the same region or across regions.

Feature	CRR (Cross-Region)	SRR (Same-Region)
Purpose	Compliance, lower latency, cross-account backups	Log aggregation, data sharing, test/prod sync
Data transfer cost	Yes — inter-region charges	No extra charges
Latency	Near real-time (asynchronous)	Near real-time (asynchronous)
Versioning	Required on both source and destination	Required on both

Replication does NOT copy existing objects automatically — use S3 Batch Replication for existing objects
New objects uploaded after enabling replication are replicated
Delete markers: NOT replicated by default (optional setting). Version deletions are NEVER replicated.
Replication supports cross-account (set ACL/bucket policy on destination)
Replication Time Control (RTC): 99.99% of objects replicated within 15 minutes (SLA-backed)

S3 Lifecycle Management

Lifecycle policies automate transitioning objects between storage classes and expiring old objects/versions. Reduces storage costs significantly.

# Typical lifecycle policy example:
# Day 0:   Upload to S3 Standard
# Day 30:  Transition to S3 Standard-IA
# Day 90:  Transition to S3 Glacier Flexible Retrieval
# Day 365: Transition to S3 Glacier Deep Archive
# Day 2555 (7 years): Delete permanently

# Also useful for:
# - Expire incomplete multipart uploads after 7 days
# - Delete old versions after 30 days (with versioning enabled)
# - Delete expired object delete markers

S3 Encryption

Encryption Type	Key Management	Header Required	Notes
SSE-S3 (default since Jan 2023)	AWS manages keys entirely. AES-256.	x-amz-server-side-encryption: AES256	No configuration needed. Automatic on all new objects.
SSE-KMS	AWS KMS. You choose CMK.	x-amz-server-side-encryption: aws:kms	Audit trail in CloudTrail. KMS API quota limits. Use S3 Bucket Keys to reduce API calls.
SSE-C	You provide the key with EVERY request.	Key in request header	MUST use HTTPS. AWS doesn't store the key. You lose key = you lose data.
Client-Side Encryption	You encrypt before uploading. Complete control.	N/A — encrypted before upload	AWS never sees plaintext. Use AWS Encryption SDK or your own solution.

Static Website Hosting with S3

Create S3 bucket with the same name as your domain (e.g., www.example.com)
Enable Static website hosting in bucket Properties → set Index document = index.html, Error document = error.html
Disable Block Public Access on the bucket (all 4 settings)
Add bucket policy to allow public GetObject on all objects
Upload your HTML/CSS/JS/image files to the bucket
(Optional) Point a Route 53 Alias record or CNAME to the S3 website endpoint
(Optional) Put CloudFront in front for HTTPS, custom domain, and global CDN

S3 websites support only HTTP by default! For HTTPS on a custom domain, you must use CloudFront + ACM certificate in front of S3.

S3 Events and Notifications

S3 can send event notifications when specific events occur on objects (create, delete, restore, replication).

Destination	Use Case	Latency
SNS Topic	Fan-out to multiple systems, email alerts	Seconds
SQS Queue	Decouple processing, retry failed events	Seconds
Lambda Function	Process objects on upload (resize, validate, extract)	Seconds
EventBridge	Advanced filtering, 20+ targets, archive/replay events	Seconds

# Example: Trigger Lambda when image is uploaded to /images/ prefix
# S3 Event: ObjectCreated (PUT, POST, COPY)
# Filter: Prefix = images/, Suffix = .jpg
# Destination: Lambda function ARN

# Common use case: Image processing pipeline
# 1. User uploads image to S3 (s3://my-bucket/images/photo.jpg)
# 2. S3 sends event notification to Lambda
# 3. Lambda reads original image from S3
# 4. Lambda resizes to multiple dimensions
# 5. Lambda writes thumbnails back to S3 (s3://my-bucket/thumbnails/)

🔓

S3 Access Control

Object Storage

Cross-Account Access for S3

When another AWS account needs to access your S3 bucket, you have three main approaches:

Method 1: Bucket Policy (Most Common)

Add a bucket policy to Account A's bucket that grants permissions to Account B's users/roles.

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "CrossAccountAccess",
    "Effect": "Allow",
    "Principal": {
      "AWS": [
        "arn:aws:iam::ACCOUNT-B-ID:root",           // all of Account B
        "arn:aws:iam::ACCOUNT-B-ID:role/SpecificRole" // or just a specific role
      ]
    },
    "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
    "Resource": [
      "arn:aws:s3:::account-a-bucket",
      "arn:aws:s3:::account-a-bucket/*"
    ]
  }]
}
// Account B users ALSO need IAM permission to make the S3 calls

Method 2: IAM Role Assumption (STS)

# Account A creates IAM Role with S3 access + trust policy:
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": { "AWS": "arn:aws:iam::ACCOUNT-B-ID:root" },
    "Action": "sts:AssumeRole"
  }]
}

# Account B user assumes the role:
aws sts assume-role \
  --role-arn "arn:aws:iam::ACCOUNT-A-ID:role/S3AccessRole" \
  --role-session-name "cross-account-session"
# Returns: AccessKeyId, SecretAccessKey, SessionToken (valid 1 hour)

# Use temporary credentials to access Account A's S3:
AWS_ACCESS_KEY_ID=xxx AWS_SECRET_ACCESS_KEY=yyy AWS_SESSION_TOKEN=zzz \
  aws s3 ls s3://account-a-bucket/

Pre-Signed URLs

Generate a time-limited URL that grants temporary access to a specific S3 object. The URL includes authentication information embedded in it. Anyone with the URL can access the object for the duration.

URL inherits permissions of the IAM identity that generated it
If the generating user/role loses permissions, the URL also stops working
Default expiry: 1 hour (console), configurable via CLI (max 7 days with IAM user, 12h with CLI default)
For roles (EC2 instance profile, Lambda): max expiry = role's session duration
Works for GET (share private objects) and PUT (allow uploads to specific path)

# Generate pre-signed URL for downloading (valid 1 hour)
aws s3 presign s3://my-bucket/private-report.pdf --expires-in 3600

# Output: https://my-bucket.s3.amazonaws.com/private-report.pdf?X-Amz-Algorithm=...&X-Amz-Expires=3600&...

# Generate pre-signed URL for uploading (PUT)
aws s3 presign s3://my-bucket/upload-here.jpg --expires-in 7200

# Python example
import boto3
s3 = boto3.client('s3')
url = s3.generate_presigned_url(
    'get_object',
    Params={'Bucket': 'my-bucket', 'Key': 'private-file.pdf'},
    ExpiresIn=3600
)

S3 Access Points

Access Points are named network endpoints attached to a bucket, each with their own permissions policy. Instead of a single complex bucket policy managing hundreds of users, create one Access Point per use case.

# Example: Data lake bucket accessed by multiple teams
# Instead of one complex bucket policy:
# Create separate access points:
# - data-scientists-ap: allow read/write to /analytics/ prefix only
# - finance-ap: allow read to /finance/ prefix only  
# - dev-team-ap: allow read/write to /dev/ prefix only, VPC-only access

aws s3control create-access-point \
  --account-id 123456789012 \
  --name data-scientists-ap \
  --bucket my-data-lake \
  --vpc-configuration VpcId=vpc-12345678   # VPC-only access

# Access point ARN: arn:aws:s3:region:account:accesspoint/data-scientists-ap
# Use access point ARN anywhere you'd use a bucket name in S3 API calls

S3 Object Lambda

S3 Object Lambda adds your code to process data retrieved from S3 before returning it to the requesting application. Data is modified on-the-fly without storing multiple versions.

Without Object Lambda

Store multiple copies of same data in different formats
Original + anonymized + compressed + watermarked = 4x storage
High storage costs
Synchronization complexity

With Object Lambda

Store one copy of data
Lambda processes on retrieval
Different users get different views of same data
No extra storage needed

Use cases: Redact PII (remove SSN/credit card numbers), convert XML to JSON, resize images, add watermarks, decompress/compress
Works with any application that uses S3 GET API calls
Creates a new S3 Object Lambda Access Point — use its ARN instead of bucket name

# Lambda function for S3 Object Lambda (redact PII)
import boto3, re, json

s3_client = boto3.client('s3')

def lambda_handler(event, context):
    # Get object from S3
    object_get_context = event["getObjectContext"]
    request_route = object_get_context["outputRoute"]
    request_token = object_get_context["outputToken"]
    s3_url = object_get_context["inputS3Url"]
    
    # Retrieve original object
    response = requests.get(s3_url)
    original_content = response.text
    
    # Redact SSNs (pattern: XXX-XX-XXXX)
    redacted = re.sub(r'\d{3}-\d{2}-\d{4}', 'XXX-XX-XXXX', original_content)
    
    # Return modified content
    s3_client.write_get_object_response(
        Body=redacted,
        RequestRoute=request_route,
        RequestToken=request_token
    )
    return {'status_code': 200}

🔐

IAM — Identity and Access Management

Identity & Access

Root Account vs IAM User

👑 Root Account (Account Owner)

Created when you sign up for AWS
Email address + password login
Complete unrestricted access to everything
Cannot be restricted by IAM policies
Only for: change account settings, close account, change email, view certain billing, first IAM admin user creation
NEVER use for day-to-day operations!
Enable MFA immediately after account creation

👤 IAM User

Created within your AWS account by root or admin
Username + password (Console) or Access Keys (CLI/API)
No permissions by default — must be explicitly granted
Can have both Console access AND programmatic access
One set of long-term credentials per user
Suitable for individual humans or service accounts

Multi-Factor Authentication (MFA)

MFA requires users to provide two forms of authentication: something they know (password) and something they have (MFA device). Even if password is stolen, attacker can't log in without the MFA device.

MFA Type	Description	Examples
Virtual MFA Device	TOTP (Time-based One-Time Password) app on smartphone	Google Authenticator, Authy, Microsoft Authenticator, Duo
Hardware TOTP Token	Physical device that generates 6-digit codes	Gemalto token, RSA SecurID
FIDO Security Key (U2F)	Physical USB/NFC key — press button to authenticate	YubiKey, Titan Security Key
Passkey / Biometric	Built-in biometric (fingerprint, face) stored in device	Touch ID on Mac, Windows Hello, smartphone biometrics

IAM Password Policy

Minimum password length (1-128 characters)
Require specific character types: uppercase, lowercase, numbers, symbols
Password expiration (force password change every N days)
Prevent password reuse (remember last N passwords)
Allow users to change their own passwords

IAM Users, Groups, Roles — Concepts

Entity	What It Is	Credentials	Best For
User	Person or application with long-term identity in your account	Password + Access Keys	Human employees, CI/CD pipelines (when no other option)
Group	Collection of IAM users — policies applied to group apply to all members	N/A (inherits from policies)	Organizing users by job function (Developers, Admins, Read-Only)
Role	IAM identity with permission policies, but NO permanent credentials. Assumed by trusted entities.	Temporary credentials (STS)	EC2/Lambda accessing AWS services, cross-account access, identity federation
Policy	JSON document defining permissions (Allow/Deny actions on resources)	N/A	Attached to users, groups, roles, or resources

IAM Groups: Users can belong to multiple groups. Groups CANNOT be nested (no group within a group). Groups are for users only — cannot assign roles to groups.

IAM Roles — Deep Dive

Roles are the AWS-recommended way to grant AWS service permissions. Instead of creating IAM users with access keys for EC2 instances (insecure), you create a role with an Instance Profile.

❌ Bad Practice (Access Keys on EC2)

aws configure on EC2 → hardcoded access keys
Keys stored in ~/.aws/credentials
If instance is compromised, keys are stolen
Keys must be manually rotated
Can't audit which instance used which credentials

✅ Good Practice (IAM Role)

Attach IAM role to EC2 instance profile
Temporary credentials auto-rotated every hour
Credentials available at metadata endpoint
CloudTrail logs show instance ID + role
No credentials stored on disk

IAM Policy — Structure and Examples

{
  "Version": "2012-10-17",       // Always use this version
  "Statement": [
    {
      "Sid": "AllowS3ReadWrite",  // Optional: human-readable ID for this statement
      "Effect": "Allow",           // Allow or Deny
      "Action": [                  // What API calls are allowed/denied
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [                // What resources the action applies to
        "arn:aws:s3:::my-bucket",           // bucket (for ListBucket)
        "arn:aws:s3:::my-bucket/*"          // all objects in bucket
      ],
      "Condition": {               // Optional: when this policy applies
        "StringEquals": {
          "aws:RequestedRegion": "ap-south-1"   // Only in Mumbai region
        },
        "Bool": {
          "aws:MultiFactorAuthPresent": "true"  // Only when using MFA
        }
      }
    },
    {
      "Sid": "DenyDeleteProduction",
      "Effect": "Deny",            // Explicit Deny always wins over Allow
      "Action": "s3:DeleteObject",
      "Resource": "arn:aws:s3:::production-bucket/*",
      "Condition": {
        "StringNotEquals": {
          "aws:PrincipalTag/Environment": "Admin"  // Unless user has Admin tag
        }
      }
    }
  ]
}

IAM Policy Types

AWS Managed Policies

Created and maintained by AWS
Updated automatically when new services launch
Cannot be modified by customers
Common examples: AdministratorAccess, ReadOnlyAccess, PowerUserAccess, AmazonS3FullAccess, AmazonEC2ReadOnlyAccess

Customer Managed Policies

You create and manage
Can be reused across multiple users/roles/groups
Versioned — up to 5 versions, rollback supported
Recommended for most custom use cases

Inline Policies

Embedded directly in a user, group, or role
Not reusable — 1:1 relationship
Deleted when entity is deleted
Use only when policy must not be accidentally attached to another identity

AWS CLI Configuration and IAM Access

# Configure CLI with access keys (for human users)
aws configure
# Prompts: Access Key ID, Secret Access Key, Region, Output format

# Or set specific profile
aws configure --profile myproject

# Use specific profile
aws s3 ls --profile myproject

# List configured profiles
aws configure list-profiles

# View current identity
aws sts get-caller-identity
# Returns: Account, UserId, Arn (who am I?)

# On EC2 with IAM role — NO configuration needed!
aws s3 ls s3://my-bucket/  # uses instance profile automatically

# Access key rotation (best practice: every 90 days)
aws iam create-access-key --user-name myuser
aws iam delete-access-key --user-name myuser --access-key-id AKIAIOSFODNN7EXAMPLE

IAM Best Practices

Least Privilege: Grant only the permissions required to do the job — nothing more
Root Account: Never use root for daily operations. Enable MFA. Delete or rotate root access keys.
MFA: Enable MFA for root account and all privileged users immediately
Roles over Users: Use IAM roles for EC2, Lambda, ECS — no hardcoded credentials
Key Rotation: Rotate access keys every 90 days or less
Groups for Permissions: Assign permissions to groups, add users to groups
Never Share Credentials: Create individual IAM users — never share usernames/passwords
Use IAM Access Analyzer: Identify over-permissive policies and external access
Monitor with CloudTrail: Log and alert on sensitive API calls

Auditing User Activity

IAM Credential Report: CSV showing ALL users in account: when they last used console/access keys, whether MFA is enabled, when passwords were changed. Download from IAM Console → Credential Report.
IAM Access Advisor: Shows service-level permissions granted to a user AND the last time those services were accessed. Use this to identify and remove unused permissions.
CloudTrail: Every API call in your account is logged — who, what, when, from where, success/fail. Essential for incident investigation.
IAM Access Analyzer: Scans resource policies and reports any that grant access to external principals. Helps find unintended public or cross-account access.

🔑

Secrets & Keys

Security

Why Never Hardcode Credentials?

Hardcoding passwords, API keys, or tokens directly in code is one of the most dangerous security mistakes. If code is pushed to GitHub (even accidentally), credentials are exposed publicly. AWS scanners, bots, and attackers actively scrape GitHub for AWS keys — a compromised key can result in thousands of dollars of AWS charges within minutes.

Real incident: Developers have accidentally committed AWS keys to public GitHub repos and received $50,000+ bills within hours from crypto miners spinning up GPU instances globally. AWS may cover some charges but not always.

AWS Secrets Manager

Secrets Manager is a dedicated service for storing, rotating, and retrieving secrets. Applications call the API at runtime instead of having credentials in code or config files.

Feature	Details
Automatic Rotation	Rotates RDS, Aurora, Redshift, DocumentDB credentials on schedule via Lambda. Zero downtime — updates DB password and stores new value atomically.
Encryption	All secrets encrypted with KMS (AWS-managed or your own CMK)
Versioning	Keeps previous versions (AWSPREVIOUS) during rotation for zero-downtime cutover
Audit Trail	Every GetSecretValue call logged in CloudTrail — full audit who accessed what and when
Cross-account	Share secrets across AWS accounts using resource-based policies
Cost	$0.40/secret/month + $0.05 per 10,000 API calls

import boto3, json

def get_secret(secret_name):
    client = boto3.client('secretsmanager', region_name='ap-south-1')
    resp = client.get_secret_value(SecretId=secret_name)
    return json.loads(resp['SecretString'])

# Usage - credentials fetched at runtime, never in code
creds = get_secret('prod/myapp/rds')
conn = pymysql.connect(host=creds['host'], user=creds['username'],
                       password=creds['password'], database=creds['dbname'])

Rotation Deep Dive

Automatic rotation works by triggering a Lambda function on schedule. AWS provides pre-built rotation Lambdas for RDS, Aurora, Redshift, and DocumentDB. For other services, you write a custom Lambda.

createSecret: Lambda generates a new random password
setSecret: Lambda updates the password in the database
testSecret: Lambda tests new credentials can authenticate
finishSecret: Lambda marks new version as AWSCURRENT, old as AWSPREVIOUS

Secrets Manager vs SSM Parameter Store

Secrets Manager

Purpose-built for secrets
Automatic rotation built-in
$0.40/secret/month
Cross-account sharing
Secret versioning
Best for: DB passwords, API keys needing auto-rotation

SSM Parameter Store

Config values + secrets
No built-in auto rotation
Standard tier: FREE (up to 10,000 params)
Advanced tier: $0.05/param/month
SecureString = KMS encrypted
Best for: config flags, feature toggles, non-rotating values

AWS KMS (Key Management Service)

KMS is the central key management service for all AWS encryption. It creates and controls cryptographic keys used to encrypt data. Crucially, plaintext keys NEVER leave KMS — all encrypt/decrypt operations happen inside the service via API calls.

Key Type	Who Manages	Cost	Use Case
AWS Owned Keys	AWS (hidden)	Free	Default for S3, SQS, DynamoDB
AWS Managed Keys	AWS (visible)	Free	aws/s3, aws/ebs, aws/rds
Customer Managed CMK	You	$1/month + $0.03/10K API calls	Custom rotation, cross-account, audit
Imported Keys	You (bring own key)	$1/month	Regulatory compliance (BYOK)

Envelope Encryption: KMS generates a Data Key (DEK) → you encrypt your data locally with DEK (fast) → KMS encrypts the DEK with your CMK → you store encrypted data + encrypted DEK together. Data never goes to KMS — only the small DEK does. This is how S3, EBS, RDS encryption works under the hood.

KMS Key Policies

Unlike IAM policies, KMS keys REQUIRE a key policy — without one, no one (not even root) can use the key. Key policies are resource-based policies attached directly to the CMK.

{
  "Statement": [
    {
      "Sid": "Enable IAM User Permissions",
      "Effect": "Allow",
      "Principal": {"AWS": "arn:aws:iam::123456789012:root"},
      "Action": "kms:*",
      "Resource": "*"
    },
    {
      "Sid": "Allow Lambda to use key",
      "Effect": "Allow", 
      "Principal": {"AWS": "arn:aws:iam::123456789012:role/lambda-role"},
      "Action": ["kms:Decrypt","kms:GenerateDataKey"],
      "Resource": "*"
    }
  ]
}

📊

CloudWatch

Monitoring

What is CloudWatch?

CloudWatch is AWS's unified observability platform — the single place to monitor all your AWS resources and applications. It collects metrics (numbers), logs (text), and traces (request paths) and lets you set alarms, create dashboards, and trigger automated actions. Think of it as the "nervous system" of your AWS infrastructure.

📈 Metrics

Numerical time-series data points
Default resolution: 1 minute (detailed) or 5 minutes (standard)
Retention: 15 months (1-min data kept 15 days, then rolled up)
Free tier: basic EC2, RDS, S3 metrics
Custom metrics: $0.30/metric/month

📋 Logs

Organized in Log Groups → Log Streams → Log Events
Configurable retention (1 day to 10 years, or never expire)
Ingest cost: $0.50/GB, Storage: $0.03/GB/month
Query with CloudWatch Logs Insights
Subscribe to Lambda/Kinesis for real-time processing

🔔 Alarms

Watch a single metric, trigger actions
States: OK, ALARM, INSUFFICIENT_DATA
Actions: SNS notification, EC2 action, Auto Scaling, Systems Manager
Composite alarms: combine multiple alarms with AND/OR
Cost: $0.10/alarm/month (standard)

Default EC2 Metrics (No Agent Needed)

Metric	Description	Unit	Alarm Threshold
CPUUtilization	% of CPU used by instance	Percent	Alert if >80% for 5 mins
NetworkIn / NetworkOut	Bytes received/sent	Bytes	Alert on traffic spikes
DiskReadOps / DiskWriteOps	I/O operations (instance store only)	Count	Detect disk bottleneck
StatusCheckFailed_Instance	OS-level issues (kernel panic, etc.)	Count (0 or 1)	Alert on any failure
StatusCheckFailed_System	AWS hardware issues	Count (0 or 1)	Alert on any failure

Memory and Disk Space are NOT available by default! These require installing the CloudWatch Agent on your EC2 instance. This is a very common exam question.

CloudWatch Agent

The CloudWatch Agent is software you install on EC2 (or on-premises servers) to collect metrics and logs that aren't available by default — especially memory utilization, disk space, and custom application logs.

# Install CloudWatch Agent on Amazon Linux 2
sudo yum install -y amazon-cloudwatch-agent

# Run configuration wizard (interactive)
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

# Or use a config file (stored in SSM Parameter Store for central management)
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl   -a fetch-config -m ec2 -s   -c ssm:/AmazonCloudWatch-Config

# Start and enable
sudo systemctl start amazon-cloudwatch-agent
sudo systemctl enable amazon-cloudwatch-agent
sudo systemctl status amazon-cloudwatch-agent

CloudWatch Agent Config (JSON)

{
  "metrics": {
    "namespace": "CWAgent",
    "metrics_collected": {
      "mem": {
        "measurement": ["mem_used_percent"],
        "metrics_collection_interval": 60
      },
      "disk": {
        "measurement": ["used_percent"],
        "resources": ["/", "/data"],
        "metrics_collection_interval": 300
      },
      "cpu": {
        "totalcpu": true,
        "metrics_collection_interval": 60
      }
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/nginx/access.log",
            "log_group_name": "/ec2/nginx/access",
            "log_stream_name": "{instance_id}"
          },
          {
            "file_path": "/var/log/myapp/app.log",
            "log_group_name": "/ec2/myapp",
            "log_stream_name": "{hostname}"
          }
        ]
      }
    }
  }
}

CloudWatch Alarms — Full Configuration

Setting	What it means	Example
Metric	What to watch	CPUUtilization, namespace=AWS/EC2
Statistic	How to aggregate data points	Average, Maximum, Sum, p99
Period	Length of each evaluation window	300 seconds (5 min)
Evaluation Periods	Total windows to look at	3 (look at last 15 min)
Datapoints to Alarm	How many windows must breach (M of N)	2 of 3
Threshold	The trigger value	> 80%
Missing data	How to treat gaps	notBreaching / breaching / ignore

CloudWatch Logs Insights Queries

# Find all ERROR log lines in last 1 hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50

# Count errors by type
fields @message
| filter @message like /Exception/
| parse @message "* Exception: *" as prefix, errorType
| stats count(*) as errorCount by errorType
| sort errorCount desc

# Lambda: find slow invocations (>3 seconds)
filter @type = "REPORT"
| fields @requestId, @duration, @billedDuration, @memorySize, @maxMemoryUsed
| filter @duration > 3000
| sort @duration desc

CloudWatch Dashboards

Create custom dashboards mixing metrics from different services and regions
Widget types: Line graph, Number (single value), Alarm status, Log table, Text (Markdown), Bar chart, Pie chart
Cross-account and cross-region on a single dashboard — great for centralized monitoring
Share dashboards: publicly (with link), privately (IAM), or with specific accounts
Auto-refresh: 10s, 1min, 2min, 5min, 15min
Cost: First 3 dashboards free, then $3/dashboard/month

🔔

CloudWatch Advanced

Monitoring

Amazon EventBridge (formerly CloudWatch Events)

EventBridge is a serverless event bus that connects applications using events. AWS services emit events when things happen (EC2 state change, S3 object uploaded, CodePipeline failed). EventBridge routes these events to target services for automated responses — enabling event-driven architectures without polling.

Event Sources

AWS services (EC2, S3, RDS, CodePipeline, Health, GuardDuty...)
Your custom applications (PutEvents API)
SaaS partners (Zendesk, Datadog, Shopify, PagerDuty)
Scheduled rules (cron or rate expressions)

Event Targets (20+)

Lambda functions
Step Functions state machines
SQS queues, SNS topics
Kinesis Data Streams/Firehose
ECS tasks, CodePipeline, CodeBuild
API Gateway, CloudWatch Log Groups

# Scheduled rule examples:
rate(5 minutes)         # every 5 minutes
rate(1 hour)            # every hour  
cron(0 18 ? * MON-FRI *) # 6 PM UTC weekdays
cron(30 3 * * ? *)       # 3:30 AM UTC daily (9 AM IST)

# Event pattern — trigger when EC2 instance stops:
{
  "source": ["aws.ec2"],
  "detail-type": ["EC2 Instance State-change Notification"],
  "detail": {
    "state": ["stopped", "terminated"]
  }
}

# Event pattern — S3 object uploaded:
{
  "source": ["aws.s3"],
  "detail-type": ["Object Created"],
  "detail": {
    "bucket": {"name": ["my-uploads-bucket"]},
    "object": {"key": [{"prefix": "images/"}]}
  }
}

EventBridge Pipes

EventBridge Pipes connects event sources (SQS, DynamoDB Streams, Kinesis) to targets with optional filtering, enrichment (Lambda/Step Functions), and transformation — all without writing integration code.

Source (SQS) → Filter → Enrichment (Lambda) → Target (Step Functions)

Composite Alarms

Composite alarms combine multiple CloudWatch alarms into a single alarm using AND/OR/NOT logic. They reduce alert noise by only notifying when multiple conditions are true simultaneously.

# Only alarm when BOTH CPU is high AND memory is high
# Prevents noisy false positives from individual metric spikes
{
  "AlarmRule": "ALARM(cpu-alarm) AND ALARM(memory-alarm)"
}

# Alert when ANY of these critical conditions occur
{
  "AlarmRule": "ALARM(disk-full) OR ALARM(health-check-failed) OR ALARM(db-connections-max)"
}

CloudWatch Metric Math

Metric Math lets you create new time-series by performing mathematical operations on existing metrics — compute error rates, percentages, sums across instances, etc.

# Error rate calculation
METRICS: 
  m1: Errors (Count)
  m2: Requests (Count)
EXPRESSION: 
  e1: (m1/m2)*100    → ErrorRate (%)

# Sum CPU across all EC2 instances in an ASG
SEARCH('{AWS/EC2,InstanceId} CPUUtilization', 'Average', 300)
# Then: SUM(METRICS())  → Total CPU across all instances

CloudWatch Contributor Insights

Analyzes log data to identify the "top contributors" to performance problems — e.g., which IP addresses are generating the most 404s, which Lambda functions are causing the most errors, which URLs have the highest latency.

Uses rules to parse logs and extract fields for analysis
Works with CloudWatch Logs, VPC Flow Logs, DynamoDB, API Gateway access logs
Updates in near real-time — great for identifying bad actors or hot partitions

CloudWatch Anomaly Detection

Uses machine learning to automatically create a "band" of expected values for any metric based on historical patterns. Alarms trigger when the metric goes outside the expected band — no manual threshold needed.

Learns seasonality: daily patterns (peak at 9 AM), weekly patterns (lower on weekends)
Set sensitivity: how wide the expected band is (1 = tight, 2 = normal, 3 = loose)
Great for metrics with natural variation where a fixed threshold would cause too many false alarms

AWS X-Ray (Distributed Tracing)

X-Ray traces requests as they travel through your distributed application — from API Gateway → Lambda → DynamoDB → external API. It shows you exactly where latency comes from and which service is causing errors.

Creates a Service Map — visual graph of all services and their connections with latency/error rates
Traces: End-to-end record of a single request across all services
Segments: Data from a single service about work it did for a request
Subsegments: Detailed breakdown (specific DB query, HTTP call, etc.)
Annotations: Key-value pairs you add to traces for filtering (user_id, order_id)
SDK available for Node.js, Python, Java, Go, .NET, Ruby

# Python Lambda with X-Ray tracing
from aws_xray_sdk.core import xray_recorder, patch_all
patch_all()  # Auto-instrument boto3, requests, pymysql

@xray_recorder.capture("process_order")
def process_order(order_id):
    xray_recorder.put_annotation("order_id", order_id)
    # your code — automatically traced
    result = table.get_item(Key={"order_id": order_id})
    return result

CloudWatch Synthetics

Synthetics lets you create "canaries" — scripts that run on a schedule to test your endpoints and APIs from outside your application, simulating user behavior 24/7.

Runs every 1 minute minimum — detects outages before users do
Checks: HTTP endpoints, APIs, web pages (headless browser), broken links
Results appear as CloudWatch metrics and can trigger alarms
Pre-built blueprints: Heartbeat monitor, API canary, Broken link checker, GUI workflow

🛡️

AWS Security Tools

Security

AWS Security Landscape

AWS provides a layered security approach — multiple services working together cover different aspects of security: edge protection, identity, vulnerability management, threat detection, compliance, and incident response. Understanding which tool does what is essential.

Service	Category	What it does
Shield	DDoS Protection	Protects against volumetric network attacks
WAF	App Firewall	Blocks malicious HTTP requests (SQLi, XSS)
ACM	SSL/TLS	Free certificates for AWS services
GuardDuty	Threat Detection	ML-based anomaly detection across your account
Inspector	Vulnerability Scan	CVE scanning for EC2, Lambda, containers
Macie	Data Security	Finds PII/sensitive data in S3
Security Hub	CSPM	Centralized security findings dashboard
CloudTrail	Audit	Records all API calls in account
Config	Compliance	Tracks config changes, evaluates rules
Trusted Advisor	Best Practices	Recommendations across 5 pillars

AWS Certificate Manager (ACM)

ACM provisions, manages, and auto-renews SSL/TLS certificates. Public certificates are completely FREE when used with AWS services — no more paying certificate authorities or worrying about expiration dates.

Auto-renews 60 days before expiration — never get caught with expired certs
Domain validation: add a CNAME record to DNS (recommended — fully automated with Route 53) or verify via email
Certificates can ONLY be used with: ALB, CloudFront, API Gateway, Elastic Beanstalk, AppSync
Cannot export the private key — you can't use ACM certs on EC2 directly (use ACM on ALB instead)
Private CA ($400/month): issue private certs for internal services

AWS Shield

Shield Standard (FREE)

Automatic protection for all AWS customers
Protects against most common DDoS attacks: SYN floods, UDP reflection, DNS amplification
Layer 3/4 protection only
No configuration needed — always on
Protects: EC2, ELB, CloudFront, Route 53, Global Accelerator

Shield Advanced ($3,000/month)

Enhanced DDoS protection with attack visibility
24/7 AWS DDoS Response Team (DRT) access
Cost protection — AWS credits charges from scaling during DDoS
Near real-time attack notifications
WAF included at no extra cost
Historical attack reports

AWS WAF — Web Application Firewall

WAF inspects HTTP/HTTPS requests at Layer 7 and blocks malicious traffic before it reaches your application. Deploy on CloudFront, ALB, API Gateway, or AppSync.

Rule Type	Description	Example
IP Set Rules	Allow/block specific IPs or CIDRs	Block known bad IP ranges
Geographic Rules	Allow/block by country	Only allow India and US
Rate-Based Rules	Limit requests per IP per 5 minutes	Max 2000 req/5min per IP
SQL Injection Match	Detect SQL injection patterns in request	Block ' OR 1=1-- in query string
XSS Match	Detect cross-site scripting patterns	Block script tags in body
Regex Pattern	Custom regex matching on request parts	Block specific User-Agent strings
AWS Managed Rules	Pre-built rulesets maintained by AWS	Core Rule Set, Known Bad Inputs, PHP, WordPress

Amazon GuardDuty

GuardDuty is a threat detection service that continuously monitors your AWS account for malicious activity using machine learning, anomaly detection, and threat intelligence feeds. It requires no agents — it analyzes VPC Flow Logs, CloudTrail, DNS logs, and S3 data events automatically.

Detects: compromised instances (crypto mining, C&C communication), credential theft, unusual API calls from Tor exit nodes, S3 data exfiltration, privilege escalation attempts
Findings are rated: Low, Medium, High severity with detailed remediation guidance
Enable per-region — 30-day free trial, then ~$4/million events
Integrate with Security Hub and EventBridge for automated remediation

# Auto-remediate GuardDuty finding via EventBridge + Lambda:
# Trigger: GuardDuty finding type = "UnauthorizedAccess:IAMUser/InstanceCredentialExfiltration"
# Action Lambda: Revoke all active sessions for the IAM role, notify security team

import boto3
def lambda_handler(event, context):
    iam = boto3.client('iam')
    detail = event['detail']
    role_name = detail['resource']['accessKeyDetails']['userName']
    # Revoke all active sessions by attaching an explicit deny policy
    iam.put_role_policy(RoleName=role_name, PolicyName='RevokeAllSessions',
        PolicyDocument='{"Version":"2012-10-17","Statement":[{"Effect":"Deny","Action":"*","Resource":"*","Condition":{"DateLessThan":{"aws:TokenIssueTime":"' + str(datetime.utcnow().isoformat()) + '"}}}]}')

AWS Inspector

Inspector continuously scans EC2 instances, Lambda functions, and container images in ECR for software vulnerabilities (CVEs) and unintended network exposure. Unlike manual scans, Inspector rescans automatically when new CVEs are published.

Uses SSM Agent on EC2 for OS package scanning (no separate agent needed)
Generates risk score combining CVSS base score + AWS environment context
Network reachability findings: identifies EC2 instances reachable from internet on unexpected ports
Integrates with Security Hub — all findings in one place

Amazon Macie

Macie uses ML to automatically discover, classify, and protect sensitive data stored in S3. It identifies Personally Identifiable Information (PII) like names, credit card numbers, SSNs, passport numbers, and health records.

Scans S3 buckets for: PII, credentials (private keys, passwords), financial data, health data
Findings go to Security Hub and EventBridge for automated response
Useful for compliance: GDPR, HIPAA, PCI-DSS data discovery
Cost: $1/GB for first 50 GB/month scanned

AWS CloudTrail

CloudTrail records every API call made in your AWS account — the who, what, when, and from where of every action. It's your primary tool for security investigation, compliance auditing, and troubleshooting permission issues.

Event Type	What's captured	Default?	Cost
Management Events	Control plane: create/delete/modify resources (RunInstances, CreateBucket, PutRolePolicy)	Yes — 90 days in console	Free for first trail
Data Events	Data plane: S3 GetObject/PutObject, Lambda invocations, DynamoDB ops	No	$0.10/100K events
Insight Events	Unusual API activity (sudden spike in TerminateInstances calls)	No	$0.35/100K events

# CloudTrail log entry example — who deleted an S3 bucket:
{
  "eventTime": "2024-01-15T14:23:11Z",
  "eventName": "DeleteBucket",
  "userIdentity": {
    "type": "IAMUser",
    "userName": "dev-john",
    "arn": "arn:aws:iam::123456789:user/dev-john"
  },
  "sourceIPAddress": "203.0.113.45",
  "requestParameters": {"bucketName": "prod-backup-bucket"},
  "responseElements": null,
  "errorCode": null   ← null means SUCCESS (bucket deleted!)
}

AWS Config

Config continuously records configuration changes of your AWS resources and evaluates them against compliance rules. If something is misconfigured (public S3 bucket, unencrypted EBS volume), Config flags it and can auto-remediate.

Configuration history: "What did this EC2 instance look like 30 days ago? Who changed the security group?"
Config Rules: AWS-managed (170+ pre-built) or custom (Lambda-based)
Auto Remediation: Trigger SSM Automation documents to fix violations automatically
Conformance Packs: Bundles of rules for frameworks like PCI-DSS, HIPAA, CIS Benchmarks
Cost: $0.003/configuration item recorded + $0.001/rule evaluation

AWS Security Hub

Security Hub provides a centralized dashboard aggregating security findings from GuardDuty, Inspector, Macie, IAM Access Analyzer, Firewall Manager, and third-party tools into one place with a security score.

Runs continuous automated checks against CIS AWS Foundations Benchmark, AWS Foundational Security Best Practices
Security score (0-100): shows your overall security posture
Multi-account: aggregate findings from all accounts in an AWS Organization
Send findings to EventBridge for automated workflows

SNS — Simple Notification Service

SNS is a fully managed pub/sub messaging service. Publishers send messages to topics, and all subscribers receive a copy. It's the glue that connects AWS monitoring alerts to humans and automated systems.

Subscriber Type	Use Case
Email / Email-JSON	Alert engineers when alarm fires
SMS	Critical alerts to phones
HTTP/HTTPS	Webhook to external systems (PagerDuty, Slack)
Lambda	Automated remediation on alert
SQS	Fan-out: one message → multiple queues processed independently
Kinesis Firehose	Stream alerts to S3/Splunk/Elasticsearch
Mobile Push	iOS/Android push notifications

SNS Fan-out Pattern: One SNS topic → multiple SQS queues. Publish once, multiple systems each independently process the message. Used for: order processing (inventory + payment + email all triggered from one order event).

AWS Trusted Advisor

Trusted Advisor analyzes your AWS environment against AWS best practices across 5 categories and gives you recommendations. It's like having an AWS solutions architect review your account automatically.

Category	Example Checks	Support Plan
💰 Cost Optimization	Idle EC2 instances, underutilized RDS, unattached EIPs, old snapshots	All plans
⚡ Performance	CloudFront enabled, EC2 instance types, EBS throughput	Business+
🔒 Security	Open security group ports, MFA on root, S3 bucket permissions, exposed access keys	7 basic checks for all
🛡️ Fault Tolerance	Multi-AZ RDS, ELB health checks, EBS snapshots, Route 53 failover	Business+
📊 Service Limits	Approaching EC2, EIP, VPC limits	All plans

AWS Global Accelerator

Global Accelerator improves performance of internet applications by routing traffic through the AWS global backbone network instead of the unpredictable public internet — reducing latency by 60%+ for global users.

Provides 2 static Anycast IPv4 addresses — users worldwide connect to the nearest AWS edge location
Traffic travels AWS's private network to your endpoint (EC2, ALB, NLB, Elastic IP)
Built-in health checks and failover — traffic automatically rerouted if endpoint fails
Works for: TCP and UDP — good for gaming, VoIP, IoT, any non-HTTP protocol
vs CloudFront: GA = performance for dynamic/non-cacheable content + TCP/UDP. CloudFront = caching HTTP content at edge.

⚡

Lambda

Serverless

What is Lambda?

Lambda is a serverless, event-driven compute service. You write code, upload it, and Lambda runs it in response to events. You never manage servers — AWS handles provisioning, scaling, patching, and availability automatically. You pay only when code runs (per 1ms of execution).

Serverless ≠ No Servers. It means YOU don't manage servers. AWS provides and manages the infrastructure invisibly.

Lambda Configuration

Setting	Range	Notes
Memory	128 MB – 10,240 MB	CPU power scales proportionally with memory
Timeout	1 second – 15 minutes	Function killed after timeout; set appropriately
/tmp Storage	512 MB – 10,240 MB	Temporary disk; shared across warm invocations
Concurrency	Up to 1,000 (default, region)	Request increase; set reserved concurrency to limit
Package size	50 MB (zip), 250 MB (unzipped)	Use Layers for large dependencies
Env variables	4 KB total	Use Secrets Manager for sensitive values

Supported Runtimes

Python 3.8–3.12Node.js 18–20 Java 11–21Go 1.x .NET 6/8Ruby 3.2 Custom Runtime (any language)

Lambda Function Structure

import json, boto3, os

# Code OUTSIDE handler runs once per container (cold start)
# Reuse these across warm invocations!
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(os.environ['TABLE_NAME'])

def lambda_handler(event, context):
    # event: input data (from API GW, S3, SQS, etc.)
    # context: runtime info (function name, remaining time, etc.)
    
    print(f"Function: {context.function_name}")
    print(f"Remaining time: {context.get_remaining_time_in_millis()}ms")
    print(f"Event: {json.dumps(event)}")
    
    # Process
    name = event.get('name', 'World')
    
    return {
        'statusCode': 200,
        'headers': {'Content-Type': 'application/json'},
        'body': json.dumps({'message': f'Hello, {name}!'})
    }

Lambda Layers

Layers are ZIP archives containing libraries, custom runtimes, or dependencies shared across multiple functions. Reduces deployment package size and code reuse.

Up to 5 layers per function, total <250 MB unzipped
Versioned — create new version on each publish
Share across accounts (specific accounts or public)
AWS provides ready-made layers: AWS SDK, Powertools for Lambda

# Create a layer with Python packages
mkdir -p python/lib/python3.12/site-packages
pip install pandas numpy requests -t python/lib/python3.12/site-packages/
zip -r pandas-layer.zip python/
aws lambda publish-layer-version \
  --layer-name pandas-numpy \
  --zip-file fileb://pandas-layer.zip \
  --compatible-runtimes python3.12

Cold Start vs Warm Start

🥶 Cold Start

New execution environment created
Download code, initialize runtime, run init code
Adds 100ms–3s latency
Happens when: first invocation, scaling out, no recent use
Worst: Java/C# runtimes, large packages

🔥 Warm Start

Existing environment reused
Only handler code runs
Millisecond latency
External connections (DB, APIs) reused!
Tip: keep DB connections outside handler

Reduce Cold Starts: Provisioned Concurrency (pre-warm N environments, extra cost), smaller deployment packages, lazy imports, SnapStart (Java 11+ — take snapshot after init). Python/Node.js have fastest cold starts.

Lambda Event Sources (Triggers)

Source	Invocation Type	Use Case
API Gateway / ALB	Synchronous	REST APIs, web backends
S3	Asynchronous	Image processing on upload, data pipeline
DynamoDB Streams	Stream (polling)	React to DB changes, replicate data
SQS	Stream (polling)	Process queue messages, decoupled workflows
SNS	Asynchronous	Fan-out processing, notifications
EventBridge	Asynchronous	Scheduled tasks (cron), event-driven workflows
Kinesis	Stream (polling)	Real-time data stream processing
CloudWatch Logs	Asynchronous	Log processing, alerting from log patterns

🔌

Lambda Integrations

Serverless

Lambda Limits

Limit	Value	Notes
Max timeout	15 minutes	For long tasks use Step Functions or ECS
Max memory	10,240 MB (10 GB)	More memory = more vCPU
Concurrency (default)	1,000/region	Request increase via support
Package size (zip)	50 MB	Use Layers for larger deps
Package (unzipped)	250 MB	Including all layers
Response payload (sync)	6 MB	Use S3 for large responses
Async payload	256 KB	Pass S3 key for large data
Env variables	4 KB total

Lambda → RDS Connection (via RDS Proxy)

Problem: Lambda can have 1,000 concurrent executions, each opening a DB connection = 1,000 connections. Most DBs max out at 100-500 connections. Solution: Use RDS Proxy to pool connections.

import boto3, pymysql, json, os

# Initialize OUTSIDE handler (connection reuse on warm invocations)
db_conn = None

def get_db_connection():
    creds = boto3.client('secretsmanager').get_secret_value(
        SecretId='prod/rds/mysql')
    c = json.loads(creds['SecretString'])
    return pymysql.connect(
        host=os.environ['RDS_PROXY_ENDPOINT'],  # Proxy, not RDS endpoint!
        user=c['username'], password=c['password'],
        database='myapp', cursorclass=pymysql.cursors.DictCursor,
        connect_timeout=5
    )

def lambda_handler(event, context):
    global db_conn
    if not db_conn or not db_conn.open:
        db_conn = get_db_connection()
    
    with db_conn.cursor() as cursor:
        cursor.execute("SELECT * FROM users LIMIT 10")
        return {'statusCode': 200, 'body': json.dumps(cursor.fetchall())}

Lambda → DynamoDB

import boto3
from decimal import Decimal

# Outside handler = reuse
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('Orders')

def lambda_handler(event, context):
    # Write (no connection pool needed - HTTP API)
    table.put_item(Item={
        'order_id': event['order_id'],
        'user_id': event['user_id'],
        'amount': Decimal(str(event['amount'])),
        'status': 'pending'
    })
    
    # Read
    resp = table.get_item(Key={'order_id': event['order_id']})
    return resp.get('Item', {})

Lambda → API Gateway Integration

# API Gateway Proxy Integration passes full HTTP context to Lambda
# Request: POST /users → Lambda receives:
event = {
    "httpMethod": "POST",
    "path": "/users",
    "pathParameters": {"id": "123"},
    "queryStringParameters": {"page": "1"},
    "headers": {"Authorization": "Bearer token..."},
    "body": '{"name":"Ravi","email":"<a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="126073647b52776a737f627e773c717d7f">[email protected]</a>"}',
    "isBase64Encoded": False
}

# Lambda MUST return this structure:
return {
    "statusCode": 200,
    "headers": {
        "Content-Type": "application/json",
        "Access-Control-Allow-Origin": "*"  # for CORS
    },
    "body": json.dumps({"user_id": "U123", "name": "Ravi"})
}

Lambda Environment Variables & Secrets

# Environment variables (non-sensitive config)
import os
TABLE_NAME = os.environ['TABLE_NAME']
REGION = os.environ.get('AWS_REGION', 'ap-south-1')

# For SECRETS → use Secrets Manager (not env vars!)
import boto3, json
_secret_cache = {}

def get_secret(name):
    if name not in _secret_cache:
        client = boto3.client('secretsmanager')
        _secret_cache[name] = json.loads(
            client.get_secret_value(SecretId=name)['SecretString']
        )
    return _secret_cache[name]  # cached after first call

🌐

Route 53

Networking

What is DNS and How Does it Work?

DNS (Domain Name System) translates human-readable domain names (google.com) into IP addresses (142.250.80.46) that computers use. Without DNS, you'd need to memorize IP addresses for every website. Route 53 is AWS's highly available and scalable DNS web service — named after the DNS port (53).

User types: www.example.com in browser
1. Browser checks local cache → not found
2. OS asks Recursive Resolver (usually ISP or 8.8.8.8)
3. Recursive Resolver asks Root Nameserver → "ask .com TLD server"
4. Asks .com TLD server → "ask ns-123.awsdns-45.com"
5. Asks Route 53 Nameserver → returns "93.184.216.34"
6. Browser connects to 93.184.216.34
Total time: ~50-200ms (first time), ~0ms (cached)

DNS Record Types

Record	Maps	Important Notes	Example
A	hostname → IPv4	Most common record	example.com → 93.184.216.34
AAAA	hostname → IPv6	Next-gen internet	example.com → 2606:2800::/32
CNAME	hostname → hostname	Cannot be used for root/apex domain (example.com) — only subdomains (www.example.com)	www.example.com → example.com
Alias	hostname → AWS resource	AWS extension. Works for root domain. FREE queries. Use instead of CNAME for AWS resources.	example.com → myalb.amazonaws.com
MX	domain → mail servers	Priority number (lower = preferred)	10 mail.example.com
TXT	domain → text string	Domain verification, SPF, DKIM	"v=spf1 include:amazonses.com ~all"
NS	zone → nameservers	Which servers are authoritative for zone	ns-123.awsdns-45.com
SOA	Zone metadata	Start of Authority — admin info, TTL defaults	Auto-created with hosted zone
PTR	IP → hostname	Reverse DNS lookup	34.216.93.184 → ec2.amazonaws.com
SRV	Service location	Used for VoIP, XMPP, Kubernetes	_http._tcp.example.com

Hosted Zones

🌍 Public Hosted Zone

Answers DNS queries from the internet
$0.50/zone/month + $0.40/million queries
Created automatically when registering domain in Route 53
For external-facing websites, APIs, services
Nameservers assigned automatically (4 NS records)

🔒 Private Hosted Zone

Answers DNS queries only from within associated VPCs
$0.50/zone/month
Must associate with VPC(s) — can associate multiple VPCs (even cross-account)
For internal services: db.internal, api.company.local
Requires: enableDnsHostnames + enableDnsSupport on VPC

TTL (Time to Live)

TTL tells DNS resolvers how long to cache a record. Choosing the right TTL is a balance between DNS query costs and propagation speed.

High TTL (86400 = 24 hours): Fewer queries (cheaper), but changes take 24 hours to propagate worldwide
Low TTL (60 = 1 min): Changes propagate in 1 minute, but 1440x more DNS queries (more expensive)
Best practice before migration: Lower TTL to 60s a week before making changes, then raise after
Alias records don't have configurable TTL — AWS sets it automatically

Routing Policies — In Depth

Policy	Algorithm	Best For	Health Checks
Simple	Returns all values, client picks randomly	Single resource, no health checks needed	No
Weighted	Route X% to A, Y% to B based on weights (0-255)	A/B testing, blue/green deployments, gradual migrations	Optional
Failover	Primary active, secondary passive. Auto-switch on health check failure.	DR setup, active-passive HA	Required on primary
Geolocation	Route based on user's geographic location (continent, country, state)	Content localization, GDPR data residency, language-specific content	Optional
Geoproximity	Route based on distance with adjustable bias (+/-)	Shift traffic between regions, fine-grained global routing	Optional
Latency-based	Route to AWS region with lowest measured latency for user	Global apps where performance matters most	Optional
Multi-Value	Returns up to 8 healthy records randomly	Simple client-side load balancing (not replacement for ELB)	Integrated
IP-based	Route based on client's originating IP CIDR	Route ISP traffic to specific endpoints, optimize peering	No

Health Checks

Route 53 health checkers are deployed in 15+ locations globally. They check your endpoints every 10 or 30 seconds and mark them unhealthy if enough checks fail — automatically removing them from DNS responses.

HTTP/HTTPS/TCP health checks: Check endpoint response. Must return 2xx/3xx within 4 seconds, response body can be checked for string match (first 5120 bytes).
Threshold: 3 consecutive failures = unhealthy. 3 consecutive successes = healthy (configurable).
Calculated Health Checks: Combine child health checks with AND/OR/NOT logic. Useful for "healthy if 2 of 3 servers up".
CloudWatch Alarm Health Checks: For private resources not accessible from internet — check CW alarm state instead.
Cost: $0.50-0.75/health check/month

# Failover routing - example setup:
Primary record: www.example.com → ALB in us-east-1 (health check attached)
Secondary record: www.example.com → ALB in eu-west-1 (failover target)

If primary health check fails for 3+ consecutive checks:
→ Route 53 automatically serves the secondary record
→ Recovery is automatic when primary becomes healthy again

Domain Registration

Route 53 is an ICANN-accredited domain registrar — buy domains directly in AWS
Supports 400+ TLDs (.com, .io, .in, .co, .tech, .cloud, etc.)
Domains auto-renew by default (can disable)
Transfer existing domains from other registrars into Route 53
Privacy protection available — hides personal info from WHOIS lookup

🌍

CloudFront

DNS & CDN

CloudFront Overview

CloudFront is AWS's Content Delivery Network (CDN) with 400+ Points of Presence (edge locations) globally. Content is cached at edge locations closest to users, reducing latency and origin load.

Supports HTTP/HTTPS, WebSocket, streaming (HLS, DASH)
Integrates with Shield (DDoS), WAF (Layer 7), ACM (SSL/TLS), Lambda@Edge
Default TTL: 24 hours. Configurable per behavior.
HTTP to HTTPS redirect built-in (viewer protocol policy)

Origins

Origin Type	Use Case	Security
S3 Bucket	Static websites, file downloads, media	OAC (Origin Access Control) — blocks direct S3 URL access
ALB	Dynamic web apps, APIs behind load balancer	Custom header (X-Origin-Key) to verify requests from CF
EC2 Instance	Custom servers (must have public IP)	Security Group allow CF IP ranges
Any HTTP Endpoint	On-premises, third-party servers	Custom headers, IP whitelisting

Cache Behaviors

Map URL path patterns to different origins: /api/* → ALB (no cache), /images/* → S3 (cache 7 days), /* → S3 (cache 24h)
Cache Policy: What goes into cache key (headers, cookies, query strings). More = less cache hits.
Origin Request Policy: What to forward to origin (headers, cookies, query strings)
Viewer protocol policy: HTTP + HTTPS, HTTPS only, Redirect HTTP to HTTPS

Invalidations

# Force CloudFront to fetch fresh content from origin
aws cloudfront create-invalidation \
  --distribution-id E1234567ABCDEF \
  --paths "/*"                    # all files
  # OR "/index.html"              # specific file
  # OR "/images/*"                # specific path

# Cost: First 1,000 invalidation paths/month free, then $0.005 each
# Better approach: use versioned filenames (main.v2.3.css) — no invalidation needed!

Security Features

Origin Access Control (OAC): CloudFront sends signed requests to S3. S3 bucket policy allows only from CloudFront OAC. Direct S3 URL = 403 Forbidden.
Signed URLs: Grant time-limited access to individual files. Use for: paid content, user-specific files
Signed Cookies: Grant access to multiple files without changing URLs. Use for: premium video libraries
Geographic Restrictions: Whitelist (allow only) or Blacklist (deny) countries
Field-Level Encryption: Encrypt specific POST data fields (e.g., credit cards) at edge

Lambda@Edge

Run Lambda functions at CloudFront edge locations to customize content delivery. 4 trigger points per request cycle:

1. Viewer Request — before cache check (auth, rewrite) 2. Origin Request — cache miss, before origin (modify headers) 3. Origin Response — after origin response (add headers) 4. Viewer Response — before delivery to user (security headers)

Runtimes: Node.js and Python only
Limits: Viewer = 1MB code/128MB/5s. Origin = 50MB/10GB/30s.
Use cases: A/B testing, auth check, SEO, URL rewriting, redirect, response compression
CloudFront Functions (newer, cheaper): JavaScript only, sub-millisecond, viewer request/response only. Ideal for URL redirects, header manipulation.

🏛️

Terraform with AWS

Infrastructure as Code

What is Infrastructure as Code?

Infrastructure as Code (IaC) means defining your cloud infrastructure in code files instead of clicking through consoles. This brings software engineering best practices (version control, code review, testing, CI/CD) to infrastructure management.

✅ Benefits of IaC

Reproducible: same code = same infrastructure every time
Version controlled: see every change in Git history
Reviewable: infrastructure changes go through PR process
Automated: deploy via CI/CD pipeline, no manual clicks
Documented: code IS the documentation
Disaster recovery: rebuild entire environment in minutes

Terraform vs CloudFormation

Terraform: Multi-cloud (AWS, Azure, GCP, K8s), HCL language, huge ecosystem, state file management needed
CloudFormation: AWS-only, JSON/YAML, deep AWS integration, no state management (AWS handles it), free
Both valid — Terraform preferred for multi-cloud, CloudFormation for AWS-only with deep integration needs

Terraform Core Concepts

Concept	Description	Example
Provider	Plugin that talks to cloud API. Translates HCL to API calls.	hashicorp/aws, hashicorp/azure, hashicorp/kubernetes
Resource	Infrastructure component you want to create/manage	aws_instance, aws_s3_bucket, aws_vpc
Data Source	Read existing resource info (don't manage it, just read)	data.aws_ami.latest, data.aws_vpc.default
Variable	Input parameter — makes code reusable	var.instance_type, var.environment
Local	Computed value within module — avoid repetition	local.name_prefix = "prod-app"
Output	Export values after apply — use in other modules or scripts	output: EC2 IP, RDS endpoint
Module	Reusable package of Terraform code	module "vpc" { source = "./modules/vpc" }
State	JSON file tracking what Terraform has created. Source of truth.	terraform.tfstate (store in S3!)

Terraform Workflow

# 1. Initialize — download providers, set up backend
terraform init

# 2. Format code (always run before committing)
terraform fmt -recursive

# 3. Validate syntax
terraform validate

# 4. ALWAYS review plan before applying!
terraform plan
terraform plan -out=tfplan.out    # save plan for apply

# 5. Apply changes
terraform apply                   # interactive confirmation
terraform apply tfplan.out        # apply saved plan
terraform apply -auto-approve     # CI/CD (no prompt)

# Other useful commands
terraform destroy                 # destroy everything (careful!)
terraform output                  # show outputs
terraform state list              # list managed resources
terraform state show aws_instance.web  # inspect resource state
terraform import aws_s3_bucket.logs my-bucket-name  # import existing resource

Remote State — Critical for Teams

By default, Terraform stores state locally (terraform.tfstate). This breaks in teams — two people can't work simultaneously, state isn't shared. ALWAYS use remote state with S3 + DynamoDB locking in production.

# versions.tf — remote backend configuration
terraform {
  required_version = ">= 1.5.0"
  required_providers {
    aws = { source = "hashicorp/aws", version = "~> 5.0" }
  }
  backend "s3" {
    bucket         = "mycompany-terraform-state"    # must exist first!
    key            = "prod/ap-south-1/terraform.tfstate"
    region         = "ap-south-1"
    encrypt        = true                           # encrypt state at rest
    dynamodb_table = "terraform-locks"              # prevent concurrent applies
  }
}

# Create the S3 bucket and DynamoDB table manually first (bootstrap):
aws s3api create-bucket --bucket mycompany-terraform-state --region ap-south-1
aws dynamodb create-table --table-name terraform-locks   --attribute-definitions AttributeName=LockID,AttributeType=S   --key-schema AttributeName=LockID,KeyType=HASH   --billing-mode PAY_PER_REQUEST

Complete VPC + EC2 Example

# variables.tf
variable "env"    { default = "dev" }
variable "region" { default = "ap-south-1" }

# vpc.tf
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
  tags = { Name = "${var.env}-vpc" }
}

resource "aws_subnet" "public" {
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.1.0/24"
  availability_zone       = "${var.region}a"
  map_public_ip_on_launch = true
  tags = { Name = "${var.env}-public-1a" }
}

resource "aws_internet_gateway" "igw" {
  vpc_id = aws_vpc.main.id
  tags   = { Name = "${var.env}-igw" }
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id
  route { cidr_block = "0.0.0.0/0"; gateway_id = aws_internet_gateway.igw.id }
  tags = { Name = "${var.env}-public-rt" }
}

resource "aws_route_table_association" "public" {
  subnet_id      = aws_subnet.public.id
  route_table_id = aws_route_table.public.id
}

# ec2.tf
data "aws_ami" "al2" {
  most_recent = true
  owners      = ["amazon"]
  filter { name = "name"; values = ["amzn2-ami-hvm-*-x86_64-gp2"] }
}

resource "aws_security_group" "web" {
  name   = "${var.env}-web-sg"
  vpc_id = aws_vpc.main.id
  ingress { from_port=80;  to_port=80;  protocol="tcp"; cidr_blocks=["0.0.0.0/0"] }
  ingress { from_port=443; to_port=443; protocol="tcp"; cidr_blocks=["0.0.0.0/0"] }
  ingress { from_port=22;  to_port=22;  protocol="tcp"; cidr_blocks=["10.0.0.0/8"] }
  egress  { from_port=0;   to_port=0;   protocol="-1";  cidr_blocks=["0.0.0.0/0"] }
}

resource "aws_instance" "web" {
  ami                    = data.aws_ami.al2.id
  instance_type          = "t3.micro"
  subnet_id              = aws_subnet.public.id
  vpc_security_group_ids = [aws_security_group.web.id]
  user_data = <<-EOF
    #!/bin/bash
    yum install -y nginx
    systemctl start nginx
    systemctl enable nginx
  EOF
  tags = { Name = "${var.env}-web" }
}

# outputs.tf
output "web_public_ip"  { value = aws_instance.web.public_ip }
output "web_public_dns" { value = aws_instance.web.public_dns }

Terraform Modules

Modules are reusable packages of Terraform configuration. Instead of copy-pasting VPC code across multiple projects, create a VPC module once and reuse it everywhere.

# Using community modules from Terraform Registry
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.0.0"
  
  name = "my-vpc"
  cidr = "10.0.0.0/16"
  azs             = ["ap-south-1a", "ap-south-1b"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24"]
  enable_nat_gateway = true
  single_nat_gateway = true
  tags = { Terraform = "true", Environment = "dev" }
}

# Reference module outputs
resource "aws_instance" "app" {
  subnet_id = module.vpc.private_subnets[0]
  # ...
}

Terraform Best Practices

Always run terraform plan and review before apply
Store state in S3 with DynamoDB locking — never commit .tfstate to Git
Use workspaces or separate state files per environment (dev/staging/prod)
Pin provider versions: version = "~> 5.0"
Use terraform fmt pre-commit hooks for consistent formatting
Use terraform validate in CI/CD pipeline
Sensitive outputs: mark with sensitive = true
Use count or for_each instead of duplicating resources
Organize large configs: separate files (vpc.tf, ec2.tf, rds.tf, variables.tf, outputs.tf)

🐍

Python Boto3 for AWS Automation

Scripting

What is Boto3?

Boto3 is the official AWS SDK for Python. It lets you programmatically interact with AWS services — create resources, manage infrastructure, automate tasks, and build applications that use AWS.

pip install boto3

# Two interfaces:
import boto3
# 1. Client (low-level, 1:1 map to AWS API)
ec2_client = boto3.client('ec2', region_name='ap-south-1')
# 2. Resource (high-level, object-oriented)
s3 = boto3.resource('s3')

# Authentication order:
# 1. Environment variables (AWS_ACCESS_KEY_ID, etc.)
# 2. ~/.aws/credentials file (aws configure)
# 3. IAM Instance Profile (EC2) or Task Role (ECS/Lambda) ← recommended on AWS

EC2 Automation

import boto3

ec2 = boto3.client('ec2', region_name='ap-south-1')

# List all running instances with details
def list_instances(state='running'):
    paginator = ec2.get_paginator('describe_instances')
    for page in paginator.paginate(Filters=[{'Name':'instance-state-name','Values':[state]}]):
        for r in page['Reservations']:
            for i in r['Instances']:
                name = next((t['Value'] for t in i.get('Tags',[]) if t['Key']=='Name'), 'N/A')
                print(f"{i['InstanceId']:20} {i['InstanceType']:12} {i.get('PublicIpAddress','Private'):15} {name}")

# Start/Stop/Reboot
ec2.start_instances(InstanceIds=['i-1234567890abcdef0'])
ec2.stop_instances(InstanceIds=['i-1234567890abcdef0'])
ec2.reboot_instances(InstanceIds=['i-1234567890abcdef0'])

# Create snapshot with tags
def snapshot_volume(volume_id, desc="Auto backup"):
    snap = ec2.create_snapshot(VolumeId=volume_id, Description=desc,
        TagSpecifications=[{'ResourceType':'snapshot',
            'Tags':[{'Key':'AutoCreated','Value':'true'},
                    {'Key':'Date','Value':str(datetime.date.today())}]}])
    return snap['SnapshotId']

# Delete old snapshots (older than N days)
def cleanup_old_snapshots(days=30):
    from datetime import datetime, timezone, timedelta
    cutoff = datetime.now(timezone.utc) - timedelta(days=days)
    snaps = ec2.describe_snapshots(OwnerIds=['self'])['Snapshots']
    for s in snaps:
        if s['StartTime'] < cutoff and s.get('Tags'):
            if any(t['Key']=='AutoCreated' for t in s['Tags']):
                ec2.delete_snapshot(SnapshotId=s['SnapshotId'])
                print(f"Deleted {s['SnapshotId']}")

S3 Automation

import boto3, os
from pathlib import Path

s3 = boto3.client('s3')

# Upload file with progress
def upload_file(path, bucket, key=None, extra_args=None):
    key = key or os.path.basename(path)
    s3.upload_file(path, bucket, key, ExtraArgs=extra_args or {})
    print(f"✓ Uploaded {path} → s3://{bucket}/{key}")

# Upload with metadata and encryption
upload_file('report.pdf', 'my-bucket', 'reports/report.pdf', {
    'ContentType': 'application/pdf',
    'ServerSideEncryption': 'aws:kms',
    'Metadata': {'author': 'Ravi', 'version': '2.0'}
})

# List all objects with pagination
def list_all_objects(bucket, prefix=''):
    paginator = s3.get_paginator('list_objects_v2')
    for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
        for obj in page.get('Contents', []):
            print(f"{obj['Key']:60} {obj['Size']:10} bytes")

# Generate pre-signed URL for download
def get_presigned_url(bucket, key, expires=3600):
    return s3.generate_presigned_url('get_object',
        Params={'Bucket': bucket, 'Key': key}, ExpiresIn=expires)

# Clean up old files
def delete_old_files(bucket, prefix, days=30):
    from datetime import datetime, timezone, timedelta
    cutoff = datetime.now(timezone.utc) - timedelta(days=days)
    paginator = s3.get_paginator('list_objects_v2')
    for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
        old = [{'Key': o['Key']} for o in page.get('Contents',[]) if o['LastModified'] < cutoff]
        if old:
            s3.delete_objects(Bucket=bucket, Delete={'Objects': old})
            print(f"Deleted {len(old)} old files")

Lambda Automation

import boto3, json, zipfile, io

lambda_client = boto3.client('lambda', region_name='ap-south-1')

# Invoke Lambda synchronously
def invoke_lambda(func_name, payload):
    resp = lambda_client.invoke(
        FunctionName=func_name,
        InvocationType='RequestResponse',  # sync
        Payload=json.dumps(payload)
    )
    result = json.loads(resp['Payload'].read())
    if resp.get('FunctionError'):
        raise Exception(f"Lambda error: {result}")
    return result

# Update function code from local file
def deploy_function(func_name, code_file):
    with open(code_file, 'rb') as f:
        lambda_client.update_function_code(
            FunctionName=func_name, ZipFile=f.read())
    print(f"✓ Deployed {func_name}")

# Update environment variables
lambda_client.update_function_configuration(
    FunctionName='my-function',
    Environment={'Variables': {'TABLE_NAME': 'NewTable', 'ENV': 'prod'}}
)

CloudWatch and RDS Automation

import boto3

logs = boto3.client('logs')
rds = boto3.client('rds', region_name='ap-south-1')

# Get Lambda error logs from last N hours
def get_lambda_errors(func_name, hours=1):
    import time
    log_group = f'/aws/lambda/{func_name}'
    start_ms = int((time.time() - hours*3600) * 1000)
    resp = logs.filter_log_events(
        logGroupName=log_group, startTime=start_ms, filterPattern='ERROR')
    for e in resp['events']:
        print(e['message'].strip())

# Create RDS snapshot
def backup_rds(db_instance_id):
    import datetime
    snap_id = f"{db_instance_id}-{datetime.date.today().isoformat()}"
    rds.create_db_snapshot(DBInstanceIdentifier=db_instance_id, DBSnapshotIdentifier=snap_id)
    print(f"✓ Snapshot {snap_id} created")

# List RDS instances with status
def list_rds_instances():
    for db in rds.describe_db_instances()['DBInstances']:
        print(f"{db['DBInstanceIdentifier']:30} {db['DBInstanceStatus']:12} {db['DBInstanceClass']}")

🔄

DMS — Database Migration Service

Migration

What is DMS?

AWS Database Migration Service (DMS) helps migrate databases to AWS with minimal downtime. The source database remains fully operational during migration — your application keeps running. Only a brief cutover pause (seconds to minutes) is needed at the very end. DMS handles the complexity of moving data, keeping it in sync, and notifying you when it's safe to switch.

Key Value Proposition: Traditional database migrations required hours or days of planned downtime. DMS enables "live migration" — continuously syncing changes so cutover is just updating a connection string.

DMS Architecture

Replication Instance: The EC2 instance DMS runs on. Reads from source, writes to target. Choose size based on data volume.
Source Endpoint: Connection details for your source DB (hostname, port, credentials, engine type)
Target Endpoint: Connection details for your destination DB
Replication Task: Defines what to migrate, which tables, migration type, and settings

Flow:
Source DB ──► Replication Instance ──► Target DB
(MySQL EC2)   (reads changes via CDC)   (Amazon Aurora)

Migration Types

Type	How it Works	Downtime	When to Use
Full Load	Copies all existing data. No CDC. Source must be static during migration.	High (must stop writes)	Dev/test DBs, small non-critical DBs, can afford downtime
Full Load + CDC	Full load first, then CDC captures ongoing changes. Keeps target in sync until cutover.	Minutes (cutover only)	Production systems — most common approach
CDC Only	Only replicates ongoing changes. Assumes initial data already in target.	None	Data already loaded manually (pg_dump), need ongoing sync

Change Data Capture (CDC)

CDC is the technology that enables near-zero downtime migration. DMS reads the database's transaction log (binlog for MySQL, WAL for PostgreSQL, redo log for Oracle) to capture every INSERT, UPDATE, DELETE and replay it on the target.

MySQL: Enable binlog: set binlog_format=ROW, binlog_row_image=FULL
PostgreSQL: Enable logical replication: set wal_level=logical
Oracle: Enable supplemental logging, use LogMiner or Binary Reader
SQL Server: Enable MS-CDC on tables to migrate

Step-by-Step: MySQL EC2 → Amazon Aurora

Pre-Migration: Enable MySQL binlog on source EC2. Create Aurora cluster as target. Run DMS premigration assessment to identify issues.
Create Replication Instance: DMS Console → Replication instances → Create. Choose class (dms.t3.medium for small, dms.r5.large for large). Multi-AZ: Yes for production. Wait for "Available".
Create Source Endpoint: Engine=MySQL, Server=EC2 private IP, Port=3306, Username=dms_user, Password. Click "Test connection" — must show Success.
Create Target Endpoint: Engine=Aurora MySQL, Server=Aurora cluster endpoint, Port=3306. Test connection.
Create Migration Task: Select replication instance + both endpoints. Migration type: Full load + CDC. Table mappings: include all schemas or specific tables. Enable logging.
Monitor Progress: Watch "Table statistics" tab — rows loaded, inserts/updates/deletes applied via CDC. Check "CDC latency" — should trend toward 0.
Validate Data: Use DMS Data Validation feature (row counts + checksums) to verify target matches source.
Cutover: Stop writes to source → wait for CDC latency = 0 → update app connection strings to Aurora endpoint → resume writes → stop DMS task.

AWS Schema Conversion Tool (SCT)

SCT is required for heterogeneous migrations (source and target are different database engines). It converts database schema, stored procedures, views, and functions from one SQL dialect to another.

Supported Conversions

Oracle → Aurora PostgreSQL/MySQL
SQL Server → Aurora PostgreSQL/MySQL
Teradata → Amazon Redshift
SAP ASE → Aurora PostgreSQL/MySQL
IBM Db2 → Aurora PostgreSQL

SCT Assessment Report

Shows % of schema auto-converted vs manual effort
Red items: require manual rewrite (stored procs, vendor-specific functions)
Green items: converted automatically
Estimates conversion effort in person-days
Free download from AWS website

DMS Sources and Targets

Sources

Oracle, SQL Server, MySQL, MariaDB, PostgreSQL
MongoDB, IBM Db2, SAP ASE, Sybase
Amazon RDS (all engines), Aurora
Amazon S3 (CSV/Parquet as source)
Azure SQL, Google Cloud SQL

Targets

All RDS engines, Aurora, Redshift
Amazon S3 (CSV/Parquet — data lake)
Amazon DynamoDB (from relational)
Amazon OpenSearch Service
Amazon Kinesis Data Streams
Apache Kafka (MSK)

Replication Instance Sizing

Class	vCPU	RAM	Use Case
dms.t3.micro	2	1 GB	Dev/test, very small DBs (<1 GB)
dms.t3.medium	2	4 GB	Small production (<10 GB)
dms.r5.large	2	16 GB	Medium production (10-100 GB)
dms.r5.xlarge	4	32 GB	Large production (100 GB+)
dms.r5.4xlarge	16	128 GB	Very large migrations (TB scale)

DMS Best Practices

Always run Pre-Migration Assessment to catch issues before starting
Enable Multi-AZ replication instance for production migrations
Place replication instance in same VPC as target (minimize latency)
Drop indexes and foreign keys on target before full load, recreate after (3-5x faster)
Use Parallel Load for large tables (partition into segments, load in parallel)
Enable Data Validation to verify row counts and checksums post-migration
Test on a staging environment that mirrors production before go-live
Monitor CDC Latency Source and CDC Latency Target — should be near 0 before cutover
Keep source DB running for 1-2 weeks post-cutover as rollback option

Multicloud DevOpsWith AI

Introduction to Cloud Computing

Why Cloud Computing?

Benefits of Cloud Computing

💰 Cost Savings

⚡ Performance & Speed

🔒 Security & Reliability

Types of Cloud Computing

Cloud Service Models — IaaS, PaaS, SaaS

IaaS — Infrastructure as a Service

PaaS — Platform as a Service

SaaS — Software as a Service

Scaling in Cloud Computing

📈 Vertical Scaling (Scale Up/Down)

📊 Horizontal Scaling (Scale Out/In)

Cloud Computing Issues & Challenges

Shared Responsibility Model

☁️ AWS Responsible For

👤 Customer Responsible For

Cloud Costing Models

AWS Global Infrastructure

Virtualization

What is Virtualization?

Virtualization and Cloud Computing

Types of Virtualization

Hypervisor Types — Type 1 vs Type 2

Type 1 — Bare Metal Hypervisor

Type 2 — Hosted Hypervisor

Key Virtualization Terminologies

Containers vs Virtual Machines

🖥️ Virtual Machines

📦 Containers (Docker)

Benefits of Virtualization

Virtualization Vendors

Linux Basics

Why Linux in AWS?

All-Important Linux Commands

📁 File & Directory

⚙️ System & Process

The Linux Filesystem Hierarchy

File Permissions

Process Management

User Account Management

Software Package Management

Backup and Restore Management

Systemd and Monitoring

Storage Management

Networking in Linux

EC2 — Elastic Compute Cloud

What is EC2?

EC2 Instance Types (Families)

Instance Naming Convention

Instance Launch Process (Step by Step)

EC2 Connection Methods

🔑 SSH (Linux Instances)

🪟 RDP (Windows Instances)

🌐 EC2 Instance Connect

🛡️ Session Manager (SSM)

AMI — Amazon Machine Image

Elastic IP (EIP)

Placement Groups

🔥 Cluster

📊 Spread

🏢 Partition

EC2 Instance Management

Key Pair Management

Security Groups — In-Depth

Security Group Best Practices

EBS Volume Types (Deep Dive)

Storage and Snapshots

User Data and Metadata

Launch Templates (vs Launch Configurations)

✅ Launch Templates (Recommended)

❌ Launch Configurations (Legacy)

Elastic Load Balancing (ELB)

What is Load Balancing?

Types of Load Balancers

Application Load Balancer (ALB) — Deep Dive

ALB Routing Rules

ALB Fixed Response & Redirects

Multicloud DevOps
With AI