Multicloud DevOps
With AI

By Veera Sir

26 Topics โ€” Click any card to start studying

โ˜๏ธ Introduction to Cloud Computing Fundamentals
๐Ÿ–ฅ๏ธ Virtualization Fundamentals
๐Ÿง Linux Basics Fundamentals
๐Ÿ–ฅ๏ธ EC2 โ€” Elastic Compute Cloud Compute
โš™๏ธ EC2 Instance Management Compute
โš–๏ธ Elastic Load Balancing Compute
๐Ÿ’ฐ Billing & Monitoring Management
๐Ÿ“ˆ Auto Scaling Compute
๐Ÿ’พ EBS / EFS Storage Storage
๐Ÿฐ VPC โ€” Virtual Private Cloud Networking
๐Ÿ›ก๏ธ VPC Controls Networking
๐Ÿชฃ S3 โ€” Simple Storage Service Storage
๐Ÿชฃ S3 Advanced Storage
๐Ÿ”“ S3 Access Control Storage
๐Ÿ” IAM Security
๐Ÿ”‘ Secrets & Keys Security
๐Ÿ“Š CloudWatch Monitoring
๐Ÿ”” CloudWatch Advanced Monitoring
๐Ÿ›ก๏ธ Security Tools Security
โšก Lambda Serverless
๐Ÿ”Œ Lambda Integrations Serverless
๐ŸŒ Route 53 Networking
๐ŸŒ CloudFront Networking
๐Ÿ›๏ธ Terraform IaC
๐Ÿ Python Boto3 Scripting
๐Ÿ”„ DMS Migration Migration
โ˜๏ธ

Introduction to Cloud Computing

Cloud Fundamentals

Why Cloud Computing?

Before cloud computing, every company had to build and manage its own data centers โ€” buying servers, networking gear, storage, hiring infrastructure teams, paying for power and cooling. This was expensive, slow, and inflexible. Cloud computing solves all of this by providing IT infrastructure over the internet as a service.

  • No Upfront Capital Expense (CAPEX โ†’ OPEX): Instead of buying hardware, pay only for what you use. Convert capital expenses into variable operational expenses.
  • Scale Globally in Minutes: Deploy workloads in any AWS Region worldwide in seconds. Go from 1 server to 1000 servers in minutes.
  • Increased Speed & Agility: Developers can provision resources in seconds vs weeks for traditional data centers. Faster time-to-market.
  • Focus on Core Business: AWS manages data centers, hardware maintenance, physical security โ€” you focus on building products.
  • Economies of Scale: AWS buys massive amounts of hardware โ€” bulk buying power โ†’ lower costs passed to customers vs running your own DC.
  • Stop Guessing Capacity: No need to predict infrastructure needs months in advance. Scale up/down on demand.

Benefits of Cloud Computing

๐Ÿ’ฐ Cost Savings

  • No hardware purchase
  • Pay-as-you-go pricing
  • No maintenance costs
  • No idle capacity
  • Free tier available

โšก Performance & Speed

  • Latest hardware always
  • Global low-latency network
  • High availability SLAs
  • Provision in seconds

๐Ÿ”’ Security & Reliability

  • Physical security by AWS
  • Compliance certifications
  • Encryption built-in
  • Multiple redundancy layers

Types of Cloud Computing

TypeDescriptionWho Controls HardwareExample
Public CloudOwned and operated by third-party cloud providers. Resources shared over internet. Multi-tenant.Cloud Provider (AWS)AWS, Azure, GCP
Private CloudCloud infrastructure used exclusively by a single organization. Can be on-premises or hosted by third party.Organization or hosted providerVMware, OpenStack, IBM Cloud Private
Hybrid CloudCombination of public and private clouds with data and application portability between them. Best of both worlds.BothAWS Outposts + AWS Cloud
Multi-CloudUsing services from multiple cloud providers simultaneously. Avoids vendor lock-in.Multiple providersAWS + Azure + GCP together
Community CloudShared infrastructure for specific community with common concerns (compliance, security).Community/ProviderGovernment cloud, Healthcare cloud

Cloud Service Models โ€” IaaS, PaaS, SaaS

These three models define HOW MUCH of the stack you manage vs the cloud provider. Think of it as a pizza analogy โ€” how much do you make yourself vs order in?

IaaS โ€” Infrastructure as a Service

  • Provider manages: Hardware, networking, virtualization, storage
  • You manage: OS, runtime, middleware, applications, data
  • Most control, most responsibility
  • AWS Examples: EC2, VPC, EBS, S3
  • Good for: Lift-and-shift migrations, custom OS needs

PaaS โ€” Platform as a Service

  • Provider manages: Hardware + OS + runtime + middleware
  • You manage: Applications and data only
  • Focus on code, not infrastructure
  • AWS Examples: Elastic Beanstalk, RDS, Lambda
  • Good for: Developers who want to just deploy code

SaaS โ€” Software as a Service

  • Provider manages: Everything including the application
  • You manage: Only your data and user access
  • Least control, least responsibility
  • Examples: Gmail, Office 365, Salesforce
  • Good for: End-users, no IT required
Memory Trick: IaaS = "I" manage almost everything. PaaS = "P"latform handles the plumbing. SaaS = "S"omeone else does it all. On-Premises = You manage 100% everything.

Scaling in Cloud Computing

๐Ÿ“ˆ Vertical Scaling (Scale Up/Down)

  • Increase/decrease the size of an existing instance
  • Example: t3.micro โ†’ t3.xlarge (more CPU + RAM)
  • Has a physical hardware limit (ceiling)
  • Usually requires downtime/reboot
  • Simple to implement (no code changes)
  • Also called "scaling up" or "scaling down"

๐Ÿ“Š Horizontal Scaling (Scale Out/In)

  • Add more instances to handle increased load
  • Example: 2 EC2 instances โ†’ 10 EC2 instances
  • Virtually unlimited capacity
  • No single point of failure
  • Works with Auto Scaling Groups + ELB
  • Applications must be stateless for best results

Cloud Computing Issues & Challenges

  • Vendor Lock-In: Heavy use of AWS-specific services (DynamoDB, Lambda) makes switching providers expensive and difficult.
  • Data Security & Privacy: Data stored off-premises raises concerns about compliance (GDPR, HIPAA), data sovereignty, and breaches.
  • Internet Dependency: Cloud access requires reliable, high-speed internet connectivity. Outages = no access.
  • Downtime Risk: Even top providers have outages. AWS S3 outage in 2017 impacted much of the internet. Need multi-region strategies.
  • Cost Management (FinOps): Uncontrolled usage can lead to surprise bills. Need budget alerts, cost allocation tags, Reserved Instances.
  • Compliance Complexity: Different data residency laws in different countries. Must understand where data is stored.
  • Limited Customization: Managed services abstract away control. Cannot always configure OS-level settings.

Shared Responsibility Model

Key Concept: AWS is responsible for security OF the cloud (the physical infrastructure). YOU are responsible for security IN the cloud (your data, configurations, IAM, application code).

โ˜๏ธ AWS Responsible For

  • Physical data center security (guards, cameras, locks)
  • Hardware (servers, networking equipment)
  • Hypervisor / virtualization layer
  • Global network infrastructure
  • Managed service OS patching (RDS, Lambda)
  • Compliance of underlying infrastructure

๐Ÿ‘ค Customer Responsible For

  • Data encryption (in transit and at rest)
  • IAM users, roles, policies, MFA
  • OS patching on EC2 instances
  • Application code security
  • Network/Security Group configuration
  • S3 bucket policies and public access settings

The responsibility shifts based on the service type: EC2 (IaaS) = you manage OS and above. RDS (PaaS) = AWS manages OS and DB engine patching. Lambda (Serverless) = AWS manages almost everything.

Cloud Costing Models

ModelDescriptionSavings vs On-DemandBest For
On-DemandPay per second/hour, no commitment, no upfront costBaseline (0%)Unpredictable workloads, testing, short-term
Reserved Instances (RI)1-year or 3-year commitment. Standard RI or Convertible RI.Up to 72%Steady-state predictable workloads (databases, web servers)
Spot InstancesBid on unused AWS EC2 capacity. Can be interrupted with 2-min warning.Up to 90%Fault-tolerant batch jobs, big data, CI/CD, stateless apps
Savings PlansFlexible 1-3 year commitment to usage amount ($/hr). Covers EC2, Fargate, Lambda.Up to 66%Flexible usage across instance types and regions
Dedicated HostsPhysical server dedicated to you. Bring your own license (BYOL).0-30% (BYOL savings)Compliance requirements, software licensing, HIPAA
Dedicated InstancesRuns on hardware dedicated to you but AWS manages the host.Small premium over On-DemandCompliance requiring dedicated hardware

AWS Global Infrastructure

  • Regions (34+): Geographic areas, each containing 2+ Availability Zones. Examples: us-east-1 (N. Virginia), ap-south-1 (Mumbai), eu-west-1 (Ireland). Data does NOT leave a Region unless you explicitly configure it to.
  • Availability Zones (AZs) (108+): One or more discrete data centers within a Region with redundant power, networking, and connectivity. AZs are connected via private, high-speed fiber links. Minimum 3 AZs per Region.
  • Edge Locations (400+): CDN endpoints for Amazon CloudFront and Route 53. Caches content closer to users. NOT full AWS regions โ€” limited services only.
  • Local Zones: AWS infrastructure placed in metro areas closer to large population centers. Example: Los Angeles, Boston. Low-latency for demanding applications.
  • Wavelength Zones: AWS infrastructure embedded in telecom 5G networks for ultra-low latency mobile apps.
  • AWS Outposts: AWS managed hardware running in your own on-premises data center. Extends AWS cloud into your facility.
Exam Tip: For High Availability, always deploy across multiple AZs within a Region. For Disaster Recovery, deploy across multiple Regions. AZ failures happen; Regional failures are extremely rare.
๐Ÿ–ฅ๏ธ

Virtualization

Cloud Fundamentals

What is Virtualization?

Virtualization is the process of creating a software-based (virtual) representation of physical computing resources such as servers, storage, networks, and desktops. It uses a software layer called a Hypervisor (VMM โ€” Virtual Machine Monitor) to abstract physical hardware and present it to multiple virtual machines simultaneously.

Without virtualization, one physical server runs one operating system. With virtualization, one physical server can run 10, 50, or 100+ virtual machines โ€” each with its own OS, isolated from each other.

Virtualization and Cloud Computing

Virtualization is the foundation of cloud computing. Every EC2 instance you launch in AWS is actually a virtual machine running on AWS physical hardware. When you launch 100 EC2 instances, AWS spins up 100 VMs across their physical servers using the Nitro Hypervisor. You share physical hardware with other AWS customers but remain completely isolated.

Key Insight: Cloud computing = virtualizing physical data center resources and selling them as on-demand services over the internet. Virtualization enables the "elastic" in Elastic Compute Cloud (EC2).

Types of Virtualization

TypeDescriptionHow It WorksAWS Service
Server/Hardware VirtualizationMultiple VMs share one physical serverHypervisor divides CPU/RAM/storage into VMsEC2 Instances
Storage VirtualizationMultiple physical storage devices pooled into one logical storage unitAbstraction layer manages storage allocation transparentlyEBS, EFS, S3
Network VirtualizationPhysical network resources abstracted into software-defined networksVLANs, SDN, overlay networksVPC, Security Groups, ENI
Desktop Virtualization (VDI)Desktop environments hosted on central server, accessed remotelyUsers stream desktop from serverAmazon WorkSpaces
Application VirtualizationApplication runs in isolated environment separate from host OSContainer or sandbox wraps the appDocker, ECS, EKS
OS-Level Virtualization (Containers)Multiple isolated user-space instances on same OS kernelNamespaces + cgroups isolate processesECS, EKS, Fargate

Hypervisor Types โ€” Type 1 vs Type 2

Type 1 โ€” Bare Metal Hypervisor

  • Runs directly on physical hardware โ€” no host OS needed
  • Better performance (no OS overhead)
  • Better security (smaller attack surface)
  • Used in production environments and cloud
  • AWS uses: Nitro Hypervisor (based on KVM)
  • Others: VMware ESXi, Microsoft Hyper-V, Xen, KVM

Type 2 โ€” Hosted Hypervisor

  • Runs on top of a host operating system
  • Host OS adds overhead and latency
  • Easier to set up for development/testing
  • Lower performance than Type 1
  • Examples: Oracle VirtualBox, VMware Workstation, VMware Fusion, Parallels Desktop
  • Used for: learning, dev testing on laptops
Type 1 vs Type 2 Hypervisor Architecture
TYPE 1 (BARE METAL) PHYSICAL HARDWARE HYPERVISOR (Nitro / ESXi) VM 1 Guest OS App VM 2 Guest OS App VM 3 Guest OS App TYPE 2 (HOSTED) PHYSICAL HARDWARE HOST OS (Windows / Linux / macOS) HYPERVISOR (VirtualBox / VMware Workstation) VM 1 Guest OS App VM 2 Guest OS App VM 3 Guest OS App

Key Virtualization Terminologies

Host MachineThe physical computer running the hypervisor. The actual hardware server.
Guest Machine (VM)The virtual machine running on the host. Has its own virtual CPU, RAM, storage, NIC.
vCPUVirtual CPU โ€” a share of the physical CPU's processing power. Each AWS EC2 instance is given a number of vCPUs.
SnapshotA point-in-time copy of a VM's disk state. Used for backups, rollback, and creating templates.
AMI (Template/Image)Amazon Machine Image โ€” a pre-configured VM template used to launch new EC2 instances quickly.
Live MigrationMoving a running VM to another physical host with zero downtime. AWS does this during host maintenance.
OverprovisioningAllocating more virtual resources than physical resources available. Works because VMs rarely use 100% simultaneously.
OvercommittingAssigning more vCPUs/vRAM than physical CPUs/RAM exist. Relies on statistical multiplexing.
Para-VirtualizationGuest OS is modified to be aware of the hypervisor and uses special APIs. Faster than full virtualization.
Full VirtualizationGuest OS runs unmodified, hardware fully simulated. VMs cannot tell they are virtualized.

Containers vs Virtual Machines

๐Ÿ–ฅ๏ธ Virtual Machines

  • Each VM has its own full OS (Guest OS)
  • Heavier: GBs in size
  • Slower startup (minutes)
  • Complete isolation between VMs
  • Better security isolation
  • AWS: EC2 Instances

๐Ÿ“ฆ Containers (Docker)

  • Share host OS kernel โ€” no Guest OS needed
  • Lighter: MBs in size
  • Very fast startup (seconds)
  • Process-level isolation
  • Portable across environments
  • AWS: ECS, EKS, Fargate

Benefits of Virtualization

  • Server Consolidation: Run many VMs on one physical server. Reduces number of servers needed by 10x-20x.
  • Cost Reduction: Fewer physical servers = less hardware cost, less power, less cooling, less data center space.
  • Rapid Provisioning: New servers can be deployed in minutes by cloning a template (AMI in AWS).
  • Isolation: Each VM is isolated. A crash in VM1 doesn't affect VM2. Security breaches are contained.
  • Disaster Recovery: Snapshots and VM replication make backup and recovery much easier and faster.
  • Better Utilization: Physical servers typically run at 10-15% capacity. VMs help utilize 70-80% of physical capacity.
  • Testing & Development: Developers can run multiple OS environments on one laptop for testing.

Virtualization Vendors

AWS: Nitro Hypervisor (KVM-based) VMware: ESXi / vSphere Microsoft: Hyper-V Red Hat: KVM + libvirt Citrix: XenServer / Hypervisor Oracle: VirtualBox (Type 2) Parallels: Parallels Desktop (macOS)
๐Ÿง

Linux Basics

Linux Basics

Why Linux in AWS?

Linux is the dominant OS in cloud computing. Over 90% of AWS workloads run on Linux. Amazon Linux 2 and Amazon Linux 2023 are AWS's own Linux distributions optimized for EC2. Linux is free, open-source, stable, and highly customizable โ€” perfect for servers.

All-Important Linux Commands

๐Ÿ“ File & Directory

  • ls -la โ€” list all files with permissions, hidden files
  • pwd โ€” print working directory (current location)
  • cd /path โ€” change directory
  • cd .. โ€” go up one directory
  • cd ~ โ€” go to home directory
  • mkdir -p dir/subdir โ€” create directory (with parents)
  • rm -rf dir โ€” remove directory recursively (careful!)
  • cp -r src dst โ€” copy file/dir recursively
  • mv src dst โ€” move or rename
  • touch file.txt โ€” create empty file
  • cat file โ€” display file content
  • less file โ€” paginated view (q to quit)
  • head -20 file โ€” first 20 lines
  • tail -20 file โ€” last 20 lines
  • tail -f /var/log/app.log โ€” follow log in real-time
  • grep "text" file โ€” search pattern in file
  • grep -r "text" /dir โ€” recursive search
  • find / -name "*.conf" โ€” find files by name
  • wc -l file โ€” count lines
  • diff file1 file2 โ€” compare two files
  • ln -s /target /link โ€” create symbolic link

โš™๏ธ System & Process

  • top โ€” live process monitor (q to quit)
  • htop โ€” improved interactive process viewer
  • ps aux โ€” list all processes with user/CPU/mem
  • ps aux | grep nginx โ€” find specific process
  • kill PID โ€” terminate process gracefully (SIGTERM)
  • kill -9 PID โ€” force kill (SIGKILL)
  • pkill nginx โ€” kill by name
  • df -h โ€” disk usage (human readable)
  • du -sh /var โ€” directory disk usage
  • free -h โ€” memory usage
  • uptime โ€” system uptime + load average
  • uname -r โ€” kernel version
  • uname -a โ€” all system info
  • hostname โ€” show/set hostname
  • whoami โ€” current logged-in user
  • id โ€” current user's UID, GID, groups
  • history โ€” command history
  • which cmd โ€” full path of command
  • sudo cmd โ€” run as superuser (root)
  • su - username โ€” switch user
  • env โ€” show environment variables
  • echo $HOME โ€” print environment variable
  • export VAR=value โ€” set environment variable

The Linux Filesystem Hierarchy

DirectoryFull NamePurpose & Contents
/RootTop of the entire filesystem hierarchy. Everything is under /
/binBinariesEssential user command binaries (ls, cp, mv, cat, grep). Available to all users.
/sbinSystem BinariesSystem administration binaries (iptables, fdisk, mount). Mostly for root user.
/etcEt CeteraSystem-wide configuration files. /etc/hosts (hostname resolution), /etc/fstab (mounts), /etc/nginx (nginx config)
/homeHomeUser home directories. /home/ubuntu, /home/ec2-user. Personal files, settings.
/rootRoot HomeHome directory for the root user (NOT same as /)
/varVariableVariable data that changes frequently: /var/log (logs), /var/www (web files), /var/lib (databases)
/tmpTemporaryTemporary files. Cleared on reboot. World-writable. Use for scratch space.
/usrUnix System ResourcesUser programs and data: /usr/bin (most user commands), /usr/lib (libraries), /usr/local (manually installed software)
/optOptionalOptional/third-party software. JDK, AWS CLI, custom apps installed here.
/procProcessVirtual filesystem. /proc/cpuinfo, /proc/meminfo, /proc/PID/. Kernel exposes system info here.
/devDevicesDevice files: /dev/sda (disk), /dev/null (discard output), /dev/random (random data)
/mntMountTemporary mount point for external/additional filesystems (USB drives, EBS volumes)
/bootBootBoot loader files, Linux kernel (vmlinuz), initrd. Do NOT delete!
/libLibrariesEssential shared libraries for /bin and /sbin binaries

File Permissions

Linux permissions control who can read, write, and execute files. Every file has three permission sets: Owner (user), Group, and Others.

# Permission output from ls -la:
# -rwxr-xr--  1  ec2-user  ec2-user  1234  Jan 1  file.sh
#  ^^^ ^^^ ^^^
#  |   |   โ””โ”€โ”€ Others: r-- = read only
#  |   โ””โ”€โ”€โ”€โ”€โ”€โ”€ Group: r-x = read and execute
#  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Owner: rwx = read, write, execute
#  |
#  - = regular file, d = directory, l = symlink

# Permission values: r=4, w=2, x=1
# rwx = 4+2+1 = 7
# r-x = 4+0+1 = 5
# r-- = 4+0+0 = 4
# rw- = 4+2+0 = 6

chmod 755 script.sh     # owner=rwx(7), group=rx(5), others=rx(5)
chmod 644 config.txt    # owner=rw(6), group=r(4), others=r(4)
chmod 600 key.pem       # owner=rw only, no one else can read
chmod 400 key.pem       # owner=read only (SSH key requirement)
chmod +x script.sh      # add execute permission for all
chmod -x script.sh      # remove execute permission
chmod u+w file          # add write for owner (u=user/owner, g=group, o=others, a=all)

# Change ownership
chown ec2-user file.txt           # change owner
chown ec2-user:developers file    # change owner AND group
chgrp developers file             # change group only
chown -R ec2-user /var/www        # recursive (entire directory)

Process Management

ps aux                    # list all processes (a=all users, u=user-oriented, x=no terminal)
top                       # real-time process monitor (press 'q' to quit)
htop                      # colorful interactive process viewer
kill -l                   # list all signals (SIGTERM=15, SIGKILL=9)
kill 1234                 # send SIGTERM (graceful shutdown) to PID 1234
kill -9 1234              # send SIGKILL (force kill) to PID 1234
killall nginx             # kill all processes named 'nginx'

# Background processes
nohup ./script.sh &       # run in background, ignore hangup signal, output to nohup.out
./script.sh &             # run in background (killed on terminal close)
jobs                      # list background jobs
fg %1                     # bring job #1 to foreground
bg %1                     # resume job #1 in background
disown -h %1              # disown job so it persists after logout

# Systemd (modern Linux service management)
systemctl start nginx     # start service
systemctl stop nginx      # stop service
systemctl restart nginx   # stop then start
systemctl reload nginx    # reload config without restart
systemctl status nginx    # check service status (running/stopped/failed)
systemctl enable nginx    # auto-start on boot
systemctl disable nginx   # disable auto-start
systemctl list-units --type=service   # list all services

User Account Management

# User management
useradd -m username           # create user with home directory
useradd -m -s /bin/bash user  # create with bash shell
passwd username               # set/change password
usermod -aG sudo username     # add to sudo group (Ubuntu)
usermod -aG wheel username    # add to wheel group (RHEL/CentOS)
usermod -s /bin/bash user     # change shell
userdel username              # delete user
userdel -r username           # delete user + home directory

# Group management
groupadd developers           # create group
groupdel developers           # delete group
groups username               # show groups for user
id username                   # show UID, GID, groups

# Important user files
cat /etc/passwd               # user accounts (username:x:UID:GID:comment:home:shell)
cat /etc/shadow               # password hashes (root only)
cat /etc/group                # group definitions

# Sudo configuration
visudo                        # safely edit /etc/sudoers
# Add: username ALL=(ALL) NOPASSWD: ALL  (passwordless sudo)

Software Package Management

# Ubuntu/Debian โ€” APT package manager
sudo apt update               # update package index (always do this first)
sudo apt upgrade              # upgrade all installed packages
sudo apt install nginx -y     # install nginx
sudo apt remove nginx         # remove nginx
sudo apt purge nginx          # remove nginx + config files
sudo apt autoremove           # remove unused dependencies
apt search nginx              # search for package
apt show nginx                # show package info
dpkg -l | grep nginx          # list installed packages matching nginx
dpkg -l                       # list all installed packages

# Amazon Linux 2 / RHEL / CentOS โ€” YUM package manager
sudo yum update               # update all packages
sudo yum install httpd -y     # install Apache
sudo yum remove httpd         # remove Apache
sudo yum list installed       # list installed packages
sudo yum info httpd           # show package info
sudo yum search nginx         # search packages

# Amazon Linux 2023 / RHEL 8+ โ€” DNF (newer yum)
sudo dnf update
sudo dnf install nginx
sudo dnf remove nginx

Backup and Restore Management

# TAR โ€” tape archive (most common backup tool)
tar -czf backup.tar.gz /data/           # compress directory (c=create, z=gzip, f=filename)
tar -cjf backup.tar.bz2 /data/         # compress with bzip2 (better compression)
tar -xzf backup.tar.gz                 # extract (x=extract, z=gzip)
tar -xzf backup.tar.gz -C /restore/    # extract to specific directory
tar -tzf backup.tar.gz                 # list contents without extracting

# RSYNC โ€” efficient file sync (only transfers changes)
rsync -avz /local/dir/ user@host:/remote/dir/   # sync to remote (a=archive, v=verbose, z=compress)
rsync -avz --delete /src/ /dst/                 # sync and delete files not in source
rsync -avz --exclude="*.log" /src/ /dst/        # exclude log files
rsync -avz --dry-run /src/ /dst/                # preview without executing

# SCP โ€” secure copy (simpler than rsync)
scp file.txt user@host:/path/          # copy file to remote
scp user@host:/path/file.txt .         # copy from remote
scp -r /local/dir user@host:/remote/  # copy directory recursively

# DD โ€” disk image backup
dd if=/dev/sda of=/backup/disk.img bs=4M status=progress  # full disk image
dd if=/backup/disk.img of=/dev/sda bs=4M                  # restore disk image

Systemd and Monitoring

# Journald โ€” systemd journal (logs)
journalctl -u nginx                    # logs for nginx service
journalctl -u nginx -f                 # follow nginx logs in real-time
journalctl -u nginx --since "1 hour ago"
journalctl -u nginx --since "2024-01-01" --until "2024-01-02"
journalctl -p err                      # only errors
journalctl -b                          # logs since last boot
journalctl --disk-usage                # how much disk journal uses

# Traditional log files
tail -f /var/log/syslog                # Ubuntu: follow system log
tail -f /var/log/messages              # RHEL: system messages
tail -f /var/log/nginx/access.log      # nginx access log
tail -f /var/log/nginx/error.log       # nginx error log
tail -f /var/log/cloud-init.log        # EC2 user-data script execution log

# System monitoring commands
vmstat 1 5                             # virtual memory stats every 1s for 5 iterations
iostat -x 1                            # I/O statistics per device
sar -u 5 3                             # CPU usage report (5s interval, 3 times)
netstat -tuln                          # open ports listening
ss -tuln                               # modern replacement for netstat
lsof -i :80                            # what process is using port 80

Storage Management

# Block device management (critical for EBS volumes on EC2)
lsblk                                  # list block devices (disks/partitions)
lsblk -f                               # with filesystem info
fdisk -l                               # detailed partition table info
blkid                                  # show UUIDs of block devices

# Create filesystem and mount (new EBS volume workflow)
sudo fdisk /dev/xvdb                   # partition the disk (optional for small disks)
sudo mkfs.ext4 /dev/xvdb               # format as ext4
sudo mkfs.xfs /dev/xvdb                # format as xfs (Amazon Linux default)
sudo mkdir -p /mnt/data                # create mount point
sudo mount /dev/xvdb /mnt/data         # mount the disk
df -h                                  # verify mount and available space

# Persistent mount (survives reboot) โ€” add to /etc/fstab
echo "UUID=$(blkid -s UUID -o value /dev/xvdb) /mnt/data ext4 defaults,nofail 0 2" | sudo tee -a /etc/fstab
sudo mount -a                          # test fstab entries
sudo umount /mnt/data                  # unmount

Networking in Linux

# Network interface info
ip addr show                           # show all interfaces and IPs
ip addr show eth0                      # specific interface
ifconfig                               # older command (same as ip addr)
ip link show                           # link layer info (MAC address)

# Routing
ip route show                          # show routing table
ip route add 10.0.0.0/8 via 10.0.1.1  # add static route
route -n                               # older routing table command

# Connectivity testing
ping -c 4 google.com                   # send 4 ICMP packets
ping 8.8.8.8                           # ping Google DNS
traceroute google.com                  # trace packet path
mtr google.com                         # combined ping + traceroute (real-time)

# DNS lookup
nslookup google.com                    # basic DNS lookup
dig google.com                         # detailed DNS info
dig google.com MX                      # look up MX records
dig @8.8.8.8 google.com               # query specific DNS server
host google.com                        # simple name resolution

# HTTP/HTTPS testing
curl https://api.example.com           # fetch URL content
curl -I https://example.com            # fetch headers only
curl -o file.zip https://example.com/file.zip  # download file
wget https://example.com/file.zip      # download file (alternative to curl)
curl -X POST -H "Content-Type: application/json" -d '{"key":"value"}' https://api.example.com

# Port and connection info
netstat -tuln                          # TCP/UDP listening ports
ss -tuln                               # same, modern version
ss -tnp                                # connections with process names
lsof -i TCP:80                         # what's using port 80
telnet host 3306                       # test if port is reachable
nc -zv host 3306                       # netcat port test (preferred over telnet)
๐Ÿ–ฅ๏ธ

EC2 โ€” Elastic Compute Cloud

Compute

What is EC2?

Amazon EC2 (Elastic Compute Cloud) provides resizable virtual computing capacity (virtual machines) in the cloud. EC2 gives you complete control: choose the OS, configure networking, manage security, attach storage, and install any software you need. It is the backbone of AWS โ€” almost every architecture involves EC2 or services built on EC2.

Think of it as: Renting a virtual computer in AWS's data center. You control everything above the hardware level (OS and above). AWS manages the physical server, network, power, and hypervisor.

EC2 Instance Types (Families)

FamilyOptimized ForInstance TypesWhen to Use
General PurposeBalanced CPU/RAM/Networkt3, t4g, m5, m6i, m7iWeb servers, dev/test, small DBs, microservices
Compute OptimizedHigh-performance CPUc5, c6g, c7g, c6iBatch processing, media encoding, gaming servers, HPC
Memory OptimizedLarge in-memory datasetsr5, r6i, x2idn, z1d, u-6tb1Big data, in-memory DBs (Redis/SAP HANA), real-time analytics
Storage OptimizedHigh sequential I/O read/writei3, i4i, d2, h1, im4gnNoSQL DBs, data warehouses, log processing, Hadoop
Accelerated ComputingGPU/FPGA hardwarep4, g5, f1, inf2, trn1ML training/inference, scientific computing, video rendering
HPC OptimizedExtreme compute + networkinghpc6a, hpc7gHigh Performance Computing clusters, CFD, molecular dynamics

Instance Naming Convention

Understanding how to read an instance type name: m5.2xlarge

m = Instance family (m=general, c=compute, r=memory, etc.)
5 = Generation (higher = newer, better price/performance)
2xlarge = Instance size (nano < micro < small < medium < large < xlarge < 2xlarge < 4xlarge ...)
Suffixes: g=Graviton (ARM), a=AMD, n=higher network, d=NVMe SSD storage, e=extra storage

Instance Launch Process (Step by Step)

  1. Choose an AMI (Amazon Machine Image): The OS + software pre-installed template. Choose Amazon Linux 2, Ubuntu 22.04, Windows Server, RHEL, or a Marketplace AMI.
  2. Select Instance Type: Choose based on workload needs. For learning/dev: t2.micro or t3.micro (free tier eligible). For production: based on CPU/RAM requirements.
  3. Configure Instance Details: Choose VPC, subnet (public/private), IAM role for AWS API access, user data script (runs at first boot), shutdown behavior, termination protection.
  4. Add Storage (EBS): Root volume (OS disk, default 8-30 GB gp3). Add additional data volumes as needed. Set "Delete on Termination" appropriately.
  5. Add Tags: Key-value metadata. Name=MyWebServer, Environment=Production, Owner=TeamA. Essential for cost allocation and resource management.
  6. Configure Security Group: Virtual firewall. Add inbound rules for SSH (22), HTTP (80), HTTPS (443). Restrict SSH to your IP only.
  7. Review and Launch: Choose or create a Key Pair (for SSH access). Download .pem file โ€” this is the only chance to download it!

EC2 Connection Methods

๐Ÿ”‘ SSH (Linux Instances)

  • Standard method for Linux/Unix
  • Requires Key Pair (.pem file)
  • Requires port 22 open in Security Group
  • Set key permissions: chmod 400 key.pem
  • Command: ssh -i key.pem [email protected]
  • Default users: Amazon Linux=ec2-user, Ubuntu=ubuntu, RHEL=ec2-user, Debian=admin

๐ŸชŸ RDP (Windows Instances)

  • Remote Desktop Protocol โ€” Windows GUI
  • Port 3389 must be open in Security Group
  • Right-click instance โ†’ Get Windows Password
  • Use Key Pair to decrypt the initial password
  • Connect with Windows Remote Desktop (mstsc.exe) or Mac RDP client

๐ŸŒ EC2 Instance Connect

  • Browser-based SSH from AWS Console
  • No key pair needed โ€” AWS pushes temporary key
  • Requires port 22 open to AWS IP ranges
  • Only works with Amazon Linux 2, Amazon Linux 2023, Ubuntu
  • Good for quick access without SSH client

๐Ÿ›ก๏ธ Session Manager (SSM)

  • No SSH, no port 22, no key pair needed!
  • Encrypted session via SSM Agent
  • Works for instances in private subnets (no public IP needed)
  • Requires: SSM Agent + IAM role with AmazonSSMManagedInstanceCore
  • All sessions logged in CloudTrail โ€” full audit
  • Best practice for production instances

AMI โ€” Amazon Machine Image

An AMI is a pre-configured template that provides the information required to launch an EC2 instance. It contains: the OS, application server, any pre-installed applications, configuration, and EBS snapshot(s).

AMI TypeSourceCostUse Case
AWS-ProvidedAmazon maintains theseFree (just EC2 cost)Amazon Linux 2/2023, Ubuntu, Windows Server, RHEL
AWS MarketplaceThird-party vendorsLicense fee + EC2LAMP stacks, NGINX Plus, SAP, security appliances
Community AMIsOther AWS usersFree (community)Public images shared by the community (use with caution)
Custom AMIsYou create from existing EC2Storage cost (EBS snapshots)"Golden images" โ€” pre-configured with your software for fast Auto Scaling
Creating Custom AMI: Launch EC2 โ†’ Install all software โ†’ Configure everything โ†’ Actions โ†’ Image and templates โ†’ Create image. This creates an AMI + EBS Snapshot. Use this AMI in Launch Templates for Auto Scaling Groups. Instances launch pre-configured = faster scale-out!
Important: AMIs are regional. To use an AMI in a different region, you must copy it. AMIs can be shared with specific AWS accounts or made public.

Elastic IP (EIP)

By default, when you stop and start an EC2 instance, it gets a new public IP address. Elastic IP is a static, public IPv4 address that stays the same regardless of instance state.

  • You allocate an EIP to your account, then associate it to an EC2 instance or ENI
  • EIP can be quickly remapped to a different instance (useful for failover)
  • Billing: FREE when associated with a running instance. You are CHARGED ($0.005/hr) when EIP is allocated but NOT associated (wasting public IPs)
  • Maximum 5 EIPs per region per account (can request increase)
  • One EIP per instance (by default)
# Allocate and associate Elastic IP using AWS CLI
aws ec2 allocate-address --domain vpc                          # allocate EIP
aws ec2 associate-address \
  --instance-id i-1234567890abcdef0 \
  --allocation-id eipalloc-12345678                           # associate to instance
aws ec2 disassociate-address --association-id eipassoc-xxx    # disassociate
aws ec2 release-address --allocation-id eipalloc-xxx          # release (delete EIP)

Placement Groups

Placement Groups control how EC2 instances are physically placed on AWS hardware to optimize performance or availability:

๐Ÿ”ฅ Cluster

  • Instances packed close together in one AZ
  • Ultra-low latency, high bandwidth (10 Gbps+)
  • Risk: if the hardware fails, all instances fail
  • Use: HPC, big data, ML training

๐Ÿ“Š Spread

  • Each instance on separate physical hardware
  • Max 7 instances per AZ per group
  • Reduces correlated failures
  • Use: Critical apps needing high availability

๐Ÿข Partition

  • Groups of instances on separate racks
  • Up to 7 partitions per AZ, 100s of instances
  • Partition failure doesn't affect others
  • Use: Hadoop, Cassandra, Kafka
โš™๏ธ

EC2 Instance Management

Compute

Key Pair Management

Key pairs are used for secure authentication to EC2 instances. They use asymmetric cryptography โ€” AWS stores the public key on the EC2 instance, and you keep the private key (.pem file) on your local machine.

  • RSA (2048-bit): Older, widely supported, works with PuTTY and OpenSSH
  • ED25519: Newer algorithm, more secure, faster, smaller key size. Not supported on Windows instances.
  • Once created, you CANNOT re-download the private key โ€” save it securely immediately!
  • Set correct permissions on Linux/Mac: chmod 400 mykey.pem (otherwise SSH rejects the key)
# SSH with key pair
chmod 400 mykey.pem                                    # required permission (400 = owner read-only)
ssh -i mykey.pem <a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="3c595f0e11494f594e7c0908120d0e120f0812090a">[email protected]</a>                  # connect to Amazon Linux
ssh -i mykey.pem <a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="abdec9dec5dfdeeb9e9f859a9985989f859e9d">[email protected]</a>                    # connect to Ubuntu
ssh -i mykey.pem -p 2222 <a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="a8cdcb9a85dddbcddae89d9c86999a869b9c869d9e">[email protected]</a>          # custom port

# Lost your key pair? Recovery steps:
# 1. Stop instance
# 2. Detach root EBS volume
# 3. Attach volume to another "helper" EC2 instance as /dev/xvdf
# 4. Mount: sudo mount /dev/xvdf1 /mnt/recovery
# 5. Add your new public key to: /mnt/recovery/home/ec2-user/.ssh/authorized_keys
# 6. Unmount, detach, reattach to original instance, start it

Security Groups โ€” In-Depth

Security Groups are stateful virtual firewalls at the instance level. They control inbound and outbound traffic to/from EC2 instances.

FeatureSecurity GroupNetwork ACL (NACL)
LevelInstance (ENI) levelSubnet level
Rule typeAllow rules ONLYAllow AND Deny rules
Stateful?YES โ€” return traffic auto-allowedNO โ€” must define both directions explicitly
Rule evaluationAll rules evaluated; most permissive winsRules evaluated in number order; first match wins
Default behaviorAll inbound denied; all outbound allowedDefault NACL: all allowed. Custom NACL: all denied.
ScopeCan be assigned to multiple instancesApplies to all instances in the subnet
Stateful explained: If you allow inbound HTTP (port 80) traffic, the Security Group automatically allows the response to go back out โ€” even if there is no outbound rule for port 80. NACLs require explicit rules in BOTH directions.

Security Group Best Practices

  • Never allow SSH (22) from 0.0.0.0/0 (anywhere) in production โ€” restrict to your office IP or VPN CIDR
  • Create separate SGs for each tier: web-sg (80/443 public), app-sg (8080 from web-sg only), db-sg (3306 from app-sg only)
  • Reference other Security Groups as sources instead of hardcoded IPs โ€” dynamic and cleaner
  • Never allow 0.0.0.0/0 to RDS or database ports
  • Use outbound rules to restrict what your EC2 instances can reach

EBS Volume Types (Deep Dive)

VolumeTypeMax IOPSMax ThroughputMax SizeUse Case
gp3SSD16,0001,000 MB/s16 TiBBoot volumes, dev/test, low-latency interactive apps
gp2SSD (older)16,000250 MB/s16 TiBLegacy workloads (migrate to gp3 โ€” cheaper + better)
io1Provisioned IOPS SSD64,0001,000 MB/s16 TiBI/O-intensive databases
io2Provisioned IOPS SSD64,0001,000 MB/s16 TiBCritical databases (99.999% durability)
io2 Block ExpressProvisioned IOPS SSD256,0004,000 MB/s64 TiBSAP HANA, Oracle RAC, highest performance
st1Throughput HDD500500 MB/s16 TiBBig data, log processing, streaming workloads
sc1Cold HDD250250 MB/s16 TiBInfrequent access archives, lowest cost HDD
gp3 vs gp2: gp3 is 20% cheaper and offers independent IOPS/throughput configuration (no burst credits). Always prefer gp3 for new workloads. gp3 baseline = 3,000 IOPS/125 MB/s (vs gp2's 100 IOPS/GB).

Storage and Snapshots

  • EBS Snapshot: Point-in-time backup of an EBS volume stored in S3 (you don't see the S3 bucket directly)
  • First snapshot: full copy of all data. Subsequent snapshots: incremental (only changed blocks)
  • Snapshots are stored across multiple AZs within a Region โ€” highly durable
  • Use snapshots to: backup data, copy volumes to different AZs/Regions, create encrypted volumes from unencrypted
  • Fast Snapshot Restore (FSR): Eliminates the "lazy loading" behavior; volume is immediately at full performance. Costs extra.
  • Recycle Bin: Protect snapshots and AMIs from accidental deletion. Set retention rules.
  • Snapshot Lifecycle Manager (DLM): Automate snapshot creation and deletion on a schedule
# EBS Snapshot operations (AWS CLI)
# Create snapshot
aws ec2 create-snapshot \
  --volume-id vol-12345678 \
  --description "Daily backup $(date +%Y-%m-%d)"

# Copy snapshot to another region
aws ec2 copy-snapshot \
  --source-region us-east-1 \
  --source-snapshot-id snap-12345678 \
  --region ap-south-1 \
  --description "Cross-region copy"

# Create volume from snapshot (in different AZ)
aws ec2 create-volume \
  --snapshot-id snap-12345678 \
  --availability-zone ap-south-1b \
  --volume-type gp3

User Data and Metadata

User Data is a script that runs automatically when an EC2 instance is launched for the first time (first boot only by default). It runs as root user.

#!/bin/bash
# This runs at FIRST BOOT as root
set -e                              # exit on any error
exec > /var/log/user-data.log 2>&1  # redirect output to log file

yum update -y
yum install -y httpd php mysql git
systemctl start httpd
systemctl enable httpd

# Create a simple webpage
cat > /var/www/html/index.html << 'EOF'
<html><body>
<h1>Hello from EC2!</h1>
<p>Instance ID: $(curl -s http://169.254.169.254/latest/meta-data/instance-id)</p>
</body></html>
EOF

echo "User data completed successfully"

Instance Metadata is information about the running instance accessible from within the instance at the special IP 169.254.169.254. This is a link-local address โ€” only reachable from within the instance itself.

# Instance Metadata Service (IMDS) โ€” v1 (simpler)
curl http://169.254.169.254/latest/meta-data/                    # list all metadata categories
curl http://169.254.169.254/latest/meta-data/instance-id         # get instance ID
curl http://169.254.169.254/latest/meta-data/instance-type       # get instance type
curl http://169.254.169.254/latest/meta-data/public-ipv4         # get public IP
curl http://169.254.169.254/latest/meta-data/local-ipv4          # get private IP
curl http://169.254.169.254/latest/meta-data/hostname            # get hostname
curl http://169.254.169.254/latest/meta-data/placement/region    # get region
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/MyRole  # get IAM role temp creds

# Instance Metadata Service v2 (IMDSv2) โ€” more secure (token-based)
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
curl -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/instance-id

# User data (view the script that ran)
curl http://169.254.169.254/latest/user-data
IMDSv2 is recommended over IMDSv1. It requires a session token, protecting against SSRF attacks where malicious applications might try to read metadata to steal IAM credentials. You can enforce IMDSv2 on new instances in Launch Templates.

Launch Templates (vs Launch Configurations)

A Launch Template stores the full EC2 instance configuration. It is the recommended way to define configurations for Auto Scaling Groups and EC2 Fleet.

โœ… Launch Templates (Recommended)

  • Supports multiple versions (v1, v2, v3...)
  • Mix On-Demand + Spot instances
  • Support all EC2 features including T2/T3 Unlimited, Dedicated Hosts, Capacity Reservations
  • Can be used with EC2 Fleet and Spot Fleet
  • Supports inheritance (create child from parent)

โŒ Launch Configurations (Legacy)

  • Immutable โ€” can't be modified after creation
  • Only On-Demand instances
  • Missing newer EC2 features
  • Being phased out โ€” AWS recommends migrating to Launch Templates
โš–๏ธ

Elastic Load Balancing (ELB)

Compute

What is Load Balancing?

A Load Balancer sits in front of your servers and distributes incoming traffic across multiple targets (EC2 instances, containers, Lambda functions, or IP addresses) in multiple Availability Zones. It continuously monitors the health of registered targets and routes traffic only to healthy ones.

Why use a Load Balancer? Single server = single point of failure. Load balancer = high availability (if one instance fails, others continue serving), horizontal scalability (add more instances behind LB), SSL termination (offload HTTPS decryption), and health monitoring.

Types of Load Balancers

TypeOSI LayerProtocolsKey FeaturesBest For
ALB (Application)Layer 7 (Application)HTTP, HTTPS, WebSocket, HTTP/2Path/host/header routing, WAF integration, Lambda targets, sticky sessionsWeb apps, microservices, REST APIs, containers
NLB (Network)Layer 4 (Transport)TCP, UDP, TLSUltra-high performance, static IP per AZ, preserves source IP, TLS terminationGaming, IoT, real-time trading, VPC Endpoint Services
GLB (Gateway)Layer 3+4 (Network)GENEVE (6081)Transparent bump-in-the-wire traffic inspection, scales third-party appliancesFirewalls, IDS/IPS, DPI appliances
CLB (Classic)Layer 4/7HTTP, HTTPS, TCP, SSLLegacy service being deprecatedOld EC2-Classic apps (migrate to ALB/NLB)

Application Load Balancer (ALB) โ€” Deep Dive

ALB operates at Layer 7 (HTTP/HTTPS) and makes routing decisions based on request content.

ALB Routing Rules

  • Path-Based Routing: Route based on URL path. /api/* โ†’ API servers, /images/* โ†’ Image servers, / โ†’ Main app
  • Host-Based Routing: Route based on HTTP Host header. app.example.com โ†’ App servers, api.example.com โ†’ API servers
  • HTTP Header Routing: Route based on custom headers (e.g., X-App-Version: v2 โ†’ new deployment)
  • Query String Routing: Route based on query parameters (e.g., ?version=mobile โ†’ mobile-optimized servers)
  • Weighted Target Groups: Send X% to target group 1, Y% to target group 2. Perfect for canary/blue-green deployments.
  • IP-Based Routing: Route specific IP addresses to specific target groups
# ALB Rule example (in AWS Console / CLI):
# IF path is /api/* โ†’ Forward to API-TG (api target group)
# IF path is /static/* โ†’ Forward to S3 bucket origin  
# IF path starts with /admin AND source IP is 10.0.0.0/8 โ†’ Forward to Admin-TG
# IF host is mobile.example.com โ†’ Redirect to https://m.example.com/#{path}
# DEFAULT โ†’ Forward to Web-TG

ALB Fixed Response & Redirects

  • Fixed Response: Return a custom HTTP response (200, 404, etc.) without reaching any target
  • Redirect: Return HTTP 301/302 to redirect clients (e.g., HTTP to HTTPS redirect)

Network Load Balancer (NLB) โ€” Deep Dive

  • Handles millions of requests per second with extremely low latency (<100ms)
  • One static IP per AZ โ€” useful when clients need to whitelist specific IPs (can use Elastic IPs)
  • Preserves source IP (client IP not replaced with NLB IP). Target sees actual client IP.
  • Supports TLS termination โ€” offloads TLS decryption from targets
  • No Security Groups on NLB itself โ€” access controlled by Security Groups on target instances
  • Supports UDP (ALB does not) โ€” essential for DNS, VoIP, gaming
  • Health checks support TCP, HTTP, HTTPS

Target Groups โ€” Configuration

A Target Group is a logical grouping of targets that receives requests from a Load Balancer. Each listener rule points to a Target Group.

Target TypeWhat It IsUse Case
InstanceEC2 instances by instance IDTraditional EC2 workloads
IP AddressSpecific IP addresses (private IPs in VPC or on-premises)Containers with dynamic ports, on-premises servers via Direct Connect/VPN
Lambda FunctionA Lambda function (ALB only)Serverless backends, event-driven apps
ALBAnother ALB (NLB only)When you need NLB's static IP but ALB's HTTP routing

Health Checks

Health checks run continuously. Unhealthy targets are removed from rotation until they recover. You configure:

  • Protocol: HTTP, HTTPS, TCP, TCP_UDP, TLS
  • Path: URL path to check (e.g., /health or /ping)
  • Port: Which port to check (usually traffic port)
  • Healthy threshold: # consecutive successes before marking healthy (default 3)
  • Unhealthy threshold: # consecutive failures before marking unhealthy (default 2)
  • Interval: Seconds between health checks (default 30s)
  • Timeout: Max time for response (default 5s)
  • Success codes: HTTP codes that count as success (default 200)

Sticky Sessions (Session Affinity)

Sticky sessions ensure a user's requests go to the SAME target throughout a session. Useful for stateful apps that store session data locally on the instance.

  • Application-based cookies: Your application sets the cookie with custom name and TTL
  • Duration-based cookies: LB generates a cookie with TTL you set (AWSALB cookie by default)
  • Warning: Can create uneven load distribution if some users have long sessions
  • Better architecture: use ElastiCache or DynamoDB for session storage โ†’ make app stateless โ†’ no sticky sessions needed

Cross-Zone Load Balancing

With cross-zone load balancing, each LB node distributes traffic evenly across ALL registered instances in ALL enabled AZs.

Without Cross-Zone

  • AZ-a LB node: distributes among only AZ-a instances
  • AZ-b LB node: distributes among only AZ-b instances
  • If AZ-a has 2 instances and AZ-b has 8, they get different traffic
  • Uneven distribution possible

With Cross-Zone

  • Each LB node distributes across ALL instances in ALL AZs
  • 10 instances total = each gets exactly 10% of traffic
  • ALB: enabled by default, no charge
  • NLB/GLB: disabled by default, data transfer charges apply
๐Ÿ’ฐ

Billing and Monitoring

Compute

AWS Billing Overview

AWS billing is usage-based โ€” you pay only for what you use, when you use it. There are no upfront costs for most services. Understanding billing is critical to avoid unexpected charges.

Key Billing Dimensions

  • Compute Time: EC2 billed per second (minimum 60 seconds) for Linux. Per hour for Windows/RHEL/SUSE.
  • Data Transfer: Inbound (ingress) is FREE. Outbound (egress) to internet is charged. Traffic within same AZ is free. Between AZs in same Region: $0.01/GB.
  • Storage: EBS (per GB provisioned/month), S3 (per GB stored + requests), etc.
  • Requests: API calls to services like S3 (PUT/GET requests), API Gateway, Lambda invocations.

AWS Free Tier

ServiceFree Tier AmountDurationType
EC2750 hours/month t2.micro or t3.micro (Linux and Windows separately)12 monthsNew accounts
S35 GB storage, 20,000 GET requests, 2,000 PUT requests12 monthsNew accounts
RDS750 hours/month db.t2.micro or db.t3.micro Single-AZ12 monthsNew accounts
Lambda1 million requests + 400,000 GB-seconds compute timeAlways freePerpetual
CloudWatch10 custom metrics, 10 alarms, 5 GB log dataAlways freePerpetual
SNS1 million publishes, 100,000 HTTP deliveriesAlways freePerpetual
DynamoDB25 GB storage, 25 WCUs + 25 RCUs (enough for ~200M requests/month)Always freePerpetual

AWS Billing Tools

๐Ÿ’ณ Cost Explorer

  • Visualize cost and usage over past 12 months
  • Forecast next 12 months based on trends
  • Filter/group by service, region, account, tag, instance type
  • Savings Plans and Reserved Instance recommendations
  • Free to use

๐Ÿ“‹ AWS Budgets

  • Set custom cost and usage budgets
  • Alert via email or SNS when you hit thresholds (e.g., 80%, 100% of budget)
  • Types: Cost budget, Usage budget, RI utilization budget, Savings Plans budget
  • First 2 budgets free; $0.02/day per additional budget

๐Ÿงพ Cost and Usage Reports (CUR)

  • Most detailed billing data available
  • Exported to S3 bucket daily or monthly
  • Line-item charges for every resource hourly
  • Can analyze with Athena or import to Redshift

๐Ÿท๏ธ Cost Allocation Tags

  • Tag resources (e.g., Project=AppA, Environment=Prod)
  • Activate tags in Billing console
  • Filter Cost Explorer by these tags
  • Essential for multi-project/multi-team accounts

CloudWatch Alarms โ€” Detailed

CloudWatch Alarms watch a single metric and perform one or more actions when that metric breaches a threshold over a specified number of evaluation periods.

Alarm StateMeaningWhen It Occurs
OKMetric is within the defined thresholdMetric is healthy
ALARMMetric has breached the threshold for specified periodsAction is triggered
INSUFFICIENT_DATANot enough data points to determine stateService just started, metric gap, new alarm

Alarm Configuration

  • Metric: What to watch (e.g., CPUUtilization for EC2 instance i-xxxx)
  • Statistic: How to aggregate data points (Average, Sum, Minimum, Maximum, p90)
  • Period: Time window for each evaluation (60s, 300s, 3600s)
  • Evaluation Periods: How many consecutive periods must breach threshold to trigger alarm
  • Datapoints to Alarm: Of the evaluation periods, how many must breach (M of N evaluation)
  • Threshold: The value that defines the breach (e.g., > 80%)
  • Actions: What to do when alarm triggers โ€” SNS notification, EC2 action, Auto Scaling

Common EC2 CloudWatch Metrics

MetricDescriptionMonitoring PeriodNotes
CPUUtilization% of allocated EC2 compute units in useBasic: 5 min, Detailed: 1 minAvailable by default
NetworkIn / NetworkOutBytes received/sent on all network interfacesBasic: 5 minAvailable by default
NetworkPacketsIn/OutPackets received/sentBasic: 5 minAvailable by default
DiskReadOps / DiskWriteOpsIOPS completed for instance storeBasic: 5 minInstance store only (not EBS)
DiskReadBytes / DiskWriteBytesBytes read/written to instance storeBasic: 5 minInstance store only
StatusCheckFailed_InstanceInstance OS/software failure1 minAction: reboot/recover
StatusCheckFailed_SystemAWS physical host failure1 minAction: recover (migrates to new host)
MemoryUtilization*% of RAM in useCustom metricRequires CloudWatch Agent!
DiskSpaceUtilization*% of disk usedCustom metricRequires CloudWatch Agent!
Memory & Disk NOT collected by default! Memory utilization, disk space usage, and swap usage require you to install the CloudWatch Agent on your EC2 instance and configure it to send these metrics. This is a very common exam question.

Billing Alarms Setup

# Set up billing alarm (must be in us-east-1 region)
# Step 1: Enable billing alerts in Billing โ†’ Billing Preferences โ†’ Receive Billing Alerts

# Step 2: Create SNS topic for notification
aws sns create-topic --name billing-alerts --region us-east-1

# Step 3: Subscribe your email to topic  
aws sns subscribe \
  --topic-arn arn:aws:sns:us-east-1:123456789:billing-alerts \
  --protocol email \
  --notification-endpoint <a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="5c2533292e1c39313d3530723f3331">[email protected]</a>

# Step 4: Create CloudWatch alarm (ONLY works in us-east-1)
aws cloudwatch put-metric-alarm \
  --alarm-name "Monthly-Bill-Exceeds-10USD" \
  --metric-name EstimatedCharges \
  --namespace AWS/Billing \
  --statistic Maximum \
  --period 86400 \
  --threshold 10 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 \
  --alarm-actions arn:aws:sns:us-east-1:123456789:billing-alerts \
  --dimensions Name=Currency,Value=USD \
  --region us-east-1
๐Ÿ“ˆ

Auto Scaling

Compute

What is EC2 Auto Scaling?

EC2 Auto Scaling automatically adds or removes EC2 instances based on demand conditions you define. It ensures your application always has the right number of instances available to handle load, provides fault tolerance by replacing unhealthy instances, and optimizes costs by removing unnecessary instances.

โšก Performance

  • Scale out when demand spikes
  • Users always get fast response times
  • No manual intervention needed
  • Works with ELB for traffic distribution

๐Ÿ’ฐ Cost Optimization

  • Scale in when demand drops
  • Pay only for instances you need
  • Use Spot Instances in ASG for 90% savings
  • No over-provisioning

๐Ÿ›ก๏ธ High Availability

  • Replace unhealthy instances automatically
  • Spans multiple AZs
  • Rebalances instances across AZs
  • Minimum capacity always maintained

Auto Scaling Group (ASG) โ€” Key Concepts

  • Launch Template: Defines what instances to launch โ€” AMI, instance type, security groups, key pair, IAM role, user data
  • Minimum Capacity: ASG never goes below this number (even during quiet periods). Ensures minimum availability.
  • Desired Capacity: The number of instances ASG tries to maintain right now. ASG launches/terminates to reach this number.
  • Maximum Capacity: ASG never exceeds this number (cost protection). Sets your maximum scale-out limit.
  • VPC and Subnets: Choose subnets in multiple AZs. ASG will balance instances across them.
  • Load Balancer: Attach to ALB/NLB target group. New instances automatically registered; terminated instances deregistered.
  • Health Checks: EC2 status checks (default) or ELB health checks (recommended for web apps)
Example: Min=2, Desired=4, Max=10. ASG maintains 4 instances normally. During high traffic, scales to up to 10. During quiet periods, scales down to minimum 2 (never less).

Scaling Types and Policies

Policy TypeHow It WorksTriggerBest For
Manual ScalingManually change desired capacity in console or CLIHuman actionPlanned events, maintenance windows, fixed capacity
Simple ScalingOne action per alarm breach. Waits for cooldown before next action.CloudWatch alarmSimple workloads (legacy, prefer Step/Target)
Step ScalingDifferent actions based on HOW FAR metric is from thresholdCloudWatch alarmWhen you need proportional response to varying load
Target TrackingAutomatically scale to keep a metric at a target valueMetric target valueMost workloads โ€” simplest and most effective
Scheduled ScalingScale based on time (cron expression)Date/time scheduleKnown traffic patterns (business hours, weekly peaks)
Predictive ScalingML model predicts future load and pre-scales proactivelyML forecastRecurring cyclical patterns (daily, weekly)

Target Tracking Policy (Recommended)

The most commonly used scaling policy. You specify a target value for a metric and Auto Scaling creates CloudWatch alarms automatically to scale in/out to maintain the target.

Predefined MetricDescriptionCommon Target
ASGAverageCPUUtilizationAverage CPU across all instances in the ASG50-70%
ALBRequestCountPerTargetNumber of requests per instance from ALB1000 req/instance
ASGAverageNetworkInAverage network bytes in per instanceDepends on app
ASGAverageNetworkOutAverage network bytes out per instanceDepends on app
# Target Tracking: Keep average CPU at 50%
# ASG will automatically:
# - Add instances if CPU goes above 50%
# - Remove instances if CPU drops below ~45% (built-in buffer)
# You don't write alarm rules โ€” AWS manages them automatically

Step Scaling

Define multiple scaling steps based on how much the metric breaches the threshold. More granular control than Simple Scaling. Does NOT wait for cooldown between steps.

# Example Step Scaling Configuration:
# Scale OUT (add capacity):
#   CPU 50-60% โ†’ add 1 instance
#   CPU 60-75% โ†’ add 2 instances
#   CPU 75-90% โ†’ add 3 instances
#   CPU > 90%  โ†’ add 4 instances

# Scale IN (remove capacity):
#   CPU 40-50% โ†’ remove 1 instance
#   CPU 30-40% โ†’ remove 2 instances
#   CPU < 30%  โ†’ remove 3 instances

aws autoscaling put-scaling-policy \
  --auto-scaling-group-name my-asg \
  --policy-name scale-out-policy \
  --policy-type StepScaling \
  --step-adjustments MetricIntervalLowerBound=0,MetricIntervalUpperBound=10,ScalingAdjustment=1 \
                     MetricIntervalLowerBound=10,MetricIntervalUpperBound=25,ScalingAdjustment=2 \
                     MetricIntervalLowerBound=25,ScalingAdjustment=3 \
  --adjustment-type ChangeInCapacity \
  --metric-aggregation-type Average

Scheduled Scaling

# Scale up Mon-Fri at 8 AM IST (2:30 AM UTC)
aws autoscaling put-scheduled-update-group-action \
  --auto-scaling-group-name my-asg \
  --scheduled-action-name scale-up-mornings \
  --recurrence "30 2 * * 1-5" \
  --min-size 4 --max-size 20 --desired-capacity 8

# Scale down Mon-Fri at 8 PM IST (2:30 PM UTC)
aws autoscaling put-scheduled-update-group-action \
  --auto-scaling-group-name my-asg \
  --scheduled-action-name scale-down-evenings \
  --recurrence "30 14 * * 1-5" \
  --min-size 2 --max-size 10 --desired-capacity 2

# Scale up for expected traffic spike (one-time)
aws autoscaling put-scheduled-update-group-action \
  --auto-scaling-group-name my-asg \
  --scheduled-action-name pre-event-scale \
  --start-time "2024-03-01T02:00:00Z" \
  --desired-capacity 20

Instance Lifecycle and Termination Policy

  • Scale-out lifecycle: Pending โ†’ InService โ†’ Healthy (registered with LB)
  • Scale-in lifecycle: InService โ†’ Terminating:Wait (lifecycle hook) โ†’ Terminated
  • Lifecycle Hooks: Pause instance during launch or termination to run custom actions (configure software, drain connections, extract logs)
Termination PolicyHow ASG Decides Which Instance to Terminate
DefaultOldest launch config/template โ†’ oldest instance in that config โ†’ closest to billing hour
OldestInstanceTerminates the oldest instance in the group
NewestInstanceTerminates the newest instance (useful for rolling updates testing)
OldestLaunchTemplateTerminates instances using oldest launch template (good for rolling updates)
ClosestToNextInstanceHourTerminates instance closest to next billing hour (cost optimization)
Cooldown Period: Default 300 seconds (5 min). After a scaling activity, ASG waits for cooldown before evaluating more alarms. Prevents thrashing. Use shorter cooldowns for scale-in (saves money faster). Use default or longer for scale-out (let metrics stabilize).
๐Ÿ’พ

EBS / EFS Storage

Storage Services

EBS โ€” Elastic Block Store

EBS provides persistent block-level storage for EC2 instances. Think of it as a network-attached hard drive. When you terminate an EC2 instance, the EBS root volume is deleted by default (configurable), but additional EBS volumes persist. EBS volumes are automatically replicated within their AZ.

Block Storage vs Object Storage: Block storage (EBS) stores data in fixed-size blocks โ€” like a hard drive. You can format it with a filesystem (ext4, xfs) and use it like a disk. Object storage (S3) stores entire files as objects โ€” you cannot mount it like a drive.

EBS Volume Types (Detailed)

Volume TypeIOPSThroughputSizeMulti-AttachUse Case
gp3 (General SSD)3,000โ€“16,000125โ€“1,000 MB/s1 GiBโ€“16 TiBNoBoot volumes, dev/test, small/medium databases, virtual desktops
io2 (Provisioned IOPS SSD)100โ€“64,0001,000 MB/s4 GiBโ€“16 TiBYes (same AZ)I/O-intensive databases: MySQL, Oracle, SQL Server
io2 Block Expressup to 256,0004,000 MB/s4 GiBโ€“64 TiBYesSAP HANA, Oracle RAC, mission-critical workloads
st1 (Throughput HDD)500 max500 MB/s125 GiBโ€“16 TiBNoBig data, data warehouses, log processing, Hadoop
sc1 (Cold HDD)250 max250 MB/s125 GiBโ€“16 TiBNoCold data requiring few scans/day. Cheapest option.
IOPS vs Throughput: IOPS = Input/Output Operations Per Second (number of read/write requests). Throughput = MB/s (amount of data transferred). Small random I/O (databases) = care about IOPS. Large sequential I/O (big data, video) = care about throughput.

Key EBS Facts to Know

  • EBS volume and the EC2 instance it attaches to must be in the same AZ
  • To use an EBS volume in a different AZ: take snapshot โ†’ create new volume from snapshot in target AZ
  • To use in a different region: take snapshot โ†’ copy snapshot to target region โ†’ create volume
  • Root EBS volume: deleted on instance termination by default (change DeleteOnTermination to false)
  • Additional EBS volumes: NOT deleted on termination by default
  • EBS Multi-Attach (io1/io2 only): Same volume attached to up to 16 instances in same AZ simultaneously. Application must handle concurrent writes (clustering software like Oracle RAC).
  • Snapshots are incremental โ€” only store changed data blocks after initial full snapshot
  • You can take a snapshot of a running instance, but it's better to stop first for consistency

EBS Encryption

  • Uses AWS KMS Customer Master Keys (CMK) โ€” either AWS-managed or customer-managed
  • Encryption is at rest AND in transit between EC2 and EBS
  • Minimal performance impact (handled by hardware)
  • Snapshots of encrypted volumes are automatically encrypted
  • Volumes created from encrypted snapshots are automatically encrypted
  • You can't remove encryption from an encrypted volume
# How to encrypt an existing UNENCRYPTED EBS volume:
# Direct encryption of existing volume is NOT possible โ€” must use this workaround:
# Step 1: Create a snapshot of the unencrypted volume
aws ec2 create-snapshot --volume-id vol-unencrypted --description "Pre-encryption backup"

# Step 2: Copy the snapshot with encryption enabled
aws ec2 copy-snapshot \
  --source-region ap-south-1 \
  --source-snapshot-id snap-unencrypted \
  --encrypted \
  --kms-key-id arn:aws:kms:ap-south-1:123:key/your-key

# Step 3: Create a new encrypted volume from the encrypted snapshot
aws ec2 create-volume --snapshot-id snap-encrypted --volume-type gp3 \
  --availability-zone ap-south-1a

# Step 4: Detach old volume, attach new encrypted volume to instance
# Step 5: Update /etc/fstab if needed

Mounting and Managing EBS Volumes on EC2

# Step 1: Verify the volume is attached
lsblk                              # shows: xvda (root), xvdb (new unformatted)
lsblk -f                           # check if filesystem exists

# Step 2: Create filesystem (first time only โ€” destroys existing data!)
sudo mkfs.ext4 /dev/xvdb           # format as ext4
# OR
sudo mkfs.xfs /dev/xvdb            # format as xfs (Amazon Linux default)

# Step 3: Create mount point
sudo mkdir -p /data

# Step 4: Mount the volume
sudo mount /dev/xvdb /data
df -h                              # verify it's mounted and available space

# Step 5: Make it permanent โ€” add to /etc/fstab
# Get UUID first (better than device name โ€” device names can change)
sudo blkid /dev/xvdb               # shows UUID

# Add to /etc/fstab (edit with: sudo nano /etc/fstab):
# UUID=xxxx-xxxx /data ext4 defaults,nofail 0 2
# "nofail" is critical โ€” prevents boot failure if volume not attached

# Test fstab entry
sudo umount /data
sudo mount -a                      # mounts everything in fstab
df -h                              # verify

EFS โ€” Elastic File System

EFS is a fully managed, scalable, shared file system (NFS - Network File System) for Linux workloads. Unlike EBS (one instance at a time), EFS can be mounted concurrently by thousands of EC2 instances across multiple AZs simultaneously. It automatically grows and shrinks as you add/remove files โ€” no capacity management needed.

EFS vs EBS vs S3 โ€” Comparison

FeatureEFS (Elastic File System)EBS (Elastic Block Store)S3 (Simple Storage)
Storage typeFile (NFS)BlockObject
Multi-instance accessYES โ€” thousands of instancesNO (one at a time, except Multi-Attach io2)YES โ€” accessible from anywhere
Multi-AZYES (Standard, Regional)NO โ€” single AZ onlyYES โ€” minimum 3 AZs
OS supportLinux only (POSIX)Linux and WindowsAny (HTTP API)
Mount as filesystemYES (NFS mount)YES (block device)NO (not a filesystem)
Capacity managementAutomatic (elastic)Fixed (you provision)Unlimited
Max sizePetabytes (auto-scale)64 TiBUnlimited
Relative cost~3x gp2 EBSBaselineCheapest per GB
Use caseShared storage, CMS, home dirs, containersBoot volumes, databases, app dataBackups, static assets, data lakes

EFS Storage Classes and Lifecycle

Storage ClassAvailabilityCostUse Case
EFS StandardMulti-AZ (3+ AZs)$0.30/GB/monthFrequently accessed files
EFS Standard-IAMulti-AZ$0.025/GB/month + retrievalInfrequent access (save 92% vs Standard)
EFS One ZoneSingle AZ$0.153/GB/monthDev/test, non-critical data (20% cheaper)
EFS One Zone-IASingle AZ$0.0133/GB/monthDev/test infrequent access (cheapest)

EFS Lifecycle Management: Automatically moves files to Standard-IA after they haven't been accessed for 7, 14, 30, 60, or 90 days. Files moved back to Standard on access. Reduces storage costs significantly for mixed workloads.

EFS Performance Modes

  • General Purpose (default): Lowest latency. Ideal for web servers, home directories, content management. Limit of 35,000 IOPS.
  • Max I/O: Scale to higher throughput (hundreds of thousands of operations/sec). Slightly higher latency. Ideal for parallel workloads โ€” big data, media processing, genomics.
  • Bursting Throughput: Throughput scales with filesystem size (50 MB/s per TB, burst to 100 MB/s per TB)
  • Provisioned Throughput: Specify throughput independently of storage size. Pay for provisioned throughput above earned burst rate.
  • Elastic Throughput (newest): Automatically scales throughput up and down based on workload. Best for spiky or unpredictable workloads.

Mounting EFS on EC2 Instances

# Install EFS utilities (handles NFS mounting and TLS encryption)
sudo yum install -y amazon-efs-utils        # Amazon Linux
sudo apt-get install amazon-efs-utils       # Ubuntu

# Mount using EFS mount helper (recommended โ€” supports encryption in transit)
sudo mkdir -p /mnt/efs
sudo mount -t efs -o tls fs-0123456789:/ /mnt/efs    # with TLS encryption
sudo mount -t efs fs-0123456789:/ /mnt/efs            # without TLS

# Mount specific directory/subdirectory
sudo mount -t efs -o tls fs-0123456789:/myapp /mnt/app

# Verify mount
df -h /mnt/efs
ls /mnt/efs

# Auto-mount on reboot (/etc/fstab)
fs-0123456789:/ /mnt/efs efs _netdev,tls,iam 0 0
# _netdev = wait for network before mounting
# iam = use IAM for authorization

# Mount using NFS directly (without efs utils)
sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 \
  fs-0123456789.efs.ap-south-1.amazonaws.com:/ /mnt/efs
Security: EFS uses Security Groups for network access. The Security Group attached to EFS must allow inbound NFS (port 2049) from the Security Groups of EC2 instances that need to mount it.
๐Ÿฐ

VPC โ€” Virtual Private Cloud

Networking

What is a VPC?

A Virtual Private Cloud (VPC) is your own logically isolated section of the AWS cloud. Think of it as your own private data center inside AWS โ€” you have complete control over your virtual networking environment including IP address ranges, subnets, route tables, and network gateways. Every AWS account gets a default VPC in each region so you can launch resources immediately.

Analogy: A VPC is like your own apartment building. You control who enters (Internet Gateway), how floors are organized (subnets), which rooms talk to each other (route tables), and security at each door (Security Groups & NACLs). AWS just provides the building infrastructure.

Core VPC Concepts

ConceptDescriptionExample
VPCIsolated virtual network in a region. Spans all AZs in that region.10.0.0.0/16 (65,536 IPs)
SubnetA subdivision of a VPC within a single AZ. Resources live in subnets.10.0.1.0/24 in ap-south-1a
Route TableSet of rules (routes) that determine where network traffic is directed.0.0.0.0/0 โ†’ IGW
Internet Gateway (IGW)Allows communication between VPC and the internet. Horizontally scaled, HA, no bandwidth limits.Attach to VPC for internet access
NAT GatewayAllows private subnet resources to access internet but prevents inbound connections from internet.Private EC2 downloading updates
Security GroupVirtual stateful firewall at instance level. Controls inbound/outbound traffic.Allow port 80 from 0.0.0.0/0
NACLStateless firewall at subnet level. Rules evaluated in order by number.Deny rule 100: block bad IP
CIDR BlockIP address range assigned to VPC or subnet using CIDR notation.192.168.0.0/24 = 256 IPs

IP Addressing Deep Dive

Understanding IP addressing is fundamental to VPC design. AWS uses IPv4 CIDR notation where the number after the slash indicates how many bits are the network portion.

CIDRTotal IPsUsable IPs (AWS reserves 5)Use Case
/1665,53665,531VPC (large enterprise)
/204,0964,091Large subnet
/24256251Standard subnet
/281611Small subnet (minimum for AWS)
AWS reserves 5 IPs per subnet: .0 (network address), .1 (VPC router), .2 (DNS server), .3 (future use), .255 (broadcast). So a /24 gives you 251 usable IPs, not 256.

Private IP Ranges (RFC 1918)

These IP ranges are not routable on the public internet โ€” they're used for private networks like VPCs. Always use these for VPC CIDR blocks.

10.0.0.0    โ€“ 10.255.255.255   (10.0.0.0/8)    โ€” Class A, 16M addresses
172.16.0.0  โ€“ 172.31.255.255   (172.16.0.0/12)  โ€” Class B, 1M addresses  
192.168.0.0 โ€“ 192.168.255.255  (192.168.0.0/16) โ€” Class C, 65K addresses

# AWS default VPC always uses: 172.31.0.0/16
# Best practice for custom VPC: use 10.0.0.0/16 (avoids overlap with default)

Public vs Private Subnets

๐ŸŒ Public Subnet

  • Route table has route to Internet Gateway (0.0.0.0/0 โ†’ IGW)
  • Resources can have public IPs
  • Accessible from internet
  • Used for: Load balancers, Bastion hosts, NAT Gateways, Web servers
  • Example CIDR: 10.0.1.0/24, 10.0.2.0/24

๐Ÿ”’ Private Subnet

  • No route to Internet Gateway
  • Internet access via NAT Gateway only (outbound)
  • NOT directly reachable from internet
  • Used for: Application servers, Databases, Lambda in VPC, Cache
  • Example CIDR: 10.0.11.0/24, 10.0.12.0/24

Internet Gateway (IGW)

The IGW is the door between your VPC and the public internet. It performs Network Address Translation (NAT) for instances with public IPs โ€” translating private IPs to public IPs for outbound traffic and vice versa for inbound.

  • One IGW per VPC โ€” horizontally scaled by AWS, fully HA
  • Free โ€” no cost for the gateway itself (only data transfer charges)
  • Two steps to enable internet: (1) attach IGW to VPC, (2) add route in public subnet route table
  • Only works for resources with a public/Elastic IP address
# Public subnet route table
Destination     Target
10.0.0.0/16    local          โ† all VPC traffic stays local
0.0.0.0/0      igw-xxxxxxxx   โ† everything else goes to internet

NAT Gateway

NAT (Network Address Translation) Gateway allows EC2 instances in private subnets to initiate outbound connections to the internet (download patches, call APIs) while preventing the internet from initiating connections into your private instances.

  • Managed by AWS โ€” highly available, auto-scales up to 100 Gbps
  • Placed in public subnet โ€” needs public subnet + Elastic IP
  • Cost: ~$0.045/hour + $0.045/GB data processed (can be expensive!)
  • For HA: Create one NAT Gateway per AZ โ€” if one AZ fails, other AZs still have internet access
  • Not needed for S3/DynamoDB โ€” use VPC Endpoints instead (free)
# Private subnet route table
Destination     Target
10.0.0.0/16    local              โ† VPC traffic stays local
0.0.0.0/0      nat-xxxxxxxx       โ† internet via NAT Gateway

Custom VPC Creation โ€” Step by Step

  1. Plan CIDR: Choose VPC CIDR (e.g., 10.0.0.0/16). Plan subnets โ€” public /24s and private /24s across 2+ AZs for HA.
  2. Create VPC: AWS Console โ†’ VPC โ†’ Create VPC. Enter name, IPv4 CIDR. Enable DNS hostnames and DNS resolution.
  3. Create Subnets: Create public subnets (10.0.1.0/24 in AZ-a, 10.0.2.0/24 in AZ-b) and private subnets (10.0.11.0/24 in AZ-a, 10.0.12.0/24 in AZ-b).
  4. Create & Attach IGW: Create Internet Gateway โ†’ Attach to your VPC. One IGW per VPC.
  5. Create NAT Gateway: In a public subnet โ†’ allocate Elastic IP โ†’ create NAT GW. Create one per AZ for HA.
  6. Configure Route Tables: Public RT: add 0.0.0.0/0 โ†’ IGW, associate public subnets. Private RT: add 0.0.0.0/0 โ†’ NAT GW, associate private subnets.
  7. Enable Auto-assign Public IP: On public subnets โ†’ enable auto-assign public IPv4 so instances get public IPs automatically.

VPC Flow Logs

Flow Logs capture information about IP traffic going to/from network interfaces in your VPC. Essential for security analysis, troubleshooting, and compliance.

  • Can be enabled at VPC, subnet, or ENI (network interface) level
  • Destinations: CloudWatch Logs, S3, Kinesis Data Firehose
  • Captures: source IP, destination IP, port, protocol, bytes, action (ACCEPT/REJECT), status
  • Does NOT capture: DNS queries, DHCP traffic, instance metadata (169.254.x.x), Windows license activation
  • Use Athena on S3 to query flow logs for security investigations
# Flow log record format:
version account-id interface-id srcaddr dstaddr srcport dstport protocol packets bytes start end action log-status
2       123456789   eni-abc123   10.0.1.5  8.8.8.8  54321   443     6        10      5000  1609459200 1609459260 ACCEPT OK

Elastic Network Interface (ENI)

An ENI is a virtual network card you can attach to EC2 instances. Every instance has at least one ENI (eth0 โ€” the primary). You can create additional ENIs and attach/detach them from instances.

  • Has: primary private IP, secondary private IPs, Elastic IP, MAC address, security groups
  • Bound to a specific AZ โ€” cannot move across AZs
  • Use cases: Dual-homed instances (in two subnets), management network separation, move IP between instances on failover
  • Used by Lambda (when in VPC), ECS tasks, RDS, ElastiCache internally

AWS Networking Best Practices

Production VPC Design:
  • Always use multiple AZs (minimum 2, ideally 3) for high availability
  • Separate public, private-app, and private-db subnet tiers
  • Use /16 for VPC CIDR โ€” gives room to grow (65K IPs)
  • Never use overlapping CIDRs if you plan to peer VPCs
  • One NAT Gateway per AZ to avoid cross-AZ traffic costs and AZ dependency
  • Use VPC Endpoints for S3 and DynamoDB to avoid NAT Gateway costs
  • Enable VPC Flow Logs from day one โ€” invaluable for debugging and security
  • Use security groups as primary firewall, NACLs only for broad subnet-level rules

Reference Architecture: 3-Tier VPC

Region: ap-south-1
VPC: 10.0.0.0/16
โ”œโ”€โ”€ AZ: ap-south-1a                    AZ: ap-south-1b
โ”‚   โ”œโ”€โ”€ Public Subnet 10.0.1.0/24      Public Subnet 10.0.2.0/24
โ”‚   โ”‚   โ”œโ”€โ”€ ALB node                   ALB node
โ”‚   โ”‚   โ””โ”€โ”€ NAT Gateway (EIP)          NAT Gateway (EIP)
โ”‚   โ”œโ”€โ”€ Private-App 10.0.11.0/24       Private-App 10.0.12.0/24
โ”‚   โ”‚   โ””โ”€โ”€ EC2 App Servers            EC2 App Servers
โ”‚   โ””โ”€โ”€ Private-DB  10.0.21.0/24       Private-DB  10.0.22.0/24
โ”‚       โ””โ”€โ”€ RDS Primary                RDS Standby (Multi-AZ)
โ”‚
โ”œโ”€โ”€ Internet Gateway (attached to VPC)
โ”œโ”€โ”€ Public Route Table  โ†’ 0.0.0.0/0 to IGW
โ”œโ”€โ”€ Private Route Table โ†’ 0.0.0.0/0 to NAT GW (per AZ)
โ””โ”€โ”€ VPC Endpoints: S3 Gateway, DynamoDB Gateway (free!)
๐Ÿ›ก๏ธ

VPC Controls

Networking

Security Groups vs NACLs โ€” The Key Difference

AWS gives you two layers of network security. Understanding when to use each is critical for both the exam and real-world architecture.

๐Ÿ”’ Security Groups (SG)

  • Level: Instance/ENI level
  • Stateful: Return traffic automatically allowed
  • Rules: Allow only โ€” no deny rules
  • Evaluation: All rules evaluated together
  • Default: Deny all inbound, allow all outbound
  • Scope: Applies to specific instances
  • Can reference other SGs as source/destination

๐Ÿงฑ NACLs (Network ACL)

  • Level: Subnet level
  • Stateless: Must explicitly allow return traffic
  • Rules: Allow AND Deny rules
  • Evaluation: Rules evaluated in number order (lowest first)
  • Default NACL: Allows all in/out
  • Scope: Applies to all instances in subnet
  • Cannot reference SGs
Stateless means you must open ephemeral ports! When a client connects, the response comes back on an ephemeral port (1024-65535). NACLs must explicitly allow this range on inbound rules for return traffic.

Security Group โ€” Deep Dive

Security Groups act as virtual firewalls controlling traffic to/from EC2 instances. They're the primary and most-used security control in AWS.

# Security Group Rules โ€” key concepts:
# Inbound: who can SEND traffic TO your instance
# Outbound: where your instance can SEND traffic TO

# Example: Web server SG
Inbound Rules:
  Type      Port    Source          Purpose
  HTTP      80      0.0.0.0/0       Allow all web traffic
  HTTPS     443     0.0.0.0/0       Allow all HTTPS traffic
  SSH       22      10.0.0.0/8      Allow SSH from internal only

Outbound Rules:
  Type      Port    Destination     Purpose
  All       All     0.0.0.0/0       Allow all outbound (default)

# SG referencing another SG (powerful pattern):
# App server SG inbound: port 8080 source = web-server-SG-id
# This means: only instances IN the web server SG can reach app server
# No need to know IP addresses โ€” scales automatically

NACL Rules โ€” Deep Dive

NACLs are the subnet-level firewall. Each subnet can only be associated with one NACL at a time. Rules are processed in ascending order โ€” first match wins.

Rule #TypeProtocolPortSourceAction
100HTTPTCP800.0.0.0/0ALLOW
110HTTPSTCP4430.0.0.0/0ALLOW
120Custom TCPTCP1024-655350.0.0.0/0ALLOW โ† ephemeral!
200SSHTCP221.2.3.4/32ALLOW
*All trafficAllAll0.0.0.0/0DENY โ† catch-all
Rule numbering best practice: Use increments of 10 or 100 so you can insert rules later. Rule * (asterisk) is the default deny โ€” always last, cannot be modified.

VPC Peering

VPC Peering creates a direct, private network connection between two VPCs allowing instances to communicate as if they were in the same network โ€” using private IPs, no internet involved.

  • Cross-region peering: Yes โ€” peer VPCs across different AWS regions
  • Cross-account peering: Yes โ€” peer VPCs in different AWS accounts
  • Non-transitive: If Aโ†”B and Bโ†”C, A cannot reach C through B. You need a direct Aโ†”C peering.
  • No overlapping CIDRs: VPCs being peered cannot have overlapping IP ranges
  • Route table update required: Must add routes in BOTH VPCs pointing to the peering connection
  • SG reference cross-account: Can reference SGs from peered VPC (same region only)
# VPC A (10.0.0.0/16) peered with VPC B (172.16.0.0/16)
# VPC A route table must add:
Destination       Target
172.16.0.0/16    pcx-xxxxxxxxx   โ† peering connection to VPC B

# VPC B route table must add:
Destination       Target
10.0.0.0/16      pcx-xxxxxxxxx   โ† peering connection to VPC A

VPC Endpoints

VPC Endpoints allow private connectivity to AWS services without traffic leaving the AWS network โ€” no internet, no NAT Gateway, no extra cost per GB (for Gateway endpoints).

Gateway Endpoint (FREE)

  • Supports: S3 and DynamoDB ONLY
  • Added as a route in route table
  • No additional cost โ€” saves NAT Gateway data charges
  • Scoped to region โ€” not specific AZ
  • Works via route table entry pointing to vpce-xxx

Interface Endpoint (PrivateLink)

  • Supports: 100+ AWS services (CloudWatch, SNS, SQS, SSM, Secrets Manager, etc.)
  • Creates an ENI in your subnet with private IP
  • Cost: ~$0.01/hour/AZ + $0.01/GB
  • Uses DNS resolution to redirect service calls
  • Works across VPC peering and Direct Connect
# S3 Gateway Endpoint - add to private route table:
Destination        Target
pl-xxxxxxxx        vpce-xxxxxxxx   โ† S3 prefix list โ†’ endpoint

# No code change needed! Your existing S3 calls
# boto3.client('s3').upload_file(...)  โ† automatically uses endpoint

AWS Transit Gateway

Transit Gateway is a network hub that connects thousands of VPCs, on-premises networks, and VPN connections through a single gateway. Instead of creating mesh of VPC peering connections, all VPCs connect to the TGW hub.

  • Hub-and-spoke model: Each VPC/VPN connects once to TGW โ€” TGW handles routing between them
  • Transitive routing: Unlike VPC peering, A can reach C through TGW (Aโ†’TGWโ†’C)
  • Cross-region: Peer Transit Gateways across regions for global connectivity
  • Cross-account: Share TGW with AWS RAM (Resource Access Manager)
  • Route tables: TGW has its own route tables โ€” control which attachments can talk to each other (segregation)
  • Cost: $0.05/attachment/hour + $0.02/GB โ€” expensive for many VPCs but cheaper than mesh peering
# Without TGW: 10 VPCs need 45 peering connections (n*(n-1)/2)
# With TGW: 10 VPCs each connect once to TGW = 10 attachments

TGW Attachments:
  โ”œโ”€โ”€ VPC-A (prod)
  โ”œโ”€โ”€ VPC-B (staging)  
  โ”œโ”€โ”€ VPC-C (shared-services)
  โ”œโ”€โ”€ VPN Connection (on-premises data center)
  โ””โ”€โ”€ Direct Connect Gateway

AWS Direct Connect

Direct Connect establishes a dedicated physical network connection from your on-premises data center to AWS โ€” bypassing the public internet entirely for more consistent performance, lower latency, and reduced data transfer costs.

  • Speeds: 1 Gbps, 10 Gbps, 100 Gbps (hosted connections: 50 Mbps to 10 Gbps)
  • Not encrypted by default โ€” combine with VPN for encryption over Direct Connect
  • Not redundant by default โ€” order two connections in different facilities for HA
  • Lead time: Weeks to months to provision physical connection
  • Use Direct Connect Gateway to connect to multiple VPCs across regions from one Direct Connect

VPN Connections

AWS Site-to-Site VPN creates an encrypted IPsec tunnel between your on-premises network and your AWS VPC over the public internet.

Site-to-Site VPN

  • Encrypted tunnel over internet
  • Quick to setup (minutes)
  • Up to 1.25 Gbps per tunnel
  • 2 tunnels for redundancy (different AWS endpoints)
  • Uses Virtual Private Gateway (VGW) on AWS side
  • Cost: ~$0.05/hour + data transfer

Client VPN

  • OpenVPN-based for individual users
  • Users connect laptop โ†’ AWS VPC
  • AD/SAML authentication
  • Split tunneling option
  • Cost: ~$0.10/hour per association + $0.05/hour per connection

AWS PrivateLink

PrivateLink allows you to expose your service privately to other VPCs without peering, without public internet, and without exposing your entire VPC. It's the technology behind Interface VPC Endpoints.

  • Service provider creates a Network Load Balancer in front of their service
  • Creates a VPC Endpoint Service โ€” consumers create Interface Endpoints to connect
  • Traffic never leaves AWS network โ€” completely private
  • Works across accounts and regions
  • Used by AWS to provide 100+ services privately (SSM, Secrets Manager, etc.)

Route 53 Resolver (DNS in VPC)

Route 53 Resolver is the built-in DNS resolver that handles DNS queries from within your VPC. Understanding it is key for hybrid cloud DNS.

  • VPC DNS resolver: Available at VPC CIDR +2 (e.g., 10.0.0.2 for 10.0.0.0/16)
  • Inbound Resolver Endpoints: Allow on-premises to resolve AWS private DNS names
  • Outbound Resolver Endpoints: Allow VPC to resolve on-premises DNS names
  • Forwarding Rules: Forward specific domain queries (e.g., corp.internal) to on-premises DNS
  • Enable DNS resolution and DNS hostnames in VPC settings for Route 53 private hosted zones to work
๐Ÿชฃ

S3 โ€” Simple Storage Service

Object Storage

What is S3?

Amazon S3 is an object storage service โ€” not a filesystem, not a database. You store objects (files) in buckets. S3 provides 11 nines of durability (99.999999999%) by storing data across minimum 3 Availability Zones. S3 is accessed via HTTP/HTTPS API calls (PUT, GET, DELETE), not mounted as a filesystem.

S3 Durability: 99.999999999% = "eleven nines". If you store 10 million objects, you'd expect to lose 1 object every 10,000 years. This is achieved by storing each object in multiple facilities.

Core S3 Concepts

BucketContainer for objects. Name must be globally unique across ALL AWS accounts. Region-specific but name global. Max 100 buckets per account (softlimit).
ObjectFiles stored in S3. Consists of Key (unique identifier/path), Value (the actual data/content), Metadata (key-value pairs), Version ID, Access Control.
KeyFull path of object within bucket. E.g., images/2024/jan/photo.jpg. There are no actual folders โ€” the key is just a string with / in it.
Object SizeMin 0 bytes, Max 5 TB. For objects larger than 100 MB: use Multipart Upload. Required for objects over 5 GB.
URL Formathttps://bucket-name.s3.region.amazonaws.com/key โ€” e.g., https://my-bucket.s3.ap-south-1.amazonaws.com/images/photo.jpg
MetadataSystem metadata (Content-Type, ETag, Content-Length) set by AWS. User-defined metadata (x-amz-meta-* headers) set by you.

S3 Storage Classes

ClassDurabilityAvailabilityAZsMin DurationRetrievalBest For
S3 Standard11 9s99.99%โ‰ฅ3NoneMilliseconds (free)Frequently accessed data, websites, mobile apps
S3 Intelligent-Tiering11 9s99.9%โ‰ฅ3NoneMillisecondsโ€“hoursUnknown or changing access patterns
S3 Standard-IA11 9s99.9%โ‰ฅ330 daysMilliseconds (per GB fee)Disaster recovery, backups accessed monthly
S3 One Zone-IA11 9s99.5%130 daysMilliseconds (per GB fee)Non-critical infrequent data. 20% cheaper than Standard-IA.
Glacier Instant11 9s99.9%โ‰ฅ390 daysMilliseconds (per GB fee)Archives accessed once a quarter
Glacier Flexible11 9s99.9%โ‰ฅ390 days1-5 min (expedited), 3-5 hrs (standard), 5-12 hrs (bulk)Archives accessed 1-2 times/year
Glacier Deep Archive11 9s99.9%โ‰ฅ3180 days12 hrs (standard), 48 hrs (bulk)Compliance archives, 7-10 year retention
Intelligent-Tiering: S3 monitors access patterns and automatically moves objects between access tiers: Frequent Access โ†’ Infrequent Access โ†’ Archive Instant Access โ†’ Archive Access โ†’ Deep Archive Access. No retrieval fees. Small monthly monitoring fee per object. Best when access patterns are unknown.

S3 Versioning

Versioning stores multiple versions of the same object in a bucket. Every upload creates a new version ID. This protects against accidental overwrites and deletes.

  • Enable versioning at the bucket level: Properties โ†’ Bucket Versioning โ†’ Enable
  • Once enabled, versioning can be suspended but NOT disabled. Existing versions remain.
  • When you "delete" a versioned object, S3 adds a Delete Marker (a special version). The object isn't actually deleted โ€” you can restore it by deleting the Delete Marker.
  • To permanently delete: delete the specific version ID
  • Objects uploaded BEFORE versioning was enabled get version ID = null
  • MFA Delete: Require MFA authentication to permanently delete versions or suspend versioning. Requires AWS CLI (not Console).
  • Versioning increases storage costs (multiple copies of same file). Use Lifecycle rules to expire old versions.

S3 Bucket Policies vs ACLs

๐Ÿ“‹ Bucket Policies (JSON)

  • Resource-based IAM-style JSON policy attached to bucket
  • Can grant access to: specific IAM users, roles, other AWS accounts, anonymous users (public access)
  • Can require HTTPS-only access
  • Can restrict by IP address or VPC
  • Recommended approach โ€” more powerful than ACLs

๐Ÿ“„ ACLs (Legacy)

  • Pre-IAM access control mechanism
  • Less granular than policies
  • Disabled by default (Block Public Access)
  • Apply at bucket or object level
  • AWS recommends: disable ACLs and use bucket policies instead

Bucket Policy Examples

# Make specific objects publicly readable
{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "PublicReadGetObject",
    "Effect": "Allow",
    "Principal": "*",
    "Action": "s3:GetObject",
    "Resource": "arn:aws:s3:::my-bucket/*"
  }]
}

# Force HTTPS only (deny HTTP)
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Deny",
    "Principal": "*",
    "Action": "s3:*",
    "Resource": ["arn:aws:s3:::my-bucket", "arn:aws:s3:::my-bucket/*"],
    "Condition": {
      "Bool": { "aws:SecureTransport": "false" }
    }
  }]
}

# Allow specific IAM role to access bucket
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": { "AWS": "arn:aws:iam::123456789:role/AppRole" },
    "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
    "Resource": "arn:aws:s3:::my-bucket/*"
  }]
}

S3 Block Public Access

Block Public Access is a safety net that prevents S3 buckets from being accidentally made public. Enabled by default on all new buckets and at the account level.

  • BlockPublicAcls: Rejects PUTs that include public ACLs
  • IgnorePublicAcls: Ignores any public ACLs on bucket/objects
  • BlockPublicPolicy: Rejects bucket policies that grant public access
  • RestrictPublicBuckets: Ignores public bucket policies
  • For static website hosting: you must turn off Block Public Access and add a bucket policy allowing public GetObject

S3 Operations via AWS CLI

# Create bucket
aws s3 mb s3://my-unique-bucket-name --region ap-south-1

# Upload file
aws s3 cp myfile.txt s3://my-bucket/
aws s3 cp myfile.txt s3://my-bucket/folder/renamed.txt

# Download file
aws s3 cp s3://my-bucket/myfile.txt ./localfile.txt

# List bucket contents
aws s3 ls s3://my-bucket/
aws s3 ls s3://my-bucket/ --recursive       # list all files including subdirs

# Sync (only copies new or modified files)
aws s3 sync ./local-folder/ s3://my-bucket/
aws s3 sync s3://source-bucket/ s3://dest-bucket/

# Delete file
aws s3 rm s3://my-bucket/myfile.txt
aws s3 rm s3://my-bucket/ --recursive       # delete all objects (careful!)

# Make object public
aws s3api put-object-acl --bucket my-bucket --key file.txt --acl public-read
๐Ÿชฃ

S3 Advanced

Object Storage

S3 Data Partitioning and Performance

S3 automatically partitions data based on key prefixes for performance. AWS can handle 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix.

# Single prefix = limited to 5,500 GET/s
s3://bucket/2024/all-files...         # all under same prefix = limited

# Multiple prefixes = multiply performance
s3://bucket/2024/q1/file            # prefix 1: 5,500 GET/s
s3://bucket/2024/q2/file            # prefix 2: 5,500 GET/s
s3://bucket/2024/q3/file            # prefix 3: 5,500 GET/s
s3://bucket/2024/q4/file            # prefix 4: 5,500 GET/s
# Total: 22,000 GET/s with 4 prefixes!

# Tip: Randomize prefixes to avoid hotspots (old advice for SSE-KMS uploads)
# Modern S3 handles random keys well natively

Multipart Upload

  • Recommended for objects larger than 100 MB
  • Required for objects larger than 5 GB
  • Splits file into up to 10,000 parts, uploads in parallel
  • If one part fails, only that part is retried (not whole file)
  • All parts must be uploaded before S3 assembles the final object
  • Incomplete multipart uploads should be cleaned up with lifecycle rules (cost saving)
# Multipart upload via CLI (handled automatically)
aws s3 cp largefile.iso s3://my-bucket/ --expected-size 4294967296

# Or specify multipart threshold and chunk size
aws configure set default.s3.multipart_threshold 64MB
aws configure set default.s3.multipart_chunksize 16MB

# Clean up incomplete multipart uploads
aws s3api list-multipart-uploads --bucket my-bucket
aws s3api abort-multipart-upload \
  --bucket my-bucket \
  --key object-key \
  --upload-id upload-id

S3 Transfer Acceleration

S3 Transfer Acceleration speeds up long-distance uploads to S3 by routing through AWS CloudFront Edge Locations. Instead of uploading directly to S3, data goes to the nearest Edge Location, then travels over AWS backbone to S3.

  • Best for: global users uploading to a centralized S3 bucket
  • Example: Users in Australia, India, Europe all upload to us-east-1 S3 bucket
  • Use special endpoint: bucket.s3-accelerate.amazonaws.com
  • Additional cost: $0.04-0.08/GB transferred through acceleration
  • Test if it helps for your scenario: Speed Comparison Tool

Cross-Region Replication (CRR) & Same-Region Replication (SRR)

Replication automatically copies objects between S3 buckets, either within the same region or across regions.

FeatureCRR (Cross-Region)SRR (Same-Region)
PurposeCompliance, lower latency, cross-account backupsLog aggregation, data sharing, test/prod sync
Data transfer costYes โ€” inter-region chargesNo extra charges
LatencyNear real-time (asynchronous)Near real-time (asynchronous)
VersioningRequired on both source and destinationRequired on both
  • Replication does NOT copy existing objects automatically โ€” use S3 Batch Replication for existing objects
  • New objects uploaded after enabling replication are replicated
  • Delete markers: NOT replicated by default (optional setting). Version deletions are NEVER replicated.
  • Replication supports cross-account (set ACL/bucket policy on destination)
  • Replication Time Control (RTC): 99.99% of objects replicated within 15 minutes (SLA-backed)

S3 Lifecycle Management

Lifecycle policies automate transitioning objects between storage classes and expiring old objects/versions. Reduces storage costs significantly.

# Typical lifecycle policy example:
# Day 0:   Upload to S3 Standard
# Day 30:  Transition to S3 Standard-IA
# Day 90:  Transition to S3 Glacier Flexible Retrieval
# Day 365: Transition to S3 Glacier Deep Archive
# Day 2555 (7 years): Delete permanently

# Also useful for:
# - Expire incomplete multipart uploads after 7 days
# - Delete old versions after 30 days (with versioning enabled)
# - Delete expired object delete markers

S3 Encryption

Encryption TypeKey ManagementHeader RequiredNotes
SSE-S3 (default since Jan 2023)AWS manages keys entirely. AES-256.x-amz-server-side-encryption: AES256No configuration needed. Automatic on all new objects.
SSE-KMSAWS KMS. You choose CMK.x-amz-server-side-encryption: aws:kmsAudit trail in CloudTrail. KMS API quota limits. Use S3 Bucket Keys to reduce API calls.
SSE-CYou provide the key with EVERY request.Key in request headerMUST use HTTPS. AWS doesn't store the key. You lose key = you lose data.
Client-Side EncryptionYou encrypt before uploading. Complete control.N/A โ€” encrypted before uploadAWS never sees plaintext. Use AWS Encryption SDK or your own solution.

Static Website Hosting with S3

  1. Create S3 bucket with the same name as your domain (e.g., www.example.com)
  2. Enable Static website hosting in bucket Properties โ†’ set Index document = index.html, Error document = error.html
  3. Disable Block Public Access on the bucket (all 4 settings)
  4. Add bucket policy to allow public GetObject on all objects
  5. Upload your HTML/CSS/JS/image files to the bucket
  6. (Optional) Point a Route 53 Alias record or CNAME to the S3 website endpoint
  7. (Optional) Put CloudFront in front for HTTPS, custom domain, and global CDN
S3 websites support only HTTP by default! For HTTPS on a custom domain, you must use CloudFront + ACM certificate in front of S3.

S3 Events and Notifications

S3 can send event notifications when specific events occur on objects (create, delete, restore, replication).

DestinationUse CaseLatency
SNS TopicFan-out to multiple systems, email alertsSeconds
SQS QueueDecouple processing, retry failed eventsSeconds
Lambda FunctionProcess objects on upload (resize, validate, extract)Seconds
EventBridgeAdvanced filtering, 20+ targets, archive/replay eventsSeconds
# Example: Trigger Lambda when image is uploaded to /images/ prefix
# S3 Event: ObjectCreated (PUT, POST, COPY)
# Filter: Prefix = images/, Suffix = .jpg
# Destination: Lambda function ARN

# Common use case: Image processing pipeline
# 1. User uploads image to S3 (s3://my-bucket/images/photo.jpg)
# 2. S3 sends event notification to Lambda
# 3. Lambda reads original image from S3
# 4. Lambda resizes to multiple dimensions
# 5. Lambda writes thumbnails back to S3 (s3://my-bucket/thumbnails/)
๐Ÿ”“

S3 Access Control

Object Storage

Cross-Account Access for S3

When another AWS account needs to access your S3 bucket, you have three main approaches:

Method 1: Bucket Policy (Most Common)

Add a bucket policy to Account A's bucket that grants permissions to Account B's users/roles.

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "CrossAccountAccess",
    "Effect": "Allow",
    "Principal": {
      "AWS": [
        "arn:aws:iam::ACCOUNT-B-ID:root",           // all of Account B
        "arn:aws:iam::ACCOUNT-B-ID:role/SpecificRole" // or just a specific role
      ]
    },
    "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
    "Resource": [
      "arn:aws:s3:::account-a-bucket",
      "arn:aws:s3:::account-a-bucket/*"
    ]
  }]
}
// Account B users ALSO need IAM permission to make the S3 calls

Method 2: IAM Role Assumption (STS)

# Account A creates IAM Role with S3 access + trust policy:
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": { "AWS": "arn:aws:iam::ACCOUNT-B-ID:root" },
    "Action": "sts:AssumeRole"
  }]
}

# Account B user assumes the role:
aws sts assume-role \
  --role-arn "arn:aws:iam::ACCOUNT-A-ID:role/S3AccessRole" \
  --role-session-name "cross-account-session"
# Returns: AccessKeyId, SecretAccessKey, SessionToken (valid 1 hour)

# Use temporary credentials to access Account A's S3:
AWS_ACCESS_KEY_ID=xxx AWS_SECRET_ACCESS_KEY=yyy AWS_SESSION_TOKEN=zzz \
  aws s3 ls s3://account-a-bucket/

Pre-Signed URLs

Generate a time-limited URL that grants temporary access to a specific S3 object. The URL includes authentication information embedded in it. Anyone with the URL can access the object for the duration.

  • URL inherits permissions of the IAM identity that generated it
  • If the generating user/role loses permissions, the URL also stops working
  • Default expiry: 1 hour (console), configurable via CLI (max 7 days with IAM user, 12h with CLI default)
  • For roles (EC2 instance profile, Lambda): max expiry = role's session duration
  • Works for GET (share private objects) and PUT (allow uploads to specific path)
# Generate pre-signed URL for downloading (valid 1 hour)
aws s3 presign s3://my-bucket/private-report.pdf --expires-in 3600

# Output: https://my-bucket.s3.amazonaws.com/private-report.pdf?X-Amz-Algorithm=...&X-Amz-Expires=3600&...

# Generate pre-signed URL for uploading (PUT)
aws s3 presign s3://my-bucket/upload-here.jpg --expires-in 7200

# Python example
import boto3
s3 = boto3.client('s3')
url = s3.generate_presigned_url(
    'get_object',
    Params={'Bucket': 'my-bucket', 'Key': 'private-file.pdf'},
    ExpiresIn=3600
)

S3 Access Points

Access Points are named network endpoints attached to a bucket, each with their own permissions policy. Instead of a single complex bucket policy managing hundreds of users, create one Access Point per use case.

# Example: Data lake bucket accessed by multiple teams
# Instead of one complex bucket policy:
# Create separate access points:
# - data-scientists-ap: allow read/write to /analytics/ prefix only
# - finance-ap: allow read to /finance/ prefix only  
# - dev-team-ap: allow read/write to /dev/ prefix only, VPC-only access

aws s3control create-access-point \
  --account-id 123456789012 \
  --name data-scientists-ap \
  --bucket my-data-lake \
  --vpc-configuration VpcId=vpc-12345678   # VPC-only access

# Access point ARN: arn:aws:s3:region:account:accesspoint/data-scientists-ap
# Use access point ARN anywhere you'd use a bucket name in S3 API calls

S3 Object Lambda

S3 Object Lambda adds your code to process data retrieved from S3 before returning it to the requesting application. Data is modified on-the-fly without storing multiple versions.

Without Object Lambda

  • Store multiple copies of same data in different formats
  • Original + anonymized + compressed + watermarked = 4x storage
  • High storage costs
  • Synchronization complexity

With Object Lambda

  • Store one copy of data
  • Lambda processes on retrieval
  • Different users get different views of same data
  • No extra storage needed
  • Use cases: Redact PII (remove SSN/credit card numbers), convert XML to JSON, resize images, add watermarks, decompress/compress
  • Works with any application that uses S3 GET API calls
  • Creates a new S3 Object Lambda Access Point โ€” use its ARN instead of bucket name
# Lambda function for S3 Object Lambda (redact PII)
import boto3, re, json

s3_client = boto3.client('s3')

def lambda_handler(event, context):
    # Get object from S3
    object_get_context = event["getObjectContext"]
    request_route = object_get_context["outputRoute"]
    request_token = object_get_context["outputToken"]
    s3_url = object_get_context["inputS3Url"]
    
    # Retrieve original object
    response = requests.get(s3_url)
    original_content = response.text
    
    # Redact SSNs (pattern: XXX-XX-XXXX)
    redacted = re.sub(r'\d{3}-\d{2}-\d{4}', 'XXX-XX-XXXX', original_content)
    
    # Return modified content
    s3_client.write_get_object_response(
        Body=redacted,
        RequestRoute=request_route,
        RequestToken=request_token
    )
    return {'status_code': 200}
๐Ÿ”

IAM โ€” Identity and Access Management

Identity & Access

Root Account vs IAM User

๐Ÿ‘‘ Root Account (Account Owner)

  • Created when you sign up for AWS
  • Email address + password login
  • Complete unrestricted access to everything
  • Cannot be restricted by IAM policies
  • Only for: change account settings, close account, change email, view certain billing, first IAM admin user creation
  • NEVER use for day-to-day operations!
  • Enable MFA immediately after account creation

๐Ÿ‘ค IAM User

  • Created within your AWS account by root or admin
  • Username + password (Console) or Access Keys (CLI/API)
  • No permissions by default โ€” must be explicitly granted
  • Can have both Console access AND programmatic access
  • One set of long-term credentials per user
  • Suitable for individual humans or service accounts

Multi-Factor Authentication (MFA)

MFA requires users to provide two forms of authentication: something they know (password) and something they have (MFA device). Even if password is stolen, attacker can't log in without the MFA device.

MFA TypeDescriptionExamples
Virtual MFA DeviceTOTP (Time-based One-Time Password) app on smartphoneGoogle Authenticator, Authy, Microsoft Authenticator, Duo
Hardware TOTP TokenPhysical device that generates 6-digit codesGemalto token, RSA SecurID
FIDO Security Key (U2F)Physical USB/NFC key โ€” press button to authenticateYubiKey, Titan Security Key
Passkey / BiometricBuilt-in biometric (fingerprint, face) stored in deviceTouch ID on Mac, Windows Hello, smartphone biometrics

IAM Password Policy

  • Minimum password length (1-128 characters)
  • Require specific character types: uppercase, lowercase, numbers, symbols
  • Password expiration (force password change every N days)
  • Prevent password reuse (remember last N passwords)
  • Allow users to change their own passwords

IAM Users, Groups, Roles โ€” Concepts

EntityWhat It IsCredentialsBest For
UserPerson or application with long-term identity in your accountPassword + Access KeysHuman employees, CI/CD pipelines (when no other option)
GroupCollection of IAM users โ€” policies applied to group apply to all membersN/A (inherits from policies)Organizing users by job function (Developers, Admins, Read-Only)
RoleIAM identity with permission policies, but NO permanent credentials. Assumed by trusted entities.Temporary credentials (STS)EC2/Lambda accessing AWS services, cross-account access, identity federation
PolicyJSON document defining permissions (Allow/Deny actions on resources)N/AAttached to users, groups, roles, or resources
IAM Groups: Users can belong to multiple groups. Groups CANNOT be nested (no group within a group). Groups are for users only โ€” cannot assign roles to groups.

IAM Roles โ€” Deep Dive

Roles are the AWS-recommended way to grant AWS service permissions. Instead of creating IAM users with access keys for EC2 instances (insecure), you create a role with an Instance Profile.

โŒ Bad Practice (Access Keys on EC2)

  • aws configure on EC2 โ†’ hardcoded access keys
  • Keys stored in ~/.aws/credentials
  • If instance is compromised, keys are stolen
  • Keys must be manually rotated
  • Can't audit which instance used which credentials

โœ… Good Practice (IAM Role)

  • Attach IAM role to EC2 instance profile
  • Temporary credentials auto-rotated every hour
  • Credentials available at metadata endpoint
  • CloudTrail logs show instance ID + role
  • No credentials stored on disk

IAM Policy โ€” Structure and Examples

{
  "Version": "2012-10-17",       // Always use this version
  "Statement": [
    {
      "Sid": "AllowS3ReadWrite",  // Optional: human-readable ID for this statement
      "Effect": "Allow",           // Allow or Deny
      "Action": [                  // What API calls are allowed/denied
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [                // What resources the action applies to
        "arn:aws:s3:::my-bucket",           // bucket (for ListBucket)
        "arn:aws:s3:::my-bucket/*"          // all objects in bucket
      ],
      "Condition": {               // Optional: when this policy applies
        "StringEquals": {
          "aws:RequestedRegion": "ap-south-1"   // Only in Mumbai region
        },
        "Bool": {
          "aws:MultiFactorAuthPresent": "true"  // Only when using MFA
        }
      }
    },
    {
      "Sid": "DenyDeleteProduction",
      "Effect": "Deny",            // Explicit Deny always wins over Allow
      "Action": "s3:DeleteObject",
      "Resource": "arn:aws:s3:::production-bucket/*",
      "Condition": {
        "StringNotEquals": {
          "aws:PrincipalTag/Environment": "Admin"  // Unless user has Admin tag
        }
      }
    }
  ]
}

IAM Policy Types

AWS Managed Policies

  • Created and maintained by AWS
  • Updated automatically when new services launch
  • Cannot be modified by customers
  • Common examples: AdministratorAccess, ReadOnlyAccess, PowerUserAccess, AmazonS3FullAccess, AmazonEC2ReadOnlyAccess

Customer Managed Policies

  • You create and manage
  • Can be reused across multiple users/roles/groups
  • Versioned โ€” up to 5 versions, rollback supported
  • Recommended for most custom use cases

Inline Policies

  • Embedded directly in a user, group, or role
  • Not reusable โ€” 1:1 relationship
  • Deleted when entity is deleted
  • Use only when policy must not be accidentally attached to another identity

AWS CLI Configuration and IAM Access

# Configure CLI with access keys (for human users)
aws configure
# Prompts: Access Key ID, Secret Access Key, Region, Output format

# Or set specific profile
aws configure --profile myproject

# Use specific profile
aws s3 ls --profile myproject

# List configured profiles
aws configure list-profiles

# View current identity
aws sts get-caller-identity
# Returns: Account, UserId, Arn (who am I?)

# On EC2 with IAM role โ€” NO configuration needed!
aws s3 ls s3://my-bucket/  # uses instance profile automatically

# Access key rotation (best practice: every 90 days)
aws iam create-access-key --user-name myuser
aws iam delete-access-key --user-name myuser --access-key-id AKIAIOSFODNN7EXAMPLE

IAM Best Practices

  • Least Privilege: Grant only the permissions required to do the job โ€” nothing more
  • Root Account: Never use root for daily operations. Enable MFA. Delete or rotate root access keys.
  • MFA: Enable MFA for root account and all privileged users immediately
  • Roles over Users: Use IAM roles for EC2, Lambda, ECS โ€” no hardcoded credentials
  • Key Rotation: Rotate access keys every 90 days or less
  • Groups for Permissions: Assign permissions to groups, add users to groups
  • Never Share Credentials: Create individual IAM users โ€” never share usernames/passwords
  • Use IAM Access Analyzer: Identify over-permissive policies and external access
  • Monitor with CloudTrail: Log and alert on sensitive API calls

Auditing User Activity

  • IAM Credential Report: CSV showing ALL users in account: when they last used console/access keys, whether MFA is enabled, when passwords were changed. Download from IAM Console โ†’ Credential Report.
  • IAM Access Advisor: Shows service-level permissions granted to a user AND the last time those services were accessed. Use this to identify and remove unused permissions.
  • CloudTrail: Every API call in your account is logged โ€” who, what, when, from where, success/fail. Essential for incident investigation.
  • IAM Access Analyzer: Scans resource policies and reports any that grant access to external principals. Helps find unintended public or cross-account access.
๐Ÿ”‘

Secrets & Keys

Security

Why Never Hardcode Credentials?

Hardcoding passwords, API keys, or tokens directly in code is one of the most dangerous security mistakes. If code is pushed to GitHub (even accidentally), credentials are exposed publicly. AWS scanners, bots, and attackers actively scrape GitHub for AWS keys โ€” a compromised key can result in thousands of dollars of AWS charges within minutes.

Real incident: Developers have accidentally committed AWS keys to public GitHub repos and received $50,000+ bills within hours from crypto miners spinning up GPU instances globally. AWS may cover some charges but not always.

AWS Secrets Manager

Secrets Manager is a dedicated service for storing, rotating, and retrieving secrets. Applications call the API at runtime instead of having credentials in code or config files.

FeatureDetails
Automatic RotationRotates RDS, Aurora, Redshift, DocumentDB credentials on schedule via Lambda. Zero downtime โ€” updates DB password and stores new value atomically.
EncryptionAll secrets encrypted with KMS (AWS-managed or your own CMK)
VersioningKeeps previous versions (AWSPREVIOUS) during rotation for zero-downtime cutover
Audit TrailEvery GetSecretValue call logged in CloudTrail โ€” full audit who accessed what and when
Cross-accountShare secrets across AWS accounts using resource-based policies
Cost$0.40/secret/month + $0.05 per 10,000 API calls
import boto3, json

def get_secret(secret_name):
    client = boto3.client('secretsmanager', region_name='ap-south-1')
    resp = client.get_secret_value(SecretId=secret_name)
    return json.loads(resp['SecretString'])

# Usage - credentials fetched at runtime, never in code
creds = get_secret('prod/myapp/rds')
conn = pymysql.connect(host=creds['host'], user=creds['username'],
                       password=creds['password'], database=creds['dbname'])

Rotation Deep Dive

Automatic rotation works by triggering a Lambda function on schedule. AWS provides pre-built rotation Lambdas for RDS, Aurora, Redshift, and DocumentDB. For other services, you write a custom Lambda.

  1. createSecret: Lambda generates a new random password
  2. setSecret: Lambda updates the password in the database
  3. testSecret: Lambda tests new credentials can authenticate
  4. finishSecret: Lambda marks new version as AWSCURRENT, old as AWSPREVIOUS

Secrets Manager vs SSM Parameter Store

Secrets Manager

  • Purpose-built for secrets
  • Automatic rotation built-in
  • $0.40/secret/month
  • Cross-account sharing
  • Secret versioning
  • Best for: DB passwords, API keys needing auto-rotation

SSM Parameter Store

  • Config values + secrets
  • No built-in auto rotation
  • Standard tier: FREE (up to 10,000 params)
  • Advanced tier: $0.05/param/month
  • SecureString = KMS encrypted
  • Best for: config flags, feature toggles, non-rotating values

AWS KMS (Key Management Service)

KMS is the central key management service for all AWS encryption. It creates and controls cryptographic keys used to encrypt data. Crucially, plaintext keys NEVER leave KMS โ€” all encrypt/decrypt operations happen inside the service via API calls.

Key TypeWho ManagesCostUse Case
AWS Owned KeysAWS (hidden)FreeDefault for S3, SQS, DynamoDB
AWS Managed KeysAWS (visible)Freeaws/s3, aws/ebs, aws/rds
Customer Managed CMKYou$1/month + $0.03/10K API callsCustom rotation, cross-account, audit
Imported KeysYou (bring own key)$1/monthRegulatory compliance (BYOK)
Envelope Encryption: KMS generates a Data Key (DEK) โ†’ you encrypt your data locally with DEK (fast) โ†’ KMS encrypts the DEK with your CMK โ†’ you store encrypted data + encrypted DEK together. Data never goes to KMS โ€” only the small DEK does. This is how S3, EBS, RDS encryption works under the hood.

KMS Key Policies

Unlike IAM policies, KMS keys REQUIRE a key policy โ€” without one, no one (not even root) can use the key. Key policies are resource-based policies attached directly to the CMK.

{
  "Statement": [
    {
      "Sid": "Enable IAM User Permissions",
      "Effect": "Allow",
      "Principal": {"AWS": "arn:aws:iam::123456789012:root"},
      "Action": "kms:*",
      "Resource": "*"
    },
    {
      "Sid": "Allow Lambda to use key",
      "Effect": "Allow", 
      "Principal": {"AWS": "arn:aws:iam::123456789012:role/lambda-role"},
      "Action": ["kms:Decrypt","kms:GenerateDataKey"],
      "Resource": "*"
    }
  ]
}
๐Ÿ“Š

CloudWatch

Monitoring

What is CloudWatch?

CloudWatch is AWS's unified observability platform โ€” the single place to monitor all your AWS resources and applications. It collects metrics (numbers), logs (text), and traces (request paths) and lets you set alarms, create dashboards, and trigger automated actions. Think of it as the "nervous system" of your AWS infrastructure.

๐Ÿ“ˆ Metrics

  • Numerical time-series data points
  • Default resolution: 1 minute (detailed) or 5 minutes (standard)
  • Retention: 15 months (1-min data kept 15 days, then rolled up)
  • Free tier: basic EC2, RDS, S3 metrics
  • Custom metrics: $0.30/metric/month

๐Ÿ“‹ Logs

  • Organized in Log Groups โ†’ Log Streams โ†’ Log Events
  • Configurable retention (1 day to 10 years, or never expire)
  • Ingest cost: $0.50/GB, Storage: $0.03/GB/month
  • Query with CloudWatch Logs Insights
  • Subscribe to Lambda/Kinesis for real-time processing

๐Ÿ”” Alarms

  • Watch a single metric, trigger actions
  • States: OK, ALARM, INSUFFICIENT_DATA
  • Actions: SNS notification, EC2 action, Auto Scaling, Systems Manager
  • Composite alarms: combine multiple alarms with AND/OR
  • Cost: $0.10/alarm/month (standard)

Default EC2 Metrics (No Agent Needed)

MetricDescriptionUnitAlarm Threshold
CPUUtilization% of CPU used by instancePercentAlert if >80% for 5 mins
NetworkIn / NetworkOutBytes received/sentBytesAlert on traffic spikes
DiskReadOps / DiskWriteOpsI/O operations (instance store only)CountDetect disk bottleneck
StatusCheckFailed_InstanceOS-level issues (kernel panic, etc.)Count (0 or 1)Alert on any failure
StatusCheckFailed_SystemAWS hardware issuesCount (0 or 1)Alert on any failure
Memory and Disk Space are NOT available by default! These require installing the CloudWatch Agent on your EC2 instance. This is a very common exam question.

CloudWatch Agent

The CloudWatch Agent is software you install on EC2 (or on-premises servers) to collect metrics and logs that aren't available by default โ€” especially memory utilization, disk space, and custom application logs.

# Install CloudWatch Agent on Amazon Linux 2
sudo yum install -y amazon-cloudwatch-agent

# Run configuration wizard (interactive)
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

# Or use a config file (stored in SSM Parameter Store for central management)
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl   -a fetch-config -m ec2 -s   -c ssm:/AmazonCloudWatch-Config

# Start and enable
sudo systemctl start amazon-cloudwatch-agent
sudo systemctl enable amazon-cloudwatch-agent
sudo systemctl status amazon-cloudwatch-agent

CloudWatch Agent Config (JSON)

{
  "metrics": {
    "namespace": "CWAgent",
    "metrics_collected": {
      "mem": {
        "measurement": ["mem_used_percent"],
        "metrics_collection_interval": 60
      },
      "disk": {
        "measurement": ["used_percent"],
        "resources": ["/", "/data"],
        "metrics_collection_interval": 300
      },
      "cpu": {
        "totalcpu": true,
        "metrics_collection_interval": 60
      }
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/nginx/access.log",
            "log_group_name": "/ec2/nginx/access",
            "log_stream_name": "{instance_id}"
          },
          {
            "file_path": "/var/log/myapp/app.log",
            "log_group_name": "/ec2/myapp",
            "log_stream_name": "{hostname}"
          }
        ]
      }
    }
  }
}

CloudWatch Alarms โ€” Full Configuration

SettingWhat it meansExample
MetricWhat to watchCPUUtilization, namespace=AWS/EC2
StatisticHow to aggregate data pointsAverage, Maximum, Sum, p99
PeriodLength of each evaluation window300 seconds (5 min)
Evaluation PeriodsTotal windows to look at3 (look at last 15 min)
Datapoints to AlarmHow many windows must breach (M of N)2 of 3
ThresholdThe trigger value> 80%
Missing dataHow to treat gapsnotBreaching / breaching / ignore

CloudWatch Logs Insights Queries

# Find all ERROR log lines in last 1 hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50

# Count errors by type
fields @message
| filter @message like /Exception/
| parse @message "* Exception: *" as prefix, errorType
| stats count(*) as errorCount by errorType
| sort errorCount desc

# Lambda: find slow invocations (>3 seconds)
filter @type = "REPORT"
| fields @requestId, @duration, @billedDuration, @memorySize, @maxMemoryUsed
| filter @duration > 3000
| sort @duration desc

CloudWatch Dashboards

  • Create custom dashboards mixing metrics from different services and regions
  • Widget types: Line graph, Number (single value), Alarm status, Log table, Text (Markdown), Bar chart, Pie chart
  • Cross-account and cross-region on a single dashboard โ€” great for centralized monitoring
  • Share dashboards: publicly (with link), privately (IAM), or with specific accounts
  • Auto-refresh: 10s, 1min, 2min, 5min, 15min
  • Cost: First 3 dashboards free, then $3/dashboard/month
๐Ÿ””

CloudWatch Advanced

Monitoring

Amazon EventBridge (formerly CloudWatch Events)

EventBridge is a serverless event bus that connects applications using events. AWS services emit events when things happen (EC2 state change, S3 object uploaded, CodePipeline failed). EventBridge routes these events to target services for automated responses โ€” enabling event-driven architectures without polling.

Event Sources

  • AWS services (EC2, S3, RDS, CodePipeline, Health, GuardDuty...)
  • Your custom applications (PutEvents API)
  • SaaS partners (Zendesk, Datadog, Shopify, PagerDuty)
  • Scheduled rules (cron or rate expressions)

Event Targets (20+)

  • Lambda functions
  • Step Functions state machines
  • SQS queues, SNS topics
  • Kinesis Data Streams/Firehose
  • ECS tasks, CodePipeline, CodeBuild
  • API Gateway, CloudWatch Log Groups
# Scheduled rule examples:
rate(5 minutes)         # every 5 minutes
rate(1 hour)            # every hour  
cron(0 18 ? * MON-FRI *) # 6 PM UTC weekdays
cron(30 3 * * ? *)       # 3:30 AM UTC daily (9 AM IST)

# Event pattern โ€” trigger when EC2 instance stops:
{
  "source": ["aws.ec2"],
  "detail-type": ["EC2 Instance State-change Notification"],
  "detail": {
    "state": ["stopped", "terminated"]
  }
}

# Event pattern โ€” S3 object uploaded:
{
  "source": ["aws.s3"],
  "detail-type": ["Object Created"],
  "detail": {
    "bucket": {"name": ["my-uploads-bucket"]},
    "object": {"key": [{"prefix": "images/"}]}
  }
}

EventBridge Pipes

EventBridge Pipes connects event sources (SQS, DynamoDB Streams, Kinesis) to targets with optional filtering, enrichment (Lambda/Step Functions), and transformation โ€” all without writing integration code.

Source (SQS) โ†’ Filter โ†’ Enrichment (Lambda) โ†’ Target (Step Functions)

Composite Alarms

Composite alarms combine multiple CloudWatch alarms into a single alarm using AND/OR/NOT logic. They reduce alert noise by only notifying when multiple conditions are true simultaneously.

# Only alarm when BOTH CPU is high AND memory is high
# Prevents noisy false positives from individual metric spikes
{
  "AlarmRule": "ALARM(cpu-alarm) AND ALARM(memory-alarm)"
}

# Alert when ANY of these critical conditions occur
{
  "AlarmRule": "ALARM(disk-full) OR ALARM(health-check-failed) OR ALARM(db-connections-max)"
}

CloudWatch Metric Math

Metric Math lets you create new time-series by performing mathematical operations on existing metrics โ€” compute error rates, percentages, sums across instances, etc.

# Error rate calculation
METRICS: 
  m1: Errors (Count)
  m2: Requests (Count)
EXPRESSION: 
  e1: (m1/m2)*100    โ†’ ErrorRate (%)

# Sum CPU across all EC2 instances in an ASG
SEARCH('{AWS/EC2,InstanceId} CPUUtilization', 'Average', 300)
# Then: SUM(METRICS())  โ†’ Total CPU across all instances

CloudWatch Contributor Insights

Analyzes log data to identify the "top contributors" to performance problems โ€” e.g., which IP addresses are generating the most 404s, which Lambda functions are causing the most errors, which URLs have the highest latency.

  • Uses rules to parse logs and extract fields for analysis
  • Works with CloudWatch Logs, VPC Flow Logs, DynamoDB, API Gateway access logs
  • Updates in near real-time โ€” great for identifying bad actors or hot partitions

CloudWatch Anomaly Detection

Uses machine learning to automatically create a "band" of expected values for any metric based on historical patterns. Alarms trigger when the metric goes outside the expected band โ€” no manual threshold needed.

  • Learns seasonality: daily patterns (peak at 9 AM), weekly patterns (lower on weekends)
  • Set sensitivity: how wide the expected band is (1 = tight, 2 = normal, 3 = loose)
  • Great for metrics with natural variation where a fixed threshold would cause too many false alarms

AWS X-Ray (Distributed Tracing)

X-Ray traces requests as they travel through your distributed application โ€” from API Gateway โ†’ Lambda โ†’ DynamoDB โ†’ external API. It shows you exactly where latency comes from and which service is causing errors.

  • Creates a Service Map โ€” visual graph of all services and their connections with latency/error rates
  • Traces: End-to-end record of a single request across all services
  • Segments: Data from a single service about work it did for a request
  • Subsegments: Detailed breakdown (specific DB query, HTTP call, etc.)
  • Annotations: Key-value pairs you add to traces for filtering (user_id, order_id)
  • SDK available for Node.js, Python, Java, Go, .NET, Ruby
# Python Lambda with X-Ray tracing
from aws_xray_sdk.core import xray_recorder, patch_all
patch_all()  # Auto-instrument boto3, requests, pymysql

@xray_recorder.capture("process_order")
def process_order(order_id):
    xray_recorder.put_annotation("order_id", order_id)
    # your code โ€” automatically traced
    result = table.get_item(Key={"order_id": order_id})
    return result

CloudWatch Synthetics

Synthetics lets you create "canaries" โ€” scripts that run on a schedule to test your endpoints and APIs from outside your application, simulating user behavior 24/7.

  • Runs every 1 minute minimum โ€” detects outages before users do
  • Checks: HTTP endpoints, APIs, web pages (headless browser), broken links
  • Results appear as CloudWatch metrics and can trigger alarms
  • Pre-built blueprints: Heartbeat monitor, API canary, Broken link checker, GUI workflow
๐Ÿ›ก๏ธ

AWS Security Tools

Security

AWS Security Landscape

AWS provides a layered security approach โ€” multiple services working together cover different aspects of security: edge protection, identity, vulnerability management, threat detection, compliance, and incident response. Understanding which tool does what is essential.

ServiceCategoryWhat it does
ShieldDDoS ProtectionProtects against volumetric network attacks
WAFApp FirewallBlocks malicious HTTP requests (SQLi, XSS)
ACMSSL/TLSFree certificates for AWS services
GuardDutyThreat DetectionML-based anomaly detection across your account
InspectorVulnerability ScanCVE scanning for EC2, Lambda, containers
MacieData SecurityFinds PII/sensitive data in S3
Security HubCSPMCentralized security findings dashboard
CloudTrailAuditRecords all API calls in account
ConfigComplianceTracks config changes, evaluates rules
Trusted AdvisorBest PracticesRecommendations across 5 pillars

AWS Certificate Manager (ACM)

ACM provisions, manages, and auto-renews SSL/TLS certificates. Public certificates are completely FREE when used with AWS services โ€” no more paying certificate authorities or worrying about expiration dates.

  • Auto-renews 60 days before expiration โ€” never get caught with expired certs
  • Domain validation: add a CNAME record to DNS (recommended โ€” fully automated with Route 53) or verify via email
  • Certificates can ONLY be used with: ALB, CloudFront, API Gateway, Elastic Beanstalk, AppSync
  • Cannot export the private key โ€” you can't use ACM certs on EC2 directly (use ACM on ALB instead)
  • Private CA ($400/month): issue private certs for internal services

AWS Shield

Shield Standard (FREE)

  • Automatic protection for all AWS customers
  • Protects against most common DDoS attacks: SYN floods, UDP reflection, DNS amplification
  • Layer 3/4 protection only
  • No configuration needed โ€” always on
  • Protects: EC2, ELB, CloudFront, Route 53, Global Accelerator

Shield Advanced ($3,000/month)

  • Enhanced DDoS protection with attack visibility
  • 24/7 AWS DDoS Response Team (DRT) access
  • Cost protection โ€” AWS credits charges from scaling during DDoS
  • Near real-time attack notifications
  • WAF included at no extra cost
  • Historical attack reports

AWS WAF โ€” Web Application Firewall

WAF inspects HTTP/HTTPS requests at Layer 7 and blocks malicious traffic before it reaches your application. Deploy on CloudFront, ALB, API Gateway, or AppSync.

Rule TypeDescriptionExample
IP Set RulesAllow/block specific IPs or CIDRsBlock known bad IP ranges
Geographic RulesAllow/block by countryOnly allow India and US
Rate-Based RulesLimit requests per IP per 5 minutesMax 2000 req/5min per IP
SQL Injection MatchDetect SQL injection patterns in requestBlock ' OR 1=1-- in query string
XSS MatchDetect cross-site scripting patternsBlock script tags in body
Regex PatternCustom regex matching on request partsBlock specific User-Agent strings
AWS Managed RulesPre-built rulesets maintained by AWSCore Rule Set, Known Bad Inputs, PHP, WordPress

Amazon GuardDuty

GuardDuty is a threat detection service that continuously monitors your AWS account for malicious activity using machine learning, anomaly detection, and threat intelligence feeds. It requires no agents โ€” it analyzes VPC Flow Logs, CloudTrail, DNS logs, and S3 data events automatically.

  • Detects: compromised instances (crypto mining, C&C communication), credential theft, unusual API calls from Tor exit nodes, S3 data exfiltration, privilege escalation attempts
  • Findings are rated: Low, Medium, High severity with detailed remediation guidance
  • Enable per-region โ€” 30-day free trial, then ~$4/million events
  • Integrate with Security Hub and EventBridge for automated remediation
# Auto-remediate GuardDuty finding via EventBridge + Lambda:
# Trigger: GuardDuty finding type = "UnauthorizedAccess:IAMUser/InstanceCredentialExfiltration"
# Action Lambda: Revoke all active sessions for the IAM role, notify security team

import boto3
def lambda_handler(event, context):
    iam = boto3.client('iam')
    detail = event['detail']
    role_name = detail['resource']['accessKeyDetails']['userName']
    # Revoke all active sessions by attaching an explicit deny policy
    iam.put_role_policy(RoleName=role_name, PolicyName='RevokeAllSessions',
        PolicyDocument='{"Version":"2012-10-17","Statement":[{"Effect":"Deny","Action":"*","Resource":"*","Condition":{"DateLessThan":{"aws:TokenIssueTime":"' + str(datetime.utcnow().isoformat()) + '"}}}]}')

AWS Inspector

Inspector continuously scans EC2 instances, Lambda functions, and container images in ECR for software vulnerabilities (CVEs) and unintended network exposure. Unlike manual scans, Inspector rescans automatically when new CVEs are published.

  • Uses SSM Agent on EC2 for OS package scanning (no separate agent needed)
  • Generates risk score combining CVSS base score + AWS environment context
  • Network reachability findings: identifies EC2 instances reachable from internet on unexpected ports
  • Integrates with Security Hub โ€” all findings in one place

Amazon Macie

Macie uses ML to automatically discover, classify, and protect sensitive data stored in S3. It identifies Personally Identifiable Information (PII) like names, credit card numbers, SSNs, passport numbers, and health records.

  • Scans S3 buckets for: PII, credentials (private keys, passwords), financial data, health data
  • Findings go to Security Hub and EventBridge for automated response
  • Useful for compliance: GDPR, HIPAA, PCI-DSS data discovery
  • Cost: $1/GB for first 50 GB/month scanned

AWS CloudTrail

CloudTrail records every API call made in your AWS account โ€” the who, what, when, and from where of every action. It's your primary tool for security investigation, compliance auditing, and troubleshooting permission issues.

Event TypeWhat's capturedDefault?Cost
Management EventsControl plane: create/delete/modify resources (RunInstances, CreateBucket, PutRolePolicy)Yes โ€” 90 days in consoleFree for first trail
Data EventsData plane: S3 GetObject/PutObject, Lambda invocations, DynamoDB opsNo$0.10/100K events
Insight EventsUnusual API activity (sudden spike in TerminateInstances calls)No$0.35/100K events
# CloudTrail log entry example โ€” who deleted an S3 bucket:
{
  "eventTime": "2024-01-15T14:23:11Z",
  "eventName": "DeleteBucket",
  "userIdentity": {
    "type": "IAMUser",
    "userName": "dev-john",
    "arn": "arn:aws:iam::123456789:user/dev-john"
  },
  "sourceIPAddress": "203.0.113.45",
  "requestParameters": {"bucketName": "prod-backup-bucket"},
  "responseElements": null,
  "errorCode": null   โ† null means SUCCESS (bucket deleted!)
}

AWS Config

Config continuously records configuration changes of your AWS resources and evaluates them against compliance rules. If something is misconfigured (public S3 bucket, unencrypted EBS volume), Config flags it and can auto-remediate.

  • Configuration history: "What did this EC2 instance look like 30 days ago? Who changed the security group?"
  • Config Rules: AWS-managed (170+ pre-built) or custom (Lambda-based)
  • Auto Remediation: Trigger SSM Automation documents to fix violations automatically
  • Conformance Packs: Bundles of rules for frameworks like PCI-DSS, HIPAA, CIS Benchmarks
  • Cost: $0.003/configuration item recorded + $0.001/rule evaluation

AWS Security Hub

Security Hub provides a centralized dashboard aggregating security findings from GuardDuty, Inspector, Macie, IAM Access Analyzer, Firewall Manager, and third-party tools into one place with a security score.

  • Runs continuous automated checks against CIS AWS Foundations Benchmark, AWS Foundational Security Best Practices
  • Security score (0-100): shows your overall security posture
  • Multi-account: aggregate findings from all accounts in an AWS Organization
  • Send findings to EventBridge for automated workflows

SNS โ€” Simple Notification Service

SNS is a fully managed pub/sub messaging service. Publishers send messages to topics, and all subscribers receive a copy. It's the glue that connects AWS monitoring alerts to humans and automated systems.

Subscriber TypeUse Case
Email / Email-JSONAlert engineers when alarm fires
SMSCritical alerts to phones
HTTP/HTTPSWebhook to external systems (PagerDuty, Slack)
LambdaAutomated remediation on alert
SQSFan-out: one message โ†’ multiple queues processed independently
Kinesis FirehoseStream alerts to S3/Splunk/Elasticsearch
Mobile PushiOS/Android push notifications
SNS Fan-out Pattern: One SNS topic โ†’ multiple SQS queues. Publish once, multiple systems each independently process the message. Used for: order processing (inventory + payment + email all triggered from one order event).

AWS Trusted Advisor

Trusted Advisor analyzes your AWS environment against AWS best practices across 5 categories and gives you recommendations. It's like having an AWS solutions architect review your account automatically.

CategoryExample ChecksSupport Plan
๐Ÿ’ฐ Cost OptimizationIdle EC2 instances, underutilized RDS, unattached EIPs, old snapshotsAll plans
โšก PerformanceCloudFront enabled, EC2 instance types, EBS throughputBusiness+
๐Ÿ”’ SecurityOpen security group ports, MFA on root, S3 bucket permissions, exposed access keys7 basic checks for all
๐Ÿ›ก๏ธ Fault ToleranceMulti-AZ RDS, ELB health checks, EBS snapshots, Route 53 failoverBusiness+
๐Ÿ“Š Service LimitsApproaching EC2, EIP, VPC limitsAll plans

AWS Global Accelerator

Global Accelerator improves performance of internet applications by routing traffic through the AWS global backbone network instead of the unpredictable public internet โ€” reducing latency by 60%+ for global users.

  • Provides 2 static Anycast IPv4 addresses โ€” users worldwide connect to the nearest AWS edge location
  • Traffic travels AWS's private network to your endpoint (EC2, ALB, NLB, Elastic IP)
  • Built-in health checks and failover โ€” traffic automatically rerouted if endpoint fails
  • Works for: TCP and UDP โ€” good for gaming, VoIP, IoT, any non-HTTP protocol
  • vs CloudFront: GA = performance for dynamic/non-cacheable content + TCP/UDP. CloudFront = caching HTTP content at edge.
โšก

Lambda

Serverless

What is Lambda?

Lambda is a serverless, event-driven compute service. You write code, upload it, and Lambda runs it in response to events. You never manage servers โ€” AWS handles provisioning, scaling, patching, and availability automatically. You pay only when code runs (per 1ms of execution).

Serverless โ‰  No Servers. It means YOU don't manage servers. AWS provides and manages the infrastructure invisibly.

Lambda Configuration

SettingRangeNotes
Memory128 MB โ€“ 10,240 MBCPU power scales proportionally with memory
Timeout1 second โ€“ 15 minutesFunction killed after timeout; set appropriately
/tmp Storage512 MB โ€“ 10,240 MBTemporary disk; shared across warm invocations
ConcurrencyUp to 1,000 (default, region)Request increase; set reserved concurrency to limit
Package size50 MB (zip), 250 MB (unzipped)Use Layers for large dependencies
Env variables4 KB totalUse Secrets Manager for sensitive values

Supported Runtimes

Python 3.8โ€“3.12Node.js 18โ€“20 Java 11โ€“21Go 1.x .NET 6/8Ruby 3.2 Custom Runtime (any language)

Lambda Function Structure

import json, boto3, os

# Code OUTSIDE handler runs once per container (cold start)
# Reuse these across warm invocations!
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(os.environ['TABLE_NAME'])

def lambda_handler(event, context):
    # event: input data (from API GW, S3, SQS, etc.)
    # context: runtime info (function name, remaining time, etc.)
    
    print(f"Function: {context.function_name}")
    print(f"Remaining time: {context.get_remaining_time_in_millis()}ms")
    print(f"Event: {json.dumps(event)}")
    
    # Process
    name = event.get('name', 'World')
    
    return {
        'statusCode': 200,
        'headers': {'Content-Type': 'application/json'},
        'body': json.dumps({'message': f'Hello, {name}!'})
    }

Lambda Layers

Layers are ZIP archives containing libraries, custom runtimes, or dependencies shared across multiple functions. Reduces deployment package size and code reuse.

  • Up to 5 layers per function, total <250 MB unzipped
  • Versioned โ€” create new version on each publish
  • Share across accounts (specific accounts or public)
  • AWS provides ready-made layers: AWS SDK, Powertools for Lambda
# Create a layer with Python packages
mkdir -p python/lib/python3.12/site-packages
pip install pandas numpy requests -t python/lib/python3.12/site-packages/
zip -r pandas-layer.zip python/
aws lambda publish-layer-version \
  --layer-name pandas-numpy \
  --zip-file fileb://pandas-layer.zip \
  --compatible-runtimes python3.12

Cold Start vs Warm Start

๐Ÿฅถ Cold Start

  • New execution environment created
  • Download code, initialize runtime, run init code
  • Adds 100msโ€“3s latency
  • Happens when: first invocation, scaling out, no recent use
  • Worst: Java/C# runtimes, large packages

๐Ÿ”ฅ Warm Start

  • Existing environment reused
  • Only handler code runs
  • Millisecond latency
  • External connections (DB, APIs) reused!
  • Tip: keep DB connections outside handler
Reduce Cold Starts: Provisioned Concurrency (pre-warm N environments, extra cost), smaller deployment packages, lazy imports, SnapStart (Java 11+ โ€” take snapshot after init). Python/Node.js have fastest cold starts.

Lambda Event Sources (Triggers)

SourceInvocation TypeUse Case
API Gateway / ALBSynchronousREST APIs, web backends
S3AsynchronousImage processing on upload, data pipeline
DynamoDB StreamsStream (polling)React to DB changes, replicate data
SQSStream (polling)Process queue messages, decoupled workflows
SNSAsynchronousFan-out processing, notifications
EventBridgeAsynchronousScheduled tasks (cron), event-driven workflows
KinesisStream (polling)Real-time data stream processing
CloudWatch LogsAsynchronousLog processing, alerting from log patterns
๐Ÿ”Œ

Lambda Integrations

Serverless

Lambda Limits

LimitValueNotes
Max timeout15 minutesFor long tasks use Step Functions or ECS
Max memory10,240 MB (10 GB)More memory = more vCPU
Concurrency (default)1,000/regionRequest increase via support
Package size (zip)50 MBUse Layers for larger deps
Package (unzipped)250 MBIncluding all layers
Response payload (sync)6 MBUse S3 for large responses
Async payload256 KBPass S3 key for large data
Env variables4 KB total

Lambda โ†’ RDS Connection (via RDS Proxy)

Problem: Lambda can have 1,000 concurrent executions, each opening a DB connection = 1,000 connections. Most DBs max out at 100-500 connections. Solution: Use RDS Proxy to pool connections.
import boto3, pymysql, json, os

# Initialize OUTSIDE handler (connection reuse on warm invocations)
db_conn = None

def get_db_connection():
    creds = boto3.client('secretsmanager').get_secret_value(
        SecretId='prod/rds/mysql')
    c = json.loads(creds['SecretString'])
    return pymysql.connect(
        host=os.environ['RDS_PROXY_ENDPOINT'],  # Proxy, not RDS endpoint!
        user=c['username'], password=c['password'],
        database='myapp', cursorclass=pymysql.cursors.DictCursor,
        connect_timeout=5
    )

def lambda_handler(event, context):
    global db_conn
    if not db_conn or not db_conn.open:
        db_conn = get_db_connection()
    
    with db_conn.cursor() as cursor:
        cursor.execute("SELECT * FROM users LIMIT 10")
        return {'statusCode': 200, 'body': json.dumps(cursor.fetchall())}

Lambda โ†’ DynamoDB

import boto3
from decimal import Decimal

# Outside handler = reuse
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('Orders')

def lambda_handler(event, context):
    # Write (no connection pool needed - HTTP API)
    table.put_item(Item={
        'order_id': event['order_id'],
        'user_id': event['user_id'],
        'amount': Decimal(str(event['amount'])),
        'status': 'pending'
    })
    
    # Read
    resp = table.get_item(Key={'order_id': event['order_id']})
    return resp.get('Item', {})

Lambda โ†’ API Gateway Integration

# API Gateway Proxy Integration passes full HTTP context to Lambda
# Request: POST /users โ†’ Lambda receives:
event = {
    "httpMethod": "POST",
    "path": "/users",
    "pathParameters": {"id": "123"},
    "queryStringParameters": {"page": "1"},
    "headers": {"Authorization": "Bearer token..."},
    "body": '{"name":"Ravi","email":"<a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="126073647b52776a737f627e773c717d7f">[email protected]</a>"}',
    "isBase64Encoded": False
}

# Lambda MUST return this structure:
return {
    "statusCode": 200,
    "headers": {
        "Content-Type": "application/json",
        "Access-Control-Allow-Origin": "*"  # for CORS
    },
    "body": json.dumps({"user_id": "U123", "name": "Ravi"})
}

Lambda Environment Variables & Secrets

# Environment variables (non-sensitive config)
import os
TABLE_NAME = os.environ['TABLE_NAME']
REGION = os.environ.get('AWS_REGION', 'ap-south-1')

# For SECRETS โ†’ use Secrets Manager (not env vars!)
import boto3, json
_secret_cache = {}

def get_secret(name):
    if name not in _secret_cache:
        client = boto3.client('secretsmanager')
        _secret_cache[name] = json.loads(
            client.get_secret_value(SecretId=name)['SecretString']
        )
    return _secret_cache[name]  # cached after first call
๐ŸŒ

Route 53

Networking

What is DNS and How Does it Work?

DNS (Domain Name System) translates human-readable domain names (google.com) into IP addresses (142.250.80.46) that computers use. Without DNS, you'd need to memorize IP addresses for every website. Route 53 is AWS's highly available and scalable DNS web service โ€” named after the DNS port (53).

User types: www.example.com in browser
1. Browser checks local cache โ†’ not found
2. OS asks Recursive Resolver (usually ISP or 8.8.8.8)
3. Recursive Resolver asks Root Nameserver โ†’ "ask .com TLD server"
4. Asks .com TLD server โ†’ "ask ns-123.awsdns-45.com"
5. Asks Route 53 Nameserver โ†’ returns "93.184.216.34"
6. Browser connects to 93.184.216.34
Total time: ~50-200ms (first time), ~0ms (cached)

DNS Record Types

RecordMapsImportant NotesExample
Ahostname โ†’ IPv4Most common recordexample.com โ†’ 93.184.216.34
AAAAhostname โ†’ IPv6Next-gen internetexample.com โ†’ 2606:2800::/32
CNAMEhostname โ†’ hostnameCannot be used for root/apex domain (example.com) โ€” only subdomains (www.example.com)www.example.com โ†’ example.com
Aliashostname โ†’ AWS resourceAWS extension. Works for root domain. FREE queries. Use instead of CNAME for AWS resources.example.com โ†’ myalb.amazonaws.com
MXdomain โ†’ mail serversPriority number (lower = preferred)10 mail.example.com
TXTdomain โ†’ text stringDomain verification, SPF, DKIM"v=spf1 include:amazonses.com ~all"
NSzone โ†’ nameserversWhich servers are authoritative for zonens-123.awsdns-45.com
SOAZone metadataStart of Authority โ€” admin info, TTL defaultsAuto-created with hosted zone
PTRIP โ†’ hostnameReverse DNS lookup34.216.93.184 โ†’ ec2.amazonaws.com
SRVService locationUsed for VoIP, XMPP, Kubernetes_http._tcp.example.com

Hosted Zones

๐ŸŒ Public Hosted Zone

  • Answers DNS queries from the internet
  • $0.50/zone/month + $0.40/million queries
  • Created automatically when registering domain in Route 53
  • For external-facing websites, APIs, services
  • Nameservers assigned automatically (4 NS records)

๐Ÿ”’ Private Hosted Zone

  • Answers DNS queries only from within associated VPCs
  • $0.50/zone/month
  • Must associate with VPC(s) โ€” can associate multiple VPCs (even cross-account)
  • For internal services: db.internal, api.company.local
  • Requires: enableDnsHostnames + enableDnsSupport on VPC

TTL (Time to Live)

TTL tells DNS resolvers how long to cache a record. Choosing the right TTL is a balance between DNS query costs and propagation speed.

  • High TTL (86400 = 24 hours): Fewer queries (cheaper), but changes take 24 hours to propagate worldwide
  • Low TTL (60 = 1 min): Changes propagate in 1 minute, but 1440x more DNS queries (more expensive)
  • Best practice before migration: Lower TTL to 60s a week before making changes, then raise after
  • Alias records don't have configurable TTL โ€” AWS sets it automatically

Routing Policies โ€” In Depth

PolicyAlgorithmBest ForHealth Checks
SimpleReturns all values, client picks randomlySingle resource, no health checks neededNo
WeightedRoute X% to A, Y% to B based on weights (0-255)A/B testing, blue/green deployments, gradual migrationsOptional
FailoverPrimary active, secondary passive. Auto-switch on health check failure.DR setup, active-passive HARequired on primary
GeolocationRoute based on user's geographic location (continent, country, state)Content localization, GDPR data residency, language-specific contentOptional
GeoproximityRoute based on distance with adjustable bias (+/-)Shift traffic between regions, fine-grained global routingOptional
Latency-basedRoute to AWS region with lowest measured latency for userGlobal apps where performance matters mostOptional
Multi-ValueReturns up to 8 healthy records randomlySimple client-side load balancing (not replacement for ELB)Integrated
IP-basedRoute based on client's originating IP CIDRRoute ISP traffic to specific endpoints, optimize peeringNo

Health Checks

Route 53 health checkers are deployed in 15+ locations globally. They check your endpoints every 10 or 30 seconds and mark them unhealthy if enough checks fail โ€” automatically removing them from DNS responses.

  • HTTP/HTTPS/TCP health checks: Check endpoint response. Must return 2xx/3xx within 4 seconds, response body can be checked for string match (first 5120 bytes).
  • Threshold: 3 consecutive failures = unhealthy. 3 consecutive successes = healthy (configurable).
  • Calculated Health Checks: Combine child health checks with AND/OR/NOT logic. Useful for "healthy if 2 of 3 servers up".
  • CloudWatch Alarm Health Checks: For private resources not accessible from internet โ€” check CW alarm state instead.
  • Cost: $0.50-0.75/health check/month
# Failover routing - example setup:
Primary record: www.example.com โ†’ ALB in us-east-1 (health check attached)
Secondary record: www.example.com โ†’ ALB in eu-west-1 (failover target)

If primary health check fails for 3+ consecutive checks:
โ†’ Route 53 automatically serves the secondary record
โ†’ Recovery is automatic when primary becomes healthy again

Domain Registration

  • Route 53 is an ICANN-accredited domain registrar โ€” buy domains directly in AWS
  • Supports 400+ TLDs (.com, .io, .in, .co, .tech, .cloud, etc.)
  • Domains auto-renew by default (can disable)
  • Transfer existing domains from other registrars into Route 53
  • Privacy protection available โ€” hides personal info from WHOIS lookup
๐ŸŒ

CloudFront

DNS & CDN

CloudFront Overview

CloudFront is AWS's Content Delivery Network (CDN) with 400+ Points of Presence (edge locations) globally. Content is cached at edge locations closest to users, reducing latency and origin load.

  • Supports HTTP/HTTPS, WebSocket, streaming (HLS, DASH)
  • Integrates with Shield (DDoS), WAF (Layer 7), ACM (SSL/TLS), Lambda@Edge
  • Default TTL: 24 hours. Configurable per behavior.
  • HTTP to HTTPS redirect built-in (viewer protocol policy)

Origins

Origin TypeUse CaseSecurity
S3 BucketStatic websites, file downloads, mediaOAC (Origin Access Control) โ€” blocks direct S3 URL access
ALBDynamic web apps, APIs behind load balancerCustom header (X-Origin-Key) to verify requests from CF
EC2 InstanceCustom servers (must have public IP)Security Group allow CF IP ranges
Any HTTP EndpointOn-premises, third-party serversCustom headers, IP whitelisting

Cache Behaviors

  • Map URL path patterns to different origins: /api/* โ†’ ALB (no cache), /images/* โ†’ S3 (cache 7 days), /* โ†’ S3 (cache 24h)
  • Cache Policy: What goes into cache key (headers, cookies, query strings). More = less cache hits.
  • Origin Request Policy: What to forward to origin (headers, cookies, query strings)
  • Viewer protocol policy: HTTP + HTTPS, HTTPS only, Redirect HTTP to HTTPS

Invalidations

# Force CloudFront to fetch fresh content from origin
aws cloudfront create-invalidation \
  --distribution-id E1234567ABCDEF \
  --paths "/*"                    # all files
  # OR "/index.html"              # specific file
  # OR "/images/*"                # specific path

# Cost: First 1,000 invalidation paths/month free, then $0.005 each
# Better approach: use versioned filenames (main.v2.3.css) โ€” no invalidation needed!

Security Features

  • Origin Access Control (OAC): CloudFront sends signed requests to S3. S3 bucket policy allows only from CloudFront OAC. Direct S3 URL = 403 Forbidden.
  • Signed URLs: Grant time-limited access to individual files. Use for: paid content, user-specific files
  • Signed Cookies: Grant access to multiple files without changing URLs. Use for: premium video libraries
  • Geographic Restrictions: Whitelist (allow only) or Blacklist (deny) countries
  • Field-Level Encryption: Encrypt specific POST data fields (e.g., credit cards) at edge

Lambda@Edge

Run Lambda functions at CloudFront edge locations to customize content delivery. 4 trigger points per request cycle:

1. Viewer Request โ€” before cache check (auth, rewrite) 2. Origin Request โ€” cache miss, before origin (modify headers) 3. Origin Response โ€” after origin response (add headers) 4. Viewer Response โ€” before delivery to user (security headers)
  • Runtimes: Node.js and Python only
  • Limits: Viewer = 1MB code/128MB/5s. Origin = 50MB/10GB/30s.
  • Use cases: A/B testing, auth check, SEO, URL rewriting, redirect, response compression
  • CloudFront Functions (newer, cheaper): JavaScript only, sub-millisecond, viewer request/response only. Ideal for URL redirects, header manipulation.
๐Ÿ›๏ธ

Terraform with AWS

Infrastructure as Code

What is Infrastructure as Code?

Infrastructure as Code (IaC) means defining your cloud infrastructure in code files instead of clicking through consoles. This brings software engineering best practices (version control, code review, testing, CI/CD) to infrastructure management.

โœ… Benefits of IaC

  • Reproducible: same code = same infrastructure every time
  • Version controlled: see every change in Git history
  • Reviewable: infrastructure changes go through PR process
  • Automated: deploy via CI/CD pipeline, no manual clicks
  • Documented: code IS the documentation
  • Disaster recovery: rebuild entire environment in minutes

Terraform vs CloudFormation

  • Terraform: Multi-cloud (AWS, Azure, GCP, K8s), HCL language, huge ecosystem, state file management needed
  • CloudFormation: AWS-only, JSON/YAML, deep AWS integration, no state management (AWS handles it), free
  • Both valid โ€” Terraform preferred for multi-cloud, CloudFormation for AWS-only with deep integration needs

Terraform Core Concepts

ConceptDescriptionExample
ProviderPlugin that talks to cloud API. Translates HCL to API calls.hashicorp/aws, hashicorp/azure, hashicorp/kubernetes
ResourceInfrastructure component you want to create/manageaws_instance, aws_s3_bucket, aws_vpc
Data SourceRead existing resource info (don't manage it, just read)data.aws_ami.latest, data.aws_vpc.default
VariableInput parameter โ€” makes code reusablevar.instance_type, var.environment
LocalComputed value within module โ€” avoid repetitionlocal.name_prefix = "prod-app"
OutputExport values after apply โ€” use in other modules or scriptsoutput: EC2 IP, RDS endpoint
ModuleReusable package of Terraform codemodule "vpc" { source = "./modules/vpc" }
StateJSON file tracking what Terraform has created. Source of truth.terraform.tfstate (store in S3!)

Terraform Workflow

# 1. Initialize โ€” download providers, set up backend
terraform init

# 2. Format code (always run before committing)
terraform fmt -recursive

# 3. Validate syntax
terraform validate

# 4. ALWAYS review plan before applying!
terraform plan
terraform plan -out=tfplan.out    # save plan for apply

# 5. Apply changes
terraform apply                   # interactive confirmation
terraform apply tfplan.out        # apply saved plan
terraform apply -auto-approve     # CI/CD (no prompt)

# Other useful commands
terraform destroy                 # destroy everything (careful!)
terraform output                  # show outputs
terraform state list              # list managed resources
terraform state show aws_instance.web  # inspect resource state
terraform import aws_s3_bucket.logs my-bucket-name  # import existing resource

Remote State โ€” Critical for Teams

By default, Terraform stores state locally (terraform.tfstate). This breaks in teams โ€” two people can't work simultaneously, state isn't shared. ALWAYS use remote state with S3 + DynamoDB locking in production.

# versions.tf โ€” remote backend configuration
terraform {
  required_version = ">= 1.5.0"
  required_providers {
    aws = { source = "hashicorp/aws", version = "~> 5.0" }
  }
  backend "s3" {
    bucket         = "mycompany-terraform-state"    # must exist first!
    key            = "prod/ap-south-1/terraform.tfstate"
    region         = "ap-south-1"
    encrypt        = true                           # encrypt state at rest
    dynamodb_table = "terraform-locks"              # prevent concurrent applies
  }
}

# Create the S3 bucket and DynamoDB table manually first (bootstrap):
aws s3api create-bucket --bucket mycompany-terraform-state --region ap-south-1
aws dynamodb create-table --table-name terraform-locks   --attribute-definitions AttributeName=LockID,AttributeType=S   --key-schema AttributeName=LockID,KeyType=HASH   --billing-mode PAY_PER_REQUEST

Complete VPC + EC2 Example

# variables.tf
variable "env"    { default = "dev" }
variable "region" { default = "ap-south-1" }

# vpc.tf
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
  tags = { Name = "${var.env}-vpc" }
}

resource "aws_subnet" "public" {
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.1.0/24"
  availability_zone       = "${var.region}a"
  map_public_ip_on_launch = true
  tags = { Name = "${var.env}-public-1a" }
}

resource "aws_internet_gateway" "igw" {
  vpc_id = aws_vpc.main.id
  tags   = { Name = "${var.env}-igw" }
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id
  route { cidr_block = "0.0.0.0/0"; gateway_id = aws_internet_gateway.igw.id }
  tags = { Name = "${var.env}-public-rt" }
}

resource "aws_route_table_association" "public" {
  subnet_id      = aws_subnet.public.id
  route_table_id = aws_route_table.public.id
}

# ec2.tf
data "aws_ami" "al2" {
  most_recent = true
  owners      = ["amazon"]
  filter { name = "name"; values = ["amzn2-ami-hvm-*-x86_64-gp2"] }
}

resource "aws_security_group" "web" {
  name   = "${var.env}-web-sg"
  vpc_id = aws_vpc.main.id
  ingress { from_port=80;  to_port=80;  protocol="tcp"; cidr_blocks=["0.0.0.0/0"] }
  ingress { from_port=443; to_port=443; protocol="tcp"; cidr_blocks=["0.0.0.0/0"] }
  ingress { from_port=22;  to_port=22;  protocol="tcp"; cidr_blocks=["10.0.0.0/8"] }
  egress  { from_port=0;   to_port=0;   protocol="-1";  cidr_blocks=["0.0.0.0/0"] }
}

resource "aws_instance" "web" {
  ami                    = data.aws_ami.al2.id
  instance_type          = "t3.micro"
  subnet_id              = aws_subnet.public.id
  vpc_security_group_ids = [aws_security_group.web.id]
  user_data = <<-EOF
    #!/bin/bash
    yum install -y nginx
    systemctl start nginx
    systemctl enable nginx
  EOF
  tags = { Name = "${var.env}-web" }
}

# outputs.tf
output "web_public_ip"  { value = aws_instance.web.public_ip }
output "web_public_dns" { value = aws_instance.web.public_dns }

Terraform Modules

Modules are reusable packages of Terraform configuration. Instead of copy-pasting VPC code across multiple projects, create a VPC module once and reuse it everywhere.

# Using community modules from Terraform Registry
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.0.0"
  
  name = "my-vpc"
  cidr = "10.0.0.0/16"
  azs             = ["ap-south-1a", "ap-south-1b"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24"]
  enable_nat_gateway = true
  single_nat_gateway = true
  tags = { Terraform = "true", Environment = "dev" }
}

# Reference module outputs
resource "aws_instance" "app" {
  subnet_id = module.vpc.private_subnets[0]
  # ...
}

Terraform Best Practices

  • Always run terraform plan and review before apply
  • Store state in S3 with DynamoDB locking โ€” never commit .tfstate to Git
  • Use workspaces or separate state files per environment (dev/staging/prod)
  • Pin provider versions: version = "~> 5.0"
  • Use terraform fmt pre-commit hooks for consistent formatting
  • Use terraform validate in CI/CD pipeline
  • Sensitive outputs: mark with sensitive = true
  • Use count or for_each instead of duplicating resources
  • Organize large configs: separate files (vpc.tf, ec2.tf, rds.tf, variables.tf, outputs.tf)
๐Ÿ

Python Boto3 for AWS Automation

Scripting

What is Boto3?

Boto3 is the official AWS SDK for Python. It lets you programmatically interact with AWS services โ€” create resources, manage infrastructure, automate tasks, and build applications that use AWS.

pip install boto3

# Two interfaces:
import boto3
# 1. Client (low-level, 1:1 map to AWS API)
ec2_client = boto3.client('ec2', region_name='ap-south-1')
# 2. Resource (high-level, object-oriented)
s3 = boto3.resource('s3')

# Authentication order:
# 1. Environment variables (AWS_ACCESS_KEY_ID, etc.)
# 2. ~/.aws/credentials file (aws configure)
# 3. IAM Instance Profile (EC2) or Task Role (ECS/Lambda) โ† recommended on AWS

EC2 Automation

import boto3

ec2 = boto3.client('ec2', region_name='ap-south-1')

# List all running instances with details
def list_instances(state='running'):
    paginator = ec2.get_paginator('describe_instances')
    for page in paginator.paginate(Filters=[{'Name':'instance-state-name','Values':[state]}]):
        for r in page['Reservations']:
            for i in r['Instances']:
                name = next((t['Value'] for t in i.get('Tags',[]) if t['Key']=='Name'), 'N/A')
                print(f"{i['InstanceId']:20} {i['InstanceType']:12} {i.get('PublicIpAddress','Private'):15} {name}")

# Start/Stop/Reboot
ec2.start_instances(InstanceIds=['i-1234567890abcdef0'])
ec2.stop_instances(InstanceIds=['i-1234567890abcdef0'])
ec2.reboot_instances(InstanceIds=['i-1234567890abcdef0'])

# Create snapshot with tags
def snapshot_volume(volume_id, desc="Auto backup"):
    snap = ec2.create_snapshot(VolumeId=volume_id, Description=desc,
        TagSpecifications=[{'ResourceType':'snapshot',
            'Tags':[{'Key':'AutoCreated','Value':'true'},
                    {'Key':'Date','Value':str(datetime.date.today())}]}])
    return snap['SnapshotId']

# Delete old snapshots (older than N days)
def cleanup_old_snapshots(days=30):
    from datetime import datetime, timezone, timedelta
    cutoff = datetime.now(timezone.utc) - timedelta(days=days)
    snaps = ec2.describe_snapshots(OwnerIds=['self'])['Snapshots']
    for s in snaps:
        if s['StartTime'] < cutoff and s.get('Tags'):
            if any(t['Key']=='AutoCreated' for t in s['Tags']):
                ec2.delete_snapshot(SnapshotId=s['SnapshotId'])
                print(f"Deleted {s['SnapshotId']}")

S3 Automation

import boto3, os
from pathlib import Path

s3 = boto3.client('s3')

# Upload file with progress
def upload_file(path, bucket, key=None, extra_args=None):
    key = key or os.path.basename(path)
    s3.upload_file(path, bucket, key, ExtraArgs=extra_args or {})
    print(f"โœ“ Uploaded {path} โ†’ s3://{bucket}/{key}")

# Upload with metadata and encryption
upload_file('report.pdf', 'my-bucket', 'reports/report.pdf', {
    'ContentType': 'application/pdf',
    'ServerSideEncryption': 'aws:kms',
    'Metadata': {'author': 'Ravi', 'version': '2.0'}
})

# List all objects with pagination
def list_all_objects(bucket, prefix=''):
    paginator = s3.get_paginator('list_objects_v2')
    for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
        for obj in page.get('Contents', []):
            print(f"{obj['Key']:60} {obj['Size']:10} bytes")

# Generate pre-signed URL for download
def get_presigned_url(bucket, key, expires=3600):
    return s3.generate_presigned_url('get_object',
        Params={'Bucket': bucket, 'Key': key}, ExpiresIn=expires)

# Clean up old files
def delete_old_files(bucket, prefix, days=30):
    from datetime import datetime, timezone, timedelta
    cutoff = datetime.now(timezone.utc) - timedelta(days=days)
    paginator = s3.get_paginator('list_objects_v2')
    for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
        old = [{'Key': o['Key']} for o in page.get('Contents',[]) if o['LastModified'] < cutoff]
        if old:
            s3.delete_objects(Bucket=bucket, Delete={'Objects': old})
            print(f"Deleted {len(old)} old files")

Lambda Automation

import boto3, json, zipfile, io

lambda_client = boto3.client('lambda', region_name='ap-south-1')

# Invoke Lambda synchronously
def invoke_lambda(func_name, payload):
    resp = lambda_client.invoke(
        FunctionName=func_name,
        InvocationType='RequestResponse',  # sync
        Payload=json.dumps(payload)
    )
    result = json.loads(resp['Payload'].read())
    if resp.get('FunctionError'):
        raise Exception(f"Lambda error: {result}")
    return result

# Update function code from local file
def deploy_function(func_name, code_file):
    with open(code_file, 'rb') as f:
        lambda_client.update_function_code(
            FunctionName=func_name, ZipFile=f.read())
    print(f"โœ“ Deployed {func_name}")

# Update environment variables
lambda_client.update_function_configuration(
    FunctionName='my-function',
    Environment={'Variables': {'TABLE_NAME': 'NewTable', 'ENV': 'prod'}}
)

CloudWatch and RDS Automation

import boto3

logs = boto3.client('logs')
rds = boto3.client('rds', region_name='ap-south-1')

# Get Lambda error logs from last N hours
def get_lambda_errors(func_name, hours=1):
    import time
    log_group = f'/aws/lambda/{func_name}'
    start_ms = int((time.time() - hours*3600) * 1000)
    resp = logs.filter_log_events(
        logGroupName=log_group, startTime=start_ms, filterPattern='ERROR')
    for e in resp['events']:
        print(e['message'].strip())

# Create RDS snapshot
def backup_rds(db_instance_id):
    import datetime
    snap_id = f"{db_instance_id}-{datetime.date.today().isoformat()}"
    rds.create_db_snapshot(DBInstanceIdentifier=db_instance_id, DBSnapshotIdentifier=snap_id)
    print(f"โœ“ Snapshot {snap_id} created")

# List RDS instances with status
def list_rds_instances():
    for db in rds.describe_db_instances()['DBInstances']:
        print(f"{db['DBInstanceIdentifier']:30} {db['DBInstanceStatus']:12} {db['DBInstanceClass']}")
๐Ÿ”„

DMS โ€” Database Migration Service

Migration

What is DMS?

AWS Database Migration Service (DMS) helps migrate databases to AWS with minimal downtime. The source database remains fully operational during migration โ€” your application keeps running. Only a brief cutover pause (seconds to minutes) is needed at the very end. DMS handles the complexity of moving data, keeping it in sync, and notifying you when it's safe to switch.

Key Value Proposition: Traditional database migrations required hours or days of planned downtime. DMS enables "live migration" โ€” continuously syncing changes so cutover is just updating a connection string.

DMS Architecture

  • Replication Instance: The EC2 instance DMS runs on. Reads from source, writes to target. Choose size based on data volume.
  • Source Endpoint: Connection details for your source DB (hostname, port, credentials, engine type)
  • Target Endpoint: Connection details for your destination DB
  • Replication Task: Defines what to migrate, which tables, migration type, and settings
Flow:
Source DB โ”€โ”€โ–บ Replication Instance โ”€โ”€โ–บ Target DB
(MySQL EC2)   (reads changes via CDC)   (Amazon Aurora)

Migration Types

TypeHow it WorksDowntimeWhen to Use
Full LoadCopies all existing data. No CDC. Source must be static during migration.High (must stop writes)Dev/test DBs, small non-critical DBs, can afford downtime
Full Load + CDCFull load first, then CDC captures ongoing changes. Keeps target in sync until cutover.Minutes (cutover only)Production systems โ€” most common approach
CDC OnlyOnly replicates ongoing changes. Assumes initial data already in target.NoneData already loaded manually (pg_dump), need ongoing sync

Change Data Capture (CDC)

CDC is the technology that enables near-zero downtime migration. DMS reads the database's transaction log (binlog for MySQL, WAL for PostgreSQL, redo log for Oracle) to capture every INSERT, UPDATE, DELETE and replay it on the target.

  • MySQL: Enable binlog: set binlog_format=ROW, binlog_row_image=FULL
  • PostgreSQL: Enable logical replication: set wal_level=logical
  • Oracle: Enable supplemental logging, use LogMiner or Binary Reader
  • SQL Server: Enable MS-CDC on tables to migrate

Step-by-Step: MySQL EC2 โ†’ Amazon Aurora

  1. Pre-Migration: Enable MySQL binlog on source EC2. Create Aurora cluster as target. Run DMS premigration assessment to identify issues.
  2. Create Replication Instance: DMS Console โ†’ Replication instances โ†’ Create. Choose class (dms.t3.medium for small, dms.r5.large for large). Multi-AZ: Yes for production. Wait for "Available".
  3. Create Source Endpoint: Engine=MySQL, Server=EC2 private IP, Port=3306, Username=dms_user, Password. Click "Test connection" โ€” must show Success.
  4. Create Target Endpoint: Engine=Aurora MySQL, Server=Aurora cluster endpoint, Port=3306. Test connection.
  5. Create Migration Task: Select replication instance + both endpoints. Migration type: Full load + CDC. Table mappings: include all schemas or specific tables. Enable logging.
  6. Monitor Progress: Watch "Table statistics" tab โ€” rows loaded, inserts/updates/deletes applied via CDC. Check "CDC latency" โ€” should trend toward 0.
  7. Validate Data: Use DMS Data Validation feature (row counts + checksums) to verify target matches source.
  8. Cutover: Stop writes to source โ†’ wait for CDC latency = 0 โ†’ update app connection strings to Aurora endpoint โ†’ resume writes โ†’ stop DMS task.

AWS Schema Conversion Tool (SCT)

SCT is required for heterogeneous migrations (source and target are different database engines). It converts database schema, stored procedures, views, and functions from one SQL dialect to another.

Supported Conversions

  • Oracle โ†’ Aurora PostgreSQL/MySQL
  • SQL Server โ†’ Aurora PostgreSQL/MySQL
  • Teradata โ†’ Amazon Redshift
  • SAP ASE โ†’ Aurora PostgreSQL/MySQL
  • IBM Db2 โ†’ Aurora PostgreSQL

SCT Assessment Report

  • Shows % of schema auto-converted vs manual effort
  • Red items: require manual rewrite (stored procs, vendor-specific functions)
  • Green items: converted automatically
  • Estimates conversion effort in person-days
  • Free download from AWS website

DMS Sources and Targets

Sources

  • Oracle, SQL Server, MySQL, MariaDB, PostgreSQL
  • MongoDB, IBM Db2, SAP ASE, Sybase
  • Amazon RDS (all engines), Aurora
  • Amazon S3 (CSV/Parquet as source)
  • Azure SQL, Google Cloud SQL

Targets

  • All RDS engines, Aurora, Redshift
  • Amazon S3 (CSV/Parquet โ€” data lake)
  • Amazon DynamoDB (from relational)
  • Amazon OpenSearch Service
  • Amazon Kinesis Data Streams
  • Apache Kafka (MSK)

Replication Instance Sizing

ClassvCPURAMUse Case
dms.t3.micro21 GBDev/test, very small DBs (<1 GB)
dms.t3.medium24 GBSmall production (<10 GB)
dms.r5.large216 GBMedium production (10-100 GB)
dms.r5.xlarge432 GBLarge production (100 GB+)
dms.r5.4xlarge16128 GBVery large migrations (TB scale)

DMS Best Practices

  • Always run Pre-Migration Assessment to catch issues before starting
  • Enable Multi-AZ replication instance for production migrations
  • Place replication instance in same VPC as target (minimize latency)
  • Drop indexes and foreign keys on target before full load, recreate after (3-5x faster)
  • Use Parallel Load for large tables (partition into segments, load in parallel)
  • Enable Data Validation to verify row counts and checksums post-migration
  • Test on a staging environment that mirrors production before go-live
  • Monitor CDC Latency Source and CDC Latency Target โ€” should be near 0 before cutover
  • Keep source DB running for 1-2 weeks post-cutover as rollback option