Hybrid Environments and Migration
Border Gateway Protocol (BGP) – Key Concepts
- Routing protocol → Peers share info on how to reach destinations
- Path-vector protocol → Each peer shares the “best path” vector to a destination
- Autonomous System (AS) = self-managed network controlled by one entity → acts as a single BGP peer
- Viewed as a black box from outside, regardless of size
- Autonomous System Number (ASN)
- 16-bit number: 0–65535
- Private ASNs: 64512–65534
- Public ASNs: globally unique, allocated by IANA
- BGP uses TCP port 179 → reliable, distributed routing with flow control & error correction
- Peering is manual → BGP does not auto-configure
- Once peering is established, ASs exchange routing info and topology continuously
- This is the foundation of Internet routing
- Autonomous System Path (ASPATH) = “best path” to destination
- Even if multiple paths exist, AS shares only the selected ASPATH
- Does not consider link speed, only number of hops
- Shortest path preferred by default
- Techniques like ASPATH prepending can make slower paths less preferred
- BGP Types
- iBGP (internal) → routing within an AS
- eBGP (external) → routing between ASs (AWS mainly uses eBGP)
Used by AWS Direct Connect (DX) and dynamic Site-to-Site VPNs
Simple Example of BGP Architecture

- 3 metro areas: Brisbane (ASN=200), Adelaide (ASN=201), Alice Springs (ASN=202)
- 200↔201 & 201↔202 → fiber links
- 200↔202 → satellite link (slower)
- Route tables (RTs): each AS maintains one
- Origin AS =
iin ASPATH - ASs share routes → RT updated with new ASPATHs
- Brisbane learns Alice Springs via Adelaide → ASPATH (201, 202, i)
- Alice Springs directly to Brisbane → ASPATH (202, i) → preferred due to fewer hops
- Origin AS =
- ASPATH prepending → artificially lengthen a path to make it less preferred
- Example: satellite link prepended → ASPATH (202, 202, 202, i)
- Brisbane now prefers fiber path (shorter hops)
- Satellite path still exists as backup
- Result: dynamic, HA network
- All ASs maintain constantly updated topology
- Failure in one AS automatically reroutes traffic
IPsec VPN Fundamentals
IPsec – Key Concepts
- IPsec is a set of protocols used to create secure tunnels over untrusted networks.
- Example: A local device and a remote device establish a secure tunnel over the public internet → connects two endpoints.
- Often used for geographically distributed on-premises infrastructure or hybrid connections between cloud and on-premises networks.
- IPsec tunnels allow the creation of VPNs spanning multiple locations.
- Provides:
- Authentication: ensures that only trusted peers can connect.
- Encryption: protects data by transmitting it in encrypted form.
IPsec Architecture

- Tunnels are dynamically created and removed depending on traffic needs.
- If traffic matches defined criteria → create tunnel; if no matching traffic → remove tunnel.
- Interesting traffic = traffic that meets defined rules.
- Rules can be based on network prefixes or more detailed criteria.
IPsec Phases – Internet Key Exchange (IKE)
- Symmetric encryption is fast but sharing keys securely is challenging.
- Asymmetric encryption allows easy key exchange but is slower.
- Internet Key Exchange (IKE) is the protocol used to exchange keys and set up IPsec VPNs.
- IKE Phase 1 → initial, resource-intensive setup.
- IKE Phase 2 → faster, lightweight setup.
- Phase 2 tunnels are created and removed based on traffic, but Phase 1 tunnel usually remains active to simplify future Phase 2 tunnel creation.
IKE Phase 1 – Peer Authentication & Key Exchange

- Peers authenticate using pre-shared keys or certificates.
- Keys are exchanged via asymmetric encryption (e.g., Diffie-Hellman, DH).
- Each peer generates a DH private key (used to sign and decrypt data).
- Each peer derives a DH public key (used to encrypt data for the other peer).
- Peers exchange public keys.
- Each peer combines its private key with the peer’s public key to generate a shared symmetric DH key.
- Both peers end up with the same DH key, independently derived.
- DH key is used to exchange additional keying information and agreements.
- Phase 1 Security Association (SA) is established – the initial tunnel.
- DH key secures all data transmitted through Phase 1 tunnel.
- Phase 1 is slow and computationally heavy, but generates a secure key without transmitting it in plaintext over the internet.
IKE Phase 2 – Establishing the IPsec VPN

- Peers negotiate encryption parameters for the VPN and use the DH key to exchange additional key material.
- Includes supported cipher suites and VPN type.
- New symmetric IPsec keys are generated for efficient bulk data transfer.
- IPsec keys are independent of DH key for added security.
- If original keys are compromised, IPsec keys remain protected.
- IPsec keys are independent of DH key for added security.
- Phase 2 Security Association (SA) is created – the VPN tunnel runs over the Phase 1 tunnel.
Types of IPsec VPNs
- Route-based VPNs – traffic is directed according to IP prefixes.
- One Phase 2 tunnel → one SA pair → one IPsec key.
- Policy-based VPNs – traffic is matched according to defined rulesets.
- Multiple rulesets can exist, each with its own SA pair and IPsec key.
- Policy-based VPNs offer more flexibility but are more complex to configure.
Route- vs Policy-based VPNs Diagram:

AWS Site-to-Site VPN
Site-to-Site VPN – Key Concepts
- AWS Site-to-Site VPN allows you to establish and manage an IPsec VPN – a logical connection between a VPC and an on-premises network, with data encrypted in transit using IPsec.
- Usually runs over the public internet, but it can also operate over Direct Connect (DX).
- Enables hybrid networking between AWS and on-premises networks, or between AWS and another cloud provider.
- Main components:
- Virtual Private Gateway (VGW)
- A logical gateway object that can be a route target in a VPC route table.
- Associated with a single VPC.
- Currently does not support IPv6.
- Functions similarly to an Internet Gateway (IGW): positioned between the VPC and AWS public network, with regional resilience.
- Customer Gateway (CGW) – refers to either:
- The physical on-premises router the VPN connects to.
- The logical AWS configuration object representing that router.
- VPN connection
- Always links one VGW and one CGW.
- Stores VPN configuration details.
- Virtual Private Gateway (VGW)
- Key features:
- Can be highly available if designed and implemented properly.
- Direct Connect integration
- Can serve as a backup/failover for DX.
- Can operate on top of DX to provide an additional encryption layer.
- Billing considerations:
- Base hourly charge.
- Data transfer fees for traffic leaving AWS ($/GB).
- Uses the on-premises internet connection, which can affect bandwidth limits or ISP data caps.
VPN Advantages and Limitations
- Benefits:
- Fast to provision (less than an hour), since it’s fully software-based and requires no hardware setup. DX provisioning takes significantly longer.
- More cost-effective than DX.
- IPsec is widely supported across routers and networking hardware.
- Limitations:
- Throughput constraints – a single VPN connection has a maximum of 1.25 Gbps (AWS limit). On-premises hardware may impose lower limits. Encryption/decryption overhead can reduce effective speed. A VGW has a combined throughput cap of 1.25 Gbps across all VPN connections.
- Latency – public internet paths are variable and inconsistent, which may not be suitable for latency-sensitive applications.
Site-to-Site VPN – Architecture
Basic / Partially Highly Available Implementation

Steps to establish VPN:
- Collect necessary information: VPC CIDR, on-premises network CIDR, and on-prem router’s public IP.
- Create a VGW and attach it to the VPC.
- VGW includes two physical endpoints in separate AZs, each with a public IP → provides HA on the AWS side.
- Create the CGW logical object in AWS using the on-prem router’s public IP.
- Create the VPN connection linking the VGW and CGW.
- IPsec tunnels are established between each VGW endpoint and the on-prem router.
- Two encrypted tunnels provide redundancy: if one fails, the other remains active.
- Additional tunnels or VGWs can be created if more redundancy is needed.
- Limitation: single on-prem router = single point of failure.
- If the router fails, the VPN fails. AWS side is HA, but the overall solution is only partially highly available.
Fully Highly Available Implementation

- Add a second on-prem router in a separate building to eliminate single point of failure.
- Create a new VPN connection with the second CGW, establishing two separate VPN connections.
- This does not reuse previous VGW endpoints; new endpoints and tunnels are created for the second router.
Static vs Dynamic VPNs

Static VPN
- Routes are manually configured in VPC route tables using static network ranges.
- Simple setup; compatible with most routers.
- Limitations:
- No load balancing or multi-connection failover without manual configuration.
- Multiple CGWs require manual failover.
- Use dynamic VPNs for HA, multi-link redundancy, or DX integration.
Dynamic VPN
- Uses BGP to exchange network routing information.
- BGP peering between VGW and CGW allows dynamic route adjustment.
- Supports multiple simultaneous links for seamless failover.
- Requires on-prem routers that support BGP.
Route Propagation
- Can be enabled in a VPC to automatically add known on-prem routes to VPC route tables.
- Works with both static and dynamic VPNs.
- Reduces manual route configuration in AWS.
- Manual route updates are still needed on on-prem routers unless BGP is used.
LAB: Simple Site-to-Site VPN
GOAL: On-prem laptop connects to a private AWS web application through a VPN

- For this lab, the on-premises setup is emulated inside AWS
- Normally, the on-prem network would be a corporate LAN rather than a VPC
- The on-prem router would typically be a physical device instead of an EC2 instance
- The on-prem laptop would usually be a real machine rather than an EC2 instance
- The on-prem router runs pfSense software
- Acts as a router, firewall, and VPN device
- In real deployments, this could be a pfSense Netgate appliance
- The lab walks through configuration step by step
- The on-prem laptop uses Windows
- It will access a web app hosted in an AWS VPC using Internet Explorer
- This lab is not fully covered by the free tier and may incur some cost
STAGE 0: Initial Setup
- Switch to the
us-east-1region - Subscribe to the pfSense AMI
- Marketplace link: pfSense AMI
- Select “View Purchase Options” → “Accept Terms”
- This AMI introduces additional cost, so unsubscribe after finishing

- Unsubscribe guide
- Create an EC2 SSH key pair in
us-east-1- Navigate to EC2 → Network & Security → Key Pairs
- Download the key after creation
- Launch the environment using CloudFormation
- 1-click deployment
- Ensure the created key pair is selected
- After this stage, both the AWS VPC and simulated on-prem network are deployed

STAGE 1: Configure AWS VPN
This setup uses a static VPN configuration (no BGP).
- Create a Virtual Private Gateway (VGW) and attach it to the VPC
- Create a Customer Gateway (CGW)
- Use the public IP of the on-prem router
- Establish the VPN connection
- Select the VGW and CGW
- Choose static routing and enter on-prem CIDR
192.168.8.0/21 - If BGP were used, routing would be exchanged dynamically

- Once active, download the VPN configuration file

- Select pfSense as the vendor
- This file includes detailed IPsec tunnel configuration needed later
- At this point, VPN endpoints are created and ready

STAGE 2: Configure On-Premises pfSense Router
- Retrieve the router password from EC2 system logs

- Access the router using its public IP via HTTPS
- Log in using
adminand the retrieved password
Configure Networking
- WAN is preconfigured

- Add and enable a LAN interface

- Set LAN IPv4 configuration to DHCP
The router now has both public (WAN) and private (LAN) connectivity
Configure IPsec Tunnels
Phase 1 (IKE)
- Define parameters using values from the AWS config file
- Use:
- IKEv1

- Mutual PSK authentication

- AES encryption and SHA1 hashing

- The pre-shared key acts as a shared authentication secret
Phase 2 (IPsec)

Define tunnel networks:
- Local:
192.168.10.0/24

- Remote:
10.16.0.0/16 - Configure encryption, hashing, and lifetime values

- Enable keepalive using the AWS instance private IP

Repeat both phases for the second Availability Zone
Establish Tunnels

- Navigate to Status → IPsec
- Initiate connections for both tunnels
- Verify that tunnel status becomes established

STAGE 3: Routing and Security Configuration
Even with tunnels active, communication will fail without routing and firewall updates
AWS VPC Routing
- Enable route propagation on the route table

- The on-prem CIDR is automatically added

- AWS resources can now route traffic to on-prem
On-Prem Routing
- Add a route to AWS CIDR
10.16.0.0/16

- Use the router’s network interface as the target
- On-prem systems can now reach AWS
AWS Security Groups
- Add inbound rule:
- Source:
192.168.8.0/21 - Type: All traffic
- Source:

- This allows traffic from on-prem into AWS
On-Prem Security Groups
- Allow AWS CIDR in the on-prem private network SG
- Update router SG to permit VPN-related traffic

At this point, bidirectional communication is no longer blocked
STAGE 4: Validation
- Retrieve the Windows VM password using Fleet Manager

- Connect via Remote Desktop
Connectivity Test

- Run:
ping <AWS web server private IP> - Successful replies confirm network connectivity
Application Test
- Open Internet Explorer
- Navigate to
http://<AWS web server private IP> - The page with instance ID and a cat image should load

This confirms the VPN is functioning correctly
STAGE 5: Cleanup
- Remove VPN connection, CGW, and VGW (detach VGW first)
- Delete the CloudFormation stack
- Cancel the pfSense subscription via AWS Marketplace

This prevents further charges after completing the lab
AWS Direct Connect (DX) 101
DX – Core Concepts
- Dedicated physical link into an AWS region
- Uses fiber-optic Ethernet (commonly 1, 10, or 100 Gbps)
- Connection path: On-premises network → DX location → AWS region
- DX location acts as an intermediary site, typically a large regional data center
- One DX connection equals one port on a DX router at the DX site
- AWS does not handle the physical connection for you
- AWS only provisions and authorizes a port; you are responsible for connecting to it (directly or through a provider)
- Billing model: hourly charge for the DX port plus outbound data transfer from AWS (inbound is free)
- Characteristics
- Provisioning duration includes:
- Allocating the port from AWS
- Establishing the cross-connect at the DX site
- Extending connectivity from your premises to the DX location
- This process can take weeks or longer due to physical infrastructure requirements
- No inherent redundancy
- A single cable introduces a single point of failure
- High availability requires multiple DX connections
- Performance advantages
- Predictable, low latency and high throughput
- Traffic does not traverse the public internet
- No native encryption, which avoids encryption overhead
- Provisioning duration includes:
- Virtual Interfaces (VIFs)
- Transit VIF → integrates with Transit Gateway
- Public VIF → enables access to AWS public services (e.g., S3, SQS)
- Private VIF → connects to private resources in VPCs (e.g., EC2, RDS)
- These terminate on Virtual Private Gateways (VGWs) associated with VPCs
- No direct internet access is provided; internet access requires additional components such as proxies or other network devices
DX – Architecture

- The on-premises router must support Direct Connect
- DX location
- Typically a major metropolitan data center facility
- Not owned by AWS; it is a shared colocation environment
- AWS installs its DX routers within this facility
- Customer connectivity options
- Large organizations may deploy their own router at the DX site
- Smaller organizations often connect through a third-party provider
- Cross-connect setup
- After port allocation, a physical link is established between:
- AWS DX router
- Customer or partner router at the DX location
- This is referred to as a cross-connect
- MACsec cross-connect provides a single Layer 2 connection
- After port allocation, a physical link is established between:
- The customer or partner router then links back to the on-premises network
- AWS region connectivity
- AWS regions are connected to DX locations using multiple high-speed, redundant links
- AWS regions are fully owned and operated by AWS
- A DX location may or may not be in the same physical facility as the AWS region
DX Resilience
DX Resilience – Overview
- Direct Connect is a physical connectivity solution, so it is not inherently resilient
- Resilience must be deliberately designed into the architecture
- Increasing resilience improves availability (up to fault tolerance), but also increases cost and complexity
DX – No Resilience

- A single Direct Connect link provides minimal reliability
- Contains multiple single points of failure (SPOFs) such as:
- Routers
- Cross-connect
- Connection to the business network
- DX location
- On-premises site
- Contains multiple single points of failure (SPOFs) such as:
- The only highly available component is the connection between AWS regions and DX locations
- Although not resilient by default, Direct Connect can be enhanced:
- Adding a Site-to-Site VPN as a backup improves availability
- Additional DX connections can also be introduced
DX – Moderate Resilience

- Deploy multiple DX connections within the same DX location
- If one connection fails, another can continue operating
- AWS typically provisions these connections on separate routers
- Remaining risks:
- DX location itself
- On-premises location
- Possible hidden single point of failure
- The connection between the DX site and the business network
- A provider might route multiple links through the same physical cable
- Careful planning of physical cabling is necessary to avoid unintended SPOFs
DX – Improved Resilience

- Use separate DX locations and separate business sites
- Ideally placed in different geographic areas or buildings
- This design removes single points of failure in the connection path
- However, failure is still possible:
- For example, if one full location goes down and a component in the remaining path also fails
DX – Maximum Resilience

- Replicate routers and connections across locations
- Builds on the improved design by adding redundancy at each layer
- This setup significantly lowers the chance of downtime
- Even if one location fails, additional hardware failures in the remaining path are unlikely to disrupt connectivity
- Provides a near fault-tolerant architecture, at the expense of higher cost and complexity
DX and Site-to-Site VPN
IPsec vs MACsec – Comparison
| IPsec VPN | MACsec Cross-connect |
|---|---|
| Transport independent (can run over VGW/TGW via internet or Direct Connect) | Single-hop only (between AWS DX router and customer/partner router at DX site) |
| End-to-end encryption → stronger security but introduces cryptographic overhead (can reduce throughput) | No encryption at Layer 2 → not secure by itself, but enables extremely high throughput |
| Broad vendor compatibility | Hardware support is more limited and less commonly available |
| Software-based setup → quicker deployment | Requires physical infrastructure (cabling) → longer setup time |
- IPsec VPN and MACsec serve different purposes and are not direct alternatives
Site-to-Site VPN + Direct Connect Integration
Site-to-Site VPN over Direct Connect
- Combining VPN with Direct Connect delivers both security and performance
- VPN provides encryption and authentication
- Direct Connect provides consistent, high-speed, low-latency connectivity
- This setup uses a Public Virtual Interface (VIF)
- Site-to-Site VPN endpoints are public
- Private VIFs only support private IP communication
- Therefore, VPN tunnels rely on public addressing even when using Direct Connect
- Edge case:
- It is possible to use private IP VPN endpoints with a Transit Gateway and Direct Connect Gateway using a Transit VIF
- This allows VPN termination on private IPs and enables broader multi-region connectivity
- AWS blog on private IP VPNs
Site-to-Site VPN alongside Direct Connect
- Common use cases:
- Use VPN temporarily while Direct Connect is still being provisioned
- Use VPN as a failover option if Direct Connect becomes unavailable
- Use both approaches together for flexibility
Integration Example
- VPN over the public internet
- Connects to a VGW
- Lower performance compared to Direct Connect
- Useful during provisioning or as backup
- Direct Connect using a Public VIF
- Provides high-speed physical connectivity to the VGW
- VPN tunnels can run over this connection to add encryption
- The same VPN tunnel can operate over:
- Public internet
- Direct Connect (via Public VIF)
- This makes switching between paths straightforward
- A single Public VIF can also support VPN connections to VGWs in multiple AWS regions
- Enables encrypted communication across regions and between on-premises networks and AWS environments
AWS Transit Gateway (TGW)
TGW – Key Concepts
- Network transit hub used to interconnect multiple VPCs and link them to on-premises networks
- On-prem connectivity is supported through Site-to-Site VPN and Direct Connect
- A key advantage is the reduction of overall network complexity
- Functions as a single centralized network component (similar in concept to an IGW)
- Designed to be highly available, scalable, and resilient within a region
- Integrates with other network resources using attachments
- VPC attachments
- Site-to-Site VPN attachments
- Direct Connect Gateway attachments
- Core features
- Includes a default route table that controls traffic flow between attachments
- Multiple route tables can be used for more advanced routing designs
- Supports transitive routing
- Enables communication across multiple VPCs and on-prem networks without additional peering
- Removes the need for complex mesh networking
- Integrates with Direct Connect Gateways using transit VIFs
- Supports global networking through peering
- Works across regions and accounts
- Cross-region traffic uses the AWS global network, offering better performance than internet-based routing
- Can be shared across AWS accounts using AWS Resource Access Manager (RAM)
- RAM enables controlled sharing of AWS resources between accounts
- Includes a default route table that controls traffic flow between attachments
TGW Network Complexity Example
Without Transit Gateway

- VPC peering is not transitive
- Each VPC must establish peering with every other VPC
- This approach does not scale efficiently
- Multiple VPN connections required
- Site-to-Site VPN is also non-transitive
- Each VPC requires its own VPN connection to on-prem
- High availability requires multiple customer gateways
- Results in a full mesh topology
- High operational overhead
- Becomes increasingly difficult to manage as the environment grows
With Transit Gateway

- Single Transit Gateway acts as a central hub
- VPN attachment
- TGW becomes the AWS-side endpoint for VPN connections (replacing VGWs)
- Fewer VPN tunnels are needed while maintaining high availability
- VPC attachments
- TGW functions as a highly available router between VPCs
- Each VPC must specify subnets in each Availability Zone for TGW use
- Similar concept to other VPC-integrated components
- Transitive routing enabled
- All attached VPCs can communicate with each other through TGW
- No need for individual peering connections
- All connected VPCs and on-prem networks can communicate through the centralized TGW, simplifying architecture and improving scalability
AWS Local Zones
Traditional AWS Infrastructure (Regions and AZs)

- The standard AWS global infrastructure is highly scalable
- Designed to grow alongside application and workload demands
- Provides high performance and resilient connectivity between Availability Zones and regions
- Each region typically contains multiple Availability Zones (at least three)
- This enables strong fault tolerance and high availability
- Distance still impacts performance
- If an on-premises network is located far from the nearest AWS region
- Even with high-speed connections like Direct Connect, latency increases and performance may degrade over long distances
AWS Local Zones (Edge Infrastructure)
- AWS Local Zones extend a parent region closer to end users
- Function similarly to an Availability Zone located near the customer
- Designed to reduce latency and improve performance
- Key characteristics
- Operates as a single zone, so it does not provide built-in redundancy
- Usually deployed within a single physical facility
- Connectivity
- Has its own internet access
- Supports Direct Connect for high-performance networking
- Service behavior
- You can create VPC subnets inside a Local Zone
- EC2 and similar services can run there with very low latency to nearby users
- Many features still rely on the parent region
- Example: EBS snapshots are stored in S3 within the parent region
- This provides regional durability and resilience
- Service support is not universal
- Some AWS services are unavailable or limited in Local Zones
- Many features require explicit opt-in
- For supported services and limitations, refer to:
AWS Local Zones features page
- You can create VPC subnets inside a Local Zone
- Usage guidance
- Best suited for workloads requiring ultra-low latency and high performance near end users
- Always verify service compatibility before deploying
Local Zone Identification

- Each Local Zone is identified using:
- Parent region code + specific zone identifier
- Examples:
us-west-2-las-1→ Local Zone in Las Vegasus-west-2-lax-1a,us-west-2-lax-1b→ separate Local Zones in Los Angeles
- Naming typically follows international city codes for easier identification
AWS Storage Gateway (Volume, VTL, and File)
Storage Gateway – Overview
- Provides a smooth integration between on-premises systems and AWS cloud storage
- Works with S3, Glacier, and EBS snapshots
- Common scenarios: migrating data to AWS, extending existing storage into the cloud, and backup solutions
- Typically deployed as a virtual machine in on-prem environments, though a physical appliance option is also available
- Types:
- Volume Gateway → behaves like a block storage volume
- Tape Gateway (VTL) → emulates a traditional backup tape system
- S3 File Gateway → functions as a file-based storage system
- Exam focus is usually on selecting the appropriate Storage Gateway type for a given use case
Volume Gateway
- On-prem storage may exist as NAS or SAN solutions
- Network-Attached Storage (NAS) delivers storage over a network from a dedicated server
- Storage Area Network (SAN) consists of interconnected storage systems forming a high-performance network
- Internet Small Computer Systems Interface (iSCSI) is used to connect to these storage systems
- Enables block-level storage access over TCP/IP networks
- Volume Gateway exposes block storage volumes to on-prem systems
- Data is synchronized with AWS (stored in S3)
- Supports creation of EBS snapshots
- Uses iSCSI for communication
- Two modes: stored mode and cached mode
- Storage is managed by AWS
- Although data resides in S3, it is not directly viewable through the S3 console
- Data is stored as raw blocks, not standard S3 objects
Volume Stored

- Stores the primary data locally on-premises
- Data is asynchronously replicated to AWS using an upload buffer
- Upload Buffer:
- Temporarily holds changes before sending them to AWS
- Transfers updates asynchronously via public endpoints
- Data is stored in S3 as EBS snapshots
- Can use internet or Direct Connect
- Suitable for full disk backups
- Produces asynchronous snapshots
- Provides strong recovery objectives (RPO/RTO)
- Useful for disaster recovery in both on-prem and AWS environments
- Offers low-latency access since data is stored locally
- Does not increase total storage capacity of the data center
- Limits:
- Up to 32 volumes per gateway
- 16 TB per volume
- 512 TB per gateway
Volume Cached

- Uses S3 as the main storage location, not local disks
- Local storage acts only as a cache
- Upload Buffer is more heavily utilized compared to stored mode
- More frequent uploads to S3 as it holds primary data
- EBS snapshots can be created from S3 data
- These may not exactly reflect the local cache state
- Local Cache:
- Stores only frequently accessed data
- Provides fast access for commonly used data
- Enables extension of on-prem storage capacity using AWS
- Data not cached locally may have higher retrieval latency
- Limits:
- Up to 32 volumes per gateway
- 32 TB per volume
- 1 PB per gateway
Tape Gateway (VTL)
Enterprise Tape Backups

- Common backup strategies in large environments:
- Tape-based backups
- Disk-based backups
- Offsite backups over a network
- Linear Tape Open (LTO) is a widely used tape format
- Tape systems operate sequentially, not randomly like disks
- Updating data is inefficient, as tapes are designed for full reads/writes
- Tape components:
- Tape drives: read/write operations
- Tape loaders: automate tape handling
- Slots: store inactive tapes
- Tape Library includes drives, loaders, and slots
- Used for active backup and restore processes
- Connected to backup servers via iSCSI
- Tape Shelf refers to storage of tapes outside the active system
- Traditional tape systems involve high costs:
- Hardware, maintenance, licensing, and staffing
- Offsite storage and transport add further expense
Virtual Tape Library (VTL) with Tape Gateway

- Connects to on-prem backup servers using iSCSI
- Emulates tape drives and media changers
- Appears identical to physical tape systems, requiring minimal changes
- Virtual Tape Library (VTL) is backed by S3
- Virtual Tape Shelf (VTS) uses Glacier storage
- Includes upload buffer and local cache
- Local backups occur at LAN speed, then sync to AWS
- Virtual tape sizes range from 100 GB to 5 TB
- Inactive tapes are archived to Glacier tiers
- Flexible Retrieval for occasional access
- Deep Archive for long-term retention
- Limits:
- Up to 1 PB across 1500 tapes in VTL
- Unlimited storage in VTS
- Benefits:
- Retains existing backup workflows while reducing costs
- Adds scalable cloud-based backup capacity
- Supports migration of legacy tape data into AWS
S3 File Gateway

- Provides a link between on-prem file systems and S3 storage
- File shares are mounted using NFS (Linux) or SMB (Windows)
- Each file share maps directly to an S3 bucket
- Files written locally are stored as objects in S3
- Fully visible and manageable in the S3 console
- Primary storage resides in S3
- Local caching improves read/write performance
- Bucket Share:
- Connects a local file share to an S3 bucket
- File paths correspond directly to S3 object keys
- Supports up to 10 bucket shares per gateway
- Benefits:
- Extends file storage into low-cost S3
- Enables use of S3 features like lifecycle policies, replication, and event-driven services
- Supports hybrid and distributed architectures
S3 File Gateway – Multi-Site Architecture

- Multiple on-prem locations can connect to the same S3 bucket
- Each site uses its own File Gateway
- All sites access shared data
- Updates are uploaded to S3 automatically
- Other gateways only see updates after refreshing their view
- Notifications can be triggered using NotifyWhenUploaded
- Does not support file locking
- No built-in concurrency control
- Workarounds include read-only shares or custom access control
S3 File Gateway – Replication Architecture

- Supports S3 cross-region replication
- Enables straightforward disaster recovery across regions
S3 File Gateway – Lifecycle Architecture

- S3 lifecycle rules can transition data to lower-cost storage tiers
- Helps optimize long-term storage costs automatically
AWS Snowball
AWS Snow Family
- While online data transfers to and from AWS (over the internet or Direct Connect) are usually preferred, very large datasets may require offline methods.
- Example: transferring 200TB over a 100Mbps connection would take roughly 185 days, assuming no interruptions—clearly impractical.
- AWS Snowball devices are physical appliances designed to handle data locally.
- Primary uses:
- Offline migration of large datasets (>100TB) into or out of AWS, typically for data import into AWS.
- Edge computing, processing data in locations with limited or unreliable connectivity, such as vehicles, ships, or remote sites.
- Primary uses:
- Many older Snow products have been retired.
- Currently, only a limited set of Snowball Edge devices are actively supported.
- AWS now recommends DataSync or AWS Outposts for most use cases.
- Cantrill lectures may be outdated; always verify AWS documentation for current offerings.
Snowball Edge
- Snowball Edge devices combine storage and compute capabilities.
- Jobs must be requested from AWS, then the device is shipped to your location—not an instant solution.
- Cost-effective for datasets ranging from 10TB to 10PB; multiple devices can be deployed in parallel for higher throughput.
- Connection interfaces:
- 2× 10Gbps RJ45 (only one usable at a time)
- 1× 25Gbps SFP28
- 1× 100Gbps QSFP28
- Data on the device is encrypted using AWS KMS.
- Compute functionality: supports running EC2 instances or Lambda functions on the device.
- Supported configurations:
- Storage Optimized: 210TB usable storage
- Compute Optimized: up to 104 vCPUs, 416GB RAM, and 28TB NVMe SSD dedicated for compute
- GPU-based compute is no longer offered.
- Devices can be shipped to multiple locations for distributed edge processing.
- Data import/export happens through S3 only.
- To move data to Glacier, load it onto the Snowball Edge, ship it to AWS, and then transition it from S3 to Glacier.
Discontinued AWS Snow Products
- Included here for historical context; unlikely to appear in exams.
Original Snowball
- Offered storage only, no compute.
- Primarily used for large-scale data migrations.
Snowcone
- Smaller, lighter devices for edge processing on a smaller scale.
- Integrated with technologies like AWS IoT Greengrass for lightweight compute tasks.
Snowmobile
- Essentially a data center inside a truck.
- Used for single-site migrations of enormous datasets (>10PB).
- Not suitable for distributed sites and likely retired because most large organizations have completed massive migrations.

AWS Directory Service
What is a Directory (Service)?
- A directory service (also called a name service) acts as a centralized system that stores identities and resources across a network
- It links resource names (like users or servers) to their corresponding network locations
- Common objects include users, groups, computers, servers, and shared resources
- Organized in a hierarchical, tree-like structure (domain)
- Multiple domains can be combined into a forest structure
- Early example: DNS as a basic form of directory service
- Enables centralized authentication and management
- Users can log in with the same credentials across multiple devices once those devices are joined to the directory
- Microsoft Active Directory Domain Services (AD DS) is a widely used directory service
- Common in enterprise Windows environments
- SAMBA provides an open-source alternative with partial compatibility
AWS Directory Service – Key Concepts
- A managed directory solution provided by AWS, removing the need to maintain your own directory infrastructure
- Operates as a private service inside a VPC
- Resources must either be in the same VPC or connected to it
- High availability is achieved by deploying across multiple Availability Zones
- Example: logging into Windows EC2 instances using directory credentials
- Integrates with multiple AWS services (e.g., Chime, Connect, QuickSight, RDS, Management Console)
- Some services (like Amazon WorkSpaces) require a directory service
- WorkSpaces provides virtual desktops similar to Citrix
- Supports multiple deployment modes depending on architecture and requirements
AWS Directory Service – Modes
Simple AD mode

- A basic, standalone directory service running in AWS
- Built on SAMBA 4 (open-source)
- Simplest and most cost-effective option
- Supports creation of users and objects for use with EC2 and WorkSpaces
- Capacity: up to 500 users (small) or 5000 users (large)
- Automatically deployed in a highly available configuration across subnets
- Limitations:
- Cannot integrate with on-premises directories
- Does not provide full Microsoft AD functionality
- Best suited for simple, isolated use cases
AWS-Managed Microsoft AD

- A fully managed Microsoft Active Directory (AD DS) running in AWS
- Operates in Windows Server 2012 R2 mode
- Includes all capabilities of Simple AD plus additional enterprise features
- Key advantages:
- Supports applications requiring full AD features (e.g., schema extensions, enterprise apps)
- Can establish trust relationships with on-premises AD
- Connectivity via VPN or Direct Connect
- Enables synchronization between AWS and on-prem environments
- More resilient: directory remains available in AWS even if connectivity to on-prem fails
- Best suited for:
- Hybrid environments
- Applications requiring full Microsoft AD capabilities
AD Connector

- Acts as a proxy to an existing on-premises directory
- Forwards authentication and directory requests from AWS to on-prem systems
- Does not deploy any directory service in AWS
- Requires private connectivity (VPN or Direct Connect)
- Advantages:
- Simpler setup with no directory infrastructure in AWS
- Keeps all directory data on-premises
- Limitations:
- No built-in resilience in AWS
- If connectivity to on-prem fails, authentication also fails
- Best suited for:
- Extending on-prem directory usage into AWS without duplication
Summary of Directory Service Modes
| Mode | Directory in AWS | Sync with On-Prem | Primary Source | MS AD Features |
|---|---|---|---|---|
| Simple AD | Yes | No | AWS | No |
| AWS-Managed Microsoft AD | Yes | Yes | AWS | Yes |
| AD Connector | No | Yes | On-prem | Yes (if on-prem AD supports it) |
AWS DataSync
AWS DataSync – Key Concepts
- AWS DataSync is a service for transferring data to and/or from AWS at large scale
- Common use cases include migrations, moving data for processing, archiving to lower-cost storage, and disaster recovery/business continuity
- Compared to older approaches (manual transfers or Snowball), DataSync provides an end-to-end managed solution
- Data movement characteristics:
- Bidirectional (supports uploads and downloads)
- Online transfer over the network (unlike physical/offline methods such as Snowball)
Components
- Task
- Represents a configured data transfer job
- Defines source and destination, transfer settings, scheduling, and performance limits
- Location
- Each task includes a source and destination location
- On-premises: NAS or SAN storage
- AWS: services such as S3, EFS, and FSx
- Agent
- A software component deployed on-premises to interact with local storage
- Supports:
- NFS for Linux-based systems
- SMB for Windows-based systems
AWS DataSync – Key Features
- Highly scalable
- Up to 10 Gbps per agent (around 100 TB/day)
- Multiple agents can be used for higher throughput
- Supports up to 50 million files per task
- Metadata preservation
- Retains file attributes such as permissions and timestamps, useful for complex migrations
- Data validation
- Verifies integrity and structure of transferred data
- Bandwidth control
- Allows throttling to prevent network congestion
- Incremental transfers and scheduling
- Only changed data is transferred after the initial sync
- Tasks can be scheduled to run at specific times
- Security and efficiency
- Built-in compression and encryption during transfer
- Reliability
- Automatically handles errors and retries failed transfers
- Service integration
- Works with AWS storage services (S3, EFS, FSx)
- Some scenarios support direct service-to-service transfers (e.g., EFS to EFS, including cross-region)
- Pricing model
- Pay-as-you-go based on the amount of data transferred (per GB)
AWS DataSync – Architecture

- A DataSync agent is deployed on-premises
- Can run in virtual environments such as VMware
- Connects to local storage (NAS/SAN) using NFS or SMB
- Communicates securely with the DataSync service endpoint
- Transfers can be scheduled
- Allows operations during off-peak hours to minimize disruption
- Bandwidth limiting can be applied
- Ensures other network activities are not impacted during transfers
Amazon FSx 101
Amazon FSx – Key Concepts
- Amazon FSx is a fully managed file storage service (File Server-as-a-Service)
- Lets you deploy high-performance, shared file systems in AWS without managing the underlying servers
- Comparable to RDS for databases: RDS abstracts DB servers, FSx abstracts file servers
- Private service within a VPC
- Can be deployed as single-AZ or multi-AZ for resilience and high availability
- File system network interfaces (ENIs) reside in your subnets
- Built-in redundancy ensures data durability even in single-AZ deployments
- Accessible via VPC peering, VPN, or Direct Connect
- Storage options:
- SSD: optimized for low-latency, high-IOPS workloads, and small/random file operations
- HDD: optimized for high-throughput, large/sequential file operations, and cost-effective storage
- Supported file system types:
- Windows File Server – traditional Windows-compatible file shares
- Lustre – high-performance file system for compute-intensive workloads
- NetApp ONTAP – enterprise-grade storage with advanced data management
- OpenZFS – modern, open-source file system with snapshots and compression
FSx for Windows File Server
FSx for Windows File Server – Key Concepts
- FSx for Windows File Server is a fully managed, native Windows file system in AWS
- Unit of consumption: file shares (file servers themselves are abstracted and managed by AWS)
- File shares contain folders and files like a traditional Windows server
- Reduces administrative overhead compared to running Windows file servers on EC2
- Unit of consumption: file shares (file servers themselves are abstracted and managed by AWS)
- Access protocols:
- SMB (Server Message Block) – standard Windows network file sharing protocol
- NTFS (New Technology File System) – native Windows file system format
- In contrast, Amazon EFS is used for Linux workloads and accessed via NFS
- Key Windows-specific features:
- Active Directory integration for authentication
- Can use AWS-managed AD (via Directory Service) or on-premises AD
- Windows permission model applies to files and folders
- Supports Distributed File System (DFS) for scaling across multiple file shares
- Uses Volume Shadow-copy Service (VSS)
- Provides file-level versioning and end-user self-service restores
- Users can restore previous versions without admin intervention
- Active Directory integration for authentication
- Additional features:
- On-demand and scheduled backups
- File de-duplication to save storage space
- Encryption
- At-rest via AWS KMS
- In-transit enforced optionally
- High performance
- Throughput: 8 MB/s – 2 GB/s
- IOPS: 100,000 – 1,000,000
- Latency: <1 ms
- Can be mounted on Linux EC2 instances despite being Windows-optimized
FSx for Windows File Server – Example Architecture

- AWS WorkSpaces instances access FSx for shared file storage
- User directory can be hosted on AWS-managed Windows AD or on-prem AD
- File shares accessed via SMB using standard Windows path notation:
- Example:
\\<domain-name>\<file-share>\… - File shares are accessible to WorkSpaces users and on-premises users alike
- Example:
FSx for Lustre
Lustre File System
- Designed for High-Performance Computing (HPC)
- Supports large-scale parallel data processing
- Throughput: hundreds of GB/s; Latency: sub-millisecond
- Common workloads: machine learning, big data analytics, financial modeling
- Intended for Linux clients using POSIX permissions
- Lustre can be thought of as a “Linux cluster file system”
- Data resides within the file system itself
- Lustre divides stored data across multiple storage volumes
- Metadata Storage Targets (MSTs) → store metadata (file names, timestamps, permissions)
- Object Storage Targets (OSTs) → store actual data blocks
- Each OST typically 1.17 TiB
- Distributing data across OSTs enables high throughput
- Files can be linked to an external repository
- With FSx for Lustre, this repository is typically an S3 bucket
- The Lustre file system is independent from the repository/S3 bucket
- Repository files are visible but not automatically loaded into FS
- Files are fetched from the repository only when accessed
- Changes in FS are not automatically reflected in the repository
- Must explicitly export modified data back to repository using the
hsm_archivecommand
- Must explicitly export modified data back to repository using the
- Repository files are visible but not automatically loaded into FS
Diagram: Lustre FS ≠ Lustre Repository (S3 Bucket)

FSx for Lustre – Key Concepts
- Managed Lustre file system service in AWS
- Currently supports single-AZ deployment only to maximize performance
- Optional integration with an S3 repository
- Can mount S3 as a Lustre file system (via FSx)
- Computation outputs can be written back to S3
- Supports manual or automatic backups (0–35 day retention) to S3
- Two primary deployment types:
- Scratch
- High-performance, short-term storage
- No replication or high availability
- Optimized for temporary workloads; lower cost due to lack of replication
- Persistent
- Durable storage with replication within the same AZ
- Includes self-healing features for hardware failures (replaces failed files quickly)
- If the AZ fails entirely, all data is lost
- Suited for long-term workloads
- Scratch
- Performance considerations:
- Throughput is based on FS size
- Minimum FS size: 1.2 TiB; increases in 2.4 TiB increments
- Scratch: 200 MB/s per TiB base performance
- Persistent: base levels of 50, 100, or 200 MB/s per TiB
- Burst performance can reach 1300 MB/s per TiB (credit-based system similar to EBS)
- Throughput is based on FS size
FSx for Lustre – Example Architecture

- Typical Lustre clients are Linux EC2 instances running Lustre software
- FSx provides AWS-managed Lustre file servers (hidden from customers)
- Each file server combines compute (with in-memory cache) and storage volumes
- Caching improves access speed for frequently used data
- Adding more storage increases the number of file servers, which increases throughput and IOPS, but also increases risk for Scratch deployments
- All file server traffic passes through a single ENI in your VPC (single-AZ limitation)
- Writes go directly to storage volumes (disk throughput)
- Reads first check the cache; if data is absent, it reads from disks
- Example configuration may include multiple disks per file server in a Persistent deployment type
FSx for NetApp ONTAP
FSx for NetApp ONTAP – Key Concepts
- Fully managed NetApp ONTAP service on AWS
- Supports multiple protocols: NFS, SMB, iSCSI
- Compatible with a variety of operating systems and services (see diagram)
- Key features:
- Snapshots, replication, storage efficiency with compression
- Instant point-in-time clones
- Useful for creating test environments or new workloads quickly
- Automatic storage scaling
- Storage capacity expands or contracts based on usage
- Data deduplication
- Reduces redundant data to save space
FSx for OpenZFS
FSx for OpenZFS – Key Concepts
- Fully managed OpenZFS file system on AWS
- Supports NFS protocol only (versions 3, 4, 4.1, 4.2)
- Compatible with multiple operating systems and services (see diagram)
- Key features:
- Snapshots, storage-efficient compression
- Instant point-in-time cloning
- Useful for testing or deploying new workloads quickly
- Extremely high performance
- Can reach up to 1 million IOPS with latency under 0.5 ms
AWS Transfer Family
AWS Transfer Family – Key Concepts
- Fully managed file transfer service for moving data to and from AWS storage
- Supports S3 and EFS as storage targets
- Deploys managed servers that can handle multiple protocols not natively supported by AWS
- Use case: connect existing applications and workflows to S3/EFS without modifying them
- Can also create new workflows using Managed File Transfer Workflows (MFTW): MFTW link
- Supported protocols:
- FTP – traditional file transfer, unencrypted, legacy protocol
- FTPS – FTP secured with TLS
- SFTP – FTP over SSH
- AS2 – for secure B2B file exchanges; used in industries with strict compliance requirements
- Identity Providers (IDPs):
- Service-managed (built-in identities) – only usable with SFTP and AS2
- AWS Directory Service
- Custom IDPs via Lambda or API Gateway
- Managed File Transfer Workflows (MFTW):
- Serverless workflow engine for automation, e.g., trigger notifications, tagging, or processing when files are uploaded
- High availability and scalability:
- Supports multi-AZ deployments
- Pricing:
- Pay hourly for provisioned servers
- Pay per GB of data transferred
- No upfront costs; billed only while the service is in use
Transfer Family – Architecture

- Transfer Family servers act as front-end gateways to AWS storage
- Can support one or more protocols per server
- Use IAM roles for accessing S3/EFS
- Authenticate external users via supported IDPs
- External clients connect using DNS names and non-native protocols; DNS resolves to Transfer Family infrastructure
Transfer Family – Endpoint Types

Public Endpoint
- Runs in AWS public zone and is accessible over the internet
- Minimal setup required (no VPC or private networking needed)
- Supports SFTP only
- Uses dynamic IPs managed by AWS, so applications must use DNS
- Cannot restrict access via IP allowlists in firewalls (NACLs or security groups)
VPC Endpoint
- Runs inside a VPC for more control and security
- Supports SFTP, FTPS, and AS2
- Can enforce access restrictions using NACLs and security groups
- Offers static IPs, so DNS is optional
- Requires more setup than Public Endpoint
- Can be configured as either:
- Internet-accessible VPC – static public IP (Elastic IP) and private IP
- Internal-only VPC – private network access only; the only configuration that supports FTP, which should not be used over the public internet