Your nightly job replicates three terabytes from the data center to S3. For two years a site-to-site VPN carried it without complaint. Then the business grew, the backup window tightened, and one quarter-end the job that used to finish in four hours took fourteen — the tunnel saturated, latency wandered between 40 ms and 300 ms, and packets retried over a public-internet path nobody at your company controls. Meanwhile the data-transfer line on the AWS bill kept climbing, because every byte left over the internet at internet egress rates.
The VPN wasn't misconfigured. It was doing the only thing a tunnel over the public internet can do: best-effort, shared, and variable. AWS Direct Connect is the other option — a private, dedicated physical line into AWS, with consistent latency and a cheaper rate for moving data. This is the long version: what it is, how a connection physically comes to exist, how traffic actually flows over it, how to make it survive a failure, and when its cost and lead time are worth it.
The problem Direct Connect was built to solve
Before Direct Connect, connecting a data center to AWS meant a Site-to-Site VPN: an IPsec tunnel from your edge router to a virtual private gateway, riding the public internet the whole way. That works, and for modest traffic it's still the right answer. But the public internet gives you three things you can't design around: variable latency (your packets take whatever path BGP on the open internet chooses, hop to hop), jitter (that path changes, so latency wanders), and throughput that tops out at what a single IPsec tunnel and the congested middle can carry. On top of that, every gigabyte you pull out of AWS over the internet is billed at the internet data-transfer-out rate — the most expensive egress AWS sells.
For a chatty management connection none of that matters. For three terabytes a night, a latency-sensitive database replica, or a trading feed, all of it matters. Direct Connect removes the public internet from the path entirely.
What Direct Connect is — and what it is not
A Direct Connect is a physical cross-connect: a fibre cable, inside a data center AWS calls a Direct Connect location, running from your router (or a partner's) to an AWS router in that same building. You light the fibre, bring up BGP across it, and now there is a private Layer-2/Layer-3 path between your network and AWS that never touches the public internet.
Being precise about what it is not saves you from three common misconceptions:
- It is not a VPN. There is no tunnel and no encryption by default — it's a dedicated link, not an encrypted one.
- It is not the internet. There is no shared path and no route flapping; latency is consistent because the path is fixed.
- It is not, by itself, "private and secure." It is private in the sense that no one else is on your fibre. It is not confidential unless you add encryption on top.
How a connection physically comes to exist
Direct Connect is one of the few AWS services with a hard dependency on the physical world, so it's worth tracing how a port actually appears.
Two port models decide your lead time and your floor:
| Model | Speeds | How you get it | Fits when |
|---|---|---|---|
| Dedicated | 1, 10, 100, 400 Gbps | Order the port from AWS; AWS issues an LOA-CFA; you/your colo run the cross-connect | You have gear in a DX location and weeks to provision |
| Hosted | 50 Mbps – 25 Gbps | An AWS Direct Connect Partner already in the location carves you a slice of their port | You have no presence there and want it live in days |
The provisioning ceremony for a dedicated connection runs like this: you request the connection in the console (choosing the location and speed); AWS issues a Letter of Authorization and Connecting Facility Assignment (LOA-CFA) — the document that authorizes a cross-connect to a specific AWS port; you hand that LOA-CFA to the colocation provider that runs the building; and they physically patch a fibre from your equipment to AWS's. Then you create virtual interfaces and bring up BGP. The LOA-CFA is the hinge of the whole process, and not knowing what it is marks someone who has never actually ordered a circuit.
Virtual interfaces: one wire, three doors
The physical port is just glass and light. What you route over it are logical virtual interfaces (VIFs), each an 802.1Q VLAN with its own BGP session, and there are three kinds.
A private VIF carries traffic to private IP addresses in a VPC, attached either to that VPC's virtual private gateway or to a Direct Connect gateway. A public VIF reaches AWS's public service endpoints — S3, DynamoDB, public APIs — in any Region, over the private line instead of the internet (you advertise your public prefixes and receive AWS's). A transit VIF lands on a Direct Connect gateway and, through it, an AWS Transit Gateway — the path to many VPCs across many Regions. A single dedicated connection carries up to 51 virtual interfaces (including transit VIFs), so one port can serve all three patterns at once.
BGP: how routing actually runs over the wire
Every VIF runs a BGP session between your router and AWS's. You bring your own ASN — a public ASN you own, or a private one in the 64512–65534 range — and AWS uses its own on its side. You advertise the on-prem prefixes you want AWS to reach; AWS advertises the VPC (or public) prefixes back. BGP is also how failover works: when you run two connections, the routes learned over both let traffic shift automatically if one drops.
Deep dive: BGP tuning that separates a working link from a resilient one
Three knobs matter in production. MD5 authentication on the BGP session is optional but standard practice — it stops a misconfigured neighbor from forming a session. BFD (Bidirectional Forwarding Detection) is the one that earns its keep: default BGP hold timers take ~90 seconds to notice a dead peer, but BFD detects a failed path in well under a second, so with two connections your failover is sub-second instead of a minute-and-a-half outage. Enable it on both ends. Finally, route preference — AWS evaluates longest-prefix-match first, then its own routing policy; to make one path primary and another standby you use AS-path prepending or more-specific advertisements on the path you prefer. Active/active needs equal advertisements; active/passive needs an intentional tiebreaker.
The Direct Connect gateway: one wire, many VPCs and Regions
A private VIF reaches exactly one VPC, which is a dead end the moment you have more than one. The Direct Connect gateway (DXGW) is the global join that fixes it: a Region-agnostic object you associate your gateways with, so a single physical connection reaches VPCs anywhere.
Two patterns: associate the DXGW directly with virtual private gateways (up to 20) to reach VPCs one-to-one, or — the scalable choice — land a transit VIF on the DXGW and associate up to 6 Transit Gateways, one per Region, each fanning out to its Region's VPCs. The physical port and VIF stay single; the gateway provides the geography. (A Transit Gateway association advertises up to 200 prefixes to the DXGW.)
Traffic flow, end to end
Tracing a single packet from an on-prem host to an EC2 instance is the fastest way to see where each component sits.
The host sends to a VPC private IP; your router has a BGP-learned route for the VPC CIDR pointing at the Direct Connect; the packet crosses the fibre to the AWS router, which hands it to the Direct Connect gateway, then the Transit Gateway, then into the VPC subnet and the instance's ENI. The return trip is symmetric because BGP advertised routes in both directions. No NAT, no tunnel, no internet — just routing over a private path.
One link is not a plan: resilience
A single Direct Connect is a single point of failure at several layers at once — the cross-connect can be cut, the AWS device it lands on can fail, the whole Direct Connect location can go dark — and because provisioning a physical port takes weeks to months, you cannot launch a replacement when one dies. The wire that fixed your throughput becomes your biggest availability risk.
AWS frames the choices as a Resiliency Toolkit. Maximum Resiliency uses separate connections terminating on separate AWS devices in two or more locations — the recommended posture for critical workloads, surviving a device, a cross-connect, or an entire location. High Resiliency uses two connections across two locations. The non-redundant tier (one connection, or two on separate devices in a single location) is for dev and test only. Two more tools matter: a link aggregation group (LAG) bundles up to four sub-100 Gbps connections (or two at 100/400 Gbps) into one logical link for capacity plus device redundancy; and a Site-to-Site VPN — cheap and live in an hour — makes an honest cold backup for the day the fibre is cut.
SiteLink: using the AWS backbone between your own sites
One non-obvious capability: with SiteLink enabled on VIFs at two or more Direct Connect locations, traffic can flow directly between those locations over the AWS global backbone, bypassing AWS Regions entirely. It turns Direct Connect into a wide-area network for your own data centers — data center A in Frankfurt to data center B in Singapore, riding AWS's backbone rather than your carrier's MPLS. It's billed separately (per-hour plus data), and it's the feature that lets teams retire expensive private WAN circuits.
Encryption: private is not confidential
Because the line is private, it's tempting to treat it as secure. It isn't, on its own — the bytes cross the fibre in the clear. Two ways to add confidentiality:
- MACsec — IEEE 802.1AE Layer-2 encryption, near line rate, on dedicated 10 and 100 Gbps connections at supporting locations. Low overhead, but tied to specific ports and locations.
- IPsec VPN over a public VIF — run a Site-to-Site VPN across the Direct Connect's public VIF. Works anywhere, but you pay tunnel overhead and a throughput ceiling.
The cost model, with the break-even math
Two charges: port-hours for the connection, billed every hour whether it's busy or idle, and data transfer out per gigabyte — but at a rate materially lower than internet egress (the exact rate depends on the Direct Connect location and the source Region; data transfer in is free, as on the internet). The Direct Connect gateway itself is free; you pay the port and the egress.
The decision is arithmetic, not faith. The port is a fixed monthly cost; the saving is the gap between internet egress and Direct Connect egress on the volume you actually move:
break-even GB/month = monthly port cost
─────────────────────────────────────
(internet $/GB) − (Direct Connect $/GB)
Below that volume, a VPN over the internet is cheaper. Above it — sustained, predictable egress — Direct Connect is both faster and less expensive, and the financial case writes itself. (Plug your location's published rates into the formula; the per-GB figures vary by location and Region, so verify them on the pricing page rather than memorizing a number.)
Standing it up: Terraform and monitoring
The dedicated connection and its cross-connect happen partly in the physical world, but the AWS-side objects — the connection request, the VIF, the gateway — are all API-driven and belong in code.
terraform {
required_version = ">= 1.6"
required_providers {
aws = { source = "hashicorp/aws", version = "~> 5.40" }
}
}
# 1. Request the dedicated connection (AWS then issues the LOA-CFA).
resource "aws_dx_connection" "primary" {
name = "dc1-primary"
bandwidth = "10Gbps"
location = "EqDC2" # a Direct Connect location code
}
# 2. A Region-agnostic gateway, so one connection reaches many VPCs/Regions.
resource "aws_dx_gateway" "core" {
name = "core-dxgw"
amazon_side_asn = 64512 # AWS-side BGP ASN for this gateway
}
# 3. A private VIF: your VLAN + BGP session into the gateway.
resource "aws_dx_private_virtual_interface" "vif" {
connection_id = aws_dx_connection.primary.id
dx_gateway_id = aws_dx_gateway.core.id
name = "vif-prod"
vlan = 4094
address_family = "ipv4"
bgp_asn = 65000 # YOUR on-prem ASN (private 64512–65534, or public)
# bgp_auth_key = "..." # optional MD5 auth — set it in production
}
After the cross-connect is patched and the VIF is up, confirm the BGP session is established and that you're learning the VPC CIDR on-prem and advertising your prefixes to AWS. Then wire monitoring: Direct Connect publishes CloudWatch metrics per connection and per VIF — ConnectionState, ConnectionBpsEgress/Ingress, and the BGP state — and the alarm that actually pages you is on BGP peer state dropping from up to down. A connection that's "up" at the physical layer but down at BGP carries no traffic; alarm on the session, not just the light.
Direct Connect, VPN, or both
| Dimension | Site-to-Site VPN | Direct Connect |
|---|---|---|
| Path | IPsec tunnel over the public internet | Private dedicated fibre |
| Latency | Variable, best-effort | Consistent, low |
| Bandwidth | Per-tunnel ceiling | 1–400 Gbps |
| Encryption | Built in (IPsec) | None by default (add MACsec/IPsec) |
| Lead time | Minutes | Weeks to months |
| Egress cost | Internet rate | Lower per-GB rate |
Direct Connect earns its cost in three situations: sustained bandwidth a tunnel can't hold; workloads that need a consistent latency floor rather than best-effort; and large, ongoing egress where the lower per-GB rate pays back the port. Outside those, a VPN is cheaper, encrypted by default, and live in an hour. The common production answer is both — Direct Connect for the steady load, a Site-to-Site VPN as the automatic backup — which also neatly covers the weeks-long gap if a Direct Connect ever has to be reprovisioned. Whatever you choose, treat a lone Direct Connect as a single link, never a finished design: pair it with a second path before you call it production.
System design scenarios
Interview questions
The questions an interviewer actually probes on hybrid connectivity, and what a strong answer covers.