This all looks pretty promising. Some questions:
Address Allocation
- Since we're peering VPCs on the same coast and linking sibling VPCs with VPNs, do all devices we allocate need to have globally unique device addresses? (I think the answer is "maybe not technically, but we'd be crazy to do anything else".)
- How much IP space is available for VPCs? (I think the answer is 172.16.0.0/12 + 10.0.0.0/8 ~= 20M addresses?)
- What's the smallest VPC / subnet we can allocate? (Smallest VPC /28, is the smallest we could actually do in practice a /26 with 4 /28 subnets? Or can subnets be smaller? So we can allocate ~260,000 VPCs in 10.0.0.0/8 before we run out of address blocks?)
- If we eventually want to allocate 256,000 VPCs per region instead of 5 VPCs per region in AWS, is this OK? Is there lots of evidence that "normal customers" like us can easily get this limit raised to a zillion? Or maybe we should ask support (e.g., ask for an increase to 256 per region today and what it would take to get an increase to 256,000 per region honored in the future)? Presumably this isn't a big deal but we don't have much wiggle room if we go with VPC-per-customer and AWS says "sorry, maximum of 64 for customers with less than $1B/month spend".
- We can't control which private address a device launches with, right? If a network is a /16, we have to random draw across the whole /16, not request 9.9.9.0, 9.9.9.1, etc? (I think the answer is "always random draw".)
- If a device has address 1.2.3.4 locally, does it always have to have address 1.2.3.4 when accessed over a VPN too? Or can the VPN expose blocks as different blocks and do address translation, at least in theory (e.g., inbound traffic from West to 9.8.7.6 is sent to 1.2.3.4 in East)? (I don't know what the answer is but suspect it ends with "..but it doesn't matter, we shouldn't do this").
- If, heaven forbid, a customer wants to link their corporate network with their private VPN (for example, so Drydock can use on-premises build hosts) AND both of them are on 10.0.0.0/24 so devices have the same addresses (their build001.company.com is the same as our videospreadsheets001.phacility.net), what do we do? Is this a capability we could never provide for some other reason so it's moot?
Subnets / Routes / VPN
- In the route tables, for the routes that let the sister subnets in adjacent AZs talk to one another do you know what eni stands for? (apparently, "elastic network interface".)
- Where does the actual piece of magic which makes traffic from test-a end up routing through the VPN host live? Is it the eni rule in the route table? ("Yes")
- Is it possible to make this link/route redundant-by-default? If we lose a VPN host we currently experience immediate service disruption (for traffic over the link) and must intervene, right? (Not entirely sure what the answer is -- this must be possible in some sense, but I assume putting two routes for a CIDR block into the table causes an explosion, not a perfect magical redundant link, and that "possible in some sense" means "pay F5 a trillion dollars for custom hardware that barely works" and you've just moved your single point of failure from a VPN host to an F5 router.)
- Are the 2 (e.g., public-2, vs public-1) subnets missing eni routes just because you didn't bother putting hosts there, since we didn't need to build a full minimal cluster to show that this works properly? ("Yes")
- Why doesn't Ubuntu ship with traceroute? Is there a cool new version of traceroute with a Node.js dependency that I'm supposed to be using? ("Yes, it uses blockchains")
- The actual VPN config is /etc/ipsec.conf? ("Yes")
- Is there a different VPN package we could use which is much, much harder to configure? This configuration file is only 20 lines long and human-readable, so it can't possibly be a Serious Enterprise Product.
- Where is the "balls" private key material actually stored? The config file references ipsec showhostkey but that doesn't actually do anything, and sudo ipsec secrets exits 0 but with no output. (Oh, is "PSK" just "preshared key", and we're encrypting the link with the robust, high-entropy password "balls"?)
- StrongSWAN is just ipsec from the CLI, right? They aren't two different programs which work together, the program's binary is just named ipsec? ("Yes")
- Is there a reasonable way we can verify that ipsec actually encrypts the traffic as configured, i.e. isn't just failing silently because there's a typo somewhere? I don't know how to do this easily offhand (configure the VPN to bounce the traffic through a box running nc -l 12345 | tee sniff.log | nc target 12345?) and I don't think it's worth spending tons of time on, but we can observe everything else (e.g., traceroute shows that the routing is working) and it would be nice to observe and confirm that the encryption actually does something, too. We still have to trust that the encryption is meaningful, but observing that the wire doesn't have plaintext on it would be reassuring.
Mostly as a thought experiment, the only thing which really prevents us from reusing addresses for each customer VPC is bastions, right? That is, we could imagine this scheme instead:
- All USWest VPCs are 10.1.0.0/16.
- All USEast VPCs are 10.2.0.0/16.
- ...and so on.
But then we can't peer the VPCs on the same coast together (right?) so we'd need to put a bastion host into each VPC (or each AZ in each VPC, I suppose). There's no technical reason we couldn't do this, I think? It's just a bunch more hardware and (much more) complexity than putting a single bastion in each AZ and using peering?
A crazy-in-a-bad-way thought is that bastions could be ephemeral: spin one up when we need to connect, then throw it away after it sits idle for, say, 15 minutes. This puts a short (~1-2 minute?) delay on operational access to any VPC we haven't interacted with recently and generally feels kind of horrible (e.g., great Hacker News post to get upvotes, terrible idea in the real world that causes everlasting pain), but it seems like it would work alright. We'd have to send any monitoring signals over the public internet, but that doesn't seem like it's necessarily problematic.
I think peering VPCs is a better approach, and it sounds like the limit we're more likely to hit is "Amazon is uncomfortable giving us 256,000 VPCs" (which this wouldn't help with) than "we ran out of address space" anyway, and we're generally better off in many ways if every device has a unique address, but if we run into terrible problems later on with VPC peering this is maybe another approach we could consider. Is there a reason we can't do this, as opposed to the many reasons we shouldn't do this?