Page MenuHomePhabricator

Private Clusters: VPN Notes
Closed, ResolvedPublic

Description

This all looks pretty promising. Some questions:

Address Allocation

  • Since we're peering VPCs on the same coast and linking sibling VPCs with VPNs, do all devices we allocate need to have globally unique device addresses? (I think the answer is "maybe not technically, but we'd be crazy to do anything else".)
  • How much IP space is available for VPCs? (I think the answer is 172.16.0.0/12 + 10.0.0.0/8 ~= 20M addresses?)
  • What's the smallest VPC / subnet we can allocate? (Smallest VPC /28, is the smallest we could actually do in practice a /26 with 4 /28 subnets? Or can subnets be smaller? So we can allocate ~260,000 VPCs in 10.0.0.0/8 before we run out of address blocks?)
  • If we eventually want to allocate 256,000 VPCs per region instead of 5 VPCs per region in AWS, is this OK? Is there lots of evidence that "normal customers" like us can easily get this limit raised to a zillion? Or maybe we should ask support (e.g., ask for an increase to 256 per region today and what it would take to get an increase to 256,000 per region honored in the future)? Presumably this isn't a big deal but we don't have much wiggle room if we go with VPC-per-customer and AWS says "sorry, maximum of 64 for customers with less than $1B/month spend".
  • We can't control which private address a device launches with, right? If a network is a /16, we have to random draw across the whole /16, not request 9.9.9.0, 9.9.9.1, etc? (I think the answer is "always random draw".)
  • If a device has address 1.2.3.4 locally, does it always have to have address 1.2.3.4 when accessed over a VPN too? Or can the VPN expose blocks as different blocks and do address translation, at least in theory (e.g., inbound traffic from West to 9.8.7.6 is sent to 1.2.3.4 in East)? (I don't know what the answer is but suspect it ends with "..but it doesn't matter, we shouldn't do this").
  • If, heaven forbid, a customer wants to link their corporate network with their private VPN (for example, so Drydock can use on-premises build hosts) AND both of them are on 10.0.0.0/24 so devices have the same addresses (their build001.company.com is the same as our videospreadsheets001.phacility.net), what do we do? Is this a capability we could never provide for some other reason so it's moot?

Subnets / Routes / VPN

  • In the route tables, for the routes that let the sister subnets in adjacent AZs talk to one another do you know what eni stands for? (apparently, "elastic network interface".)
  • Where does the actual piece of magic which makes traffic from test-a end up routing through the VPN host live? Is it the eni rule in the route table? ("Yes")
  • Is it possible to make this link/route redundant-by-default? If we lose a VPN host we currently experience immediate service disruption (for traffic over the link) and must intervene, right? (Not entirely sure what the answer is -- this must be possible in some sense, but I assume putting two routes for a CIDR block into the table causes an explosion, not a perfect magical redundant link, and that "possible in some sense" means "pay F5 a trillion dollars for custom hardware that barely works" and you've just moved your single point of failure from a VPN host to an F5 router.)
  • Are the 2 (e.g., public-2, vs public-1) subnets missing eni routes just because you didn't bother putting hosts there, since we didn't need to build a full minimal cluster to show that this works properly? ("Yes")
  • Why doesn't Ubuntu ship with traceroute? Is there a cool new version of traceroute with a Node.js dependency that I'm supposed to be using? ("Yes, it uses blockchains")
  • The actual VPN config is /etc/ipsec.conf? ("Yes")
  • Is there a different VPN package we could use which is much, much harder to configure? This configuration file is only 20 lines long and human-readable, so it can't possibly be a Serious Enterprise Product.
  • Where is the "balls" private key material actually stored? The config file references ipsec showhostkey but that doesn't actually do anything, and sudo ipsec secrets exits 0 but with no output. (Oh, is "PSK" just "preshared key", and we're encrypting the link with the robust, high-entropy password "balls"?)
  • StrongSWAN is just ipsec from the CLI, right? They aren't two different programs which work together, the program's binary is just named ipsec? ("Yes")
  • Is there a reasonable way we can verify that ipsec actually encrypts the traffic as configured, i.e. isn't just failing silently because there's a typo somewhere? I don't know how to do this easily offhand (configure the VPN to bounce the traffic through a box running nc -l 12345 | tee sniff.log | nc target 12345?) and I don't think it's worth spending tons of time on, but we can observe everything else (e.g., traceroute shows that the routing is working) and it would be nice to observe and confirm that the encryption actually does something, too. We still have to trust that the encryption is meaningful, but observing that the wire doesn't have plaintext on it would be reassuring.

Mostly as a thought experiment, the only thing which really prevents us from reusing addresses for each customer VPC is bastions, right? That is, we could imagine this scheme instead:

  • All USWest VPCs are 10.1.0.0/16.
  • All USEast VPCs are 10.2.0.0/16.
  • ...and so on.

But then we can't peer the VPCs on the same coast together (right?) so we'd need to put a bastion host into each VPC (or each AZ in each VPC, I suppose). There's no technical reason we couldn't do this, I think? It's just a bunch more hardware and (much more) complexity than putting a single bastion in each AZ and using peering?

A crazy-in-a-bad-way thought is that bastions could be ephemeral: spin one up when we need to connect, then throw it away after it sits idle for, say, 15 minutes. This puts a short (~1-2 minute?) delay on operational access to any VPC we haven't interacted with recently and generally feels kind of horrible (e.g., great Hacker News post to get upvotes, terrible idea in the real world that causes everlasting pain), but it seems like it would work alright. We'd have to send any monitoring signals over the public internet, but that doesn't seem like it's necessarily problematic.

I think peering VPCs is a better approach, and it sounds like the limit we're more likely to hit is "Amazon is uncomfortable giving us 256,000 VPCs" (which this wouldn't help with) than "we ran out of address space" anyway, and we're generally better off in many ways if every device has a unique address, but if we run into terrible problems later on with VPC peering this is maybe another approach we could consider. Is there a reason we can't do this, as opposed to the many reasons we shouldn't do this?

Related Objects

Event Timeline

Skipping questions where you have the right answer:

Do all devices we allocate need to have globally unique device addresses?

Maybe not technically, but we'd be crazy to do anything else.

How much IP space is available for VPCs?

From the Amazon docs: "You can create a VPC with a publicly routable CIDR block that falls outside of the private IPv4 address ranges specified in RFC 1918." That doesn't make any sense to me, but it's apparently possible. They "suggest" using RFC1918 space, and I think we should use 10.0.0.0/8.

What's the smallest VPC / subnet we can allocate?

From the docs: "Currently, Amazon VPC supports VPCs between /28 (in CIDR notation) and /16 in size for IPv4." Subnets have the same limitations.

If we eventually want to allocate 256,000 VPCs per region instead of 5 VPCs per region in AWS, is this OK?

I just opened a support ticket to bump us-east-1 and us-west-1 limits to 500; let's see what happens.

We can't control which private address a device launches with, right?

Other than scoping a host to a subnet, I don't think so. In theory we might be able to get clever with DHCP option sets, but I don't think being able to dictate private IP assignment would make our lives meaningfully easier. Anecdotally, I can confirm that off-the-shelf VPCs assign private IPs randomly and not starting from .1.

If a device has address 1.2.3.4 locally, does it always have to have address 1.2.3.4 when accessed over a VPN too?

I believe it would be possible to setup a NAT layer before applying the VPN, but a cursory googling doesn't inspire a lot of confidence. I'm pretty sure I'd rather retire and become an ostrich farmer before attempting this at scale.

If, heaven forbid, a customer wants to link their corporate network with their private VPN...

I think this is something we should explicitly try to support. Off the top of my head, I suggest we preserve 172.16/12 as the RFC1918 subnet that is "least likely" to already be used by customers in their private networks. If we need to rebuild a customer's VPC to avoid conflicting with their existing allocations, that just means a little downtime for that customer. As-is, the VPC customer template takes a bunch of CIDR blocks as arguments. Even if customer B's Phacility allocation conflicts with customer A's internal network, that won't matter because we're using VPCs for isolation. If someone went out of their way to add bogus routes to our VPC, that would cause problems, but I think we can avoid that particular pitfall.

Where does the actual piece of magic which makes traffic from test-a end up routing through the VPN host live? Is it the eni rule in the route table? ("Yes")

Yep, it lives in the route table that's attached to the subnet. The "actual" route tables configured on the host are totally bogus, and everything relies on the AWS SDN layer.

Is it possible to make this link/route redundant-by-default?

We could just give EIPs to all the hosts that need to communicate with each other and not rely on the VPN at all. Alternatively, there are in fact some docs about doing StrongSWAN HA, which might not even be possible in the AWS environment, but we could look into this. My gut instinct is to wait until AWS releases a product explicitly designed to peer inter-region VPCs, and make sure that Phacility cluster does something reasonable (like disabling writes if there's no cluster quorum) when the VPN link goes down, since this will happen inevitably when a giant meteor severs every fiber between California and Virginia.

Why doesn't Ubuntu ship with traceroute?

iiam

Is there a different VPN package we could use which is much, much harder to configure?

I'm still waiting to hear back from Cisco about our RFP.

(Oh, is "PSK" just "preshared key", and we're encrypting the link with the robust, high-entropy password "balls"?)

But seriously though, it's worth spending a few minutes coming up with some way of generating these secrets. I have a bunch of tabs open on the subject, and I'll get back to you when I have a decent understanding of the risks of screwing this up and a suggested approach.

Is there a reasonable way we can verify that ipsec actually encrypts the traffic as configured, i.e. isn't just failing silently because there's a typo somewhere?

We should be able to tcpdump the VPN connection and at least convince ourselves that what is actually just gzip'ed traffic is encrypted traffic. I'll take a crack at this. On a related note, aes128-sha1-modp1024 is a cipher suite I chose by carefully copy and pasting a working example off the internet.

Mostly as a thought experiment, the only thing which really prevents us from reusing addresses for each customer VPC is bastions, right?

And whatever other stuff needs to live in our admin VPC.

That is, we could imagine this scheme instead:

This is actually the path that I started going down, until I realized that it limits us quite a bit when generating subnets, since they can't be any smaller than a /28. I propose that we avoid baking any particular assumptions (like "the 2nd octet tells you what region you're in") into our subnetting scheme, and just assigning /24s (or larger) off the top of 10/8 forever.

But then we can't peer the VPCs on the same coast together (right?)

This goes back to "we probably could but definitely shouldn't".

A crazy-in-a-bad-way thought is that bastions could be ephemeral: spin one up when we need to connect, then throw it away after it sits idle for, say, 15 minutes.

Just to add to the list of badness, I have personally experienced two AWS outages that put AZs into a state where we couldn't actually spin up new instances.

and it sounds like the limit we're more likely to hit is "Amazon is uncomfortable giving us 256,000 VPCs"

Actually, the hard limits we're most likely to hit first are "routes per route table (hard limit of 100)" or "peering connections per VPC (hard limit of 125)", since the bastion's subnet needs routes to every customer VPC, and the admin VPC should be peered with every customer VPC. We can solve the former with an indirection layer that automatically connects from real-bastion.us-east-1.phacility.net to customers-starting-with-the-letter-Z.us-east-1.phacility.net, and the latter by using our VPN hack instead of VPC peering.

Is there a reason we can't do this, as opposed to the many reasons we shouldn't do this?

Not off the top of my head. But if it came down to it, I'd rather roll the dice with AWS's IPv6 support than attempt to knowingly build an infrastructure that reuses addresses.

epriestley claimed this task.

Cool, that all sounds like it's roughly what I expected. Thanks!

the 2nd octet tells you what region you're in

Oh, I meant even more aggressively reusing IPs: every customer's first host in USWest is (by default) 10.0.0.1, second host is 10.0.0.2, etc. So not "octet tells you the region" exactly, but "octets are different only for subnets which may need to connect over VPN". Regardless, I agree that this idea is a pretty bad one and having dozens of different 10.0.0.1 hosts seems like a recipe for disaster unless the alternative is somehow even worse (and it sounds like it's totally reasonable).

which might not even be possible in the AWS environment, but we could look into this

I'm not too worried about this since our failure mode should already be good (below) and these events should be rare, and "the VPN internlink goes down for a bit until we go flip the switch off and on" will probably never make it to the top 100 list of most broken stuff -- I was mostly just curious if there was some easy network-level way to make things work magically that I wasn't aware of.

and make sure that Phacility cluster does something reasonable (like disabling writes if there's no cluster quorum)

I am scared of hard problems so none of our clustering does anything tricky like this where non-writable nodes use spooky voodoo to become writable when you least expect it. Failure mode for everything is approximately "when the meteor hits, everything not still connected to the master goes read-only until operations intervenes manually to decide which coast gets to live". Should already fail pretty safely/reasonably except for a basketful of mostly-minor product issues where the UI doesn't degrade as well as it should (T10769).