Page MenuHomePhabricator

A Pathway Towards Private Clusters
Closed, WontfixPublic

Description

I will write some text here in a bit.

Event Timeline

I think there are broadly two halves to this which we can think about mostly separately:

  • What do we need to do to make https://phabricator.epriestley.com/ work?
  • Once we get there with a bunch of duct tape and sticky notes, what's the path toward a world of smooth operations?

SSL Certificates and Load Balancers

For https://phabricator.epriestley.com/ to work, we must have some kind of device with an SSL certificate for the domain. Specifically, these are the entrypoints we must support:

  1. Normal HTTPS traffic.
  2. Notification traffic via websockets.

I think we can do both (1) and (2) with AWS certificates and an ALB. The workflow will be:

  • We request an AWS certificate on our side.
  • We manually work with the customer to get them to click the link that authorizes it.
  • We're golden forever?

This is a mild pain but it's probably much better than getting customers to upload and rotate SSL certificates, and I assume the approval process to let us sign random certificates for customers is very very long and hard.

I believe this must be on a dedicated ALB because SNI (which would let us host multiple certificates on a single domain name + port pair) is not really reliable yet (?). Even if it is, or we expect it to be generally reliable in a few years, we probably want separate ALBs anyway. Cloudflare gets around this by signing like a hundred domains into each cert (and also using SNI?) but we can't do this without being able to sign certificates. I don't think ALB/ELB support SNI anyway?

Doing this ties us to ALB/ELB to some degree since we won't have the actual certificates, but that seems alright.

The documentation claims that AWS limits customers to 20 LBs per region, but presumably we can email them to lift that. ALBs cost approximately $20/month.

Note that we can not send SSH traffic over an ALB -- they only speak HTTP/HTTPS. However, we can send it over v1.5 VPC ELB, apparently (v1 classic ELBs do not let you listen on 22).

So the particulars here are:

  • Each cluster gets an ALB listening for HTTPS, notification websockets, and maybe HTTP to do redirects to HTTPS.
  • The ALB has an AWS SSL certificate for phabricator.epriestley.com on it.
  • Each cluster gets an ELB listening for SSH.
  • Hard costs to us are ~$40/month.

DNS

The main domain name, phabricator.epriestley.com, must be pointed at the ALB.

A separate domain name (like vault.epriestley.com) must be selected and pointed at the SSH ELB.

I don't see any real way to do this with a single domain name. AWS doesn't have a single device which can both listen for TCP on 22 (only ELB) and terminate SSL for websockets (only ALB). If we launch such a device ourselves, we have to have users actually upload SSL certificates.

We also need MX records, in theory, but I'll deal with this in a separate section since I imagine it's a nightmare.

The actual mechanics of this are probably that we tell installs to add CNAME records like this:

  • phabricator.epriestley.com > CNAME > app-1.cluster.phacility.com
  • vault.epriestley.com > CNAME > vault-1.cluster.phacility.com

Those are DNS alias records for the ALB and ELB respectively.

We should probably CNAME some additional records so that <something>.phacility.com works until DNS is set up, but I'll discuss that below.

So the particulars here are:

  • Installs need to CNAME two DNS records for whitelabeling.
  • This costs us something like a millionth of a cent per year.

As far as requesting certs goes, our "Request Certificate" button in the AWS console will work correctly by sending an email to the domain's owner:

After you request the certificate, email will be sent to the registered owner of each domain name below. The domain owner or an authorized representative can validate control of the domain and approve the certificate by following the instructions in the body of the email. After all of the domains are validated, the certificate will be issued.

Mail

no lol

I think the best strategy here is probably to continue doing both outbound and inbound through instance.phacility.com for a long time. Installs can set up forwards from bugs@epriestley.com to bugs@epriestley.phacility.com and the value here seems small and not worth the hassle of MX/DKIM/etc. If installs want this, we can come up with a plan and walk them through it and see how things go at that time. It wouldn't surprise me if no one cares.

I believe the hard part here is entirely with getting installs to set up DNS the right way, not breaking their existing mail in the process, and having them configure things in a reasonable, future-proof way that doesn't break if we need to swap mail providers or do T12677. This seems hard.

And services like SES and MailGun require a lot of per-domain verification on top of everything else.

So:

  • Punt forever.
  • This costs us $0.

(We can't ever host bugs@epriestley.com directly anyway -- that always has to forward -- and that's probably what installs would want if they care about any of this.)

At least, not until we launch Phacility Apps for Domains and rebrand it to PSuite a few years later.

Subnets

Each private cluster should be on its own subnet. If/when we let installs run custom code on an instance and it's a security nightmare and roots the box, no other customer should be vulnerable.

This is probably the design we're leaning towards, in terms of minimum permissions?

  • Each private cluster has one subnet per region where it has presence.
  • Network ACLs allow traffic into the subnet from only a few other subnets.
  • Network ACLs allow traffic out to only the public NAT subnet and maybe a few other subnets.

Today, instances interact with external services in these ways:

  • Deployment and external (human) operations go ExternalBastionInstance.
  • Internal operations go AdminBastionInstance.
  • However, some of the stuff in the UI on admin and some administrative operations connect directly to instances, usually instance databases: AdminInstance.
  • Instances connect directly to Admin to query config: InstanceAdmin
  • Instances do various things like install packages, publish webhooks, make build calls, etc., to the public internet: InstancePublic Internet.
  • Instances interact with S3.

So if we change none of that, the private subnet needs to be able to send and receive traffic to the subnet or subnets where "admin" and "bastion" are located.

We could stop all the AdminInstance stuff and send traffic exclusively thorough the bastion instead. This would make a few things a little slower or more complicated (for example, the "Members" page currently queries instances directly to see if they're administrators on the instance or not) but nothing insurmountable.

We could stop all the InstanceAdmin stuff too, although this is probably far more complicated (T12646#222228 has some discussion).

AWS network ACLs are stateless, so we can't stop just inbound or just outbound traffic.

I also don't know what the plan is for multi-region clusters. Do we need to build a VPN between the subnets? It does not look like any of the VPC-related technologies address this directly.

Another possible consideration is that we could access admin via an internal ELB, if this changes things. This is desirable anyway.

So, open questions:

  • Do we want to stop Admin/Instance traffic? If we do, do we need to do it now or can we do it smoothly in the future?
  • What's the plan for multi-region clusters? How does the west-coast subnet talk to the east-coast subnet? We don't need this for v1, but should know what the pathway is and be confident we aren't headed down a blind alley.
  • Can we still hit S3 with restrictive network ACLs?
  • Can we still receive ALB/ELB traffic with restrictive network ACLs?

We can answer some of these questions concretely:

  • Launch a west-coast subnet and an east coast subnet and make them talk to each other somehow (VPN?).
  • Launch a west-coast subnet with a restrictive network ACL and see if it can talk to S3.
  • Launch a west-coast subnet with a restrictive network ACL and see if it can receive ELB/ALB traffic normally.

We should pursue T12816 before this. If we do not, private clusters that have network access to the bastion will necessarily also have access to other instances.

Names

Instances should probably still have short display names (like "epriestley" corresponding to epriestley.phacility.com) for tooling consistency and, at least today, for mail. It's annoying/inconvenient if we can never call this instance "epriestley" and always have to call it "phabricator.epriestley.com". After T11413, they will also have possibly-different internal names, perhaps inst-abc or PHID-INST-abcdfe.

However, I don't think we can make https://epriestley.phacility.com/ (vs https://phabricator.epriestley.com/) actually work unless we break some other rules. This can't go to the private ALB in DNS since it won't have an SSL cert there, which means it gets terminated on the main ALB and routed to the generic web tier.

web could proxy the request to the right subnet to make this "just work", but that means the subnet needs to be able to talk to web. This seems like a kinda bad dangerous mess that isn't worth the minor smoothing during setup?

Better is probably just an error page that says "keep going, you're doing great! Just a couple more things to set up."

We could also make epriestley.phacility.com work on the main tier and then move it to the private cluster once DNS + SSL is set up, although I'm not sure this is worthwhile, at least for v1.

Once it's set up, it could just redirect to the right domain.

We can also make PHID-INST-abcdef.phacility.com (or whatever) redirect appropriately.

On Admin, we need a way to let installs claim and review domains. This probably looks like:

[+] Add A Domain

+----------------------------+------+------+-----+-----+
| Domain                     | Type | DNS  | SSL | ALB |
+----------------------------+------+------+-----+-----+
| phabricator.epriestley.com | Web  | OK   | OK  | OK  |
| vault.epriestley.com       | VCS  | OK   | N/A | OK  |
+----------------------------+------+------+-----+-----+

The "DNS" and "SSL" columns walk users through getting things set up (Needs ConfigurationOK) and then go green once we confirm that settings are good.

The "ALB" column waits on operations to deploy the thing in v1 (Waiting for OperationsOK), and auto-deploys an ALB/subnet eventually.

This all dumps into a table somewhere which is probably the same table as whatever comes out of T11413, or at least updates that table as a side effect.

We should probably (?) do T11413 first with a mind toward this, although if we're severing the Admin/Instance connection maybe it doesn't matter.

I'll also suggest that we should implement some kind of "infrastructure as code" tool before going too far down the road of setting this stuff up. I'd really like to get to the point where no one ever has to push buttons in the AWS console unless something is really broken. I don't have super strong feelings about Ansible vs chef vs puppet, but I'd recommend against CFEngine and SaltStack.

Config

Today, we read config from admin.phacility.com on each request. This is generally good for shared clusters, although T12646 discusses some issues.

It also means that instances depend on admin to work.

I think we should probably not require this for private clusters, and should push config with deployment instead. If admin goes down, private clusters should be able to stay up -- there's no technical reason we can't build for this, and we have no need to rapidly change configuration on private clusters.

This is a little tricky but should not be particularly involved, since we already do other AdminBastionHosts stuff like starting/restarting daemons, synchronizing Almanac services, etc.

I'll also suggest that we should implement some kind of "infrastructure as code" tool

I'm concerned that this may turn a 2-month project into a 6-month project if we write something ourselves, or a 12-month project if we use Ansible, Chef or Puppet. If you think this is important, though, I'm happy to let you take the reins here.

I think we should probably pick a tool and convert all existing deployment to it first if we're going down this path. I don't have any experience with any of these tools, do you want to drive researching them, making a selection, and converting existing deployment and operations tools?

I'm concerned that this may turn a 2-month project into a 6-month project if we write something ourselves, or a 12-month project if we use Ansible, Chef or Puppet.

I have mixed feelings about this. Honestly, at every other engineering org I've been a part of, picking a tool like this would be a no-brainer, day-0 project. But if there's one big philosophical difference I've observed between Phacility and everywhere else, it's "3rd party code is evil and should be avoided at almost any cost". I've been trying to internalize that, but in this case I feel like writing our own deployment/management stack is closer to "writing our own kernel + database" than "writing our own SMTP client". In that vein, I see three possible paths forward:

  1. Drop everything and pivot all engineering resources to our new infrastructure-as-code product, "Phansible"
  2. Continue incrementally improving the existing deployment scripts until we get to a state where the basic management of private clusters (provisioning, destroying, upgrading, reconfiguring) works pretty well
  3. Pick an off the shelf tool, migrate our existing flows to use it, and hope that we backed the right horse (or at least a horse that doesn't constantly kick us in the face)

I don't think anyone wants to open door number one, even though it could in theory generate some revenue (RedHat apparently sells some kind of Ansible add-on that makes money).

I have no doubt that we could do number 2 relatively easily and build the tooling ourselves to do stuff like provisioning subnets/instances/ALBs/etc and configuring them. This is not a crazy amount of work, but I feel like ultimately it's never going to be the strongest, most heavily-developed area of the codebase, and is pretty orthogonal to our actual product. Inevitably, issues like "provisioning steps 1-4 succeeded, but step 5 failed in a subtle way we never saw before, so we incorrectly continued on to step 6 which deleted all customer data" won't get fixed until we hit them, which is an argument in favor of a mature codebase that's absorbed the collective wisdom of those kinds of incidents. Also, our open source advantage breaks down in this area because no one else will be running/seeing that code. And ultimately, unlike most other engineering work we could be doing, this will never help generate revenue, and it might even be cheaper to hire an intern to push buttons in the AWS console all day instead. (I'm reminded of an "automation" solution that an equipment vendor offered to Facebook, which involved contractors logging into hundreds of devices to perform manual config changes).

Aside from not having to write all this code ourselves, I think the biggest wins for a tool like Ansible are auditing (did someone forget to remove some old instances after a migration, or remove that temporary inbound from 0.0.0.0 rule after the customer's import finished?), logging (show me all the ops tasks that have been carried out on a given host), and intelligently sequencing operations with lots of dependent steps. The existing cookbooks (or playbooks, or grains, or whatever) in these products are generally pretty good about knowing how to revert themselves if downstream dependent tasks fail, which I feel is a great property to have. Even if we were meticulous about adding revert pathways to all our automation code, we'd have to build a stateful job tracking system to make sure that the revert happens eventually even if it's failing temporarily due to AWS issues. I know the daemons already mostly do this, but I would also argue that depending on, say, the availability of daemons on admin to perform operational tasks is a recipe for breaking our tools with our tools.

All that said, my gut tells me that the right move is to pick an existing tool and commit to it, but I could be wrong. I could spend a few days building a PoC deploy process for each of the contenders if you want to see what things would look like. Let me know how you'd like to move forward.

I think we just have two very different models of where operational risk lies and how to hedge against it -- the risks you associate with first-party tooling are risks I associate with third-party tooling. We can't really objectively prove that either model is better because failures are rare in either case. I think we're worse off trying to compromise the two models than picking either one and sticking to it.

I don't want to run ops forever and I think everyone with any experience is going to join and say "why aren't you using Chef/Ansible/Puppet?", so let's just pick something people actually know and which you're more comfortable with, and use that instead of custom first-party stuff. Even if I'm "right" in some sense and we would have had 0.5 incidents per year instead of 0.6 incidents per year if I hand-crafted everything, me writing all the ops tooling doesn't scale well and you're right that this isn't our core business.

I think the worst possible outcome is that we throw it all away for some reason we don't currently forsee, you learn a lot more about how the cluster works in the process, and I can answer "Why aren't you using Chef/Ansible/Puppet?" when we make the next hire.

Okie dokie, I'll start evaluating.

Another comment about inter-region connectivity: this documentation is somewhat discouraging: https://aws.amazon.com/answers/networking/aws-multiple-region-multi-vpc-connectivity/

That doc pretty much makes explicit that the best answer for our situation is "roll your own point-to-point VPN setup". Now the big question: is it worth it to isolate instance inter-region traffic? That scales poorly since it requires n dedicated VPN instances per customer where n is the number of regions we eventually support. Aggregating VPN traffic is definitely easier, but creates another way for instances to interfere with each other by saturating the VPN link (or the VPN instance's network interface). In theory we can address this with per-instance VPN traffic quotas, but I have no idea if OpenVPN etc support that.

Also, I won't even pretend to be up to speed on regulatory issues, but I wonder which customers might actually require traffic isolation. In practice, I'm expecting the only thing we'll be sending inter-region is MySQL replication traffic, so if we use TLS for those streams (which we should anyway), I think we might be ok.

I'm a bit out of my depth here -- what's the alternative you're suggesting? That we have VPC <-> VPC VPNs instead of subnet <-> subnet VPNs? What, specifically, is the "just use TLS" plan?

I'm not concerned about regulatory issues, but I do want to be able to treat private clusters as compromised/adversarial without worrying that other customers are affected. If an instance wants to deploy some sketchy custom SAML extension and it happens to be full of terrible security holes, and we decide to let them, it should only be able to affect their instance: attackers who compromise the hosts in a private subnet should not be able to escalate the attack to other instances.

Particularly, a state actor should not be able to pay us $1,000/month to get access to a private cluster, then actively compromise it and use it as a staging area to mount an attack on their real target. A private cluster should be able to deploy accidentally or intentionally compromised extensions/applications without putting anyone else at risk.

That we have VPC <-> VPC VPNs instead of subnet <-> subnet VPNs?

Yeah, that's the alternative. For that plan, I also have no idea how we'd scale it if we ever needed more than one instance to handle the traffic, but AWS has 10GE instances and it's hard to imagine pushing more traffic than that.

From a security perspective, I'm not sure what an attacker could do to traffic flowing over a shared VPN infrastructure (other than consume all the bandwidth).

What, specifically, is the "just use TLS" plan?

This was just me thinking out loud: if we end-to-end encrypt the replication traffic (and anything else that would flow over the VPN), it doesn't matter if an attacker finds a way to read another user's VPN traffic.

I think the biggest downside to giving every subnet its own VPN connection (aside from potentially extra hardware expense) is that I put VPNs in the bucket of "sorta flaky things that occasionally need to be poked at", and having one per customer could potentially result in a lot of poking.

To back up a bit, how much human touching are you envisioning for bring up a new private cluster? In my head I was thinking we'd at least have to manually deal with requesting the cert, but it looks like There's An API For That, so I think it's at least possible to make the whole thing work magically. It would be something like filling out the form triggers the creation of a new terraform (or whatever) config which then gets deployed.

Unrelated question: are we going to provide SSH access to private clusters? If not, how are we going to handle custom code?

If we do VPC-level VPNs, what network level safeguard prevents web001.evil-attacker.us-west.phacility.com from connecting directly to db001-master.wholesome-good-company.us-east.phacilty.com? Subnet ACLs can no longer enforce this, right? All the traffic arriving on the east coast will originate from the local VPN endpoint, not from the corresponding west-coast subnet? Or am I misunderstanding how this works?

I had imagined a very large amount of human button pressing for the next ~year or so, especially during setup, and gradually automating things away over time, similar to the main cluster deployment process, but that's coming from a place of my "first-party everything" school of thought, not the "Chef/Ansible/Puppet/never press buttons in AWS" school of thought.

I'd like to leave open the possibility that we provide SSH access, but resist providing it for as long as possible.

Custom code will be a list of Phabricator packages you want deployed to your instance (T5055) that you define in a web UI. I'm not sure exactly how much custom code we're going to allow, but at a minimum it needs to be packaged, versioned, and checked in somewhere, not just random stuff you live-edit in production.

what network level safeguard

I am moderately confident that we can restrict outbound subnet reachability by VPN user, but would have to do some googling to confirm.

We can definitely do per-subnet VPNs with something like OpenSWAN: http://40cloud.com/interconnecting-two-aws-vpc-regions/

Another possibility is that since we need only a very small number of connections between subnets (database replication and VCS over SSH only, I think), we could enumerate everything and then tunnel/forward them with something like stunnel/ssh/haproxy rather than a full-blown VPN. That also feels like kind of a mess, but maybe not as much of a mess as doing a full VPN setup. It would probably look something like the OpenSWAN diagram, except the OpenSWAN boxes would be haproxy/stunnel boxes or whatever instead that just forwarded a specific set of ports between devices:

Post-VPC-interconnect-net-diagram-v2-1024x540.jpg (540×1 px, 80 KB)

Then each private subnet is probably really two subnets (a real private subnet with a NAT route through a shared public subnet, plus a separate public subnet with this "link gateway"), I guess?

Note that we can not send SSH traffic over an ALB -- they only speak HTTP/HTTPS. However, we can send it over v1.5 VPC ELB, apparently (v1 classic ELBs do not let you listen on 22).

I just tested this and confirmed that Classic ELB's can speak SSH and it works as expected. The ELB does a simple "is the targeted TCP port listening" health check which also works.

Another AWS gotcha learned the hard way: nodes in a "private" subnet can still be part of an "internet-facing" ELB, but the trick is that you have to attach the ELB to the "public" subnet that contains the IGW and the NAT Gateway, and the nodes in the "private" subnet need a route to the NAT Gateway.

So it occurs to me that we could have one "shared" public subnet that contains: 1 shared Internet Gateway, 1 shared NAT Gateway, and then per-install VPN nodes. We should be able to setup routing tables for the private subnets that point to that install's VPN node. Since routing is enforced at the network layer (routing tables on actual EC2 nodes are bogus), that would enforce traffic separation. In theory, installs could saturate the IGW or NAT gateway, but I'm pretty sure those resources should autoscale since they're managed by EC2. A NAT gateway with no traffic going through it is $32/month, and as far as I can tell, an Internet Gateway is free, but eventually we'd be saving some money.

And here's the official AWS docs for connecting VPCs using OpenSWAN: https://aws.amazon.com/articles/5472675506466066

AWS doesn't have a single device which can both listen for TCP on 22 (only ELB) and terminate SSL for websockets (only ALB).

That's not correct. We are routing our websocket traffic for Aphlict through an ELB that has been setup using SSL instead of TCP as the protocol. In Terraform, it looks like this:

resource "aws_elb" "notifications" {
  name_prefix     = "phntf-"
  security_groups = ["${aws_security_group.notifications_lb.id}"]
  subnets         = ["${module.vpc.public_subnet_ids}"]

  listener {
    instance_port     = 22280
    instance_protocol = "tcp"
    lb_port           = 80
    lb_protocol       = "tcp"
  }

  listener {
    instance_port      = 22280
    instance_protocol  = "tcp"
    lb_port            = 443
    lb_protocol        = "ssl"
    ssl_certificate_id = "${data.aws_iam_server_certificate.notifications.arn}"
  }

  health_check {
    healthy_threshold   = 5
    unhealthy_threshold = 2
    target              = "TCP:22280"
    interval            = 5
    timeout             = 2
  }
}

We should also make a decision about whether or not we want to use "dedicated" AWS instances: https://aws.amazon.com/ec2/purchasing-options/dedicated-instances/

I'm leaning towards yes, just because I've seen a handful of weird instances of getting bit by shared AWS infra. Notable example: mystery packet drops because other traffic on a shared host was hitting an undocumented hypervisor-layer packets per second limit.

It looks like the price difference is pretty large -- $45/month for a m4.large reserved instance vs $80/month for a m4.large dedicated -- plus ~$1,500/month/region for having at least one dedicated instance. It would be easy to start using dedicated hosts in the future, right? Just an issue of converting existing hosts if we run into issues?

Maybe we should wait for issues to crop up before making the leap, since we haven't seen any issues so far -- and we should generally be working on making it easier to swap hosts as time goes on since we know this is a capability we need to have good support for in the long run: T12798 is almost certainly just the tip of the iceberg. I could imagine that in the long run we might see a situation where we lose 90% of instances for non-dedicated reasons and only 10% for dedicated reasons, and weird hypervisor packet stuff ends up just sort of being noise in the normal churn of hosts getting blown up by stray cosmic rays.

If the price difference was more like 5-10% I'd say this is probably a no-brainer, but since it looks like it's more along the lines of 2x-ish I'm less sure.

AWS doesn't have a single device which can both listen for TCP on 22 (only ELB) and terminate SSL for websockets (only ALB).

That's not correct. We are routing our websocket traffic for Aphlict through an ELB that has been setup using SSL instead of TCP as the protocol. In Terraform, it looks like this:

Oh cool, I've never seen someone actually getting WebSockets working on a Classic ELB. We'd still like to leverage the "native" WebSocket support and possibly some of the fancy path-based routing in ALBs, but good to know we have an alternative.

So having done a bunch more reading/thinking about private cluster isolation, I'm starting to lean towards "one VPC per customer per region" instead of "one subnet per customer per region".

Advantages:

  • Cost allocation: things like internet gateway/NAT/transfer bandwidth can be tied to a specific customer instead of being lumped together. The nightmare scenario is that someone starts hogging a shared resource and we have to start dumping VPC flow logs to even figure out who the offender is. AWS will actually do this for us if we setup User Defined Cost Allocation Tags.
  • Better isolation: even though I want to automate the creation of ACLs and security groups, creating distinct VPCs is another line of defense.
  • Less chance of dangling resources: customers terminating their service are "terminated" as soon as their VPCs have been successfully destroyed. If we forget to clean something up, it will block VPC destruction in an obvious way.
  • Not as impossible as expected: I've read several anecdotal reports of AWS users with 100's of VPCs. There used to be a hard limit of 5 VPCs per region per account, but that's gone now. Several "best practices" discussions suggest using the "one VPC per environment and/or customer" model.

Disadvantages:

  • More expensive: creating dedicated hardware for everyone will directly increase our costs. Namely: EIPs, NAT gateways, VPN instances.
  • More IPs: similar to the "more expensive" point, each customer will have a unique list of public IPs that they'll have to punch holes for in their firewalls (but we can extract this information from AWS and throw it into a UI).
  • More AWS state: just by virtue of having to create the VPCs (as well as the subnets that we have to create anyway), we'll end up with strictly more lines of config.
  • Still possibly overkill: we don't need distinct VPCs per customer, but I think that the private cluster product should cater to the expectation that your install is truly private and that you share no resources with any other customers. The first time we have to explain to someone that their instance is down because of someone else's bad behavior will outweigh whatever cost savings we might extract from sharing resources.

The basic model would be to create an "administrative" VPC in each region we want to deploy in. That VPC contains a bastion host per subnet (and ideally nothing else, but more likely will contain things like admin). Each customer gets a dedicated VPC per region, with at least two public and two private subnets, spread across whatever AZs are available in that region. Private subnets contain the actual customer instances, public subnets contain a NAT gateway and a dedicated StrongSWAN instance for connecting to the customer's other VPCs.

I'm going to continue going down the road of "one VPC per customer" unless I come across something scary, or someone else has an objection.

bastion host per subnet

Is "subnet" a typo or am I thinking of the wrong subnets? Not 4 bastions per customer, right? Don't the bastions have to be in the customer VPCs?

bastion host per subnet

Is "subnet" a typo or am I thinking of the wrong subnets? Not 4 bastions per customer, right? Don't the bastions have to be in the customer VPCs?

So, not a typo, but definitely confusing. Subnets can't span AZs, and AZs can fail independently, so we'd want to have one bastion per AZ for HA (which is equivalent to one bastion per subnet). Not 4 bastions per customer. And the bastions don't need to be in the customer VPCs, because we can setup VPC peering between the admin and customer VPCs (which is an AWS thing, not a VPN thing, since we'd be peering within a region).

That does mean that a bastion compromise would give an attacker access to all customer VPCs, but I'm having a hard time coming up with an attack where someone would be able to compromise the admin VPC's bastion, but not any dedicated customer bastion.

Ah, that makes sense. Everything else sounds entirely reasonable to me on the "VPC per customer" plan.

We should enumerate all the ports we're going to open up so we can do the security groups sooner rather than later. My list:

Public-facing Load Balancers

  • 22->2222 for git+ssh
  • 443->8443 for HTTPS
  • 22280->22280 for aphlict (or we could possibly use path-based routing on the ALB instead of opening up a 2nd port?)
  • 80 for redirects to HTTPS

Private-subnet Instances

  • 22 for real SSH, only allowed from VPC peers or VPN instances
  • 2222 for git+ssh from LB
  • 8443 from LB
  • Whatever needs to be reachable from admin for fetching configs and whatever

VPN Host
Do we want to enable real SSH from 0.0.0.0/0 to the customer VPN hosts, making VPN hosts de facto bastions? The alternative is that we can only access customer environments by going laptop-->admin VPC bastion-->admin VPN host--(VPN)-->customer VPN host. Also if we use VPN hosts for carrying admin traffic, we need to make sure the VPN doesn't allow arbitrary traffic to flow back across the VPN to admin, and make double-sure that you can't route to another customer's VPC by using our VPN.

Also, I just checked, and the AWS limit is "500 security groups per VPC" and I can't find a global or region limit (except for EC2-classic), so I don't think we'll be hitting that limit any time soon.

22280->22280 for aphlict (or we could possibly use path-based routing on the ALB instead of opening up a 2nd port?)

If we're using ELBs for this (which it looks like we may be able to, per above?) I don't think (?) we can do any path-based stuff.

(Even if we can, I think separating this service is probably good in the long run -- having more flexibility around how this traffic routes seems likely to be more valuable than opening fewer ports?)

Do we want to enable real SSH from 0.0.0.0/0 to the customer VPN hosts, making VPN hosts de facto bastions?

We don't need VPN hosts in most customer subnets, right? If a customer only has a USWest presence, they only need actual Phabricator hardware? I'd guess that most private clusters will fall into this bucket, in which case VPN hosts may not be present, so we probably shouldn't give them extra responsibilities.


We also need 3306 (or some other port) for MySQL replication.

I think your list is otherwise exhaustive.

admin currently does almost everything over SSH, although it does connect directly to databases on 3306 in some cases. This is a little janky and we could replace it with an HTTP API call.


Other theoretical stuff I can try to imagine:

  • Would we ever want to open SMTP on 25 (or 925 or whatever the secure SMTP port is) to receive inbound mail? I'm not sure how we'd set this up, exactly. It would probably be easier to tell customers "configure your mailserver to forward *@x to *@internalname.phacility.com" than to try to run mailservers ourselves.
  • Uh, maybe we might want to expose Phabricator as an LDAP server some day? But probably not?
  • Maybe CalDAV? But that's over HTTP so we don't need new ports.
  • Video spreadsheets? Factorio server add-on? Music streaming?

If we're using ELBs for this (which it looks like we may be able to, per above?) I don't think (?) we can do any path-based stuff.

Yeah, per @joshuaspence, we can do websockets over ELBs, so in theory we could put all three services on an ELB. I'm slightly inclined to prefer using an ALB for the web stuff, just because it's the new hotness and ALBs explicitly advertise their support for websockets. I'll look at the pricing breakdown and I might do some perf testing just to see if there's any real difference.

We don't need VPN hosts in most customer subnets, right?

If we don't put VPN hosts in every customer subnet, we need to either A) expose public SSH on at least one of the customer's instances so we can get in to manage it, or B) use VPC peering to connect the customer to our admin VPC. We'll eventually hit the hard limit of 125 VPC peering connections per VPC, but that's plenty of headroom to start building the product. Worst-case scenario, we get around this by having multiple admin VPCs per region.

We also need 3306 (or some other port) for MySQL replication.

And in the current model where all services run on every host, every instance for a given customer will need that replication traffic, right? Both locally within a region and between regions?

Would we ever want to open SMTP on 25 (or 925 or whatever the secure SMTP port is) to receive inbound mail?

Yeah, I have no idea how we'd handle that. D18205 is probably a better path than doing our own SMTP stuff?

I'll look at the pricing breakdown and I might do some perf testing just to see if there's any real difference.

One possible non-obvious difference is idle connection timeout settings, although we should be able to survive that with websockets on the client now with the ELB settings.

And in the current model where all services run on every host, every instance for a given customer will need that replication traffic, right? Both locally within a region and between regions?

For larger clusters I'd expect to specialize hardware (e.g., at 8 hosts some of them are dedicated web hosts, not just 8 everything hosts) but yeah, each region will replicate between AZs and then to the other regions if they exist.

Are we going to let customers start with a single EC2 instance, or require them to have at least some form of HA? My plan is to always create the VPC/subnet infrastructure assuming that the customer will have a second AZ (and ALBs for example require you to listen on at least two different subnets).

I think the minimum size is 2 hosts, in separate AZs, in one region. e.g., everything1 in us-west-1a and everything2 in us-west-1c, so that assumption is sound.

In the modern era: I think we generally understand what private clusters will look like now, but I'd like to take a much more iterative approach to getting there than we have in the past. I had this concern above (circa June 2017):

I'm concerned that [purusing infrastructure-as-code] may turn a 2-month project into a 6-month project if we write something ourselves, or a 12-month project if we use Ansible, Chef or Puppet.

A lot has happened since then, but it looks like this was probably optimistic. Here's how I'd like to move forward instead:


First:

  • Make shard rebalancing work ("move instance X from host Y to host Z").
  • Move to the new cache API and support renaming.

These are covered in more detail in T13076. They aren't dependent.

Rebalancing is only really blocked by not being able to bin/host download files larger than 2GB. D19011 moves us much of the way toward a fix for this. This needs integration into HTTPSFuture and testing.

The cache API probably works (at least mostly) on the rSAAS side, but needs implementation in rSERVICES. A gradual way to implement this is to fall back to it to handle 404s first.


Then I'd like to make bin/provision fully provision modern shared hosts (with EBS, etc) and gradually provision a new replacement cluster with NAT, migrating all instances to it and rebalancing them as we go. We'll end up with a compacted, rebalanced cluster behind NAT with no public IPs.

Along the way, the phage --pools flag should become a little more powerful and more operational tasks should look at Almanac on admin to identify hosts and pools. T12414 will also need to happen, although I expect that to complete this week. More types of devices and services should gain representations in Almanac.

Throughout this, we're gradually moving toward a world where Almanac on admin has a complete specification for all the hardware we have running and phage reads it to execute bulk operational tasks, and we can provision new shared-cluster hosts (db, repo, web, etc) completely from bin/provision with no web UI button pressing. Other routine tasks (rebalancing, storage resizing) are also probably supported from bin/xyz by now. We should also generally have confidence this model works since we used it to move the whole cluster.


Now we let bin/provision build new private clusters, reusing all the existing provisioning support. The clusters probably look like the CF clusters in D18265 but we arrive there with more support on the software-deployment side of things and the smallest leap from shared to private clusters we can manage in terms of new software and new processes.

After T13630:

  • New hardware is provisioned with Piledriver, which is approximately "Terraphorm" except it doesn't delete all your resources by default, see previously T12856.
  • All relevant hardware is in a private subnet with appropriate NAT/IGW configuration.
  • Instance data can be freely moved between shards.
  • Instances support complex clustering configuration (and have successfully run with it in production for extended periods of time with high read/write rates), have appropriate domain management support, etc.

The only real remaining barrier to private cluster support is adding billing support, which I don't plan to pursue since operations are winding down.