Page MenuHomePhabricator

Move secure.phabriactor.com halfway into the cluster
Closed, ResolvedPublic

Assigned To
Authored By
epriestley
Jun 26 2015, 4:30 PM
Tags
Referenced Files
F549694: SSL
Jun 27 2015, 11:53 PM
F547589: Screen Shot 2015-06-27 at 9.48.10 AM.png
Jun 27 2015, 4:49 PM
Tokens
"Manufacturing Defect?" token, awarded by chad.

Description

For reference/coordination, I plan to move secure.phabricator.com sort-of-halfway into the cluster soon.

  • The secure.phabricator.com is approaching 2.5 years old, and is a previous-generation host (m1.large).
  • The reserved instance for it recently expired. That gave us a small reason to keep it on the old hardware (marginally lower price), but no longer motivates retaining it.
  • A lot of the setup/configuration is one-off and has better alternatives in the cluster management tools (e.g., backup and deployment stuff is way better in the cluster toolset).

However, I want to make sure that, to the greatest degree reasonable, cluster disruptions don't prevent us from administrating the cluster or coordinating about resolution, so I don't want to bring the host completely into the cluster. My plan is:

  • I'll switch it to a cluster stack (ubuntu + apache).
  • I'll carve out a cluster tier for it and let the cluster deployment, backup, etc., tools work on it.
  • But it won't run Services, or the cluster firewall rules, and I'll keep it accessible over a public interface. So if the cluster is FUBAR'd we can still access it normally, and it won't depend on the cluster to run.

That should approximately give us the best of both worlds: one environment, one set of tools, but isolation between secure and the cluster for most kinds of cluster issues.

If the whole AWS datacenter drops out it will still kill everything, but we weren't isolated from that before, anyway.

Revisions and Commits

Event Timeline

epriestley claimed this task.
epriestley raised the priority of this task from to Normal.
epriestley updated the task description. (Show Details)
epriestley added a project: Ops.
epriestley added subscribers: epriestley, btrahan.

This is great. When / ever do you plan to be able to survive the whole AWS datacenter going out/

For communication, my plan today would just be to use IRC/email/phonecalls since we have a small team, run operations mostly in the open, and are all regularly available on these other channels, and the likely course of action is "wait for AWS to come back up". I think that's generally reasonable until we have a larger operations team and more operational responses available, so I'm not particularly concerned about Conpherence/Maniphest on this instance being down.

For restoring service, there's essentially nothing we could do today. We'd need to be running the cluster across multiple datacenters to survive loss of a datacenter. The path forward there is approximately T4571, T4209, T4292, and those are also largely the tasks on the way to private clusters. I'll organize a more detailed roadmap here when I plan out private clustering. For now, I'm basically relying on AWS to prevent datacenter-level events and/or recover from them quickly, but they have a good track record on this.

epriestley added a commit: Restricted Diffusion Commit.Jun 26 2015, 9:23 PM
epriestley added a commit: Restricted Diffusion Commit.Jun 26 2015, 9:31 PM
epriestley added a commit: Restricted Diffusion Commit.Jun 26 2015, 9:45 PM
epriestley added a commit: Restricted Diffusion Commit.Jun 26 2015, 9:54 PM
epriestley added a commit: Restricted Diffusion Commit.Jun 26 2015, 9:56 PM
epriestley added a commit: Restricted Diffusion Commit.Jun 26 2015, 9:59 PM
  • I created a new instance.secure security group. I removed port 25 (SMTP), since we now use Mailgun for inbound mail. I removed port 843 (Flash sockets) since we no longer use them. I changed administrative SSH from 222 to 2222 for consistency with the vault tier. This leaves us with 22, 80, 443, 2222, and 22280.
  • I launched a secure001 host and added a secure001 DNS entry for it in Route 53.
  • I created and attached sbak001, srepo001 and sdata001 volumes. This is a little overkill but should work well with the tooling.
  • I created a new "secure" host role in rCORE44a038f, and deployed the bastion to get it to recognize the role.
  • I used bin/remote deploy secure001 --port 22 to do an initial deploy.
  • I added secure.conf and the missing sshd library in rCORE95a18f3.
  • I added a bunch of config files and stuff in rCORE78c6c9a. This got SSH moved to 2222 correctly, so --port 22 is no longer required (like vault, this tier does need it for initial deploys).
  • Kept fiddling with config (see diffs), thing is sort of deploying properly now, at least.
epriestley added a commit: Restricted Diffusion Commit.Jun 26 2015, 10:03 PM
epriestley added a commit: Restricted Diffusion Commit.Jun 26 2015, 10:21 PM
epriestley added a commit: Restricted Diffusion Commit.Jun 26 2015, 10:23 PM
epriestley added a commit: Restricted Diffusion Commit.Jun 26 2015, 10:27 PM

I think this is kind-of working now. I'm going to take a break for today and pick it back up tomorrow since I'm pretty close to bringing the data over, which might take a while, and I'm sort of committed to finishing once I start it.

I've completed the cluster deployment and am resuming this now.

epriestley added a commit: Restricted Diffusion Commit.Jun 27 2015, 11:45 AM
Diffusion added a commit: Restricted Diffusion Commit.Jun 27 2015, 11:50 AM
epriestley added a commit: Restricted Diffusion Commit.Jun 27 2015, 11:52 AM
epriestley added a commit: Restricted Diffusion Commit.Jun 27 2015, 11:59 AM
epriestley added a commit: Restricted Diffusion Commit.Jun 27 2015, 1:26 PM
epriestley added a commit: Restricted Diffusion Commit.Jun 27 2015, 1:29 PM
  • I configured HTTP -> HTTPS in a standard way (on this host, it involves some legacy oddness).
  • I added host SSH keys and switched the tier to use them.
  • I verified external connectivity and reasonable-seeming repo connectivity.
  • I created secure001.phacility.net and sbak001.phacility.net devices on Admin, so generating routine backups will depend on the cluster being available. This seems acceptable as a dependency.
  • I added run.repo to the host attributes to get repo backups working and verified repo backups get written.
  • I dumped all the data and moved it over.
  • I renamespace'd the database to secure and loaded it, then made config adjustments to use this as the storage namespace.
  • I unpacked the repos into /core/repo/secure, then made a config adjustment for this.
  • I allocated a new elastic IP and swapped DNS over (we still have javelinjs on the old box, plus we can't reuse a non-VPC elastic IP in a VPC).

Issues encountered:

  • Moving files (~2GB of data) between old-secure and new-secure was very slow (about 30 minutes), because file transfer is still an unsolved problem in 2015.
  • In bin/storage renamespace, the > operator silently fails when disk is full. This is another strike against using it for anything. We should switch to --out flags in the long term, e.g. usage should be command --out y.file, not command > y.file.
  • Some bootstrapping nonsense as the host deployed from itself, about what I expected; no major issues here.

Stuff I'm still tracking:

  • Backups work but won't be run automatically by cluster administration services, although they are reported. I'm just going to cron them for now.
  • phabot doesn't start up yet.
  • I need to move javelinjs.com over (or maybe I'll just decommission it).
  • I need to put a redirect in place for phabricator.com, etc.
  • Decommission the old host.
epriestley added a commit: Restricted Diffusion Commit.Jun 27 2015, 1:58 PM

Backups work but won't be run automatically by cluster administration services, although they are reported. I'm just going to cron them for now.

I installed this with crontab -e.

I need to put a redirect in place for phabricator.com, etc.

I sorted these out in rCORE191c0aed.

I manually installed xhprof on the host since it needs some special setup steps that aren't trivial to integrate into the deployment tools, and isn't particularly important to have available.

epriestley added a commit: Restricted Diffusion Commit.Jun 27 2015, 2:58 PM

Oh, I need to hook up diviner / symbols, too. I'm probably just going to cron those. So still-remaining stuff I'm aware of is:

  • Get phabot running again.
  • Get diviner running.
  • Get symbols generating.
  • Deal with javelinjs.com.
  • Decomission old host.

I think I'm going to bring .org onto this host for simplicity/consistency, too. Putting it on corp is messy because of SSL, but putting it here for now should be straightforward.

Unsure if it is due to this but chrome is complaining about your ssl cert.

Can you be more specific? I can't reproduce this.

Screen Shot 2015-06-27 at 9.48.10 AM.png (989×1 px, 211 KB)

Currently on phone so I'm a bit limited but secure.phabricator.com is showing a red lock and the "Your connection is not private warning".

Green now, is this behind an lb or could a server be misconfigured?

It's a single host and I didn't touch anything in the last couple hours.

iiam

Huh, well I don't think I'm crazy. I'll take a screenshot if it happens again.

I had the same thing as @ftdysa but its fine now...

Yeah I'm seeing SSL errors as well. Not sure if it helps but there are some items in yellow on https://www.ssllabs.com/ssltest/analyze.html?d=secure.phabricator.com

epriestley added a commit: Restricted Diffusion Commit.Jun 28 2015, 2:03 AM
epriestley added a commit: Restricted Diffusion Commit.Jun 29 2015, 2:38 PM
epriestley added a commit: Restricted Diffusion Commit.Jun 29 2015, 2:50 PM
epriestley added a commit: Restricted Diffusion Commit.Jun 29 2015, 2:54 PM
epriestley added a commit: Restricted Diffusion Commit.Jun 29 2015, 2:59 PM
  • I (probably?) corrected the HTTPS issue by splitting the certificate and certificate chain files.
  • I allowed the host to serve blog.phacility.com, javelinjs.com, and phabricator.org.
  • Javelin and org are working fine so I updated DNS. I'll decommission the existing .org host.
  • blog has some HTTP vs HTTPS stuff that I think I'm going to clean up in Phame itself.

So remaining work is now:

  • Get phabot running again.
  • Get diviner running.
  • Get symbols generating.
  • Deal with blog HTTPS stuff.
  • Authorize trusted external keys.
  • Document everything.
  • Decomission old secure and org hosts.
epriestley added a commit: Restricted Diffusion Commit.Jun 29 2015, 4:42 PM
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a commit: Restricted Diffusion Commit.Jun 29 2015, 4:53 PM

Get phabot running again.

Done, albeit in a slightly funky way.

Get diviner running.
Get symbols generating.

I put reasonable scripts for these in rCORE. I just installed them with crontab rather than trying to hook them into the upgrade/restart process.

epriestley added a commit: Restricted Diffusion Commit.Jun 29 2015, 5:25 PM

Authorize trusted external keys.

I added an external authorization step for this tier.

Decomission old secure and org hosts.

I stopped traffic on the old .org.

Document everything.

@btrahan, I've updated the cluster docs in Phriction to discuss secure. Brief version is:

ActivityOld CommandNew CommandNotes
SSH To Host (Disaster)ssh -p 222 ec2-user@secure.phabricator.comssh -p 2222 ubuntu@secure.phabricator.comPort change from 222 to 2222, user change to ubuntu.
SSH To Host (Routine)ssh -p 222 ec2-user@secure.phabricator.comlocal:~/phacility/core/ $ ./bin/remote ssh secure001
Upgradesecure:/core/ $ ./bin/updatelocal:~/phacility/core/ $ ./bin/remote upgrade secure001

Basically, just use bin/remote ssh, bin/remote upgrade, bin/remote mysql, bin/remote deploy, etc, on this host now like any other cluster host.

If the bastion explodes, you can also connect to it directly (with ... -p 2222 ubuntu@...).

epriestley added a commit: Restricted Diffusion Commit.Jun 29 2015, 9:16 PM

I've moved the blog and stopped traffic to the old secure host. I'll decommission both hosts in a few days if nothing crops up.

epriestley added a commit: Restricted Diffusion Commit.Jul 1 2015, 5:30 PM
epriestley added a commit: Restricted Diffusion Commit.Jul 1 2015, 6:11 PM

I decommissioned the old .org host. I'll give the old secure host another week or so since we did hit a couple minor config things where it was useful to double-check that there was a config change.

Need Graphviz install too, pls. Test Page

I don't want to install it on cluster hosts, see T7785.

I've decommissioned the old secure.phabricator.com host.