Page MenuHomePhabricator

Put crontabs in VCS and deploy them during provisioning
Open, NormalPublic

Description

Relevant steps (from discussion in T12857):

  • Copy the crontabs on secure into a file and check them into core/.
    • Change bin/remote upgrade to install them on appropriate hosts. Note that the secure crontab is ONLY on secure001, not the whole tier.
    • Check for crontabs I might be forgetting on other tiers. I thought we had another one on admin but it doesn't look like we do -- but maybe I'm forgetting something.
    • Change bin/remote deploy to install tmpreaper during the apt-get phase.
    • Add a crontab for repo hosts and change bin/remote upgrade to install it (this one should go to the whole tier).
    • Deploy a repo host to test (maybe do this off-hours, since there will be some disruption to instances).
      • Write a lot of junk into /tmp and come back 8 hours later to see if it got removed, I guess?
    • Deploy the rest of the repo tier -- you can use --pools repo but note that repo012 will fail (it's part of the pool still but no longer exists).
    • Since we don't use crontabs, I'm also not sure what they do if they fail. We probably don't have any reasonable options for attaching them to alerting today, though.

Event Timeline

On secure, I could only find a crontab for the ubuntu user, which is as follows:

0 6 * * * /core/bin/host backup
0 7 * * * /core/bin/host prune --force
0 8 * * * /core/conf/util/generate-documentation
0 9 * * * /core/conf/util/generate-symbols

I couldn't find any crontabs for admin or repo.

My recollection is a little fuzzy, but I think admin may have once had a hacky crontab for backups that I replaced with a normal backup trigger at some point when the infrastructure matured a little.

The "normal" backup process does host backup and (as appropriate) host prune.

admin also does documentation generation, but it's part of the restart step and happens so quickly that it's reasonable to keep there, probably indefinitely.

It would be nice to make secure use the same stuff as everything else, but:

  • The normal stuff is triggered on admin, and I think it's generally questionable for admin to be reaching "upward" to secure to perform write operations.
  • The generate-documentation and generate-symbols steps can both take a nontrivial amount of time.

That said, it's kind of bad that documentation generation and symbol regeneration happen on a cron instead of after pushing -- for example, this means that doc fixes don't go live for up to 24 hours, even if we deploy them immediately. But they can be slow, so it's not great if they block a push.

Maybe we should add an auxiliary post-deploy phase, like the adjust phase, for these kinds of actions (or make them occur in the existing adjust phase). The adjust phase generally has similar properties to these actions: it should run after push, but it isn't critical, it can be slow, it's semi-optional in the sense that if it fails it doesn't cause any really major problems.

(Both adjust and doc/symbol documentation affect an entire service tier rather than particular servers, too -- we only need to generate docs or make schema adjustments once, no matter how many hosts are in the secure or admin tiers.)