Page MenuHomePhabricator

Temporary directory fullness can cause daemon issues?
Open, NormalPublic

Description

This is pretty fuzzy, but here's a list of observations that might lead toward something actionable:

  • One or more processes are leaving a bunch of junk in /tmp/ at times. This is rare and hard to pin down so I haven't chased it down yet.
    • After manually cleaning up /tmp/ the problem doesn't happen again for months/years.
    • The actual files look like they're being written by us (same format as TempFile).
    • A lot of these files are exactly 4194304 bytes long (the size of one file storage chunk), which is probably a smoking gun.
    • So: incorrectly doing MIME type detection on all chunks of large files, and then not cleaning them up for some reason? Running MIME detection separately seems fine, though. So maybe this is somewhat subtle/involved.
  • We saw an issue with daemon stability for an instance in the presence of a full /tmp/, although nothing in the logs is really a smoking gun. However, it's possible that a full /tmp/ causes some other kind of issue at an awkward time between log setup phases.
    • We currently write to normal system /tmp/, which is the same volume as /. So a full /tmp/ also means a full /. The trigger problem could be some other kind of failure, like a pidfile update. We write these to /core/run/... but this is also the same volume so that doesn't help.

Event Timeline

Here's another clue, from the relevant host's error log:

[Tue Jun 20 11:33:43.180796 2017] [:error] [pid 18010] [client 172.30.0.108:55959] >>> UNRECOVERABLE FATAL ERROR <<<\n\nMaximum execution time of 30 seconds exceeded\n\n/core/lib/libphutil/src/future/exec/ExecFuture.php:885

So another piece here is possibly something like this:

  • We start an ExecFuture for the FileUpload, running some underlying VCS command like git cat-file.
    • Until D18138, this wrote the entire file to disk for MIME type detection in 4MB chunks. Now, we'll only write the initial 4MB chunk, which is an improvement, but doesn't fix the issue.
    • Until D18138, this file was held on disk longer than necessary. We now clean it up as soon as we're done. However, the difference in the windows should be extremely short (a few milliseconds) so I would not expect this to close the race window very much.
  • 30 seconds later, it gets hard killed by PHP's execution time limit, so destructors don't run and the temp file doesn't get cleaned up?
    • But I only see 54 failures in the log since yesterday -- even if all of them left a 4MB chunk on disk, that should be only 216MB of temporary files. We actually accumulated closer to 3GB of temporary files.

So this is still not really a convincing explanation.

Should we just add a crontab entry to clean /tmp to paper this over until we get it fixed for real?

Possibly.

Every repo host is equally affected, so I'd like to deploy crontabs as part of the regular deployment process if we use them as part of the approach here. That would require first codifying a handful of custom crontabs, including one on secure which regenerates documentation daily on only one host. This codification should happen anyway eventually, but it's a little bit of work, and wasted effort if we're switching to Chef/Salt/Ansible/etc soon anyway.

If we write such a script, it shouldn't just wipe /tmp, since it could remove files out from under working processes. I'm not sure what the best way to detect "unused files in /tmp" is. The obvious test is maybe "more than X hours old", which I think we can do with some kind of find -print0 -ctime ... | xargs -0 ... incantation directly in a crontab, or which we could deploy a separate script for. But this isn't a conclusive test, and at least one use of /tmp (spooling very large svn ls --xml results) was a response to a real case of the underlying svn command taking more than 24 hours to run for extremely large changes. If we remove this file every, say, 8 hours, the process will never complete for similar tasks. Possibly we could use mtime or atime instead, although I don't have a deep understanding of what updates atime offhand and I suspect a lot of things we don't care about may update it.

A proxy test is that the temp files are written with the owning PID. If the PID no longer exists, we can safely remove the file. But I don't think we could reasonably write a one-liner to do this inside a crontab, so we'd need to deploy a separate script for the crontab to invoke.

The import which was driving disk usage has also completed (touched on in a separate support thread) and /tmp is no longer filling on this host.

This should do the trick. It runs off atime by default. We could just set the time period to several days if we wanted to. Alternatively, if the filenames for extremely long-running jobs are predictable, there's a --protect '<shell_pattern>' argument we could use to avoid cleaning up those files.

If you want to move forward with that:

  • Copy the crontabs on secure into a file and check them into core/.
  • Change bin/remote upgrade to install them on appropriate hosts. Note that the secure crontab is ONLY on secure001, not the whole tier.
  • Check for crontabs I might be forgetting on other tiers. I thought we had another one on admin but it doesn't look like we do -- but maybe I'm forgetting something.
  • Change bin/remote deploy to install tmpreaper during the apt-get phase.
  • Add a crontab for repo hosts and change bin/remote upgrade to install it (this one should go to the whole tier).
  • Deploy a repo host to test (maybe do this off-hours, since there will be some disruption to instances).
    • Write a lot of junk into /tmp and come back 8 hours later to see if it got removed, I guess?
  • Deploy the rest of the repo tier -- you can use --pools repo but note that repo012 will fail (it's part of the pool still but no longer exists).
  • Since we don't use crontabs, I'm also not sure what they do if they fail. We probably don't have any reasonable options for attaching them to alerting today, though.

I think we should have our crontabs in version control regardless of whether or not we add tmpreaper to them, so I'll make a task for that.

Tangentially related here and to T12611, the traffic volume from the ongoing "attack" in T13003 filled /tmp on admin.phacility.com. I pruned some old logs for now, but the mitigations in T12611 (e.g., separate log volumes) would resolve this properly.

amckinley added a revision: Restricted Differential Revision.Apr 20 2018, 7:26 PM
amckinley added a commit: Restricted Diffusion Commit.Apr 20 2018, 9:08 PM

EC2 volume ddata005.phacility.net filled up, causing problems for instances hosted on db005, leading to PHI771. I'll dig back into the CloudWatch monitoring stuff I setup a few months ago and make the db hosts report storage metrics the same way the repo hosts already do.