Page MenuHomePhabricator

Build hosts run out of disk space because Drydock doesn't clean up working copies automatically
Closed, DuplicatePublic

Description

Ref T9872. Drydock doesn't clean up working copies automatically (so I end up with lots of copies of the same thing), and the UI doesn't give me any decent way of seeing which working copies are unlikely to be used for a while and can be manually released.

My build hosts run out of disk space because of this, and it's a nightmare to fix (because I have to SSH through an SSH tunnel through a VPN to get to the build hosts to clean them up).

As per T9872, many of my repositories have submodules, which dramatically increases the size of a non-bare working copy (caching the bare repositories is often much smaller, which is why I wrote the original working copy blueprint like that).

Since these build hosts reside on physical hardware, I can't just magically add more disk space like on EC2.

Event Timeline

hach-que raised the priority of this task from to Needs Triage.
hach-que updated the task description. (Show Details)
hach-que updated the task description. (Show Details)
hach-que added subscribers: hach-que, epriestley.

so I end up with lots of copies of the same thing

I can't reproduce this. For example, Blueprint 6 on this host (which runs all builds for libphutil, arcanist and Phabricator) has allocated a total of only 4 working copies in more than a month. It has performed hundreds of builds in this time.

run out of disk space

I can't reproduce this. Since Drydock has only allocated 4 working copies, the disk on the host is about 6% utilized.

nightmare to fix

I can't reproduce this. Destroying resources from the web console destroys working copies for me.

This usually happens when rebasing an upstream repository after a significant amount of time (like six months). When this happens, there'll be hundreds of builds all running at the same time, which can result in hundreds of working copies for the same repository.

In old Drydock, this was annoying, but tolerable, because at least all those repos would get cleaned up afterwards. In the upstream Drydock though, there's no expiry on them, so they just sit around forever. Over time, the build host will run out of disk space.

I can't reproduce this. Destroying resources from the web console destroys working copies for me.

It's a nightmare because the new Drydock doesn't tell me what working copies are on what hosts or what they have cloned. If the Linux build agent is out of disk space, I don't want to trawl through working copies that belong to the Mac build agent.

fyi my Phabricator instance currently has 80 working copy resources open

When this happens, there'll be hundreds of builds all running at the same time, which can result in hundreds of working copies for the same repository.

You can use the "limit" configuration on WorkingCopy resources to limit the number of allowed simultaneous active resources. Other jobs will wait until resources free up.

It's a nightmare because the new Drydock doesn't tell me what working copies are on what hosts or what they have cloned.

Why is this important?

If the Linux build agent is out of disk space, I don't want to trawl through working copies that belong to the Mac build agent.

What do you want to do? Why?

Why is the build agent running out of disk space?

Because if I don't clean it up, then every build from then on will fail until I fix it?

So you have a machine full of unused working copies, and Drydock is failing every build instead of reusing them?

It's running out of disk space because Drydock isn't cleaning up the working copies and they're consuming all the disk space.

I don't want a global limit on the number of working copies because they (unless I'm mistaken) still won't get cleaned up and instead Drydock will just fail to allocate.

still won't get cleaned up and instead Drydock will just fail to allocate.

They won't get cleaned up; they'll get reused.

If you have a limit of 16 working copies, Drydock will let 16 things run at once. The other 100 will wait until those resources are free, then reuse them.

What happens if a new configuration / request comes in that doesn't match any of the available 16 working copies?

So you have a machine full of unused working copies, and Drydock is failing every build instead of reusing them?

It will fail if it tries to allocate another working copy because all the others are in use, but more often it will simply be a case that there isn't enough room left for build artifacts anymore (because of unused working copies that are sitting around), which will make Harbormaster fail forever.

The old Drydock would delete working copies after it was done with them, and clone them from a cache, so you wouldn't have lots of unused disk space consumed this way (nor would you just have build artifacts remaining everywhere until next use).

@epriestley I hit this again. These are some stats from SSHing into the machine:

linux:/srv/leases # df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        16G   16G     0 100% /
devtmpfs        998M     0  998M   0% /dev
tmpfs          1004M     0 1004M   0% /dev/shm
tmpfs          1004M  8.5M  995M   1% /run
tmpfs          1004M     0 1004M   0% /sys/fs/cgroup
linux:/srv/leases # du -sh **
12M     workingcopy-3694                                                                                                                                                                                                                                                       
11M     workingcopy-3696                                                                                                                                                                                                                                                       
11M     workingcopy-3697                                                                                                                                                                                                                                                       
5.8M    workingcopy-3698                                                                                                                                                                                                                                                       
12M     workingcopy-3699                                                                                                                                                                                                                                                       
11M     workingcopy-3700                                                                                                                                                                                                                                                       
6.7M    workingcopy-3701                                                                                                                                                                                                                                                       
11M     workingcopy-3702                                                                                                                                                                                                                                                       
1.6G    workingcopy-3703                                                                                                                                                                                                                                                       
228M    workingcopy-3704                                                                                                                                                                                                                                                       
676M    workingcopy-3705                                                                                                                                                                                                                                                       
1.4G    workingcopy-3706                                                                                                                                                                                                                                                       
114M    workingcopy-3707                                                                                                                                                                                                                                                       
675M    workingcopy-3708                                                                                                                                                                                                                                                       
1.4G    workingcopy-3709                                                                                                                                                                                                                                                       
678M    workingcopy-3710                                                                                                                                                                                                                                                       
1.4G    workingcopy-3711                                                                                                                                                                                                                                                       
111M    workingcopy-3712                                                                                                                                                                                                                                                       
687M    workingcopy-3713                                                                                                                                                                                                                                                       
111M    workingcopy-3714                                                                                                                                                                                                                                                       
111M    workingcopy-3715                                                                                                                                                                                                                                                       
673M    workingcopy-3716                                                                                                                                                                                                                                                       
673M    workingcopy-3717                                                                                                                                                                                                                                                       
673M    workingcopy-3718                                                                                                                                                                                                                                                       
133M    workingcopy-3722                                                                                                                                                                                                                                                       
112M    workingcopy-3725                                                                                                                                                                                                                                                       
41M     workingcopy-3728                                                                                                                                                                                                                                                       
113M    workingcopy-3731                                                                                                                                                                                                                                                       
113M    workingcopy-3734                                                                                                                                                                                                                                                       
62M     workingcopy-3741                                                                                                                                                                                                                                                       
62M     workingcopy-3743
71M     workingcopy-3746
69M     workingcopy-3749
69M     workingcopy-3752
69M     workingcopy-3755
66M     workingcopy-3787
69M     workingcopy-3790
66M     workingcopy-3792
66M     workingcopy-3796
3.9M    workingcopy-3955
3.9M    workingcopy-3957