Daemon overseers may have a slow memory leak, possibly in FuturePool
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	Aug 17 2020, 3:17 PM

Description

See PHI1849. See PHI1847. In these cases, a shard which is not overallocated or under any apparently unusual workload ran into the OOM killer.

Coarsely, phd-daemon processes are using 0.5% of host memory after launch and 0.9% after a couple days of uptime.

Narrowly, FuturePool almost certainly has a small leak via FutureIterator, since FutureIterator does not discard futures after iterating over them.

Revisions and Commits

rARC Arcanist
	D21478	rARCac54d61d7af2 (stable) Promote 2020 Week 42
	D21478	rARC4b3baca999a4 Fix a typo of "previously" in FutureIterator

Related Objects

Mentioned In: rARC8e5e49984d50: Fix a slow memory leak in long-lived FutureIterator objects, as used by…
D21466: Fix a slow memory leak in long-lived FutureIterator objects, as used by FuturePool
Mentioned Here: D21466: Fix a slow memory leak in long-lived FutureIterator objects, as used by FuturePool

Event Timeline

epriestley triaged this task as Low priority.Aug 17 2020, 3:17 PM

epriestley created this task.

This synthetic script demonstrates the conceptual problem with FutureIterator:

<?php

require_once 'support/init/init-script.php';

$iterator = new FutureIterator(array());

$iterator->addFuture(new_future());
$iterator->addFuture(new_future());

foreach ($iterator as $future) {
  echo '<MEM> '.sprintf('%.2f', (memory_get_usage() / (1000 * 1000)))."\n";
  $iterator->addFuture(new_future());

  sleep(1);
}

function new_future() {
  return new ImmediateFuture(str_repeat('x', 1024 * 1024));
}

This sort of thing should be able to happen with the FuturePool in Overseer, too. I was able to get a reasonable reproduction case for the leak by doing this:

Adding error_log(... memory_get_usage() ...) to the main overseer loop to monitor memory usage.
Changing shouldWakePool() to return true.
Running bin/phd start.

This produces a state where:

The PullLocal daemon is always running, so the FuturePool can never empty and reset its FutureIterator.
The Taskmaster and other hibernating daemons cycle in and out of hibernation rapidly (about once per second, instead of once every few minutes).

This is otherwise a normal state, the hibernation cycle is just ~300x faster than normal under an idle queue.

My expectation is that this will cause the FutureIterator inside the FuturePool to grow without bound, holding an ExecFuture for each child ever launched. This appears to be confirmed in the log, which sees the overseer memory usage grow slowly and steadily from 15.70M to 15.93M over the course of a minute.

This corresponds to a leak rate of 312MB/day. Given that this leak is artificially 300x faster, this isn't nearly fast enough to explain the production leak (which would be only ~1MB/day). However, the "300x faster" is based on an idle queue. If the queue is routinely injected with small tasks, pools may wake every few seconds and the reproduction case might only be a bit faster than production.

Upshot: I'm not 100% confident this fully explains observed behavior, but it seems like a real leak, and plausible that it accounts for all observations in production.

epriestley mentioned this in D21466: Fix a slow memory leak in long-lived FutureIterator objects, as used by FuturePool.Sep 17 2020, 7:52 PM

epriestley mentioned this in rARC8e5e49984d50: Fix a slow memory leak in long-lived FutureIterator objects, as used by….Sep 17 2020, 7:56 PM

epriestley added a revision: D21478: Fix a typo of "previously" in FutureIterator.Oct 16 2020, 7:31 PM

epriestley added a commit: rARC4b3baca999a4: Fix a typo of "previously" in FutureIterator.Oct 16 2020, 9:23 PM

epriestley added a commit: rARCac54d61d7af2: (stable) Promote 2020 Week 42.Oct 19 2020, 7:04 PM

I believe D21466 resolved this, since it hasn't cropped up again even though daemon runtimes between restarts have been very long at various periods in the last year.

Daemon overseers may have a slow memory leak, possibly in FuturePoolClosed, ResolvedPublicActions

Description

Revisions and Commits

Related Objects

Event Timeline

Daemon overseers may have a slow memory leak, possibly in FuturePool
Closed, ResolvedPublic
Actions