Page MenuHomePhabricator

Aphlict slowly grows to use an uncomfortably large amount of memory if the disk is full
Open, WishlistPublic


I caught a ~1.5GB aphlict on after about 24 hours / 100K connections.

We currently have a ~5.4GB aphlict on notify001 after about 16 hours.

On notify001 this isn't really an issue since the thing just restarts after exploding, I think. However, this can create memory pressure on hosts like secure001 where the process can't have the whole server.

I'm pretty sure this didn't previously happen (I think I recall months of uptime for aphlict last year?) so there may have been, e.g., a node update which changed something we depend on. Whatever the issue is is likely a problem on our side, but it may not have been exposed before.

Event Timeline

The Aphlict logs don't reveal anything useful, and could use some attention: they're heavily geared toward setup/debugging, not maintenance/operations.

At least locally, I can't manage to get node to not leak:

$ cat simple_server.js 
'use strict';

var http = require('http');

http.createServer(function(request, response) {

  var bytes = process.memoryUsage().rss;
  var mb = parseInt(10 * (bytes / (1024 * 1024)), 10) / 10;

  response.writeHeader(404, 'Not Found');
$ node --expose-gc simple_server.js 
$ ab -c100 -n10000


I actually can't get it to go above 20.9MB so maybe this is just some finite-sized caches that aren't reachable by global.gc() filling or something like that.

I think the ~1.5GB node instance size is also expected.

Currently, by default v8 has a memory limit of 512MB on 32-bit systems, and 1.4GB on 64-bit systems.

I think this means that Node's default behavior is to not run the GC until it uses 1.4GB+ of memory. So any process around 1.5GB may just be stable and not recently GC'd. We can adjust this option.

(I'm having a somewhat-difficult time finding up-to-date documentation for this stuff since every google result is someone asking "how do I fix OOM errors?" and someone else saying "just use this flag to set it to infinity!!!".)

I haven't been successful locally in getting a GC'd node to reach an appreciable level of memory consumption, and the production servers are currently at ~1.5GB.

I think the large aphlict size in production yesterday may have been caused by the log volume running out of space. I noticed that incidentally and resolved it. So a possible explanation here is:

  • aphlict is fine.
  • aphlict grows to about 1.5GB because this is the default behavior of node. We can adjust this.
  • When trying to log to a disk with no space, aphlict can grow larger. This should be fixed at some point, but is probably not worth fixing any time soon (although reducing log verbosity would be nice).

Yeah, it looks like this is almost certainly the expected behavior:

writable.write(): This return value is strictly advisory. You MAY continue to write, even if it returns false. However, writes will be buffered in memory, so it is best not to do this excessively. Instead, wait for the drain event before writing more data.

So the behavior of log.write() appears to be:

  • Try to write to disk.
  • If that fails (for example, because the disk is full), buffer the string in memory and return false.

This is a Very Interesting Architecture Choice.

So we're mostly fine here, but should fix this at some point.

epriestley renamed this task from Aphlict slowly grows to use an uncomfortably large amount of memory to Aphlict slowly grows to use an uncomfortably large amount of memory if the disk is full.Mar 31 2016, 6:44 PM
epriestley lowered the priority of this task from Normal to Wishlist.