Aphlict slowly grows to use an uncomfortably large amount of memory if the disk is full
Open, WishlistPublic
Actions

Assigned To

Authored By

	epriestley
	Mar 30 2016, 1:10 PM

Description

I caught a ~1.5GB aphlict on secure.phabricator.com after about 24 hours / 100K connections.

We currently have a ~5.4GB aphlict on notify001 after about 16 hours.

On notify001 this isn't really an issue since the thing just restarts after exploding, I think. However, this can create memory pressure on hosts like secure001 where the process can't have the whole server.

I'm pretty sure this didn't previously happen (I think I recall months of uptime for aphlict last year?) so there may have been, e.g., a node update which changed something we depend on. Whatever the issue is is likely a problem on our side, but it may not have been exposed before.

Revisions and Commits

rP Phabricator
	D15720	rP686c02d54a7a Add a "memory.hint" parameter to Aphlict

Related Objects
Search...

		Status	Assigned	Task
		Resolved	epriestley	T10695 OOM killer nuked `mysql` on `secure.phabricator.com` after apache2 spawned 75+ workers
		Open	epriestley	T10696 Aphlict slowly grows to use an uncomfortably large amount of memory if the disk is full

Event Timeline

epriestley created this task.Mar 30 2016, 1:10 PM

The Aphlict logs don't reveal anything useful, and could use some attention: they're heavily geared toward setup/debugging, not maintenance/operations.

epriestley moved this task from Backlog to vNext on the Aphlict board.Mar 30 2016, 1:59 PM

At least locally, I can't manage to get node to not leak:

$ cat simple_server.js 
'use strict';

var http = require('http');

http.createServer(function(request, response) {
  global.gc();

  var bytes = process.memoryUsage().rss;
  var mb = parseInt(10 * (bytes / (1024 * 1024)), 10) / 10;
  console.log(mb);

  response.writeHeader(404, 'Not Found');
  response.end();
}).listen(22281);

$ node --expose-gc simple_server.js

$ ab -c100 -n10000 http://127.0.0.1:22281/

16.7
17
17
17
...
18.3
18.3
18.3
18.3
18.3
18.3
18.3
...
20.4
20.4
20.5
20.5
20.5
...

Sooooooooo...

I actually can't get it to go above 20.9MB so maybe this is just some finite-sized caches that aren't reachable by global.gc() filling or something like that.

I think the ~1.5GB node instance size is also expected.

Currently, by default v8 has a memory limit of 512MB on 32-bit systems, and 1.4GB on 64-bit systems.
https://github.com/nodejs/node-v0.x-archive/wiki/FAQ

I think this means that Node's default behavior is to not run the GC until it uses 1.4GB+ of memory. So any process around 1.5GB may just be stable and not recently GC'd. We can adjust this option.

(I'm having a somewhat-difficult time finding up-to-date documentation for this stuff since every google result is someone asking "how do I fix OOM errors?" and someone else saying "just use this flag to set it to infinity!!!".)

I haven't been successful locally in getting a GC'd node to reach an appreciable level of memory consumption, and the production servers are currently at ~1.5GB.

I think the large aphlict size in production yesterday may have been caused by the log volume running out of space. I noticed that incidentally and resolved it. So a possible explanation here is:

aphlict is fine.
aphlict grows to about 1.5GB because this is the default behavior of node. We can adjust this.
When trying to log to a disk with no space, aphlict can grow larger. This should be fixed at some point, but is probably not worth fixing any time soon (although reducing log verbosity would be nice).

Yeah, it looks like this is almost certainly the expected behavior:

writable.write(): This return value is strictly advisory. You MAY continue to write, even if it returns false. However, writes will be buffered in memory, so it is best not to do this excessively. Instead, wait for the drain event before writing more data.
https://nodejs.org/docs/v0.10.24/api/stream.html#stream_writable_write_chunk_encoding_callback

So the behavior of log.write() appears to be:

Try to write to disk.
If that fails (for example, because the disk is full), buffer the string in memory and return false.

This is a Very Interesting Architecture Choice.

So we're mostly fine here, but should fix this at some point.

epriestley renamed this task from Aphlict slowly grows to use an uncomfortably large amount of memory to Aphlict slowly grows to use an uncomfortably large amount of memory if the disk is full.Mar 31 2016, 6:44 PM

epriestley lowered the priority of this task from Normal to Wishlist.

joshuaspence added a subscriber: joshuaspence.Apr 8 2016, 9:21 AM

epriestley added a revision: D15720: Add a "memory.hint" parameter to Aphlict.Apr 15 2016, 1:45 PM

epriestley added a commit: rP686c02d54a7a: Add a "memory.hint" parameter to Aphlict.Apr 15 2016, 2:14 PM

Herald added a subscriber: eadler. · View Herald TranscriptApr 15 2016, 2:14 PM

epriestley mentioned this in T10695: OOM killer nuked `mysql` on `secure.phabricator.com` after apache2 spawned 75+ workers.Apr 22 2016, 12:28 AM

ivo added a subscriber: ivo.Apr 21 2017, 12:29 PM

epriestley moved this task from vNext to Far Future on the Aphlict board.Jul 17 2019, 6:30 PM

Herald added a subscriber: amckinley. · View Herald TranscriptJul 17 2019, 6:30 PM