Page MenuHomePhabricator

Indexing a task with 2,000 comments required a lot of RAM in mid-2015
Closed, ResolvedPublic

Description

Last night I queued a large number of background jobs (reparsing of commit messages and reindex search documents). A few hours later, I noticed that our two Phabricator daemons hosts had died (they quite possibly died earlier than this, but it took me a few hours to notice.

It is worth noting that I have phd.taskmasters set to 8. The instance is a c3.large. The following groups show the CPU usage over the past 24 hours:

My theory is that the taskmasters are autoscaling themselves and then eventually running out of memory and dying miserably. I've attached some relevant log files:

  • Actually the daemon log files aren't particularly useful here.

Event Timeline

joshuaspence raised the priority of this task from to Needs Triage.
joshuaspence updated the task description. (Show Details)
joshuaspence added a project: Daemons.
joshuaspence added a subscriber: joshuaspence.

Use --autoscale-reserve to prevent them from autoscaling past some amount of memory. For example, this will prevent autoscaling if less than 25% of RAM is free:

phabricator/ $ ./bin/phd restart --autoscale-reserve 0.25

Shouldn't we do something similar by default?

eadler added a subscriber: eadler.Jul 18 2015, 7:04 AM

That isn't a great solution - it should be possible to set a max ram used, rather than max percent free: http://blogs.msdn.com/b/oldnewthing/archive/2012/01/18/10257834.aspx

Shouldn't we do something similar by default?

If we do, it means that we sometimes end up in a scenario where the user says "launch 100 daemons" and we launch, say, 7 daemons. This is likely to be confusing, and I don't see a great way to explain to the user what we're doing and why.

If the user says "launch 100 daemons", the behavior is more clear if we launch 100 daemons and OOM the box than if we launch 7 daemons and prevent the OOM. This behavior is probably not what the user wanted, but it's what they asked for, and I generally think it's more important to be clear than to prevent users from doing silly things. T8761 has some additional discussion of a similar case.

This flag should be documented and autoscaling should be discussed in the documentation, but the docs need updates in general.

That isn't a great solution - it should be possible to set a max ram used, rather than max percent free

This post discusses userspace applications running on desktops written in relatively low-level languages. I don't think it's applicable here. In particular, we do not free memory to hit a target allocation; we only decline to scale the pool.

Hmm. It has happened again, even with --autoscale-reserve 0.25.

This has happened three time today :(

I'm not sure if its related, but late last week I ran ./bin/search index --all --background and that roughly coincides with the observed OOM behavior.

OK. If I get rid of the explicit phd.taskmasters setting then it seems to work. With phd.taskmasters set to 8 it seems to lock up at 100% CPU usage after a few minutes.

Ah, I think this is relevant...

josh@ip-10-157-83-125:/usr/src/phabricator$ ./bin/worker execute --trace --id 18282255
>>> [2] <connect> phabricator_worker
<<< [2] <connect> 4,840 us
>>> [3] <query> SELECT * FROM `worker_activetask` WHERE id IN ('18282255') 
<<< [3] <query> 1,075 us
>>> [4] <query> SELECT * FROM `worker_archivetask` WHERE (id in (18282255)) ORDER BY id DESC 
<<< [4] <query> 1,055 us
>>> [5] <connect> phabricator_worker
<<< [5] <connect> 3,903 us
>>> [6] <query> UPDATE `worker_activetask` SET `failureTime` = NULL, `taskClass` = 'PhabricatorSearchWorker', `leaseOwner` = NULL, `leaseExpires` = '1437394542', `failureCount` = '0', `dataID` = '18311600', `priority` = '4000', `objectPHID` = NULL, `id` = '18282255' WHERE `id` = '18282255'
<<< [6] <query> 10,543 us
>>> [7] <query> SELECT * FROM `worker_taskdata` WHERE id = 18311600 
<<< [7] <query> 1,015 us
Executing task 18282255 (PhabricatorSearchWorker)...>>> [8] <connect> phabricator_maniphest
<<< [8] <connect> 4,070 us
>>> [9] <query> SELECT `task`.*  FROM `maniphest_task` task  WHERE (task.phid in ('PHID-TASK-bclylvkmj6c5kdn74s26'))   ORDER BY `task`.`id` DESC 
<<< [9] <query> 1,207 us
>>> [10] <query> SELECT * FROM `maniphest_transaction` x WHERE (objectPHID IN ('PHID-TASK-bclylvkmj6c5kdn74s26')) ORDER BY `id` DESC 
<<< [10] <query> 9,847,215 us
Killed
mysql> SELECT COUNT(id) FROM phabricator_maniphest.maniphest_transaction WHERE (objectPHID IN ('PHID-TASK-bclylvkmj6c5kdn74s26'));
+-----------+
| COUNT(id) |
+-----------+
|      1996 |
+-----------+

This happened again today, presumably because I added a new differential revision to PHID-TASK-bclylvkmj6c5kdn74s26.

epriestley renamed this task from Phabricator daemons autoscale themselves to OOM to Indexing a task with 2,000 comments required a lot of RAM in mid-2015.Feb 21 2017, 12:19 AM
epriestley triaged this task as Normal priority.
epriestley added a project: Search.

I guess this one can live for now since that's reproducible/actionable.

epriestley closed this task as Resolved.Feb 15 2019, 2:00 AM
epriestley claimed this task.

Presumably resolved elsewhere by D19503.