Indexing a task with 2,000 comments required a lot of RAM in mid-2015
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	joshuaspence
	Jul 17 2015, 4:28 AM

Description

Last night I queued a large number of background jobs (reparsing of commit messages and reindex search documents). A few hours later, I noticed that our two Phabricator daemons hosts had died (they quite possibly died earlier than this, but it took me a few hours to notice.

It is worth noting that I have phd.taskmasters set to 8. The instance is a c3.large. The following groups show the CPU usage over the past 24 hours:

My theory is that the taskmasters are autoscaling themselves and then eventually running out of memory and dying miserably. I've attached some relevant log files:

kern.log4 MBDownload
Actually the daemon log files aren't particularly useful here.

Related Objects

Mentioned In: T12337: Provide conduit access to related tasks and in particular 'mentions'
Mentioned Here: D19503: Index only the first 1,000 comments on any object
T8761: Projects shouldn't allow setting "joinable by" to itself

Event Timeline

joshuaspence created this task.Jul 17 2015, 4:28 AM

joshuaspence raised the priority of this task from to Needs Triage.

joshuaspence updated the task description. (Show Details)

joshuaspence added a project: Daemons.

joshuaspence added a subscriber: joshuaspence.

Use --autoscale-reserve to prevent them from autoscaling past some amount of memory. For example, this will prevent autoscaling if less than 25% of RAM is free:

phabricator/ $ ./bin/phd restart --autoscale-reserve 0.25

Shouldn't we do something similar by default?

That isn't a great solution - it should be possible to set a max ram used, rather than max percent free: http://blogs.msdn.com/b/oldnewthing/archive/2012/01/18/10257834.aspx

Shouldn't we do something similar by default?

If we do, it means that we sometimes end up in a scenario where the user says "launch 100 daemons" and we launch, say, 7 daemons. This is likely to be confusing, and I don't see a great way to explain to the user what we're doing and why.

If the user says "launch 100 daemons", the behavior is more clear if we launch 100 daemons and OOM the box than if we launch 7 daemons and prevent the OOM. This behavior is probably not what the user wanted, but it's what they asked for, and I generally think it's more important to be clear than to prevent users from doing silly things. T8761 has some additional discussion of a similar case.

This flag should be documented and autoscaling should be discussed in the documentation, but the docs need updates in general.

That isn't a great solution - it should be possible to set a max ram used, rather than max percent free

This post discusses userspace applications running on desktops written in relatively low-level languages. I don't think it's applicable here. In particular, we do not free memory to hit a target allocation; we only decline to scale the pool.

Hmm. It has happened again, even with --autoscale-reserve 0.25.

This has happened three time today :(

I'm not sure if its related, but late last week I ran ./bin/search index --all --background and that roughly coincides with the observed OOM behavior.

OK. If I get rid of the explicit phd.taskmasters setting then it seems to work. With phd.taskmasters set to 8 it seems to lock up at 100% CPU usage after a few minutes.

Ah, I think this is relevant...

josh@ip-10-157-83-125:/usr/src/phabricator$ ./bin/worker execute --trace --id 18282255
>>> [2] <connect> phabricator_worker
<<< [2] <connect> 4,840 us
>>> [3] <query> SELECT * FROM `worker_activetask` WHERE id IN ('18282255') 
<<< [3] <query> 1,075 us
>>> [4] <query> SELECT * FROM `worker_archivetask` WHERE (id in (18282255)) ORDER BY id DESC 
<<< [4] <query> 1,055 us
>>> [5] <connect> phabricator_worker
<<< [5] <connect> 3,903 us
>>> [6] <query> UPDATE `worker_activetask` SET `failureTime` = NULL, `taskClass` = 'PhabricatorSearchWorker', `leaseOwner` = NULL, `leaseExpires` = '1437394542', `failureCount` = '0', `dataID` = '18311600', `priority` = '4000', `objectPHID` = NULL, `id` = '18282255' WHERE `id` = '18282255'
<<< [6] <query> 10,543 us
>>> [7] <query> SELECT * FROM `worker_taskdata` WHERE id = 18311600 
<<< [7] <query> 1,015 us
Executing task 18282255 (PhabricatorSearchWorker)...>>> [8] <connect> phabricator_maniphest
<<< [8] <connect> 4,070 us
>>> [9] <query> SELECT `task`.*  FROM `maniphest_task` task  WHERE (task.phid in ('PHID-TASK-bclylvkmj6c5kdn74s26'))   ORDER BY `task`.`id` DESC 
<<< [9] <query> 1,207 us
>>> [10] <query> SELECT * FROM `maniphest_transaction` x WHERE (objectPHID IN ('PHID-TASK-bclylvkmj6c5kdn74s26')) ORDER BY `id` DESC 
<<< [10] <query> 9,847,215 us
Killed

mysql> SELECT COUNT(id) FROM phabricator_maniphest.maniphest_transaction WHERE (objectPHID IN ('PHID-TASK-bclylvkmj6c5kdn74s26'));
+-----------+
| COUNT(id) |
+-----------+
|      1996 |
+-----------+

This happened again today, presumably because I added a new differential revision to PHID-TASK-bclylvkmj6c5kdn74s26.

I guess this one can live for now since that's reproducible/actionable.

epriestley mentioned this in T12337: Provide conduit access to related tasks and in particular 'mentions'.Mar 1 2017, 4:44 PM

Presumably resolved elsewhere by D19503.

Indexing a task with 2,000 comments required a lot of RAM in mid-2015Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Indexing a task with 2,000 comments required a lot of RAM in mid-2015
Closed, ResolvedPublic
Actions