We have a handful of tasks which repeatedly fail after consuming more and more RAM until they run the machine out of memory. The impact is that our Phabricator service is unavailable for a period of about 10-20 min every two hours (the lease time on the tasks), until an operator manually interrupts the task.
From studying the issue,
- it appears to happen while trying to render an update to a Paste,
- it also affects trying to render the same Paste updates on the web,
- it applies to all 3 updates in the history of at least one affected Paste, which is about 75kB and 979 lines in size,
- but it doesn't apply to the updates on some other recently-created Pastes.
I don't know how to reproduce the issue from scratch, but I can trigger it at will on data in our Phabricator install, and I have a variety of debugging information. Happy to do other forms of tracing or debugging. More details below.
Does this look like any known or past issue? What else would you suggest to help diagnose it?
DETAILS
Each of the tasks has class PhabricatorApplicationTransactionPublishWorker. Running one with bin/worker execute --trace, the transaction being published is a Paste transaction. Following the PHIDs to phabricator_pastebin.pastebin_pastetransaction and then to phabricator_file.file, the two versions of the paste are both just under 75kB long, so there isn't a ton of computation involved here.
When I go to the Paste page for the paste, it renders just fine, but if I hit "Show Details" on the offending transaction, I get no response on the web, and on the server I see an apache2 process showing the same symptoms as the hapless PhabricatorTaskmasterDaemon processes had -- spinning at near 100% CPU and increasing memory usage at a rate of 1GB every 5-20sec or so.
The bin/worker --trace output moves rapidly until it gets to this point:
>>> [28] <query> SELECT * FROM `file_transformedfile` WHERE transformedPHID IN ('PHID-FILE-doakbz6ox64c7zaiiom5', 'PHID-FILE-cjon3dx4lp573nyfx5qf') <<< [28] <query> 720 us >>> [29] <query> SELECT * FROM `file_storageblob` WHERE `id` = 237819 <<< [29] <query> 6,747 us >>> [30] <query> SELECT * FROM `file_storageblob` WHERE `id` = 237829 <<< [30] <query> 2,104 us
Those blob IDs in fact correspond to the old and new versions of the paste. Then the bin/worker --trace process hangs there, spinning CPU and rapidly eating more and more memory.
Running strace on the bin/worker --trace process produces very boring results: there's a very, very rapid repetition of these calls:
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
and when those are filtered out, what's left looks like this:
15:12:05.045990 brk(0x8fbc000) = 0x8fbc000 <0.000028> 15:12:06.227412 brk(0x8ffc000) = 0x8ffc000 <0.000023> 15:12:07.385571 brk(0x903c000) = 0x903c000 <0.000029>
i.e., a brk call a little less often than once per second, each advancing by 256kiB. (This is a lot slower than the memory use advances without strace.)
VERSIONS
phabricator fd5825fafdcbf44733db24e558d3b5e5fd5ae7a1 (Fri, Sep 30)
arcanist d455ddde6a9c16c607f198743c8eb7babc3ef273 (Fri, Sep 30)
phutil 496473da30b617696b89882dbf96e008db472316 (Fri, Sep 30)
pytest-phabricator 1550854ff89dca474176d307f6a485cd5eff0452 (Aug 24 2016)
Our phabricator is based on 921aab604ec759f2af997faf9b3822ce0e4ee340 (Wed, Sep 28), and differs from it by
13 files changed, 48 insertions(+), 7 deletions(-)
none of which appears related to Paste. Happy to discuss any part of the diff if it comes up, though; nothing's secret, just boring.