Page MenuHomePhabricator

Unbounded runtime and memory use on some Paste transactions
Closed, ResolvedPublic

Description

We have a handful of tasks which repeatedly fail after consuming more and more RAM until they run the machine out of memory. The impact is that our Phabricator service is unavailable for a period of about 10-20 min every two hours (the lease time on the tasks), until an operator manually interrupts the task.

From studying the issue,

  • it appears to happen while trying to render an update to a Paste,
  • it also affects trying to render the same Paste updates on the web,
  • it applies to all 3 updates in the history of at least one affected Paste, which is about 75kB and 979 lines in size,
  • but it doesn't apply to the updates on some other recently-created Pastes.

I don't know how to reproduce the issue from scratch, but I can trigger it at will on data in our Phabricator install, and I have a variety of debugging information. Happy to do other forms of tracing or debugging. More details below.

Does this look like any known or past issue? What else would you suggest to help diagnose it?


DETAILS

Each of the tasks has class PhabricatorApplicationTransactionPublishWorker. Running one with bin/worker execute --trace, the transaction being published is a Paste transaction. Following the PHIDs to phabricator_pastebin.pastebin_pastetransaction and then to phabricator_file.file, the two versions of the paste are both just under 75kB long, so there isn't a ton of computation involved here.

When I go to the Paste page for the paste, it renders just fine, but if I hit "Show Details" on the offending transaction, I get no response on the web, and on the server I see an apache2 process showing the same symptoms as the hapless PhabricatorTaskmasterDaemon processes had -- spinning at near 100% CPU and increasing memory usage at a rate of 1GB every 5-20sec or so.

The bin/worker --trace output moves rapidly until it gets to this point:

>>> [28] <query> SELECT * FROM `file_transformedfile` WHERE transformedPHID IN ('PHID-FILE-doakbz6ox64c7zaiiom5', 'PHID-FILE-cjon3dx4lp573nyfx5qf') 
<<< [28] <query> 720 us
>>> [29] <query> SELECT * FROM `file_storageblob` WHERE `id` = 237819 
<<< [29] <query> 6,747 us
>>> [30] <query> SELECT * FROM `file_storageblob` WHERE `id` = 237829 
<<< [30] <query> 2,104 us

Those blob IDs in fact correspond to the old and new versions of the paste. Then the bin/worker --trace process hangs there, spinning CPU and rapidly eating more and more memory.

Running strace on the bin/worker --trace process produces very boring results: there's a very, very rapid repetition of these calls:

rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0

and when those are filtered out, what's left looks like this:

15:12:05.045990 brk(0x8fbc000)          = 0x8fbc000 <0.000028>
15:12:06.227412 brk(0x8ffc000)          = 0x8ffc000 <0.000023>
15:12:07.385571 brk(0x903c000)          = 0x903c000 <0.000029>

i.e., a brk call a little less often than once per second, each advancing by 256kiB. (This is a lot slower than the memory use advances without strace.)


VERSIONS

phabricator fd5825fafdcbf44733db24e558d3b5e5fd5ae7a1 (Fri, Sep 30)
arcanist d455ddde6a9c16c607f198743c8eb7babc3ef273 (Fri, Sep 30)
phutil 496473da30b617696b89882dbf96e008db472316 (Fri, Sep 30)
pytest-phabricator 1550854ff89dca474176d307f6a485cd5eff0452 (Aug 24 2016)

Our phabricator is based on 921aab604ec759f2af997faf9b3822ce0e4ee340 (Wed, Sep 28), and differs from it by
13 files changed, 48 insertions(+), 7 deletions(-)
none of which appears related to Paste. Happy to discuss any part of the diff if it comes up, though; nothing's secret, just boring.

Event Timeline

Can you attempt to create a paste on this install which reproduces the issue? I have a few ideas about what's up, but that's almost certainly the fastest path to a fix if it works.

Meaning here on secure.phabricator.com? Sure -- no guarantee it'll work, but I have some guesses I can throw at it.

One other quick diagnostic is: when the worker is stuck, is there a pygmentize subprocess running?

Nope -- pstree shows no subprocesses when the worker is stuck.

Yeah, just take a shot at making a paste here (secure.phabricator.com) with the same degenerate behavior. If you can get us crashing I suspect I can get a fix in about 5 minutes through the power of adding var_dump() all over the place on the production server. If not I have a few ideas but there's a lot of code that could be causing the problem and that trace isn't an obvious smoking gun to me.

Just made P2011, and I think that's a repro -- the "Show Details" link hangs.

Awesome, thanks! I'll dig into it and update you once I get somewhere.

Thank you! Not every day someone invites me to cause a crash on their production server. :-D

Oh, one diagnostic tool that I didn't mention is that you can send SIGHUP to a running bin/* command to get a stacktrace of what it's doing dumped to /tmp/phabricator_backtrace_<pid> (usually). This one looks like it is a smoking gun:

$ cat /var/folders/pj/3dbkgfxd783_9yvv7c1kss9m0000gn/T/phabricator_backtrace_68376 
#0 /Users/epriestley/dev/core/lib/libphutil/src/future/exec/PhutilSignalRouter.php(61): PhutilBacktraceSignalHandler->handleSignal(Object(PhutilSignalRouter), 1)
#1 /Users/epriestley/dev/core/lib/libphutil/src/utils/PhutilEditDistanceMatrix.php(474): PhutilSignalRouter->routeSignal(1)
#2 /Users/epriestley/dev/core/lib/libphutil/src/utils/PhutilEditDistanceMatrix.php(339): PhutilEditDistanceMatrix->computeMatrix(Array, Array)
#3 /Users/epriestley/dev/core/lib/libphutil/src/utils/PhutilEditDistanceMatrix.php(253): PhutilEditDistanceMatrix->getTypeMatrix()
#4 /Users/epriestley/dev/core/lib/libphutil/src/utils/PhutilProseDifferenceEngine.php(37): PhutilEditDistanceMatrix->getEditString()
#5 /Users/epriestley/dev/core/lib/libphutil/src/utils/PhutilProseDifferenceEngine.php(8): PhutilProseDifferenceEngine->buildDiff(Object(PhutilProseDiff), ' .arcconfig    ...', ' .arcconfig    ...', 1)
#6 /Users/epriestley/dev/core/lib/phabricator/src/applications/transactions/view/PhabricatorApplicationTransactionTextDiffDetailView.php(135): PhutilProseDifferenceEngine->getDiff(' .arcconfig    ...', ' .arcconfig    ...')
#7 /Users/epriestley/dev/core/lib/phabricator/src/applications/transactions/view/PhabricatorApplicationTransactionTextDiffDetailView.php(20): PhabricatorApplicationTransactionTextDiffDetailView->buildDiff()
#8 /Users/epriestley/dev/core/lib/phabricator/src/applications/transactions/storage/PhabricatorApplicationTransaction.php(1387): PhabricatorApplicationTransactionTextDiffDetailView->renderForMail()
#9 /Users/epriestley/dev/core/lib/phabricator/src/applications/transactions/editor/PhabricatorApplicationTransactionEditor.php(2872): PhabricatorApplicationTransaction->renderChangeDetailsForMail(Object(PhabricatorUser))
#10 /Users/epriestley/dev/core/lib/phabricator/src/applications/transactions/editor/PhabricatorApplicationTransactionEditor.php(2750): PhabricatorApplicationTransactionEditor->addHeadersAndCommentsToMailBody(Object(PhabricatorMetaMTAMailBody), Array)
#11 /Users/epriestley/dev/core/lib/phabricator/src/applications/paste/editor/PhabricatorPasteEditor.php(83): PhabricatorApplicationTransactionEditor->buildMailBody(Object(PhabricatorPaste), Array)
#12 /Users/epriestley/dev/core/lib/phabricator/src/applications/transactions/editor/PhabricatorApplicationTransactionEditor.php(2511): PhabricatorPasteEditor->buildMailBody(Object(PhabricatorPaste), Array)
#13 /Users/epriestley/dev/core/lib/phabricator/src/applications/transactions/editor/PhabricatorApplicationTransactionEditor.php(2468): PhabricatorApplicationTransactionEditor->buildMailForTarget(Object(PhabricatorPaste), Array, Object(PhabricatorMailTarget))
#14 /Users/epriestley/dev/core/lib/phabricator/src/applications/transactions/editor/PhabricatorApplicationTransactionEditor.php(1150): PhabricatorApplicationTransactionEditor->buildMail(Object(PhabricatorPaste), Array)
#15 /Users/epriestley/dev/core/lib/phabricator/src/applications/transactions/worker/PhabricatorApplicationTransactionPublishWorker.php(21): PhabricatorApplicationTransactionEditor->publishTransactions(Object(PhabricatorPaste), Array)
#16 /Users/epriestley/dev/core/lib/phabricator/src/infrastructure/daemon/workers/PhabricatorWorker.php(123): PhabricatorApplicationTransactionPublishWorker->doWork()
#17 /Users/epriestley/dev/core/lib/phabricator/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php(171): PhabricatorWorker->executeTask()
#18 /Users/epriestley/dev/core/lib/phabricator/src/infrastructure/daemon/workers/management/PhabricatorWorkerManagementExecuteWorkflow.php(52): PhabricatorWorkerActiveTask->executeTask()
#19 /Users/epriestley/dev/core/lib/libphutil/src/parser/argument/PhutilArgumentParser.php(438): PhabricatorWorkerManagementExecuteWorkflow->execute(Object(PhutilArgumentParser))
#20 /Users/epriestley/dev/core/lib/libphutil/src/parser/argument/PhutilArgumentParser.php(330): PhutilArgumentParser->parseWorkflowsFull(Array)
#21 /Users/epriestley/dev/core/lib/phabricator/scripts/setup/manage_worker.php(21): PhutilArgumentParser->parseWorkflows(Array)
#22 {main}

The root issue was that we would compute a ProseEditDistanceMatrix (roughly, a Levenshtein distance matrix) for large inputs, but this is O(N*M) in the size of the inputs and both of these inputs had ~2,200 "atoms" at the top prose diff level, so the matrix had ~5M elements.

PHP is even more awful at this kind of bytewise low-level fiddly thing than it is at most things, so a 5M element matrix requires approximately 5M GB of RAM and 5M mega lightyears of runtime.

I've marked D16682 as fixing this; the "fix" there is to give up and compute a less useful diff if the inputs are too big.

This is probably not the most complete fix. Broadly, we have two difference engines:

  • Code diffs, used when diffing code (Differential, Diffusion, etc.)
  • Prose diffs, as of T7643, used when diffing human-readable text (task descriptions, etc).

Prose diffs are relatively recent and were introduced because code diffs on prose are usually garbage. Prose diffs are still a bit rough (T7643 has some improvements) but seem like a step forward overall.

However, a bunch of applications got sort of grandfathered into prose diffs when they appeared. In almost every case that was good (e.g., descriptions of every sort in every application now use prose diffs by default) but I think it's probably not the right choice for paste content diffs, at least if they don't have some specific narrow set of suffixes like ".txt".

I'll file a followup to get them converted back to code diffs at some point. That should improve the quality of these diffs generally, and stop us from hitting this "give up" case.

(I think there are a handful of other cases where prose diffs aren't as good as code diffs, too -- one that springs to mind is JSON blobs in Config, maybe? We can hunt those down too.)

This algorithm is also a good candidate for porting to C some day (T2312) although there aren't many other strong arguments for pursuing that at the moment -- doing a lot of UTF8 fiddling in PHP hasn't actually really bitten us yet.

Sweet -- thanks for the fast fix! I'll try applying it locally to clear out the affected tasks we have, and to make sure any new ones don't get me paged tonight. :-)

Thanks also for the tip about SIGHUP to get a backtrace! Had not known that, and it looks extremely useful.

There's one still-lingering task that didn't get resolved by that fix, which shows similar symptoms (except the part about making the service unavailable for a while -- that seems to come from having several of them hit at once) but isn't a paste. I'll file a separate bug for that, with the aid of that backtrace feature.

This should now be fixed in HEAD of master.

I deployed the change to this server, and the worker attached to the P2011 edit completed and the "(Show Details)" link now works (albeit without producing a hugely useful diff).

Thanks for the thorough report, and for confirming the patch in D16682. Let us know if you run into anything else!