Page MenuHomePhabricator

PhabricatorApplicationTransactionPublishWorker requires unreasonable amount of memory for large numbers of mail recipients?
Closed, InvalidPublic


I'm having some amount of PhabricatorApplicationTransactionPublishWorker tasks leased because workers reached memory limit and are killed (limit is quite high, ~20Gb).

Tried execute tasks via CLI bin/worker execute --id 123 --trace with no limit and noticed interesting behavior.

  1. Task is executed until forever, memory consumption grows. I've canceled it later.
  1. Task is running following queries:
>>> [28] <query> SELECT `user`.* FROM `user` user  WHERE (user.phid IN ('PHID-USER-nsnnowxztidyfw62namu', 'PHID-USER-pgi5dylkzf4n74u5trn5', 'PHID-USER-eh6b6mexpadgyyhhkhfr', 'PHID-USER-6ttizi6pnrvjkaauwb75', ...
<<< [28] <query> 16,978 us
>>> [29] <query> SELECT `user`.* FROM `user` user  WHERE (user.phid IN ('PHID-USER-7szdg4elqqaowpzjrkbi', 'PHID-USER-pgi5dylkzf4n74u5trn5', 'PHID-USER-eh6b6mexpadgyyhhkhfr', 'PHID-USER-6ttizi6pnrvjkaauwb75', ...
<<< [29] <query> 17,692 us
>>> [30] <query> SELECT `user`.* FROM `user` user  WHERE (user.phid IN ('PHID-USER-nsnnowxztidyfw62namu', 'PHID-USER-pgi5dylkzf4n74u5trn5', 'PHID-USER-eh6b6mexpadgyyhhkhfr', 'PHID-USER-6ttizi6pnrvjkaauwb75', ...
<<< [30] <query> 18,085 us
>>> [31] <query> SELECT `user`.* FROM `user` user  WHERE (user.phid IN ('PHID-USER-7szdg4elqqaowpzjrkbi', 'PHID-USER-pgi5dylkzf4n74u5trn5', 'PHID-USER-eh6b6mexpadgyyhhkhfr', 'PHID-USER-6ttizi6pnrvjkaauwb75', ...
<<< [31] <query> 17,795 us

If you noticed, even and odd queries are repeating. I've canceled task after noticed that because it didn't make sense to perform it any longer.

This might be our data issue but maybe we could debug the reason somehow. Do you have any idea why this is happening?

Event Timeline

Pawka assigned this task to epriestley.
Pawka updated the task description. (Show Details)
Pawka added a project: Daemons.
Pawka added subscribers: Pawka, joshuaspence.

What happens earlier in the stack trace?

@epriestley any thoughts? This issue is blocking entire queue.

Here's my best effort to figure out what's happening here:

15 >>> [9] <query> SELECT `p`.*, vm.dst viewerIsMember FROM `project` p LEFT JOIN `edge` vm ON vm.src = p.phid AND vm.type = 13 AND vm.dst = '' WHERE (phid IN ('PHID-PROJ-vi7rjrmnk3lcc5aztlqu'))   ORDER BY `p`.`name` ASC

This looks like the PhabricatorObjectQuery in PhabricatorApplicationTransactionPublishWorker->loadObject(). We're loading the project that this transaction is about (for example, we might be trying to publish a rename of a project or an edit to its description).

The next few queries are loading the profile image for the project.

Not related, but the vm.dst = '' part of this query is silly. We could skip this when the viewer is omnipotent.

25 >>> [14] <query> SELECT * FROM `project_transaction` x WHERE (phid IN ('PHID-XACT-PROJ-uoucbnrl7x3xpjl')) ORDER BY `id` DESC

We're now in PhabricatorApplicationTransactionPublishWorker->loadTransactions(), loading the actual transaction. You could look this row up to get more information, which might be helpful:

SELECT * FROM phabricator_project.project_transaction WHERE phid = 'PHID-XACT-PROJ-uoucbnrl7x3xpjl';

That would tell us exactly what transaction is being published.

The next few queries load the project again (as required to load the transaction) and implicitly load its profile image. We could eventually cache these better, but this is expected today.

37 >>> [20] <query> SELECT `user`.* FROM `user` user  WHERE (user.phid IN ('PHID-USER-7szdg4elqqaowpzjrkbi', ...

This is coming out of resolving the handle pool in PhabricatorApplicationTransactionQuery->willFilterPage(). We're loading the handles for all of the PHIDs required by the transaction.

The number of users getting loaded here is surprising, since we loaded only one transaction. My guess is that this transaction is a membership transaction which adds or removes a lot of members.

39 >>> [21] <query> SELECT edge.* FROM `edge` edge WHERE (edge.src IN ('PHID-PROJ-vi7rjrmnk3lcc5aztlqu')) AND (edge.type IN ('13')) ORDER BY edge.dateCreated DESC, edge.seq DESC

We've now made it into PhabricatorApplicationTransactionPublishWorker->publishTransactions(). The first thing it does is call the willPublish() hook. The PhabricatorProjectTransactionEditor subclass loads the project's members in this hook, and that's what this query is doing.

I'm about 95% confident this read is accurate up to this point, but now we're at where things go wrong and I'm not as sure what's happening.

41 >>> [22] <query> SELECT `user`.* FROM `user` user  WHERE (user.phid IN ('PHID-USER-nsnnowxztidyfw62namu',

After willPublish(), we call a couple of hooks which shouldn't issue any queries.

I think the next place we'll run queries is in PhabricatorMailReplyHandler->loadRecipientUsers(), which does run the right kind of query, but this doesn't make sense. This query is completely mundane and I don't see how it can loop.

So maybe the first query is actually fine, and the loop doesn't start until the second query.

>>> [23] <query> SELECT `user`.* FROM `user` user  WHERE (user.phid IN ('PHID-USER-7szdg4elqqaowpzjrkbi', ...

This next query is probably from PhabricatorApplicationTransactionEditor->loadHandles(), but I'm not sure. This should also be mundane, and I don't see why it would loop either.

I don't see why the queries after this would alternate their order.

A possible explanation is that the second query is just another handle query somewhere in mail construction, and that this isn't a loop: the project might just have an enormous number of affected users and/or mail recipients.

Does the project have a huge number of members/subscribers (more memers than the log issues queries for before running out of memory)?

SELECT * FROM phabricator_project.project WHERE phid = 'PHID-PROJ-vi7rjrmnk3lcc5aztlqu';

Does the transaction affect a huge number of members?

SELECT * FROM phabricator_project.project_transaction WHERE phid = 'PHID-XACT-PROJ-uoucbnrl7x3xpjl';
Pawka closed this task as Resolved.EditedSep 16 2015, 12:48 PM

Thank you for full explanation.

Problem was that there were over 1500 subscribers on single project and probably, as you said, that were mail construction queries.

I've removed all subscribers from project with following queries:

USE phabricator_project;
DELETE FROM `edge` WHERE ((src IN ('PHID-PROJ-vi7rjrmnk3lcc5aztlqu') AND type = 21));
USE phabricator_user;
DELETE FROM `edge` WHERE ((dst IN ('PHID-PROJ-vi7rjrmnk3lcc5aztlqu') AND type = 22));;

We can and should reduce the amount of memory required here -- I believe it's roughly O(N^2) because we load every user for every user, in order to respect visibility policies. We can reduce it to O(N) by dumping the handles after building the mail.

1,500 users is a large number but not an unreasonable number. My "well, don't do that" threshold is more like 150,000 users for this operation. I'd expect 1,500 to work on any system and expect 15,000 to work on large systems like yours.

(I'm going to rename this to track fixing it since the information here is useful in understanding the problem and likely in reproducing it, but feel free to unsubscribe if you aren't interested.)

epriestley renamed this task from PhabricatorApplicationTransactionPublishWorker gets into deadlock to PhabricatorApplicationTransactionPublishWorker requires O(N^2) memory in number of mail recipients.Sep 16 2015, 12:53 PM

My previous DELETE queries just reduced amount of tasks created, but I'm still getting few of those tasks.

Actually, I can't immediately reproduce this. Here's what I did to set the test up:

  • Stopped the daemons.
  • Sent some mail.
  • Fished the PublisherWorker task ID out of the Daemon console.
  • Added a throw new Exception() to the end of PhabricatorApplicationTransactionEditor->buildMail() so it wouldn't actually send any mail.
  • Repeatedly ran the task with bin/worker execute --id <id>.

Then I put echo mem_get_usage(); calls into the loop body to get a sense of memory usage.

Even when sending to 3750 users, memory usage peaked at 116MB, and PHP was regularly able to reclaim memory used during mail construction, suggesting that it was only permanently allocating about as much memory as the message itself required each time through the loop.

The overall usage (32KB / message) and execution rate (80 messages / second) I observed seem reasonable to me.

One possible issue is that my test was fairly synthetic, since I don't have 1,500 local user accounts, so I just copied the targets a bunch locally. It's possible that the combined size of "To" and "Cc" headers, the "To" and "Cc" mail sections, and then "X-Phabricator-To" / "X-Phabricator-Cc" headers is large enough per message that the actual size of the messages can grow large, but 20GB seems unobtainable (each recipient would need to use 8KB per message by some back-of-the-envelope math, so that each message would be 12MB and 1,500 messages would be ~20GB). But, naively, the size of a recipient in those headers should be closer to 256 bytes than 8KB, which implies total memory usage of ~500MB.

I'll generate a large number of user accounts locally and see if I can get more insight here.

epriestley renamed this task from PhabricatorApplicationTransactionPublishWorker requires O(N^2) memory in number of mail recipients to PhabricatorApplicationTransactionPublishWorker requires unreasonable amount of memory for large numbers of mail recipients?.Sep 25 2015, 3:27 PM

No clue how to reproduce this and we haven't seen anything similar since.