Page MenuHomePhabricator

Make task queue more robust against long-running tasks
ClosedPublic

Authored by epriestley on Apr 14 2014, 11:14 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Jan 17, 11:51 AM
Unknown Object (File)
Sun, Dec 29, 5:23 PM
Unknown Object (File)
Tue, Dec 24, 10:20 PM
Unknown Object (File)
Tue, Dec 24, 8:15 AM
Unknown Object (File)
Sun, Dec 22, 8:22 PM
Unknown Object (File)
Sun, Dec 22, 8:22 PM
Unknown Object (File)
Sun, Dec 22, 8:22 PM
Unknown Object (File)
Sun, Dec 22, 8:22 PM
Subscribers

Details

Summary

See discussion in D8773. Three small adjustments which should help prevent this kind of issue:

  • When queueing followup tasks, hold them on the worker until we finish the task, then queue them only if the work was successful.
  • Increase the default lease time from 60 seconds to 2 hours. Although most tasks finish in far fewer than 60 seconds, the daemons are generally stable nowadays and these short leases don't serve much of a purpose. I think they also date from an era where lease expiry and failure were less clearly distinguished.
  • Increase the default wait-after-failure from 60 seconds to 5 minutes. This largely dates from the MetaMTA era, where Facebook ran services with high failure rates and it was appropriate to repeatedly hammer them until things went through. In modern infrastructure, such failures are rare.
Test Plan
  • Verified that tasks queued properly after the main task was updated.
  • Verified that leases default to 7200 seconds.
  • Intentionally failed a task and verified default 300 second wait before retry.
  • Removed all default leases shorter than 7200 seconds (there was only one).
  • Checked all the wait before retry implementations for anything much shorter than 5 minutes (they all seem reasonable).

Diff Detail

Repository
rP Phabricator
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

epriestley retitled this revision from to Make task queue more robust against long-running tasks.
epriestley updated this object.
epriestley edited the test plan for this revision. (Show Details)
epriestley added reviewers: sowedance, btrahan.

Generally looks good to me. Just one concern on the commit parsing workflow: if we make the lease expiration 2hrs, in case of parsing failure, does that mean that commit won't be parsed for the next 2 hours? If that's the case, I am a little worried since there are quite a few internal tools relying on the commit parsing and the delay could cause some problems for us.

parsing failure, does that mean that commit won't be parsed for the next 2 hours

That will hit the "wait before retry" case (5 minutes) instead of the lease expiration case (2 hours).

src/applications/repository/worker/PhabricatorRepositoryCommitOwnersWorker.php
6–9

(The default is now 2 hours, so this would have been shorter.)

sowedance edited edge metadata.

Ah ok I got it. Thanks!

This revision is now accepted and ready to land.Apr 15 2014, 2:12 AM
epriestley updated this revision to Diff 20827.

Closed by commit rPcb545856a9b8 (authored by @epriestley).