Make task queue more robust against long-running tasks
ClosedPublic
Actions

Authored by epriestley on Apr 14 2014, 11:14 PM.

Details

Reviewers

btrahan
• sowedance

Commits

Restricted Diffusion Commit
rPcb545856a9b8: Make task queue more robust against long-running tasks

Summary

See discussion in D8773. Three small adjustments which should help prevent this kind of issue:

When queueing followup tasks, hold them on the worker until we finish the task, then queue them only if the work was successful.
Increase the default lease time from 60 seconds to 2 hours. Although most tasks finish in far fewer than 60 seconds, the daemons are generally stable nowadays and these short leases don't serve much of a purpose. I think they also date from an era where lease expiry and failure were less clearly distinguished.
Increase the default wait-after-failure from 60 seconds to 5 minutes. This largely dates from the MetaMTA era, where Facebook ran services with high failure rates and it was appropriate to repeatedly hammer them until things went through. In modern infrastructure, such failures are rare.

Test Plan

Verified that tasks queued properly after the main task was updated.
Verified that leases default to 7200 seconds.
Intentionally failed a task and verified default 300 second wait before retry.
Removed all default leases shorter than 7200 seconds (there was only one).
Checked all the wait before retry implementations for anything much shorter than 5 minutes (they all seem reasonable).

Diff Detail

Repository

rP Phabricator

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

epriestley updated this revision to Diff 20819.Apr 14 2014, 11:14 PM

epriestley retitled this revision from to Make task queue more robust against long-running tasks.

epriestley updated this object.

epriestley edited the test plan for this revision. (Show Details)

epriestley added reviewers: • sowedance, btrahan.

Herald added a subscriber: epriestley.Apr 14 2014, 11:14 PM

epriestley edited the test plan for this revision. (Show Details)Apr 14 2014, 11:17 PM

Generally looks good to me. Just one concern on the commit parsing workflow: if we make the lease expiration 2hrs, in case of parsing failure, does that mean that commit won't be parsed for the next 2 hours? If that's the case, I am a little worried since there are quite a few internal tools relying on the commit parsing and the delay could cause some problems for us.