HomePhabricator

Make task queue more robust against long-running tasks

Description

Make task queue more robust against long-running tasks

Summary:
See discussion in D8773. Three small adjustments which should help prevent this kind of issue:

  • When queueing followup tasks, hold them on the worker until we finish the task, then queue them only if the work was successful.
  • Increase the default lease time from 60 seconds to 2 hours. Although most tasks finish in far fewer than 60 seconds, the daemons are generally stable nowadays and these short leases don't serve much of a purpose. I think they also date from an era where lease expiry and failure were less clearly distinguished.
  • Increase the default wait-after-failure from 60 seconds to 5 minutes. This largely dates from the MetaMTA era, where Facebook ran services with high failure rates and it was appropriate to repeatedly hammer them until things went through. In modern infrastructure, such failures are rare.

Test Plan:

  • Verified that tasks queued properly after the main task was updated.
  • Verified that leases default to 7200 seconds.
  • Intentionally failed a task and verified default 300 second wait before retry.
  • Removed all default leases shorter than 7200 seconds (there was only one).
  • Checked all the wait before retry implementations for anything much shorter than 5 minutes (they all seem reasonable).

Reviewers: btrahan, sowedance

Reviewed By: sowedance

Subscribers: epriestley

Differential Revision: https://secure.phabricator.com/D8774

Details

Provenance
epriestleyAuthored on
epriestleyPushed on Apr 15 2014, 3:42 PM
Reviewer
sowedance
Differential Revision
D8774: Make task queue more robust against long-running tasks
Parents
rP6a4f12600034: Give the commitownersparser a little more time
Branches
Unknown
Tags
Unknown

Event Timeline