See PHI1794. Here, a firehose webhook queued a large number of tasks for the daemons. This is normally "okay" and caused problems only because of a connection pooling bug (resolved in D21369), but having large numbers of queued tasks creates a bit of an operational headache in general.
On Phacility instances, it's somewhat common for instances to configure a firehose webhook and then later disable the underlying service, leaving a long queue of tasks around indefinitely (these tasks GC after 7 days, but one task "floats" in the meantime for each action on the instance).
See PHI1838 and T13261. These tasks look at adding a "fail after X seconds/minutes/hours" to Harbormaster build plans. Today, this is tricky to do with the trigger daemon because bulk-updating the timing rules for a large number of triggers is difficult.
See PHI1816 and T13125. These tasks look at adding a new workload to the daemons to perform code coverage aggregation.
These problems (and some existing daemon workloads) might be better handled by increasing the abstraction level of the worker queue. Currently, the worker queue is a single concrete database table, but this could be structured one level higher:
- The daemons work on tasks generated dynamically at runtime by "Task Generators".
- The existing worker queue is one such "Task Generator".
These existing workloads are likely a good fit for this:
- Garbage collection is currently part of the trigger daemon, but could be part of the worker daemon if some of the generator state is moved into the database.
- Fact aggregation is currently part of the fact daemon, but could be part of the worker daemon if some of the generator state is moved into the database. The fact daemon could then be removed.
The above workloads might be easier if represented this way:
- Webhooks could primarily run off a webhook history table without floating tasks in the daemon queue.
- Harbormaster plans could trigger here in a GC-like way; this is much simpler to manage than using triggers.
- Coverage aggregation is fact-like a much better fit here than in the main worker queue.
These workload seem like a less-obvious fit:
- The repository daemon does a lot of complicated stuff, including complicated scheduling, and might make sense to merge eventually but there's little benefit today.
- The "approximately real-time" stuff in the trigger daemon (subscriptions, calendar notifications) might make sense to move eventually, but having a separate process for these time-sensitive, short-lived triggers has some value in making scheduling more reliable. There could eventually be some compromise here like having one worker only execute tasks at "realtime" priority or above.
The main technical issues I see here are:
- The task daemons currently expect to be able to operate on a PhabricatorWorkerActiveTask object, but some of this work (like GC and fact aggregation) does not make sense to couple with the actual task table. This coupling is likely easy to break, but will require some changes elsewhere (for example, a daemon should be able to report what it's doing, but can no longer just use a task PHID).
- Some handling, like PermanentFailureException, may be tricky when a task source is a dynamic generator and a task isn't directly failable.
- Priority scheduling across generators may be tricky. We'd prefer not to ask every generator for tasks before ranking tasks, but if we don't ask every generator for tasks we might not execute tasks in appropriate priority order. Since most generators can generate only a single priority of tasks, it may be reasonable to give generators a priority range and rank generators.
The path forward is likely:
- Sink the table-based task generator into a separate QueueTaskGenerator or similar.
- Wrap ActiveTask in some ephemeral DynamicTask container.
- Fix all the daemon cases where a task PHID is used for something; replace this with a "Generator Class + Identifier" pair or similar.
- Move the GC work (always lowest priority).
- Move the fact aggregation work (always slightly-higher-than-normal-priority).
Then:
- Move the webhook work.
- Add the Harbormaster workload.
- Add the aggregation workload.
Open questions:
- What primitives do we need to keep track of Generator state (e.g., GC generators don't need to do anything if they found nothing to collect recently)?
- What primitives do we need to support webhook work (where a dynamic cursor iterates over some table)?