Page MenuHomePhabricator

Implement clock/trigger infrastructure for scheduling actions
ClosedPublic

Authored by epriestley on Jan 16 2015, 12:43 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Jan 12, 6:02 PM
Unknown Object (File)
Sat, Jan 11, 4:35 AM
Unknown Object (File)
Tue, Dec 31, 10:08 PM
Unknown Object (File)
Sat, Dec 28, 5:50 PM
Unknown Object (File)
Dec 6 2024, 6:55 AM
Unknown Object (File)
Dec 6 2024, 4:45 AM
Unknown Object (File)
Dec 3 2024, 2:00 AM
Unknown Object (File)
Dec 1 2024, 10:57 PM
Subscribers

Details

Summary

Ref T6881. Hopefully, this is the hard part.

This adds a new daemon (the "trigger" daemon) which processes triggers, schedules them, and then executes them at the scheduled time. The design is a little complicated, but has these goals:

  • High resistance to race conditions: only the application writes to the trigger table; only the daemon writes to the event table. We won't lose events if someone saves a meeting at the same time as we're sending a reminder out for it.
  • Execution guarantees: scheduled events are guaranteed to execute exactly once.
  • Support for arbitrarily large queues: the daemon will make progress even if there are millions of triggers in queue. The cost to update the queue is proportional to the number of changes in it; the cost to process the queue is proportional to the number of events to execute.
  • Relatively good observability: you can monitor the state of the trigger queue reasonably well from the web UI.
  • Modular Infrastructure: this is a very low-level construct that Calendar, Phortune, etc., should be able to build on top of.

It doesn't have this stuff yet:

  • Not very robust to bad actions: a misbehaving trigger can stop the queue fairly easily. This is OK for now since we aren't planning to make it part of any other applications for a while. We do still get execute-exaclty-once, but it might not happen for a long time (until someone goes and fixes the queue), when we could theoretically continue executing other events.
  • Doesn't start automatically: normal users don't need to run this thing yet so I'm not starting it by default.
  • Not super well tested: I've vetted the basics but haven't run real workloads through this yet.
  • No sophisticated tooling: I added some basic stuff but it's missing some pieces we'll have to build sooner or later, e.g. bin/trigger cancel or whatever.
  • Intentionally not realtime: This design puts execution guarantees far above realtime concerns, and will not give you precise event execution at 1-second resolution. I think this is the correct goal to pursue architecturally, and certainly correct for subscriptions and meeting reminders. Events which execute after they have become irrelevant can simply decline to do anything (like a meeting reminder which executes after the meeting is over).

In general, the expectation for applications is:

  • When creating an object (like a calendar event) that needs to trigger a scheduled action, write a trigger (and save the PHID if you plan to update it later).
  • The daemon will process the event and schedule the action efficiently, in a race-free way.
  • If you want to move the action, update the trigger and the daemon will take care of it.
  • Your action will eventually dump a task into the task queue, and the task daemons will actually perform it.
Test Plan

Using a test script like this:

<?php

require_once 'scripts/__init_script__.php';

$trigger = id(new PhabricatorWorkerTrigger())
  ->setAction(
    new PhabricatorLogTriggerAction(
      array(
        'message' => 'test',
      )))
  ->setClock(
    new PhabricatorMetronomicTriggerClock(
      array(
        'period' => 33,
      )))
  ->save();

var_dump($trigger);

...I queued triggers and ran the daemon:

  • Verified triggers fire;
  • verified triggers reschedule;
  • verified trigger events show up in the web UI;
  • tried different periods;
  • added some triggers while the daemon was running;
  • examined phd debug output for anything suspicious.

It seems to work in trivial use case, at least.

Diff Detail

Repository
rP Phabricator
Branch
trigger3
Lint
Lint Passed
Unit
Tests Passed
Build Status
Buildable 3887
Build 3899: [Placeholder Plan] Wait for 30 Seconds

Event Timeline

epriestley retitled this revision from to Implement clock/trigger infrastructure for scheduling actions.
epriestley updated this object.
epriestley edited the test plan for this revision. (Show Details)
epriestley added a reviewer: btrahan.

Couple of things inline, mostly typos.

src/infrastructure/daemon/workers/PhabricatorTriggerDaemon.php
99

I'll type some more text here.

204

This seems ungrammatical.

219

This should be $last_epoch.

227–229

I need to always save this so we retain the last execution time in case triggers are later updated and the clock rule depends on the previous execution epoch (the scheduling phase does this correctly).

src/infrastructure/daemon/workers/action/PhabricatorTriggerAction.php
61

This is copy/pasted and incorrect.

btrahan edited edge metadata.

This looks pretty badass. The architecture sounds good to me; I don't think we want to get into the business of executing things at the specified second or even minute, just ballpark around then should work great for all our scenarios.

src/infrastructure/daemon/workers/PhabricatorTriggerDaemon.php
204

Maybe "Don't reschedule events unless the reschedule epoch is in the future." and maybe rename $next_epoch to $reschedule_epoch ?

This revision is now accepted and ready to land.Jan 16 2015, 6:39 PM

One other possible weakness is that this doesn't give us any direct tools to attack the "task reaper" use case, but I'm not really sold on the usefulness of it in the general case and I'm not sure if, e.g., Herald rules with timer conditions are the right approach anyway, since I think they're not expressive enough.

In particular, with special cases like T6821/T6491 (where the desire is to trigger, e.g., after no change in assignee status, specifically, for X days), I'm hard-pressed to imagine us ever building a system which can express those rules in a generic way. Pager escalation rules ("text everyone if no one touches this for 15 minutes") seem somewhat more practical, but just building that into a pager/monitoring tool probably makes more sense than doing it via Herald or some Chrono-Herald.

If T4434 ends up somewhat-generic-ish, the "Task Reaper" becomes easy to write as an external script, although it will have to be pretty generic to satisfy all the use cases in T6491.

epriestley edited edge metadata.
  • Mostly-cosmetic fixes from self-review + feedback.
This revision was automatically updated to reflect the committed changes.