Page MenuHomePhabricator

Provide a workflow to restart Harbormaster builds
ClosedPublic

Authored by yelirekim on Sep 2 2016, 6:08 AM.
Tags
None
Referenced Files
F14044119: D16485.id39671.diff
Tue, Nov 12, 3:03 PM
F14042824: D16485.id39671.diff
Tue, Nov 12, 6:22 AM
F14013923: D16485.id39663.diff
Sat, Nov 2, 7:53 PM
F14013922: D16485.id39668.diff
Sat, Nov 2, 7:53 PM
F14013921: D16485.id39671.diff
Sat, Nov 2, 7:53 PM
F14013920: D16485.id.diff
Sat, Nov 2, 7:53 PM
F14011675: D16485.diff
Fri, Nov 1, 4:22 AM
F14005046: D16485.diff
Sun, Oct 27, 6:33 AM
Subscribers
Tokens
"Dat Boi" token, awarded by michaeljs1990.

Details

Summary

Ref T10867 for original use case. This workflow provides a plausible way for administrators to stop the daemons when performing upgrades or maintenance, then bring those daemons back up without resulting in the failure of builds that were running at the time.

On our organization's phab install, builds are running 24/7. The majority of these builds last for at least several minutes, and contain build steps which fail if interrupted and then resumed, as happens when turning daemons on and off.

Instead of allowing these build steps to resume execution as normal, this workflow will instruct active builds to restart their entire build process instead of just resuming whichever step they were on.

Test Plan

contrived a build plan which would fail if resumed partway through:

  • lease a working copy
  • command touch restart_{build.id}
  • command test -e restart_{build.id} && rm restart_{build.id} && sleep 60

followed old procedure:

  • run a few of these builds manually
  • ./bin/phd stop
  • ./bin/phd start
  • saw the builds fail

followed new procedure:

  • run a few of these builds manually
  • ./bin/phd stop
  • ./bin/harbormaster restart --active
  • ./bin/phd start
  • saw the builds pass

Diff Detail

Repository
rP Phabricator
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

yelirekim retitled this revision from to Provide a workflow to restart Harbormaster builds.
yelirekim updated this object.
yelirekim edited the test plan for this revision. (Show Details)
yelirekim edited edge metadata.

What are the advantages of this approach over using bin/phd graceful in your environment? (Predictability of what the "restart" script does / how long it will run for?)

Do you have plans for coordinating the stop + harbormaster restart + start sequence across multiple daemon hosts ("internal deploy magic")?

contain build steps which fail if interrupted and then resumed

Is this fixable? In theory? In practice? Offhand, this seems a little unusual (I'd expect most builds to be repeatable, e.g. commands arc unit or make should work even if run in a working copy with some leftovers from a previous failed build).

I think this is likely fine as a general operations/administration tool, but this particular restart workflow is probably not universally applicable. In particular, if your builds last several hours instead of several minutes, this throws away their work.

epriestley added a reviewer: epriestley.

Actual code looks fine, and I think this is justified on the basis of making it easier to debug/develop Harbormaster even if it isn't a universal solution to build/restart interactions.

PhutilConsole is sort of out-of-favor versus tsprintf() but I'm not really happy with either API at the moment so I think "do whatever" is reasonable for now, since I suspect the One True API For Telling Users Stuff From The Console has yet to be written.

src/applications/harbormaster/management/HarbormasterManagementRestartWorkflow.php
58

Slightly more flexible as:

pht('Restart %s build(s)?', new PhutilNumber($count))

...then go translate it in PhabricatorUSEnglishTranslation if you want pretty text.

This revision is now accepted and ready to land.Sep 2 2016, 12:18 PM
yelirekim edited edge metadata.

use pretty numbers when displaying build count

In theory we could make it so that steps resume correctly, but in practice I have very little control over the contents of the scripts that get run. People tend to wrap all of the stuff up that their build is supposed to do into a single script, and assume they're starting fresh each time it's executed.

Graceful stop isn't a great strategy because we do have builds that run for hours, and I want to be awake when the update completes. Restarting hours long builds is better that failing hours long builds.

This revision was automatically updated to reflect the committed changes.