Page MenuHomePhabricator

Allow build implementations to extend their taskmaster lease
AbandonedPublic

Authored by hach-que on Jan 22 2015, 12:11 AM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Apr 2, 5:37 AM
Unknown Object (File)
Mar 6 2024, 6:27 PM
Unknown Object (File)
Mar 2 2024, 3:40 AM
Unknown Object (File)
Jan 29 2024, 7:09 AM
Unknown Object (File)
Jan 16 2024, 2:45 PM
Unknown Object (File)
Dec 26 2023, 7:21 AM
Unknown Object (File)
Nov 22 2023, 5:15 PM
Unknown Object (File)
Oct 26 2023, 11:17 AM

Details

Summary

Fixes T7006. Currently if a Harbormaster build target runs for longer than 24 hours, it will be executed multiple times. This is because the taskmaster lease expires on the target worker task.

This provides build step implementations a mechanism to extend the task lease by a number of seconds; whileever build steps implementations are progressing they can increment the task lease time, while still ensuring that if the taskmaster encounters a fatal error the task will be picked up elsewhere.

Test Plan

Modified the lease time on Harbormaster workers down to 30 seconds, and then ran a remote command 'sleep 60'. Saw the lease expiry time stay above 24 seconds under /daemon/ as the build occurred.

Diff Detail

Repository
rP Phabricator
Branch
master
Lint
Lint Passed
SeverityLocationCodeMessage
Advicesrc/infrastructure/daemon/workers/PhabricatorWorker.php:224XHP16TODO Comment
Unit
Test Failures
Build Status
Buildable 4624
Build 4638: [Placeholder Plan] Wait for 30 Seconds

Unit TestsFailed

TimeTest
194 msPhabricatorChangeParserTestCase::testGitParser
46 msPhabricatorChangeParserTestCase::testSubversionForeignStubsParser
32 msPhabricatorChangeParserTestCase::testSubversionParser
30 msPhabricatorChangeParserTestCase::testSubversionPartialParser
29 msPhabricatorChangeParserTestCase::testSubversionValidRootParser
View Full Test Results (5 Failed · 12 Passed · 1 Skipped)

Event Timeline

hach-que retitled this revision from to Allow build implementations to extend their taskmaster lease.
hach-que updated this object.
hach-que edited the test plan for this revision. (Show Details)
hach-que added a reviewer: epriestley.

Have you looked to see if it's practical to have this happen automatically as part of the sleep() / stillWorking() stuff? I'm actually surprised long-running commands don't get detected as hung and killed right now, so maybe that's buggy.

The expectation is that daemons must call sleep() (or stillWorking()) periodically to send a signal to the overseer indicating that they have not hung. If the overseer does not receive such a signal before a deadline, it assumes the daemon is hung and forcibly restarts it.

It actually looks like the overseer allows the daemons to hang for 86400 seconds (24 hours) before force-restarting them right now, so maybe we are seeing that.

Extending the lease by 6 seconds at a time also seems odd to me. This will cause us to do a relatively large number of fairly unnecessary writes, and it doesn't seem implausible to me that something might hang for 6 seconds sooner or later and incorrectly cause the lease to be freed.

I'd expect the rule to be more like "check if we've used up 80% of the lease time; if we have, double the lease time".

I'm actually surprised long-running commands don't get detected as hung and killed right now, so maybe that's buggy.

It's quite possible that is happening; it would explain why we didn't seem to run out of taskmasters.

I'll hook this up to sleep() / stillWorking() with the alternate logic to prevent lots of writes.

hach-que edited edge metadata.

Update based on code review feedback

src/infrastructure/daemon/workers/PhabricatorWorker.php
228

This won't behave correctly when run from a daemon; calls to stillWorking() won't propagate to the daemon instance.

However, the only call site to waitForTasks is inside Drydock, which doesn't perform stillWorking() calls during allocation anyway.

epriestley edited edge metadata.

Some possible cleanups, but I think this is basically reasonable. Feel free to skip this feedback if it all seems like nitpicky junk.

src/applications/harbormaster/step/HarbormasterBuildStepImplementation.php
5–10

Maybe just call this setWorker()? It doesn't seem ambiguous to me.

src/infrastructure/daemon/workers/PhabricatorTaskmasterDaemon.php
22

Can we setDaemon($this) + executeTask() here instead? This is very nitpicky, but I think it's generally cleaner to use setters over optional parameters.

src/infrastructure/daemon/workers/PhabricatorWorker.php
22

I think it's cleaner to have callers check if ($daemon) than allow null here (which is unusual/unconventional in this codebase in cases where it isn't required to overwrite a self::ATTACHABLE).

86–89

Can we remove $data and $this->data? They can be accessed with $this->activeTask->getData() now, right? And they're only used for display, it looks like.

168

This is the most minor of nitpicks, but a variable name like $renew_threshold might be more accurate and less likely to become out of date if someone changes the constant.

228

Maybe leave a TODO. We probably should propagate the daemon here eventually.

This revision now requires changes to proceed.Jan 28 2015, 7:26 PM
hach-que edited edge metadata.

Changes from code review feedback

I need to also update FeedPublisherWorker which constructs other workers to deal with the new worker constructor.

hach-que edited edge metadata.

FeedPublisherWorker manually constructs workers for Doorkeeper; make sure they pass the task in the constructor.

Also ran grep to try and spot anywhere else that's manually constructing workers, but all the other places with worker names in strings seem to be uses of scheduleTask.

  • Fix up some tests which also manually construct workers
  • The reparse workflow also creates workers using newv

Add getActiveTask to PhabricatorWorker so that the FeedPublisherWorker works

This hasn't been re-reviewed since the last upstream feedback, so I'm abandoning it (and potentially just keeping it patched locally).