Page MenuHomePhabricator

Plans: 2018 Week 12 Bonus Content
Closed, ResolvedPublic

Description

Rolled forward from T13102, see PHI364. An install encountered a held lock issue in a cluster which resolved itself immediately when I appeared onsite. It seemed plausible that this was because read synchronization does not put a timeout on git fetch, so if git fetch hangs for some network-related reason the lock could be held indefinitely.

See PHI466. This isn't totally concrete yet, but it looks like a user who has resigned from a revision but is still part of a package/project can continue getting notifications (just not email) in some situations; they should not.

See PHI480. See T13109. An install would like better user feedback for write lock status updates. This may lead to better write routing.

See PHI488. See T11145. Workflow forms do not disable their submit buttons when submitted. Also, at least in Chrome and Firefox, if you click extremely fast you can still double-submit a form even though we're disabling the button immediately in the event handler (???). We can deploy stronger locks on submission against these browsers.


These were moved to T13110:

  • See PHI483. Some of the mail rules around builds and drafts are kind of whack right now.
  • See PHI356, which has some reasonable suggestions for minor improvements to the filetree view.

Event Timeline

epriestley created this task.

See PHI364. An install encountered a held lock issue...

Here's the narrative on this:

An install reported some nodes in a repository subcluster falling out of sync. bin/repository update reported the repository update lock was held, even with the daemons on the same host stopped. I wrote the lock log in T13096 as a possible tool to help with identifying lock holders in the future, but also trekked for a fortnight over land and sea to lend aid.

When I arrived on the fabled shores, the subcluster had something like 7 nodes; 4 were working correctly and 3 had fallen out of sync with held locks and been taken out of service. We reactivated the binding for the first out-of-service node and it promptly came up and synchronized correctly, giving us 5 working nodes and 2 out-of-sync nodes. This is typical of onsites because the very aura of my will is strong enough to exorcise many of the bad spirits which cause computer programs to misbehave.

Fortunately, the other two nodes were more fruitful. We found a git fetch in a bin/repository update from March 7th hung on the second node, and a git fetch from an inbound SSH connection from March 13th hung on the third node.

A plausible (if not completely satisfying) explanation here is something like: git fetch hangs for a while (for days / forever) 0.01% of the time and there's no timeout on git fetch when it's built by DiffusionCommandEngine. If git fetch hangs forever, whatever's running it will hold the read synchronization lock forever. So a possible fix is to just put a timeout on it, which I plan to do now.

A workaround to get the host to synchronize in the meantime is to connect to it and kill all possible lock holders. There are four kinds of processes which we expect to be able to hold this lock:

  • The daemons (kill them with bin/phd stop).
  • Explicit bin/repository update commands that someone ran to debug things (find and kill them with ps auxww + kill <pid>).
  • Read synchronizations doing git fetch before HTTP reads (users accessing the API over the web UI).
  • Read synchronizations doing git fetch before SSH reads (users fetching or pushing to the repository over SSH).

Find old fetches in the latter two categories with ps auxwww / pstree -ap or similar and just kill the git fetch.

Since the hangs we could find had a git fetch in the tree I think it's about 95% likely that the timeout will fix this, although ideally we'd also figure out exactly why git fetch can hang.

See PHI483. Some of the mail rules around builds and drafts are kind of whack right now.

This also came up in PHI309. Here's how I think it probably should work. When remote builds finish:

  1. If everything passed, the revision is promoted to "Needs Review" and everyone is emailed. This is how it works today.
  2. If at least one server-side unit test failed, the author is emailed with a build failure notification. This is not how it works today.
  3. If server-side tests pass, but at least one client-side unit test failed, the situation is ambiguous. See below.
  4. Client-side lint should have no impact on any of this and never block a draft from promoting.

For case (3), the user ran arc diff, hit a local test failure, and elected to continue. The server-side tests then passed. They could mean two things:

  • (A) "I think or know that this test doesn't work locally but my code is fine, so I'm only interested in the remote result." This revision should promote if server-side tests pass.
  • (B) "I think or know that my code has broken this test, but I'm interested in learning the remote result." This revision should not promote if server-side tests pass.

Users could indicate intent (A) with --nounit. Users could indicate intent (B) with --draft.

When there are no server-side tests, I think the intent is clearly (A).

Of these options, behavior (A) is more "dangerous", since it can cause irreversible side effects (sending some email to reviewers). Behavior (B) is safer.

My initial reaction here was that we should implement behavior (B), mostly because it's a safer option. However, I think the "damage" that option (A) does is insignificant, and the behavior is reasonable, and probably more often aligns with user intent in practice. It's also a little simpler because it remains consistent in the trivial case with no tests. So I'm inclined to:

  • Implement behavior (A): promote the revision for review if server-side tests pass, even if client-side tests failed.
  • Add guidance to the "Tests failed, continue anyway? [y/N]" prompt recommending "--draft" if you just want to test if server-side tests pass. The expectation is that you'll signal your "I just want to try building this" intent with "--draft".
  • Consider stronger guidance on the revision to the effect of "This promoted but has local test failures, keep an eye out!".

As a somewhat related change, I'm planning to remove the lint and unit test "Excuse" prompts as part of T13098. We might restore these later, but I suspect they're sort of a well-intentioned but net-negative feature in practice now, especially with the advent of drafts and support for server-side builds.


Finally, PHI283 discusses a bad behavior today where abandoning a draft starts sending email. It shouldn't.

Fixing this probably means making the "draft" flag orthogonal to the revision status. Although I'd expect to keep a "draft" status, there would also now be a secondary "draft" flag, so that a revision can be in an "Abandoned + Draft" state. I'm certain no one, least of all me, will ever find this confusing, and it won't be the source of any bugs in the future.

If this does get added, we could improve situation (2) above by automatically demoting the revision to "Changes Planned + Draft" when the server-side build fails. This would make the behavior of these revisions a little better on the Differential action dashboard, and reduce the chance that they get lost if you miss an email.

From reading the code, the behavior should already be (A): that is, we should already get this right in cases (3) and (4).

However, this isn't the behavior I'm seeing in practice -- for example, see D19238. I had a local test failure there (javelinsymbols, which I should probably just delete) but the revision did not promote at D19238#236605.

It did promote when I next made an update, at D19238#236614.

This could be some kind of race issue, but I think I've seen this a couple of other times so I suspect it's just a mundane bug with the accounting around autobuilds.

I had a local test failure there

Actually, a local lint failure.

D19240 exhibited the same behavior on a local lint failure so this is almost certainly a mundane bug rather than some sort of exotic race.

This is sort of a race. Here's what happens:

  • Harbormaster starts all the builds and moves the Buildable out of "Preparing" to "Building".
  • Now that the Buildable is no longer preparing (we've started all of the builds), it's eligible to pass or fail.
  • The Buildable immediately fails because some tests have failed (the local lint/unit tests). In general, it's good for us to fail early, because it means that you get feedback right away when "Important Fast Tests" fail even if "Really Slow Really Not Very Important" tests are still running. So this behavior is mostly desirable, in the general case, and not a bug.
  • The Buildable publishes the failure to the revision.
  • The Revision does not promote because there are still active builds (whatever Harbormaster started).
  • The Buildable doesn't publish again because it already failed, and its status doesn't change when those pending builds come in.

So all of these pieces are working correctly when viewed locally, they just aren't interacting very well.

This seems tricky to fix.

One possible fix is to just ignore local lint and unit test results for all buildables. This is probably fine in practice, but feels very messy and hacky, especially if it's truly implemented as "ignore all autobuilds". A slightly-less-hacky flag like "this build does not affect build results" might be OK?

One possible fix is to not publish build updates until all builds finish. This would fix things, but users would sometimes get build failure notifications long after the information was known to Harbormaster. I don't think this is desirable, and this is probably the least attractive approach.

One possible fix is to publish every time a build finishes, even if the build status didn't change. This might be a good thing to think about, although I don't want to actually put these in the timeline. I worry that this may be hard to understand as a user ("why did the revision mysteriously update?") and hard to reason about and test as a developer ("sometimes, invisible updates occur, so account for them").

There's a more general issue here where Differential is clearly in the wrong conceptually. Consider:

  • No local lint or unit.
  • Two remote builds, "Fast A" and "Slow B".
  • "Fast A" fails. The buildable fails (since we know the result for sure, regardless of what "Slow B" does) and publishes.
  • Differential ignores this (since "Slow B" is ongoing) and never updates.

This behavior isn't strictly wrong today in practice, but only because we don't try to implement rule (2) (further above), and the "correct" behavior without that rule when anything fails is to ignore the revision forever.

I'm not entirely sure how to best navigate this. I think Differential should definitely change to look more like:

if (build result is known) {
  if (pass) {
    // promote
  } else {
    // today: ignore
    // soon: notify author
    // future: demote to changes planned + draft
  }
}

Another possibly-adjacent change is the idea of introducing build flags more broadly. I've been kind of waiting for this to ripen, but let me see if I can collect the use cases now.

On complex build statuses, which has come up more than a handful of times but which I haven't aggregated well:

T10568 would like a "fractional pass", i.e. "3/5 of builds pass".

T1515#16326 discusses some flags very speculatively, although these were imagined as more "evidence-based" flags than "configured" flags, to some degree, and are aimed at unit tests, not builds: "monitoring" (skip locally), "chaotic" (ignore unsound failures), "slow" (run last, prompt to skip), "advisory" (never prompt/block even on failure).

Recently via email (Suppress "harbormaster has not finished building" warning in arc land) an install wanted a way to skip the arc land prompt for ongoing builds, since arc land just submits to a publishing queue which re-runs builds anyway. More broadly, we could imagine doing this for slow, nonessential builds.

T5920 discussed more build-level statuses, including "unstable". There, it was a Jenkins or JUnit thing where "executable does not build" produced "fail" and "executable builds but does not pass tests" produced "unstable". This is an odd way to label things to me, but the distinction isn't necessarily useless.

It (with T5493) also discussed non-critical builds (which basically mean we don't care about the result).

It also discusses another "fraction pass" case where builds pass or fail depending on whether they produce fewer or more than 100 warnings, but this use case feels flimsy to me and I think build status is not the best way to monitor this, and your test harness could implement this distinction anyway.

T10123 and PHI465 discuss two different features which may overlap around providing users additional context and information beyond build status, particularly "What you should do now that your build has failed.". These may also overlap with T6139. We could imagine pathways here related to build status although I think they're probably better if approached more-separately.

Tentatively, I think these flags may make sense:

  • ignore-when-landing: "Never", "If Building", "Always". Ignore this target when prompting the user about ongoing or failed builds in arc land. If you use a submit queue, you could run all builds as "Ignore When Landing: If Building" (if a test has completed and failed, you probably do want to warn the user).
  • ignore-when-undrafting: "Never", "If Complete", "Always". Ignore this target when choosing whether to promote or demote a draft. You can use "If Complete" if you want to wait for the build but promote even if it has failed.
  • ignore-when-aggregating: "Never", "If Complete", "Always". Ignore this target when deciding whether a build has passed or failed.

This gives us three different views of build status, which isn't great, and potentially is very confusing.

However, we could then run the arc lint and arc unit targets with ignore-when-landing: Always (which would stop arc land from re-prompting you if you already "y"'d through the prompt during arc diff) and ignore-when-undrafting: Always, which would fix behaviors (3) and (4) above.

But wait! This means we definitely need to publish multiple times, since we have to publish every time one of these sub-statuses changes. And, if we're going that far, we'd probably be better off just publishing on every update anyway, since it's at least slightly more predictable, and will work if an application is deriving some sub-sub-status on its own.

And if we're doing that, I think we might as well not do any of this for now and just hard-code the effect of "ignore-when-undrafting" against autoplans for the moment.

So maybe this still isn't ripe. Instead, the publish step moves to a ThingThatCaresAboutBuildsEngine, and applications get their own logic for responding to build updates.

In some ways, this feels somewhat bad, since it adds a whole additional layer of hidden, difficult-to-observe complexity to build publishing. However, since it makes publishing state-based and repeatable, maybe we can peel that back with bin/harbormaster publish Bxxx. I think we probably end up in a cleaner place conceptually, even if there's more going on. So I'm currently leaning in this direction.

Much of this survives in T13110 but the rest is resolved.