Page MenuHomePhabricator

Add author information when creating a build in Buildkite
Closed, ResolvedPublic

Description

We've been trying out the Buildkite integration and it has been working great so far. One problem that we've encountered is Phabricator not including author information when creating a build in Buildkite.

The Buildkite API token is a personal token, so all builds are now appear as though they have been created by me personally. This breaks certain Buildkite features like the My Builds page, in addition to people blaming me for their own failed builds ๐Ÿ˜ธ

As the author field in the Buildkite API is just a freeform name and email address, it would be great if Phabricator could add the creator of the commit or revision as author of the build.

Event Timeline

epriestley triaged this task as Wishlist priority.Feb 13 2017, 12:56 PM
epriestley moved this task from Backlog to Far Future on the Harbormaster board.

@epriestley Would you guys be willing to accept a patch that implements this?

@epriestley I know you've said no on accepting a patch for this one, but we've found that without the "author" field sent to Buildkite it's not very useful.

I've gone through the process of signing the CLA, local testing, etc and I've posted my first ever arc diff up (D18075) with the changes.

From the outside look to be reasonable and small. Would love you to take the chance to reconsider your position.

The general shape of the change is reasonable. However:

  • The change will fatal for commits or diffs without a user author.
    • The most common case of this is commits made by an "Author" string we don't recognize (unknown@somewhere-else.org), like when you import a third-party repository.
    • This can also happen if a user is deleted (with bin/remove destroy), then someone else triggers a build of a diff they authored.
    • In some cases, commits or diffs may (now or in the future) be generated by non-user authors.
      • Today, I believe "Differential" (the application itself, which will have a PHID like PHID-APPS-PhabricatorDifferentialApplication) ends up as the author on the diffs which are generated when a revision is closed in response to discovering a corresponding commit.
      • In general, other applications may reasonably author diffs in the future (like: Differential, Diffusion, Harbormaster, Drydock, Nuance, whatever comes out of T9530).
    • (See also T12164 for vaguely related adjacent work.)
  • Ideally, you should use PhabricatorPeopleQuery, not loadOneWhere(), but this requires access to a $viewer which the API currently doesn't provide. Perhaps it should. I don't have a clear resolution for this offhand.
  • You use $author->getRealName(), but this field is unstructured and may be empty, and this is broadly not consistent with how Phabricator otherwise identifies users.
    • For context, see: T2950, T3167#32426.
    • Although there's an argument that we might reasonably want to try to mimic git behavior more closely here and select "Alice Xavier" instead of "axavier", we currently use usernames in synthetic authors we construct elsewhere (see DrydockLandRepositoryOperation) and we should justify this deviation more strongly if we're making it.
    • (But also see below, as we aren't really building a git authorship string.)
  • We treat primary email addresses as private information, but this change makes them public.
    • For context, see: T12257, T10279#157396. I recall some lengthier discussion elsewhere, but I think it may have been in a private thread or a dream or something since I can't immediately dig it up.
    • In some configurations, this may present an unverified primary address to Buildkite. This isn't inherently problematic here, but doing this in other contexts (like OAuth) can lead to major security issues (register the address in system A, OAuth to system B, system B trusts system A [which is improper, but very convenient], you now have a verified account on system B even though you don't own the address). Even if we decided that primary email addresses were public, I think we'd probably want to present only verified addresses to third-party systems as a general rule.
  • This duplicates knowledge ("given a PHID, how do we build revision control authorship information?") already existing in DrydockLandRepositoryOperation. Since I think this question is legitimately complex (because of all the special cases above), likely to continue to arise (as we do more synthetic generation of commits and integration with external systems) and I think we're in materially worse shape if we copy/paste it here.
    • My initial thinking is that this could reasonably be extracted into a separate Query class, but testing the callsite in Drydock currently requires five Phabricator PhD's.
  • In many cases, diffs already have authentic git authorship information (some discussion of this in T12257), which this doesn't use. But from the description here and in D18075, it sounds like the meaning of author is really "Buildkite user to notify of this build's results", not necessarily what Phabricator thinks of as an author.

For example, suppose this chain of events occurs:

  • User A creates a revision by running arc diff on a change authored by user B (who may not be a Phabricator user).
  • User C commandeers it, becoming the new owner.
  • User D clicks "Run Build" or "Restart Build" in Harbormaster (if the build was triggered automatically, this user may not exist).

As written, D18075 will use "A" as the author for Buildkite.

In arc land, we currently use "B" as the author and the user most similar to "D" (whoever is running arc land) as the committer.

In "Land Revision" from the web UI, we currently use "B" as the author and the user most similar to "D" (whoever clicked the button) as the committer.

When they differ, I think users "C" (current revision owner) or "D" (user whose direct action triggered the build) are probably more expected/consistent than "A" (user who originally created the diff). "B" is probably the worst value to use since this field does not appear to be used for commit attribution in Buildkite, which is unintuitive because it's the best "author" field to use for generating synthetic commits.


Thus, the pathway forward here is muddy:

  • I'm not sure which user to select when A, B, C, and D differ. The common case is that they do not differ and that they are all "A", so I think it's reasonable to use "always A" as a the rule here, but that is probably not the best rule we could select, or even the best simple rule.
  • We can extract the existing logic from DrydockLandRepositoryOperation and return some new proxy object (like a RepositoryActor, with fields like email and username and methods like getGitAuthorString()) to fix the DRY and "it fatals if the author isn't valid" issues.
  • This extracted query could accept a Viewer and we could paper over this in the implementations with the omnipotent viewer, but changing the methods to accept a $viewer is likely trivial and pushes the problem up to a level where it's easier to solve in the future if it needs to be solved.
  • But: I don't see any easy answer for extracting email addresses from user "A" (or any user but "B"), and from D18075 it sounds like this won't work at all without that. I'm unwilling to change the "email addresses are private unless the user explicitly agrees to disclose them" rule. A solution that would be acceptable from a technical perspective is something like a new "Buildkite Email Address (this address will be exposed to Buildkite)" field, but this is very bad from a product perspective.

To step back, the solution we use in other cases like this (JIRA, Asana) is to require uses to OAuth, the use their individual tokens to make the API call. This is more palatable for users than linking accounts by entering separate email addresses, makes sure users can't take actions in the remote system that they don't actually have permission to take, and can be more realistically wrangled into a usable product on our side. But this is still far more setup for users than the current method.

But it sounds like the current method amounts to "let any user freely impersonate any other user". Although this is presumably fine here, I'm a little suspicious of this kind of design in the general case. Principally, it can be misleading to users building on top of it if the fact that the field is user-controlled, freely selectable, and purely advisory isn't sufficiently obvious, and "author" sounds like a relatively authentic field when it sounds like the real meaning is more like "userTheCallerHasSelectedToNotifyOfBuildResults".

Regardless, the only value we can actually use without sweeping changes is "B". I think this is equivalent to a checkbox on the Buildkite side like this:

  • By default, notify commit authors instead of the API key holder (yourusername) for builds created with this API key.

Or an author=automatic option on the call itself or whatever.

I'm somewhat surprised this isn't the default behavior, so maybe I'm still misunderstanding or missing something here. Why do API key builds default to notifying the owner of the API key instead of the authors of the commits?

My thoughts are:

  • The build may fail before the "automatic author" can be identified (e.g., bad commit), and needs to show up somewhere if it does.
  • Maybe there was a huge casualty count from defaulting to the automatic author and this was a "the users are at our door with pitchforks and torches" change (by default, make users shoot themselves in the foot, not other users).

I'd naively expect the behavior to be:

  • Associate with the API key holder until a working copy is built.
  • Then, associate with the commit author at HEAD unless the author field was provided.
  • (If the author field is provided but invalid, fail immediately, but this ship has presumably sailed.)

And some flavor of this must be how normal non-API builds work, since builds that aren't externally triggered can't have a separate "author" field? But I have only a surface-level understanding of the forces at work here and the meaning of this field, so maybe there are other factors in play.

There is no real technical or product barrier on the Phabricator side to selecting user "B" here and passing it to Buildkite. When that user differs from "A", "C" and "D" it is likely the worst user to select, but all four users are almost always the same. "B" also sometimes does not exist, but it usually does.

Another attack we could make here is to query Buildkite first to find the user's verified primary email address. If we ran a query on their real name or username, or the entire list of users in the org, and that query returned their verified primary email address, we could pass that back to Buildkite without disclosing anything Buildkite does not already know -- effectively using their email address as a user ID. However, it looks like there's currently no "scrape everyone's email address via the API" feature, at least based on a cursory look on the web UI. (I poked into GraphQL a little and it looks like there's some "team.members" thing, but I am not enough of a computer wizard to guess how GraphQL works).


As meta-discussion: I'm sorry if it's frustrating that it's hard to get stuff upstreamed here -- I know it can be disappointing to spend time writing a change and hit a barrier. Avoiding this frustration is part of the reason we discourage sending patches: "no" is disappointing no matter what, but it's less impactful if you spent 10 minutes writing up a suggestion than if you spent 60 minutes writing and testing a patch that actually works in your environment.

If something is basically reasonable, not entangled in adjacent work, and will take me less than about 30 minutes to implement, I usually implement it immediately. Almost everything that survives triage is legitimately complicated for some reason. But I haven't found a way to convince anyone of this that doesn't take a very large amount of time to explain, as above. See also T12134 about generally improving alignment of contributor/upstream incentives.

To summarize where we've ended up:

  • We prioritized the initial implementation (T12173) because of a customer request.
  • That customer hasn't expressed interest in this "author" change, so there is currently no priority on this change.
  • I suspect any change here will require a significant amount of my time to either write or shepherd. I don't expect to dedicate more time to this change (whether writing or reviewing) until we see customer interest or much of the complexity resolves elsewhere naturally and we're making adjacent changes (e.g., the next Harbormaster iteration).
  • From an incentive viewpoint, I don't want to encourage or reinforce "No." โ†’ Ignore it and send a diff anyway โ†’ "Yes.". Although I think this would have little impact on the state of the world and T12134 moots this and is where we're headed anyway, it's still a bad precedent to set and users do sometimes point at cases where we bent the rules for someone else to justify bending the rules themselves.
  • If there's relevant information about Buildkite's behavior, you could leave a response here and that would be helpful when this does eventually either get priority or makes it to the head of the queue on the natural roadmap.