Page MenuHomePhabricator

Observe Mode + GitHub has weird mail interactions due to pull request refs
Closed, ResolvedPublic

Description

Once upon a time, earlier this week...

Bear with me through this bizarre story. Maybe this should be closed as "Working as Intended", but it's probably worth spelling out.

On Haskell.org's Phabricator, we have several repositories we mirror, all from GitHub. We imported these repositories long ago and they are only occasionally updated - we actually maintain our *own* mirrors/repos under haskell-infra on GitHub, and push from upstream mirrors to our own mirrors, and Phabricator observes ours (i.e. we vendor copies of upstream for ourselves).

Recently, we upgraded from a build in late April to this build from the past week. In that time frame, in 2016 Week 25 Diffusion was changed to fetch refs even though they may not be reachable from a visible branch ref.

This can lead to some weird behavior when this is combined with GitHub, and I am not sure what the right solution is.

In our case, here is what happened: when we upgraded, Phabricator went back over these forks of ours -- note, several of them having not been touched in months -- and with it's new, more aggressive ref matching, it began importing commits for every pull request ever made on the imported upstream repositories, no matter if they are open, closed, or anything else.

Many of these people also have accounts on Phabricator. Because Phabricator sees the commit has been made by a user, it will email them notifying it about them. That means before I knew it, I had spammed out at least several hundred emails to over a hundred people, about 6-8 month old commits, or forgotten pull requests they made that were never merged!

Note that this was also compounded mainly by the upgrade: Phabricator will only start sending emails *after* it has marked the repository as imported. And, well, ours was imported at the time!

To fix this, we stopped all the daemons, marked all these observed repositories as Turn off notifications (Feed, Herald and Email), and turned the daemons back on. The objects were still imported (that's totally OK) but without the emails. For the record, this current status is OK with us: none of these repositories that are observed currently need any of those features (but see below)

The main other way I can think this could crop up is a scenario which I think is pretty reasonable. I haven't exactly tested this so lol I am probably wrong, but:

  1. We use Phabricator and have all our people at $DAYJOB signed up. The people love Phabricator, and their enthusiasm is incredible.
  2. We observe some repositories from GitHub, because everyone else wants that. Observing is still useful even if the canonical repository is elsewhere: you can run CI through Harbormaster, use Owners, etc.
  3. Some people submit pull requests, etc. Maybe, people from $DAYJOB submit PRs, too! That's useful if development is ideally centralized on GitHub (perhaps proprietary stuff in your Phabricator - open source releases on GitHub for end-user accessibility)
  4. Now, anyone who ever makes a PR against the repository while being a user on Phabricator (for $DAYJOB) gets spammed, because Phabricator thinks that an un-reachable commit object it now sees is a 'real' commit.

Why this happens

Okay, so I kind of am lying every time I say 'unreachable', because these commits are totally reachable, but they're never fetched by default clients. That's why this is kind of weird.

What GitHub does is, when you make a PR, it will create a visible -- yet not a default -- reference in the master repository. This ref is named under the pull/ namespace, reflecting the pull requests. This is actually a very convenient feature: it allows you to configure git to fetch those refs by default, and thus you can get a copy of every PR locally. This is very useful for maintainers to develop locally!

Check out this gist, which shows you how to mirror the pull/ ref namespace into your local repository on git fetch.


Suggested fixes.

I can think of two ways to perhaps 'fix' this to prevent the spam, but I don't know how they overall fit in to the Diffusion design or all intended use cases (in order of actual reasonableness/feasibility):

  1. When Phabricator sees an incoming commit, on an observable repository that is not reachable by any visible reference in that repository, it should avoid emailing the user if their account exists. This seems like the most general solution and seems like a sensible UX. However, this doesn't exactly work in the above case, because there is a reachable reference that Phabricator will identify.
  2. Disable sending email notifications on observed repositories, or all notifications, by default. I honestly can't think of many major uses where I'd want email in this case - I may want Harbormaster emails, but not like, Diffusion emails for a thing like Github which already emails the hell out of me. But that means I need Herald. However, I'm under the assumption "Turn off Notifications" is exactly what it is for a reason: i.e. it is intended to be a big hammer, and splitting it into a shitload of small options makes things complex and is undesirable.
  3. Maybe just strongly suggest instead that observed repositories should have notifications turned off?
  4. When an observed repository points to GitHub, ignore emails or something on pull/ refs. I think this is really the worst one because it's too special cased -- it breaks e.g. in the case someone sets up a mirror for a GitHub repo, fetching all refs (like Phab does) then you import the mirror into Phabricator - and the mirror might not be GitHub at all.

Detailed, real repository for reproduction analysis.

Here is an easy repository where you can see this kind of specialness in action, using a small project we're developing for $DAYJOB:

  1. Create a new repository.
  2. Create a new URL for the repository in Observe mode, and set the repository URI to https://github.com/well-typed/binary-serialise-cbor.git
  3. Let the repository finish importing.
  4. Look at this Pull request for our repository in all its glory: well-typed/binary-serialise-cbor#83
  5. Open the 'Commits tab', and click on the single commit that is there.
  6. That leads you here, which gives you this commit hash: 14212004ecab637496936ce4d46f091b63d17e80. Note that this commit is part of our repository, not their fork.
  7. Go back to Phabricator, and search for 14212004ecab637496936ce4d46f091b63d17e80 in the search bar.
  8. Note that this commit object is immediately pulled up and redirects you right to it in Diffusion - it is now pulled into the repository Phabricator monitors, even though it is not part of any visible reference. The commit object simply exists in the repository, but is not pulled by most clients by default.
  9. As mentioned earlier, Phabricator identifies this as part of the 'special' pull/ namespace:
    pasted_file (635×815 px, 62 KB)
  10. Okay, now here is where it gets weird, but it is 'clear' as a corollary that...
  11. Back in the pull requests for binary-serialise-cbor, there is another important pull request: this one, #71. Crucially, PR #71 was obviously opened before #83 -- and neither are merged.
  12. Pull request #71 adds a single commit, this one.
  13. Because it was opened before #83, when the user forked our repository, their repository also has a copy of this unmerged pull request. See here: https://github.com/obadz/binary-serialise-cbor/commit/d306e0d3c55a913bc6c0f782ad9a502f812e99a3, which note is a fork, but containing commits from another fork. Again, #71 was never directly reachable from the parent when #83 was forked. In theory, if the user had created an empty repository, made a clone, and initialized a fresh one, then the commit from #71 would have never made it in (because in the initial clone, your client wouldn't have fetched it). But, obviously, everyone on GitHub hits Fork.

Event Timeline

thoughtpolice updated the task description. (Show Details)
thoughtpolice updated the task description. (Show Details)
thoughtpolice moved this task from Backlog to Details on the Haskell.org board.

A small part of the motivation for this change was to make commits reachable only from Gerrit magic refs (refs/changes/..., similar to refs/pull/...) visible in Diffusion. This didn't really make it into a single upstream task I think, but some of this is scattered across T9726 / T6878. So we have at least one use case where users want these refs to exist and be treated roughly like they were tags/branches.

You can disable email by default in SettingsGlobal Default SettingsEmail PreferencesAudits as an administrator. You may need to Create Global Defaults if you haven't yet. This won't affect users who have adjusted their own settings: the whole block of dropdowns counts as one "setting", so the global default only applies if users haven't touched anything else.

This mail is entangled with auditing, and a legitimate use case for observing a repository is so that you can use Phabricator's audit flow on it. This makes me hesitant to disable mail for observed repositories by default.

I don't particularly love doing magic on pull/, although a slightly more general version of this would be to expand "Track Only" and "Autoclose Only" to be able to match tags/refs. We probably should do this, it's just involved and not a particularly great solution to the problem on its own, since it would just let us say "go put ref-regex((refs/pull)) in "Track Only"" or whatever.

Another issue here is that commits only create events when they first appear in a repository, not when they become reachable from more branches (see T6522). For certain classes of magic on pull/, you could sneak changes into a repository by pushing them there first and then merging them to master (possibly by fast-forwarding master, so no merge commit is generated).

Users who received this mail presumably (mostly) already receive other mail, and haven't adjusted their settings yet? That is, if Phabricator mailed them when it discovered pull/, it must also mail them routinely when it discovers commits on master, I think? If they haven't changed this, maybe the mail is generally desirable (and this is mostly a one-time issue) or perhaps the setting is too hard to change (T9161, etc).

We could also possibly just drop these commit mails in all cases where no audit triggered. It's easy to write a Herald rule to restore them if you do want them. Possibly involved enough to be adjacent to T10978, though.

Thanks for listening to my hair-brained schemes. Yeah, the motivation is totally clear to me FWIW, and thinking about it more, I strongly suggest this is related to T9161.

Generally (and I imagine this is the rough approximation for everyone), I find Phabricator to be much more sensible about email overall when set up, but out of the box you do get a lot of mail. Our case is obviously maybe more atypical than most; we get many people who sign up out of curiosity in the community or to do very small things, so they only spend like 20 minutes one time, and never configure much email. This ticket is somewhat in line with that (tho it was still the right thing to disable notifications, etc on those repositories.)

I think the idea of dropping commit mails without audits is a pretty reasonable default, and letting users configure Herald to email them incessantly if they really want to see dem emails.

T11335 talks about a hosting system (gitblit) that adds meta-data to refs/meta/ refs, which show up in phabricator as real commits.

eadler added a project: Restricted Project.Sep 15 2016, 6:08 PM
epriestley edited projects, added Diffusion (v3); removed Diffusion.
Unknown Object (User) added a subscriber: Unknown Object (User).Jan 10 2018, 8:21 AM

I would like to add another use case where this would be beneficial.

At my organisation we push committed changes to production as soon as possible. Usually every commit is deployed to the server immediately. Every deployment creates two git notes and removes one at the end. These git notes are filling up the 'Recent Activity' section of our Phabricator installation.

The 'Recent Activity' section is showing three commits to refs/note/* for each deployment, which normally happens after every commit.