Once upon a time, earlier this week...
Bear with me through this bizarre story. Maybe this should be closed as "Working as Intended", but it's probably worth spelling out.
On Haskell.org's Phabricator, we have several repositories we mirror, all from GitHub. We imported these repositories long ago and they are only occasionally updated - we actually maintain our *own* mirrors/repos under haskell-infra on GitHub, and push from upstream mirrors to our own mirrors, and Phabricator observes ours (i.e. we vendor copies of upstream for ourselves).
Recently, we upgraded from a build in late April to this build from the past week. In that time frame, in 2016 Week 25 Diffusion was changed to fetch refs even though they may not be reachable from a visible branch ref.
This can lead to some weird behavior when this is combined with GitHub, and I am not sure what the right solution is.
In our case, here is what happened: when we upgraded, Phabricator went back over these forks of ours -- note, several of them having not been touched in months -- and with it's new, more aggressive ref matching, it began importing commits for every pull request ever made on the imported upstream repositories, no matter if they are open, closed, or anything else.
Many of these people also have accounts on Phabricator. Because Phabricator sees the commit has been made by a user, it will email them notifying it about them. That means before I knew it, I had spammed out at least several hundred emails to over a hundred people, about 6-8 month old commits, or forgotten pull requests they made that were never merged!
Note that this was also compounded mainly by the upgrade: Phabricator will only start sending emails *after* it has marked the repository as imported. And, well, ours was imported at the time!
To fix this, we stopped all the daemons, marked all these observed repositories as Turn off notifications (Feed, Herald and Email), and turned the daemons back on. The objects were still imported (that's totally OK) but without the emails. For the record, this current status is OK with us: none of these repositories that are observed currently need any of those features (but see below)
The main other way I can think this could crop up is a scenario which I think is pretty reasonable. I haven't exactly tested this so lol I am probably wrong, but:
- We use Phabricator and have all our people at $DAYJOB signed up. The people love Phabricator, and their enthusiasm is incredible.
- We observe some repositories from GitHub, because everyone else wants that. Observing is still useful even if the canonical repository is elsewhere: you can run CI through Harbormaster, use Owners, etc.
- Some people submit pull requests, etc. Maybe, people from $DAYJOB submit PRs, too! That's useful if development is ideally centralized on GitHub (perhaps proprietary stuff in your Phabricator - open source releases on GitHub for end-user accessibility)
- Now, anyone who ever makes a PR against the repository while being a user on Phabricator (for $DAYJOB) gets spammed, because Phabricator thinks that an un-reachable commit object it now sees is a 'real' commit.
Why this happens
Okay, so I kind of am lying every time I say 'unreachable', because these commits are totally reachable, but they're never fetched by default clients. That's why this is kind of weird.
What GitHub does is, when you make a PR, it will create a visible -- yet not a default -- reference in the master repository. This ref is named under the pull/ namespace, reflecting the pull requests. This is actually a very convenient feature: it allows you to configure git to fetch those refs by default, and thus you can get a copy of every PR locally. This is very useful for maintainers to develop locally!
Check out this gist, which shows you how to mirror the pull/ ref namespace into your local repository on git fetch.
Suggested fixes.
I can think of two ways to perhaps 'fix' this to prevent the spam, but I don't know how they overall fit in to the Diffusion design or all intended use cases (in order of actual reasonableness/feasibility):
- When Phabricator sees an incoming commit, on an observable repository that is not reachable by any visible reference in that repository, it should avoid emailing the user if their account exists. This seems like the most general solution and seems like a sensible UX. However, this doesn't exactly work in the above case, because there is a reachable reference that Phabricator will identify.
- Disable sending email notifications on observed repositories, or all notifications, by default. I honestly can't think of many major uses where I'd want email in this case - I may want Harbormaster emails, but not like, Diffusion emails for a thing like Github which already emails the hell out of me. But that means I need Herald. However, I'm under the assumption "Turn off Notifications" is exactly what it is for a reason: i.e. it is intended to be a big hammer, and splitting it into a shitload of small options makes things complex and is undesirable.
- Maybe just strongly suggest instead that observed repositories should have notifications turned off?
- When an observed repository points to GitHub, ignore emails or something on pull/ refs. I think this is really the worst one because it's too special cased -- it breaks e.g. in the case someone sets up a mirror for a GitHub repo, fetching all refs (like Phab does) then you import the mirror into Phabricator - and the mirror might not be GitHub at all.
Detailed, real repository for reproduction analysis.
Here is an easy repository where you can see this kind of specialness in action, using a small project we're developing for $DAYJOB:
- Create a new repository.
- Create a new URL for the repository in Observe mode, and set the repository URI to https://github.com/well-typed/binary-serialise-cbor.git
- Let the repository finish importing.
- Look at this Pull request for our repository in all its glory: well-typed/binary-serialise-cbor#83
- Open the 'Commits tab', and click on the single commit that is there.
- That leads you here, which gives you this commit hash: 14212004ecab637496936ce4d46f091b63d17e80. Note that this commit is part of our repository, not their fork.
- Go back to Phabricator, and search for 14212004ecab637496936ce4d46f091b63d17e80 in the search bar.
- Note that this commit object is immediately pulled up and redirects you right to it in Diffusion - it is now pulled into the repository Phabricator monitors, even though it is not part of any visible reference. The commit object simply exists in the repository, but is not pulled by most clients by default.
- As mentioned earlier, Phabricator identifies this as part of the 'special' pull/ namespace:
- Okay, now here is where it gets weird, but it is 'clear' as a corollary that...
- Back in the pull requests for binary-serialise-cbor, there is another important pull request: this one, #71. Crucially, PR #71 was obviously opened before #83 -- and neither are merged.
- Pull request #71 adds a single commit, this one.
- Because it was opened before #83, when the user forked our repository, their repository also has a copy of this unmerged pull request. See here: https://github.com/obadz/binary-serialise-cbor/commit/d306e0d3c55a913bc6c0f782ad9a502f812e99a3, which note is a fork, but containing commits from another fork. Again, #71 was never directly reachable from the parent when #83 was forked. In theory, if the user had created an empty repository, made a clone, and initialized a fresh one, then the commit from #71 would have never made it in (because in the initial clone, your client wouldn't have fetched it). But, obviously, everyone on GitHub hits Fork.