Page MenuHomePhabricator

"Commit's branches contains" condition is not re-evaluated for commits in multiple branches
Closed, ResolvedPublic

Description

We have a bunch of herald rules to trigger audits along the lines of:

  • Differential revision does not exist
  • Commit's branches contains master
  • Repository's projects include all of some-project

If a commit is pushed to another branch and then to master an audit is not triggered.

Some references to internal stuff in case I need to find it again:

  • commit: rSKYFALL036b0846a9fc5d9f6209c5747574ac5d0db84ab1 Lists master and an other branch. No audit triggered
  • herald/transcript/487720/ shows 'Condition: Commit's branches contains master' failed

Event Timeline

cburroughs raised the priority of this task from to Needs Triage.
cburroughs updated the task description. (Show Details)
cburroughs added a project: Herald.
cburroughs added a subscriber: cburroughs.

This is expected, although not necessarily intuitive or well-explained, and maybe not desirable. "Commit's branches contain" really mean "Commit's branches [at the time when Herald executes, shortly after initial discovery] contain".

A major reason it works like this is that if you push a new branch ("temp-devel-bugfix") which is branched from master, every commit in master is now on that branch. We can't realistically evaluate a million commits in response to every branch push, and no one wants us to anyway.

Hypothetically, and ignoring implementation difficulty, can you come up with a rule which does what you want without requiring us to re-evaluate the entire repository's history frequently in common/reasonable cases?

We had a similar problem (tracking feature branches and triggering a herald rule when a commit shows up in master) and solved it by specifically looking for the merge commit in the master/production branch.
Depending on what your herald rule does it might be necessary to apply an action to every commit contained in a merge commit. I'm not sure what happens if you trigger an audit on the merge commit. It might include more commits than what you want.

Well since I get to ignore implementation difficulty:

  • A way to say 'only try to evaluate herald rules on commits in these branches' (sort of like the auto-closes configuration).
  • Fancy constraint system only re-evaluates rules that possible could be affected by branch moves (so if you have no rules that reference a commits branches nothing happens).
  • Use knowledge of history to only re-evaluate commits since the branch point in the "temp-devel-bufix" case.

How about:

On push (pseudocode)

for (ref in pushed_refs) {
  if (ref.change_type != create && ref.change_type != rewrite) {
    for (commit in commits_between(ref.old_commit, ref.new_commit)) {
      trigger herald, etc. as though the commit is new
    }
  }
}
cburroughs moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.
for (commit in commits_between(ref.old_commit, ref.new_commit)) {

Suppose that you have branches like this:

--o---C---D   <-- master
   \
    \
     \
      A---B   <-- feature

You've pushed all this stuff to the remote already. Now, you merge feature into master, producing this graph, and push it:

--o---C---D---M   <-- master
   \         /
    \       /
     \     /
      A---B

The ref.old_commit for master is D. The ref.new_commit for master is M.

I'm not sure what the commits_between() function is supposed to do. That's basically what I'm asking for a definition of. Some examples:

  • If commits_between(D, M) means nothing, the rule fails to trigger on M, but should.
  • If commits_between(D, M) means M, that's the current behavior, although the rule we use is more complex.
  • If commits_between(D, M) means "all ancestors of M, including M itself, that are not also ancestors of D", that defines the set of commits you likely want: A, B, M.

So let's suppose we go with that rule: commits_between() means "M, and all ancestors of M which are not also ancestors of D".

That will work great for merges, but it's disastrously bad for cutting new branches and some operations which rewrite branches. If we push a new branch:

--o       <-- master
   \
    \
     \
      X   <-- feature2

...the rule ("X, and all ancestors of X which are not also ancestors of <empty>") matches every commit on master because there is no old ref. And there's no metadata available to us to tell us that you branched it from master specifically, so we can't cheat our way out of this by saying "ref.old_commit is really master, not <empty>, because we can't tell.

The actual rule we use to identify new commits is "ancestors of ref.new_commit which are not ancestors of any old heads". This selects M in the merge case and X in the "new branch" case.

We can choose a different rule, but I can't immediately come up with one that doesn't have bad failure cases. I suspect no simple, intuitive rule exists, and there is a large implementation and cost to adopting a complex, unintuitive rule (e.g., see T9657 for a small peek at the complexity of making arc land usually behave in a way that aligns with user expectation -- it faces a similar problem of guessing how the user is thinking about a repository).

I don't think you need to do anything new for new refs.

  • If a pushed ref is new, use your existing algorithm..
  • Otherwise, "ancestors of ref.new_commit which are not ancestors of ref.old_head"

Do you see degenerate cases in that? I agree that it doesn't solve 100% of possible use cases, but it certainly solves mine, and I think it solves the OP's.

A branch or tag can be moved from a very old commit to a new commit, and then re-run rules on a large amount of repository history.

For example, suppose a user started working on a change about a year ago, then moved on to other things for whatever reason, and is now coming back to it. So they do something like this:

$ git checkout some-really-old-change
$ git rebase -i master
$ git push --force origin some-really-old-change

I'm not endorsing or encouraging this workflow and think it's a bad one, but many users are used to working with forks on GitHub and treat git push as a sort of "Save Changes" operation, and I don't think this workflow is widely regarded as dangerous or unreasonable even though I personally regard it as both (we have a number of requests, including T10691, which ask for better support for similar workflows).

In this case, the ref is not new, but a year of changes are ancestors of the new commit and not ancestors of the old commit.

To some degree this is "the user's fault" because they configured a rule and we're interpreting it in a reasonable way from the cold, mechanical perspective of a repository commit hook, but users who received thousands of emails they don't want tend not to see it like this, and this rule and behavior are not totally unambiguous to human users. Even if we went to great lengths to make them explicit, I think many users probably would not anticipate the behavior.

I think a user who performs the sequence of changes above is generally thinking that they are introducing no new commits (from their point of view, they aren't "creating" anything new), but there is no way we can infer this from the behavior we actually see.


Another possibly degenerate case is missing ancestors of new branch heads that users did want to run checks on. For example, suppose you have a rule like:

  • For tags named "deploy-*", make sure all the commit conform to something-or-other.

A user pushes a new branch or tag named deploy-99 pointing at the existing master.

They may really want to run on all commits that are ancestors of deploy-99 which are not ancestors of any other deploy-* ref, but the proposed rule will run on nothing since the ref is new and all commits in the repository are already existing commits.

We don't get this right currently either, and it's again clearly "the user's fault" for not thinking things through carefully, but it's more reasonable for a user to misinterpret this rule as it gets more magical.


I suspect the pathway forward may really something like forking T10691 and making Herald rules run per-fork. That is, the only actual use case mentioned here is a branch named temp-devel-bugfix. In my worldview, this branch should never be published. Publishing temporary branches like this already causes many other problems. T10691 is the "give up and implement GitHub" approach to letting users use git push as "Save Changes".

In particular, for anyone hitting this, is your use case anything other than something in the vein of "my users publish temporary/development changes to branches in the remote"? If we implement GitHub and let them dump their garbage into personal forks which are isolated from a Herald perspective, would that resolve the issue?

A branch or tag can be moved from a very old commit to a new commit, and then re-run rules on a large amount of repository history.

I don't think you should "worry" about this case -- treat rewritten refs (force pushes) the same as new refs.

Another possibly degenerate case is missing ancestors of new branch heads that users did want to run checks on. For example, suppose you have a rule like:

Agreed -- that's the case that I don't think my proposal solves, but it's in the 1-5% of less common cases compared to the ones that OP and I have.

In particular, for anyone hitting this, is your use case anything other than something in the vein of "my users publish temporary/development changes to branches in the remote"? If we implement GitHub and let them dump their garbage into personal forks which are isolated from a Herald perspective, would that resolve the issue?

My users don't want to deal with multiple remotes, so we'd much prefer a solution where we can have garbage branches (and collaborative feature branches) mixed in with autoclose/release/CI branches in a single repo.


My use-case would be better served by another alternate solution where the rule for "new commits" becomes "ancestors of ref.new_commit which are not ancestors of any old heads for which this herald rule would apply".

This would require a pretty big refactor of how commits are processed, but I think would actually solve close to 100% of use-cases, the point here is that we want our herald rules to apply once when a commit first becomes eligible for that rule.

We're extremely unlikely to pursue any changes where 99% of the time it works great and 1% of the time a user shows up here wanting to kill us, so arguments in the vein of "it would work for me and other reasonable users who know what they're doing", while likely true, has very weak persuasive power.

I don't think these alternate rules are necessarily bad rules, they're just bad rules from an upstream perspective where we bear most of the support cost of failures of any rule we pick, because they have hugely negative failure cases. We're strongly incentivized to pick an "always safe" rule over a rule which makes it possible to shoot yourself in the foot, even if the foot-shooting rule is better 99% of the time.

The proposed sub-sub-sub new-commits rule still fails in the some-really-old-change case if the user didn't actually make any changes a year ago and just pushed a new branch pointing at an ancestor of master, intending to start work on the changes without actually making progress. Then the update is not a --force push.

And I'm sure I can make up some cases where a branch is --force pushed and users do expect new commits on the branch to be evaluated.

What about my suggestion regarding evaluating commit newness on a per-herald rule basis, based on which refs would match the herald rule?

It's basically the same as the current rule, but would solve (possibly 100%?) of cases for the cost of having to make "which refs" conditions "special" for herald evaluation.

In particular, for anyone hitting this, is your use case anything other than something in the vein of "my users publish temporary/development changes to branches in the remote"? If we implement GitHub and let them dump their garbage into personal forks which are isolated from a Herald perspective, would that resolve the issue?

I have not run into this in any workflow I want to actively encourage.

I've run into this problem with my puppet repositories. A common puppet workflow is to use every branch as a puppet environment.
This way you can test a large change on a subset of production servers before applying it to the production branch itself.

We do a similar thing with our application code to deploy to a staging server or to production. (to deploy you just push into a special branch)

I like to see the production state in phabricator and use herald to generate email notifications / create audits when specific things show up in production/live. For most cases though i'm fine with looking for the merge commit (as we enforce merges into production) and if i'd really need them separate i could create a duplicate repo in phabricator tracking only the production branch.

eadler added a project: Restricted Project.Aug 5 2016, 5:23 PM
epriestley edited projects, added Diffusion (v3); removed Diffusion.
epriestley claimed this task.

I think this is effectively resolved by T13277.

"Autoclose Only" is now "Permanent Refs". Actions previously under the "Autoclose" and "Publish" umbrellas (Herald, Audits, feed, notifications, email, "Fixes ...", closing revisions) now trigger only once a commit is reachable from a permanent ref.

If you have a repository with some permanent refs (like "master") but also expect users to routinely push arbitrary nonsense "tmp-epriestley-fix-123" branches:

  • Configure "Permanent Refs" to be "master", or the set of permanent branches.
  • Commits not reachable from any permanent ref will parse into the web UI, but will not trigger any additional behavior.
  • This should moot the original issue, since Herald won't run until a commit becomes reachable from a permanent ref.