Page MenuHomePhabricator

Phabricator handles commits which are published and later deleted poorly
Closed, ResolvedPublic

Description

We believe that the following happened (though we haven't tried to repro so we're not 100% sure):

  • A commit was pushed to a remote branch
  • This requested an audit
  • Some time later, someone went through and deleted many remote branches as part of a clean-up effort
  • The auditor was working through his queue and saw an old (incomplete) audit corresponding to a commit on one of the deleted branches
  • Clicking on the audit shows:
Command failed with error #128!
COMMAND
git log -n 1 --format='%P' '08d5e6c2a164f10679fdf412dc187c50f2acb289'

STDOUT
(empty)

STDERR
fatal: bad object 08d5e6c2a164f10679fdf412dc187c50f2acb289

The problem we encountered is two-fold:

  1. It wasn't obvious (since the clean-up happened a while ago) that the reason for this error was that the commit was on a deleted branch. We wasted quite some time figuring out what was going on.
  2. There is no way to delete, close, or resign from this audit via the Phabricator UI, so it will forever be in the auditor's list of "open audits."

Potential solution ideas:

  • Improve the error message and provide a button to close the audit
  • Have a daemon that processes deleted commits/branches and deletes associated (incomplete?) audits
  • Give us a script we can use that deletes a branch and cleans up its audits, or that given the name of a just-deleted branch cleans up its audits, or similar

Related Objects

Event Timeline

jhurwitz raised the priority of this task from to Needs Triage.
jhurwitz updated the task description. (Show Details)
jhurwitz added projects: Restricted Project, Audit.
jhurwitz added subscribers: jhurwitz, angie, aj, rameshdharan.
jhurwitz updated the task description. (Show Details)Jul 30 2015, 11:57 PM
jhurwitz updated the task description. (Show Details)Jul 30 2015, 11:59 PM
jhurwitz updated the task description. (Show Details)
jhurwitz updated the task description. (Show Details)

Improve the error message and provide a button to close the audit

This seems reasonable.

Have a daemon that processes deleted commits/branches and deletes associated (incomplete?) audits

I think we probably don't want to do this. If someone accidentally deletes master and no one realizes for a little while, it would be bad to potentially destroy years worth of history, state and discussion contained in audits. Even just auto-closing them (without actually removing any comments/discussion) seems a bit bad.

Give us a script we can use that deletes a branch and cleans up its audits, or that given the name of a just-deleted branch cleans up its audits, or similar

You can use bin/audit delete to do this, although you'll have to massage things a bit. A working approach might be:

  • Use bin/audit delete --dry-run --repositories X to get a list of open audits for affected repositories.
  • Separately, remove all the audits from that list which are for commits which still exist in the remote, using some small script you write.
  • You can then use bin/audit delete --ids ... to close the remaining obsolete audits.

Have a daemon that processes deleted commits/branches and deletes associated (incomplete?) audits

I think we probably don't want to do this. If someone accidentally deletes master and no one realizes for a little while, it would be bad to potentially destroy years worth of history, state and discussion contained in audits. Even just auto-closing them (without actually removing any comments/discussion) seems a bit bad.

What about somehow ghosting them so that they don't show up in the UI, without actually deleting the data?

jhurwitz edited subscribers, added: ramesh_dharan; removed: rameshdharan.Aug 4 2015, 1:14 AM
ramesh_dharan added a comment.EditedAug 4 2015, 1:34 AM

You can use bin/audit delete to do this, although you'll have to massage things a bit.

This won't really address the issue for us, since with our setup running bin/audit requires superadmin access, i.e. even as the admin of a particular project I don't have access to ssh to the machine actually hosting our Phabricator installation to run bin/audit.

Improve the error message and provide a button to close the audit

This seems reasonable.

Just wanted to chime in and say that I'd be perfectly happy with this fix for now.

Any updates on this? I've still got these annoying audits in my queue and they continue to frustrate me.

I think my preference would be the "Improve the error message and provide a button to close the audit" option. @epriestley, how long would this take? Assuming it's small, I think we'd want to prioritize this.

This is probably about an hour of work, although it may be a bit involved to distinguish between "the commit no longer exists in the repository" and other errors. If you want to prioritize this I can give you an hour-ish fix or a better estimate if it looks more complicated once I get into the code.

jhurwitz moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Oct 1 2015, 9:06 PM

D14227 should fix our mishandling of these commits:

As for a button to get rid of any audits, if you've been asked to audit you can already accept the commit -- if a concern has been raised, I think this is just T2393. Or is there another state I'm missing?

angie moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Oct 2 2015, 11:22 PM

(I might not include this when I cut stable for this since the controller is a little bit of a mess, it's possible I missed something, and I don't expect it to be vetted much between 5PM Friday and 5AM Saturday.)

chad moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Feb 2 2016, 4:46 AM
chad moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Feb 2 2016, 4:50 AM
epriestley moved this task from Backlog to Preflight on the Prioritized board.May 25 2016, 9:20 PM

A sub-problem here is that if daemons discover commits but those commits are deleted before they can be processed, the tasks fail forever. This is normally unusual (git won't GC the commits for a few weeks, which often gives the daemons enough time to process them) but cases like fiddling with tracking, re-cloning a repository, everyone going on vacation for a couple weeks, or clustering an observed repository can exacerbate them.

I believe we can flag these commits as deleted, and revive them from the dead if they are later rediscovered fairly easily, to at least clear the daemon issues.

aj removed a subscriber: aj.May 25 2016, 9:27 PM
scode added a project: Restricted Project.May 25 2016, 9:36 PM
scode added a subscriber: scode.
epriestley renamed this task from Deleting a branch does not delete associated audits to Phabricator handles commits which are published and later deleted poorly.May 25 2016, 9:38 PM
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Jun 1 2016, 10:37 PM

To revive deleted commits when they are re-pushed, I think it is sufficient to ignore deleted commits when filling the discovery cache (PhabricatorRepositoryDiscoveryEngine->fillCommitCache()) and allow recordCommit() to detect existing commits and revive them.

This should work even if you delete a branch in the remote, add new commits to it locally, then push the branch back to the remote, so revived commits arrive underneath new commits.

I think this is straightforward, and using a free bit in the importStatus bitfield should be OK for storing the "deleted" flag for this purpose.


Marking commits as deleted is more complicated. Some strategies I can come up with:

Mark in Daemons: We can mark commits as deleted in the daemons. When a daemon tries to work on a commit and finds it does not exist, it marks the commit deleted and stops working on the task.

I think this approach has a lot of downsides:

  • Daemons perform a lot of different repository operations which might fail because of a deleted commit, and not all of them can easily or unambiguously detect "commit does not exist" vs other types of failure, so it would be somewhat hard to detect when we need to run the "mark deleted" code.
  • We could make the daemons all run an explicit "does this commit exist?" check up front to fix this, but this is relatively expensive.
  • This causes potential race conditions and contention between the daemons (marking commits deleted) and the discovery process (marking commits revived).
  • This can leave us with sparse deleted commit information: deleted commits will potentially be marked in random order, and newer deleted commits may be marked deleted before older ones. If we use the approach to revival described above, revival will stop when it encounters the first commit which is not deleted, but there may be more commits beyond it which still need to be revived.

Mark in Discovery: We can mark commits as deleted in discovery:

  • The refs step can record a list of previously known branch/tag/ref heads. When a ref is updated or deleted, the old value of the ref is added to the list.
  • The discovery step can consume this list, and identify all ancestors of the head which are known and no longer reachable.

We record commit parent relationships and can determine "known" fairly easily, but testing "no longer reachable" is tricky. A naive approach would be to run git branch --contains or similar on each commit, but this is likely to be painfully slow in large repositories (see T9898) and isn't correct anyway since ref or tag reachability should be sufficient.

I think the best way to do this is probably this mess:

  • Use PhabricatorGitGraphStream to pull a stream of every reachable commit.
  • Follow the database edges backward and test them against the stream.

This should perform well until a repository has too many commits to fit the hashes into memory.

We can use a similar strategy to implement bin/repository mark-everything-that-has-been-deleted-as-deleted.


One other thought is that there's a sort of race condition here where a commit is deleted and revived before it is processed. This could leave two sets of daemons in queue and possibly, e.g., evaluate Herald rules twice. We can mitigate this at least partially by making "natural" tasks more proactive about checking if they've already been completed, which is easy.


We should also probably resolve T6878 here.


This marginally increases our overhead in the common case, where a valid branch is amended with a small number of valid commits. I think there isn't much we can do here. This will boil down to approximately one extra git log --all, which is modest. Possibly, refs can be smart enough to avoid recording a ref on the old refs list when it's a straight amend. This largely just changes the cost to a similar git merge-base, although it's probably a little better in terms of complexity.


Overall, I plan to:

  • Resolve T6878 (fetch and discover all refs, not just branches -- this will become a bigger issue after we start marking things deleted).
  • Add a DELETED flag.
  • Provide UI messaging for the DELETED flag ("This commit has been deleted...").
  • Ignore deleted commits when filling the commit cache; revive deleted commits when re-recording them.
  • Add a mechanism for using GitGraphStream to efficiently test for the bulk existence of commits.
  • Add a bin/repository ... script for synchronizing commit deletion state in an existing repository.
  • Make "natural" daemons more proactive about aborting processing of already-processed commits.
  • Make all daemons abort processing of deleted commits.
  • Add an "old refs" list to refs which is populated with removed and rewritten refs.
  • Make discover consume the old refs and test their ancestors for deletion.

I'm vaguely suspicious that I may have missed some weird stuff here since this pipeline is complicated, but I'm going to land this stuff now and just plan to hold the release a bit if it feels shaky tomorrow.

epriestley moved this task from The Queue to Paused on the Prioritized board.Jun 16 2016, 6:49 PM

I've pushed this here, and enabled dangerous changes on rGITTEST if anyone wants to try anything.

I pushed rGITTEST450fc0d1b09b to doomed-branch, then deleted the branch, and things seem to have worked correctly: the commit is marked unreachable/deleted, and the repository summary reverted to 87 commits with the older commit as the marquee commit.

I also pushed rGITTEST38447a6c to the tag watermelon only (it is unreachable from any branch head) and it was properly discovered.

This is a relatively high-complexity change, but these operations don't seem to have left us with any spinning daemons or other horrible side effects.


Because this is a relatively complex change I'd recommend waiting a little bit before anyone puts this into production.

When you do, you can use bin/repository mark-reachable <repository> to fix the reachability flags on an existing repository. If you aren't running into problems currently you may not need to bother, but this will let any daemons that are stuck trying to operate on unreachable commits fail gracefully.

Going forward, Phabricator should automatically detect when commits become unreachable (or are revived by being pushed as ancestors of some new head).

Additionally, Phabricator now tracks all tags (except those starting with phabricator/) and all freeform refs. We may refine these rules in the future.

(One other small change I'll probably make is marking unreachable commit handles as "closed", so they get a strikethru style.)

Beyond that, I don't have any further worked planned here unless issues crop up.


GoalRefTimeNotes
Discover EverythingD16129, D161341 HourDiscover commits only reachable from tags and refs.
Unreachable BehaviorsD16130, D161311 HourMake UI and daemons behave properly in the presence of unreachable commits.
Detect UnreachabilityD16132, D161331 HourDetect when commits become unreachable.
Subtotal3 Hours
Cumulative Total3 Hours
scode added a comment.Jun 16 2016, 7:35 PM

If the mark-reachable command is used, is it subject to races in the sense that in-flight legit newly importing commits may be marked unreachable, or does it just trigger the process of going back and re-evaluating true reachability of all commits?

It basically runs git log --all, then checks every commit in the database against that list and marks the missing ones as unreachable.

It doesn't examine or interact with import status or the daemon pipeline (at least, directly).

Importing commits may be marked unreachable if all heads have already been deleted. That is, if a branch was pushed and then deleted, tasks trying to import those commits may be left in the queue. mark-reachable will identify those commits as unreachable and the tasks will fail permanently the next time they attempt to execute.

Not sure if that answers your question or not, let me know if I can clarify.

scode added a comment.Jun 16 2016, 8:11 PM

Yeah. I just wanted to make sure whether mark-reachable was safe to run without fear of accidentally marking something unreachable that isn't (i.e., if the command were to imply "mark everything in the import queue as unreachable no matter what"). Sorry it was probably a bit of a lazy question since a more careful re-reading of the rest of the ticket would probably have answered it for me.

Thanks!

Updated our Phabricator install just now. One issue I discovered is that we are having a bunch of audits triggered for commits that are newly discovered. We have Diffusion configured to track only the master branch, but now Diffusion is importing commits from all branches, it seems.

We have Diffusion configured to track only the master branch, but now Diffusion is importing commits from all branches, it seems.

The expectation is that these commits will be imported:

  • all ancestors of tracked branches;
  • all ancestors of tags not beginning phabricator/; and
  • all ancestors of other refs.

You can use bin/repository discover --verbose <repository> to see exactly what we're tracking. For example:

epriestley@orbital ~/dev/phabricator $ ./bin/repository discover --verbose R55
Discovering "R55"...
Discovering commits in repository "R55".
Examining ref "master", at "a940a9a297039d7db0fa7dae6312c3a7c7e4abc2".
Skipping, HEAD is known.
Examining ref "circle-test-1", at "4ece35174bcfd1454440a3ddba03ecbf78a5e22d".
Skipping, HEAD is known.
Examining ref "branch-and-tag", at "9537e472fefbac22273bcd4d45c5b8bbbd0b2baa".
Skipping, HEAD is known.
Examining ref "lfs-self-track", at "40378a2807bd80876fb9c13b5546159ba8e264ee".
Skipping, HEAD is known.
Examining ref "featurex", at "fd88d4c54f86bdf628ec1b88f6ef346542eb4e64".
Skipping, HEAD is known.
Examining ref "notauto", at "e25dc9c7c96031623e6723005aff3e22b1dfab11".
Skipping, HEAD is known.
Examining ref "blerps", at "118c70cf3f7b0e157c5c987550d3cd2beb2e0671".
Skipping, HEAD is known.
Examining ref "feature2", at "b5ee37cce43970dcc34cfe9dc805fe01e501e8cc".
Skipping, HEAD is known.
Examining ref "feature", at "6cf5f6d0c8c06c4c73b8783666d9b3ecce138244".
Skipping, HEAD is known.
Examining ref "quack2", at "5a9c51e86615f6e1097b2a4a73ef0fe75981c1dd".
Skipping, HEAD is known.
Examining ref "quackquack", at "b74896aa087cd643118604099bb2c10958722aaf".
Skipping, HEAD is known.
Examining ref "blarp", at "ad17c19ae1d71538bdb3d24b5485fd2290d10cd0".
Skipping, HEAD is known.
Examining ref "phabricator/diff/403", at "4ece35174bcfd1454440a3ddba03ecbf78a5e22d".
Skipping, ref is untracked.
Examining ref "phabricator/diff/404", at "4ece35174bcfd1454440a3ddba03ecbf78a5e22d".
Skipping, ref is untracked.
Examining ref "phabricator/diff/402", at "1ef0e21e517a29d5b09ec9ecd89cfda19e95d5a7".
Skipping, ref is untracked.
Examining ref "phabricator/diff/401", at "02a0f69c1ffe3dfa13fe678239f0a8a42d8a0ccc".
Skipping, ref is untracked.
Examining ref "phabricator/diff/400", at "f97829f2df15202189eebf380e25f48d78a2c458".
Skipping, ref is untracked.
Examining ref "refs/pull/2/merge", at "9c56735e3bd99d2554486c7d285eebc381f6bbf1".
Skipping, HEAD is known.
Examining ref "phabricator/base/401", at "4dcd4c492b31d82f0fec6f447cf3719b9cc079e0".
Skipping, ref is untracked.
Examining ref "phabricator/base/403", at "4dcd4c492b31d82f0fec6f447cf3719b9cc079e0".
Skipping, ref is untracked.
Examining ref "phabricator/base/402", at "4dcd4c492b31d82f0fec6f447cf3719b9cc079e0".
Skipping, ref is untracked.
Examining ref "phabricator/base/400", at "4dcd4c492b31d82f0fec6f447cf3719b9cc079e0".
Skipping, ref is untracked.
Examining ref "phabricator/base/404", at "4dcd4c492b31d82f0fec6f447cf3719b9cc079e0".
Skipping, ref is untracked.
Examining ref "phabricator/diff/54321", at "8c133a4e08fc600753042805ba5662aaed3b27f1".
Skipping, ref is untracked.
Examining ref "phabricator/diff/262", at "8c133a4e08fc600753042805ba5662aaed3b27f1".
Skipping, ref is untracked.
Examining ref "phabricator/diffs/54321", at "8c133a4e08fc600753042805ba5662aaed3b27f1".
Skipping, ref is untracked.
Examining ref "tag1", at "458f403ae28fe6eabd380912e1a000a7cdcbcf4c".
Skipping, HEAD is known.
Examining ref "phabricator/diff/261", at "d4b2710c051061eabbb40adb5c73d3bf08df2e12".
Skipping, ref is untracked.
Examining ref "phabricator/diff/260", at "94bc447a5492746a9811648997ba04e63a6f4fdd".
Skipping, ref is untracked.
Examining ref "phabricator/diff/241", at "121862a764ed6bb36a031f5cf027256192d5c8ea".
Skipping, ref is untracked.
Examining ref "phabricator/diff/240", at "9ec71ab2c21aa88bc775a6ffefc5d051ce68b534".
Skipping, ref is untracked.
Examining ref "phabricator/diff/238", at "41ba4f16074a81b58e328afbdcd462d5bd8bdf4f".
Skipping, ref is untracked.
Examining ref "refs/secret/spaghetti", at "4238a0de5ef9afc262350ca87326c94afe7d2674".
Skipping, HEAD is known.
Examining ref "derp3", at "2140215815bc3d68e67f240fe43e182852a1fcc4".
Skipping, HEAD is known.
Examining ref "refs/pull/3/head", at "118c70cf3f7b0e157c5c987550d3cd2beb2e0671".
Skipping, HEAD is known.
Examining ref "refs/pull/2/head", at "6cf5f6d0c8c06c4c73b8783666d9b3ecce138244".
Skipping, HEAD is known.
Examining ref "quack", at "c551ba8a245695daf5e9a4d139beda575b801f76".
Skipping, HEAD is known.
Examining ref "qwer", at "683d658264a5f10466671c2eeda16edb652e36a7".
Skipping, HEAD is known.
Examining ref "asdf", at "683d658264a5f10466671c2eeda16edb652e36a7".
Skipping, HEAD is known.
Done.

OK, running ./bin/repository discover --verbose rX gives me a bunch of output including this:

Examining ref "refs/remotes/origin/XYZ", at "441ca44686e5afb9054f3e25b19b70e3aefd21f7".
Skipping, HEAD is known.

Running git branch --all --contains 7a048137e59cad0f1ea1feb4c6daaa45689e91da in the working directory for this repository on our repository host (7a048137e59cad0f1ea1feb4c6daaa45689e91da is one of the commits that triggered an audit`) produces the following output:

git branch --all --contains 7a048137e59cad0f1ea1feb4c6daaa45689e91da
  remotes/origin/XYZ

Interestingly though, this branch no longer exists in the remote repository. Running git branch --all from my local machine and I don't see this branch.

Hrrm -- is the repository a non-bare working copy (e.g., has files and folders in it instead of just config/, hooks/, info/, etc)?

Is there a lack of a git remote prune origin or similar?

Also: FWIW, examining every ref like that is a complete no-go for our situation (tens of thousands of refs, huge repository, hundreds of thousands of commits in the average ref). It would work as a one-off that just runs for days, but not for any kind of recurring/incremental process.

We currently have a small patch to force the fetching to only consider the refs we care about though, so hopefully we shouldn't have an actual problem with this. But only because we're filtering it out.

(We really want Phab to only ever consider the master branch no matter what, but we're limited to the autoclose functionality right now and that doesn't affect importing or fetch scope.)

Hrrm -- is the repository a non-bare working copy (e.g., has files and folders in it instead of just config/, hooks/, info/, etc)?

Nope, it is definitely bare.

We fetch bare working copies with git fetch origin +refs/*:refs/* --prune.

We fetch non-bare working copies with git fetch all --prune.

I would expect --prune to remove the branch in both cases, but I may be misunderstanding the behavior of --prune.

I wonder if the actual authoritative repository has a refs/remotes/origin/XYZ ref which we're mirroring?

We can ignore refs/remotes/ refs for bare repositories to "fix" this, but I'd like to understand what's going on. I'm not getting any refs/remotes/... refs in, e.g., local test repositories I'm observing from GitHub.

I think checking if git ls-remote origin in the working copy lists a refs/remotes/origin/XYZ ref in the authoritative remote is the easiest way to check, although I'm not 100% sure that refs/remotes/ isn't magic with respect to ls-remote.

Ah interesting. From our GitLab host, I see that git branch --all shows a whole bunch of local branches and only a single branch of the form remotes/origin/url_shortener.

Ah interesting. From our GitLab host, I see that git branch --all shows a whole bunch of local branches and only a single branch of the form remotes/origin/url_shortener.

Wierd... I can't find any reference to this branch in refs/heads or refs/remotes.

okay so basically ghosts?

iiam

I think D16136 will "fix" the problem and I can't come up with any problems it creates, although it would be nice to figure out where the mystery ref is coming from.

OK, so I just manually removed the branch from info/refs and packed-refs and I think that maybe that worked?

FWIW, examining every ref like that is a complete no-go for our situation

In your case, I expect that keeping your "master only" patch will be fine, and running mark-reachable will clean up any clutter left over from earlier adventures.

I generally expect this approach to hold up in tens-of-thousands of refs / millions-of-commits situations since I wasn't able to reproduce any degenerate behavior with a 10K+-ref repository in T10368, but keeping the master restriction should also be fine since this doesn't seem to hold in your environment.

(The largest upstream repository on this host, rSTAGING, has 4700 refs and costs about 500ms to repository discover vs about 200ms for rP; I'd expect this cost to be linear.)

epriestley closed this task as Resolved.Jul 2 2016, 3:24 PM
epriestley claimed this task.

This appears to be stable, and is now accounted for.