Page MenuHomePhabricator

Diffusion commit history displays inconsistent results after migrating repo off of cluster service
Closed, ResolvedPublic

Assigned To
Authored By
timhirsh
Jun 30 2016, 11:16 PM
Referenced Files
F1707170: cluster.png
Jun 30 2016, 11:37 PM
F1707168: uncluster.png
Jun 30 2016, 11:37 PM
F1706925: Screen Shot 2016-06-30 at 5.35.33 PM.png
Jun 30 2016, 11:16 PM
F1706922: Screen Shot 2016-06-30 at 5.35.04 PM.png
Jun 30 2016, 11:16 PM

Description

I was testing out cluster repositories in development: migrating one repository onto, then back off of the cluster service. The steps to reproduce are:

  • Register an existing host as a cluster device: repo001.mycompany.net
  • Create service: repos001.mycompany.net and add bindings for repo001.mycompany.net:80 and repo001.mycompany.net:22
  • Migrate a repository onto the cluster service:
$ ./bin/repository clusterize rTEST --service repos001.mycompany.net
  • Wait for everything to synchronize
  • Expand the cluster by registering a second repository host device: repo002.mycompany.net
  • Again, wait for everything to synchronize. Now migrate the repository back off the service:
$ ./bin/repository clusterize rTEST --remove-service
  • Push a few new commit to rTEST. At this point, repo001.mycompany.net will be 2 commits ahead of repo002.mycompany.net.
  • Navigate to the Diffusion repository history and refresh the page a few times.

Expected: new commits are displayed (note the latest commit is from 5:12pm):

Screen Shot 2016-06-30 at 5.35.04 PM.png (608×1 px, 95 KB)

Then refresh. Roughly half the time you won't see the new commits. (note the latest commit is from 5:09pm):

Screen Shot 2016-06-30 at 5.35.33 PM.png (542×1 px, 83 KB)

It looks like it's still using the cluster service call instead of running git log directly on disk. I also tried disabling the service bindings for repo002.mycompany.net thinking that might fix it, but after doing so I was still able to reproduce. Also tried in an incognito window to rule out browser caching/oddness.

I'm mainly just looking for a way to roll back during the migration in case any configuration issues are encountered, and maybe there's a better way to go about it than the method I described that won't trigger this edge case. Thanks!

phabricator cadac75b82bbed18d52c3ee7ba6d396bff69c009 (Fri, Jun 24)
arcanist 18b27b03fa3d9f2439bf998c5fa2e4f5bd93db16 (Sat, Jun 18)
phutil 8aa8612a094b4dafcf5c461b746a613a1e229b86 (Sat, Jun 18)

Event Timeline

Just to rule this out, are repo001 and repo002 also web nodes which may receive traffic directly? That is, could this be explained this way?

  • Half the time the load balancer sends you to repo002.
  • Whichever machine you end up on runs git log.
  • Net result: git log looks like clustering?

Otherwise, you can check these things:

  • Is almanacServicePHID in phabricator_repository.repository for the relevant repository null? That's the expected effect of --remove-service.
  • In DarkConsole, what's actually running?

Here are some examples of the expected service calls for /diffusion/XXX/history/master/. First, this is with an un-clusterized repository. Note that it does literally run git log:

uncluster.png (1×1 px, 527 KB)

Here's the same repository after clusterizing it:

cluster.png (1×1 px, 546 KB)

It does a bunch of Almanac lookups (first circle), then makes a Conduit service call (second circle).

(When I toggle between cluster/no-cluster locally with --service / --remove-service, everything appears to be working properly, although I didn't precisely reproduce your setup.)

You can also check RequestMachine in DarkConsole to see which host served things, if the Load Balancer / web node thing might be part of the explanation.

No, there's no load balancer. In DarkConsole under Request -> Machine I'm seeing the same IP for both cases. But repo002 is configured to share the db from repo001.

DarkConsole gave me more info to help track it down though. Both daemons were updating repository_refcursor, so the query:

`SELECT refNameHash, refType, commitIdentifier, isClosed FROM `repository_refcursor` WHERE repositoryPHID = 'PHID-REPO-k5tlp52sql4kz4jf4vim' AND refNameHash IN ('fCKLrzDyaVYn');

yielded different results depending on which daemon was the last to run. After disabling the daemon on repo002 I'm consistently seeing the correct commits. Thanks for the help. Knowing this, we should be fine for the initial migration. Is it possible this is still an issue in a cluster setup where one device is far out of sync from the others?

epriestley claimed this task.

Ah! That makes sense. Originally we were more strict about which repositories we would update (which would have caught this) but this created some worse/more-confusing effects in the "enroll" direction (see T10940).

In a cluster configuration, the daemons synchronize before updating ref cursors, so this shouldn't be an issue.

Moving the working copy aside on repo002 (for example, mv /var/repo/TEST /var/repo/TEST.disabled) should also fix things in this situation without requiring the daemon to be completely disabled.

(I also added you to Community, granting you tremendous new power to tidy up around here.)