Page MenuHomePhabricator

Instance reports consistent bad repository device versions for clustered, observed Git repositories
Closed, ResolvedPublic

Assigned To
None
Authored By
epriestley
Jul 3 2017, 12:25 PM
Referenced Files
F5030310: Screen Shot 2017-07-03 at 6.09.42 AM.png
Jul 3 2017, 1:10 PM
F5030296: Screen Shot 2017-07-03 at 6.01.28 AM.png
Jul 3 2017, 1:10 PM
F5030300: Screen Shot 2017-07-03 at 6.05.02 AM.png
Jul 3 2017, 1:10 PM
F5030308: Screen Shot 2017-07-03 at 6.09.31 AM.png
Jul 3 2017, 1:10 PM
F5030304: Screen Shot 2017-07-03 at 6.08.41 AM.png
Jul 3 2017, 1:10 PM
F5030267: Screen Shot 2017-07-03 at 5.32.11 AM.png
Jul 3 2017, 12:34 PM
F5030256: Screen Shot 2017-07-03 at 5.25.30 AM.png
Jul 3 2017, 12:34 PM
F5030269: Screen Shot 2017-07-03 at 5.32.34 AM.png
Jul 3 2017, 12:34 PM

Description

An instance is reporting that all (clustered, observed, Git) repositories are coming online with bad device versions, leading to this error:

Error updating working copy: Repository "REPONAME" exists on more than one device, but no device has any repository version information. Phabricator can not guess which copy of the existing data is authoritative. Promote a device or see "Ambigous Leaders" in the documentation.

An accompanying screenshot showed a repository launching with two version records on different devices, with - (no version) as the version for each. The expected value is 0.

This error can be manually resolved with bin/repository thaw but that's a huge pain and this is obviously not intended.

Additionally, these repositories were created via the API.

Event Timeline

I'm attempting to reproduce this from the web UI first. I created a new repository, and get 0 versions as expected:

Screen Shot 2017-07-03 at 5.25.30 AM.png (988×1 px, 160 KB)

Then I added an observed URI:

Screen Shot 2017-07-03 at 5.32.11 AM.png (958×1 px, 147 KB)

However, I still see 0 versions:

Screen Shot 2017-07-03 at 5.32.34 AM.png (908×1 px, 160 KB)

(For good measure, I then activated the repository, but this didn't change anything.)

I'll try the API next.

I used this parameter bundle to create a repository:

epriestley@orbital ~/dev/phabricator $ cat cmd1.json 
{
  "transactions": [
    {
      "type": "name",
      "value": "Test X2"
    },
    {
      "type": "vcs",
      "value": "git"
    }
  ]
}
epriestley@orbital ~/dev/phabricator $ cat cmd1.json | arc call-conduit --conduit-uri http://local.phacility.com/ diffusion.repository.edit | json_pp
{
   "errorMessage" : null,
   "response" : {
      "transactions" : [
         {
            "phid" : "PHID-XACT-REPO-ijeciwl2sjmh7ij"
         },
         {
            "phid" : "PHID-XACT-REPO-uctdf3p62y4ztmz"
         },
         {
            "phid" : "PHID-XACT-REPO-kqfrah6cbmy2ne3"
         }
      ],
      "object" : {
         "phid" : "PHID-REPO-brgf4dcaoyemhgvyg43m",
         "id" : 73
      }
   },
   "error" : null
}

This created a repository with the right versions:

Screen Shot 2017-07-03 at 6.01.28 AM.png (1×2 px, 380 KB)

Then I added the remote URI:

epriestley@orbital ~/dev/phabricator $ cat cmd2.json 
{
  "transactions": [
    {
      "type": "repository",
      "value": "PHID-REPO-brgf4dcaoyemhgvyg43m"
    },
    {
      "type": "uri",
      "value": "https://github.com/epriestley/poems.git"
    },
    {
      "type": "io",
      "value": "observe"
    }
  ]
}epriestley@orbital ~/dev/phabricator $ cat cmd2.json | arc call-conduit --conduit-uri http://local.phacility.com/ diffusion.uri.edit | json_pp
{
   "response" : {
      "object" : {
         "id" : 779,
         "phid" : "PHID-RURI-xcoppuotb766cum6sb7k"
      },
      "transactions" : [
         {
            "phid" : "PHID-XACT-RURI-tii5e33yslkquru"
         },
         {
            "phid" : "PHID-XACT-RURI-qxgi6ehgmkodmkk"
         },
         {
            "phid" : "PHID-XACT-RURI-o6xgbnmf45nha3e"
         },
         {
            "phid" : "PHID-XACT-RURI-2fvgqajnrmg3ucu"
         }
      ]
   },
   "error" : null,
   "errorMessage" : null
}

The versions were still good:

Screen Shot 2017-07-03 at 6.05.02 AM.png (1×2 px, 380 KB)

Finally, I activated the repository for good measure:

epriestley@orbital ~/dev/phabricator $ cat cmd3.json 
{
  "transactions": [
    {
      "type": "status",
      "value": "active"
    }
  ],
  "objectIdentifier": "PHID-REPO-brgf4dcaoyemhgvyg43m"
}
epriestley@orbital ~/dev/phabricator $ cat cmd3.json | arc call-conduit --conduit-uri http://local.phacility.com/ diffusion.repository.edit | json_pp
{
   "response" : {
      "transactions" : [
         {
            "phid" : "PHID-XACT-REPO-s7nyzslykq7hfxm"
         }
      ],
      "object" : {
         "phid" : "PHID-REPO-brgf4dcaoyemhgvyg43m",
         "id" : "73"
      }
   },
   "error" : null,
   "errorMessage" : null
}

Versions still look good:

Screen Shot 2017-07-03 at 6.08.41 AM.png (1×2 px, 380 KB)

For completeness, I ran the daemons to actually make the repository work. It cloned and versioned properly:

Screen Shot 2017-07-03 at 6.09.31 AM.png (1×2 px, 361 KB)

Screen Shot 2017-07-03 at 6.09.42 AM.png (1×2 px, 376 KB)

So I'm not immediately able to reproduce this. Some possible questions:

  • Anything in the process above jump out at you as different from what you're doing?
  • If you go through the process manually with the web UI, are the versions OK after initial creation? After adding an observed URI?
  • If the web UI process works, where do versions get lost if you go through the CLI process step by step?

I can also just improve logging around replication which should point us in the right direction, but I need to look at this and figure out how the existing push logging, pull logging, and replication feedback logging systems will work with it.

If you go through the process manually with the web UI, are the versions OK after initial creation? After adding an observed URI?

No, after a creation in the UI, I get versions on both instances of "-". Does the version number initialization happen in the workers or synchronously in the request?

Synchronously in the request. I'll look at adding logging.

See PHI14 for priority.

The existing logs we could try to repurpose here are: the push event log, the pull event log, and the actual push log. The two former logs are more like connection logs ("someone tried to do something") and the actual push log is a log of writes ("someone actually pushed something").

None of these logs are especially suitable for this data. PushLog is probably the closest fit, but it feeds into Herald and I think we're just getting into trouble if we try to use it for this since having Herald react to version synchronization is not meaningful.

Looking at the actual code, we'd also potentially need a very large amount of diagnostic logging here since there are a lot of branches like this:

if (!$some_condition) {
  return without synchronizing;
}

One general thought here is that it might be nice to build a --trace mode for the web UI. It semi-exists today, but is a pain to access:

  • Enable debug.stop-on-redirect.
  • Enable DarkConsole.
  • Submit the request.
  • Pull stuff out of the various DarkConsole tabs, mostly Services.

If it was easier (like: enable debug mode, then press Control + T on any page to enable it) it (plus a little bit of new trace logging) might be sufficient to get to the bottom of this kind of problem. Generally, I'd like to avoid adding a lot of "fishing" logging, where the whole codebase looks like this:

if (!$some_condition) {
  log("a totally normal thing happened");
  return;
}

...since this stuff tends to be almost useless 99.99% of the time and makes maintenance harder on the balance.


I also wonder if we should push harder for access to customer environments for this stuff, since this is the kind of thing that probably takes 15 minutes to resolve with CLI access and hours per round of fishing for who-knows how many rounds remotely.