Page MenuHomePhabricator

Add an intracluster synchronization log for cluster repositories
ClosedPublic

Authored by epriestley on Nov 6 2018, 10:45 PM.
Tags
None
Referenced Files
F13046094: D19779.id47265.diff
Thu, Apr 18, 6:27 AM
Unknown Object (File)
Wed, Apr 17, 2:19 AM
Unknown Object (File)
Tue, Apr 16, 12:03 AM
Unknown Object (File)
Sat, Apr 13, 11:58 AM
Unknown Object (File)
Sat, Apr 13, 8:58 AM
Unknown Object (File)
Wed, Apr 10, 9:16 PM
Unknown Object (File)
Tue, Mar 19, 3:46 PM
Unknown Object (File)
Tue, Mar 19, 3:45 PM
Subscribers
Restricted Owners Package

Details

Summary

Depends on D19778. Ref T13216. See PHI943, PHI889, et al.

We currently have a push log and a pull log, but do not separately log intracluster synchronization events. We've encountered several specific cases where having this kind of log would be helpful:

  • In PHI943, an install was accidentally aborting locks early. Having timing information in the sync log would let us identify this more quickly.
  • In PHI889, an install hit an issue with MaxStartups configuration in sshd. A log would let us identify when this is an issue.
  • In PHI889, I floated a "push the linux kernel + fetch timeout" theory. A sync log would let us see sync/fetch timeouts to confirm this being a problem in practice.
  • A sync log will help us assess, develop, test, and monitor intracluster routing sync changes (likely those in T13211) in the future.

Some of these events are present in the pull log already, but only if they make it as far as running a git upload-pack subprocess (not the case with MaxStartups problems) -- and they can't record end-to-end timing.

No UI yet, I'll add that in a future change.

Test Plan
  • Forced all operations to synchronize by adding || true to the version check.
  • Pulled, got a sync log in the database.

Diff Detail

Repository
rP Phabricator
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

Owners added a subscriber: Restricted Owners Package.Nov 6 2018, 10:45 PM
amckinley added inline comments.
src/applications/diffusion/protocol/DiffusionRepositoryClusterEngine.php
734

Worth it to add the name/PHID of the repo being synchronized? I know we get that info in the PhabricatorRepositorySyncEvent object, but seems like it would be good to log it here as well.

Also, is it always the cluster leader that we sync from? Don't we now just pick any device that's at max_version?

src/applications/repository/phid/PhabricatorRepositorySyncEventPHIDType.php
6

I'm guessing this is "sync event" and not a typo for SYNC?

src/applications/repository/query/PhabricatorRepositorySyncEventQuery.php
7–11

This could probably also use resultType, devicePHID and fromDevicePHID, no? Or are you just going to add stuff as needed when you build the UI?

This revision is now accepted and ready to land.Nov 7 2018, 6:22 PM
epriestley added inline comments.
src/applications/diffusion/protocol/DiffusionRepositoryClusterEngine.php
734

I think the repository is always clear from context: this message ends up on the user's command line after a git fetch or git push or similar, not in a general logfile. Hopefully, users know what they're fetching/pushing. Hopefully.

"Devices at max_version" and "leaders" are the same machines: the definition of "leader" is "any device with the most current version of the repository".

src/applications/repository/phid/PhabricatorRepositorySyncEventPHIDType.php
6

Yeah, the other two are PSHE ("Push Event") and PULE ("Pull Event"), so this is SYNE for consistency.

src/applications/repository/query/PhabricatorRepositorySyncEventQuery.php
7–11

Yeah, just didn't have any callers yet. I expect to flesh this out at least a bit once I add UI for it.

This revision was automatically updated to reflect the committed changes.