It isn’t obvious when repository observation breaks
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	mavit
	Jul 4 2017, 4:19 PM

Description

Observing of a remote Git repository broke because of boring ssh key related reasons at the remote end, but it took me a while to notice.

Browsing to /source/…/manage/status/ makes it obvious that there’s an error, with a little red warning triangle, but if you don’t happen to visit this page then it’s not clear that your repository replica is quietly going out of date.

Perhaps there could be some kind of alert for admins that something is broken, similar to the setup errors that they see? Perhaps people could chose to subscribe to alerts via Herald? Perhaps this should only trigger if mirroring has been failing for a certain period of time?

This likely overlaps somewhat with T6131.

Related Objects

Mentioned In: T13672: Arcanist: Exceptions when using Mercurial ~6.0/6.1
Mentioned Here: T12417: Show failing status of Diffusion repositories in top main menu bar.
T6131: Mirror Status on repository edit view

Event Timeline

mavit created this task.Jul 4 2017, 4:19 PM

jcarrillo7 added a subscriber: jcarrillo7.Jul 6 2017, 5:36 AM

From elsewhere:

[A main menu alert for adminstrators] would be good on small installs, but installs like Wikimedia let a wide set of users create repositories -- they have ~2,200 repositories, and chances are good that at least one of those is misconfigured at any time, so the warning would probably just be always-on for every admin (and not necessarily shown to the repository owners -- who are probably not admins most of the time -- although they ones who can actually resolve it). They also have admins who aren't engineers and probably can't help with the issue even if they do own the repository in some sense.

We also can't run a query constrained by policy efficiently ("Find all users who have permission to edit this repository") since we have to load every user and run application-level checks to evaluate policies in the general case. We could still do this, but it would need to be done in the background and probably still wouldn't really get us the right audience. I believe a lot of repositories are editable by users who wouldn't necessarily want or be able to correct credential errors, even if they technically have access to do so on the Phabricator side of things.

We could let you opt-in to alerting ("when there's a problem with this repository, notify: X, Y") but there's no source we could really use to set reasonable defaults so I'd guess this would remain empty much of the time.

The most unambiguous changes I can come up with are:

We could surface that a problem exists more clearly on the list view and detail view, pointing you toward the Manage > Status view. There are some problems with this today -- we can't check if a large set of repositories have errors efficiently, I think -- but these are tractable. This wouldn't proactively notify you of problems, but would make them harder to overlook when you went looking.
We could extend this warning to Differential ("The repository this change will land into is experiencing a problem updating."), by checking the status of the repository associated with the revision. This isn't always a reliable link, but it's usually accurate.

These still aren't great, as they don't give users a direct set of steps to take toward resolution (they have to go figure out who the right person is to fix the issue), but they feel less perilous than trying to get broadcast notifications aimed at the right audience in all cases.

This information may be entirely useless, but it is somewhat related. We have a large number of repositories, with multiple clusters so this affects us in a major way.

We"solved" the problem with a small python daemon that pings repository_statusmessage.epoch every minute and calculates the delta. When these deltas surpass Phabricator's max delta (21,600 sec) we page the oncall. For repositories that are mission critical, we use a smaller value. It obviously isn't a perfect solution, but it gets the job done and it's been extremely stable.

joshuaspence added a subscriber: joshuaspence.Jul 10 2017, 10:00 PM

In T12896#228825, @jmeador wrote:

We"solved" the problem with a small python daemon that pings repository_statusmessage.epoch every minute and calculates the delta. When these deltas surpass Phabricator's max delta (21,600 sec) we page the oncall. For repositories that are mission critical, we use a smaller value. It obviously isn't a perfect solution, but it gets the job done and it's been extremely stable.

Nice solution! I might have to do something like that for wikimedia.

chad merged a task: T12925: Warn more loudly when diffusion setups are broken?.Jul 17 2017, 10:49 PM

chad added subscribers: csilvers, sophiebits.

epriestley mentioned this in T13672: Arcanist: Exceptions when using Mercurial ~6.0/6.1.Mar 15 2022, 5:01 PM

It isn’t obvious when repository observation breaksOpen, Needs TriagePublicActions

Description

Related Objects

Event Timeline

It isn’t obvious when repository observation breaks
Open, Needs TriagePublic
Actions