Page MenuHomePhabricator

Checking out from Phacility sometimes fails
Closed, ResolvedPublic

Description

I'm not sure but I think this started happening maybe 2 weeks ago. The error happens maybe once every 2 days, always from Jenkins when checking out at the beginning of a build. When it happens, often 2 builds will fail in a row. Restarting them get rid of the problem.

Building in workspace /Users/admin/Jenkins/Home/workspace/Hyper-Android-SideLoad
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url ssh://hyper@vault.phacility.com/diffusion/HYPER/hyper.git # timeout=10
Fetching upstream changes from ssh://hyper@vault.phacility.com/diffusion/HYPER/hyper.git
 > git --version # timeout=10
using GIT_SSH to set credentials Phabricator
 > git -c core.askpass=true fetch --tags --progress ssh://hyper@vault.phacility.com/diffusion/HYPER/hyper.git +refs/heads/*:refs/remotes/origin/*
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from ssh://hyper@vault.phacility.com/diffusion/HYPER/hyper.git
	at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:763)
	at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1012)
	at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1043)
	at hudson.scm.SCM.checkout(SCM.java:485)
	at hudson.model.AbstractProject.checkout(AbstractProject.java:1276)
	at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:607)
	at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:529)
	at hudson.model.Run.execute(Run.java:1738)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
	at hudson.model.ResourceController.execute(ResourceController.java:98)
	at hudson.model.Executor.run(Executor.java:410)
Caused by: hudson.plugins.git.GitException: Command "git -c core.askpass=true fetch --tags --progress ssh://hyper@vault.phacility.com/diffusion/HYPER/hyper.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: # Fetch received by "web.phacility.net", forwarding to cluster host.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1640)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1388)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$300(CliGitAPIImpl.java:62)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:313)
	at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:761)
	... 11 more
ERROR: null
NOTE: I'll try to update the list of occurrences below (times are PST):
  • Aug 12, 2016 12:56:06 AM
  • Aug 12, 2016 10:33:04 PM
  • Aug 15, 2016 3:42:05 PM
  • Aug 28, 2016 12:51:35 AM
  • Aug 29, 2016 2:55:47 AM
  • Oct 16, 2016 2:45 PM

Event Timeline

swisspol created this task.Aug 12 2016, 8:58 PM
chad added subscribers: epriestley, chad.
swisspol updated the task description. (Show Details)Aug 13 2016, 5:34 AM

I got code 500 as I checked out from this instance, was about 23:50 UTC, yesterday. Maybe related? After getting that code, I tried it again after about 10 seconds, then, everything was ok.

Why do you mean by "this instance"?

chad added a comment.Aug 13 2016, 3:09 PM

secure.phabricator.com isn't in the Phacility cluster

No disruption is expected and there was no maintenance or known disruption at those times. Nothing jumps out at me immediately in monitoring, either, and I can't reproduce this by casually cloning stuff.

We have a PullEvent log, but it's currently pretty rough since I built it to debug a Mercurial + HTTP issue. It doesn't currently record SSH pulls. We can expand it to include SSH pulls, which may shed some light on things.

We can also ramp up SSH-pathway logging, although I suspect application pull logging may cover things on its own.

I do have a theory on this: currently, instance Conduit calls from the CLI are not cached (we didn't have a clean, existing cache available when I wrote them, and the cache semantics are somewhat involved). This is getting fairly hot on admin, and Multimeter shows a sustained average of ~10 calls per second over the last 24 hours. Conduit calls are relatively heavy (see T11404 for some discussion). It's plausible that some of these are just failing at peak -- I believe I've observed occasional failures during the additional load caused by deploys.

If they are, they'd rarely be visible in the web UI (which caches and makes relatively few calls). However, I'd expect to see them in the daemon logs and I don't, so this may be barking up the wrong tree.

I'd also possibly expect to see errors in the admin log (possibly; I'm not sure if this would present as a client timeout or a server timeout) but there's a bunch of CORGI stuff mucking that up. So my plans are:

  • Fix CORGI errors, check if that reveals any obvious indicators of instance information fetch issues.
  • Refine the instance CLI cache, which is starting to run uncomfortably hot and can't be helping, even if it isn't actually the problem here.
  • If this still unresolved, bring the pull log to SSH.
  • If that doesn't get to the bottom of it either, add a dedicated SSH error log.
epriestley triaged this task as Normal priority.

Fix CORGI errors, check if that reveals any obvious indicators of instance information fetch issues.

I've deployed this and the error log is quiet for now, but I'll keep an eye on things.

swisspol updated the task description. (Show Details)Aug 15 2016, 10:43 PM

Just observed a failure, see updated description.

No smoking guns in the access or error logs on admin. The cache/load theory may just be a goose chase, since I don't see any evidence of unusual load around that time.

swisspol updated the task description. (Show Details)Aug 28 2016, 8:04 AM
swisspol updated the task description. (Show Details)Aug 29 2016, 9:57 AM
swisspol updated the task description. (Show Details)Oct 16 2016, 9:47 PM

Still failing as of today (rarely though):

Building in workspace /Users/admin/Jenkins/Home/workspace/Hyper-tvOS-AdHoc
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url ssh://hyper@vault.phacility.com/diffusion/HYPER/hyper.git # timeout=10
Fetching upstream changes from ssh://hyper@vault.phacility.com/diffusion/HYPER/hyper.git
 > git --version # timeout=10
using GIT_SSH to set credentials Phabricator
 > git -c core.askpass=true fetch --tags --progress ssh://hyper@vault.phacility.com/diffusion/HYPER/hyper.git +refs/heads/*:refs/remotes/origin/*
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from ssh://hyper@vault.phacility.com/diffusion/HYPER/hyper.git
	at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:763)
	at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1012)
	at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1043)
	at hudson.scm.SCM.checkout(SCM.java:485)
	at hudson.model.AbstractProject.checkout(AbstractProject.java:1276)
	at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:607)
	at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:529)
	at hudson.model.Run.execute(Run.java:1738)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
	at hudson.model.ResourceController.execute(ResourceController.java:98)
	at hudson.model.Executor.run(Executor.java:410)
Caused by: hudson.plugins.git.GitException: Command "git -c core.askpass=true fetch --tags --progress ssh://hyper@vault.phacility.com/diffusion/HYPER/hyper.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: ssh_exchange_identification: Connection closed by remote host
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1640)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1388)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$300(CliGitAPIImpl.java:62)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:313)
	at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:761)
	... 11 more
ERROR: null
Archiving artifacts
Publish artifacts to S3 Bucket Using S3 profile: Default
Publish artifacts to S3 Bucket Skipping publishing on S3 because build failed
[phabricator:non-differential-harbormaster] Sending diffusion result as: FAILURE
Finished: FAILURE
epriestley added a commit: Restricted Diffusion Commit.Oct 21 2016, 2:02 PM

This may have been the "MaxStartups" directive in sshd; if so, a partial fix is currently deployed (raising the limit on application load balancers) and a more comprehensive fix will go out tomorrow morning (also raising the limit on repository shards).

I'm no longer able to reproduce this with the partial fix in place, although my initial reproduction case was extremely shaky so I'm not sure that's conclusive.

D16744 should also improve SSH performance slightly, with a larger impact on larger instances.

epriestley closed this task as Resolved.Oct 25 2016, 4:04 PM

Another instance was reproducing this more frequently/reliably and reports that they're no longer seeing issues, and I haven't been able to reproduce it myself since the configuration changes. Presuming we're in the clear now, but let us know if you see further issues.

(I also adjusted SSH connection load balancing to even out tier load.)