Checking out from Phacility sometimes fails
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	swisspol
	Aug 12 2016, 8:58 PM

Description

I'm not sure but I think this started happening maybe 2 weeks ago. The error happens maybe once every 2 days, always from Jenkins when checking out at the beginning of a build. When it happens, often 2 builds will fail in a row. Restarting them get rid of the problem.

Building in workspace /Users/admin/Jenkins/Home/workspace/Hyper-Android-SideLoad
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url ssh://hyper@vault.phacility.com/diffusion/HYPER/hyper.git # timeout=10
Fetching upstream changes from ssh://hyper@vault.phacility.com/diffusion/HYPER/hyper.git
 > git --version # timeout=10
using GIT_SSH to set credentials Phabricator
 > git -c core.askpass=true fetch --tags --progress ssh://hyper@vault.phacility.com/diffusion/HYPER/hyper.git +refs/heads/*:refs/remotes/origin/*
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from ssh://hyper@vault.phacility.com/diffusion/HYPER/hyper.git
	at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:763)
	at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1012)
	at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1043)
	at hudson.scm.SCM.checkout(SCM.java:485)
	at hudson.model.AbstractProject.checkout(AbstractProject.java:1276)
	at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:607)
	at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:529)
	at hudson.model.Run.execute(Run.java:1738)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
	at hudson.model.ResourceController.execute(ResourceController.java:98)
	at hudson.model.Executor.run(Executor.java:410)
Caused by: hudson.plugins.git.GitException: Command "git -c core.askpass=true fetch --tags --progress ssh://hyper@vault.phacility.com/diffusion/HYPER/hyper.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: # Fetch received by "web.phacility.net", forwarding to cluster host.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1640)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1388)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$300(CliGitAPIImpl.java:62)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:313)
	at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:761)
	... 11 more
ERROR: null

NOTE: I'll try to update the list of occurrences below (times are PST):

Aug 12, 2016 12:56:06 AM
Aug 12, 2016 10:33:04 PM
Aug 15, 2016 3:42:05 PM
Aug 28, 2016 12:51:35 AM
Aug 29, 2016 2:55:47 AM
Oct 16, 2016 2:45 PM

Revisions and Commits

		Restricted Diffusion Commit
rP Phabricator
	D16744	rPc21a71f02489 Cache generation of the SSH authentication keyfile for sshd

Related Objects
Search...

Status	Assigned	Task
Resolved	epriestley	T11469 Checking out from Phacility sometimes fails
Resolved	epriestley	T11506 Build smarter CLI caching for instance data in the Phacility cluster
Resolved	epriestley	T11766 Write to the pull log from SSH, not just HTTP

Event Timeline

swisspol created this task.Aug 12 2016, 8:58 PM

chad added a project: Phacility.Aug 12 2016, 10:12 PM

chad added subscribers: epriestley, chad.

swisspol updated the task description. (Show Details)Aug 13 2016, 5:34 AM

I got code 500 as I checked out from this instance, was about 23:50 UTC, yesterday. Maybe related? After getting that code, I tried it again after about 10 seconds, then, everything was ok.

Why do you mean by "this instance"?

secure.phabricator.com isn't in the Phacility cluster

No disruption is expected and there was no maintenance or known disruption at those times. Nothing jumps out at me immediately in monitoring, either, and I can't reproduce this by casually cloning stuff.

We have a PullEvent log, but it's currently pretty rough since I built it to debug a Mercurial + HTTP issue. It doesn't currently record SSH pulls. We can expand it to include SSH pulls, which may shed some light on things.

We can also ramp up SSH-pathway logging, although I suspect application pull logging may cover things on its own.

I do have a theory on this: currently, instance Conduit calls from the CLI are not cached (we didn't have a clean, existing cache available when I wrote them, and the cache semantics are somewhat involved). This is getting fairly hot on admin, and Multimeter shows a sustained average of ~10 calls per second over the last 24 hours. Conduit calls are relatively heavy (see T11404 for some discussion). It's plausible that some of these are just failing at peak -- I believe I've observed occasional failures during the additional load caused by deploys.

If they are, they'd rarely be visible in the web UI (which caches and makes relatively few calls). However, I'd expect to see them in the daemon logs and I don't, so this may be barking up the wrong tree.

I'd also possibly expect to see errors in the admin log (possibly; I'm not sure if this would present as a client timeout or a server timeout) but there's a bunch of CORGI stuff mucking that up. So my plans are:

Fix CORGI errors, check if that reveals any obvious indicators of instance information fetch issues.
Refine the instance CLI cache, which is starting to run uncomfortably hot and can't be helping, even if it isn't actually the problem here.
If this still unresolved, bring the pull log to SSH.
If that doesn't get to the bottom of it either, add a dedicated SSH error log.

epriestley claimed this task.Aug 14 2016, 3:30 PM

epriestley triaged this task as Normal priority.

Fix CORGI errors, check if that reveals any obvious indicators of instance information fetch issues.

I've deployed this and the error log is quiet for now, but I'll keep an eye on things.

Just observed a failure, see updated description.

swisspol mentioned this in T11445: Commits with failed HarborMaster builds stuck in "Building" state in Diffusion.Aug 15 2016, 11:13 PM

No smoking guns in the access or error logs on admin. The cache/load theory may just be a goose chase, since I don't see any evidence of unusual load around that time.

epriestley created subtask T11506: Build smarter CLI caching for instance data in the Phacility cluster.Aug 21 2016, 1:01 PM

swisspol updated the task description. (Show Details)Aug 28 2016, 8:04 AM

michaeljs1990 added a subscriber: michaeljs1990.Aug 28 2016, 2:57 PM

swisspol updated the task description. (Show Details)Aug 29 2016, 9:57 AM

epriestley closed subtask T11506: Build smarter CLI caching for instance data in the Phacility cluster as Resolved.Sep 5 2016, 4:45 PM

Still failing as of today (rarely though):

Building in workspace /Users/admin/Jenkins/Home/workspace/Hyper-tvOS-AdHoc
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url ssh://hyper@vault.phacility.com/diffusion/HYPER/hyper.git # timeout=10
Fetching upstream changes from ssh://hyper@vault.phacility.com/diffusion/HYPER/hyper.git
 > git --version # timeout=10
using GIT_SSH to set credentials Phabricator
 > git -c core.askpass=true fetch --tags --progress ssh://hyper@vault.phacility.com/diffusion/HYPER/hyper.git +refs/heads/*:refs/remotes/origin/*
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from ssh://hyper@vault.phacility.com/diffusion/HYPER/hyper.git
	at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:763)
	at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1012)
	at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1043)
	at hudson.scm.SCM.checkout(SCM.java:485)
	at hudson.model.AbstractProject.checkout(AbstractProject.java:1276)
	at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:607)
	at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:529)
	at hudson.model.Run.execute(Run.java:1738)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
	at hudson.model.ResourceController.execute(ResourceController.java:98)
	at hudson.model.Executor.run(Executor.java:410)
Caused by: hudson.plugins.git.GitException: Command "git -c core.askpass=true fetch --tags --progress ssh://hyper@vault.phacility.com/diffusion/HYPER/hyper.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: ssh_exchange_identification: Connection closed by remote host
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1640)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1388)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$300(CliGitAPIImpl.java:62)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:313)
	at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:761)
	... 11 more
ERROR: null
Archiving artifacts
Publish artifacts to S3 Bucket Using S3 profile: Default
Publish artifacts to S3 Bucket Skipping publishing on S3 because build failed
[phabricator:non-differential-harbormaster] Sending diffusion result as: FAILURE
Finished: FAILURE

epriestley mentioned this in T11766: Write to the pull log from SSH, not just HTTP.Oct 17 2016, 9:57 PM

epriestley created subtask T11766: Write to the pull log from SSH, not just HTTP.

epriestley moved this task from Backlog to Near Future on the Phacility board.

epriestley added a commit: Restricted Diffusion Commit.Oct 21 2016, 2:02 PM

This may have been the "MaxStartups" directive in sshd; if so, a partial fix is currently deployed (raising the limit on application load balancers) and a more comprehensive fix will go out tomorrow morning (also raising the limit on repository shards).

I'm no longer able to reproduce this with the partial fix in place, although my initial reproduction case was extremely shaky so I'm not sure that's conclusive.

epriestley added a revision: D16744: Cache generation of the SSH authentication keyfile for sshd.Oct 21 2016, 2:27 PM

D16744 should also improve SSH performance slightly, with a larger impact on larger instances.

epriestley added a commit: rPc21a71f02489: Cache generation of the SSH authentication keyfile for sshd.Oct 21 2016, 2:29 PM

Another instance was reproducing this more frequently/reliably and reports that they're no longer seeing issues, and I haven't been able to reproduce it myself since the configuration changes. Presuming we're in the clear now, but let us know if you see further issues.

(I also adjusted SSH connection load balancing to even out tier load.)

epriestley closed subtask T11766: Write to the pull log from SSH, not just HTTP as Resolved.Oct 25 2016, 4:05 PM

Checking out from Phacility sometimes failsClosed, ResolvedPublicActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Checking out from Phacility sometimes fails
Closed, ResolvedPublic
Actions

Related Objects
Search...