Remove "-q" from SSH commands executed by `bin/remote` and similar cluster commands
Open, LowPublic
Actions

Assigned To

None

Authored By

	epriestley
	Apr 12 2018, 12:08 AM

Description

Currently, bin/remote runs ssh -q ... -- bin/bastion ... which runs ssh -q ... -- ....

The -q flag suppresses annoying routine output, mostly about "Permanently added host X to known hosts". However, it also suppresses useful error output when things go wrong.

It's also vaguely bad that we're using -o StrictHostKeyChecking=no. It would be better to know the host keys (they're knowable) and use -o StrictHostKeyChecking=yes.

One general issue is that we probably don't want to check in a big everything.txt file with a list of every host and key since this makes deployment a mess that depends on git. So we either need to check in a general-purpose file which covers everything broadly, or generate a purpose-built file before each invocation of ssh.

When a file looks like this:

hostname.phacility.net <key>

...ssh tries to add a 1.2.3.4 <key> line when it connects, and emits a notice. This can be disabled with -o CheckHostIP=no, which gets us a little bit of the way there.

We can also use wildcards in hostnames, so we could do this:

bastion*.phacility.net <bastion-key>
web*.phacility.net <web-key>

...and so on. That feels a little janky but maybe it's OK?

We also currently don't install consistent keys on hosts. We do present a consistent key to git clone ..., but do so by specifying HostKey in the VCS SSHD, not by overwriting /etc/ssh/... on the system.

It's also theoretically sort-of good that each host have its own key?

So we could probably go down two paths:

Normalize keys: install the same key on every host, or on all hosts of a given type. Then write a wildcard file and check it into Git since it would rarely need to be updated (only when we add new tiers or cycle keys for some reason).
Keep unique keys: Stick with whatever unique key the host comes up with (or generate a new one) but give each host a unique key either way. Before connecting, bin/remote writes a temporary file with exactly the hostname and IP it expects, then uses -o UserKnownHostsFile to point ssh at that file. By writing a file immediately before use, we don't need to keep a giant list of every host/key in git.

(2) seems a little better in some sense and is "more correct", but I'm not sure it reaaaaally defuses any meaningful attack compared to per-tier-class keys?

(2) is also a bit trickier because it means bin/bastion must also generate a file on the bastion host. Since anything generating a file probably needs to make a service call to Almanac to figure out the public key, this might add a significant bit of overhead (one local call, then one call on the bastion) compared to normalizing keys per tier and storing them in git.

Revisions and Commits

	Restricted Differential Revision	Restricted Diffusion Commit
rP Phabricator
	D20240	rPe15fff00a640 Use "LogLevel=ERROR" to try to improve "ssh" hostkey behavior without doing…

Related Objects

Mentioned In: T13253: Plans: Daemon Status Reporting
T13210: Plans: 2018 Week 41-44 Bonus Content
Mentioned Here: T13180: AWS is rebooting sbuild001.phacility.net on August 15, 2018

Event Timeline

epriestley triaged this task as Low priority.Apr 12 2018, 12:08 AM

epriestley created this task.

Herald added a subscriber: eadler. · View Herald TranscriptApr 12 2018, 12:08 AM

I'm leaning toward:

Wait until all SSH-able devices are represented in Almanac.
Try approach (2).
If that's actually no good, maybe approach (1) instead.

amckinley added a subscriber: amckinley.Apr 12 2018, 2:21 AM

We can also get a "Something something won't allocate a terminal because something something pseudo-tty." warning without -q. I'm 100% confident this is the exact text of the error.

This can probably be resolved with the some combination of -t and -T flags.

One specific thing we hit in the context of T13180: sbuild001 stopped and started with a new public IP, and the Almanac device record needed to be updated with the new address.

The lease output is worthless, and I just knew this by guessing:

https://secure.phabricator.com/drydock/lease/12134/

Lease activation failed: [CommandException] Command failed with error #255!
COMMAND
ssh '-o' 'LogLevel=quiet' '-o' 'StrictHostKeyChecking=no' '-o' 'UserKnownHostsFile=/dev/null' '-o' 'BatchMode=yes' -l '********' -p '22' -i '********' '52.8.83.222' -- '(cd '\''/var/drydock/workingcopy-81/repo/phabricator/'\'' && git clean -d --force && git fetch && git reset --hard HEAD && git checkout '\''df31405d64a44a9ae7a12a324fb292dc8357be46'\'' --)'

STDOUT
(empty)

STDERR
(empty)

The LogLevel=quiet is the main culprit here making this totally useless, but providing a valid UserKnownHostsFile for an arbitrary device in Almanac seems tricky in the general case.

One cheaty approach is to just ask you to configure it.

Another possible (?) approach is to figure it out empirically ahead of the connection? Like:

Check if we have a cached key.
Try to connect with the cached key.
If it fails, try to connect without the key.
If that works, save the key we found to the cache.

Kinda a big pile of garbage.

In the context of Drydock/Harbormaster, we could reasonably eat the "pointless known hosts log every time" message for a while and don't necessarily need to fix this, but it would be nice to find a pathway forward where we aren't just wading through mountains of garbage on every side.

epriestley moved this task from Backlog to Near Future on the Phacility board.Aug 10 2018, 6:06 PM

epriestley mentioned this in T13210: Plans: 2018 Week 41-44 Bonus Content.Oct 24 2018, 12:39 AM

epriestley mentioned this in T13253: Plans: Daemon Status Reporting.Feb 15 2019, 12:23 PM

For Drydock/Harbormaster, the only real way forward I see here is:

Eat this warning by default.
Someday, perhaps allow a host key to be configured. If a key is configured, use strict host key checking. This should effectively raise the warning to an error, so we don't need to worry about it.

I think any default behavior other than "eat this warning" probably makes configuration too cumbersome relative to the value of this warning.

Another possible approach is to use -o LogLevel=ERROR. This gets us into trouble if there are useful INFO messages other than "permanently added X to list of known hosts", but presumably all the important stuff is rasied at ERROR or better.

epriestley added a revision: D20240: Use "LogLevel=ERROR" to try to improve "ssh" hostkey behavior without doing anything extreme/hacky.Mar 1 2019, 5:51 PM

epriestley added a revision: Restricted Differential Revision.Mar 1 2019, 5:53 PM

epriestley added a commit: rPe15fff00a640: Use "LogLevel=ERROR" to try to improve "ssh" hostkey behavior without doing….Mar 4 2019, 5:38 PM

epriestley added a commit: Restricted Diffusion Commit.

Remove "-q" from SSH commands executed by `bin/remote` and similar cluster commandsOpen, LowPublicActions

Description

Revisions and Commits

Related Objects

Event Timeline

Remove "-q" from SSH commands executed by `bin/remote` and similar cluster commands
Open, LowPublic
Actions