Phacility Deploy Log 2017 Week 8
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	Feb 25 2017, 11:10 AM

Description

Some taskmasters failed to start on repo001 and perhaps elsewhere. It's not clear why. Restarting phd manually resolved this for an individual instance, but I can see that we're missing a bunch of taskmasters in ps auxwww. I don't currently have a clean way to detect missing daemons across instances.
These volumes have storage alerts:
- repo014 (86%)
- ddata005 (87%)
- ddata002 (82%)
- dbak002 (80%)
- ddata001 (94%)
Hibernating trigger daemons appears to be working, but hasn't done much to relieve overall memory pressure on, e.g. repo011.
The tier as a whole is 88% full.

Revisions and Commits

rPHU libphutil
	Audited	rPHUe3381770af0f (stable) Ignore memory reserve settings for the first daemon in each pool
	Audited	rPHU7b7820ccf85a Ignore memory reserve settings for the first daemon in each pool

Related Objects
Search...

Status	Assigned	Task
Resolved	epriestley	T12218 Reduce the operational cost of a larger Phacility cluster
Invalid	epriestley	T12217 Reduce the hardware cost of Phacility free instances
Resolved	epriestley	T12316 Phacility Deploy Log 2017 Week 8

Event Timeline

epriestley created this task.Feb 25 2017, 11:10 AM

Herald added subscribers: chad, eadler. · View Herald TranscriptFeb 25 2017, 11:10 AM

The tier as a whole is 88% full.

I had hoped to increase our instances-per-host (T12217) before we ran out of hardware, but haven't gotten there fast enough and I'm no longer comfortable with utilization. I'm going to put more hosts into production immediately before pursuing other issues.

Some reserved instances ended on Feb 19th; I should make sure reserved instances match production instances.

I'm going to put more hosts into production immediately before pursuing other issues.

The db019 and db020 hosts don't seem to reach the network on outbound requests only:

ubuntu@ip-172-30-0-89:~$ curl -v http://example.com/
* Hostname was NOT found in DNS cache
*   Trying 93.184.216.34...
*   Trying 2606:2800:220:1:248:1893:25c8:1946...
* Immediate connect fail for 2606:2800:220:1:248:1893:25c8:1946: Network is unreachable

These hosts are identical to db017 and db018, which deployed successfully moments before, so I'm not sure what's up. I can't install tools like traceroute either because I can't apt-get. I'm going to try just kicking the boxes and see if things magically fix themselves...

The hosts launched into an "Auto-Assign Public IPv4" subnet but didn't get IPv4 addresses. Restarting them didn't help, and there's no apparent way to assign public addresses after the fact. I'm going to throw them away and try again.

The re-launched hosts seem OK, I'm waiting for DNS to update and then redeploying them.

Shards 017 through 020 are now in production.

Some reserved instances ended on Feb 19th; I should make sure reserved instances match production instances.

I've adjusted instance reservations to match instances in service.

repo014 (86%)

This was a bunch of repository file data in /tmp. Each file was unique, so this doesn't look like any sort of buggy loop or anything. There's some actual bug here but since there's a long list of other issues here I just "fixed" this by clearing /tmp.

ddata005 (87%)

I synchronized services for this shard and destroyed old instances. The volume is now 44% full.

I fixed one minor bug in the process (rCORE3ea26d36).

ddata002 (82%)

I synchronized services for this shard and destroyed old instances. The volume is now 68% full.

dbak002 (80%)

I pruned some very old backups (backup pruning is currently very conservative in several cases). The volume is now 72% full. This could easily be reduced much further by allowing bin/host prune to remove backups older than 90 days for suspended/terminated instances that have been out of service for 90 days, but backup utilization is generally relatively stable so I'm not concerned about this.

ddata001 (94%)

I suspended and destroyed the largest instance on the shard, a free instance with one user and no logins for >30 days (the user appears to have imported a huge repository and then never logged back in). We still have backups, but this instance was way over reasonable resource use anyway. The volume is now 78% full.

Some taskmasters failed to start on repo001 and perhaps elsewhere.

I'm looking at this now.

epriestley added a commit: rPHU7b7820ccf85a: Ignore memory reserve settings for the first daemon in each pool.Feb 25 2017, 3:01 PM

epriestley added a commit: rPHUe3381770af0f: (stable) Ignore memory reserve settings for the first daemon in each pool.

I think the issue is that the --reserve flag for reserving system memory now applies to even the first daemon in a pool, because all daemons are now part of pools. Previously, it only applied only to the 2nd..Nth daemon.

As we launch daemons, we eventually hit a point where system memory is >80% full, particularly during the initial launch before trigger daemons have a chance to hibernate. This stops taskmaster pools from scaling up to 1 daemon.

This is consistent with observations because we only pass reserve to the taskmaster pool.

I'm going to fix this by ignoring reserve for the first daemon in any pool.

After manually deploying that to repo001, I'm now seeing a consistent Taskmaster/PullLocal count. I'm redeploying the rest of the repo hosts.

I cycled the rest of the tier and things look better now.

epriestley mentioned this in T12218: Reduce the operational cost of a larger Phacility cluster.Feb 27 2017, 3:10 PM

epriestley closed this task as Resolved.Feb 27 2017, 6:46 PM

epriestley mentioned this in T13073: Plans: Drydock for normal software use cases where builds take more than 45 seconds.Feb 13 2018, 2:18 PM

Phacility Deploy Log 2017 Week 8Closed, ResolvedPublicActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Phacility Deploy Log 2017 Week 8
Closed, ResolvedPublic
Actions

Related Objects
Search...