Page MenuHomePhabricator

Phacility Deploy Log 2017 Week 8
Closed, ResolvedPublic

Description

  • Some taskmasters failed to start on repo001 and perhaps elsewhere. It's not clear why. Restarting phd manually resolved this for an individual instance, but I can see that we're missing a bunch of taskmasters in ps auxwww. I don't currently have a clean way to detect missing daemons across instances.
  • These volumes have storage alerts:
    • repo014 (86%)
    • ddata005 (87%)
    • ddata002 (82%)
    • dbak002 (80%)
    • ddata001 (94%)
  • Hibernating trigger daemons appears to be working, but hasn't done much to relieve overall memory pressure on, e.g. repo011.
  • The tier as a whole is 88% full.

Event Timeline

  • The tier as a whole is 88% full.

I had hoped to increase our instances-per-host (T12217) before we ran out of hardware, but haven't gotten there fast enough and I'm no longer comfortable with utilization. I'm going to put more hosts into production immediately before pursuing other issues.

  • Some reserved instances ended on Feb 19th; I should make sure reserved instances match production instances.

I'm going to put more hosts into production immediately before pursuing other issues.

The db019 and db020 hosts don't seem to reach the network on outbound requests only:

ubuntu@ip-172-30-0-89:~$ curl -v http://example.com/
* Hostname was NOT found in DNS cache
*   Trying 93.184.216.34...
*   Trying 2606:2800:220:1:248:1893:25c8:1946...
* Immediate connect fail for 2606:2800:220:1:248:1893:25c8:1946: Network is unreachable

These hosts are identical to db017 and db018, which deployed successfully moments before, so I'm not sure what's up. I can't install tools like traceroute either because I can't apt-get. I'm going to try just kicking the boxes and see if things magically fix themselves...

The hosts launched into an "Auto-Assign Public IPv4" subnet but didn't get IPv4 addresses. Restarting them didn't help, and there's no apparent way to assign public addresses after the fact. I'm going to throw them away and try again.

The re-launched hosts seem OK, I'm waiting for DNS to update and then redeploying them.

Shards 017 through 020 are now in production.

Some reserved instances ended on Feb 19th; I should make sure reserved instances match production instances.

I've adjusted instance reservations to match instances in service.

repo014 (86%)

This was a bunch of repository file data in /tmp. Each file was unique, so this doesn't look like any sort of buggy loop or anything. There's some actual bug here but since there's a long list of other issues here I just "fixed" this by clearing /tmp.

ddata005 (87%)

I synchronized services for this shard and destroyed old instances. The volume is now 44% full.

I fixed one minor bug in the process (rCORE3ea26d36).

ddata002 (82%)

I synchronized services for this shard and destroyed old instances. The volume is now 68% full.

dbak002 (80%)

I pruned some very old backups (backup pruning is currently very conservative in several cases). The volume is now 72% full. This could easily be reduced much further by allowing bin/host prune to remove backups older than 90 days for suspended/terminated instances that have been out of service for 90 days, but backup utilization is generally relatively stable so I'm not concerned about this.

ddata001 (94%)

I suspended and destroyed the largest instance on the shard, a free instance with one user and no logins for >30 days (the user appears to have imported a huge repository and then never logged back in). We still have backups, but this instance was way over reasonable resource use anyway. The volume is now 78% full.

Some taskmasters failed to start on repo001 and perhaps elsewhere.

I'm looking at this now.

I think the issue is that the --reserve flag for reserving system memory now applies to even the first daemon in a pool, because all daemons are now part of pools. Previously, it only applied only to the 2nd..Nth daemon.

As we launch daemons, we eventually hit a point where system memory is >80% full, particularly during the initial launch before trigger daemons have a chance to hibernate. This stops taskmaster pools from scaling up to 1 daemon.

This is consistent with observations because we only pass reserve to the taskmaster pool.

I'm going to fix this by ignoring reserve for the first daemon in any pool.

After manually deploying that to repo001, I'm now seeing a consistent Taskmaster/PullLocal count. I'm redeploying the rest of the repo hosts.

I cycled the rest of the tier and things look better now.