Tue, Feb 13
(Of course, it'll probably just work the first time now...)
Mon, Feb 12
The export process is already robust at a coarse level: the dump is retained on disk and the process can be retried at the "upload the whole file again" level, then picked up with bin/host export using the --database or --database-file flags (probably with --keep-file).
The (anonymized) error the process encountered occurred while transferring the dump to central storage was:
Nov 13 2023
Next issue: can't pull from secure.
With bin/provision events working again:
Oct 26 2022
I patched and partially deployed this in early August. Another unattended MySQL upgrade went out on Monday night, also didn't restart MySQL on affected hosts, and caused some downtime on hosts that didn't have the patch (to "disable unattended upgrades"). I've now deployed this everywhere, and am presuming this is fixed until evidence arises to the contrary.
Oct 25 2022
Jul 29 2022
Apr 20 2022
There's nothing particularly useful or actionable here now, so closing it out. (I believe this was the most severe incident Phacility ever experienced while actively maintained.)
This hasn't caused any more problems in like 4 years, so I guess it's kind of whatever.
This isn't really resolved, but almost certainly does not make sense to pursue given the Phacility wind-down.
Almost every host currently in production was provisioned with Piledriver and things have been stable for quite a while, so I'm calling this resolved. See elsewhere for issues with Ubuntu20, mail, etc.
Moved the rest of this to T13640.
Apr 19 2022
I deployed this and it seems to be working properly.
Hey, it worked once. Good enough for me!
No dice. We need bin/upgrade to run before mysql because it has to mount the data volume. So now I'm trying this:
... service ... start rather than service ... restart ...
Apr 1 2022
This has some rough edges that I'm not going to deal with for now:
Dec 19 2021
See T12847. All the technical parts of this are now solved except for billing, but since Phacility is winding down I no longer plan to pursue it.
I resolved this in rCORE320b2854.
Only one instance was impacted by this and I just credited them until 2099. I don't currently expect to pursue this.
I no longer expect to pursue this.
- Hosts in the repo class are now build by Piledriver (see T13630), which automatically creates the rbak device entries, so this error isn't likely to occur again.
- I also don't expect to launch any more hosts.
I compacted secure onto new hardware (T13671) and shut down saux001 ("Land Revision") and sbuild001 (Harbormaster remote builds). I think all the remaining work is covered under T13630 (largely, just a handful of large database migrations remain).
I just swapped configs over without merging the LBs, since it wasn't immediately obvious to me what the Application vs Classic state of the world is and swapping was good enough.
The aphlict/notify stuff still needs to be tweaked. I think the snlb + slb setup can be merged into a single slb with "TCP (Secure)" forwarding now.
Databases are moved and secure is out of read-only mode. I'm going to adjust repository configuration, then I should be able to tear down secure001.
I'm going to put secure back into read-only mode now and move the databases to the new host.
I brought up the new host and pointed the slb001 load balancer at it. The database is still on the old host, and the new host doesn't have repositories yet, but the basics seem to be working.