An earlier patch here (rCORE6d6170f76463) swapped binlogs to MIXED and set a 24-hour retention policy. This issue has not reoccurred in the cluster since that patch went out, but the root causes remain unresolved.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Oct 26 2022
Jun 13 2022
- The drydock_resource table could use a (status, ...) key to satisfy common/default queries.
Jun 7 2022
May 9 2022
There may be additional work here, but presuming this is more or less resolved until evidence to the contrary arises.
May 5 2022
When this mechanism is removed (by commenting out the logic that cares about the 25% limit), we'd expect Drydock to build 8 resources at a time (limited by number of taskmasters). It actually builds ~1-4...
May 4 2022
The outline above isn't quite sufficient because when the active resource list is nonempty, we don't actually reach the "new allocation" logic. Broadly, executeAllocator() is kind of wonky and needs some additional restructuring to cover both the D19762 case ("allocate up to the resource limit before reusing resources") and the normal set of cases. The proper logic is something like:
This issue partially reproduces (consistent with the original report, not immediately consistent with my theorizing about a root cause in PHI2177 -- actually, looks like both parts are right, see below): Drydock builds ~1 working copy per minute serially until it reaches a pool size of 5 resources. Then, it begins allocating 2 simultaneous resources.
May 3 2022
This is somewhat resolved and neither next steps or motivation are clear any longer, so I'm going to call it done until evidence to the contrary arises.
Perhaps a philosophical question here is: do we care about which repositories are checked out in a working copy resource?
Before, instant reclaim after lease destruction:
To create resource pressure, I'm now going to try this -- I guess I don't really need the --count flag, but it does make the terminal juggling slightly easier:
The blueprint thing was on the way toward creating allocation pressure, so D21802 allows you to select a blueprint (or a set of possible blueprints) with --blueprint. You can specify an ID or PHID:
That patch is reasonable, and shouldn't break anything as long as the list you provide is a subset of the possible list.
fill in the details a bit.
After D21796:
(one orthogonal bug I found is that bin/drydock lease discards any blueprints provided in an attributes JSON)
(one orthogonal bug I found is that bin/drydock lease discards any blueprints provided in an attributes JSON)
Grab a test lease on the host with:
Here's a fairly simple way to reproduce this:
Feb 7 2020
A saved state is likely something like this:
Sep 27 2019
One broad problem here is "chain of custody" issues in T182. A "Saved State" can easily accommodate multiple representations, and the plan above imagines using Drydock to build tags/branches out of non-repository representations, so we'd have cases where a given "Saved State" has a way to build it with a "patch list" (from the client) or a "ref pointer" (from Drydock).
Aug 20 2019
Dec 12 2018
In T12145#242682, @joshuaspence wrote:I'm having some trouble getting this new behaviour (which IIUC basically means that multiple hosts in a Drydock pool should be load-balanced across). In "active resources" I see three Drydock hosts, which all belong to the same Almanac service. In "active leases", however, I see only a single host lease and many working copy leases.
Dec 9 2018
I'm having some trouble getting this new behaviour (which IIUC basically means that multiple hosts in a Drydock pool should be load-balanced across). In "active resources" I see three Drydock hosts, which all belong to the same Almanac service. In "active leases", however, I see only a single host lease and many working copy leases.
Nov 26 2018
Nov 10 2018
Nov 1 2018
Oct 30 2018
Fantastic, thanks very much @epriestley! I had indeed intended to take care of this myself was on other work this and last week and planned to come back to this. It also would have taken me much longer to realize that drydock.lease.search wasn't yet upstream and how to proceed from there, so I'm glad to see you were able to handle this so easily!
Oct 26 2018
Oct 25 2018
D19762 adds a "supplemental allocation" behavior, which basically lets blueprints say "I want to grow the pool instead of allowing this otherwise valid lease acquisition".
After that, both hosts will have resources and jobs will allocate randomly, which should be good enough.
I believe you can work around this today by disabling the binding to host "A" in Almanac, running one job (which will be forced to allocate on host "B"), then re-enabling the binding. After that, both hosts will have resources and jobs will allocate randomly, which should be good enough. This is exceptionally cumbersome and ridiculous, of course (and it's possible that it doesn't even work).
A specific subcase here is when the binding to an Almanac host has been disabled. We should possibly test this during Interface construction, treat it as a failure, then recover from it.
I believe D16594 should implement this, one way or another, unless I'm misunderstanding the request.
Complicating this: there is no drydock.lease.search call upstream. So you're probably running some variation of D16594? But that already has ownerPHIDs.
Oct 24 2018
I'm happy to make these changes myself, or you mentioned wanting to contribute a patch?
Oct 23 2018
Oct 16 2018
Oct 12 2018
Oct 10 2018
Oct 1 2018
Sep 21 2018
Sep 19 2018
Sep 14 2018
Sep 13 2018
Sep 7 2018
Aug 28 2018
The unit test results also don't currently show on individual builds, which is a little whack?
See T13189#240682 for some planning on the Unit Test result table.
Aug 27 2018
Aug 3 2018
Jun 20 2018
Back when this was originally reported, I'm pretty sure git lfs clone didn't exist (or at least I wasn't aware of it's existence). The appropriate fix now is probably different to the fix suggested in the original report.
We have a similar issue - however I think the "fix" is probably worse then the workaround.
Jun 5 2018
Apr 16 2018
Apr 13 2018
Mar 16 2018
Mar 13 2018
Mar 12 2018
Mar 7 2018
Mar 5 2018
This is effectively paused until I'm more convinced that the stabilizations changes really stabilized things -- I'm hoping to stabilize first, then work on improvements from there.