Page MenuHomePhabricator

Fix bad counting in SQL when enforcing Drydock allocator soft limits
ClosedPublic

Authored by epriestley on Oct 14 2015, 12:26 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Dec 11, 9:35 PM
Unknown Object (File)
Wed, Dec 11, 8:25 PM
Unknown Object (File)
Fri, Nov 29, 10:47 AM
Unknown Object (File)
Thu, Nov 28, 5:33 PM
Unknown Object (File)
Wed, Nov 27, 3:49 AM
Unknown Object (File)
Tue, Nov 26, 4:40 AM
Unknown Object (File)
Tue, Nov 26, 4:40 AM
Unknown Object (File)
Tue, Nov 26, 4:15 AM
Subscribers
None

Details

Summary

Ref T9252. This fixes a bug from D14236. D14272 discusses the observable effects of the bug, primarily that the window for racing is widened from ~a few milliseconds to several minutes under our configuration.

This SQL query is missing a GROUP BY clause, so all of the resources get counted as having the same status (specifically, the alphabetically earliest status any resource had, I think). For test cases this often gets the right result since the number of resources may be small and they may all have the same status, but in production this isn't true. In particular, the allocator would sometimes see "35 destroyed resources" (or whatever), when the real counts were "32 destroyed resources + 3 pending resources".

Since this allocator behavior is soft/advisory this didn't cause any actual problems, per se (we do expect races here occasionally), it just made the race very very easy to hit. For example, Drydock in production currently has three pending working copy resources. Although we do expect this to be possible, getting 4 resources when the configured limit is 1 should be hard (not lightning strike / cosmic radiaion hard, but "happens once a year" hard).

Also exclude destroyed resources since we never care about them.

Test Plan

Followed the plan from D14272 and restarted two Harbormaster workers at the same time.

After this patch was applied, they no longer created two different resources (we expect it to be possible for this to happen, just very hard).

We should still be able to force this race by putting something like sleep(10) right before the query, then sleep(10) right after it. That would prevent the allocators from seeing one another (so they would both think there were no other resources) and push us down the pathway where we exceed the soft limit.

Diff Detail

Repository
rP Phabricator
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

epriestley retitled this revision from to Fix bad counting in SQL when enforcing Drydock allocator soft limits.
epriestley updated this object.
epriestley edited the test plan for this revision. (Show Details)
epriestley added a reviewer: chad.
This revision is now accepted and ready to land.Oct 14 2015, 1:15 PM
This revision was automatically updated to reflect the committed changes.