I was able to get web001 to wedge a lot of requests in "R" ("Reading Request") -- note "SS" is "seconds since beginning of most recent request".
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Oct 16 2017
I'm also unable to wedge secure. mod_status is working there, but since I can't wedge it it isn't very useful.
I'm going to take a more detailed look at local behavior for this specific request pattern.
Oct 14 2017
This is working somewhat better than before, but I'm still able to wedge apache with enough requests using a pre-patch version of libphutil. I'm going to take a more detailed look at local behavior for this specific request pattern.
It's currently possible for cluster hosts to hit rate limits against admin while doing normal deployment stuff, since they may make a lot of requests to admin very quickly.
I expect I can just reduce the data size to something manageable with the current workflow fairly easily
Oct 13 2017
(I also verified that real multipart/form-data works fine from a browser, by changing a project's profile picture here.)
On the server side, the connection test looks better: secure001 saw 8 HTTP 500's (this might be a bug in the multipart/form-data parser for whatever cURL is doing) and then 135 HTTP 429s, which is more or less exactly what the code is supposed to do. So I think most of the inconclusiveness on what was happening from the client side was curl / nohup stuff.
I used ab to verify the rate limiting code here, and got killed after a bit of abuse with ab -c 10 -n 200 ...:
Oct 12 2017
There's also a bug with arc here. It's supposed to limit simultaneous uploads to 4:
Oct 11 2017
Oct 10 2017
Oct 7 2017
This is still ongoing but it isn't impacting us so I don't plan to do anything else here.
I've deployed the slightly more formal preamble. It no longer has the remote address logging, but here are the addresses captured up to now (first column: request count; second column: address):
Oct 6 2017
This is still ongoing, although we're weathering it without any issues.
Yes, use bin/storage dump --no-indexes. See here for more details:
For those of us who don't mind having to wait through a large reindex after restoring from a backup, is there a way to ask the backup process to skip dumping the index tables? (Based on the above I presume they'll still be included)
Oct 5 2017
merely a large number and not an enormous number
Handful of these, too:
I early 500'd these requests in preamble.php and load seems better now:
There are a very large number of originating IP addresses so I'm going to filter this traffic by path rather than by IP address.
In PHI120, we hit an issue where an instance (whose members seem to be located in Asia) performed 1,600 file uploads at roughly 256KB/sec, apparently putting a substantial amount of pressure on workers for ~12 minutes.
Oct 4 2017
Another thing we can do here which will generally reduce the operational load is to be more selective about how we dump/export tables from MySQL. Broadly, we have three general classes of tables today:
Oct 3 2017
I've pushed the MySQL configuration tweaks here, and manually built the common ngrams table with a 0.15 threshold.
I'm also going to increase the db tier innodb_buffer_pool_size from 4GB to 6GB. The hosts have 7.5GB of RAM and 8GB of swap, and are pretty much sitting there with a little under 3GB of RAM mostly unused and no other load.
The thing I'm mostly looking at is how many queries are pushed down to 0 ngrams, i.e. they can't use the ngram index.
I looked at how the last 1,000 queries on this instance would be affected by different "common" thresholds. Note that these charts are all sort of garbage (X axis is nonlinear) because I couldn't figure out how basic spreadsheet software works.
Oct 2 2017
Here are some possible structural changes we can make to the table, using the Maniphest ngrams table on this install as an example dataset (this is roughly 50x smaller than the target dataset):
There's still a big element of mystery here: why did MySQL sustain 80MB/sec of write I/O without daemons running? There "should" have been no queries against the fngrams table, and a large table "should" not require huge I/O volumes if it isn't being used. I don't think it's totally implausible that this was some kind of general thrash/swap/memory management issue in MySQL (the innodb_buffer_pool_size on these hosts if 4GB, and mysql was using all of it) but I would expect "paging-like" activities to cause a large read volume, not a large write volume.
Sep 25 2017
Sep 23 2017
We had to write some code to make phage work with exec and fix the phd stop flow to use --force, but this deploy completed successfully.
Sep 18 2017
We actually already have bin/remote stop, I just run it so rarely that I forgot about it.
Next chapter: T12989: Phacility Deployment: 2017 Week 38
Sep 16 2017
vault002 is dead. Long live lb001.
Sep 15 2017
Swap DNS.
- I opened up 22 -> 2223 on lb001.
- I allowed external 22 in the security group.
- I hard-coded my hostfile and cloned successfully:
I'm going to take a stab at this now since I think it's non-disruptive and straightforward.
We use Almanac + Passphrase + Ansible + (Dynamic inventory client) for this.
Sep 14 2017
It's also possible to write a custom instances.do-exactly-what-we-need sort of endpoint and generalize later if that seems like a more promising approach.
The other shadow lurking in the water here -- which I think we can mostly avoid -- is that Almanac is mostly a-bit-bare-bones-but-overall-pretty-functional, except that the way properties on Bindings and Services are specified and edited is complete garbage. You more or less just have to magically know which properties are valid, and there's no real support for defaults or nice UI controls or hints about what you can set or suggestions that you're making stuff up and probably typo'd something.
The "most right" way in terms of consistency is to fully convert Binding to EditEngine, then implement almanac.binding.edit which can create/edit bindings. When creating a binding, it would require transactions specifying the service and interface. PhamePostBlogTransaction is sort of an example of this: when you create a new post with phame.post.edit, you must specify a blog transaction.
What's the best way to add API endpoints for resources like bindings? Call it almanac.create_binding and have it take a service and an interface as arguments?
Sep 12 2017
Both bak volumes are now swapped. The old volumes are detached as dbak001.phacility.net-old and dbak002.phacility.net-old. I'll delete them after the deployment on Saturday if no issues arise before then.
I haven't gotten any emails yet so I may have to go muck with the WHOIS stuff and make sure there's a valid email somewhere -- phurl.io itself has no mail or MX records.
Upgrade dbak001 to 128GB (from 64GB).
Upgrade dbak002 to 128GB (from 64GB).
Swap notify001 to notify002 in the LB.
I think this is basically "node is bananas" and our AMI is Ubuntu 14 which ships with "Node for DOS".
Just sent a cert request for phurl.io. (Actually it should be two, one for each region).
I'm bumping into T12171 when bringing up the new host. I'm going to take another stab at figuring out what's going on there because the workaround I found in that task is ridiculous.
Switch all SSL to AWS.
Test that moving SSL termination to nlb001 works.
Oh, sorry, misread -- that makes more sense. phurl.io is the only one we serve anything from ourselves right now, and I think the only one we have plans to serve anything from.
Yeah I'm just talking about requesting the SSL certs, not moving the domains. Unless I'm missing something, I don't think there's any way to get stuck just by getting the certs ready.
Let's make sure all the SSL is swapped first just so we don't get into trouble if we make it halfway, run into issues, and something expires, but that's be helpful once SSL is in the clear. I'm not sure if nlb was the last case of SSL terminating somewhere other than LBs or not, but I think there's one left that I just don't remember offhand.
I can go request these certs via the AWS cert manager. We can also do it via CloudFormation, which will reduce the number of clicks significantly (and make it trivial to request all the same certs in multiple regions.
Replace notify001, which is scheduled for AWS downtime on September 20th.
Sep 11 2017
Per above, not planning to actually go forward with the GC step since the impact isn't ultimately very large.
Looking at the actual data, I'm less sure this is actually a good strategy. Here's the data for this install, considering the production configuration of storage.mysql-engine.max-size as 65535: