Page MenuHomePhabricator

Sustained MySQL I/O overwhelmed db009 / huge Ferret engine ngrams table
Closed, ResolvedPublic

Assigned To
Authored By
epriestley
Oct 2 2017, 3:14 PM
Referenced Files
F5209202: Screen Shot 2017-10-03 at 10.59.38 AM.png
Oct 3 2017, 6:06 PM
F5209194: Screen Shot 2017-10-03 at 10.55.37 AM.png
Oct 3 2017, 6:06 PM
F5209191: Screen Shot 2017-10-03 at 10.54.16 AM.png
Oct 3 2017, 6:06 PM
F5209197: Screen Shot 2017-10-03 at 10.57.51 AM.png
Oct 3 2017, 6:06 PM
F5209205: Screen Shot 2017-10-03 at 11.01.20 AM.png
Oct 3 2017, 6:06 PM
F5209189: Screen Shot 2017-10-03 at 10.53.24 AM.png
Oct 3 2017, 6:06 PM
F5209187: Screen Shot 2017-10-03 at 10.51.50 AM.png
Oct 3 2017, 6:06 PM
Tokens
"Baby Tequila" token, awarded by 20after4.

Description

Friday, September 29

An instance reported slow performance around noon (see PHI96).

I connected to the shard (db009) and identified multiple simultaneous backup processes running in ps auxwww. I killed them and started a new clean one, under the theory that this might be mostly a load issue, perhaps related to the Ferret engine indexing backlog, and that reducing the load might restore things to normal operation. The new backup dumped quickly at first but slowed down dramatically as time went on, and ultimately did not complete in the next 16 hours.

I also observed that the instance had 70GB of MySQL data: 50GB in repository_commit_fngrams, 10GB in repository_commit_ffield, and 10GB of actual data in all other tables. The instance also had something in the realm of 600,000 SearchWorker tasks queued. The instance has approximately 800,000 indexable commits.

Saturday, September 30

I observed that the backup was still ongoing after approximately 16 hours and killed it. db009 had a magnetic volume at this point (T12999) and I suspected this might be a storage I/O issue. I snapshotted the magnetic volume (this took maybe two hours), then I allocated a new SSD volume and performed a volume swap on the host with bin/swap (this also took maybe two hours).

I also suspected (or, perhaps, hoped) this might be some flavor of table fragmentation issue, since the 50GB fngrams table seemed obviously wrong given that it appeared that fewer than 200,000 commits had indexed. However, the fdocument table actually has all ~800,000 entries, so those jobs were more likely reindexes. The exact origin remains unclear, but I think this table size is actually predicted for that number of documents.

Locally, we have 13,000 tasks and a ~1GB fngrams table. This predicts an (800,000 / 13,000) * 1GB = 61GB fngrams table for 800,000 objects. So this is at least in the ballpark.

After swapping to an SSD, I began running OPTIMIZE TABLE under the "some sort of table fragmentation" theory. This moved forward, albeit very slowly (approximately 30x slower than optimizing similar tables on secure).

Somewhere around this time (mid-morning), @amckinley noticed that the SSD burst I/O budget had been completely consumed for the volume. I swapped the volume to io1 with a ~10x larger allocation of iops, which appeared to have some sort of effect, but not much of one.

We also observed (via iotop) a sustained 80MB/sec I/O write rate coming out of MySQL. This write rate was sustained even when the daemons were stopped (that is, no new indexing tasks were running), but very little of it appeared to be going to the OPTIMIZE TABLE (which was growing the tempfile on disk at less than 4MB/sec).

I'm not familiar with tools to inspect MySQL I/O activity, the actual load on the server didn't make sense in terms of generating I/O, and Googling didn't turn up much of use. I suspected some kind of memory/swap/paging issue with the size of the fngrams table, although I'm not sure this makes much sense (why would this cause writes?). Regardless, the fngrams table still seemed like the unusual element -- and the most likely culprit.

I truncated the fngrams table, the ffield table, and the worker_activetask table, then optimized them, removing the task queue and reducing the data size to approximately 10GB. I deployed db025 and migrated the instance via bin/move. This ran into several issues:

  • The large size of the dump hit HTTP errors downloading from admin.phacility.com. This is probably a config issue, but see also T12907.
  • The dump was more than 2GB, so bin/file download would have failed in HTTPSFuture anyway, since it could not store the file data in a PHP string.
  • We've previously only used bin/move to move repositories (T12817). When used on databases, it fails to create credentials on the target host before connecting to the database. This is probably an easy fix.

I worked around these issues with scp, bin/host upgrade, and a little tweaking of the code. When the instance came up on db025 I ran a new backup, which completed promptly. Load and I/O had vanished from both db009 and db025 and the system appeared stable.

Sunday, October 1

Backups ran normally without intervention.

Monday, October 2

I reverted the io1 volumes to gp2 and wrote this up.

Revisions and Commits

Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
rP Phabricator
D18710
D18707
D18696
D18690
D18687
D18682
D18679
D18673
D18672

Event Timeline

There's still a big element of mystery here: why did MySQL sustain 80MB/sec of write I/O without daemons running? There "should" have been no queries against the fngrams table, and a large table "should" not require huge I/O volumes if it isn't being used. I don't think it's totally implausible that this was some kind of general thrash/swap/memory management issue in MySQL (the innodb_buffer_pool_size on these hosts if 4GB, and mysql was using all of it) but I would expect "paging-like" activities to cause a large read volume, not a large write volume.

Regardless, the fngrams table looks a lot like it's in the middle of things here.

The easiest way I can come up with to reduce the size of the table is to remove common ngrams from the index. For example, almost every (English-language) document matches the ngrams th, the, and he . These ngrams aren't actually useful: they have no real power to reduce the size of the result set, and if the user searches for "the" we're probably better off just doing LIKE on whatever result set we can use other indexes to reduce to, since almost everything is going to hit anyway.

On this install, for the maniphest_task_ngrams table, these are the most common ngrams:

mysql> SELECT ngram, COUNT(*) N from maniphest_task_fngrams GROUP BY ngram ORDER BY N desc limit 100;
+-------+-------+
| ngram | N     |
+-------+-------+
|  th   | 10736 |
| the   | 10161 |
| ng    | 10087 |
| at    | 10055 |
| on    | 10052 |
| ed    | 10038 |
| is    | 10028 |
| to    |  9996 |
|  to   |  9984 |
| ing   |  9946 |
...

We have 13,000 tasks, so all of these ngrams are present in more than 50% of the documents and pretty much useless as an index.

If we throw out ngrams which are present in more than X% of documents, we reduce the size of the ngrams table by this much:

Present in >50%14%
Present in >33%28%
Present in >25%41%
Present in >16%53%
Present in >12%60%
Present in >5%75%
Present in >2%83%

This isn't tremendously reassuring at first blush: the long tail is pretty thick. Throwing out ngrams present in more than 50% of documents is almost certainly very safe, but I'm less sure about throwing out ngrams present in more than 5% of documents.

That said, it may not be bad: the tail on distinct ngrams is also very thick. The table has 88,378 distinct ngrams and throwing out all the ngrams present in more than 5% of documents discards only 2,164 of them, or about 2.5%.

In fact, discarding ngrams present in more than 1% of documents only discards 6% of ngrams, and saves 88% of the table size.

Generally, if we can get away with this or something similar, I think this is the best approach. If we can't trick MySQL into scaling at the ~1M object level we probably need to deploy an external index which has more cleverness than "do a lot of joins", and that ramps up complexity explosively.

Another way we can cheat is to split the fngrams table into a "latin" table and a nonlatin table. Basically:

  • If an ngram has byte length 3, it goes in the latin table.
  • If an ngram has byte length >3 (i.e., contains unicode) it goes in the nonlatin table.

The table currently uses utf8mb4, so we're presumably wasting 9 bytes per ngrams, on 99% of ngrams for English documents. There may be a similar effect in the index, although I'm not sure.

However, the local table has 4,809,955 rows and a total size of ~1GB (430MB data + 570MB indexes), which suggests each row requires ~200 bytes of storage, and that shaving 9 bytes off might not do much. This would be easy to test, though.

It's also possible that using, e.g., ROW_FORMAT=compressed might help (more likely with the ffield table than the fngrams table?).

We could also remove the documentID key, which is used only for indexing, and GC old indexes instead. I suspect this won't help us, but the impact of this is also easy to test.

If none of this works we can ultimately use a dedicated external index instead, but we're in a far stronger position if Ferret can scale through reasonable workloads, and then scale into unreasonable ones by being given a beefier database with more RAM.

epriestley added a revision: Restricted Differential Revision.Oct 2 2017, 8:01 PM
epriestley added a commit: Restricted Diffusion Commit.Oct 2 2017, 8:31 PM

Here are some possible structural changes we can make to the table, using the Maniphest ngrams table on this install as an example dataset (this is roughly 50x smaller than the target dataset):

ChangeData Size%Key Size%Notes
No Change429,932,544100%571,736,064100%
OPTIMIZE238,780,41655%379,191,29666%Seems weird, maybe SHOW TABLE STATUS "approximating" / lying?
Copy Table454,033,408105%320,290,81656%Probably SHOW TABLE STATUS being weird?
OPTIMIZE238,780,41655%379,191,29666%Same as above, so that's consistent at least.
Remove "id" column254,558,20859%424,476,67274%See note below.
OPTIMIZE254,558,20859%424,476,67274%Well okay then.
Drop documentID key + OPTIMIZE254,558,20859%256,901,12045%Updates become more difficult.
utf8mb4 to binary + OPTIMIZE247,201,79257%243,269,63243%Dropped 8,282 multibyte ngrams.

ID Column: We don't need an explicit ID column on this table, but my understanding is that InnoDB creates an implicit one internally if you don't assign one yourself, so you might as well assign one so you can use it. This seems born out by the table not dropping in size after removing the ID column.


None of these seem terribly promising. Only the initial OPTIMIZE had any real impact on the reported data size. Dropping the documentID key looks like it has a real effect, but it's maybe ~15% of the total table size and makes updates far more involved.

I'm going to try dropping common ngrams instead and see how well that works.

I looked at how the last 1,000 queries on this instance would be affected by different "common" thresholds. Note that these charts are all sort of garbage (X axis is nonlinear) because I couldn't figure out how basic spreadsheet software works.

I've lightly edited some of the actual query text to preserve anonymity.

100% Threshold (Current Behavior)

Screen Shot 2017-10-03 at 10.51.50 AM.png (372×532 px, 26 KB)

Example queries which can't use ngram constraints: ~v0, -robust.

50% Threshold

Screen Shot 2017-10-03 at 10.53.24 AM.png (359×518 px, 26 KB)

Example queries which can't use ngram constraints: same as above.

25% Threshold

Screen Shot 2017-10-03 at 10.54.16 AM.png (373×541 px, 27 KB)

Example queries which can't use ngram constraints: phabricator https, party, forking.

10% Threshold

Screen Shot 2017-10-03 at 10.55.37 AM.png (382×553 px, 26 KB)

Example queries which can't use ngram constraints: handles, projects, notes, sticky, storage, extension, epriestley, component, chinese, allocation, activating, billing, install application extension, specify reviewer, logs large, revision created, branches.

5% Threshold

Screen Shot 2017-10-03 at 10.57.51 AM.png (369×540 px, 23 KB)

Example queries which can't use ngram constraints: active directory, active credentials, scrolling boxes, setcommentaction, uncommitted changes, revision dependency, phacility private, opencats, force online, changed visibility, table borders, message query.

2% Threshold

Screen Shot 2017-10-03 at 10.59.38 AM.png (372×548 px, 25 KB)

Example queries which can't use ngram constraints: upgrade guidance, orchestration, overheated, mercurial merge, materialized members, employee directory, differential ignore whitespace, dangerous change protection.

1% Threshold

Screen Shot 2017-10-03 at 11.01.20 AM.png (365×555 px, 22 KB)

Example queries which can't use ngram constraints: cache status, getroutes, edge table, milestone, minifier.

The thing I'm mostly looking at is how many queries are pushed down to 0 ngrams, i.e. they can't use the ngram index.

The 25% threshold seems generally safe, and impacts only a small number of searches which aren't really meaningful searches anyway, like party.

The 10% threshold is a little dicier. It starts to impact "searches" which have some common word. These probably also aren't meaningful searches: even a human intelligence could not return whatever the user who searched for meant by querying for notes or component or branches.

The 5% threshold isn't much different from the 10% threshold, and includes some more "probably not very good searches for common terms" sorts of queries.

The 2% threshold is a bit worse than the 5% threshold, but continues including more legitimate-ish queries.

The 1% threshold isn't much worse than the 2% threshold, and more or less continues the trend of catching some actual queries.

Even at the 1% threshold, only 12% of the queries miss the ngram index.

Overall, this seems promising -- not ideal, but not disastrous. We don't seem to have a super long tail of ngrams which occur in only a tiny number of documents, and we don't seem to be catching many "good" queries even with a very aggressive threshold.

That said, this install is something of a magnet for nonsense queries and may not be particularly representative of realworld use.

I'm inclined to start with a relatively conservative threshold of 0.15 -- which the data above predicts should affect almost nothing, but reduce the table size by 50% -- and go from there. We can easily decrease the threshold later, but can not increase it without rebuilding indexes.

I'm also going to increase the db tier innodb_buffer_pool_size from 4GB to 6GB. The hosts have 7.5GB of RAM and 8GB of swap, and are pretty much sitting there with a little under 3GB of RAM mostly unused and no other load.

In my general reading about this I think there are also a couple other things we may be able to tweak.

epriestley added a revision: Restricted Differential Revision.Oct 3 2017, 7:41 PM

I've pushed the MySQL configuration tweaks here, and manually built the common ngrams table with a 0.15 threshold.

Another thing we can do here which will generally reduce the operational load is to be more selective about how we dump/export tables from MySQL. Broadly, we have three general classes of tables today:

  • Normal data tables, which store the authoritative copy of data, like maniphest_task.
  • Readthrough cache tables, which will be automatically/transparently rebuilt if the data is lost, like cache_general.
  • Index tables, which won't be automatically/transparently rebuilt, but can be rebuilt with bin/storage index (or maybe some other similar command, although no other examples spring to mind).

For normal data tables, we obviously need to dump the data.

For cache tables, we never need to dump the data. Skipping the data in these tables, some of which are large, and just dumping the schema will make dumps faster and backups smaller.

For index tables, we should probably keep dumping the data for normal users, but skipping this data in the production cluster makes a significant amount of sense: we get smaller/faster backups, and the overhead of doing an index rebuild in the event of a restore isn't large.

I think the patch for this is also fairly simple, so I'm going to see if I can get it working.

epriestley added a revision: Restricted Differential Revision.Oct 4 2017, 11:49 PM
epriestley added a revision: Restricted Differential Revision.Oct 4 2017, 11:54 PM
epriestley added a revision: Restricted Differential Revision.
epriestley added a commit: Restricted Diffusion Commit.Oct 4 2017, 11:56 PM
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a commit: Restricted Diffusion Commit.Oct 5 2017, 12:27 AM

For those of us who don't mind having to wait through a large reindex after restoring from a backup, is there a way to ask the backup process to skip dumping the index tables? (Based on the above I presume they'll still be included)

(Some background to this: a full database dump of our instance at KDE is now 165GB uncompressed - probably due to us having 3.54 million commits in our repositories)

Yes, use bin/storage dump --no-indexes. See here for more details:

https://secure.phabricator.com/book/phabricator/article/configuring_backups/#skipping-indexes

The "common ngrams" changes above (which will also reduce the size of the table, probably by 50% or more) are likely to become automatic soon too, but I want to make sure they work in production first.

epriestley added a revision: Restricted Differential Revision.Oct 6 2017, 2:09 PM

The pruned (at 0.15 threshold), optimized ngram index for the original affected instance is only ~13GB, which is entirely manageable with the changes to backups, and we haven't run into other issues.

I'm not planning to automatically prune ngram indexes for now, but will keep an eye on this if/when we see other installs run into issues. The --no-indexes flag, which is the bulk of the practical operational benefit here, is documented.

If you're ambitious and want to prune your own indexes, here's an outline:

  • bin/search ngrams --threshold X, where X is between 0 and 1. Small values prune more aggressively. See above for tables and analysis, but selecting X is corpus-dependent black magic. 0.15 is fairly safe and may reduce your index size by about half. 0.02 is less safe (but still looks fairly safe) and will reduce your index size by 5/6ths.
  • Searching and indexing will be affected immediately.
  • The data will be pruned in the background by the garbage collector. MySQL is bad about this so progress may be glacially slow (we observed ~4 days to prune a ~25GB index). You could TRUNCATE and reindex instead. When either one is done, OPTIMIZE.

To undo it:

  • bin/search ngrams --reset. This is likely to instantly break all searches until you rebuild indexes.
  • Rebuild all indexes.