Page MenuHomePhabricator

Storage on `secure` was filled by binlogs from looping (?) rebuild-identities script
Closed, ResolvedPublic

Description

The /core/data volume on secure001 filled, which paused MySQL.

It was filled by an unusual number of binlogs. We have binlogs set to GC, but MySQL generated approximately 100GB of them in a day or so. The GC config seems correct, the fill rate was just unusually high.

I remedied the issue by:

  • Connecting to the replica on secure002.
  • Making sure replication was caught up (with SHOW SLAVE STATUS) and verifying that the current binlog on the replica was a recent binlog.
  • Removing most of the older binlogs with rm.
  • Restarting mysql with sudo service mysql restart.
  • (The replica was being a little slow to resume (via SHOW SLAVE STATUS) so I also did a STOP SLAVE; START SLAVE; on the replica.)

Using mysqlbinlog <log>, it looks like these logs are mostly updates from the rebuild-identities script. My guess is that it's running in a loop and that I have a bug in PhabricatorQueryIterator.

Event Timeline

epriestley triaged this task as Normal priority.Jun 7 2018, 12:57 PM
epriestley created this task.

My guess is that it's running in a loop and that I have a bug in PhabricatorQueryIterator.

I'm not immediately able to reproduce this on secure. I ran bin/repository rebuild-identities --all --trace and it finished in about a minute.

I did catch a rebuild-identities script eating 66% of CPU this morning, so I'm pretty sure there's a problem there somewhere, but it died at some point (likely when I restarted MySQL?) so there isn't much left in terms of forensic evidence.

A sub-issue here is that rebuild-identities (and, likewise, CommitMessageParserWorker) uses getAuthorName() from the CommitData.

This may not be the original string: in particular, it has been converted to utf8 and truncated to 255 characters.

For some data on secure, it is incomplete (probably because of different behavior in older versions of Phabricator), e.g. jack instead of jack <jack@jacksdomain.com>.

It may also be mangled somewhat by DiffusionCommitRef, which splits the author apart rather than preserving it completely faithfully.

Under the hood, DiffusionCommitRef is (possibly) transported in JSON which can not transport non-utf8 values. Fixing this is probably out of scope until wire encodings in T5955, though.

I'm going to try truncate + rebuild and see if this is a first-run issue, maybe?

That went through totally clean.

Upshot:

  • I'm reasonably suspicious (?) that PhabricatorQueryIterator is maybe (?) broken somehow and can get stuck in a loop (?).
  • I have absolutely no clue how to reproduce it.

@amckinley, any guesses what you might have done differently to get rebuild-identities into a stuck state? Can you try again and see if it sticks again?

I assume you didn't, like, run while true; do rebuild-identities; done, and ~/.bash_history isn't especially illuminating.

epriestley claimed this task.

Pretty confident D19479 was the root cause here and we got unlucky (or lucky, to find the bug) by running with 37,000 commits exactly.