Page MenuHomePhabricator

Store diffs as binary, not UTF-8
Closed, ResolvedPublic


Currently, we store diffs as UTF-8. This is a lovely idea, but diffs are not UTF-8, and they also aren't UTF-8 with only BMP characters, which is what we actually are able to store. This causes various problems which we'd be better off dealing with at a higher level than we do. Particularly, this means arc patch can never work for some set of diffs, which is probably not a reasonable behavior.

We should store diffs as binary and mangle encodings at display time (hopefully with some level of caching so this doesn't destroy performance).

The major issue with this is that it implies an enormous migration for all existing installs.

Related Objects

Event Timeline

NorthIsUp raised the priority of this task from to Needs Triage.
NorthIsUp updated the task description. (Show Details)
NorthIsUp added a project: Differential.
NorthIsUp added a subscriber: NorthIsUp.
NorthIsUp renamed this task from The poop emoji breaks Differential to The poop emoji breaks Differential 💩.Nov 1 2013, 9:15 PM
NorthIsUp updated the task description. (Show Details)

@epriestley pointed me to T1191: Comments are truncated at first non-base-plane character. This is slightly different as this is the diff contents and not just a comment.

See also T3804, T3788, T3801, T3974, and others. I'm going to merge some of those here and make this the master.

epriestley added a subscriber: Unknown Object (User).Nov 4 2013, 4:26 PM

◀ Merged tasks: T3974, T3804.

epriestley renamed this task from The poop emoji breaks Differential to Store diffs as binary, not UTF-8.Nov 4 2013, 4:30 PM
epriestley triaged this task as Normal priority.
epriestley updated the task description. (Show Details)

It may make sense to pursue T3788 first, which is likely to cover "I have this non-utf8 diff and need to do something reasonable with it". Once that's in, we can start handing Differential non-utf8 diffs.

A messy side effect of this is that we can't use JSON to transport content or filenames in all cases. Conduit probably needs builtin support for encoding/decoding arbitrary fields.

See T1191 for the root issue with valid (but not BMP-only) unicode diffs. This will resolve both BMP truncation and non-utf8 diffs.

Is there a timeline for this? We have an active project that has files that can't be parsed regularly.

We don't have a specific timeline on this. There are several steps that need to happen to provide complete support:

  • We need to change the actual column to binary. This is technically simple, but will be a fairly time consuming migration for larger installs.
  • We need to annotate the character set of the stored data, and adjust the write pathways to perform detection/annotation when saving data. We can fake this / build it partially at first to some degree.
  • Conduit uses JSON to transmit data (including diffs), which is not binary safe. We need to introduce a parameter encoding mode which can transmit binary data. This is messy.
  • We need to update arc to transmit data using this encoding mode.
  • When pulling diffs out of the VCS, we may need to adjust encoding flags or environmental variables.
  • When displaying diffs in an unknown or non-UTF8 encoding, we need to convert to UTF8 and reasonably inform the user about errors. This is probably quite messy.

T2999 and T3801 are related.

I would like to fix this relatively soon, but it's complicated and interacts with many other systems, and only affects installs that use non-UTF8 character encodings in source code. This isn't especially unusual, but isn't hugely common either.

epriestley edited this Maniphest Task.

This and T1191 are pretty much in positions #1 and #2 on my personal list of dumb, bad technical junk which is embarrassingly broken that we need to fix, but that list is like list #3 or #4 on my list of lists of stuff I need to do.

(Well, T4324 might actually be #1.)

epriestley edited this Maniphest Task.
epriestley edited this Maniphest Task.
epriestley edited this Maniphest Task.
epriestley edited this Maniphest Task.
epriestley edited this Maniphest Task.
epriestley edited this Maniphest Task.

Is there a way to put phabricator returned to normal?

The phabrcator server have our using data for one year

You can comment out this check in libphutil/src/aphront/storage/connection/mysql/AphrontMySQLDatabaseConnectionBase.php, by making validateUTF8String() just return; until this task and T1191 complete.

Any progress for this? It is currently holding up the import of some of our svn and git repositories (stuck at 99% done due to consistent commit message parse failures due to emoji).

Support Impact Broken with no workaround on a nontrivial subset of diffs.

eadler added a project: Restricted Project.Aug 5 2016, 5:24 PM
epriestley claimed this task.

This is now better tracked in followups:

  • T8475 / T8623 cover making the storage binary-safe (it has been binary safe for at least 18 months).
  • T11660, T6633 (and others?) deal with non-UTF8 filenames.
  • T5955 discusses Conduit wire protocols.