Currently, we store diffs as UTF-8. This is a lovely idea, but diffs are not UTF-8, and they also aren't UTF-8 with only BMP characters, which is what we actually are able to store. This causes various problems which we'd be better off dealing with at a higher level than we do. Particularly, this means arc patch can never work for some set of diffs, which is probably not a reasonable behavior.
We should store diffs as binary and mangle encodings at display time (hopefully with some level of caching so this doesn't destroy performance).
The major issue with this is that it implies an enormous migration for all existing installs.