Page MenuHomePhabricator

Transcode the HTML part of incoming email into UTF-8 as well
ClosedPublic

Authored by alexmv on Nov 16 2017, 9:00 AM.
Tags
None
Referenced Files
F18869874: D18776.id.diff
Tue, Nov 4, 2:03 PM
F18864135: D18776.diff
Mon, Nov 3, 3:23 AM
F18817505: D18776.id.diff
Tue, Oct 21, 4:09 PM
F18815796: D18776.diff
Tue, Oct 21, 3:44 AM
F18804641: D18776.id45053.diff
Sat, Oct 18, 10:21 AM
F18738068: D18776.id.diff
Oct 1 2025, 1:57 PM
F18724213: D18776.diff
Sep 30 2025, 3:24 AM
F18654307: D18776.diff
Sep 22 2025, 5:59 AM
Subscribers

Details

Summary

D1093 did this for just the text/plain part of incoming
email. Most text/html parts choose to either use entity encoding
or are already UTF-8, thus obviating the need to transcode the
HTML part. However, this is not always the case, and leads to dropped
messages, by way of:

EXCEPTION: (Exception) Failed to JSON encode value (#5: Malformed UTF-8 characters, possibly incorrectly encoded): Dictionary value at key "html" is not valid UTF8, and cannot be JSON encoded: [snip HTML part of message content]

Generalize the charset transcoding to not apply to just the text/plain part, but
both text/plain and text/html parts.

Test Plan

Fed in a Windows-1252-encoded text/html part with 0x92
bytes in it; verified that $content only contained valid UTF-8 after
this change.

Diff Detail

Repository
rP Phabricator
Branch
transcode-html-part (branched from master)
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 18836
Build 25390: Run Core Tests
Build 25389: arc lint + arc unit

Event Timeline

In theory we should maybe store the raw input and transcode it when using it, but that's a pain to change and probably not ever relevant (except maybe for debugging transcoding issues). I think this is completely reasonable. Thanks!

This revision is now accepted and ready to land.Nov 16 2017, 5:50 PM