Make hacky "reply-text" stripper deal with reply text broken across multiple lines
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	May 27 2011, 4:57 PM

Description

See T147, where sandra used email reply but the world-class, state-of-the-art "algorithm" here failed to correctly detect it because of a linebreak:

https://secure.phabricator.com/diffusion/P/browse/origin:master/src/applications/metamta/storage/receivedmail/PhabricatorMetaMTAReceivedMail.php;ce8a406424d42352$99

Revisions and Commits

Restricted Differential Revision

Related Objects

Mentioned In: Q240: What do some content sources mean? (Answer 276)

Event Timeline

epriestley claimed this task.May 27 2011, 4:57 PM

epriestley triaged this task as Normal priority.

epriestley added subscribers: epriestley, sandra.

I think that we could do a little things about :
1- Erase the entire "on reply to ..." it isn't useful and imagine if the previous chat was huge, we get a response = reply message + previous huge message too.
2- Add at the end of the reply something like "on reply to comment No. X" or, "on reply to 'This was the previous mess..." and shows only the first line.
3- Add a "Mail icon" on the reply box, may be after the date. So everyone knows that this was replied from e-mail. Ps: This part is the best haha

cadamo added a subscriber: cadamo.May 27 2011, 6:11 PM

Yeah, it tries to erase all the text right now (like your (1)), there's just a bug with some emails if the line that goes like this:

On Some Date, Some User <someemail@domain.com> wrote:

...gets split onto two lines:

On Some Date, Some User <
someemail@domain.com> wrote:

...it can't figure it out. I'll just make the parser smarter about that.

I also think (3) is a good idea, and if we ever roll out a mobile site we can do little phone icons too.

I will try to make some of those, but my sandbox on my mac dosen't works correctly. I can't make run the pcntl extension and have some troubles setting the repositories and the outbound an inbound emails .

epriestley changed file(s), attached 0: ; detached 0: .Jun 1 2011, 3:35 PM

Oh... I think that all of us was using "Mail.app" or gmail to response e-mails, but when you use Outlook it change everything and even when you use some else tool in another lenguage this check will fail.
e.g: reply on gmail at spanish will be something like this:

2011/6/1 epriestley (Evan Priestley) <noreply@phabricator.com>
epriestley attached Differential Revision: Fix reply email parsing for linebreaks in "On <date>,  <user> wrote:" quote identifier.

and at outlook

From: cadamo (Cristian Adamo) [noreply@devdebian.com.ar]
Sent: Wednesday, June 01, 2011 10:13
To: Cristian Adamo
Subject: [Maniphest] [Commented On] T1: Make Inbound and Outbound e-mails works

cadamo added a comment.
test #2 mailbox

As you can see, on these the verification will fail, too.
This will be a little bit harder trying to match every posible response from every kind of e-mail tool there is a bunch ot tools and even lenguages.

+a bunch of people who may have opinions or who use this feature.

Yeah, I think there's no great solution to this problem in the general case. I see three possible approaches:

1) Regexps: Keep using regexps, building up a larger and larger library of regexps to match a larger body of popular clients/languages.
2) Explicit markers: Add as the first line of every email something like "-~-~=~- Reply Above This Line -~=~-~-", and strip that, or force users to add something like "!end" to their emails.
3) Context-Aware: Store the previous email we sent (we already do this) and compare the text of the new email to the text of the old email, using the text in the old email to determine what is quoted.

All of these solutions have a bunch of problems. I'm inclined to see how far we can get with (1) because it's the simplest, and it might be that Mail.app + Outlook + Gmail with English + Spanish + X + Y is good enough and gets us 99%+ of the way there. This approach also probably works reasonably well for companies, which will probably often all speak one primary language, even though it may not work as well for open source projects which can have worldwide contributors.

Of the alternatives, (2) makes every email ugly and it still won't cover every case because we still need to strip the "On X, user Y wrote:" line or the block Outlook adds, so we'd have to combine it with (1) anyway I think (although we'd get better results). The "!end" form seems like a wholly unreasonable burden to place on users, but Wordpress requires it in some cases (see <http://en.blog.wordpress.com/2009/05/15/comment-reply-via-email-improvements/>) -- apparently some clients don't even quote replies.

(3) seems more complex might actually work really well. There's a thread I found here <http://www.redmine.org/issues/2852> where someone claims it's a good solution to the problem, and claims this is what GMail does. One elegant aspect of this approach is that it handles bottom-reply and inline-reply. I think we'd need to put something like "Herald Message ID: 2939" in the body but it could go at the end and not be as ugly as a "-~-~=~-" sort of header. Actually, it could also go in the "Reply To" address. This approach is made more complicated because of linewrapping in clients, so I think we'd need to force-wrap all emails at <78 characters or be somewhat clever in matching quoted text (e.g., collapse all whitespace runs in the email to a single space and then do strpos(), but I'm not sure how many false positives this might generate).

Apparently Google has a patent on a more complicated version of (3) and Paul Buchheit is one of the inventors: <http://www.google.com/patents/about?id=eA6AAAAAEBAJ>; we have a simpler problem to solve than that covers though since we always know the exact text of the original email and don't need to deal with threads.

Personally, I want to avoid spending a huge amount of time and effort on this since I think there are higher-value things to focus on (at least personally, I very rarely use this feature) and that this problem is really complicated, and that it's not the end of the world if some quoted text shows up in an email reply but maybe it's worth seeing how hard (3) is to implement and how well it performs.

I think the two cases this feature is most useful are:

when the install is behind a VPN, as at Facebook. But in these cases the email clients and languages should be relatively homogenous, which suggests (1) may be a reasonable solution.
on mobile devices; in this case we could perhaps better address the need by building a mobile site. Not sure if we want to get into this or not.

There's also a similar problem with detecting auto-responders / vacation emails which we aren't currently solving. I think that one's a little easier but figure we can wait until it becomes an issue.

Everyone else on this thread, do you have thoughts here?

#2 is partially implemented, we already include To: and CC: lines on new diff emails. If you moved these lines up, they would effectively be a marker.

Personally I don't really like anything about ---reply above this line---. Feels clunky.

I think regexps are perfectly adequate for this problem actually. I don't think it will be that crazy to support a bunch of variations. For example, GMail and Outlook are extremely similar.

We spent months on this problem at my last company (an email-based personal assistant).

If I remember correctly, (3) is very very hard to do when HTML email is involved and/or when Outlook uses the quoting style without >'s. Basically, when someone replies to an email in Outlook and uses HTML email, Outlook takes your nicely formatted message, and then fucks it up in crazy ways in its Word-based rendering/composing engine.

Even the text/plain alternative version is not the same as the original... The annoying thing is that some people have Outlook set to default to HTML, so even though we send text emails we can still get back munged crap.

It might be possible to normalize the reply (e.g. stripping all whitespace) but figuring out interleaved replies/quotes will probably be really hard...

I guess we can just wait until Facebook Messages kill email.

hahaha. Well it just was a comment/inquietude, I also think that the solution implemented now is the right one for the currents issues. And can we being adding more "rules" (regexp) as email clients issues will be appearing.
On a future it could, or not, be a problem but right now it isn't.
We'll see if this becomes an issue or not.

The regexp solution seems to be working okay (see D1239, e.g.), at least for the now. We now also annotate these replies as "(via Email)" (mentioned earlier). I'm going to close this task since there's nothing really actionable here anymore, although the discussion was pretty useful.

talshiri added a subscriber: talshiri.Jun 20 2014, 8:59 AM

adrelanos added a subscriber: adrelanos.Aug 7 2015, 5:24 PM

chad changed the visibility from "All Users" to "Public (No Login Required)".Aug 7 2015, 5:28 PM

tycho.tatitscheff mentioned this in Q240: What do some content sources mean? (Answer 276).Dec 8 2015, 12:23 PM

Make hacky "reply-text" stripper deal with reply text broken across multiple linesClosed, ResolvedPublicActions

Description

Revisions and Commits

Related Objects

Event Timeline

Make hacky "reply-text" stripper deal with reply text broken across multiple lines
Closed, ResolvedPublic
Actions