Improve prose diffs (was: description changes don't generate usable diffs)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	quiddity
	Mar 22 2015, 5:41 PM

Description

Remaining concerns:

Show Old / Show New: In the web UI, have a way to get the raw old and raw new text, in case you want to revert things or copy some of it or whatever else.
Mail Section Titles: In mail, all sections have an "EDIT DETAILS" header, but they should have per-field headers instead ("CHANGES TO SUMMARY", "CHANGES TO TEST PLAN" or similar).
Paste should Code-Diff: Paste got converted here somewhat accidentally, but is probably better as a code diff. The old behavior wasn't really particularly great (it was "half-a-code-diff", more or less) but it should probably be swapped to become more of a pure code diff. See some additional context in T11743. Prose diffing ".txt" might be OK.
Config should Code-Diff?: I think we have at least one "JSON BLOB" thing in Config (and maybe in Almanac?). These would be better as code diffs.
Plain Text Mail?: ~~Maybe try to do something with plain text mail, if we can figure out something reasonable. (No apparent interest in this or suggestions for "something reasonable", and I don't have any real ideas myself.)

Original Description

As a user, when I get an email that remarks:

Alice edited the task description

I usually want to see what the change was. It's currently a complete mystery, that requires me to:

open the page in a browser,
scroll down to the line where the edit was made
click on "(Show details)"

There are 3 possible ways to solve this. In the email notification:

Provide a direct link to the diff, e.g. https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-ziuazd55shmmdl7/ (which should also link back to the task it is associated with)
or, Provide a link to the specific timestamp (to remove the need to scroll to it), e.g. https://phabricator.wikimedia.org/T75851#1139254
or, Provide the diff itself in the mail. -- [this option is possibly too complicated (HTML vs plaintext, etc) but possibly a worthwhile long-term goal?]

Thank you! w(O>O)w

Revisions and Commits

rPHU libphutil
		D16881	rPHU086df1ba443c Improve prose diffs for changes spanning very large blocks of intermediate text
		D16853	rPHU162c55d991df Exempt some more puncutation characters from diff smoothing in prose diffs
		D16839	rPHUf36c31c991ca Make prose diff algorithm more iterative, to improve prose diffs for (among…
		D16820	rPHU5d8e090fe1e2 Slightly improve some prose diffs
		D16097	rPHU7233ff63f821 Add a "summary" prose diff mode which omits unchanged text
		D16073	rPHUb11a344ba571 Allow prose and code diffs to use different smoothing
		D16071	rPHU6d1eea50fb9a Improve prose diff smoothing rules for whitespace and prefix/suffix changes
		D16068	rPHUa64328ab3d62 Apply edit smoothing to prose diffs
		D16067	rPHUc637bdba3985 Give prose diffs some basic test coverage
rARC Arcanist
		D16074	rARCc75b671b221a Use "internal" smoothing for code diffs in Arcanist
		D16069	rARC41e8e30e8c13 Use EditDistanceMatrix diff smoothing in Arcanist
rP Phabricator
		D16817	rP3f5109b66864 In prose diff dialogs (like "Show Details" in transactions), show "old", "new"…
		D16818	rP729492a8ff85 Allow transactions to specialize their mail headers for diff sections
	Audited	D16098	rPfb156af480a1 Render prose diffs in email as summaries
		D16063	rPfb2da8bd8b43 Add links and diffs for text block edits to mail

Related Objects
Search...

Status	Assigned	Task
Resolved	epriestley	T7643 Improve prose diffs (was: description changes don't generate usable diffs)
Resolved	epriestley	T3353 Make Phriction Document changes reviewable
Open	None	T11881 Make Paste content changes render as code diffs again, not prose diffs

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Yes. I think this has two general components:

Embed diffs in email.
Linking directly to changes from email.

On embedding diffs, I want to improve them before we embed them, because the current diffs are often quite bad. T3353 discusses improving them, which mostly boils down to wanting diffs of human-readable prose text to look more like this:

Currently, our algorithm produces a reasonable result in only a fairly small number of cases, and results are particularly bad when a large paragraph has a small number of changes.

For example, here I've added only one word ("hyperspace") but the algorithm has butchered the diff:

Screen Shot 2016-05-18 at 6.26.53 AM.png (428×1 px, 126 KB)

I want to fix that before putting it in email. It doesn't have to be perfect, but it shouldn't be horribly broken in like 60% of common cases.

Fixing this will fix the web UI diffs, and also allow us to put reasonable diffs in mail.

Separately, we can improve linking transactions to comments. This should wait for changes in T10694 to complete (particularly, design on D15884) but I expect them to happen shortly.

Of the two link suggestions, linking directly to the transaction is slightly tricky for various reasons. Probably slightly easier is linking to the diff dialog, and improving that page so it renders better in standalone mode (particularly, links back to the object). Basically, this would put these "(Show Detail)" links into mail:

When clicked, you'd get to a similar page to where you end up today, but it would have more context about the object:

For example, the crumbs would show the parent object and the button would have appropriate text like "Return to object" or similar (or maybe we'd make this page fully render as a complete page instead of as a dialog).

We can do these separately. I'd estimate:

Rough cut of prose diffs + prose diffs in email: 4 hours.
Detail links on edits and better destination pages: 1 hour.

NOTE: I'd expect both of these mail changes to only affect HTML mail, not plain text mail. I don't have any ideas for a reasonable way to render prose diffs in plain text email, and I think the "Details" links would add too much clutter in plain text email since they have to render as https://domain.com/full/url/ instead of (View Details). After D15885, HTML mail will be the default.

NOTE: This prose diff estimate is upgrading the current "works maybe 30% of the time" terrible garbage algorithm to a new "works 90%-95% of the time" reasonably usable algorithm, not a "works 99% of the time" really nice algorithm. You should expect that there may still be significant limitations to prose diffs where output is not as good as a human could do, e.g. edits to tables and other complex embedded elements probably won't render ideally on the first pass. We can improve this over time but I don't expect to be able to get it perfect in 4 hours, just much less bad.

I think the "Details" links would add too much clutter in plain text email

Could you maybe extend the "TASK DETAILS" part of the email and link to the transaction via an #anchor? That would be sufficient for me.

We can't easily link to an #anchor from email because the transactions that render in email are different from the transactions that render on the web. For example, T10493 is a case that specifically shows a transaction in email but not on the web. Since it doesn't exist on the web, we can't easily generate an anchor to the corresponding transaction because no web version of that transaction exists. This behavior is not unique to that transaction type. Anchor generation is also more complex than this for other reasons (for example, transaction are grouped, aggregated and reordered at display time, and these changes affect which anchors are available).

This is tractable (we can send the user to some link which looks up the "most similar" anchor to an email transaction and then redirects them) but significantly more complex than linking directly to details.

A group of transactions may also have an arbitrarily large number of transactions, and an arbitrarily large number of detail links (tasks may have an unlimited number of remarkup custom fields, all of which may be edited simultaneously). Any approach we take needs to accommodate this. Adding a single link to "TASK DETAILS" does not.

epriestley mentioned this in T4768: Phabricator silently overwrites concurrent changes (no conflict detection).May 25 2016, 2:46 PM

epriestley added a project: Prioritized.May 25 2016, 4:21 PM

Herald added a subscriber: eadler. · View Herald TranscriptMay 25 2016, 4:21 PM

epriestley moved this task from Backlog to The Queue on the Prioritized board.Jun 6 2016, 10:31 PM

epriestley added a revision: D16063: Add links and diffs for text block edits to mail.Jun 6 2016, 10:36 PM

epriestley added a commit: rPfb2da8bd8b43: Add links and diffs for text block edits to mail.Jun 7 2016, 12:12 AM

epriestley claimed this task.Jun 7 2016, 12:36 AM

epriestley triaged this task as Normal priority.

epriestley updated the task description. (Show Details)

Herald added a subscriber: faulconbridge. · View Herald TranscriptJun 7 2016, 12:36 AM

This is pretty rough (particularly, the actual diff part), but wired up everywhere. HTML email includes a link:

This link takes you to the change details, directly:

Screen Shot 2016-06-06 at 5.38.34 PM.png (914×1 px, 163 KB)

The details are also embedded in HTML mail:

Screen Shot 2016-06-06 at 5.38.55 PM.png (270×918 px, 74 KB)

Text mail is unchanged. D16063 discusses some ideas for improving text mail, but none of them seem very good. If you have thoughts on how it could work, see that for what I came up with and let me know if you have something better.

The change details in the web UI also use the new prose diffing:

Screen Shot 2016-06-06 at 5.39.34 PM.png (914×1 px, 208 KB)

The actual quality of the prose diff is very uneven right now. This particular edit is an example where it does very poorly. It does better with some other types of edits:

Screen Shot 2016-06-06 at 5.41.29 PM.png (523×666 px, 100 KB)

Next, I'll work on identifying and refining the behavior for cases where it does fairly poorly (the diff is technically correct, but a human would prefer that it did not detect that the "a" in "says" and "remarks" was technically retained across the edit). Once I've nailed down everything I can find, we can start collecting weird cases it still misses and I can see what we can do to improve them.

The old diff was frequently so bad that I think this is essentially always better than it was, even given the current cases where it doesn't do a great job of producing human-readable diffs, but I expect to be able to improve it to get pretty good output about 90-95% of the time.

epriestley mentioned this in T3353: Make Phriction Document changes reviewable.Jun 7 2016, 1:00 AM

epriestley closed subtask T3353: Make Phriction Document changes reviewable as Resolved.

epriestley added a revision: D16067: Give prose diffs some basic test coverage.Jun 7 2016, 3:06 PM

epriestley added a revision: D16068: Apply edit smoothing to prose diffs.Jun 7 2016, 3:21 PM

epriestley added a revision: D16069: Use EditDistanceMatrix diff smoothing in Arcanist.Jun 7 2016, 3:27 PM

epriestley added a commit: rPHUc637bdba3985: Give prose diffs some basic test coverage.Jun 7 2016, 3:55 PM

epriestley added a commit: rPHUa64328ab3d62: Apply edit smoothing to prose diffs.

Diffs should be less choppy now, here's the improved version of the above change:

Screen Shot 2016-06-07 at 9.01.32 AM.png (508×621 px, 67 KB)

epriestley added a commit: rARC41e8e30e8c13: Use EditDistanceMatrix diff smoothing in Arcanist.Jun 7 2016, 4:13 PM

epriestley added a revision: D16071: Improve prose diff smoothing rules for whitespace and prefix/suffix changes.Jun 7 2016, 7:15 PM

epriestley added a commit: rPHU6d1eea50fb9a: Improve prose diff smoothing rules for whitespace and prefix/suffix changes.Jun 7 2016, 7:58 PM

epriestley added a revision: D16073: Allow prose and code diffs to use different smoothing.Jun 7 2016, 8:43 PM

epriestley added a revision: D16074: Use "internal" smoothing for code diffs in Arcanist.

epriestley added a commit: rPHUb11a344ba571: Allow prose and code diffs to use different smoothing.Jun 7 2016, 9:07 PM

epriestley added a commit: rARCc75b671b221a: Use "internal" smoothing for code diffs in Arcanist.Jun 7 2016, 9:15 PM

In T7643#179137, @epriestley wrote:

Diffs should be less choppy now, here's the improved version of the above change:

Very nice! Thank you for your work in this area.

I'm sure there are still some cases where we don't do a particularly good job, but let me know when you run into them.

Some other stuff I'm thinking about:

Make unchanged text grey to de-emphasize it, particularly in email?
Possibly fold/hide large chunks of unchanged text, particularly leading and trailing text, and particularly in email?
In the web UI, have a way to get the raw old and raw new text, in case you want to revert things or copy some of it or whatever else.
In mail, all sections have an "EDIT DETAILS" header, but they should have per-field headers instead ("CHANGES TO SUMMARY", "CHANGES TO TEST PLAN" or similar).
Maybe try to do something with plain text mail, if we can figure out something reasonable.

It would also be nice to develop an inline version of this for title changes like this:

T9233 is vaguely related.

This one feels a bit un-human-like, although it might be tricky to fix:

Screen Shot 2016-06-08 at 1.53.23 PM.png (260×630 px, 29 KB)

O.o

Oh, one more thing:

Paste got converted here somewhat accidentally, but is probably better as a code diff. The old behavior wasn't really particularly great (it was "half-a-code-diff", more or less) but it should probably be swapped to become more of a pure code diff.

epriestley mentioned this in T9789: Make it easier to write custom transaction types.Jun 9 2016, 4:40 PM

epriestley mentioned this in T11116: Ponder email for newly created question includes content that suggests the question was also edited.Jun 9 2016, 4:47 PM

We currently use white-space: pre-wrap to handle linebreaks in prose diffs, but at least one client (Airmail) doesn't handle that properly. We could use explicit <br /> to improve behavior.

epriestley added a revision: D16097: Add a "summary" prose diff mode which omits unchanged text.Jun 10 2016, 3:13 PM

epriestley added a revision: D16098: Render prose diffs in email as summaries.Jun 10 2016, 3:30 PM

epriestley updated the task description. (Show Details)Jun 10 2016, 3:57 PM

epriestley added a commit: rPHU7233ff63f821: Add a "summary" prose diff mode which omits unchanged text.Jun 10 2016, 4:40 PM

epriestley added a commit: rPfb156af480a1: Render prose diffs in email as summaries.

Large diffs like wiki pages should render better in email now:

Screen Shot 2016-06-10 at 9.55.43 AM.png (224×588 px, 28 KB)

thx!

epriestley mentioned this in Blog Post: Development Notes (2016 Week 24).Jun 11 2016, 11:43 AM

Catching a couple more funky diffs:

This one looks like an excessive smoothing after a comma was removed:

https://secure.phabricator.com/transactions/detail/PHID-XACT-TASK-43nibwnfn5g5glm/

Screen Shot 2016-06-13 at 12.11.02 PM.png (251×892 px, 49 KB)

This one is okay but it's a single deleted space that would be nice to mark as diffed less aggressively:

Screen Shot 2016-06-13 at 12.13.13 PM.png (85×191 px, 8 KB)

aklapper added a subscriber: aklapper.Jun 16 2016, 9:04 AM

epriestley renamed this task from Email notification for "[user] edited the task description" does not contain a link to the edit to Improve prose diffs (was: description changes don't generate usable diffs).Jul 16 2016, 1:51 PM

epriestley removed a project: Mail.

epriestley mentioned this in T7047: Add user preference for diffs in emails.Jul 16 2016, 2:09 PM

epriestley mentioned this in T9272: Show x lines does not work as expected in dialog.Aug 22 2016, 6:31 PM

epriestley mentioned this in T11529: Send emails/notifications containing the link to the action on the object which triggered it.Aug 25 2016, 2:30 PM

epriestley mentioned this in T11743: Unbounded runtime and memory use on some Paste transactions.Oct 6 2016, 11:20 PM

epriestley updated the task description. (Show Details)

epriestley merged a task: T11825: Prose diffs unusable for changes with large blocks of unmodified text.Nov 6 2016, 9:08 PM

epriestley added a subscriber: dseifert.

epriestley added a revision: D16817: In prose diff dialogs (like "Show Details" in transactions), show "old", "new" and "diff" tabs.Nov 7 2016, 7:23 PM

epriestley added a revision: D16818: Allow transactions to specialize their mail headers for diff sections.Nov 7 2016, 7:39 PM

epriestley added a commit: rP729492a8ff85: Allow transactions to specialize their mail headers for diff sections.Nov 7 2016, 8:16 PM

epriestley added a revision: D16820: Slightly improve some prose diffs.Nov 7 2016, 10:44 PM

epriestley added a commit: rPHU5d8e090fe1e2: Slightly improve some prose diffs.Nov 7 2016, 10:54 PM

epriestley updated the task description. (Show Details)Nov 7 2016, 11:09 PM

epriestley added a commit: rP3f5109b66864: In prose diff dialogs (like "Show Details" in transactions), show "old", "new"….Nov 7 2016, 11:18 PM

We still do a pretty questionable job on the first case in T7643#180264, where a comma is removed from the middle of a sentence.

I fiddled with this for a while and wasn't able to make much progress by just tweaking the existing algorithm. I have some ideas for changing how the algorithm works a little bit that I think may improve this case, and also make the algorithm a bit easier to debug. I'm going to take a shot at these improvements and see how far I get.

epriestley added a revision: D16839: Make prose diff algorithm more iterative, to improve prose diffs for (among other things) removed commas.Nov 10 2016, 8:35 PM

epriestley added a commit: rPHUf36c31c991ca: Make prose diff algorithm more iterative, to improve prose diffs for (among….Nov 10 2016, 8:40 PM

That seemed to just work? Slightly disconcerting.

unit tests ftw

I believe this is now in reasonable shape, in the sense that the overwhelming majority of prose inputs I'm aware of produce reasonable, human-readable diffs. They aren't perfect in all cases and aren't necessarily the diffs that a human would produce, but they're useful and fairly sensible.

If you're aware of cases which we still get wrong, please let us know (after updating to include D16839, which should fixed several of the cases we did the worst on). I imagine we'll also continue tweaking the algorithm over time as weird cases arise.

If there don't seem to be any major remaining quality issues for a little while, I'll consider this resolved and move remaining work to followups. Particularly:

Paste (and likely Config/Almanac?) should go back to code diffs at some point, but prose diffs usually aren't too bad for them.
We don't link to the comment/transaction in mail, because this is difficult in the general case. This is discussed in greater detail is T11529. (This isn't really specific to prose diffs, anyway.)

epriestley added a revision: D16853: Exempt some more puncutation characters from diff smoothing in prose diffs.Nov 13 2016, 10:05 PM

epriestley added a commit: rPHU162c55d991df: Exempt some more puncutation characters from diff smoothing in prose diffs.Nov 13 2016, 10:10 PM

D16853 tweaked one issue that I caught.
T11881 is a followup for restoring code diffs in a small set of scenarios.
I haven't seen other issues for a while, so I'm going to consider this resolved until they crop up. If you do run into cases where the new algorithm isn't doing a great job, let us know.

@epriestley One issue that is still open is adding a change to the beginning and to the end of a large prose text. The diff of that becomes unusable. This is still reproducable with the current stable branch )(fc71a7e92dc26e0d93ec33d709391411d2bcd827 (Mon, Nov 14)). For details / reproduction steps please see T11825.

epriestley added a revision: D16881: Improve prose diffs for changes spanning very large blocks of intermediate text.Nov 16 2016, 6:05 PM

Ah. I believe I fixed a simpler version of that case earlier, but not the specific case you describe. D16881 should fix that case.

Note that the algorithm gives up eventually, because computing a diff would be too expensive. In the general case, adding "a" at the beginning and "z" at the end of a very large block of unchanged text will eventually hit the internal limits and render an "everything changed, good luck" diff. However, the amount of text required to hit that limit should now be more reasonable (substantially more than 2,000 words of lorem ipsum):

Screen Shot 2016-11-16 at 10.08.11 AM.png (1×2 px, 742 KB)

epriestley added a commit: rPHU086df1ba443c: Improve prose diffs for changes spanning very large blocks of intermediate text.Nov 16 2016, 6:09 PM

@epriestley thank you very much for the fast update and fix. It is working now with my example, and should be much better with a lot of pages.

I am a bit worried however about the 128 paragraph limit you mention in D16881. Though this should indeed cover a lot of pages, it explicitly leaves out those for which this issue is the most severe - very long pages :-) I did a quick check in our wiki and I find a lot of pages with more than 128 newlines, e.g. tutorial pages or pages with lots of lists (meeting minutes, check lists, etc).

Some examples that still trigger the problem:

Page	Byte Count	Word Count	Total lines	Non-empty Lines
Tutorial	14701	2402	159	101
Some meeting notes	5976	890	212	147
Phabricator Starmap	13273	2019	479	259

With the current approach I guess there will always be a limit. I don't know about the resource consumption of the algorithm and whether the default limit could be increased. But maybe it makes sense to have the default user-configurable? I.e. I feel multiplying it by two would be a good initial value for our setup.

(Please let me know if I should open a new task for this topic (or reopen T11825).

Feel free to file a new task when you encounter actual problems with real text and we can look at adjusting the limit, adding an additional block-level difference phase, rewriting PhutilEditDistanceMatrix in C (see T2312), or other approaches. We can make a better decision based on concrete examples of where the current solution is falling short than by guessing where issues might arise.

I think we are unlikely to add an option for two reasons:

I am generally hesitant to add options, for reasons discussed in T8227.
Because runtime complexity is explosive, I worry it would be difficult for users to select a value for this option which is low enough to avoid performance issues.

If you are confident that adjusting the limit is the best solution for you, you can modify the constant in src/utils/PhutilProseDifferenceEngine.php, near line 14, by changing the value passed to setMaximumLength(...).

Twilight removed a subscriber: Twilight.Jul 16 2017, 2:09 PM

epriestley mentioned this in T12963: Aggregate similar stories in Feed, Notifications and Timelines.Aug 16 2017, 6:55 PM

I do have a real-life example where the prose diff engine rendered a suboptimal diff. I was able to fix it by changing the maximum length of the edit distance matrix from 128 to 256.

	F1924485: Screen Shot 2016-11-16 at 10.08.11 AM.png
	Nov 16 2016, 6:08 PM

	F1686240: Screen Shot 2016-06-13 at 12.11.02 PM.png
	Jun 13 2016, 7:14 PM

	F1686245: Screen Shot 2016-06-13 at 12.13.13 PM.png
	Jun 13 2016, 7:14 PM

	F1682694: Screen Shot 2016-06-10 at 9.55.43 AM.png
	Jun 10 2016, 4:56 PM

	F1680549: Screen Shot 2016-06-08 at 1.53.23 PM.png
	Jun 8 2016, 8:54 PM

	F1680382: Screen Shot 2016-06-08 at 10.18.12 AM.png
	Jun 8 2016, 5:19 PM

	F1678915: Screen Shot 2016-06-07 at 9.01.32 AM.png
	Jun 7 2016, 4:03 PM

	F1678068: Screen Shot 2016-06-06 at 5.38.55 PM.png
	Jun 7 2016, 12:46 AM

Improve prose diffs (was: description changes don't generate usable diffs)Closed, ResolvedPublicActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Improve prose diffs (was: description changes don't generate usable diffs)
Closed, ResolvedPublic
Actions

Related Objects
Search...