Page MenuHomePhabricator

Make Phriction Document changes reviewable
Closed, ResolvedPublic

Assigned To
Authored By
rm
Jun 11 2013, 11:30 PM
Tags
Referenced Files
F351825: Screen_Shot_2015-03-27_at_12.49.15.png
Mar 27 2015, 12:49 PM
Tokens
"Mountain of Wealth" token, awarded by edward."Like" token, awarded by aravindn.

Description

Here at Facebook, the Memcache team is taking the bold step of designing a system before building them. We believe this will lead to fewer bugs over the long-term, while serving to document our code beforehand instead of afterwards (Ha!)

We have looked at a few possible tools to write and review these design documents:

  1. Microsoft Word
  2. Zoho Docs
  3. Differential
  • Word has great review tools for text-heavy (non-code) documents. It works great except that it doesn't solve the versioning issue at all -- we have to manually version the documents.
  • Zoho docs solves the versioning issue (kinda) and I hear it has decent review tools, but it saves the document after every key press, so you end up with thousands of 'revisions' when really we only want revisions when we state it explicitly.
  • Differential is great for versioning, but poor for reviewing non-code documents. We tried this with the Scaling Memcache at Facebook paper and it wasn't a great use of differential.

What we would really like is something with versioning like differential, rich text editing like many of the tools in phabricator, and review tools optimized for non-code prose. This last point is especially important: we need to be able to:

  1. call out specific words (not just lines)
  2. call out sentences that may be partially on several lines
  3. suggest edits (perhaps inline, as in Word?)
  4. see diffs of the text that are more intelligent than line-based diffs (eg, if a sentence re-flows due to line-wrap, we can't have that be highlighted as changed text. We need a word-based differ).

Does this seem like a reasonable thing to support for the Phabricator suite? Is there already a tool that does a great job at this that I'm missing?

Event Timeline

rm added a project: Pholio.
rm added a subscriber: rm.

This probably doesn't overlap meaningfully with Pholio. I think it basically breaks out as a bunch of infrastructure components; if we had them then building the app would just be glue. Basically, they are:

Text (vs Code) Diffing: The major blocker here is technical. Our intraline diffing is O(NM) (N=strlen(old), M=strlen(new)) in time and space, and it's implemented in PHP, and it gets unicode right, and it gets unicode combining characters right, and all of that unicode stuff is implemented in PHP too. This creates a bit of an O-hazard.

Currently, we cheat by hard-wrapping diffs of blocks of text at 80 columns, so NM is never larger than 6400. This is generally good enough in Phriction/Maniphest/etc., but not good in the general case. We pursue the easy optimizations, but diffing axxx...xxxb against cxxx...xxxd on my system (where each string has length 514) takes 2,300ms and consumes 93MB of RAM.

512 characters is probably too short to fake our way through it for normal use; I think we need roughly an order-of-magnitude improvement here. There are a few options:

  • Find a better algorithm. I'm not sure if there's an actually-better algorithm (LCS + suffix tree?) or if we just need a more clever application of the same algorithm. The approach that jumps to mind is to diff words instead of characters, which seems likely to produce a better output anyway and puts us comfortably in faking-it-range (i.e., 512 words is fair chunk of text), although it might have weird problems I don't foresee. These reduced-granularity approaches are attractive because PhutilEditDistanceMatrix already supports each "character" having multiple bytes to accommodate all the utf8 stuff. At some level above that we could diff sentences, although we'll lose some resolution, and either approach means we fail in pathological cases (a paragraph like "a x x x .... x x x b"). We can look at what other programs do, too. These approaches become more questionable in non-latin languages.
  • Implement in C, which would work but creates a big installation mess. We could do soft variations of this too, like ship it as an optional external binary rather than a PHP extension.
  • Implement diffs as colorized edit histories instead of actual diffs (e.g., the inputs are the old document and the edits which were applied to it, not the old document and the new document). This has undesirable implications for cut/paste but doesn't have big-O problems. This requires having an edit history, of course, so it won't work if we don't control the editor.

Generally, this is tractable but the best way forward isn't completely obvious to me. If we had it, we could use it immediately in Phriction, Maniphest, Pholio, and other applications where we show diffs of human-readable text (increasingly, this is all applications, since they basically have some kind of editable text thing).

WYSIWYG Editing: Discussion in T337: Add a Rich Text editor to Phriction. The short version is that I really don't want to build it because I think it's all trash and doesn't work well, and it's also very challenging and complex and time consuming. However, nontechnical users expect it and to some degree require it, and it probably needs to be considered in any discussion of document stores. I think this makes diffing more or less impossible (it must be edit-based or dramatically more sophisticated), although it makes commenting easier. If we had this and somehow thought it was good we'd probably use it in Phriction, but bleh. We inched toward this with the reminder buttons above remarkup areas which I'm happy with, but I don't really want to go any farther.

Text (vs Code) Inline Commenting: Discussion in T1894: Support commenting on Phriction documents. This is mostly a UI problem. We can probably solve it reasonably well for latin languages, but as we reduce the difficulty of creating new revisions, we increase the need to bring inline comments across revisions, which gets really messy in the presence of edits. This is easier if we have WYSIWYG editing (we can have invisible markers) or if we use explicit markers. However, explicit markers (e.g., {! comment here} in the remarkup source) seem far too technical (or, really, extremely inconvenient) for even many technical users. If we had this, we'd probably use it in Phriction today.

Real-Time Collaborative Editing: A couple of companies swear by real-time document editing tools (hackpad / etherpad stuff). Docs is also realtime. We have some realtime infrastructure for Notifications but it's all optional, unauthenticated, and intentionally unsophisticated right now. I don't personally like these tools very much, and the existing ones seem pretty good, and they seem hard to get right (e.g., the FB interview thing was broken all the time as I recall).

Conflict Resolution: I think there are real-time and not-so-real-time flavors of this, but if a document has multiple editors from the web interface we need some kind of plan around conflict resolution. Phriction is currently "last save wins", but that is likely not sufficient for a collaboration tool. Making users resolve merge conflicts from the web UI is probably beyond most nontechnical users. We could try to build some sort of visual tool for it, or use locking. There are a bunch of upsides/downsides and a lot of mess. If we had some strategy here, we'd use it in Phriction today.

Plentiful Alternatives / Many Competitors: There are a large number of popular programs/services in this space -- Google Docs particularly. There are a lot of other application opportunities where we can build something more useful and competitive with much less effort, especially if the end result looks like WYSIWYG. If the end result is something very technical it's easier to get there, but I question how many users it would really serve. In this case, for example, could Google Docs satisfy the problem? I assume it's absent from the above list because of FB vs Google stuff, but it has features like "Show less detailed revisions" and "Make a copy..." which seem like they'd be pretty reasonable approaches here. It makes little sense for us to build a tool which everyone except Facebook will ignore in favor of Google Docs.


Overall, we're thinking about this stuff but there are a lot of hard product, technical and strategic questions between where we are now and having something which can solve this problem.

In the absence of new factors, we'll probably build text diffing and commenting + inline commenting into Phriction, which would get you most of the way there, although we might pull that in a CMS-ey direction rather than a documentey direction so separating the roles might make more conceptual sense. This is probably (much) later this year at the earliest, though. I don't anticipate it ever making strategic sense to pursue a WYSIWYG Google Docs-alike, and probably not any sort of real-time collaborative editing (these are basically on an unofficial list of "stuff we will never build", alongside video chat and spreadsheets).

We're also in the middle of some very slow-moving negotiations with Facebook about the Facebook/Phacility relationship, and it's unclear what the outcome there will be.

Thanks for the detailed reply. Real-time collaboration, conflict resolution, and WYSIWYG aren't very interesting to me either for this use case, honestly. I think remarkup is "rich enough" (and you can embed images, no?

{F46254} (yes you can!)

The biggest one for me, really, is text (vs code)-based comments and somewhat less big than that would be text-based diffing (I think splitting on words and paragraphs with a few hacks for noticing when paragraphs are split would be sufficient, but I agree that it's a generally difficult problem; I'm not trying to minimize that).

Yeah, gdocs would probably be nice enough.

Yeah, gdocs would probably be nice enough.

Alas...

epriestley triaged this task as Wishlist priority.Jun 13 2013, 12:27 AM

If Facebook is interested in paying us to prioritize this I'm happy to write up a proposal (although it would be at least 6 weeks of Releeph + however long the other stuff we're sorting out takes away), but otherwise I think it's largely a Facebook-specific problem and don't anticipate much action in the upstream until after everything else on the Roadmap.

It would really make this complete if we can use this for our document review and docstore. We have used reviewboard for reStructuredtext docs, having a similar fro remarkup would be good

epriestley added a subscriber: Unknown Object (MLST).Sep 3 2013, 11:43 PM

Via FB.

chad renamed this task from Pholio (or similar) as a design document store? to Design document store?.Mar 21 2014, 5:19 AM
chad removed a project: Pholio.

This task discusses text (vs code) diffing above. Not sure if you're interested in looking at this, but here's a rough outline:

Currently, the method ArcanistDiffUtils::generateIntralineDiff($old, $new) takes an old line of text and a new line of text, like this:

aaaaaaaa
aaaabaaaa

...and produces edit strings (approximately) like this:

ssssssss
ssssissss

...where "s" means the characters are the same, "i" means characters were inserted (so we'll color that green), "d" means characters were deleted (so we'll color that red) and "x" means characters were exchanged (edited), so it gets colored red on the left and green on the right. The actual output format is a little more complicated, but that's basically what's happening.

The workhorse of this computation is PhutilEditDistanceMatrix, which takes two "vectors of scalars" (i.e., lists of strings) and returns an edit string for them. That part shouldn't need to be changed, probably (which is desirable, because it's very complicated).

Right now, in computeIntralineEdits() we break the inputs down on a character-by-character level, using either str_split() for non-utf8 inputs or phutil_utf8v_combined() for utf8 inputs.

The major problem is that if the inputs are too long, we'll just fake our way through the diffing because of $max = 80 in generateEditString(). If you increase that to 1000 and try to diff long strings, the runtime and memory use will go crazily high (note that we ignore similar prefixes and suffixes, so we can compute "aaa...a" vs "aaa...b" cheaply -- if you're testing this, you need to make both the beginning and end of the strings differ). This means that we just return "xxxxxx" if the inputs have changes separated by more than 80 characters, which most human-readable text changes do.

Instead, an alternate API could try splitting the inputs on words or sentences, so that a budget like 80 would let us diff a paragraph up to 80 sentences long, which is likely all non-pathological inputs. It could also do something like diff sentences first, then do word-diff on all the changed sentences. That could let us get very granular results without needing to actually solve computationally hard diff problems.

Generally, implementing an ArcanistDiffUtils::generateIntralineDiffForHumanReadableText($old, $new) which exhibits more reasonable behavior for human-text inputs would let us improve history diffs, Phriction diffs, etc., even if that behavior is only "okay" over all and only does the right thing "most of the time", since the current approach is quite bad in most cases. That is, we can make a big improvement with a very rough solution to this problem.

(Ahh, thanks for the merge, @epriestley.)

Real-time collab is actually one of my own top priorities (unbelievably useful for certain types of documents.) Wishlist wise, OpenDocument support would be peachy, though impractical for a first run.

FWIW, FB is using Quip for this kind of design/idea document these days: http://quip.com

I think quip works by actually paying attention to what you do as you edit the document, much like Word and Google Docs.

What @epriestley described above sounds like a pretty reasonable way to deal with human-readable input though. It's all about sentences anyway.

Note that GitHub recently introduced difference views for for Markdown documents.

https://help.github.com/articles/rendering-differences-in-prose-documents/

It's minimal, but quite nice for the common case.

Example: https://github.com/wikimedia/oojs-ui/commit/c115cc293b

Screen_Shot_2015-03-27_at_12.49.15.png (812×1 px, 69 KB)

chad changed the visibility from "All Users" to "Public (No Login Required)".Jul 26 2015, 7:52 PM
chad renamed this task from Design document store? to Make Phriction Documents "Reviewable".Jul 26 2015, 8:09 PM
chad added a subscriber: chad.
chad renamed this task from Make Phriction Documents "Reviewable" to Make Phriction Document changes reviewable.Jul 26 2015, 8:17 PM
epriestley moved this task from Backlog to Next on the Phriction board.
eadler added a project: Restricted Project.Jun 1 2016, 10:37 PM
eadler edited projects, added Restricted Project; removed Restricted Project.Jun 1 2016, 10:37 PM
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.

I think this pretty much boiled down to three technical pieces, all now better covered elsewhere by more specific tasks:

  • Text (vs Code) Diffing: This is now improved, and the actual algorithm should improve shortly. See T7643 for followup.
  • Text/Phriction Inline Comments: See T1894 for followup. I'll update this shortly in light of T7643.
  • Conflict Resolution: This has improved slightly. See T4768 for followup.

I don't think this task has anything unique or actionable remaining, so I'm going to close it in favor of those.

  • Text/Phriction Inline Comments: See T1894 for followup. I'll update this shortly in light of T7643.

Hm, I suspect there's somewhat of a gap here, in being able to propose concrete changes which are immediately viewable. For instance, when writing design documentation, you might propose a major rework which changes quite a few things: too much for a comment thread (not being able to view the proposed change makes it easy to misread/misunderstand things by trying to manually stitch suggestions together), but also not something that should be shown in the primary document view until it's actually been agreed upon (as it'll mislead people).

I couldn't find another task for this; should I file a feature request, in the full knowledge it'll land quite far into the future?

Can you do that today with view and edit policy?

In T3353#196761, @chad wrote:

Can you do that today with view and edit policy?

I'm not sure what you mean here. I don't want to restrict viewability or editability, and the only way in which I can see those policies being meaningful would be if every proposed change involved copying the page to a temporary first, then editing that. This seems not great.

Having thought about it further, I think a better path for my documentation/editing needs would be T4558, in that Differential is now usable for prose review, and it keeps the documentation together with the code, which is generally what I'm after. That being said, it's a total non-starter until there's a Conduit API or similar which allows updates to happen from something without direct database access.