Page MenuHomePhabricator

Distinguish between issue and non-issue inline comments
Open, Needs TriagePublic

Description

We've tried to format this feature request using the suggested problem and five why's approach. We've also compiled user stories from feedback we've received from our end users. Let us know if you have any questions or would prefer another format.

Problem statement

As a reviewer, I cannot distinguish between a comment I expect the author to take action on and a comment I do not expect the author to take any action on. The action can be either making changes to the revision or explicitly stating they will not make the change.

  • Why? So it is clear to the author what comments they should take action on.
    • Why? So the author can focus on and not lose track of the changes necessary to submit the revision.
      • Why? Authors have missed comments that should have been addressed.
    • Why? So I can accept the revision with open issues while trusting the author to address the comments so I do not have to formally request changes (which seems too strong) and the revision can be landed without me having to accept after the comments are addressed.
      • Why? So I am not blocking the revision from landing after the comments are addressed.
        • Why? So that revisions can ship faster while maintaining quality.

User stories

  • As the author of a change, I want an easily accessible backlog of things I need to address in order to satisfy reviewers thus far - without spending undue time or risking missing things.
  • As a reviewer of a change, I wish to differentiate between take-it-or-leave-it style feedback and “please address this before shipping”, without coming across too formal or strong, and trusting that it will work and not be misinterpreted no matter who the author is (cross-team consistency in a large org).
  • As a reviewer if I trust that the author will do final tweaks according to feedback and then it’s okay to ship without re-review, I want to say “fix it, then ship”. This puts additional importance on accurately conveying what needs to be fixed and not.
  • As a team leader, I want to reduce social friction by allowing the distinction between “please fix before ship” issues and others in a neutral, codified and consistent way.

Event Timeline

I'll dig around the tasks, but in short we feel this significantly changes the concept of "Code Review" to being "just fix these things" and not about "why was this code written".

eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Aug 9 2016, 5:06 PM

I will add my vote for a small subset/version of this, ie. hoist inline comments and their "done" state into an easily perusable checklist.

T8250 I think is what you're looking for @venky

As a reviewer, I cannot distinguish between a comment I expect the author to take action on and a comment I do not expect the author to take any action on.

You can't currently do this by explicitly checking a box, but T9004 suggests some ways you can do this. For example:

  1. Write your feedback in a way that makes it clear if it's a suggestion or issue to human readers.
  2. By convention, prefix inlines with a text marker like "(ISSUE)".
  3. By convention, prefix inlines with a non-text marker like ❗ (:exclamation:) or ({icon bomb}).
  4. Use IMPORTANT: markup.

I have exclusively used method (1) for the lifetime of Differential, and haven't experienced problems conveying intent (specifically, I observe a very low error rate related to missing feedback in code I review). I do have a few idiomatic phrases I use (for example, "Consider..." for suggestions and "Prefer..." or "By convention, prefer..." for trivial feedback) but am basically just writing English text and haven't encountered any meaningful difficulty with this on either side of the review fence.

In particular, I've very rarely seen authors miss comments that should have been addressed, even with a fairly liberal policy on accepting revisions with significant feedback.

It's also not clear to me how distinguishing between feedback types solves this problem, or why you arrived at this solution in response to that root problem. If this is an error you're seeing at a significant rate, an alternate suggestion which has come up in the past is either warning about inlines not marked "done" in arc land, or a stricter version (requiring users to mark all of the inlines on a revision "done" before they can land it). These seem like more direct/obvious solutions if this is an issue of careless mistakes.

So I'd like to understand:

  • Generally, why are users making these errors? (In particular, what's different about your environment that makes them more common than in the environments I've experienced?)
  • If these errors are arising from carelessness and remedying this source of error is important, why is marking inline types better than requiring users to explicitly mark all issues addressed? Why do you believe that separating feedback types would help if these errors are just arising from authors overlooking feedback? Couldn't they overlook an "issue" as easily as a "non-issue"?

To this subpoint:

trusting that it will work and not be misinterpreted no matter who the author

Has this happened? I can't personally recall a time when an author ever misinterpreted a comment that I left and, e.g., intentionally disregarded an issue I considered important because they misunderstood (as opposed to overlooking feedback by mistake, which is rare but occasionally happens). Can you give me some examples of comments which have been misinterpreted?

Playing devil's advocate, it's not too much of a stretch of the imagination to assume that one could then make it impossible to land things which have unaddressed "issues" vs "comments" on a review. That would be the only concrete advantage I see to explicitly defining these things rather than relying on human parsing.

Maybe, but we really do a bad job today of allowing inlines to be managed and displayed. I think resolving those UI concerns go a long way to correcting any root issue here. At least, only after cleaning up the UI should we look at whether additional controls are needed. Even just adding some 'remarkup hints' to the inline comment box could resolve this in a lightweight, optional way.

I could imagine a possible future like:

  1. We come to understand that some set of problems exist which lead to developers carelessly overlooking inlines, which we believe are reasonably-well remedied through "you have unaddressed inlines" warnings or errors.
  2. We implement the warnings or errors.
  3. They do actually fix the problem, but create a secondary problem where authors have to go do busy work checking off inlines at the end of a review.

If we had this secondary problem, an explicit feedback type might be reasonable, although another possible approach is to allow the author to mark their own inlines "Done" when they submit them.

However, I'm still stuck on (1) so far. The only time I've really seen routine carelessness that I can recall was very new users or interns or student-interns or whatever, who needed a lot of handholding anyway. In those cases, I wait until the first revision is perfect to accept it, then just gradually ease up as they come up to speed.

It's possible that I'm making a lot of adjustments like this without really realizing it, or that these adjustments aren't as obvious or easy for others to make for whatever reason (I don't think I'm an exceptionally empathetic reviewer or anything, but I do have at least a few years of experience now). There could be missing information in some contexts (e.g., maybe reviewers don't realize they're reviewing a new college hire's first revision!). There could be other things that are just outside my realm of experience (e.g., maybe a lot of language barrier issues in global development teams that could legitimately be cleared up by more explicit signaling, or cases like culture clash between acqui-hired companies and core teams coming to the table with different expectations).

On the other hand, if full-time professional engineers are just routinely super careless, that seems like it might be a problem outside of the realm of Phabricator's purview, or a problem we want to try to approach differently ("how can we make engineers less careless?"). This might spiral into some larger topic like feedback systems (ala T4308).

If engineers are extremely careful when writing code but magically hopelessly careless when responding to review feedback, they probably aren't bought into the value of review. I suspect forcing it on them more aggressively by making it a painful gauntlet of checkbox procedures won't win them over, and is actually the wrong direction.

Personally my only anecdote is additional controls actually slow me down. Having to label each comment as an reviewer or address the control as the author is an extra mental step that has to occur before finishing the review. All the times I've had difficulty managing a review as the author come from times I've shipped a very large diff. In those cases just being able to manage "Done" more effectively would have been very appreciated. I don't think the context was important, just the overall list and state.

For example, I can describe several different levels of "careless mistakes", after the author lands a revision without addressing the feedback:

  1. The author initiates a remedy and immediately comments on the revision, like "sorry, I got interrupted and missed some of this, I'll fix it up", then promptly sends you a followup change to address the feedback.
  2. The author doesn't initiate a remedy, but is apologetic and prompt in fixing things when gently prodded ("Did you miss this?") or apologetic and prompt when sent a followup revision.
  3. The author doesn't initiate a remedy and is anything less than than apologetic and prompt when gently prodded (e.g., ignores a followup).

When they occur rarely, I think (1) and usually (2) are likely legitimate careless errors. These are the classes of error I've seen (and made) most often, but the rate of these errors is low enough (in the realm of 1 in 200 revisions, perhaps?) in my experience that I think I'd be slowed down overall by having a warning prompt (and significantly by having an error). It would often raise false positives, and very rarely catch a careless error. When it did catch one, it would almost always only be making a tiny inconvenience slightly less inconvenient. Of course, there are still things we could do here, like make arc land emit a message without prompting the user or stopping the workflow.

I don't think I've ever seen (3) in a professional setting. If anyone is seeing (3), this is probably not a behavior with a technical solution.

If we're still in the realm of legitimate careless error, are these users super careless all the time? Maybe they actually are (e.g., they're legitimately new), and we can find some technical solution to this problem, like having better cues about interns or new hires (badges is close to being able to do this).

If they're experienced engineers and super careless all the time, this seems like bad news and not something that has a technical solution. Or maybe there's some class of problem where careless engineering is perfectly fine that's just outside my realm of experience. But my expectation currently is that in most cases you can't be an effective engineer and super careless all the time.

If they're experienced engineers and magically way more careless about responding to review feedback, maybe this is a buy-in issue rather than a technical issue. If it is, I don't want to make the system harder to buy into (by adding more process): I expect this probably won't help users who have buy-in issues, and hurts users who don't.

In T11450#189540, @chad wrote:

Personally my only anecdote is additional controls actually slow me down.

Yeah, this is also motivating my current thinking around some specific things, like putting more prompting in arc land. I don't think such prompts are necessarily unreasonable, but the rate of errors I see today is so low that they're probably slight net negatives: slowing down every revision to slightly reduce the cost of a tiny mistake in 0.5% of them isn't a good tradeoff.

We may reasonably be on the low end here, and maybe the greater body of software engineers commits errors in this vein at a higher but still basically reasonable rate of, say, 3% (say, because everyone has open floor plans and gets interrupted all the time while landing). This prompting might be worthwhile if we're just on the low end and many teams see generally similar behavior, just at slightly higher rates.

Alternatively, if most teams see behavior like us that makes this pretty marginal (say, in the realm of 1% true positives) and some teams see some wildly different number which makes this capability critical, I'd like to understand what's different about those teams. There's probably some other root cause and we'd be addressing a symptom and hurting teams without that underlying problem, whatever it is.

I think these things make sense in an open-source world, where nearly every contribution is from a new source, and having clear, action-based feedback is a net positive. But Phabricator is engineered towards enterprise/teams/trusted circles and these things can burden both parties pretty quickly, thus making us less efficient. Having to classify each piece of feedback becomes an extra thought loop users have to classify in a binary manor, leaving out the possibility for grey conversations to take place.

Responding to as much as I can. @eadler @scode feel free to chime in / correct anything below.

Playing devil's advocate, it's not too much of a stretch of the imagination to assume that one could then make it impossible to land things which have unaddressed "issues" vs "comments" on a review. That would be the only concrete advantage I see to explicitly defining these things rather than relying on human parsing.

That's pretty much what we're looking for. To be clear, our end goal isn't necessarily to bog down code review with more forced process, but to help improve quality while maintaining speed. Distinguishing between issue and non-issue should be optional, and default to non-issue.

In T11450#189530, @chad wrote:

At least, only after cleaning up the UI should we look at whether additional controls are needed. Even just adding some 'remarkup hints' to the inline comment box could resolve this in a lightweight, optional way.

Agreed, cleaning up any UI concerns should likely take priority before addressing any root issue here.

I think these things make sense in an open-source world, where nearly every contribution is from a new source, and having clear, action-based feedback is a net positive. But Phabricator is engineered towards enterprise/teams/trusted circles

While the majority of reviews are intra-team in a large company like ours, a large chunk of reviews are inter-team / cross-functional. I think the workflow in these cases isn’t that far off from the open source use-case that you mention in the sense that engineers that you haven’t worked with before are modifying code that you own. So, “clear, action-based feedback” is useful.

Having to classify each piece of feedback becomes an extra thought loop users have to classify in a binary manor, leaving out the possibility for grey conversations to take place.

I agree that if you *had* to classify everything, it’d be a huge pain. That’s not something anybody wants. I think by default a comment should be classified as an optional no-big-deal comment, but we would want a way to very clearly call out things that should block the revision before landing.

A very common use-case we’ve had is a reviewer says “Hey, most of this looks great, but this line in particular is going to break things. Make sure to fix this then ship it.” RB has a clear way to do this in the UI where you specify specific issues and still “accept”. The author must address the issues and then they can land without any further action from the reviewer. In Phab’s case, the reviewer must request changes, then do another context switch to come back to the review, then make the accept. Thus we believe adding some ability for a reviewer to give a some sort of conditional accept speeds up process rather than bog it down.

  1. Write your feedback in a way that makes it clear if it's a suggestion or issue to human readers.
  2. By convention, prefix inlines with a text marker like "(ISSUE)".
  3. By convention, prefix inlines with a non-text marker like ❗ (:exclamation:) or ({icon bomb}).
  4. Use IMPORTANT: markup.

Something like 2-4 could potentially be a reasonable solution if it were hinted at / encouraged by the UI, but we would still be missing the ability in the UI as the reviewer to clearly say “fix this, then you can land it”.

Generally, why are users making these errors? (In particular, what's different about your environment that makes them more common than in the environments I've experienced?)

Good question that I don’t have the answer to and may have to get back to you. I don’t know if users are doing this at a higher rate than what you’ve experienced (although anecdotally it sounds like it is a higher rate), but I will say that even if it is the same rate of honest carelessness, reducing it even by 0.5% would be very helpful in preventing massive bugs/incidents.

If these errors are arising from carelessness and remedying this source of error is important, why is marking inline types better than requiring users to explicitly mark all issues addressed? Why do you believe that separating feedback types would help if these errors are just arising from authors overlooking feedback? Couldn't they overlook an "issue" as easily as a "non-issue"?

I think we’d want things marked as issues to block from accepting. We think this solution is a good balance of speed and quality by ensuring the most important issues get done before landing, but allowing the author to land once these issues are addressed without any further action from the reviewer. I would think it's also harder to overlook things when there are much fewer things to overlook.

All the times I've had difficulty managing a review as the author come from times I've shipped a very large diff. In those cases just being able to manage "Done" more effectively would have been very appreciated. I don't think the context was important, just the overall list and state.

Along those lines, at least a few users have commented that they miss having a “to-do” list for these large revisions. One team in particular even listed it as one of the reasons they were moving back to RB. And although you mention context isn’t important, wouldn’t having issues give you a better prioritized list of things you need to address?

I will say that even if it is the same rate of honest carelessness, reducing it even by 0.5% would be very helpful in preventing massive bugs/incidents.

I've never seen this cause a "massive bug" or incident because I don't "Accept" code which has massive bugs in it. I'll only "accept" if all of my remaining feedback is effectively cosmetic and I'd be comfortable putting code in production as-is.

That is, my experience is that in about 1 out of 200 revisions someone (myself about as often as others) will carelessly neglect to fix a typo, update an icon, or make a cosmetic change to a piece of logic. They will often follow up with a fix to make the change without nudging. Replacing a missed typo fix with 199 prompts and no missed typo fix would not be helpful at the rates I'm seeing errors.

Are you seeing users accept code with serious bugs? Is the standard for "Accept" something lower than "this would be OK (e.g., non-breaking) in production right now" in your environment?

See also T9367, perhaps, where users apparently accepted changes with serious bugs because of timezone differences between remote teams, although the exact use case behind that task remains murky to me.

My expectation is that only production-ready code is accepted, and all remaining feedback is noncritical when a reviewer accepts. If this isn't true in your environment, I'd like to understand what circumstances lead to reviewers accepting changes which have a known ability to be significantly disruptive without further updates.

A few thoughts.

Much of the above conversation has been focused on users making errors. I don't think this is the primary concern, except to the extent that we consider an overall behavior that's sub-optimal an error. The user stories tried to show that it's more than just about making mistakes.

A few spot thoughts that relate to some of the above:

I don't think we're asking for a "warning, there are unresolved issues" prompt. I don't have data to support anything here, but that comes across exactly as extra annoying process whereas providing the *utility* of easy and clear issue management is designed to *help* both the reviewer and the author. While one *could* certainly have a warning for shipping things w/ open issues it's not the main point.

I disagree about it being extra process, at least as intended by us. I think we may not be understanding each other on the intent and the nature of what is being asked for. As @lvital mentioned, we envision something which never gets in your face by default but is very easy to use. From a reviewer perspective, it's then entirely optional. From an author perspective - if the reviewer cared enough to mark an issue, I don't see why clicking "fixed" or equivalent is busy work compared to the work required to actually make the fix. The intent is that it's used for things important enough that the reviewer wants an explicit ACK that the issue was fixed, and clicking a button is pretty easy compared to commenting or listing "fixed XYZ" in your next invocation of arc diff. Meanwhile it provides a mechanism for ensuring issues are burned down, thus helping the author (running around trying to look for what left-over issues are still open is also busy work). It's intended to help the user, not hinder them.

About culture: A large organization is not a close knit team where everyone is on the same page. Culture at scale is hard. I would deem it mostly impossible to get 1000+ engineers to organically agree (unless you make this happen very early from the start as you slowly grow) to use informal conventions such as phrasing. It's exceedingly hard to get the attention of people, and even harder to change their behavior. It also does tie into things like language barriers, acquisitions, etc. A key component of having the code review system support a notion of issues to burn down is that it's inherently and automatically discoverable. As long as the reviewer decides to use the feature, the author will automatically be exposed to it and it will be obvious that it is a an actual workflow/thing to use and it provides active utility (a feature that helps the author track issues). Each reviewer, since they are using the same feature, will be doing it in a consistent way. Prominent availability in the UI also means good discoverability for reviewers.

There is also social sensitivity. A variety of concerns including interpersonal concerns, different modes of operating, language barriers, etc can all contribute to making code review a more antagonistic process than it should ideally be. While there are many things that go into that and it's not going to be magically solved by a tool change, there is value in having a *neutral* and *codified* way of expressing "please don't ignore/skip this feedback before shipping". This benefits both cases: If you're trying to give "just FYI, let me spread some opinion/knowledge, but don't worry about it for now" it make it easier to do so by simply not making it an issue, so it encourages doing so rather than only commenting on serious problems. On the other hand if you're trying to express a serious problem, you again can do so in a neutral codified manner.

On the purpose of code review:

I'll dig around the tasks, but in short we feel this significantly changes the concept of "Code Review" to being "just fix these things" and not about "why was this code written".

I think the goals of code review are much wider than just "why was this code written". One could even argue that a good commit message should convey the why, not a discussion in the weeds of a code review. Code Review can have many uses, including:

  • Finding bugs (even if you also have reasonable testing).
  • Spotting bad design that is going to be costly down the road, and have a discussion about it, so the author can learn and also make a better change.
  • Spotting lack of adequate documentation and encouraging better.
  • Sharing knowledge about best practices, available libraries, etc.
  • Independent eyes scanning for readability.
  • Spotting good design, so the reviewer can perhaps learn to do better.
  • Spotting good documentation and encourage the reviewer to do similar things in the future.
  • In general promoting consistency by exposing people to the work of others.

The first few of those are things where you may want to mark as issues, whereas the last few are probably always going to be either implicit, or just about "hey FYI" type comments. And there is plenty of room for disagreement about what kind of feedback is supposed to be crucial and not. Code review as a learning process implies a lot of feedback that's constructive and teaching, and not all of it is going to be an absolute must-fix.

All of that said, I definitely agree/expect that better handling of conversations will lessen the motivation for this, because the penalty of effort/annoyance in trying to wade through a ton of comments to spot what's left to do should be much decreased. I think it's fair to say regardless, that improving conversations is more important and perhaps the first step as @lvital also said earlier. If we ship that we can also gather more input from our users as to how much they still desire issues/non-issues distinctions.

It's not optional. It's important to not confuse the fact that pressing the button is optional with the fact that a new decision has to be made with every comment.

It's not optional. It's important to not confuse the fact that pressing the button is optional with the fact that a new decision has to be made with every comment.

That is true, but also true if you're making a decision through phrasing or convention. I.e., the decision happens whenever you are intending to convey, somehow, the distinction.

My expectation is that intent is conveyed with the plain text of the comment without requiring any real effort on the part of the reviewer, and that the use of human language allows reviewers to express a very wide range of types of feedback.

Maybe it would be helpful if you can provide some examples of comments which are ambiguous without an explicit marker, but would become unambiguous with an explicit marker? Or comments which come across as unclear or antagonistic without a marker, but neutral with a marker?

I can't really recall the last time I saw confusion about tone. I need to clarify the technical content of comments with some regularity (e.g., I did a poor job of explaining an issue), but can't immediately recall a time where the tone of feedback was misunderstood.

  1. The author initiates a remedy and immediately comments on the revision, like "sorry, I got interrupted and missed some of this, I'll fix it up", then promptly sends you a followup change to address the feedback.
  2. The author doesn't initiate a remedy, but is apologetic and prompt in fixing things when gently prodded ("Did you miss this?") or apologetic and prompt when sent a followup revision.
  3. The author doesn't initiate a remedy and is anything less than than apologetic and prompt when gently prodded (e.g., ignores a followup).

I don't think I've ever seen (3) in a professional setting. If anyone is seeing (3), this is probably not a behavior with a technical solution.

I have encountered (3), which is one of the many reasons I am no longer working there. :)

I do understand the desire to be able to more accurately frame the feedback, but I'm probably not onboard with introducing a new workflow that we need to maintain, support, and educate to all Phabricator users. I still think no real discussion can come of this until we revamp the inline / done UI. If we have a new inline/done UI, the ability for reviewers to pre-mark dones and have an option or warning when things land before they are marked done, I think from a high level that achieves the same overall goal?

At least, I assume we all want to see T8250 first?

This discussion also doesn't really seem to define an upper bound on this feature, at least to me. If some explicit classification ("issue" vs "non-issue") is good, why isn't more better? Almost all of the arguments for this classification seem to support adding many different options ("compliment", "suggestion", "fyi", "issue", "really important issue"), perhaps multiple dimensions of classification ("serious comment", "joke intended for humans"), etc.

If we build this and motivate it as a way to reduce human/technical culture barriers, ease social sensitivity, and reduce ambiguity, I don't see how we draw the line: surely having more options furthers these goals? Why not put tone markers on top-level comments, too?

(Questions like this are hypothetical now, but may not be hypothetical if we ship this feature and get a request in a few months to add a "sarcasm" checkbox to inlines or whatever. While that feature is obviously bad, it's completely in line with some of the motivations discussed here.)

Yeah, I think we're definitely going to build T8250 before anything here or on T11451.

+1, let's revisit after T11451/T8250 and us shipping that to our users.

@scode, @lvital, having this level of detail on your issues before attempting T8250 is immensely helpful.

To add another perspective, several of the above use cases also apply where non-developers are looking at code reviews as quality documents, in an audit or quality department review. Being able to easily differentiate between Change Required and Discussion\Question types of comments would make these situations easier to explain/answer -- ex:

  • Explain this code review from 8 months ago where comments were made but no followup code changes were done. How did the author/reviewer know it was ready to go upstream in this situation?

Being able to show that all Change Required comments were addressed while Discussion\Question may not have been is something they can consume and move on.

Also I need some way to explain memes to auditors~
whatcouldgowrong

How did the author/reviewer know it was ready to go upstream in this situation?

My expectation is that reviewers do not accept code which isn't ready to go upstream (maybe it could be improved, but all remaining feedback is essentially optional), so the author and reviewer knew it was ready to go upstream because the reviewer selected "Accept".

Is this not the meaning of "Accept" in your environment? If not, what does "Accept" mean to you? When do reviewers "Accept" code which is not suitable to publish in its current form?

How did the author/reviewer know it was ready to go upstream in this situation?

My expectation is that reviewers do not accept code which isn't ready to go upstream (maybe it could be improved, but all remaining feedback is essentially optional), so the author and reviewer knew it was ready to go upstream because the reviewer selected "Accept".

Is this not the meaning of "Accept" in your environment? If not, what does "Accept" mean to you? When do reviewers "Accept" code which is not suitable to publish in its current form?

This is the root cause of our differences. We often "accept" code in a "fix it then ship it way" way: For example "it will be ready to ship once you've fixed the argument order on line 7. Accepting it because I trust you'll fix prior to shipping".

My expectation is that reviewers do not accept code which isn't ready to go upstream

Yes this expectation does hold for us as well, however comments often do continue after acceptance, which results in a question of why was something accepted if there was more to say. Having a way to distinguish that an inline comment was specifically asking for a change or not (or some other type/context indicator) would simplify these explanations. This is not a huge issue for us, I was more trying to give a perspective of non-developer being a consumer of code reviews.

We often "accept" code in a "fix it then ship it way" way: For example "it will be ready to ship once you've fixed the argument order on line 7.

This also happens with us for minor things like, "rename this variable and it's good to go" or "this could probably be simplified". We do this for two reasons:

  1. The change is not required to address the problem.
  2. The author is more aware of time (or other) constraints than the reviewer is for the change in question.