Page MenuHomePhabricator

Update diff/patch parsing to extract more metadata and parse a wider range of formats
Open, NormalPublic

Description

When parsing diffs, we currently do not extract metadata, particularly:

  1. Author information (name and email address).
  2. Base/parent commit information.
  3. We do not parse multiple commits from formats which support multiple commits.

Extracting this metadata likely involves meaningfully reworking how changes are stored internally, particularly for point (3), as a parse must now be able to emit multiple commits even if we immediately raise this state as an error to the user.

Additionally, there are a large number of existing arc diff or arc patch bugs, many of which are obscure but many of which are likely to be entangled with the parsing phase. It may be possible to skip many of these, but at least some are likely tied to parsing. Particularly, I believe T1022 (one of the few Git bugs) to be significantly difficult to resolve, and T9069 to be deeply entangled (and impacting at least one SAAS customer).

Parsing is particularly complex because it does not, in the general case, convert a text input into a structured output. It converts a repository into a structured output, sometimes by generating synthetic data by running additional commands. In particular, we attempt to generate full binaries for each side of a binary change because users can never resolve binary conflicts on their own, and binary diffs are useless unless you already have the starting state. The subtasks here suggest that handling of binary files leaves much to be desired.


See PHI1218. When diff A removes "+x" from a file and diff B undoes the change (so the file still has "+x"), then you diff B vs A, you get:

  • before D20478, a fully incorrect display of property changes in the UI (verbatim changes from A).
  • after D20478, a half-incorrect display of property changes in the UI, because the file has no mode data in diff B.

To fix this, we need to always provide mode data.

We can parse this out of Git diffs by examining index 35793a2108..fe6163feb9 100644 lines, or determine it from the working copy in the case that we have a working copy.

See also PHI1299.

Related Objects

StatusAssignedTask
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
Openepriestley
ResolvedNone
OpenNone
OpenNone
Openepriestley
Openepriestley
Openepriestley
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
Openepriestley

Event Timeline

epriestley updated the task description. (Show Details)May 2 2017, 5:12 PM
epriestley added a subscriber: cspeckmim.
joshuaspence added a subscriber: joshuaspence.

Not sure if I should make a separate issue for this or just mention it here, but it would be nice if you could parse the "signature" of a patch generated with git format-patch. Currently patches generated with git format-patch cannot even be uploaded to Phabricator unless you set git config format.signature=""

This seems like a pretty trivial fix, as there is only 1 line at the bottom of the file that confuses Phabricator's upload process.

epriestley added a subscriber: bcooksley.

T13023 provides an example of a synthetic "git-like" diff generated by some tool (dpkg-source?) which appears (?) to pretend to generate Git diffs (diff --git in the text).

epriestley moved this task from Backlog to Diff Parsing on the Arcanist board.Mar 5 2018, 2:16 PM
epriestley updated the task description. (Show Details)Apr 25 2019, 10:11 PM
epriestley updated the task description. (Show Details)Tue, Jun 11, 6:18 PM