When parsing diffs, we currently do not extract metadata, particularly:
# Author information (name and email address).
# Base/parent commit information.
# We do not parse multiple commits from formats which support multiple commits.
Extracting this metadata likely involves meaningfully reworking how changes are stored internally, particularly for point (3), as a parse must now be able to emit multiple commits even if we immediately raise this state as an error to the user.
Additionally, there are a large number of existing `arc diff` or `arc patch` bugs, many of which are obscure but many of which are likely to be entangled with the parsing phase. It may be possible to skip many of these, but at least some are likely tied to parsing. Particularly, I believe T1022 (one of the few Git bugs) to be significantly difficult to resolve, and T9069 to be deeply entangled (and impacting at least one SAAS customer).
Parsing is particularly complex because it does not, in the general case, convert a text input into a structured output. It converts a repository into a structured output, sometimes by generating synthetic data by running additional commands. In particular, we attempt to generate full binaries for each side of a binary change because users can never resolve binary conflicts on their own, and binary diffs are useless unless you already have the starting state. The subtasks here suggest that handling of binary files leaves much to be desired.
---
See PHI1218. When diff A removes "+x" from a file and diff B undoes the change (so the file still has "+x"), then you diff B vs A, you get:
- before D20478, a fully incorrect display of property changes in the UI (verbatim changes from A).
- after D20478, a half-incorrect display of property changes in the UI, because the file has no mode data in diff B.
To fix this, we need to always provide mode data.
We can parse this out of Git diffs by examining `index 35793a2108..fe6163feb9 100644` lines, or determine it from the working copy in the case that we have a working copy.