When parsing diffs, we currently do not extract metadata, particularly:
# Author information (name and email address).
# Base/parent commit information.
# We do not parse multiple commits from formats which support multiple commits.
Extracting this metadata likely involves meaningfully reworking how changes are stored internally, particularly for point (3), as a parse must now be able to emit multiple commits even if we immediately raise this state as an error to the user.
Additionally, there are a large number of existing `arc diff` or `arc patch` bugs, many of which are obscure but many of which are likely to be entangled with the parsing phase. It may be possible to skip many of these, but at least some are likely tied to parsing. Particularly, I believe T1022 (one of the few Git bugs) to be significantly difficult to resolve, and T9069 to be deeply entangled (and impacting at least one SAAS customer).
Parsing is particularly complex because it does not, in the general case, convert a text input into a structured output. It converts a repository into a structured output, sometimes by generating synthetic data by running additional commands. In particular, we attempt to generate full binaries for each side of a binary change because users can never resolve binary conflicts on their own, and binary diffs are useless unless you already have the starting state. The subtasks here suggest that handling of binary files leaves much to be desired.