Page MenuHomePhabricator

Under Mercurial, `arc patch` mishandles files with spaces in them
Closed, ResolvedPublic

Description

See PHI294. The behavior of Git and Mercurial differs when applying patches like this:

diff --git a/X Y.txt b/X Y.txt
new file mode 100644
--- /dev/null
+++ b/X Y.txt
@@ -0,0 +1 @@
+quack

Git creates a file named X Y.txt.

Mercurial creates a filed named X.

So this is something Mercurial could perhaps improve on. However, Git can't generate this diff itself, so this is our fault (see also T13023).

Instead, Git generates this line:

+++ b/I Love Spaces.txt\t

That is, the filename is terminated with a tab literal. This only happens sometimes. We need to:

  • Figure out what the rule for adding a tab is.
  • Add those tabs ourselves.
  • Probably get some test coverage.
  • Maybe file a bug with Mercurial; this isn't really a bug and definitely isn't very important, but maybe worth a mention. It seems reasonable to imagine that semantic trailing tabs might sometimes lead to confusing behavior.

I wasn't able to find any existing similar issues in the Mercurial bugtracker by searching for spaces, tab, or import.


Original Description

In a Mercurial repository, I got a differential revision with "Logo A.svg" and "Logo B.svg".

When we try arc patch D..., we have the following message:

Patch Failed! 
Exception
Command failed with error #255!
COMMAND
HGPLAIN=1 hg import --no-commit -

STDOUT
applying patch from stdin


STDERR
file logos/2015-refresh/Logo already exists
1 out of 1 hunks FAILED -- saving rejects to file logos/2015-refresh/Logo.rej
abort: patch failed to apply

Event Timeline

dereckson renamed this task from `arc patch` fails to add add two files starting by the same word, then a space on hg to `arc patch` fails to add two files starting by the same word, then a space on hg .
dereckson raised the priority of this task from to Needs Triage.
dereckson updated the task description. (Show Details)
dereckson added a subscriber: dereckson.

I am curious about this as well or if it has been addressed some other place. It's pretty easy to manually resolve the issue most of the time however I am not sure what the proper action would be after doing this? Does phabricator expect you to arc diff after you manually resolve it?

The issue I run into is that i'm applying a patch with a dependency and that is where the conflict is although it's not clear without looking at the patches separately. I am not sure and really wouldn't expect phabricator to be smart enough to apply this to the dependency instead of the current patch I'm trying to apply.

@michaeljs1990. You seem to have another issue.

Here, this is an avoidable conflict created by Arcanist itself, because filenames aren't protected by quotes.

epriestley renamed this task from `arc patch` fails to add two files starting by the same word, then a space on hg to Under Mercurial, `arc patch` mishandles files with spaces in them .Jan 16 2018, 3:42 PM
epriestley triaged this task as Normal priority.
epriestley updated the task description. (Show Details)
epriestley edited projects, added Git; removed Differential.

The rule Git uses appears to literally be "does the filename include a space":

diff.c:780
	name_a_tab = strchr(name_a, ' ') ? "\t" : "";
	name_b_tab = strchr(name_b, ' ') ? "\t" : "";

This change dates from 2006 and sounds like a GNU patch compatibility issue, although the GNU diff code looks like it always emits \t between other printable characters, which is at least less crazy. So I'm going to say this is mostly Git's fault, with a bit of blame on diff for using a tab separators as an encoding, although a commit from 1993 in GNU diff says:

+    /* See Posix.2 section 4.17.6.1.4 for this format.  */
+    fprintf (outfile, "%s %s\t%s",
+            mark, inf->name, ctime (&inf->stat.st_mtime));

I'm not sure how to go find the "POSIX.2" document, but the "\t" behavior was also present in 1991 when diff started using revision control. Maybe POSIX.2 just codified the behavior of diff, or maybe there's more blame to be shared here.

Anyway, the rule seems simple enough.

This is an explicit behavior in Mercurial and dates from 2007:

def parsefilename(str):
    # --- filename \t|space stuff
    s = str[4:].rstrip('\r\n')
    i = s.find('\t')
    if i < 0:
        i = s.find(' ')
        if i < 0:
            return s
    return s[:i]

https://phab.mercurial-scm.org/diffusion/HG/browse/default/mercurial/patch.py;da8bdeb1be28b976909a963c89e974264686e2bb$1463-1471

Git does a very large amount of magic, which looks like:

  • If the line ends in something that looks like a timestamp, throw that away.
  • If there's a tab, parse until the tag.
  • Take the whole line.

Also from Git:

* FIXME! The end-of-filename heuristics are kind of screwy. For existing
* files, we can happily check the index for a match, but for creating a
* new file we should try to match whatever "patch" does. I have no idea.

However, the timestamp explicitly requires a "\t" separator.

The Mercurial behavior comes from this hefty commit which just says "Add Chris Mason's mpatch library". The repository URL in that link is dead. Mason has a homepage for mpatch here:

https://oss.oracle.com/~mason/mpatch/

...but the repository link there goes to the same place and is also dead.

So it's unclear to me if there's any traceable provenance on this behavior.

I filed a summary of this in the Mercurial upstream to waste someone else's time so I feel better:

https://bz.mercurial-scm.org/show_bug.cgi?id=5771

Presumably they'll just close that as WONTFIX and we can all move on.

I'm not totally sure all variants of this are fixed, but I don't know how to reproduce any remaining issues.