Page MenuHomePhabricator

Detection of language for some file types isn't good
Open, NormalPublic

Description

We don't detect the language of files named "Makefile" correctly. Pygments does; it looks like the rule Pygments uses is "if there is no dot in the filename, use the whole filename", since the string "makefile" does not appear in the Pygments source so I believe this isn't a special case.

However, our logic currently looks like this:

$lang = detect_from_filename($filename);
if ($lang === null) {
  $lang = detect_from_content($data);
}

If we add the "use the whole filename", we'll always get a language out of detect_from_filename(), so we'll never invoke the data-based language guesser. The logic needs to look like this instead:

$lang = detect_from_filename($filename);
if (!is_valid_language($lang)) {
  $data_lang = detect_from_content($data);
  if (is_valid_langauge($data_lang)) {
    $lang = $data_lang;
  }
}

Implementing is_valid_language() is a bit tricky. A language is valid if we support it explicitly, or if Pygments supports it. We can get a list of langauges Pygments supports with pygmentize -L lexers, but we don't want to run or parse this every time. The list may change based on Pygments versions, too.

I think the simplest approach is:

  • Just copy-paste the list out of modern Pygments into libphutil, and we'll update it once a year or whatever (as new languages are created).
  • Map the lexers to human-readable language names.
  • Add the custom lexers we support (Rainbow, Invisible).

We can use this list to implement T832, too.

So, specifically, this would boil down to:

  • Add a public static getSupportedLanguages() to PhutilDefaultSyntaxHighlighterEngine.
  • This returns a big hard-coded map of Pygments languages plus additional languages.
  • Implement the modified check above ("if filename language isn't valid..").
  • Add a rule to PhutilDefaultSyntaxHighlighterEngine->getLanguageFromFilename() which uses the entire filename if nothing matches (e.g., return basename($filename); instead of return null; at the end).

If you came here looking for a short answer:

You can update your configuration to do better:
ConfigSyntax Highlightingsyntax.filemap
You can add/update it to add your favorite missing language (By filename).

The upstream is very reluctant to just update the existing hard-coded map with new values, so don't try to submit patches for it.

Event Timeline

epriestley triaged this task as Normal priority.Jul 26 2013, 2:06 PM
epriestley added a project: Badge Awarded.
epriestley added subscribers: epriestley, staticshock.
epriestley edited this Maniphest Task.Jul 26 2013, 2:06 PM

pygments allows for lexers to be provided as plugins, as well. We use
pygments-bitbake, for instance. It would be a shame is_valid_language were to mark bitbake as invalid just because your hard-coded list doesn't know about it.

The suggested structure above will still pass an "invalid" lexer to Pygments as long as we don't find a match in the file content, so it should accommodate plugins. The only problem would be, e.g., a ".bb" file which started "#!/usr/bin/php", which we would detect as PHP based on the file content. But I don't think we can do much better than that without parsing the available lexers out of Pygments, which seems like a mess.

Why not cache the output of pygmentize -L lexers at runtime and invalidate that once a month or so?

It adds a lot of complexity and I don't think we get much benefit. We don't have access to a reasonable cache in libphutil.

◀ Merged tasks: T4185.

We already have major parts of this map in getPygmentsLexerNameFromLanguageName().

In fixing T7456, I wrote part of a parser for pygmetize -L lexers. However, this turned out to be pretty messy. In particular, there are a bunch of rules which we'd need to hard code, like ".m" mapping to objective-c. This is an implicit rule which we can't infer from pygmentize -L lexers.

Hard-coding this list will also produce some questionable or wrong results across various versions of pygments, as lexer aliases are added.

We can use pygmentize -N filename to build this map at runtime, but invoking pygmentize costs about 100ms. This tends to exacerbate T5644. The price we pay to run pygmentize is already very high.

Possibly we could hard-code the top N maps (i.e., all the files anyone actually uses or has ever heard of) and then fall back to pygmentize -N for the rest of them, and accept the performance hit for highlighting weird files.

avivey changed the visibility from "All Users" to "Public (No Login Required)".Nov 1 2015, 11:59 PM
eadler added a project: Restricted Project.Jan 11 2016, 9:38 PM
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Jan 11 2016, 9:42 PM
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Jul 4 2016, 9:15 PM
avivey added a subscriber: avivey.Nov 15 2016, 10:23 PM

In a similar approach, this code just makes pygmentize spit out it's data in a nicer format:

from pygments.lexers import get_all_lexers
map = {}
for fullname, names, globs, _ in get_all_lexers():
    for glob in globs:        
        if glob in map:
            print 'duplicate glob %s' % glob
        else:
            map[glob] = fullname
for glob, name in map.iteritems():
    print '%s => %s' % (glob, name)

(It can just import pygments if it's installed using pip or whatever).
However, it leaves a lot of duplicate globs:

$ ./foo.py  | g -c 'duplicate glob'
63
avivey renamed this task from Detection of language for "Makefile" isn't good to Detection of language for some file types isn't good.Mar 14 2017, 8:56 PM
urzds added a subscriber: urzds.Jul 12 2017, 11:14 AM

We've had two customers hit this (both for Kotlin, although a config workaround is available).

And one more.

And one more, also for Kotlin.

avivey updated the task description. (Show Details)Oct 31 2017, 9:09 AM

When this eventually moves forward, it's probably also appropriate to remove the syntax-highlighter.engine config option. I think this has essentially zero use in the wild and we have better modern patterns to approach the problem (without extreme-niche config options) if there are real-world use cases.

epriestley renamed this task from Detection of language for some file types isn't good to .Apr 12 2018, 2:28 PM

Oh, good, it's about time this task be created!

Yeah, this is a combination of a bulk editor bug (which lets you delete task titles) and legacy compatibility code for transaction title rendering (which treats null -> anything transactions as "created this thing.").