Detection of language for some file types isn't good
Open, NormalPublic
Actions

Assigned To

Authored By

	epriestley
	Jul 26 2013, 2:06 PM

Description

We don't detect the language of files named "Makefile" correctly. Pygments does; it looks like the rule Pygments uses is "if there is no dot in the filename, use the whole filename", since the string "makefile" does not appear in the Pygments source so I believe this isn't a special case.

However, our logic currently looks like this:

$lang = detect_from_filename($filename);
if ($lang === null) {
  $lang = detect_from_content($data);
}

If we add the "use the whole filename", we'll always get a language out of detect_from_filename(), so we'll never invoke the data-based language guesser. The logic needs to look like this instead:

$lang = detect_from_filename($filename);
if (!is_valid_language($lang)) {
  $data_lang = detect_from_content($data);
  if (is_valid_langauge($data_lang)) {
    $lang = $data_lang;
  }
}

Implementing is_valid_language() is a bit tricky. A language is valid if we support it explicitly, or if Pygments supports it. We can get a list of langauges Pygments supports with pygmentize -L lexers, but we don't want to run or parse this every time. The list may change based on Pygments versions, too.

I think the simplest approach is:

Just copy-paste the list out of modern Pygments into libphutil, and we'll update it once a year or whatever (as new languages are created).
Map the lexers to human-readable language names.
Add the custom lexers we support (Rainbow, Invisible).

We can use this list to implement T832, too.

So, specifically, this would boil down to:

Add a public static getSupportedLanguages() to PhutilDefaultSyntaxHighlighterEngine.
This returns a big hard-coded map of Pygments languages plus additional languages.
Implement the modified check above ("if filename language isn't valid..").
Add a rule to PhutilDefaultSyntaxHighlighterEngine->getLanguageFromFilename() which uses the entire filename if nothing matches (e.g., return basename($filename); instead of return null; at the end).

If you came here looking for a short answer:

You can update your configuration to do better:
Config → Syntax Highlighting → syntax.filemap
You can add/update it to add your favorite missing language (By filename).

The upstream is very reluctant to just update the existing hard-coded map with new values, so don't try to submit patches for it.

Related Objects
Search...

Status	Assigned	Task
Resolved	epriestley	T5644 Add controls to prevent `pygmentize` from saturating system resources
Resolved	epriestley	T5179 Unearth Differential rendering technical debt
		Restricted Maniphest Task
Open	joshuaspence	T3626 Detection of language for some file types isn't good

Event Timeline

epriestley triaged this task as Normal priority.Jul 26 2013, 2:06 PM

epriestley added a project: Badge Awarded.

epriestley added subscribers: epriestley, staticshock.

epriestley edited this Maniphest Task.Jul 26 2013, 2:06 PM

pygments allows for lexers to be provided as plugins, as well. We use
pygments-bitbake, for instance. It would be a shame is_valid_language were to mark bitbake as invalid just because your hard-coded list doesn't know about it.

The suggested structure above will still pass an "invalid" lexer to Pygments as long as we don't find a match in the file content, so it should accommodate plugins. The only problem would be, e.g., a ".bb" file which started "#!/usr/bin/php", which we would detect as PHP based on the file content. But I don't think we can do much better than that without parsing the available lexers out of Pygments, which seems like a mess.

Why not cache the output of pygmentize -L lexers at runtime and invalidate that once a month or so?

It adds a lot of complexity and I don't think we get much benefit. We don't have access to a reasonable cache in libphutil.

◀ Merged tasks: T4185.

epriestley removed a project: Badge Awarded.Nov 12 2014, 1:11 PM

chad added a project: Differential.Dec 30 2014, 5:35 PM

joshuaspence added a subscriber: joshuaspence.Dec 31 2014, 12:14 AM

devurandom added a subscriber: devurandom.Jan 26 2015, 2:20 PM

We already have major parts of this map in getPygmentsLexerNameFromLanguageName().

In fixing T7456, I wrote part of a parser for pygmetize -L lexers. However, this turned out to be pretty messy. In particular, there are a bunch of rules which we'd need to hard code, like ".m" mapping to objective-c. This is an implicit rule which we can't infer from pygmentize -L lexers.

Hard-coding this list will also produce some questionable or wrong results across various versions of pygments, as lexer aliases are added.

We can use pygmentize -N filename to build this map at runtime, but invoking pygmentize costs about 100ms. This tends to exacerbate T5644. The price we pay to run pygmentize is already very high.

Possibly we could hard-code the top N maps (i.e., all the files anyone actually uses or has ever heard of) and then fall back to pygmentize -N for the rest of them, and accept the performance hit for highlighting weird files.

Parser diff: https://secure.phabricator.com/differential/diff/28809/

epriestley mentioned this in D11967: Support VHDL file extension in pygments.Mar 4 2015, 3:52 PM

epriestley moved this task from Backlog to v2 (UI/Mobile) on the Differential board.Mar 4 2015, 5:11 PM

epriestley mentioned this in D13036: Framework for external symbol search.May 28 2015, 2:53 PM

joshuaspence claimed this task.May 30 2015, 6:37 AM

epriestley mentioned this in D14376: Allow XHPAST to be built without flex and bison.Nov 1 2015, 11:58 PM

avivey changed the visibility from "All Users" to "Public (No Login Required)".Nov 1 2015, 11:59 PM

eadler added a project: Restricted Project.Jan 11 2016, 9:38 PM

eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Jan 11 2016, 9:42 PM

eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Jul 4 2016, 9:15 PM

Herald added a subscriber: eadler. · View Herald TranscriptJul 4 2016, 9:15 PM

avivey mentioned this in D16428: Add support for synta highlighting kotlin files.Oct 15 2016, 5:53 AM

epriestley merged a task: T11857: Allow .rs source code files to be parsed by Rust.Nov 14 2016, 4:31 PM

epriestley added a subscriber: dereckson.

chad merged a task: T11862: Makefile syntax highlighting isn't auto-detected.Nov 15 2016, 2:22 AM

chad added a subscriber: samking.

In a similar approach, this code just makes pygmentize spit out it's data in a nicer format:

from pygments.lexers import get_all_lexers
map = {}
for fullname, names, globs, _ in get_all_lexers():
    for glob in globs:        
        if glob in map:
            print 'duplicate glob %s' % glob
        else:
            map[glob] = fullname
for glob, name in map.iteritems():
    print '%s => %s' % (glob, name)

(It can just import pygments if it's installed using pip or whatever).
However, it leaves a lot of duplicate globs:

$ ./foo.py  | g -c 'duplicate glob'
63

epriestley mentioned this in D16856: Allow Pygments to recognize .rs files as Rust.Jan 11 2017, 3:29 PM

avivey renamed this task from Detection of language for "Makefile" isn't good to Detection of language for some file types isn't good.Mar 14 2017, 8:56 PM

urzds added a subscriber: urzds.Jul 12 2017, 11:14 AM

epriestley added a project: Customer Impact.Jul 24 2017, 3:30 PM

We've had two customers hit this (both for Kotlin, although a config workaround is available).

And one more.

And one more, also for Kotlin.

avivey updated the task description. (Show Details)Oct 31 2017, 9:09 AM

epriestley mentioned this in D19273: Make Jupyter notebooks use the fast builtin Python highlighter.Mar 30 2018, 4:48 PM

When this eventually moves forward, it's probably also appropriate to remove the syntax-highlighter.engine config option. I think this has essentially zero use in the wild and we have better modern patterns to approach the problem (without extreme-niche config options) if there are real-world use cases.

epriestley mentioned this in rPccbc8a430f62: Make Jupyter notebooks use the fast builtin Python highlighter.Mar 30 2018, 6:26 PM

epriestley renamed this task from Detection of language for some file types isn't good to .Apr 12 2018, 2:28 PM

epriestley created this task.Apr 12 2018, 2:30 PM

Oh, good, it's about time this task be created!

Yeah, this is a combination of a bulk editor bug (which lets you delete task titles) and legacy compatibility code for transaction title rendering (which treats null -> anything transactions as "created this thing.").

epriestley mentioned this in T13122: "Continue on missing fields" in Bulk Edit allows removal of required fields.Apr 12 2018, 6:42 PM

epriestley mentioned this in D20231: When highlighting source, catch "python" for the PHP parser.Feb 28 2019, 8:57 PM

epriestley mentioned this in rPHUf434f57578dd: When highlighting source, catch "python" for the PHP parser.Mar 1 2019, 3:51 AM

	F5519965: image.png
	Apr 12 2018, 3:04 PM

Detection of language for some file types isn't goodOpen, NormalPublicActions

Description

If you came here looking for a short answer:

Related ObjectsSearch...

Event Timeline

Detection of language for some file types isn't good
Open, NormalPublic
Actions

Related Objects
Search...