Page MenuHomePhabricator

Implementation of PhutilLexer for Python
ClosedPublic

Authored by sophiebits on Jun 4 2014, 3:10 AM.

Details

Summary

This appears to be ~1.5-2.0x faster than pygments on my machine, which isn't as good as I was hoping for but will help. (This is ~330 ms for a 4400-line file I chose at random.)

Test Plan

Loaded a Python paste and saw highlights that approximately (or maybe exactly) match the pygments version.

Diff Detail

Repository
rPHU libphutil
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

sophiebits updated this revision to Diff 22333.Jun 4 2014, 3:10 AM
sophiebits retitled this revision from to Implementation of PhutilLexer for Python.
sophiebits updated this object.
sophiebits edited the test plan for this revision. (Show Details)
sophiebits added a reviewer: epriestley.
sophiebits added inline comments.Jun 4 2014, 7:37 PM
src/lexer/PhutilPythonFragmentLexer.php
2

test comment, please ignore

sophiebits updated this revision to Diff 22340.Jun 4 2014, 7:50 PM
sophiebits edited edge metadata.

Oops, don't allow newlines in non-triple-quoted strings

epriestley accepted this revision.Jun 5 2014, 6:39 PM
epriestley edited edge metadata.

This stuff is pretty hard to review comprehensively, but we can tweak it and start adding unit tests or whatever if there are issues. It looks structurally correct to me, and I didn't catch anything suspicious looking.

Thanks for putting this together!

This revision is now accepted and ready to land.Jun 5 2014, 6:39 PM
epriestley closed this revision.Jun 5 2014, 6:42 PM
epriestley updated this revision to Diff 22409.

Closed by commit rPHU9b2f35480dc0 (authored by @spicyj, committed by @epriestley).

csilvers added inline comments.
src/lexer/PhutilPythonFragmentLexer.php
219

I was just looking over this due to a recent mention in a chat, and while I don't understand what's going on here, I can't help but wonder if this should be \\\\" instead of \\\\\'.

I'm not entirely sure how to test the rule, but the upstream seems to use ":

https://bitbucket.org/birkenfeld/pygments-main/src/7941677dc77d4f2bf0bbd6140ade85a9454b8b80/pygments/lexers/python.py?at=default&fileviewer=file-view-default#python.py-220

If anyone can show me a Python file where this matters, I'm happy to fix it and put some test coverage on it.

This was my best guess:

maverick.py
print r"Tom \"Maverick\" Cruise"
print r'Tom \'Maverick\' Cruise'

...but neither string highlights specially and the runtime behavior just makes me confused about why Python has this feature:

$ python maverick.py
Tom \"Maverick\" Cruise
Tom \'Maverick\' Cruise

...what? Why?

This was my best guess:

maverick.py
print r"Tom \"Maverick\" Cruise"
print r'Tom \'Maverick\' Cruise'

...but neither string highlights specially and the runtime behavior just makes me confused about why Python has this feature:

$ python maverick.py
Tom \"Maverick\" Cruise
Tom \'Maverick\' Cruise

...what? Why?

Try taking out the r prefix. It's for "raw strings", in which backslashes are interpreted literally. This is mostly used for regexps (python does not support /.../ strings, which are how "raw" strings are implemented in most languages).

Oh, my assumption was that the raw part was important because of the comment ("// included here for raw strings").

Non-raw strings seem to work correctly without changes (see inline about why):

src/lexer/PhutilPythonFragmentLexer.php
306

Additional normal escaping rules get merged in here, and seem to handle things in normal strings.

Oh, my assumption was that the raw part was important because of the comment ("// included here for raw strings").

Non-raw strings seem to work correctly without changes (see inline about why):

OK, as I say I have no idea what the line in question is doing. But I do think it's probably wrong. It may just be wrong in a way that doesn't matter.

Yeah, I think it's wrong but I'm not sure if the right version is to use a " or to remove it completely, since I can't find an input which it does anything for.