Page MenuHomePhabricator

Implementation of PhutilLexer for Python
ClosedPublic

Authored by sophiebits on Jun 4 2014, 3:10 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Nov 24, 7:53 AM
Unknown Object (File)
Oct 26 2024, 11:15 AM
Unknown Object (File)
Oct 22 2024, 2:53 AM
Unknown Object (File)
Oct 19 2024, 5:34 PM
Unknown Object (File)
Oct 11 2024, 10:28 PM
Unknown Object (File)
Oct 11 2024, 10:02 PM
Unknown Object (File)
Oct 11 2024, 10:01 PM
Unknown Object (File)
Oct 11 2024, 10:01 PM

Details

Summary

This appears to be ~1.5-2.0x faster than pygments on my machine, which isn't as good as I was hoping for but will help. (This is ~330 ms for a 4400-line file I chose at random.)

Test Plan

Loaded a Python paste and saw highlights that approximately (or maybe exactly) match the pygments version.

Diff Detail

Repository
rPHU libphutil
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

sophiebits retitled this revision from to Implementation of PhutilLexer for Python.
sophiebits updated this object.
sophiebits edited the test plan for this revision. (Show Details)
sophiebits added a reviewer: epriestley.
src/lexer/PhutilPythonFragmentLexer.php
2

test comment, please ignore

sophiebits edited edge metadata.

Oops, don't allow newlines in non-triple-quoted strings

epriestley edited edge metadata.

This stuff is pretty hard to review comprehensively, but we can tweak it and start adding unit tests or whatever if there are issues. It looks structurally correct to me, and I didn't catch anything suspicious looking.

Thanks for putting this together!

This revision is now accepted and ready to land.Jun 5 2014, 6:39 PM
epriestley updated this revision to Diff 22409.

Closed by commit rPHU9b2f35480dc0 (authored by @spicyj, committed by @epriestley).

csilvers added inline comments.
src/lexer/PhutilPythonFragmentLexer.php
219

I was just looking over this due to a recent mention in a chat, and while I don't understand what's going on here, I can't help but wonder if this should be \\\\" instead of \\\\\'.

I'm not entirely sure how to test the rule, but the upstream seems to use ":

https://bitbucket.org/birkenfeld/pygments-main/src/7941677dc77d4f2bf0bbd6140ade85a9454b8b80/pygments/lexers/python.py?at=default&fileviewer=file-view-default#python.py-220

If anyone can show me a Python file where this matters, I'm happy to fix it and put some test coverage on it.

This was my best guess:

maverick.py
print r"Tom \"Maverick\" Cruise"
print r'Tom \'Maverick\' Cruise'

...but neither string highlights specially and the runtime behavior just makes me confused about why Python has this feature:

$ python maverick.py
Tom \"Maverick\" Cruise
Tom \'Maverick\' Cruise

...what? Why?

This was my best guess:

maverick.py
print r"Tom \"Maverick\" Cruise"
print r'Tom \'Maverick\' Cruise'

...but neither string highlights specially and the runtime behavior just makes me confused about why Python has this feature:

$ python maverick.py
Tom \"Maverick\" Cruise
Tom \'Maverick\' Cruise

...what? Why?

Try taking out the r prefix. It's for "raw strings", in which backslashes are interpreted literally. This is mostly used for regexps (python does not support /.../ strings, which are how "raw" strings are implemented in most languages).

Oh, my assumption was that the raw part was important because of the comment ("// included here for raw strings").

Non-raw strings seem to work correctly without changes (see inline about why):

Screen Shot 2017-05-18 at 5.32.50 PM.png (153×328 px, 17 KB)

src/lexer/PhutilPythonFragmentLexer.php
306

Additional normal escaping rules get merged in here, and seem to handle things in normal strings.

Oh, my assumption was that the raw part was important because of the comment ("// included here for raw strings").

Non-raw strings seem to work correctly without changes (see inline about why):

Screen Shot 2017-05-18 at 5.32.50 PM.png (153×328 px, 17 KB)

OK, as I say I have no idea what the line in question is doing. But I do think it's probably wrong. It may just be wrong in a way that doesn't matter.

Yeah, I think it's wrong but I'm not sure if the right version is to use a " or to remove it completely, since I can't find an input which it does anything for.