Page MenuHomePhabricator

Skip pygmentize for large source and too long lines
AbandonedPublic

Authored by gd on Oct 14 2015, 12:14 PM.

Details

Summary

Skip pygmentize syntax highlighting when too long source lines or too large source itself is detected.

See T9566 for more info.

Test Plan

Tested on Paste application manually with ~1MB JSON oneliner.

Diff Detail

Repository
rPHU libphutil
Branch
highlighter-timeout
Lint
Lint Passed
Unit
Test Failures
Build Status
Buildable 8272
Build 9474: arc lint + arc unit

Event Timeline

gd retitled this revision from to Skip pygmentize for large source and too long lines.
gd updated this object.
gd edited the test plan for this revision. (Show Details)

This is complex for a variety of reasons.

First, is this definitely an issue with line length, not just total input size? That is, does adding syntax-preserving linebreaks to the file reduce the pygmentize runtime substantially?

We already have a lower byte limit than this for Differential (DifferentialChangesetParser::HIGHLIGHT_BYTE_LIMIT, currently set to 262144, i..e 256KB). There is a tiny amount of interest in raising this (see T7895) but it should never be possible to hit this 1MB limit from Differential or Diffusion. It is still possible to trigger highlighting of larger files from Paste and probably from embedding things in comments in Remarkup.

Imposing arbitrary limits is messy and I'd like to avoid them if possible, especially if users can't override them. T7895 has a case of wanting to raise the default "click to highlight" limit above 256KB (even though there's a UI option available to activate highlighting). T5258 has cases where users want to spend a large number of seconds generating thumbnails from GIFs. Adding configuration options is also messy (see T8227). See also D8040 for a broader discussion of an iterative approach to scaling features like this up.

If we land this change, files over 1MB (or larger than 256KB with a line longer than 10KB) will show in Differential as "Highlighting for this very large file was disabled, click to highlight" -- but when the user clicks, the file will not highlight, even if their webserver is configured to allow an arbitrarily long execution time.

A simpler approach here is to just setTimeout(30) (for some value of 30) on the pygmentize ExecFuture. The highlight engine will fall back automatically. However, this doesn't resolve any of the arbitrariness of the limit. Installs can (and sometimes do) configure webservers so requests are permitted to run for a very long time, and sometimes desire this behavior.

Taking a very long time to highlight 1MB of JSON also seems silly, and like it might be a real issue with Pygments. This might be fixable in the Pygments upstream. More directly, we could write a lexer (see D9370, D3336 for previous, more complex lexers). These are somewhat complex but enormously faster than invoking a pygmentize subprocess (previously, we've seen both lower startup costs and much faster execution).

I'll accept these changes which unambiguously move us forward without creating problems:

  • Apply the same HIGHLIGHT_BYTE_LIMIT logic to Paste that we do to Differential: disable highlighting for files over that size, and provide an option to enable it by clicking a link on screen. ("The content of this paste is very large (1MB) so highlighting has been disabled. Enable Highlighting" or whatever the phrasing of the Differential prompt is).
  • Write a PHP JSON lexer to sidestep this particular niche issue. This isn't a very general solution, but might be a pragmatic one for the issues you're encountering and will improve performance and give us better behavior in a reasonable range of cases -- this is unambiguously desirable, just a priority/value question.

I generally don't want to pursue these sorts of changes:

  • Introduce a second limit which works like HIGHLIGHT_BYTE_LIMIT but doesn't use any of the same code.
  • Anything which breaks the existing HIGHLIGHT_BYTE_LIMIT feature by causing "Enable Highlighting" to sometimes not actually enable highlighting.
  • Introduce hard-coded limits which a reasonable user might want to raise and can't disable (some percentage of users will always want a higher limit).
  • I'm very hesitant to introduce new configuration options for this kind of stuff.

From a purely pragmatic viewpoint, this smaller change will probably accomplish what you want in a more general way:

diff --git a/src/markup/syntax/highlighter/PhutilPygmentsSyntaxHighlighter.php b/src/markup/syntax/highlighter/PhutilPygmentsSyntaxHighlighter.php
index 53f77b3..39409f5 100644
--- a/src/markup/syntax/highlighter/PhutilPygmentsSyntaxHighlighter.php
+++ b/src/markup/syntax/highlighter/PhutilPygmentsSyntaxHighlighter.php
@@ -24,6 +24,7 @@ final class PhutilPygmentsSyntaxHighlighter extends Phobject {
       $future = new ExecFuture(
         'pygmentize -O encoding=utf-8 -O stripnl=False -f html -l %s',
         $language);
+      $future->setTimeout(15);
       $scrub = false;
       if ($language == 'php' && strpos($source, '<?') === false) {
         $source = "<?php\n".$source;

You could conceivably retain that as a local patch, but I don't want to upstream that because it's arbitrary and interacts badly with "Enable Highlighting".


The test failure should also be fixed at HEAD -- builds are clean in the upstream:

https://secure.phabricator.com/diffusion/PHU/branches/master/

Thank you for detailed response. I see your point.

Just to clarify:

First, is this definitely an issue with line length, not just total input size? That is, does adding syntax-preserving linebreaks to the file reduce the pygmentize runtime substantially?

I should have provided some concrete numbers in my original post. Sorry for that. Here is an example that shows the one/many lines difference:

$ echo '{' $(for i in $(seq 1 10000); do printf '"key%d":"value %d",' $i $i; done) '}' > one-line.json

$ sed -e 's/,/,\n/g' one-line.json > many-lines.json

$ du -h one-line.json many-lines.json
224K	one-line.json
236K	many-lines.json

$ wc -l one-line.json many-lines.json
     1 one-line.json
 10001 many-lines.json
 10002 total

$ time pygmentize -O encoding=utf-8 -O stripnl=False -f html -l 'js' < many-lines.json > /dev/null

real	0m0.379s
user	0m0.344s
sys	0m0.024s

$ time pygmentize -O encoding=utf-8 -O stripnl=False -f html -l 'js' < one-line.json > /dev/null

real	0m26.254s
user	0m11.281s
sys	0m14.597s

$ pygmentize -V
Pygments version 1.4, (c) 2006-2008 by Georg Brandl.

As you can see inputs are roughly the same size but the newlines make all the difference (0.379s and 26.254s). If I bump the size to ~1mb (~4 times bigger), I get 0m1.341s vs 8m57.407s (~18 times longer for one-line case).

This is a bit contrived example since json input usually would contain more content per line than short key-value pairs so many-lines case would also be somewhat slower in real world but I think this illustrates the line length issue quite well.

I guess it's not unreasonable for someone to paste some large one-line json blob (e.g. some response dump).

I can see that slow highlighting is recurring problem and Differential deals with it using HIGHLIGHT_BYTE_LIMIT. I must admit that providing user with a choice is more elegant but it seems byte count is not the only variable here and since the problem is more general then Differential I'd like solve it in more general way.

I like the timeout approach you suggested. I wasn't aware of it. I think I'll go with that locally.

I understand why configuration options are not very welcome but I'll ask anyway: What do you think about adding config option for timeout in Phabricator (say "syntax-highlighter.timeout" or "pygments.timeout"? I'm not sure if other highlighters have this ability (or problem in the first place) that would be passed down from PhabricatorSyntaxHighlighter::newEngine()? If you think that would be acceptable, I could try implementing it.

Anyway, thank you again for the insight and it is perfectly understandable if you don't want to introduce yet another option just for that.


The test failure should also be fixed at HEAD -- builds are clean in the upstream:

https://secure.phabricator.com/diffusion/PHU/branches/master/

Test failure might be caused by something on my local machine. It wasn't anything serious, just one newline in generated html output that was not expected in the assertion.

I think this illustrates the line length issue quite well.

Yeah, I agree completely. Thanks for running the numbers!

(I wonder why Pygments is so much slower on the single line input...)

it seems byte count is not the only variable here and since the problem is more general then Differential I'd like solve it in more general way

Yeah, the "one big line is really slow" issue makes this more complex.

The advantage of a byte limit is that we don't have to run Pygments or consume resources to enforce it, and it does get the right result for most inputs. A big part of the motivating use case for HIGHLIGHT_BYTE_LIMIT was to prevent cases where a user loads a huge diff/revision with a lot of big files and the server spins up a ton of pygmentize processes which will take, say, 60s each to finish, since this tends to degrade host performance for everyone. If we ran them but cut them off after 15s, this would still slow the host down for everyone for 15s. Better than slowing it down for 60s+, but being able to avoid these in the first place is even better.

So I'd still want the byte limit as a general load reduction feature even if we had an additional, more general runtime limit to catch unusual cases like "big JSON string".

What do you think about adding config option for timeout in Phabricator

I think we could theoretically have a behavior like this which would be pretty reasonable in all cases without needing to introduce an option:

  • If the file is too large, disable highlighting and show "This file is large, Enable Highlighting".
  • Run pygmentize for up to 15s. If it fails, disable highlighting and show "This file is very expensive to highlight, Enable Highlighting"
  • When the user clicks either button, run pygmentize with a much longer cutoff (like 300s), just to prevent runaway subprocesses on truly huge inputs, or inputs which trigger other kinds of performance problems in pygmentize like "one line of JSON".

However, this is way more complex than setTimeout(15). I'd be OK with upstreaming a huge static timeout (like 300) because it would be part of this solution eventually, but presumably that doesn't help in your case since the highlight time was less than 300 seconds, just longer than you wanted to wait for it or longer than your webserver is configured to permit. I don't want to upstream an option because I think we could do something like the above without needing one, it would just take a bit more work.

My principle concern is that if we did upstream an option with a default like 15 and no other changes, it would be hard for users to figure out why highlighting wasn't working in some cases, which tends to increase our support costs for the feature since we get reports like "highlighting doesn't work" and have to rule this out as a possible cause. So I'd want to be able to have the UI tell the user that they hit the cutoff if we did add this option, and once we can do that it's probably better to just implement the whole UI and tell the user "Enable Highlighting" instead of telling them "Go configure x.y to adjust the arbitrary cutoff."

(And ideally we'd also implement a better JSON highlighter too, since I think pasting a 1MB JSON response from an API endpoint is a reasonable use case which we should be able to handle gracefully -- it's a little unusual, but well within the realm of things the tool should reasonably be able to accommodate.)

(I wonder why Pygments is so much slower on the single line input...)

My best guess here would be: memory allocations. But I didn't check Pygments source so I may be completely wrong.

The advantage of a byte limit is that we don't have to run Pygments or consume resources to enforce it, and it does get the right result for most inputs. A big part of the motivating use case for HIGHLIGHT_BYTE_LIMIT was to prevent cases where a user loads a huge diff/revision with a lot of big files and the server spins up a ton of pygmentize processes which will take, say, 60s each to finish, since this tends to degrade host performance for everyone. If we ran them but cut them off after 15s, this would still slow the host down for everyone for 15s. Better than slowing it down for 60s+, but being able to avoid these in the first place is even better.

So I'd still want the byte limit as a general load reduction feature even if we had an additional, more general runtime limit to catch unusual cases like "big JSON string".

Yeah, that makes sense.

My principle concern is that if we did upstream an option with a default like 15 and no other changes, it would be hard for users to figure out why highlighting wasn't working in some cases, which tends to increase our support costs for the feature since we get reports like "highlighting doesn't work" and have to rule this out as a possible cause. So I'd want to be able to have the UI tell the user that they hit the cutoff if we did add this option, and once we can do that it's probably better to just implement the whole UI and tell the user "Enable Highlighting" instead of telling them "Go configure x.y to adjust the arbitrary cutoff."

If we are considering adding configuration option I'd suggest setting it to 0 (i.e. no limit) or something really big (e.g. 300) by default and let users to discover/have the option to set limit if they run into problems. If it would be called pygments.timeout then it should be pretty easy to discover. We could even point to it from description of pygments.enable.


(And ideally we'd also implement a better JSON highlighter too, since I think pasting a 1MB JSON response from an API endpoint is a reasonable use case which we should be able to handle gracefully -- it's a little unusual, but well within the realm of things the tool should reasonably be able to accommodate.)

It seems this issue is not specific to JSON. I've just checked 'xml' and 'cpp' highlighting with pygmentize using similar method as above and it was even worse then 'json' for one-liners of similar size:

$ echo '<root>' $(for i in {1..10000}; do printf '<child id="%d">value  %d</child>' $i $i; done) '</root>' > one-line.xml

$ sed -e 's/></>\n</g' one-line.xml > many-lines.xml

$ du -h *.xml
292K	many-lines.xml
284K	one-line.xml

$ wc -l *.xml
 10000 many-lines.xml
     1 one-line.xml
 10001 total

$ time pygmentize -O encoding=utf-8 -O stripnl=False -f html -l 'xml' < many-lines.xml > /dev/null

real	0m0.297s
user	0m0.272s
sys	0m0.012s

$ time pygmentize -O encoding=utf-8 -O stripnl=False -f html -l 'xml' < one-line.xml > /dev/null

real	0m55.932s
user	0m24.862s
sys	0m30.398s

I think XML is another likely candidate of pasting a 1MB response from an API endpoint use case.

Ideally this should be fixed in Pygments core but in the meantime it would be great to have some limiting to keep Phabricator functioning when someone does something like that.


(Another option would be to do highlighting in client-side. But I guess there are completely valid reasons that do not allow that.)

Before digging into this too much, I'm getting a 10x faster runtime on my laptop with a more recent version of pygmentize:

epriestley@orbital ~/dev/libphutil $ time pygmentize -O encoding=utf-8 -O stripnl=False -f html -l 'js' < one-line.json > /dev/null

real	0m4.908s
user	0m3.859s
sys	0m1.040s
epriestley@orbital ~/dev/libphutil $ pygmentize -V
Pygments version 2.0.1, (c) 2006-2014 by Georg Brandl.

I also wrote a JSON lexer -- it probably needs a little more work, but it's ~25 lines long -- and that's ~100x faster than pygmentize:

$ time ./scripts/test/highlight.php --language json < one-line.json >/dev/null

real	0m0.857s
user	0m0.809s
sys	0m0.045s

Here are a couple of the outputs -- the "\u..." rule could be better but the rest seems OK at first glance:

Screen Shot 2015-10-16 at 7.33.41 AM.png (1×1 px, 533 KB)

Screen Shot 2015-10-16 at 7.38.26 AM.png (1×1 px, 150 KB)

Oh, the runtime difference may be substantially because Pygments 1.5 added a dedicated JSON lexer, which you presumably don't have in Pygments 1.4. I don't see any other obvious "made everything 10x faster" notes in the changelog between Pygments 1.4 and HEAD. My laptop is also probably a bit faster than commodity server hardware if that's what you were running on, but presumably not 10x faster.

However, while I don't quite get a 10x speedup on XML, but I get ~6x:

$ time pygmentize -O encoding=utf-8 -O stripnl=False -f html -l 'xml' < one-line.xml > /dev/null

real	0m9.209s
user	0m7.134s
sys	0m2.031s

So maybe some of this is just pygmentize getting better in the last ~5 years.

It's sort of silly to "fix" this problem by writing lexers for every language since that's obviously not a general solution, but syntax lexers don't need to be perfect and the simple ones (JSON, XML) take very little time to write.

I just checked and can confirm your results new pygmentize version. But your ~100x faster JSON lexer is way awesome result!

Thank you for looking into this.

Just an update: I've actually went out and checked Pygments source. Issue was in HTML formatter. Changing string concatenation to list extend/join improved performance by ~10x. My pull request was accepted upsteam so these cases shouldn't be a problem in the future.

I guess I should have gone this path in the first place, but I guess I needed this discussion to understand that. Thank you again for your time.

Ah, nice! I also filed T9587 as a followup here, I'll make a note about your fix there.

I do want to pursue the other changes (e.g., an "Enable Highlighting" UI in Paste) eventually, but I think 1.4->2.0 Pygments changes, your Pygments long-line fix, and making the JSON parser native should greatly reduce the number of reasonable things users can do in order to hit this problem, and I'm less concerned if they have difficulty working with a legitimately huge input (like a 100 MB JSON file). It's something we should fix eventually, but it's reasonable to imagine that 0 users are actually encountering it because that input size is so extreme.