Page MenuHomePhabricator

Create a native JSON parser.
Closed, ResolvedPublic

Description

This might be more effort than it's worth, but I've been contemplating the possibility of writing a native JSON parser for use in libphutil. JSON isn't a difficult language to parse AFAICT (see jsonlint.y and jsonlint.l). There are a couple of reasons that I think that this could be useful:

  1. The ability to lint JSON files without any external dependencies (currently, we have an ArcanistJSONLintLinter to lint JSON files, but this requires jsonlint to be installed from npm).
  2. We could, theoretically, add a mode to the JSON parser to allow // ... and/or /* ... */ style comments.
    • As mentioned in T5260, the .arclint file format doesn't currently allow comments.
    • Some JSON-like file formats (Sublime Text configuration files are one example, and I am sure that there are others) do allow comments. It's not currently possible to lint these files properly with ArcanistJSONLintLinter.

@epriestley, what are your thoughts?

Related Objects

Event Timeline

joshuaspence raised the priority of this task from to Needs Triage.
joshuaspence updated the task description. (Show Details)
joshuaspence added a project: libphutil.
joshuaspence added a subscriber: joshuaspence.

The major desirable result to me would be raising better parsing errors, e.g. "unexpected comma on line 23 in .arclint" instead of "this is not a JSON file". The errors we raise are quite bad/unhelpful right now.

I'm a bit hesitant about making the formats we parse more flexible (with comments, trailing commas, optional quotes / single quotes, ...), since I think that's a bit of a dangerous road to walk down (and, e.g., that T5260 isn't a great argument for it: high-complexity use cases should reasonably use Engines or generator scripts), but I'm not completely against it.

(Having a first-party linter would be nice too, of course.)

In T5297#4, @epriestley wrote:

The major desirable result to me would be raising better parsing errors, e.g. "unexpected comma on line 23 in .arclint" instead of "this is not a JSON file". The errors we raise are quite bad/unhelpful right now.

(which could be done with the ArcanistJSONLintLinter... but this is less than ideal)

In T5297#4, @epriestley wrote:

I'm a bit hesitant about making the formats we parse more flexible (with comments, trailing commas, optional quotes / single quotes, ...), since I think that's a bit of a dangerous road to walk down (and, e.g., that T5260 isn't a great argument for it: high-complexity use cases should reasonably use Engines or generator scripts), but I'm not completely against it.

I'm not suggesting that we get too funky with the format... but certainly allowing comments would be nice.

I'm happy to work on this, but have one question... Would you rather that this be done as a binary (like xhpast is) or using PhutilParserGenerator?

I don't think it should be a separate binary -- there's a ton of overhead for binaries and they're a huge pain to build on Windows, and we have to check .exe's in since most users don't have a compiler. This process has gotten a little less bad recently, but I think this would throw much of the value/convenience back out the window immediately, particularly because so much startup/configuration stuff depends on parsing JSON.

PhutilParserGenerator might work OK, but it might need a bit (or a lot) of work to get reasonable information about where a parse failure occurred. I implemented it in a weekend by reading some course notes from Stanford that I googled, and really have no clue what I'm doing, so I'm not hugely confident that it generates working/usable/correct parsers (although we haven't hit any issues with the typespec stuff that I recall).

Basically, I'd say:

  • Using PhutilParserGenerator might be reasonable, but be prepared to bail early and cut your losses if you hit issues.
  • Pulling in an existing pure-PHP JSON parser might be reasonable.
  • Hand-rolling a recursive descent parser might be reasonable too? This is probably the easiest way to get good error messages, but it might be an unreasonably large amount of work. JSON is pretty simple, but the typespec stuff is pretty simple too and IIRC I wasn't able to build a recursive descent parser for it in a reasonable amount of time.

Personally, I'd probably start like this:

  • Spend a few minutes googling pure-PHP JSON parsers and see if any of them look OK in terms of code quality, license, auditability and extensibility.
  • If nothing's any good, start by writing a PhutilLexer, probably? A lot of parse errors are probably really lexer errors, so this might need some work on its own to improve error messages.
  • Flip a coin and try either PhutilParserGenerator or hand-rolling recursive descent, if either seems sucky switch to the other one until something works?

I stumbled upon this a while ago which looks interesting. It's basically jsonlint ported to PHP. And it does have a standalone JSON parser class.

Edit: Hm, it uses namespaces though so it would only support PHP 5.3.

joshuaspence edited this Maniphest Task.