Page MenuHomePhabricator

abstract class PhutilLexer
Arcanist Technical Documentation ()

Slow, inefficient regexp-based lexer. Define rules like this:

array(
  'start'  => array(...),
  'state1' => array(...),
  'state2' => array(...),
)

Lexers start at the state named 'start'. Each state should have a list of rules which can match in that state. A list of rules looks like this:

array(
  array('\s+', 'space'),
  array('\d+', 'digit'),
  array('\w+', 'word'),
)

The lexer operates by processing each rule in the current state in order. When one matches, it produces a token. For example, the lexer above would lex this text:

3 asdf

...to produce these tokens (assuming the rules are for the 'start' state):

array('digit', '3', null),
array('space', ' ', null),
array('word', 'asdf', null),

A rule can also cause a state transition:

array('zebra', 'animal', 'saw_zebra'),

This would match the text "zebra", emit a token of type "animal", and change the parser state to "saw_zebra", causing the lexer to start using the rules from that state.

To pop the lexer's state, you can use the special state '!pop'.

Finally, you can provide additional options in the fourth parameter. Supported options are case-insensitive and context.

Possible values for context are push (push the token value onto the context stack), pop (pop the context stack and use it to provide context for the token), and discard (pop the context stack and throw away the value).

For example, to lex text like this:

Class::CONSTANT

You can use a rule set like this:

'start' => array(
  array('\w+(?=::)', 'class', 'saw_class', array('context' => 'push')),
),
'saw_class' => array(
  array('::', 'operator'),
  array('\w+', 'constant, '!pop', array('context' => 'pop')),
),

This would parse the above text into this token stream:

array('class', 'Class', null),
array('operator', '::', null),
array('constant', 'CONSTANT', 'Class'),

For a concrete implementation, see PhutilPHPFragmentLexer.

Tasks

Lexer Implementation

  • abstract protected function getRawRules() — Return a set of rules for this lexer. See description in @{class:PhutilLexer}.

Lexer Rules

  • protected function getRules() — Process, normalize, and validate the raw lexer rules.

Lexer Tokens

  • public function getTokens($input, $initial_state) — Lex an input string into tokens.

Other Methods

  • public function mergeTokens($tokens) — Merge adjacent tokens of the same type. For example, if a comment is tokenized as <"//", "comment">, this method will merge the two tokens into a single combined token.
  • public function getLexerState()

Methods

abstract protected function getRawRules()

Return a set of rules for this lexer. See description in PhutilLexer.

Return
dictLexer rules.

protected function getRules()

Process, normalize, and validate the raw lexer rules.

Return
wild

public function getTokens($input, $initial_state)

Lex an input string into tokens.

Parameters
string$inputInput string.
string$initial_stateInitial lexer state.
Return
listList of lexer tokens.

public function mergeTokens($tokens)

Merge adjacent tokens of the same type. For example, if a comment is tokenized as <"//", "comment">, this method will merge the two tokens into a single combined token.

Parameters
array$tokens
Return
wild

public function getLexerState()

This method is not documented.
Return
wild