Page MenuHomePhabricator
Diviner libphutil Tech Docs PhutilEditDistanceMatrix

final class PhutilEditDistanceMatrix
libphutil Technical Documentation (Core Utilities)

Compute edit distance between two scalar sequences. This class uses Levenshtein (or Damerau-Levenshtein) to compute the edit distance between two inputs. The inputs are arrays containing any scalars (not just strings) so it can be used with, e.g., utf8 sequences.

$matrix = id(new PhutilEditDistanceMatrix())

->setSequences(str_split('ran'), str_split('rat'));

$cost = $matrix->getEditDistance();

Edit distance computation is slow and requires a large amount of memory; both are roughly O(N * M) in the length of the strings.

You can customize the cost of insertions, deletions and replacements. By default, each has a cost of 1.

$matrix->setReplaceCost(1);

By default, transpositions are not evaluated (i.e., the algorithm is Levenshtein). You can cause transpositions to be evaluated by setting a transpose cost (which will change the algorithm to Damerau-Levenshtein):

$matrix->setTransposeCost(1);

You can also provide a cost to alter the type of operation being applied. Many strings have several equivalently expensive edit sequences, but one some sequences are more readable to humans than others. Providing a small cost to alter operation type tends to smooth out the sequence and produce long runs of a single operation, which are generally more readable. For example, these strings:

(x)
((x))

...have edit strings "issis" and "isssi", which describe edit operations with the same cost, but the latter string is more comprehensible to human viewers.

If you set an alter cost, you must call setComputeString() to enable type computation. The alter cost should generally be smaller than c / N, where c is the smallest operational cost and N is the length of the longest string. For example, if you are using the default costs (insert = 1, delete = 1, replace = 1) and computing edit distances for strings of fewer than 1,000 characters, you might set the alter cost to 0.001.

Methods

public function setMaximumLength($maximum_length)

This method is not documented.
Parameters
$maximum_length
Return
wild

public function getMaximumLength()

This method is not documented.
Return
wild

public function didReachMaximumLength()

This method is not documented.
Return
wild

public function setComputeString($compute_string)

This method is not documented.
Parameters
$compute_string
Return
wild

public function getComputeString()

This method is not documented.
Return
wild

public function setTransposeCost($transpose_cost)

This method is not documented.
Parameters
$transpose_cost
Return
wild

public function getTransposeCost()

This method is not documented.
Return
wild

public function setReplaceCost($replace_cost)

This method is not documented.
Parameters
$replace_cost
Return
wild

public function getReplaceCost()

This method is not documented.
Return
wild

public function setDeleteCost($delete_cost)

This method is not documented.
Parameters
$delete_cost
Return
wild

public function getDeleteCost()

This method is not documented.
Return
wild

public function setInsertCost($insert_cost)

This method is not documented.
Parameters
$insert_cost
Return
wild

public function getInsertCost()

This method is not documented.
Return
wild

public function setAlterCost($alter_cost)

This method is not documented.
Parameters
$alter_cost
Return
wild

public function getAlterCost()

This method is not documented.
Return
wild

public function setApplySmoothing($apply_smoothing)

This method is not documented.
Parameters
$apply_smoothing
Return
wild

public function getApplySmoothing()

This method is not documented.
Return
wild

public function setSequences($x, $y)

This method is not documented.
Parameters
array$x
array$y
Return
wild

private function requireSequences()

This method is not documented.
Return
wild

public function getEditDistance()

This method is not documented.
Return
wild

public function getEditString()

Return a string representing the edits between the sequences. The string has these characters:

  • s (same): Same character in both strings.
  • i (insert): Character inserted.
  • d (delete): Character deleted.
  • x (replace): Character replaced.
  • t (transpose): Character transposed.
Return
wild

private function padEditString($str)

This method is not documented.
Parameters
$str
Return
wild

private function getTypeMatrix()

This method is not documented.
Return
wild

private function getDistanceMatrix()

This method is not documented.
Return
wild

private function computeMatrix($x, $y)

This method is not documented.
Parameters
array$x
array$y
Return
wild

private function getInfinity()

This method is not documented.
Return
wild

private function printMatrix($m)

This method is not documented.
Parameters
array$m
Return
wild

private function printTypeMatrix($t)

This method is not documented.
Parameters
array$t
Return
wild

private function applySmoothing($str, $full)

This method is not documented.
Parameters
$str
$full
Return
wild