Page MenuHomePhabricator

PHP sort function is unpredictable to human minds when dealing with mixed types
Closed, ResolvedPublic

Description

Some discussion in IRC. The sort() function (and other related functions) seem to behave badly in the case of mixed types (such as integers and strings).

Event Timeline

joshuaspence raised the priority of this task from to Needs Triage.
joshuaspence updated the task description. (Show Details)
joshuaspence added a project: Infrastructure.
joshuaspence added subscribers: joshuaspence, epriestley.

Here's an example which shows how zany things are:

<?php

$list = array('6e2', '1', 2, '3', 'z', 'a');
for ($ii = 0; $ii < 10000; $ii++) {
  shuffle($list);
  sort($list);
  echo implode(', ', $list)."\n";
}

This repeatedly shuffles a list, then sorts it. One might reasonably expect that the output would be identical, but it isn't. We get three different sorted orders, depending on the input order.

$ php -f test.php | sort | uniq -c
4118 1, 2, 3, 6e2, a, z
3949 1, 3, 6e2, a, z, 2
1933 1, a, z, 2, 3, 6e2

I think this is because < on mixed types is not transitive, so the actual comparisons performed depend on the input, and lead to multiple sort orders.

In some places, we prefix strings with "~" to prevent this (this forces string casting). I started doing this a while ago, because I didn't realize SORT_STRING exists, and didn't fully realize the implications of PHP's goofy behavior.

We should likely:

  • Identify callsites to sort(), ksort(), asort(), arsort(), rsort() and krsort() and add SORT_STRING after verifying they are safe.
  • Change msort() to use SORT_STRING.
  • Remove superfluous "~" and similar characters from sort keys, eventually.
  • Add a lint warning about this nonsense.

(It's technically deterministic, I suppose, just very unpredictable.)

How about adding the $sort_flag to msort as a parameter? It could very well be useful to explicitly sort things numerically?

I suspect that's never useful in this codebase, but could be wrong.

epriestley renamed this task from PHP sort function is non-deterministic when dealing with mixed types to PHP sort function is unpredictable to human minds when dealing with mixed types.Dec 25 2015, 3:18 PM
epriestley claimed this task.

Practically, msortv() and PhutilSortVector are now the solution here. They provide high-performance (no usort*()), multi-dimensional, stable, explicitly typed sort.

Since they rely on reducing dimensions down to a scalar string to avoid usort*() they trade away some flexibility, but seem to satisfy ~95%+ of the cases where we need sorts.

(It's also possible that the effort to avoid usort*() is pointless in practice, I haven't benchmarked msortv() against usort() in a modern environment. But the msortv() API is generally a bit simpler than custom comparators anyway so I think we'd probably keep it and just make it more flexible if facts come to light suggesting we should just usort().)