Some discussion in IRC. The sort() function (and other related functions) seem to behave badly in the case of mixed types (such as integers and strings).
Description
Revisions and Commits
Related Objects
Event Timeline
Here's an example which shows how zany things are:
<?php $list = array('6e2', '1', 2, '3', 'z', 'a'); for ($ii = 0; $ii < 10000; $ii++) { shuffle($list); sort($list); echo implode(', ', $list)."\n"; }
This repeatedly shuffles a list, then sorts it. One might reasonably expect that the output would be identical, but it isn't. We get three different sorted orders, depending on the input order.
$ php -f test.php | sort | uniq -c 4118 1, 2, 3, 6e2, a, z 3949 1, 3, 6e2, a, z, 2 1933 1, a, z, 2, 3, 6e2
I think this is because < on mixed types is not transitive, so the actual comparisons performed depend on the input, and lead to multiple sort orders.
In some places, we prefix strings with "~" to prevent this (this forces string casting). I started doing this a while ago, because I didn't realize SORT_STRING exists, and didn't fully realize the implications of PHP's goofy behavior.
We should likely:
- Identify callsites to sort(), ksort(), asort(), arsort(), rsort() and krsort() and add SORT_STRING after verifying they are safe.
- Change msort() to use SORT_STRING.
- Remove superfluous "~" and similar characters from sort keys, eventually.
- Add a lint warning about this nonsense.
How about adding the $sort_flag to msort as a parameter? It could very well be useful to explicitly sort things numerically?
Practically, msortv() and PhutilSortVector are now the solution here. They provide high-performance (no usort*()), multi-dimensional, stable, explicitly typed sort.
Since they rely on reducing dimensions down to a scalar string to avoid usort*() they trade away some flexibility, but seem to satisfy ~95%+ of the cases where we need sorts.
(It's also possible that the effort to avoid usort*() is pointless in practice, I haven't benchmarked msortv() against usort() in a modern environment. But the msortv() API is generally a bit simpler than custom comparators anyway so I think we'd probably keep it and just make it more flexible if facts come to light suggesting we should just usort().)