Sorting strings with embedded numbers
Friday, September 12, 2008
Suppose you have a list of file names:
picture1.txt
picture2.txt
picture3.txt
picture4.txt
picture5.txt
picture6.txt
picture7.txt
picture8.txt
picture9.txt
picture10.txt
Lexicographic order will sort picture10.txt
before picture2.txt
,
which is not what you want. The correct solution is to recognize that
the strings contain numbers, and special-case their comparison.
Doug Coleman wrote a Factor solution last December.
Today I was revisiting his code and noticed that it could be simplified significantly using Chris Double’s amazing parser expression grammars library:
USING: peg.ebnf math.parser kernel assocs sorting ;
IN: sorting.human
: find-numbers ( string -- seq )
[EBNF Result = ([0-9]+ => [[ string>number ]] | (!([0-9]) .)+)* EBNF] ;
: human-sort ( seq -- seq' )
[ dup find-numbers ] { } map>assoc sort-values keys ;
I checked the code into the repository; you can USE: sorting.human
to
use it.
A bit of explanation is in order.
The PEG [0-9]+ => [[ string>number ]]
matches one or more digits (in
the range 0 to 9), and applies the action string>number
(a Factor word
which does what the name says).
The PEG (!([0-9]) .)+
matches one or more characters which are not
digits.
The PEG idiom (A | B)*
matches a sequence of zero or more A
’s and
B
’s, which is exactly what we do here with the above two PEGs.
The PEG parser is wrapped in [EBNF ... EBNF]
, which is “inline PEG
notation”; it expands into code which takes a string as input and
returns a parse result as output. So calling find-numbers
splits out
any embedded numbers in the string.
The phrase [ dup find-numbers ] { } map>assoc
builds an association
list from a sequence where each sequence element is associated with a
tokenized string. We use sort-values
to sort the association list by
comparing values, then use keys
to extract the sequence elements, now
in sorted order. Beautiful.
I solved the same problem in Java several years ago. The result was not pretty at all:
public static int compareStrings(String str1, String str2, boolean ignoreCase)
{
char[] char1 = str1.toCharArray();
char[] char2 = str2.toCharArray();
int len = Math.min(char1.length,char2.length);
for(int i = 0, j = 0; i < len && j < len; i++, j++)
{
char ch1 = char1[i];
char ch2 = char2[j];
if(Character.isDigit(ch1) && Character.isDigit(ch2)
&& ch1 != '0' && ch2 != '0')
{
int _i = i + 1;
int _j = j + 1;
for(; _i < char1.length; _i++)
{
if(!Character.isDigit(char1[_i]))
break;
}
for(; _j < char2.length; _j++)
{
if(!Character.isDigit(char2[_j]))
break;
}
int len1 = _i - i;
int len2 = _j - j;
if(len1 > len2)
return 1;
else if(len1 < len2)
return -1;
else
{
for(int k = 0; k < len1; k++)
{
ch1 = char1[i + k];
ch2 = char2[j + k];
if(ch1 != ch2)
return ch1 - ch2;
}
}
i = _i - 1;
j = _j - 1;
}
else
{
if(ignoreCase)
{
ch1 = Character.toLowerCase(ch1);
ch2 = Character.toLowerCase(ch2);
}
if(ch1 != ch2)
return ch1 - ch2;
}
}
return char1.length - char2.length;
}
This code really sucks. The solution is very imperative and low-level; to understand what it does, you have to execute the code in your head and keep track of the various loops, early exits, and mutable local variables. I no longer try to solve problems in these low-level terms. With Factor I can solve the problem without “playing CPU”. A lot of people, when they’re first starting out with Factor, try to write code much like the above using a stack. Predictably, the stack shuffling gets out of hand. While the above algorithm could be implemented quite easily in Factor using locals, it would still be 20 times longer and more complicated than the high-level solution using PEGs.
Understanding the various abstractions and libraries which are available is key to grasping Factor: collections, generic words, fry, locals, macros, memoization, PEGs, the prettyprinter, and so on. Making effective use of these tools can reduce the amount of work required to solve a problem by an order of magnitude. This is why I invest a lot of effort into writing documentation, as well as the help system itself. Making Factor easy to learn and explore is a high priority for me.
In the Java community, there is a certain derision against “API monkeys”: programmers who always reach for a library without being able to solve a problem using just the language itself. Factor programmers on the other hand should be API monkeys and proud of it!