Sorting strings with embedded numbers

Friday, September 12, 2008

Suppose you have a list of file names:

picture1.txt
picture2.txt
picture3.txt
picture4.txt
picture5.txt
picture6.txt
picture7.txt
picture8.txt
picture9.txt
picture10.txt

Lexicographic order will sort picture10.txt before picture2.txt, which is not what you want. The correct solution is to recognize that the strings contain numbers, and special-case their comparison.

Doug Coleman wrote a Factor solution last December.

Today I was revisiting his code and noticed that it could be simplified significantly using Chris Double’s amazing parser expression grammars library:

USING: peg.ebnf math.parser kernel assocs sorting ;
IN: sorting.human

: find-numbers ( string -- seq )
    [EBNF Result = ([0-9]+ => [[ string>number ]] | (!([0-9]) .)+)* EBNF] ;

: human-sort ( seq -- seq' )
    [ dup find-numbers ] { } map>assoc sort-values keys ;

I checked the code into the repository; you can USE: sorting.human to use it.

A bit of explanation is in order.

The PEG [0-9]+ => [[ string>number ]] matches one or more digits (in the range 0 to 9), and applies the action string>number (a Factor word which does what the name says).

The PEG (!([0-9]) .)+ matches one or more characters which are not digits.

The PEG idiom (A | B)* matches a sequence of zero or more A’s and B’s, which is exactly what we do here with the above two PEGs.

The PEG parser is wrapped in [EBNF ... EBNF], which is “inline PEG notation”; it expands into code which takes a string as input and returns a parse result as output. So calling find-numbers splits out any embedded numbers in the string.

The phrase [ dup find-numbers ] { } map>assoc builds an association list from a sequence where each sequence element is associated with a tokenized string. We use sort-values to sort the association list by comparing values, then use keys to extract the sequence elements, now in sorted order. Beautiful.

I solved the same problem in Java several years ago. The result was not pretty at all:

public static int compareStrings(String str1, String str2, boolean ignoreCase)
{
    char[] char1 = str1.toCharArray();
    char[] char2 = str2.toCharArray();

    int len = Math.min(char1.length,char2.length);

    for(int i = 0, j = 0; i < len && j < len; i++, j++)
    {
        char ch1 = char1[i];
        char ch2 = char2[j];
        if(Character.isDigit(ch1) && Character.isDigit(ch2)
            && ch1 != '0' && ch2 != '0')
        {
            int _i = i + 1;
            int _j = j + 1;

            for(; _i < char1.length; _i++)
            {
                if(!Character.isDigit(char1[_i]))
                    break;
            }

            for(; _j < char2.length; _j++)
            {
                if(!Character.isDigit(char2[_j]))
                    break;
            }

            int len1 = _i - i;
            int len2 = _j - j;
            if(len1 > len2)
                return 1;
            else if(len1 < len2)
                return -1;
            else
            {
                for(int k = 0; k < len1; k++)
                {
                    ch1 = char1[i + k];
                    ch2 = char2[j + k];
                    if(ch1 != ch2)
                        return ch1 - ch2;
                }
            }

            i = _i - 1;
            j = _j - 1;
        }
        else
        {
            if(ignoreCase)
            {
                ch1 = Character.toLowerCase(ch1);
                ch2 = Character.toLowerCase(ch2);
            }

            if(ch1 != ch2)
                return ch1 - ch2;
        }
    }

    return char1.length - char2.length;
}

This code really sucks. The solution is very imperative and low-level; to understand what it does, you have to execute the code in your head and keep track of the various loops, early exits, and mutable local variables. I no longer try to solve problems in these low-level terms. With Factor I can solve the problem without “playing CPU”. A lot of people, when they’re first starting out with Factor, try to write code much like the above using a stack. Predictably, the stack shuffling gets out of hand. While the above algorithm could be implemented quite easily in Factor using locals, it would still be 20 times longer and more complicated than the high-level solution using PEGs.

Understanding the various abstractions and libraries which are available is key to grasping Factor: collections, generic words, fry, locals, macros, memoization, PEGs, the prettyprinter, and so on. Making effective use of these tools can reduce the amount of work required to solve a problem by an order of magnitude. This is why I invest a lot of effort into writing documentation, as well as the help system itself. Making Factor easy to learn and explore is a high priority for me.

In the Java community, there is a certain derision against “API monkeys”: programmers who always reach for a library without being able to solve a problem using just the language itself. Factor programmers on the other hand should be API monkeys and proud of it!