Factor Language Blog

Removing duplicates, boolean arrays, ARM

Sunday, January 21, 2007

Comp Sci lesson learned: For some reason I was always under the impression that one cannot remove duplicates from an array while preserving order in O(n) time. This is why Factor had two words, prune and hash-prune.

The former preserved order and ran in quadratic time; it would loop over the input sequence, and see if each element was already present in an accumulator, if not, it would add it to the accumulator.

The latter did not preserve order and ran in linear time; it would add all elements as keys in a new hashtable, then output the sequence of keys.

Well it turns out there’s a way to remove duplicates while keeping order in O(n) time. The idea is that as you inspect each element of the input sequence in turn, you check if it is present in a hashtable; if it is, you keep going, otherwise you push the element on an accumulator and add it to the hashtable. I guess this is obvious to most people but for some reason I was not aware of this algorithm and did not even think this is possible. The implementation is simple enough, and I changed the prune word to use this algorithm, and removed hash-prune:

: (prune) ( hash vec elt -- )
    rot 2dup hash-member?
    [ 3drop ] [ dupd dupd set-hash swap push ] if ; inline

: prune ( seq -- newseq )
    dup length <hashtable> over length <vector>
    rot [ >r 2dup r> (prune) ] each nip ;

A new data structure: The Unix native I/O code has had some words for managing a space-efficient bit array for quite some time. These words were not really useful outside of Unix I/O, and were not safe either. I cleaned them up and made them more useful; now Factor has a boolean-array class which supports the sequence protocol and behaves just like an array, except it can only store t or f, and it uses much less space. Dan asked for this to make the XML parser more efficient, and it was about time Factor got a boolean array type, anyway.

ARM: Doug Coleman got me a Netstix ARM computer as an early birthday present. Thanks Doug! I was going to buy one anyway when I felt like porting Factor to ARM, but I guess he really wanted an ARM port now :-) So I’ve decided to make this port a priority for 0.88. I’m still finishing some compiler changes and setting up the software on the Netstix, but once that is done I will start working on the port. Running on ARM will be nice, because it will open up the world of cellphones and handhelds to Factor. Especially because Factor is both high level and fast.