Factor Language Blog

Some things I've been working on

Thursday, January 31, 2008

I’ve been busy lately. Here are my current projects:

  • I implemented bit-vectors, byte-vectors and float-vectors, the resizable analogues of bit-arrays, byte-arrays and float-arrays. This nicely rounds out Factor’s sequences library; previously the only resizable types were strings and vectors; now every built-in array-like type has a resizable analogue. The literal syntax is simple:

    ?V{ t f t } ! bit vector
    BV{ 1 2 3 } ! byte vector
    FV{ 1.0 2.0 3.0 } ! float vector
    

    This is of course derived from V{ } for vectors, and ?{ } B{ } F{ } for bit, byte and float arrays, respectively.

    If you want to learn more, just look at the online help.

  • File system change notification: I have this working on Windows. When I implement it on Linux and Mac OS X I will blog about the API. This is really neat stuff, integrates with the I/O multiplexer loop so that waiting for notifications doesn’t block all of Factor.

  • New representation for strings. I’m just starting this, but it shouldn’t take too long. Right now, Factor strings are sequences of 16-bit characters, which is neither here nor there. It uses more memory than ASCII strings, but cannot represent every Unicode code point, since Unicode 5.0 requires 21 bits now. Totally brain-dead. The new representation is quite clever, and comes from Larceny Scheme. The idea is that strings are ASCII strings, but have an extra slot pointing to an ‘auxiliary vector’. If no auxiliary vector is set, the nth character of the string is just the nth byte. If an auxiliary vector is set, then the nth character has the nth byte as the least significant 8 bits, and the most significant 13 bits come from the nth double-byte in the auxiliary vector. Storing a non-ASCII character into the string creates an auxiliary vector if necessary. This reduces space usage for ASCII strings, it can represent every Unicode code point, and for strings with high code points in them, it still uses less space than the other alternative, UTF-32.