Factor Language Blog

Converting an Excel file to CSV - Factor makes XML fun!

Friday, January 19, 2007

Today I was sent a rather important Microsoft Excel document. Unlike older binary formats, this was an XML file, and the version of OpenOffice I had installed on my Debian-stable Linux workstation was too old to cope with it. Rather than upgrading OpenOffice on the Debian box, or installing OpenOffice on my Mac, I decided to write a Factor program to convert the data to CSV, comma separated values. OpenOffice can open that for sure. Just like my previous foray into parsing XML with Factor, writing this code only took a few minutes, including testing, in no small part due to Daniel’s excellent documentation.

REQUIRES: libs/xml ;
USING: io sequences xml xml-utils ;
IN: msxml>csv

: print-csv ( table -- ) [ "," join print ] each ;

: (msxml>csv) ( xml -- table )
    "Worksheet" find-tag
    "Table" find-tag
    "Row" find-tags [
        "Cell" find-tags [
            "Data" find-tag children>string
        ] map
    ] map ;

: msxml>csv ( infile outfile -- )
    <file-writer> [
        <file-reader> read-xml (msxml>csv) print-csv
    ] with-stream ;

Of course this only handles a very tiny subset of the Excel file format, but it was good enough to convert the file to CSV without losing any information.

The XML file used XML namespaces, but the parser handled that without problems. Daniel was very clever by allowing the programmer to either specify simple strings, or fully-qualified name tuples when searching for tags; allowing quick-and-dirty code to proceed as if the namespaces weren’t there, but supporting the full XML specification when one needs it.

The (msxml>csv) word looks pretty in its current state, however for more complex XML processing I can easily imagine searching for tags can get out of hand very quickly. An XPath-like DSL would be a welcome addition to libs/xml, but I need to spend more time working with XML in Factor before I can come up with a Factorish solution. None of the XPath-like solutions I’ve seen in other languages were really satisfactory.

It is also interesting that the program above contains no stack manipulation at all.