The process launcher, parser combinators, and /bin/sh

Saturday, November 24, 2007

The io.launcher vocabulary lets you specify a command line in one of two ways:

"foo bar baz" run-process
{ "foo" "bar" "baz" } run-process

Both have their advantages and disadvantages. The first one is more amenable to user input, the second one is safer and more predictable if some of the arguments are themselves user input and may contain special characters.

Until now, the first form was implemented on Unix by passing the command line to /bin/sh. The shell would take care of tokenizing the command line, parsing quotes and backslashes, etc. This is problematic, for two reasons. The practical reason is that the shell does a hell of a lot of stuff (expanding env vars, tokenizing on $IFS which may change, etc) which is hard to control and predict, hence the regular security advisories we see regarding programs which use system() carelessly. The second reason is philosophical – Factor should not depend on an external tool for such a trivial task (tokenizing arguments).

Java side-steps the whole issue in a lame way. The Java API also supports launching a process specified as a command line string, or as an array of strings. But their logic for tokenizing a command line string is simplistic, they just split the input on spaces. No quoting or escaping allowed. In Factor, this could be achieved just by calling " " split, but it is lame.

This is Factor, Factor is awesome, and parsing a simple command line string while respecting quoting and escaping should be simple. And it turns out that it is! I haven’t used Chris Double’s parser-combinators library for anything serious up til now, and I was (pleasantly) surprised at how simple it was. Your grammar really does map pretty much directly to parser combinators, and the clever implementation of the built-in combinators means you pretty much get a parse tree for free, with very little processing of the parse result. Chris did a wonderful job and of course the clever people in the Haskell community who invented parser combinators deserve a lot of praise too.

The command line grammar begins by defining a parser for escaped characters, and another parser for a sequence of escaped and unescaped characters. Note that the former uses &>, so that an escaped character parser result is the actual character itself, without the backslash:

LAZY: 'escaped-char' "\\" token any-char-parser &> ;

LAZY: 'chars' 'escaped-char' any-char-parser <|> <*> ;

Next up, we use the surrounded-by parser combinator to build parsers for quoted strings:

LAZY: 'quoted-1' 'chars' "\"" "\"" surrounded-by ;

LAZY: 'quoted-2' 'chars' "'" "'" surrounded-by ;

The parse result of a surrounded by parser is the result of the parser combinator in the body, so the quotes are stripped off for us automatically. Neat.

Next up, we have a parser for non-whitespace characters used in unquoted tokens:

LAZY: 'non-space-char'
    'escaped-char' [ CHAR: \s = not ] satisfy <|> ;

Since we can still escape a space in a unquoted token, we re-use our escaped char parser here.
Now a parser for unquoted tokens:

LAZY: 'unquoted' 'non-space-char' <+> ;

Finally a parser for any type of argument, whether it be quoted or unquoted; we use <@ here to convert the parse result to a string (the result of <*> is an array of parsed elements; the elements are characters because of how we defined our character parsers; we turn this array of characters into a string.)

LAZY: 'argument'
    'quoted-1' 'quoted-2' 'unquoted' <|> <|>
    [ >string ] <@ ;

Now we have our top-level parser. We define it in a memoized word, so that it is only ever constructed once and the same instance is reused:

MEMO: 'arguments' ( -- parser )
    'argument' " " token <+> list-of ;

Finally a word which uses this parser:

: tokenize-command ( command -- arguments )
    'arguments' parse-1 ;

The parse-1 word is a utility word which outputs the parse result from the first match (the parse word outputs a lazy list of matches).

Here’s an example:

"hello 'world how are you' "\\"today\\"" blah\\ blah" tokenize-command .
  { "hello" "world how are you" "\"today\"" "blah blah" }

The entire parser is 19 lines of code, or 8 lines if we remove blank lines and squish everything together. It reads easily because it is essentially the BNF grammar we’re parsing here, just written in a postfix style with some additional fluff. There’s no need to use a parser generator or any kind of external tool in the build process; this is just Factor code. Sure beats calling /bin/sh.