The process launcher, parser combinators, and /bin/sh
Saturday, November 24, 2007
The io.launcher
vocabulary lets you specify a command line in one of
two ways:
"foo bar baz" run-process
{ "foo" "bar" "baz" } run-process
Both have their advantages and disadvantages. The first one is more amenable to user input, the second one is safer and more predictable if some of the arguments are themselves user input and may contain special characters.
Until now, the first form was implemented on Unix by passing the command
line to /bin/sh
. The shell would take care of tokenizing the command
line, parsing quotes and backslashes, etc. This is problematic, for two
reasons. The practical reason is that the shell does a hell of a lot of
stuff (expanding env vars, tokenizing on $IFS which may change, etc)
which is hard to control and predict, hence the regular security
advisories we see regarding programs which use system()
carelessly.
The second reason is philosophical – Factor should not depend on an
external tool for such a trivial task (tokenizing arguments).
Java side-steps the whole issue in a lame way. The Java API also
supports launching a process specified as a command line string, or as
an array of strings. But their logic for tokenizing a command line
string is simplistic, they just split the input on spaces. No quoting or
escaping allowed. In Factor, this could be achieved just by calling
" " split
, but it is lame.
This is Factor, Factor is awesome, and parsing a simple command line
string while respecting quoting and escaping should be simple. And it
turns out that it is! I haven’t used Chris Double’s parser-combinators
library for anything serious up til now, and I was (pleasantly)
surprised at how simple it was. Your grammar really does map pretty much
directly to parser combinators, and the clever implementation of the
built-in combinators means you pretty much get a parse tree for free,
with very little processing of the parse result. Chris did a wonderful
job and of course the clever people in the Haskell community who
invented parser combinators deserve a lot of praise too.
The command line grammar begins by defining a parser for escaped
characters, and another parser for a sequence of escaped and unescaped
characters. Note that the former uses &>
, so that an escaped character
parser result is the actual character itself, without the backslash:
LAZY: 'escaped-char' "\\" token any-char-parser &> ;
LAZY: 'chars' 'escaped-char' any-char-parser <|> <*> ;
Next up, we use the surrounded-by
parser combinator to build parsers
for quoted strings:
LAZY: 'quoted-1' 'chars' "\"" "\"" surrounded-by ;
LAZY: 'quoted-2' 'chars' "'" "'" surrounded-by ;
The parse result of a surrounded by parser is the result of the parser combinator in the body, so the quotes are stripped off for us automatically. Neat.
Next up, we have a parser for non-whitespace characters used in unquoted tokens:
LAZY: 'non-space-char'
'escaped-char' [ CHAR: \s = not ] satisfy <|> ;
Since we can still escape a space in a unquoted token, we re-use our
escaped char parser here.
Now a parser for unquoted tokens:
LAZY: 'unquoted' 'non-space-char' <+> ;
Finally a parser for any type of argument, whether it be quoted or
unquoted; we use <@
here to convert the parse result to a string (the
result of <*>
is an array of parsed elements; the elements are
characters because of how we defined our character parsers; we turn this
array of characters into a string.)
LAZY: 'argument'
'quoted-1' 'quoted-2' 'unquoted' <|> <|>
[ >string ] <@ ;
Now we have our top-level parser. We define it in a memoized word, so that it is only ever constructed once and the same instance is reused:
MEMO: 'arguments' ( -- parser )
'argument' " " token <+> list-of ;
Finally a word which uses this parser:
: tokenize-command ( command -- arguments )
'arguments' parse-1 ;
The parse-1
word is a utility word which outputs the parse result from
the first match (the parse
word outputs a lazy list of matches).
Here’s an example:
"hello 'world how are you' "\\"today\\"" blah\\ blah" tokenize-command .
{ "hello" "world how are you" "\"today\"" "blah blah" }
The entire parser is 19 lines of code, or 8 lines if we remove blank
lines and squish everything together. It reads easily because it is
essentially the BNF grammar we’re parsing here, just written in a
postfix style with some additional fluff. There’s no need to use a
parser generator or any kind of external tool in the build process; this
is just Factor code. Sure beats calling /bin/sh
.