USA Zip code database
Tuesday, April 29, 2008
Recently someone posted a freely available Zip code database on reddit. The database is in CSV format.
Phil Dawes contributed a CSV parser to Factor a few days ago, and I thought this would be the perfect use-case to test out the parser.
I added the library to extra/usa-cities. It begins with the usual
boilerplate:
USING: io.files io.encodings.ascii sequences sequences.lib
math.parser combinators kernel memoize csv symbols inspector
words accessors math.order sorting ;
IN: usa-cities
Then, we define some singleton types for the various states of the
union. While this isn’t strictly necessary, it allows us to write
generic words which dispatch on states; for example, I’m sure Doug’s
taxes library could use this:
SINGLETONS: AK AL AR AS AZ CA CO CT DC DE FL GA HI IA ID IL IN
KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK
OR PA PR RI SC SD TN TX UT VA VI VT WA WI WV WY ;
: states ( -- seq )
{
AK AL AR AS AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY
LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK
OR PA PR RI SC SD TN TX UT VA VI VT WA WI WV WY
} ; inline
ERROR: no-such-state name ;
M: no-such-state summary drop "No such state" ;
MEMO: string>state ( string -- state )
dup states [ word-name = ] with find nip
[ ] [ no-such-state ] ?if ;
Next up, we define a data type storing rows from the CSV database:
TUPLE: city
first-zip name state latitude longitude gmt-offset dst-offset ;
Now a word which reads the database, parses it as CSV, and then parses each column into a specific data type:
MEMO: cities ( -- seq )
"resource:extra/usa-cities/zipcode.csv" ascii <file-reader>
csv rest-slice [
7 firstn {
[ string>number ]
[ ]
[ string>state ]
[ string>number ]
[ string>number ]
[ string>number ]
[ string>number ]
} spread city boa
] map ;
This word is tricky; some notes on its workings:
- We begin by opening a stream for reading from the CSV file with ASCII encoding.
- The
csvword reads CSV data from a stream. - The first line of the file consists of column headings and not
actual data, so we ignore it by using the non-copying variant of
rest,rest-slice(recall that the primary sequence type in Factor is an array, so removing the first element makes a copy; a slice is a virtual sequence presenting a view of a subsequence of an array). - The
spreadcombinator takes a sequence of quotations,Q1,...,QN, and n values from the stack,X1,...XN, and outputsQ1[X1],...,QN[XN]. In this case we’re taking the first 7 elements of each row of CSV data (each row has exactly 7 columns so in effect we’re pushing every element of the sequence on the stack), then usingspreadto convert some columns from their initial text format into something more useful to us; a state singleton or a number. - Finally, we use
city boato construct a city tuple “by order of arguments”; this slurps the 7 stack values and stores them into a new instance ofcity(note that the definition of thecitytype has exactly 7 slots and they are defined in the same order as the columns of the file). - Finally, we
mapover the sequence of rows to perform the above steps on each row of the file. The result is a sequence ofcityinstances.
The word is memoized so of course it will only load the database once.
We can now define words to query it:
MEMO: cities-named ( name -- cities )
cities [ name>> = ] with filter ;
MEMO: cities-named-in ( name state -- cities )
cities [
tuck [ name>> = ] [ state>> = ] 2bi* and
] with with filter ;
: find-zip-code ( code -- city )
cities [ first-zip>> <=> ] binsearch* ;
Now, let’s look at some examples.
First, let’s look up my Zip code:
( scratchpad ) 55406 find-zip-code .
T{ city f 55406 "Minneapolis" MN 44.938615 -93.22082 -6 1 }
And another famous Zip code:
( scratchpad ) 90210 find-zip-code .
T{ city f 90210 "Beverly Hills" CA 34.088808 -118.40612 -8 1 }
How many states have a city named “Minneapolis”?
( scratchpad ) "Minneapolis" cities-named [ state>> ] map prune .
V{ NC MN KS }
What is the possible range of Zip codes for Austin?
( scratchpad ) "Austin" TX cities-named-in [ first-zip>> ] map [ infimum . ] [ supremum . ] bi
73301
78972
There are many possible applications for this library, including form validation in web apps. It could be extended further: if the database was loaded into a quadtree sorted by latitude/longitude, you could perform queries such as finding all towns within 50 miles of a given city.