3. Power tools

3.2. csvstack: combining subsets

Frequently large datasets are distributed in many small files. At some point you will probably want to merge those files for aggregate analysis. csvstack allows you to “stack” the rows from CSV files with identical headers. To demonstrate, let’s imagine we’ve decided that Nebraska and Kansas form a “region” and that it would be useful to analyze them in a single dataset. Let’s grab the Kansas data:

$ curl -L -O https://github.com/onyxfish/csvkit/raw/master/examples/realdata/ks_1033_data.csv

Back in Getting started, we had used in2csv to convert our Nebraska data from XLSX to CSV. However, we named our output data.csv for simplicity at the time. Now that we are going to be stacking multiple states, we should re-convert our Nebraska data using a file naming convention matching our Kansas data:

$ in2csv ne_1033_data.xlsx > ne_1033_data.csv

Now let’s stack these two data files:

$ csvstack ne_1033_data.csv ks_1033_data.csv > region.csv

Using csvstat we cansee that our region.csv contains both datasets:

$ csvstat -c state,acquisition_cost region.csv
  1. state
        <type 'unicode'>
        Nulls: False
        Values: KS, NE
  8. acquisition_cost
        <type 'float'>
        Nulls: False
        Min: 0.0
        Max: 658000.0
        Sum: 9447912.36
        Mean: 3618.50339334
        Median: 138.0
        Standard Deviation: 23725.9555723
        Unique values: 127
        5 most frequent values:
                120.0:  649
                499.0:  449
                138.0:  311
                6800.0: 304
                58.71:  218

Row count: 2611

If you supply the -g flag then csvstack can also add a “grouping column” to each row, so that you can tell which file each row came from. In this case we don’t need this, but you can imagine a situation in which instead of having a county column each of this datasets had simply been named nebraska.csv and kansas.csv. In that case, using a grouping column would prevent us from losing information when we stacked them.

3.3. csvsql and sql2csv: ultimate power

Sometimes (almost always), the command line isn’t enough. It would be crazy to try to do all your analysis using command line tools. Often times, the correct tool for data analysis is SQL. csvsql and sql2csv form a bridge that eases migrating your data into and out of a SQL database. For smaller datasets csvsql can also leverage sqlite to allow execution of ad hoc SQL queries without ever touching a database.

By default, csvsql will generate a create table statement for your data. You can specify what sort of database you are using with the -i flag:

$ csvsql -i sqlite joined.csv
CREATE TABLE joined (
        state VARCHAR(2) NOT NULL,
        county VARCHAR(10) NOT NULL,
        fips INTEGER NOT NULL,
        nsn VARCHAR(16) NOT NULL,
        item_name VARCHAR(62) NOT NULL,
        quantity VARCHAR(4) NOT NULL,
        ui VARCHAR(7) NOT NULL,
        acquisition_cost FLOAT NOT NULL,
        total_cost VARCHAR(10) NOT NULL,
        ship_date DATE NOT NULL,
        federal_supply_category VARCHAR(34) NOT NULL,
        federal_supply_category_name VARCHAR(35) NOT NULL,
        federal_supply_class VARCHAR(25) NOT NULL,
        federal_supply_class_name VARCHAR(63),
        name VARCHAR(21) NOT NULL,
        total_population INTEGER NOT NULL,
        margin_of_error INTEGER NOT NULL
);

Here we have the sqlite “create table” statement for our joined data. You’ll see that, like csvstat, csvsql has done it’s best to infer the column types.

Often you won’t care about storing the SQL statements locally. You can also use csvsql to create the table directly in the database on your local machine. If you add the --insert option the data will also be imported:

$ csvsql --db sqlite:///leso.db --insert joined.csv

How can we check that our data was imported successfully? We could use the sqlite command line interface, but rather than worry about the specifics of another tool, we can also use sql2csv:

$ sql2csv --db sqlite:///leso.db --query "select * from joined"

Note that the --query parameter to sql2csv accepts any SQL query. For example, to export Douglas county from the joined table from our sqlite database, we would run:

$ sql2csv --db sqlite:///leso.db --query "select * from joined where county='DOUGLAS';" > douglas.csv

Sometimes, if you will only be running a single query, even constructing the database is a waste of time. For that case, you can actually skip the database entirely and csvsql will create one in memory for you:

$ csvsql --query "select county,item_name from joined where quantity > 5;" joined.csv | csvlook

SQL queries directly on CSVs! Keep in mind when using this that you are loading the entire dataset into an in-memory database, so it is likely to be very slow for large datasets.

3.4. Summing up

csvjoin, csvstack, csvsql and sql2csv represent the power tools of csvkit. Using this tools can vastly simplify processes that would otherwise require moving data between other systems. But what about cases where these tools still don’t cut it? What if you need to move your data onto the web or into a legacy database system? We’ve got a few solutions for those problems in our final section, Going elsewhere with your data.