csvkit 0.8.1 (beta)¶
csvkit is a suite of utilities for converting to and working with CSV, the king of tabular file formats.
It is inspired by pdftk, gdal and the original csvcut utility by Joe Germuska and Aaron Bycoffe.
csvkit is to tabular data what the standard Unix text processing suite (grep, sed, cut, sort) is to text. As such, csvkit adheres to the Unix philosophy.
- Small is beautiful.
- Make each program do one thing well.
- Build a prototype as soon as possible.
- Choose portability over efficiency.
- Store data in flat text files.
- Use software leverage to your advantage.
- Use shell scripts to increase leverage and portability.
- Avoid captive user interfaces.
- Make every program a filter.
As there is no formally defined CSV format, csvkit encourages well-known formatting standards:
- Output favors compatability with the widest range of applications. This means that quoting is done with double-quotes and only when necessary, columns are separated with commas, and lines are terminated with unix style line endings (“\n”).
- Data that is modified or generated will prefer consistency over brevity. Floats always include at least one decimal place, even if they are round. Dates and times are written in ISO8601 format.
If you only want to use csvkit, install it this way:
pip install csvkit
If you are installing on Ubuntu you may need to install the Python development headers prior to install csvkit:
sudo apt-get install python-dev python-pip python-setuptools build-essential
If the installation appears to be successful but running the tools fails, try updating your version of Python setuptools:
pip install setuptools --upgrade pip install csvkit --upgrade
csvkit is routinely tested on OSX, somewhat less frequently on Linux and once in a while on Windows. All platforms are supported. It is tested against Python 2.6, 2.7, 3.3, 3.4 and PyPy. Neither Python < 2.6 nor Python < 3.3 are supported at this time.
If you are a developer that also wants to hack on csvkit, install it this way:
git clone git://github.com/onyxfish/csvkit.git cd csvkit mkvirtualenv --no-site-packages csvkit # If running Python 2 pip install -r requirements-py2.txt # If running Python 3 pip install -r requirements-py3.txt python setup.py develop tox
If you are using Python2 and have a recent version of pip, you may need to run pip with the additional arguments --allow-external argparse.
The csvkit tutorial walks through processing and analyzing a real dataset from data.gov. It is divided into several parts for easier reading:
- 1. Getting started
- 2. Examining the data
- 2.1. Cutting up the data with csvcut
- 2.2. Statistics on demand with csvstat
- 2.3. Searching for rows with csvgrep
- 2.4. Flipping column order with csvcut
- 2.5. Sorting with csvsort
- 2.6. Using line numbers as proxy for rank
- 2.7. Reading through data with csvlook and less
- 2.8. Saving your work
- 2.9. Onward to merging
- 3. Adding another year of data
- 4. Wrapping up
csvkit is comprised of a number of individual command line utilities that be loosely divided into a few major categories: Input, Processing, and Output. Documentation and examples for each utility are described on the following pages.
Output (and Analysis)
Using as a library¶
csvkit is designed to be used a replacement for most of Python’s csv module. Important parts of the API are documented on the following pages.
Want to hack on csvkit? Here’s how:
The MIT License
Copyright (c) 2014 Christopher Groskopf and contributers
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- Fix Integrity error when inserting zero rows in database with csvsql. (#299)
- Add Michael Mior to AUTHORS. (#305)
- Add –count option to CSVStat.
- Implement csvformat.
- Fix bug causing CSVKitDictWriter to output ‘utf-8’ for blank fields.
- Add pnaimoli to AUTHORS.
- Fix column specification in csvstat. (#236)
- Added “Tips and Tricks” documentation. (#297, #298)
- Add Espartaco Palma to AUTHORS.
- Remove unnecessary enumerate calls. (#292)
- Deprecated DBF support for Python 3+.
- Add support for Python 3.3 and 3.4 (#239)
- Fix date handling with openpyxl > 2.0 (#285)
- Add Kristina Durivage to AUTHORS. (#243)
- Added Richard Low to AUTHORS.
- Support SQL queries “directly” on CSV files. (#276)
- Add Tasneem Raja to AUTHORS.
- Fix off-by-one error in open ended column ranges. (#238)
- Add Matt Pettis to AUTHORS.
- Add line numbers flag to csvlook (#244)
- Only install argparse for Python < 2.7. (#224)
- Add Diego Rabatone Oliveira to AUTHORS.
- Add Ryan Murphy to AUTHORS.
- Fix DBF dependency. (#270)
- Fix CHANGELOG for release.
- Fix homepage url in setup.py.
- Fix XLSX datetime normalization bug. (#223)
- Add raistlin7447 to AUTHORS.
- Merged sql2csv utility (#259).
- Add Jeroen Janssens to AUTHORS.
- Validate csvsql DB connections before parsing CSVs. (#257)
- Clarify install process for Ubuntu. (#249)
- Clarify docs for –escapechar. (#242)
- Make import csvkit API compatible with import csv.
- Update Travis CI link. (#258)
- Add Sébastien Fievet to AUTHORS.
- Use case-sensitive name for SQLAlchemy (#237)
- Add Travis Swicegood to AUTHORS.
- Add Chris Rosenthal to AUTHORS.
- Fix multi-file input to csvsql. (#193)
- Passing –snifflimit=0 to disable dialect sniffing. (#190)
- Add aarcro to the AUTHORS file.
- Improve performance of csvgrep. (#204)
- Add Matt Dudys to AUTHORS.
- Add support for –skipinitialspace. (#201)
- Add Joakim Lundborg to AUTHORS.
- Add –no-inference option to in2csv and csvsql. (#206)
- Add Federico Scrinzi to AUTHORS file.
- Add –no-header-row to all tools. (#189)
- Fix csvstack blowing up on empty files. (#209)
- Add Chris Rosenthal to AUTHORS file.
- Add –db-schema option to csvsql. (#216)
- Add Shane StClair to AUTHORS file.
- Add –no-inference support to csvsort. (#222)
- Implement geojson support in csvjson. (#159)
- Optimize writing of eight bit codecs. (#175)
- Created csvpy. (#44)
- Support –not-columns for excluding columns. (#137)
- Add Jan Schulz to AUTHORS file.
- Add Windows scripts. (#111, #176)
- csvjoin, csvsql and csvstack will no longer hold open all files. (#178)
- Added Noah Hoffman to AUTHORS.
- Make csvlook output compatible with emacs table markup. (#174)
- Add Derek Wilson to AUTHORS.
- Add Kevin Schaul to AUTHORS.
- Add DBF support to in2csv. (#11, #160)
- Support –zero option for zero-based column indexing. (#144)
- Support mixing nulls and blanks in string columns.
- Add –blanks option to csvsql. (#149)
- Add multi-file (glob) support to csvsql. (#146)
- Add Gregory Temchenko to AUTHORS.
- Add –no-create option to csvsql. (#148)
- Add Anton Ian Sipos to AUTHORS.
- Fix broken pipe errors. (#150)
- Begin CHANGELOG (a bit late, I’ll admit).