Slides and code from my PyData NYC 2013 talk on Python tools for data wrangling.
Slides and code (with the exception of `pony_blanket.py`) for Beyond the dict: Python Tools for Data Wrangling by Imran S Haque are licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License .
The pony_blanket
module is licensed for open-source use under the
GNU Affero General Public License v3 to comply
with the open-source license terms of Pony ORM.
See pony_blanket/pony_blanket_example.py
for an example of usage. Run the
Makefile in that directory to regenerate all necessary data files.
The pony_blanket
module is an adapter between the schemas defined by Marty Alchin's
sheets
module, providing
schematized parsing for delimited text data, and Pony ORM, an ORM
for Python with Pythonic syntax and little boilerplate. pony_blanket
allows you to create
a schema once, for sheets
, and have that automatically turned into a schema for pony.orm
:
from sheets import Row, Dialect
from sheets import StringColumn, FloatColumn
class HapMapAllele(Row):
Dialect = Dialect(has_header_row=True,
delimiter=' ')
rsid = StringColumn()
ref_freq = FloatColumn()
alt_freq = FloatColumn()
# pony_blanket extension to sheets
# set the `indexed` attribute on any sheets column to index the
# corresponding column in the database model
alt_freq.indexed = True
from pony_blanket import csv_to_db
from pony.orm import db_session, select
db, models = csv_to_db({'hapmap.txt': HapMapAllele})
# Query the database to find loci with high frequency of the alternate allele
with db_session:
print len(select(x for x in models[HapMapAllele]
if x.alt_freq > 0.01))
By default, csv_to_db
will load the given files into an in-memory SQLite3 database
(sqlite3.connect(':memory:')
).
The code in benchmarks/
generates the stats listed in my slide on numeric-storage performance.
Please don't complain about the different ways that I'm Doing It Wrong -- there are definitely
more efficient ways to do numeric storage (particularly in SQL). The point of these benchmarks is
to get order-of-magnitude estimates on the efficiency of storing numeric data in different formats
using the most obvious means in the respective libraries.
To regenerate the benchmarks, you will need the tables
module installed. Just
run the Makefile inside the benchmarks/
directory.
The example code snippets on my slides are included in the snippets/
directory in case it's easier
for someone to copy-paste from there. I offer no guarantees on whether they do the right thing for you.