pydata_nyc_2013

Slides and code from my PyData NYC 2013 talk on Python tools for data wrangling.

Slides and code (with the exception of `pony_blanket.py`) for Beyond the dict: Python Tools for Data Wrangling by Imran S Haque are licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License .

The pony_blanket module is licensed for open-source use under the GNU Affero General Public License v3 to comply with the open-source license terms of Pony ORM.

Code included

`pony_blanket`: Automatic text <-> database adapter

See pony_blanket/pony_blanket_example.py for an example of usage. Run the Makefile in that directory to regenerate all necessary data files.

The pony_blanket module is an adapter between the schemas defined by Marty Alchin's sheets module, providing schematized parsing for delimited text data, and Pony ORM, an ORM for Python with Pythonic syntax and little boilerplate. pony_blanket allows you to create a schema once, for sheets, and have that automatically turned into a schema for pony.orm:

from sheets import Row, Dialect
from sheets import StringColumn, FloatColumn

class HapMapAllele(Row):
    Dialect = Dialect(has_header_row=True,
                      delimiter=' ')
    rsid = StringColumn()
    ref_freq = FloatColumn()
    alt_freq = FloatColumn()

    # pony_blanket extension to sheets
    # set the `indexed` attribute on any sheets column to index the
    # corresponding column in the database model
    alt_freq.indexed = True

from pony_blanket import csv_to_db
from pony.orm import db_session, select

db, models = csv_to_db({'hapmap.txt': HapMapAllele})

# Query the database to find loci with high frequency of the alternate allele
with db_session:
    print len(select(x for x in models[HapMapAllele]
                     if x.alt_freq > 0.01))

By default, csv_to_db will load the given files into an in-memory SQLite3 database (sqlite3.connect(':memory:')).

Numeric benchmarks

The code in benchmarks/ generates the stats listed in my slide on numeric-storage performance. Please don't complain about the different ways that I'm Doing It Wrong -- there are definitely more efficient ways to do numeric storage (particularly in SQL). The point of these benchmarks is to get order-of-magnitude estimates on the efficiency of storing numeric data in different formats using the most obvious means in the respective libraries.

To regenerate the benchmarks, you will need the tables module installed. Just run the Makefile inside the benchmarks/ directory.

Code snippets

The example code snippets on my slides are included in the snippets/ directory in case it's easier for someone to copy-paste from there. I offer no guarantees on whether they do the right thing for you.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
benchmarks		benchmarks
pony_blanket		pony_blanket
snippets		snippets
.gitignore		.gitignore
README.md		README.md
slides.pdf		slides.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pydata_nyc_2013

Code included

`pony_blanket`: Automatic text <-> database adapter

Numeric benchmarks

Code snippets

About

Releases

Packages

ihaque/pydata_nyc_2013

Folders and files

Latest commit

History

Repository files navigation

pydata_nyc_2013

Code included

pony_blanket: Automatic text <-> database adapter

Numeric benchmarks

Code snippets

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

`pony_blanket`: Automatic text <-> database adapter

Packages