Skip to content

Latest commit

 

History

History
163 lines (123 loc) · 5.04 KB

README.md

File metadata and controls

163 lines (123 loc) · 5.04 KB

Zip Files

As you deal with bigger datasets, those datasets will often be compressed. Compressed means that the format takes advantage of patterns and redundancy in data to same a bigger file in less space.

For example, say you have a string like this: "HAHAHAHAHAHAHAHAHAHA". You should imagine inventing a notation for representing that string with fewer characters (maybe something like "HA{x10}").

Zip is one common compression format. In addition to compressing files, .zips often bundle multiple files together. In the past, you would have run unzip in the terminal before starting to write your code. However, it is also possible to directly read the contents of a .zip file in Python. Doing so is often more convenient; the code may also quite possibly be faster.

Generating a .zip

To create an example.zip file, run the following (don't worry, understanding this particular snippet isn't expected for this lab):

import pandas as pd
from zipfile import ZipFile, ZIP_DEFLATED
from io import TextIOWrapper

with open("hello.txt", "w") as f:
    f.write("hello world")

with ZipFile("example.zip", "w", compression=ZIP_DEFLATED) as zf:
    with zf.open("hello.txt", "w") as f:
        f.write(bytes("hello world", "utf-8"))
    with zf.open("ha.txt", "w") as f:
        f.write(bytes("ha"*10000, "utf-8"))
    with zf.open("bugs.csv", "w") as f:
        pd.DataFrame([["Mon",7], ["Tue",4], ["Wed",3], ["Thu",6], ["Fri",9]],
                     columns=["day", "bugs"]).to_csv(TextIOWrapper(f), index=False)

ZipFile

We can access the file by using the ZipFile type, imported from the zipfile module:

from zipfile import ZipFile

ZipFiles are context managers, much like file objects. Let's try creating one using with, then loop over info about the files inside using this method:

with ZipFile('example.zip') as zf:
    for info in zf.infolist():
        print(info)

Let's print off the size and compression ratio (uncompressed size divided by compressed size) of each file:

with ZipFile('example.zip') as zf:
    for info in zf.infolist():
        orig_mb = info.file_size / (1024**2) # there are 1024**2 bytes in a MB
        ratio = info.file_size / info.compress_size
        s = "file {name:s}, {mb:.3f} MB (uncompressed), {ratio:.1f} compression ratio"
        print(s.format(name=info.filename, mb=orig_mb, ratio=ratio))

Take a minute to look through -- what file is largest? What is its compression ratio?

The compression ratio is the original size divided by the compressed size, so bigger means more savings. ha.txt contains "hahahahaha..." (repeated 10 thousand times), which is highly compressible.

As practice, compute the overall compression ration (sum of all uncompressed sizes divided by sum of all compressed sizes) -- it ought to be about 216.

Binary Open

Ok, forget zips for a minute, and run the following:

with open("hello.txt", "r") as f:
    data1 = f.read()

with open("hello.txt", "rb") as f:
    data2 = f.read()

print(type(data1), type(data2))

What type does f.read() return if we use "r" for the mode? What about "rb"?

The "b" stands for "binary" or "bytes", so we get back type bytes. If we open in text mode (the default), as in the first open, the bytes automatically get translated to strings, using some encoding (like "utf-8") that assigns characters to byte-represented numbers.

Run this:

from io import TextIOWrapper

TextIOWrapper objects "wrap" file objects are used to convert bytes to characters on the fly. For example, try the following:

with open("hello.txt", "rb") as f:
    tio = TextIOWrapper(f)
    data3 = tio.read()
print(type(data3))

Even though we open in binary mode, we get a string thanks to TextIOWrapper! You can think of the example where we read into data1 as a shorthand for what we did to get data3.

Reading Files

A ZipFile has a method named open that works a lot like the open function you're familiar with. A ZipFile is a context manager, and so is the object returned by ZipFile.open(...), so we'll end up with nested with statements to make sure everything gets closed up properly. Let's take a look at the compressed schedule file:

with ZipFile('example.zip') as zf:
    with zf.open("hello.txt", "r") as f:
        print(f.read())

Woah, why do we get b'hello world'? For regular files, "r" mode defaults to reading text, but for files inside a zip, it defaults to binary mode, so we got back bytes.

TextIOWrapper saves the day:

with ZipFile('example.zip') as zf:
    with zf.open("hello.txt", "r") as f:
        tio = TextIOWrapper(f)
        print(tio.read())

With regular files, TextIOWrapper is a bit useless (why not just open with "r" instead of "rb"?), but for zips, it is crucial.

Pandas

Pandas can read a DataFrame even from a binary stream. So you can can do this:

with ZipFile('example.zip') as zf:
    with zf.open("bugs.csv") as f:
         df = pd.read_csv(f)
df