1.5.0
We've been working with Easydata extensively over the last few months to implement shared repos for data science workshops. Easydata 1.5 is the result of many, many hands (and heads) bashing against the code. Here are the big changes from 1.0
We slimmed down the Makefile
Easydata has been evolving for a couple of years, and there's a lot of functionality there we don't use every day. We took the opportunity to remove (or at least hide) some of the older workflows in favour of the notebook-and-Dataset approach we've been using lately.
In a future release, we'll look at reviving the "make fetch, make datasets" style targets to use the Dataset dependency graph. In the meantime, these targets are deprecated.
Fetch improvements
One of the most common Easydata tasks is fetching (and hash-validating) raw data from remote sources. We made a couple of improvements to this process:
- We added a tqdm statusbar to fetch actions. Let's face it. Little blue bars are better than staring at what looks like a hung computer.
- We added a
url_options
flag for URL-based fetches. Recently, one of our datasets was hosted on a machine with an expired SSL certificate. Since we hash validate anyway, addingurl_options
let us ignore the SSL errors in our on-disk dataset specification.
Relative download_dir
s
In a DataSource
, download_dir
, if given without a leading slash, is now assumed to be relative to raw_data_path
. We expose this fact via a new property (download_dir_fq
) and make use of it in the fetch()
mechanisms.
Better Jupyterhub Integration
We made a number of tweaks and improvements (especially in the
documentation) to better support working on JupyterHub instances. We have big plans here, but for now, we've worked around most jupyterhub issues with better documentation.
Improved framework documentation.
Speaking of documentation, we spent some time improving the framework (i.e. easydata) documentation, which now includes extensive sections on git configuration and our recommended git workflow.
We also recorded some videos to walk you through the main pieces of the framework.
New Git Hosting options
Since a recent event was using gitlab, we took the opportunity to add customizable git hosting services (github, gitlab, bitbucket), and branch names (master vs main), to our cookiecutter template.
Added new Dataset creation helpers to src.workflow
The src.workflow
namespace is where we put new easydata features while we're sorting out what the formal API should look like. In 1.5 we added a couple of helper functions that allow for the near instant creation of Datasets under three very common use cases:
- dataset from a manually downloaded CSV file
- dataset from metadata/extra information only
- derived dataset from an existing dataset and a transformation function
Better handling of local shared data.
We came across an interesting use case recently: a dataset (Malpedia) that was essentially a zipfile full of malware. Needless to say, we didn't want to be building a dataset that downloaded and unpacked these files locally. Instead, we built a Dataset template that contains EXTRA data (raw file hashes) only. By setting extra_base
to the shared location, this dataset can be used to hash-validate and access raw malware files without the need to ship and unpack a dangerous zipfile.
New hash algorithm: 'size'
It's not exactly cryptographically secure, but having a file size check in the hash validation arsenal has proved to be very useful. (especially for remote data, where the size "hash" is basically available for free)
Paths (and local configuration) UX
Easydata uses a dictionary-like mechanism for storing path information; e.g.
from src import paths
paths['project_path']
Though it looks like a dictionary, there's actually a lot of magic going on under the hood. (To be really nerdy about it, it's a singleton object that is backed by a configparser (.ini) file using the ExtendedInterpolation format)
Unfortunately, this turned out to be a dangerous pattern for user UX, as we found users were setting this from notebooks and shared code. paths data is really meant to be local config (and hence changes should not be checked-in to a git repo).
Realizing this, we've modified the implementation to always re-read from the the .ini file when a paths value is queried. We now recommend that paths (and other local configuration info) be used either from the command line; e.g.
python -c "import src; src.paths['raw_data_path'] = /path/to/big/data"
or by editing catalog/config.ini
file directly. To help enforce this usage, setting a path will issue a warning when used interactively.
Dataset Cache changes
We've had to turn off the code paths that attempt to cache non-dumped datasets. The thinking was just plain wrong, (the hashes apply to DataSources, and don't contain enough information to cache a full Dataset). To be honest, it's not clear what the benefit would have been anyway.
Datasets are still dumped/cached locally -- i.e. serialized to data/processed
-- by default, so Dataset.load()
will always be faster the second time. In the future Easydata releases (currently aiming for 1.6) we will be introducing a new shared caching mechanism that speeds up dataset creation within your team or workgroup by caching binary blobs that comprise hashable parts of Datasets.