Skip to content

Latest commit

 

History

History
74 lines (49 loc) · 4.73 KB

ABSTRACT-SCIPY-2022.md

File metadata and controls

74 lines (49 loc) · 4.73 KB

Talk Abstract

This document contains the contents of the abstract submitted to the SciPy 2022 conference committee in February 2022.


If you could reduce the size of a popular Python package by 5 MB, would it matter?

In my opinion, the answer is "yes". This talk presents an argument supporting that position.

What, specifically, will attendees learn from the talk?

short outline

(00:00 - 03:00) What is a Python package and what files does it usually contain? (03:00 - 15:00) Specific examples of issues caused by larger package distributions. (15:00 - 20:00) Specific examples of unnecessary files in some popular packages on PyPI. (20:00 - 24:00) Guidance on how to keep packages small (e.g. with MANIFEST.in) and how to measure the impact of changes on package size (24:00 - 25:00) Call to action to go try to contribute changes like these in open source projects.

longer outline

Attendees will learn about the types of files that Python packages should contain, at a level of detail much more specific than ".py files and a license".

Next, they'll learn about all of the specific negative impacts of unnecessarily-large packages. This section of the talk will be the longest, and is intended to get attendees thinking holistically and creatively about all the places that a Python package can travel to.

For example, a larger package tarball means:

  • increased outbound data transfer for package repositories like PyPI and conda-forge
    • exacerbated by the fact that newer versions of the pip resolver now often download multiple versions of the same package while searching for compatible sets of versions
  • increased inbound data transfer for anyone installing packages
    • extra problematic in the presence of weak or unreliable internet connections
  • increased storage footprint for package repositories like PyPI and conda-forge
  • increased image size for container images, which implies:
    • increased storage footprint
    • longer pull times for Docker images
    • which, on services like Kubernetes or Amazon ECS, may translate to increased latency when running containerized Python tasks
  • increased storage footprint in the places where people run Python code

After these examples, attendees will learn about some broad classes of "unnecessary files" found in popular Python packages. Examples will include test code / data, rendered documentation files, and continuous-integration configs / scripts. The goal of this discussion will be to teach attendees some common patterns to look out for.

Attendees will also learn how to reduce the size of packages (e.g. by using rules in a MANIFEST.in file), and how to measure the compressed and uncompressed size of Python packages.

Who is the intended audience for the talk?

Anyone developing Python packages, or looking for ideas on how to contribute back to the packages they use.

How do we know you're qualified to talk about this topic?

I have been a maintainer on LightGBM (https://github.com/microsoft/LightGBM) since 2018. That project includes a Python package which wraps a large C++ library.

I have firsthand professional experience dealing with the package-size constraints on AWS Lambda, trying to create a Python Lambda using scikit-learn + pandas + lightgbm and doing things like this:

zip \
--exclude \*/tests/\* \*dist-info\* \*/__pycache__\* \
-r pandas-layer.zip \
python
.

I have firsthand experience proposing changes similar to those mentioned in this talk and working with maintainers to get those changes merged into popular projects:

I've given previous conference talks about open source and Python software. For example:

Other Notes

Thanks very much for considering me!