Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling Data Science with Dask #133

Open
pavithraes opened this issue May 17, 2022 · 0 comments
Open

Scaling Data Science with Dask #133

pavithraes opened this issue May 17, 2022 · 0 comments

Comments

@pavithraes
Copy link

pavithraes commented May 17, 2022

Abstract (2-3 lines)

Python data science tools like pandas, NumPy, and scikit-learn are excellent. However, they use only one core out of the many cores in modern processors and are limited by your computer RAM. In this tutorial, you'll learn to scale your data science workflow to larger datasets+models using Dask, by leveraging the full potential of your laptop, all while staying in the PyData ecosystem. You will learn the fundamentals of parallel and distributed computing, when (and when not) to consider scaling, and work through some hands-on examples.

Brief Description and Contents to be covered

Dask is an open source library for parallel and distributed computing in Python. This tutorial is meant to be an introduction to this super broad and powerful library. We will:

  • Build vocabulary: What is parallel and distributed computing? What are clusters? What do we mean by "scaling to the cloud"?
  • Introduce Dask: What is Dask? How does it work? Where is it used?
  • Learn the Dask DataFrame API, which mimics the pandas API -- how are the two APIs similar, and where do they differ?
  • Talk about Dask's Distributed Scheduler and explore Dask's (very cool) diagnostic Dashboards
  • Briefly cover the low-level Dask Delayed API, which can parallelize any general Python code
  • Conclude with some best practices and discuss resources for learning more

Pre-requisites for the talk

  • Programming fundamentals in Python (e.g variables, data structures, for loops, etc.)
  • A bit of or are familiarized with NumpP, pandas, and scikit-learn
  • Jupyter Lab / Jupyter Notebooks
  • Way around the shell/terminal

Time required for the talk

1 hr

Link to slides

https://github.com/pavithraes/dask-mini-tutorial/blob/main/slides.pdf

Will you be doing hands-on demo as well?

Yes

Link to ipython notebook (if any)

https://github.com/pavithraes/dask-mini-tutorial

About yourself

My name is Pavithra Eswaramorthy. I currently work as a Community Engagement Manager at Coiled, where I help support Dask users and contributors. I also contribute to the Bokeh project and I've worked on administrating Wikimedia Foundation’s open source outreach programs in the past. In my spare time, I enjoy a good book and hot coffee. :)

Are you comfortable if the talk is recorded and uploaded to PyData Dellhi's YouTube channel?

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants