How to do Data Science with larger than memory data using Dask? #132

arnabbiswas1 · 2020-12-21T05:52:54Z

Abstract (2-3 lines)

As a Data Scientist, we face few challenges while dealing with large volume of data:

Popular Python libraries like NumPy & Pandas are not designed to scale beyond single processor/core
Numpy, Pandas, Scikit-Learn are not designed to scale beyond a single machine
If data is bigger than RAM, these libraries can't be used

In this session, I will discuss how these challenges can be addressed using parallel computing library, Dask.

Brief Description and Contents to be covered

The talk is divided in two portions:

Understanding the challenges of large data (Will be delivered through presentation)
a. Fundamentals of computer architecture (with a focus on Computing Unit & Memory unit)
b. Why parallelism is necessary in a multi-core architecture?
c. Challenges with large data (data that doesn't fit RAM) & how to address
d. Introduction to distributed computing?
How does Dask handle large data? (Code walk through)
a. What is Dask and Why it is needed?
b. How Dask parallelizes jobs across cores/processors?
c. How Dask handles larger than memory data using out of core computing and distributed computing?

Pre-requisites for the talk

Basic knowledge about the Python based Data Science libraries like Pandas, NumPy, ScikitLearn

Time required for the talk

45 minutes to 1 hour. This talk can be extended to a 2 hour long work shop as well.

Link to slides

https://speakerdeck.com/arnabbiswas1/scale-up-your-data-science-work-flow-using-dask

Will you be doing hands-on demo as well?

Yes.

Link to ipython notebook (if any)

https://github.com/arnabbiswas1/dask_workshop

About yourself

https://arnab.blog/about/

Are you comfortable if the talk is recorded and uploaded to PyData Delhi's YouTube channel ?

Yes

Any query ?

This talk (45 minutes) have been delivered recently to Bangalore Python User Group, BangPypers. Here is the recording for your reference: Link

MSanKeys963 · 2020-12-28T21:21:53Z

Hi @arnabbiswas1. Thanks for the proposal.
@shagunsodhani please have a look.

shagunsodhani · 2020-12-29T00:09:37Z

Hey @arnabbiswas1 ! Thanks for proposing the talk. The content looks good. Best of luck :)

arnabbiswas1 · 2020-12-29T13:40:55Z

@MSanKeys963 @shagunsodhani

As mentioned in the proposal, this talk can be delivered as it is (45 minutes). Or it can be extended to 1.5 to 2 hours focusing on usage of Dask just for Data Science use cases (Existing content + Dask Machine Learning + Dask Dashboard for debugging).

Please let me know your thoughts.

MSanKeys963 added the needs review label Dec 28, 2020

MSanKeys963 assigned shagunsodhani Dec 28, 2020

MSanKeys963 added accepted and removed needs review labels Dec 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to do Data Science with larger than memory data using Dask? #132

How to do Data Science with larger than memory data using Dask? #132

arnabbiswas1 commented Dec 21, 2020

MSanKeys963 commented Dec 28, 2020

shagunsodhani commented Dec 29, 2020

arnabbiswas1 commented Dec 29, 2020

How to do Data Science with larger than memory data using Dask? #132

How to do Data Science with larger than memory data using Dask? #132

Comments

arnabbiswas1 commented Dec 21, 2020

MSanKeys963 commented Dec 28, 2020

shagunsodhani commented Dec 29, 2020

arnabbiswas1 commented Dec 29, 2020