You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As a Data Scientist, we face few challenges while dealing with large volume of data:
Popular Python libraries like NumPy & Pandas are not designed to scale beyond single processor/core
Numpy, Pandas, Scikit-Learn are not designed to scale beyond a single machine
If data is bigger than RAM, these libraries can't be used
In this session, I will discuss how these challenges can be addressed using parallel computing library, Dask.
Brief Description and Contents to be covered
The talk is divided in two portions:
Understanding the challenges of large data (Will be delivered through presentation)
a. Fundamentals of computer architecture (with a focus on Computing Unit & Memory unit)
b. Why parallelism is necessary in a multi-core architecture?
c. Challenges with large data (data that doesn't fit RAM) & how to address
d. Introduction to distributed computing?
How does Dask handle large data? (Code walk through)
a. What is Dask and Why it is needed?
b. How Dask parallelizes jobs across cores/processors?
c. How Dask handles larger than memory data using out of core computing and distributed computing?
Pre-requisites for the talk
Basic knowledge about the Python based Data Science libraries like Pandas, NumPy, ScikitLearn
Time required for the talk
45 minutes to 1 hour. This talk can be extended to a 2 hour long work shop as well.
As mentioned in the proposal, this talk can be delivered as it is (45 minutes). Or it can be extended to 1.5 to 2 hours focusing on usage of Dask just for Data Science use cases (Existing content + Dask Machine Learning + Dask Dashboard for debugging).
As a Data Scientist, we face few challenges while dealing with large volume of data:
In this session, I will discuss how these challenges can be addressed using parallel computing library, Dask.
The talk is divided in two portions:
Understanding the challenges of large data (Will be delivered through presentation)
a. Fundamentals of computer architecture (with a focus on Computing Unit & Memory unit)
b. Why parallelism is necessary in a multi-core architecture?
c. Challenges with large data (data that doesn't fit RAM) & how to address
d. Introduction to distributed computing?
How does Dask handle large data? (Code walk through)
a. What is Dask and Why it is needed?
b. How Dask parallelizes jobs across cores/processors?
c. How Dask handles larger than memory data using out of core computing and distributed computing?
Basic knowledge about the Python based Data Science libraries like Pandas, NumPy, ScikitLearn
45 minutes to 1 hour. This talk can be extended to a 2 hour long work shop as well.
https://speakerdeck.com/arnabbiswas1/scale-up-your-data-science-work-flow-using-dask
Yes.
https://github.com/arnabbiswas1/dask_workshop
https://arnab.blog/about/
Yes
This talk (45 minutes) have been delivered recently to Bangalore Python User Group, BangPypers. Here is the recording for your reference: Link
The text was updated successfully, but these errors were encountered: