From 1378382f60ff3fed5962b5fe4ed9e6496dff5208 Mon Sep 17 00:00:00 2001 From: Adrian Date: Wed, 25 Oct 2023 16:45:43 +0200 Subject: [PATCH] format iframes --- docs/website/blog/2023-10-25-dlt-deepnote.md | 109 ++++++++++++++++--- 1 file changed, 95 insertions(+), 14 deletions(-) diff --git a/docs/website/blog/2023-10-25-dlt-deepnote.md b/docs/website/blog/2023-10-25-dlt-deepnote.md index 5a0816b9f5..f042fcd93e 100644 --- a/docs/website/blog/2023-10-25-dlt-deepnote.md +++ b/docs/website/blog/2023-10-25-dlt-deepnote.md @@ -63,7 +63,7 @@ However, the journey to reach these stages is stretched much longer due to the t The two datasets that we are using are nested json files, with further lists of dictionaries, and are survey results with wellness indicators for women. Here’s what the first element of one dataset looks like: -
+
+
+ +
+ + +--- And now, putting one of the nested variables into a pandas data frame: - + +
+ +
+ And this little exercise needs to be repeated for each of the columns that we had to “explode” in the first place. @@ -113,26 +128,70 @@ We leave the loading of the raw data to dlt, while we leave the data exploration Imagine this: you initialize a data pipeline in one line of code, and pass complicated raw data in another to be modelled, unnested and formatted. Now, watch that come to reality: - + +
+ +
+ +
+ +
+ + + - And that’s pretty much it. Notice the difference in the effort you had to put in? -The data has been loaded into a pipeline with `duckdb` as its destination. `duckdb` was chosen as it is an OLAP database, perfect for usage in our analytics workflow. The data has been unnested and formatted. To explore what exactly was stored in that destination, a `duckdb` connector (`conn`) is set up, and the `SHOW ALL TABLES` command is executed. +The data has been loaded into a pipeline with `duckdb` as its destination. +`duckdb` was chosen as it is an OLAP database, perfect for usage in our analytics workflow. +The data has been unnested and formatted. To explore what exactly was stored in that destination, +a `duckdb` connector (`conn`) is set up, and the `SHOW ALL TABLES` command is executed. + + +
+ +
+ + + - In a first look, we understand that both the datasets `violence` and `wellness` have their own base tables. One of the child tables is shown below: - + +
+ +
+ + ### Know your data model; connect the unnested tables using dlt’s pre-assigned primary and foreign keys: The child tables, like `violence__value` or `wellness__age_related` are the unnested lists of dictionaries from the original json files. The `_dlt_id` column as shown in the table above serves as a **primary key**. This will help us in connecting the children tables with ease. The `parent_id` column in the children tables serve as **foreign keys** to the base tables. If more then one child table needs to be joined together, we make use of the `_dlt_list_idx` column; - + +
+ +
+ + ## Deepnote - the iPython Notebook turned Dashboarding tool @@ -146,15 +205,29 @@ At this point, we would probably move towards a `plt.plot` or `plt.bar` function And a stacked bar chart came into existence! A little note about the query results; the **value** column corresponds to how much (in %) a person justifies violence against women. An interesting yet disturbing insight from the above plot: in many countries, women condone violence against women as often if not more often than men do! -The next figure slices the data further by gender and demographic. The normalized bar chart is sliced by 2 parameters, gender and demographic. The two colors represent genders. While different widths of the rectangles represent the different demographics, and the different heights represent that demographic’s justification of violence in %. The taller the rectangle, the greater the % average. It tells us that most women think that violence on them is justified for the reasons mentioned, as shown by the fact that the blue rectangles make up more than 50% of respondents who say ‘yes’ to each reason shown on the x-axis. If you hover over the blocks, you will see the gender and demographic represented in each differently sized rectangle, alongside that subset’s percentage of justification of violence. +The next figure slices the data further by gender and demographic. The normalized bar chart is sliced by 2 parameters, gender and demographic. The two colors represent genders. While different widths of the rectangles represent the different demographics, and the different heights represent that demographic’s justification of violence in %. The taller the rectangle, the greater the % average. It tells us that most women think that violence on them is justified for the reasons mentioned, as shown by the fact that the blue rectangles make up more than 50% of respondents who say ‘yes’ to each reason shown on the x-axis. If you hover over the blocks, you will see the gender and demographic represented in each differently sized rectangle, alongside that subset’s percentage of justification of violence. Let’s examine the differences in women’s responses for two demographic types: employment vs education levels. We can see that the blue rectangles for “employed for cash” vs “employed for kind” don’t really vary in size. However, when we select “higher” vs “no education”, we see that the former is merely a speck when compared to the rectangles for the latter. This comparison between employment and education differences demonstrates that education plays a much larger role in likelihood to influence women’s levels of violence justification. - +
+ +
+ + + Let’s look at one last plot created by Deepnote for the other dataset with wellness indicators. The upward moving trend shows us that women are much less likely to have a final say on their health if they are less educated. - +
+ +
+ # 🌍 Clustering countries based on their wellness indicators @@ -174,7 +247,15 @@ The color bar shows us which color is associated to which cluster. Namely; 1: pu To understand briefly what each cluster represents, let’s look at the averages for each indicator across all clusters; - +
+ +
+ + + This tells us that according to these datasets, cluster 2 (highlighted blue) is the cluster that is performing the best in terms of wellness of women. It has the lowest levels of justifications of violence, highest average years of education, and almost the highest percentage of women who have control over their health and finances. This is followed by clusters 3, 1, and 4 respectively; countries like the Philippines, Peru, Mozambique, Indonesia and Bolivia are comparatively better than countries like South Africa, Egypt, Zambia, Guatemala & all South Asian countries, in regards to how they treat women.