Skip to content

Latest commit

 

History

History
466 lines (359 loc) · 12.8 KB

FIZ228_02_DataVisualization.md

File metadata and controls

466 lines (359 loc) · 12.8 KB
jupytext kernelspec
formats text_representation
ipynb,md:myst
extension format_name format_version jupytext_version
.md
myst
0.13
1.11.5
display_name language name
Python 3 (ipykernel)
python
python3

Visualization of datasets

FIZ228 - Numerical Analysis
Dr. Emre S. Tasci, Hacettepe University

+++

It's always beneficial to check the data before and after we process it as it can offer some hidden relations or the picking of off values. Even though the matplotlib module offers elasticity, unfortunately it is not known for its practicality. Wrappers like the seaborn module provide functionality with ease.

+++

"El Clasico"

Let's try to do it old way, using numpy & matplotlib. As we have observed in our previous lecture, pandas were the go-to module when dealing with datasets, but for reference purposes, we'll start with numpy arrays. As numpy arrays can not (by default) store elements of different types, our string timestamps are lost in import.

For the beginners, we are going to use the meteorological data of the Basel city, obtained from meteoblue.com

{download}01_meteoblue_Basel_20230303T060433.csv<data/01_meteoblue_Basel_20230303T060433.csv>

import numpy as np
data_np = np.genfromtxt("data/01_meteoblue_Basel_20230303T060433.csv", delimiter=',',
                        filling_values=0.0,skip_header=10)
data_np
data_np.shape

We're going to implement meaningful indexes as the first column, by joining the year, month, day index with the hour.

Checking the timestamp of the top entries, we see that it goes from '20220101T0000' to '20230303T2300' (with most of the last entries being blank but we'll deal with it later).

:tags: [output_scroll]

flag_break = False
for y in range(22,24):
    if(flag_break):
        break
    for m in range(1,13):
        if(flag_break):
            break
        for d in range(1,32):
            if(flag_break):
                break
            if((m==2) & (d>28)):
                continue
            if((m in [2,4,6,9,11]) & (d>30)):
                continue
            for h in range (0,24):
                print('{:2d}{:02d}{:02d}{:02d}'.format(y,m,d,h))
                date = '{:2d}{:02d}{:02d}{:02d}'.format(y,m,d,h)
                if(date == '23030323'):
                    flag_break = True
                    break
i = 0
flag_break = False
for y in range(22,24):
    if(flag_break):
        break
    for m in range(1,13):
        if(flag_break):
            break
        for d in range(1,32):
            if(flag_break):
                break
            if((m==2) & (d>28)):
                continue
            if((m in [2,4,6,9,11]) & (d>30)):
                continue
            for h in range (0,24):
                #print('{:2d}{:02d}{:02d}{:02d}'.format(y,m,d,h))
                date = '{:2d}{:02d}{:02d}{:02d}'.format(y,m,d,h)
                data_np[i,0] = date
                i += 1
                if(date == '23030323'):
                    flag_break = True
                    break
print(i)
data_np
:tags: [output_scroll]

data_np[-1000:,0]

Let's get rid of those without any temperature information (col #1):

data_np[data_np[:,1] == 0,:]

Turns out that from Feb 24th, 2023 and forward, so:

a = data_np.copy()
a
a = np.delete(a,np.arange(0,a.shape[0])[a[:,0]>23022400],0)
a.shape[0]
np_data = a.copy()

... and here comes the basic plot:

import matplotlib.pyplot as plt
data_2022 = data_np[data_np[:,0]<23010100,:]
data_2022[-10:,:]
plt.plot(data_2022[:,0],data_2022[:,1],"b-s")
plt.title("Graph via Matplotlib")
plt.xlabel("Date")
plt.ylabel("Temperature")
plt.show()
plt.plot(data_2022[:,1])
plt.show()
filter_1 = (data_np[:,0]>=23010100) & (data_np[:,0]<23020100)
plt.plot(data_np[filter_1,0],data_np[filter_1,1],"b-s")
plt.title("Graph via Matplotlib")
plt.xlabel("January 2023")
plt.ylabel("Temperature")
plt.show()

Exporting a numpy array as a CSV file

While we are at it, here is how we can export a numpy array as CSV:

np.savetxt('del_this_file.csv', data_np, delimiter = ",")

Importing a CSV file with Pandas

Now that we have experienced the pains of the "old" method, let's revive the technique we have acquired last week: using Pandas to hold the data in a dataframe!

import pandas as pd
pd.set_option('display.min_rows', 10)
pd.set_option('display.max_rows', 10)
data1 = pd.read_csv("data/01_meteoblue_Basel_20230303T060433.csv",
                                         skiprows=9)
data1.columns = ['Timestamp','Temperature','Relative Humidity',
                 'Cloud Coverage', 'Sunshine Duration','Radiation']
data1 = data1.set_index('Timestamp')
data1

Even though, it is completely possible to plot dataframe using matplotlib there's actually a much better way to do it: enter the seaborn module!

import seaborn as sns
sns.set_theme() # To make things appear "more cool" 8)
data1.loc[:,"Relative Humidity"].max()
data1.loc[:,"Sunshine Duration"].max()
filter_202208w1 = ((data1.index>="20220801") & 
                 (data1.index<"20220808"))
data_202208w1 = data1.loc[filter_202208w1].copy()
data_202208w1
data_202208w1.shape

Here, it's as simple as it gets! We are just letting seaborne to figure out what we need:

plt1 = sns.relplot(data=data_202208w1)

Plotting a specific column

We can easily designate columns to be used for the x & y parameters for our graph:

plt2 = sns.relplot(data=data_202208w1,x="Temperature",y="Relative Humidity")

And here is a beauty: by hue and size parameters, we can classify using other column values, making it easier to investigate the dependencies wrt these columns:

plt3 = sns.relplot(data=data_202208w1,x="Temperature",y="Relative Humidity",
                  hue="Temperature",size="Relative Humidity")

And this is our attempt to further classify things by adding the style alas it kind of fails

:tags: [output_scroll]

plt3 = sns.relplot(data=data_202208w1,x="Temperature",y="Relative Humidity",
                  style="Temperature")

Seems that it doesn't like so many classification wrt the values. Luckily we can work around it, by smoothing things out! 8)

import numpy as np
data_202208w1
print("T_min: {:.6f}C | T_max: {:.3f}C"
      .format(data_202208w1.Temperature.min(),data_202208w1.Temperature.max()))
data_202208w1[data_202208w1.Temperature == data_202208w1.Temperature.min()]
print(data_202208w1.index[data_202208w1.Temperature == data_202208w1.Temperature.min()][0])
data_202208w1.Temperature/10
np.floor(data_202208w1.Temperature / 10.0) * 10
data_202208w1.Temperature

Here we add a new column TempFloor that stores the smoothed out temperature values:

tempsf = np.floor(data_202208w1.loc[:,"Temperature"] / 10.0) * 10
data_202208w1.loc[:,"TempFloored"] = tempsf.loc[:]
data_202208w1
plt4 = sns.relplot(data=data_202208w1,x="Temperature",y="Relative Humidity",
                  style="TempFloored")

Enough with the scatter plots, lets connect the dots with the kind parameter:

plt4 = sns.relplot(data=data_202208w1,x="Timestamp",y="Temperature", 
                   kind="line", marker="^")

Here is the same thing without the markers:

plt4 = sns.relplot(data=data_202208w1,x="Timestamp",y="Temperature", 
                   kind="line")
data_202208w1

Let's further classify such that those entries with their humidity above the mean value will be labeled as "humid", whereas those below will be "dry".

Therefore, we have to start with calculating the mean:

data_202208w1["Relative Humidity"].mean()

and we define a new column for the job:

data_202208w1['RHClass'] = 0
data_202208w1

How do we single out the ones that have their humidity above the average? By filtering of course! 8)

filter_2 = data_202208w1['Relative Humidity']>52
data_202208w1.loc[filter_2,'RHClass']
filter_2
np.invert(filter_2)

So, we fill the 'RHClass' column of the ones above the mean with "humid"; and with "dry" for the others (please observe how we invert the booleans with the "invert").

data_202208w1.loc[filter_2,'RHClass'] = 'humid'
data_202208w1.loc[np.invert(filter_2),'RHClass'] = 'dry'
data_202208w1
plt5 = sns.relplot(data=data_202208w1,x="Timestamp",y="Temperature",\
                   kind="line", 
                   style="RHClass", hue="RHClass")
(plt5.map(plt.axhline,y = 22.5, color=".5", dashes=(2, 1), zorder=0)
.set_axis_labels("Day Hour", "Temperature")
.fig.suptitle("Test Graph"))
plt5 = sns.relplot(data=data_202208w1,x="Timestamp", y="Temperature", 
                   kind="line", col="RHClass")

Histogram Plots

Histogram bars are also essential - especially if we are dealing with distributions.

plt6 = sns.displot(data=data_202208w1,x="Temperature",
                   col="RHClass",bins=10)
data_g = np.random.normal(0,10,1000)
:tags: [output_scroll]

data_g
plt_gauss = sns.displot(data_g,bins=20,color="r",kde=True,rug=True,)

Summary / Practical Case

+++

'Old Style' plot parameters

x_val = np.linspace(-4,5,20)
y_val = x_val**2-2*x_val-7

df_xy = pd.DataFrame({'xx':x_val,'yy':y_val})
plt_xy = sns.relplot(data=df_xy,x='xx',y='yy')
plt_xy = sns.relplot(data=df_xy,x='xx',y='yy',
                     kind="line",marker="d",
                    markersize=9,markerfacecolor="red",
                    markeredgecolor="green",
                    color="gray",linestyle="--",linewidth=3)
plt.xlabel('x values')
plt.ylabel('y values')
plt.title(r'$x^2-2x-7$')
plt.show()

Pretty much all the useful set

N  = 10
data2 = pd.DataFrame(np.empty((N*N,3),int),columns=['x','y','val'])
k = 0
for i in range(N):
    for j in range(N):
        data2.iloc[k,:] = [i,j,np.random.rand()]
        k += 1
data2
data2['xymod'] = np.mod(data2.x+data2.y,5)
data2
plt2 = sns.relplot(data=data2,x='x',y='y',hue='val',
                       size='val',style='xymod')
#k=plt.legend(bbox_to_anchor=(1.8,1.01),loc='upper right')
#plt.show()
plt3 = sns.relplot(data=data2,x='x',y='y',hue='val',
                       size='val',style='xymod',col=np.mod(data2.xymod,2))
plt.show()

References