jupytext | kernelspec | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
FIZ228 - Numerical Analysis
Dr. Emre S. Tasci, Hacettepe University
+++
It's always beneficial to check the data before and after we process it as it can offer some hidden relations or the picking of off values. Even though the matplotlib
module offers elasticity, unfortunately it is not known for its practicality. Wrappers like the seaborn
module provide functionality with ease.
+++
Let's try to do it old way, using numpy & matplotlib. As we have observed in our previous lecture, pandas were the go-to module when dealing with datasets, but for reference purposes, we'll start with numpy arrays. As numpy arrays can not (by default) store elements of different types, our string timestamps are lost in import.
For the beginners, we are going to use the meteorological data of the Basel city, obtained from meteoblue.com
{download}01_meteoblue_Basel_20230303T060433.csv<data/01_meteoblue_Basel_20230303T060433.csv>
import numpy as np
data_np = np.genfromtxt("data/01_meteoblue_Basel_20230303T060433.csv", delimiter=',',
filling_values=0.0,skip_header=10)
data_np
data_np.shape
We're going to implement meaningful indexes as the first column, by joining the year, month, day index with the hour.
Checking the timestamp of the top entries, we see that it goes from '20220101T0000' to '20230303T2300' (with most of the last entries being blank but we'll deal with it later).
:tags: [output_scroll]
flag_break = False
for y in range(22,24):
if(flag_break):
break
for m in range(1,13):
if(flag_break):
break
for d in range(1,32):
if(flag_break):
break
if((m==2) & (d>28)):
continue
if((m in [2,4,6,9,11]) & (d>30)):
continue
for h in range (0,24):
print('{:2d}{:02d}{:02d}{:02d}'.format(y,m,d,h))
date = '{:2d}{:02d}{:02d}{:02d}'.format(y,m,d,h)
if(date == '23030323'):
flag_break = True
break
i = 0
flag_break = False
for y in range(22,24):
if(flag_break):
break
for m in range(1,13):
if(flag_break):
break
for d in range(1,32):
if(flag_break):
break
if((m==2) & (d>28)):
continue
if((m in [2,4,6,9,11]) & (d>30)):
continue
for h in range (0,24):
#print('{:2d}{:02d}{:02d}{:02d}'.format(y,m,d,h))
date = '{:2d}{:02d}{:02d}{:02d}'.format(y,m,d,h)
data_np[i,0] = date
i += 1
if(date == '23030323'):
flag_break = True
break
print(i)
data_np
:tags: [output_scroll]
data_np[-1000:,0]
Let's get rid of those without any temperature information (col #1):
data_np[data_np[:,1] == 0,:]
Turns out that from Feb 24th, 2023 and forward, so:
a = data_np.copy()
a
a = np.delete(a,np.arange(0,a.shape[0])[a[:,0]>23022400],0)
a.shape[0]
np_data = a.copy()
... and here comes the basic plot:
import matplotlib.pyplot as plt
data_2022 = data_np[data_np[:,0]<23010100,:]
data_2022[-10:,:]
plt.plot(data_2022[:,0],data_2022[:,1],"b-s")
plt.title("Graph via Matplotlib")
plt.xlabel("Date")
plt.ylabel("Temperature")
plt.show()
plt.plot(data_2022[:,1])
plt.show()
filter_1 = (data_np[:,0]>=23010100) & (data_np[:,0]<23020100)
plt.plot(data_np[filter_1,0],data_np[filter_1,1],"b-s")
plt.title("Graph via Matplotlib")
plt.xlabel("January 2023")
plt.ylabel("Temperature")
plt.show()
While we are at it, here is how we can export a numpy array as CSV:
np.savetxt('del_this_file.csv', data_np, delimiter = ",")
Now that we have experienced the pains of the "old" method, let's revive the technique we have acquired last week: using Pandas
to hold the data in a dataframe!
import pandas as pd
pd.set_option('display.min_rows', 10)
pd.set_option('display.max_rows', 10)
data1 = pd.read_csv("data/01_meteoblue_Basel_20230303T060433.csv",
skiprows=9)
data1.columns = ['Timestamp','Temperature','Relative Humidity',
'Cloud Coverage', 'Sunshine Duration','Radiation']
data1 = data1.set_index('Timestamp')
data1
Even though, it is completely possible to plot dataframe using matplotlib there's actually a much better way to do it: enter the seaborn module!
import seaborn as sns
sns.set_theme() # To make things appear "more cool" 8)
data1.loc[:,"Relative Humidity"].max()
data1.loc[:,"Sunshine Duration"].max()
filter_202208w1 = ((data1.index>="20220801") &
(data1.index<"20220808"))
data_202208w1 = data1.loc[filter_202208w1].copy()
data_202208w1
data_202208w1.shape
Here, it's as simple as it gets! We are just letting seaborne to figure out what we need:
plt1 = sns.relplot(data=data_202208w1)
We can easily designate columns to be used for the x & y parameters for our graph:
plt2 = sns.relplot(data=data_202208w1,x="Temperature",y="Relative Humidity")
And here is a beauty: by hue
and size
parameters, we can classify using other column values, making it easier to investigate the dependencies wrt these columns:
plt3 = sns.relplot(data=data_202208w1,x="Temperature",y="Relative Humidity",
hue="Temperature",size="Relative Humidity")
And this is our attempt to further classify things by adding the style
alas it kind of fails
:tags: [output_scroll]
plt3 = sns.relplot(data=data_202208w1,x="Temperature",y="Relative Humidity",
style="Temperature")
Seems that it doesn't like so many classification wrt the values. Luckily we can work around it, by smoothing things out! 8)
import numpy as np
data_202208w1
print("T_min: {:.6f}C | T_max: {:.3f}C"
.format(data_202208w1.Temperature.min(),data_202208w1.Temperature.max()))
data_202208w1[data_202208w1.Temperature == data_202208w1.Temperature.min()]
print(data_202208w1.index[data_202208w1.Temperature == data_202208w1.Temperature.min()][0])
data_202208w1.Temperature/10
np.floor(data_202208w1.Temperature / 10.0) * 10
data_202208w1.Temperature
Here we add a new column TempFloor
that stores the smoothed out temperature values:
tempsf = np.floor(data_202208w1.loc[:,"Temperature"] / 10.0) * 10
data_202208w1.loc[:,"TempFloored"] = tempsf.loc[:]
data_202208w1
plt4 = sns.relplot(data=data_202208w1,x="Temperature",y="Relative Humidity",
style="TempFloored")
Enough with the scatter plots, lets connect the dots with the kind
parameter:
plt4 = sns.relplot(data=data_202208w1,x="Timestamp",y="Temperature",
kind="line", marker="^")
Here is the same thing without the markers:
plt4 = sns.relplot(data=data_202208w1,x="Timestamp",y="Temperature",
kind="line")
data_202208w1
Let's further classify such that those entries with their humidity above the mean value will be labeled as "humid", whereas those below will be "dry".
Therefore, we have to start with calculating the mean:
data_202208w1["Relative Humidity"].mean()
and we define a new column for the job:
data_202208w1['RHClass'] = 0
data_202208w1
How do we single out the ones that have their humidity above the average? By filtering of course! 8)
filter_2 = data_202208w1['Relative Humidity']>52
data_202208w1.loc[filter_2,'RHClass']
filter_2
np.invert(filter_2)
So, we fill the 'RHClass' column of the ones above the mean with "humid"; and with "dry" for the others (please observe how we invert the booleans with the "invert").
data_202208w1.loc[filter_2,'RHClass'] = 'humid'
data_202208w1.loc[np.invert(filter_2),'RHClass'] = 'dry'
data_202208w1
plt5 = sns.relplot(data=data_202208w1,x="Timestamp",y="Temperature",\
kind="line",
style="RHClass", hue="RHClass")
(plt5.map(plt.axhline,y = 22.5, color=".5", dashes=(2, 1), zorder=0)
.set_axis_labels("Day Hour", "Temperature")
.fig.suptitle("Test Graph"))
plt5 = sns.relplot(data=data_202208w1,x="Timestamp", y="Temperature",
kind="line", col="RHClass")
Histogram bars are also essential - especially if we are dealing with distributions.
plt6 = sns.displot(data=data_202208w1,x="Temperature",
col="RHClass",bins=10)
data_g = np.random.normal(0,10,1000)
:tags: [output_scroll]
data_g
plt_gauss = sns.displot(data_g,bins=20,color="r",kde=True,rug=True,)
+++
x_val = np.linspace(-4,5,20)
y_val = x_val**2-2*x_val-7
df_xy = pd.DataFrame({'xx':x_val,'yy':y_val})
plt_xy = sns.relplot(data=df_xy,x='xx',y='yy')
plt_xy = sns.relplot(data=df_xy,x='xx',y='yy',
kind="line",marker="d",
markersize=9,markerfacecolor="red",
markeredgecolor="green",
color="gray",linestyle="--",linewidth=3)
plt.xlabel('x values')
plt.ylabel('y values')
plt.title(r'$x^2-2x-7$')
plt.show()
N = 10
data2 = pd.DataFrame(np.empty((N*N,3),int),columns=['x','y','val'])
k = 0
for i in range(N):
for j in range(N):
data2.iloc[k,:] = [i,j,np.random.rand()]
k += 1
data2
data2['xymod'] = np.mod(data2.x+data2.y,5)
data2
plt2 = sns.relplot(data=data2,x='x',y='y',hue='val',
size='val',style='xymod')
#k=plt.legend(bbox_to_anchor=(1.8,1.01),loc='upper right')
#plt.show()
plt3 = sns.relplot(data=data2,x='x',y='y',hue='val',
size='val',style='xymod',col=np.mod(data2.xymod,2))
plt.show()