-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path1_ExponentialDistributions.Rmd
188 lines (130 loc) · 6.97 KB
/
1_ExponentialDistributions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
title: "Exponential distributions and the Central Limit Theorem"
author: "Fernando Flores"
date: "December 20th, 2015"
output:
word_document: default
---
```{r setoptions, echo = FALSE, warning = FALSE}
library(knitr)
opts_chunk$set(echo = TRUE,
warning = FALSE, #Make it FALSE for distribution
message = FALSE, #Make it FALSE for distribution
fig.width = 12,
fig.height = 6)
```
## Synopsis
The current report aims to investigate the exponential distribution and the application of the Central Limit Theorem (CLT) to a simulation case. Description information from an exponential distribution, like the mean and variance, is gathered by applying the CLT and compared to its theoretical values. Also, the simulation try to explain how the CLT helps us to describe an exponential distribution by approaching to a normal one.
## The exponential distribution
The only parameter of an exponential distribution is $\lambda$ which has to be greater than 0. $\lambda$ is used to compute the mean and standard deviation by the following formulas:
Mean = Standard deviation = $\frac{1}{\lambda}$
### References
1. [Wikipedia](https://en.wikipedia.org/wiki/Exponential_distribution)
2. [Wolfram MathWorld](http://mathworld.wolfram.com/ExponentialDistribution.html)
## Simulations
For simulation purposes, the value of $\lambda$ is setted to 0.2. The investigation targets the distribution of averages of 40 exponentials, simulated a thousand times.
```{r simulationSetup}
exp.n <- 1000
exp.avg.n <- 40
exp.lambda <- 0.2
exp.mean <- 1 / exp.lambda
exp.sd <- 1 / exp.lambda
exp.var <- (exp.sd ^ 2) / (exp.avg.n)
```
According to the Central Limit Theorem:
$$\frac{\bar X_n - \mu}{\sigma / \sqrt{n}}=
\frac{\sqrt n (\bar X_n - \mu)}{\sigma}$$
To start the simulation and for the sake of reproducibility, a seed is setted with the value `9713`. Then, a matrix is created with the values generated of 40 exponentials on each row. After that, a data frame stores the means of each row, meaning the average of each group of 40 exponentials.
```{r simulationRun}
set.seed(9713)
exp.simulation.matrix <- NULL
for (i in 1 : exp.n) {
exp.simulation.matrix <- rbind(exp.simulation.matrix,
rexp(exp.avg.n, exp.lambda))
}
exp.simulation <- data.frame(x = rowMeans(exp.simulation.matrix))
```
## Features analysis
The plot shows the distribution of the simulation, which tends to be normal.
```{r simulationPlotSetup}
exp.simulation.mean <- mean(exp.simulation$x)
exp.simulation.sd <- sd(exp.simulation$x)
exp.simulation.var <- var(exp.simulation$x)
```
```{r simulationPlot, echo = FALSE}
library(ggplot2)
plotAverages <- ggplot(exp.simulation, aes(x = x)) +
geom_histogram(binwidth = 0.1,
colour = "black",
fill = "wheat",
aes(y = ..density..))
plotAverages <- plotAverages +
stat_function(fun = dnorm,
args = list(mean = exp.simulation.mean,
sd = exp.simulation.sd),
size = 2) +
geom_vline(xintercept = exp.mean,
colour = "red",
size = 1) +
geom_vline(xintercept = exp.simulation.mean,
colour = "blue",
size = 1,
linetype = "dashed")
plotAverages
```
*Figure 1. Distribution of 40 averages of exponentials simulated 1000 times. Red solid line shows the theoretical mean and blue dashed line shows the simulation mean.*
### Sample mean vs theoretical mean
As seen on *Figure 1*, the theoretical mean (red solid line) is very close to the simulation mean (blue dashed line), as expected.
```{r meansTableCompute}
meansTable <- data.frame(Type = c("Theoretical mean", "Simulation mean"),
Value = c(exp.mean, exp.simulation.mean))
meansTable
```
*Figure 2. Theoretical and simulation means from a distribution of 40 averages of exponentials simulated 1000 times*
### Sample variance vs theoretical variance
The variance of the sample mean is $\frac{\sigma ^ 2}{n}$
In order to compare the values:
```{r varianceTableCompute}
varianceTable <- data.frame(Type = c("Theoretical variance",
"Simulation variance"),
Value = c(exp.var, exp.simulation.var))
varianceTable
```
*Figure 3. Theoretical and simulation variances from a distribution of 40 averages of exponentials simulated 1000 times*
By the information above, the simulation and theoretical variances are close for this simulation exercise, but not nearly equal like the case for the compared means. As the variance is a measure of how spread is a distribution, we can say the simulation and theoretical values tend to be closer as the number of simulations increases.
## Comparing the distribution
After analyzing the features of a simulation of 1000 means of 40 exponentials, we can compare it with a distribution of 1000 exponentials.
```{r simulationExponentialRun}
set.seed(9713)
exp.simulation.2 <- data.frame(x = rexp(exp.n, exp.lambda))
exp.simulation.2.mean <- mean(exp.simulation.2$x)
```
```{r simulationExponentialPlot, echo = FALSE}
plotExponential <- ggplot(exp.simulation.2, aes(x = x)) +
geom_histogram(colour = "black",
fill = "wheat",
aes(y = ..density..))
plotExponential <- plotExponential +
stat_function(fun = dexp,
args = list(rate = exp.lambda),
size = 2) +
geom_vline(xintercept = exp.mean,
colour = "red",
size = 1) +
geom_vline(xintercept = exp.simulation.2.mean,
colour = "blue",
size = 1,
linetype = "dashed")
plotExponential
```
*Figure 4. Distribution of 1000 exponentials. Red solid line shows the theoretical mean and blue dashed line shows the simulation mean.*
It's clear that for this new simulation, the theoretical and sample mean are again close, as predicted by the first simulated run where the distribution approximates to a normal one.
This application of the CLT is very important to gather information of any distribution when having a large enough sample. For the current case, by expressing the data using averages of exponentials, we can inspect the information of the sample using the same tools and properties that are used with a normal distribution.
## Appendix
This appendix section only includes the code used to generate the plots, since they use a lot of space in the main report and doesn't add much to the discussion. The rest of the R code is kept in the main report so the reader can follow the research more clearly.
### A1. Code for simulation plot of averages of 40 exponentials (Section Features analysis)
```{r simulationPlot_code, ref.label="simulationPlot", echo = TRUE, eval = FALSE}
```
### A2. Code for simulation plot of 1000 exponentials (Section Comparing the distribution)
```{r simulationExponentialPlot_code, ref.label="simulationExponentialPlot", echo = TRUE, eval = FALSE}
```