-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path03-compbayes.Rmd
executable file
·999 lines (720 loc) · 78.7 KB
/
03-compbayes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
# (PART) Regression models with brms {-}
# Computational Bayesian data analysis {#ch-compbda}
In the previous chapter, we learned how to analytically derive the posterior distribution of the parameters in our model.
In practice, however, this is possible for only a very limited number of cases. Although the numerator in Bayes' rule, the unnormalized posterior, is easy to calculate (by multiplying the likelihood and the probability density/mass function), the denominator, the marginal likelihood, requires us to carry out an integration; see Equation \@ref(eq:bayesbrms).
\begin{equation}
\begin{aligned}
p(\boldsymbol{\Theta}|\boldsymbol{y}) &= \cfrac{ p(\boldsymbol{y}|\boldsymbol{\Theta}) \cdot p(\boldsymbol{\Theta}) }{p(\boldsymbol{y})}\\
p(\boldsymbol{\Theta}|\boldsymbol{y}) &= \cfrac{ p(\boldsymbol{y}|\boldsymbol{\Theta}) \cdot p(\boldsymbol{\Theta}) }{\int_{\boldsymbol{\Theta}} p(\boldsymbol{y}|\boldsymbol{\Theta}) \cdot p(\boldsymbol{\Theta}) d\boldsymbol{\Theta} }
\end{aligned}
(\#eq:bayesbrms)
\end{equation}
Unless we are dealing with conjugate distributions, the solution will be extremely hard to derive or there will be no analytical solution. This was the major bottleneck of Bayesian analysis in the past, and required Bayesian practitioners to program an approximation method by themselves before they could even begin the Bayesian analysis. Fortunately, many of the probabilistic programming languages freely available today (see the next section for a listing) allow us to define our models without having to acquire expert knowledge about the relevant numerical techniques.
## Deriving the \index{Posterior} posterior through \index{Sampling} sampling {#sec-sampling}
```{r, echo = FALSE, message=FALSE, results= "hide"}
lst_cloze_data <- list(k = 80, N = 100)
# Fit the model with the default values of number of chains and iterations
# chains = 4, iter = 2000
bn <- system.file("stan_models", "binomial_cloze.stan", package = "bcogsci")
fit_cloze <- stan(
file = bn, data = lst_cloze_data,
warmup = 1000,
iter = 21000
)
```
```{r, echo = FALSE}
samples_theta <- rstan::extract(fit_cloze)$theta
some_samples <- toString(round(head(samples_theta, 20), 3))
df_fit_cloze <- data.frame(theta = samples_theta)
diff_means <- format(mean(samples_theta) - 84 / (84 + 24), digits = 1)
diff_var <- format(sd(samples_theta)^2 - 84 * 24 / ((84 + 24)^2 * (84 + 24 + 1)), digits = 1)
```
Let's say that we want to derive the posterior of the model from section \@ref(sec-analytical), that is, the posterior distribution of the cloze probability of "*umbrella*," $\theta$, given the following data: a word (e.g., "*umbrella*") was answered 80 out of 100 times, and assuming a binomial distribution as the likelihood function, and $\mathit{Beta}(a=4,b=4)$ as a prior distribution for the cloze probability. If we can generate samples^[In the `brms` package [@R-brms_a], samples from the posterior are referred to as draws.] from the posterior distribution of $\theta$, instead of an analytically derived posterior distribution, given enough samples we will have a good approximation of the posterior distribution. Obtaining samples from the posterior will be the only viable option in the models that we will discuss in this book. By "obtaining samples," we are talking about a situation analogous to when we use ```rbinom()``` or ```rnorm()``` to obtain samples from a particular distribution. For more details about sampling algorithms, see the section Further Reading (section \@ref(sec-ch3furtherreading)).
Thanks to \index{Probabilistic programming language} probabilistic programming languages, it will be relatively straightforward to get these samples, and we will discuss how we will do it in more detail in the next section. For now let's assume that we used some probabilistic programming language to obtain 20000 samples from the posterior distribution of the cloze probability, $\theta$: `r some_samples`, \dots. Figure \@ref(fig:betapost) shows that the approximation of the posterior looks quite similar to the analytically derived posterior. The difference between the analytically computed and approximated mean and variance are $`r diff_means`$ and $`r diff_var`$ respectively.
(ref:betapost) Histogram of the samples of $\theta$ from the posterior distribution generated via sampling. The black line shows the density plot of the analytically derived posterior.
```{r betapost, fig.cap="(ref:betapost)",echo = FALSE, fig.height = 2.5}
ggplot(df_fit_cloze, aes(theta)) +
geom_histogram(binwidth = .01, colour = "gray", alpha = .5, aes(y = after_stat(density))) +
stat_function(fun = dbeta, color = "black", args = list(
shape1 = 84,
shape2 = 24
))
```
## \index{Bayesian Regression Model} Bayesian Regression Models using Stan: \index{brms} brms
The surge in popularity of Bayesian statistics is closely tied to the increase in computing power and the appearance of \index{Probabilistic programming language} probabilistic programming languages, such as WinBUGS [@lunn2000winbugs], JAGS [@plummer2016jags], PyMC3 [@Salvatier2016], Turing [@turing], and Stan [@carpenter2017stan]; for a historical review, see @plummer2022simulation.
These probabilistic programming languages allow the user to define models without having to deal (for the most part) with the complexities of the sampling process. However, they require learning a new language since the user has to fully specify the statistical model using a particular syntax.^[The Python package PyMC3 and the Julia library Turing are recent exceptions since they are fully integrated into their respective languages.] Furthermore, some knowledge of the sampling process is needed to correctly parameterize the models and to avoid convergence issues (these topics will be covered in detail in chapter \@ref(ch-complexstan)).
There are some alternatives that allow Bayesian inference in R without having to fully specify the model "by hand." The `R` packages `rstanarm` [@rstanarm] and `brms` [@R-brms] provide Bayesian equivalents of many popular `R` model-fitting functions, such as (g)lmer [@R-lme4] and many others; both `rstanarm` and `brms` use Stan as the back-end for estimation and sampling. The package R-INLA [@lindgren2015bayesian] allows for fitting a limited selection of likelihood functions and priors in comparison to `rstanarm` and `brms` (R-INLA can fit models that can be expressed as latent Gaussian models). This package uses the integrated nested Laplace approximation (INLA) method for carrying out Bayesian inference, rather than a sampling algorithm as in the other probabilistic languages listed above. Another alternative is JASP [@JASP2019], which provides a graphical user interface for both frequentist and Bayesian modeling, and is intended to be an open-source alternative to SPSS.
We will focus on `brms` in this part of the book. This is because it can be useful for a smooth transition from frequentist models to their Bayesian equivalents. The package `brms` is not only powerful enough to satisfy the statistical needs of many cognitive scientists, it has the added benefit that the Stan code can be inspected (with the `brms` functions `make_stancode()` and `make_standata()`), allowing the user to customize their models or learn from the code produced internally by `brms` to eventually transition to writing the models entirely in Stan. In the introduction to Stan in chapter \@ref(ch-introstan), we implement in Stan the models presented in the current and the following chapters.
### A simple linear model: A single subject pressing a button repeatedly (a finger tapping task) {#sec-simplenormal}
To illustrate the basic steps for fitting a model using `brms`, consider the following example of a finger tapping task [for a review, see @hubel2013computerized]. Suppose that a subject first sees a blank screen. Then, after a certain amount of time (say $200$ ms), the subject sees a cross in the middle of a screen, and as soon as they see the cross, they tap on the space bar as fast as they can until the experiment is over ($361$ trials). The dependent measure here is the time it takes in milliseconds from one press of the space bar to the next one. The data in each trial are therefore finger tapping times in milliseconds. Suppose that the research question is: how long does it take for this particular subject to press a key?
Let's model the data with the following assumptions:
1. There is a true (unknown) underlying time, $\mu$ ms, that the subject needs to press the space bar.
2. There is some noise in this process.
3. The noise is normally distributed (this assumption is questionable given that finger tapping as also response times are generally skewed; we will fix this assumption later).[^rt]
[^rt]: We refer to the time it takes for a subject to respond or react to a stimulus as response time [response and reaction times are often used interchangeably, cf. @luce1986response]. In this case, however, there are no stimuli other than the cross on the screen, and the subject only taps the space bar as soon as they see the cross.
This means that the likelihood for each observation $n$ will be:
\begin{equation}
t_n \sim \mathit{Normal}(\mu, \sigma)
(\#eq:rtlik)
\end{equation}
where $n =1, \ldots, N$, and $t$ is the dependent variable (finger tapping times in milliseconds). The variable $N$ indexes the total number of data points. The symbol $\mu$ indicates the *location* of the normal distribution function; the location parameter shifts the distribution left or right on the horizontal axis. For the normal distribution, the location is also the mean of the distribution. The symbol $\sigma$ indicates the *scale* of the distribution; as the scale decreases, the distribution gets narrower. This compressing approaches a spike (all the probability mass get concentrated near one point) as the scale parameter approaches zero. For the normal distribution, the scale is also its standard deviation.
The reader may have encountered the model shown in Equation \@ref(eq:rtlik) in the form shown in Equation \@ref(eq:rtlikLM):
\begin{equation}
t_n = \mu + \varepsilon_n \hbox{, where } \varepsilon_n \stackrel{iid}{\sim} \mathit{Normal}(0,\sigma) (\#eq:rtlikLM)
\end{equation}
When the model is written in this way, it should be understood as meaning that each data point $t_n$ has some variability around a mean value $\mu$, and that variability has standard deviation $\sigma$. The term "iid" (independent and identically distributed) implies that the residuals are independently generated (they are not correlated with any of the other residual values). The statement of the model in Equation \@ref(eq:rtlikLM) is exactly the same as saying that the observed data points $t_n$ are iid and are coming from the $Normal(\mu,\sigma)$ distribution.
For a frequentist model that will give us the \index{Maximum likelihood estimate} maximum likelihood estimate (the sample mean) of the time it takes to press the space bar, this would be enough information to write the formula in `R`, `t ~ 1`, and plug it into the function `lm()` together with the data: `lm(t ~ 1, data)`. The meaning of the `1` here is that `lm()` will estimate the intercept in the model, which is the estimate of $\mu$ in our example. If the reader is completely unfamiliar with linear models, the references in section \@ref(sec-ch4furtherreading) will be helpful.
For a Bayesian linear model, we will also need to define priors for the two parameters in our model. Let's say that we know for sure that the time it takes to press a key will be positive and lower than a minute (or $60000$ ms), but we don't want to make a commitment regarding which values are more likely. We encode what we know about the noise in the task in $\sigma$: we know that this parameter must be positive and we'll assume that any value below $2000$ ms is equally likely. These priors are in general strongly discouraged: A flat (or very wide) prior will almost never be the best approximation of what we know. Prior specification is discussed in detail in the online chapter \@ref(ch-priors).
In this case, even if we know very little about the task, we know that pressing the spacebar will take at most a couple of seconds. We'll start with flat priors in this section for pedagogical purposes; the next sections will already show more realistic uses of priors.
\begin{equation}
\begin{aligned}
\mu &\sim \mathit{Uniform}(0, 60000) \\
\sigma &\sim \mathit{Uniform}(0, 2000)
\end{aligned}
(\#eq:rtpriors)
\end{equation}
First, load the data frame `df_spacebar` from the `bcogsci` package:
```{r, reading_noreading, message = FALSE}
data("df_spacebar")
df_spacebar
```
It is always a good idea to plot the data before doing anything else; see Figure \@ref(fig:m1visualize). As we suspected, the data look a bit skewed, but we ignore this for the moment.
```{r m1visualize, fig.cap="Visualizing the button-press data.", fold = TRUE, fig.height = 2.5}
ggplot(df_spacebar, aes(t)) +
geom_density() +
xlab("Finger tapping times") +
ggtitle("Button-press data")
```
#### Specifying the model in `brms`
Fit the model defined by equations \@ref(eq:rtlik) and \@ref(eq:rtpriors) with `brms` in the following way.
```{r, message = FALSE, cache = TRUE}
fit_press <-
brm(t ~ 1,
data = df_spacebar,
family = gaussian(),
prior =
c(prior(uniform(0, 60000), class = Intercept, lb = 0, ub = 60000),
prior(uniform(0, 2000), class = sigma, lb = 0, ub = 2000)),
chains = 4,
iter = 2000,
warmup = 1000)
```
The `brms` code has some differences from a model fit with `lm()`. At this beginning stage, we'll focus on the following options:
1. The term `family = gaussian()` makes it explicit that the underlying likelihood function is a normal distribution (Gaussian \index{Gaussian} and normal are synonyms). This detail is implicit in the `R` function `lm()`. Other linking functions are possible, exactly as in the `glm()` function in `R`. The default for `brms` that corresponds to the `lm()` function is `family = gaussian()`.
2. The term `prior` takes as argument a list of prior specifications. Although this \index{Prior specification} specification of priors is optional, the researcher should always explicitly specify each prior. Otherwise, `brms` will define priors by default, which may or may not be appropriate for the research area. In cases where the distribution has a restricted coverage, that is, not every value is valid (e.g., smaller than $0$ or larger than $60000$ are not valid for the intercept), we need to set lower and upper boundaries with `lb` and `ub`.^[There is an additional problem here. Although the parameter for the intercept is assigned a uniform distribution bounded between $0$ and $60000$ ms, the sampler might start sampling from an initial value outside this range producing warnings. The sampler can start from an initial value that is outside the $0$-$60000$ range because the initial value is chosen randomly (unless the user specifies an initial value explicitly). \label{fn:uniform}]
3. The term \index{\texttt{chains}} `chains` refers to the number of independent runs for sampling (by default four).
4. The term \index{\texttt{iter}} `iter` refers to the \index{Number of iterations} number of iterations that the sampler makes to sample from the posterior distribution of each parameter (by default `2000`).
5. The term \index{\texttt{warmup}} `warmup` refers to the number of iterations from the start of sampling that are eventually discarded (by default half of `iter`).
The last three options, `chains`, `iter`, `warmup` determine the behavior of the sampling algorithm: the No-U-Turn Sampler [NUTS; @hoffmanNoUTurnSamplerAdaptively2014] extension of Hamiltonian Monte Carlo [@duaneHybridMonteCarlo1987; @nealMCMCUsingHamiltonian2011]. We will discuss sampling in a bit more depth in chapter \@ref(ch-introstan), but the basic process is explained next.
#### Sampling and convergence in a nutshell {#sec-convergencenut}
The following is a gross oversimplification of the sampling process: The code specification starts (by default) four chains independently from each other. Each chain "searches" for samples of the posterior distribution in a multidimensional space, where each parameter corresponds to a dimension. The shape of this space is determined by the priors and the likelihood. The chains start at random locations, and in each iteration they take one sample each for all the parameters. When \index{Sampling} sampling begins, the samples may or may not belong to the posterior distributions of the parameters. Eventually, the chains end up in the vicinity of the posterior distribution, and from that point onward the samples will belong to the posterior.
Thus, when sampling begins, the samples from the different chains can be far from each other, but at some point they will "converge" and start delivering samples from the posterior distributions. Although there are no guarantees that the number of iterations we run the chains for will be sufficient for obtaining samples from the posteriors, the default values of `brms` (and Stan) are in many cases sufficient to achieve \index{Convergence} convergence. When the default number of iterations does not suffice, `brms` (actually, Stan) will print out warnings, with suggestions for fixing the convergence problems. If all the chains converge to the same distribution, by removing the "warmup" samples, we make sure that we do not get samples from the initial path to the posterior distributions. The default in `brms` is that half of the total number of iterations in each chain (which default to 2000) will count as "warmup". So, if one runs a model with four chains and the default number of iterations, we will obtain a total of 4000 samples from the four chains, after discarding the warmup iterations.
Figure \@ref(fig:fwarmup)(a) shows the path of the chains from the warmup phase onwards. Such plots are called \index{Traceplots} traceplots. The warmup is shown only for illustration purposes; generally, one should only inspect the chains after the point where convergence has (presumably) been achieved (i.e., after the dashed line). After convergence has occurred, a visual diagnostic check is that chains should look like a "fat hairy caterpillar." Compare the traceplots of our model in Figure \@ref(fig:fwarmup)(a) with the traceplots of a model that did not converge, shown in Figure \@ref(fig:fwarmup)(b).
Traceplots are not always diagnostic as regards convergence. The traceplots might look fine, but the model may not have converged. Fortunately, Stan automatically runs several diagnostics with the information from the chains, and if there are no warnings after fitting the model and the traceplots look fine, we can be reasonably sure that the model converged, and assume that our samples are from the true posterior distribution. However, it is necessary to run more than one chain (preferably four), with a couple of thousands of iterations (at least) in order for the diagnostics to work.
```{r warmup, fig.cap = "(ref:warmup)", echo = FALSE}
plot_all <- function(fit, xlab = 500, ylab = 100) {
chains <- as.array(fit, inc_warmup = TRUE) %>%
array_branch(margin = 3) %>%
imap_dfr(~ .x %>%
as_tibble() %>%
mutate(iter = 1:n()) %>%
tidyr::pivot_longer(1:4, names_to = "chain", values_to = "value") %>%
mutate(parameter = .y)) %>%
mutate(parameter = ifelse(parameter == "b_Intercept", "mu", parameter)) %>%
filter(!parameter %in% c("lp__","lprior", "Intercept"))
## as.mcmc(fit, inc_warmup = TRUE) %>%
## map_dfr(~ as.data.frame(.x) %>%
## as_tibble() %>%
## mutate(iter = 1:n()), .id = "chain") %>%
## rename(mu = b_Intercept) %>%
## dplyr::select(-`lp__`) %>%
## tidyr::gather("parameter", "value", -iter, -chain)
ggplot(chains, aes(x = iter, y = value, color = chain)) +
geom_line() +
facet_wrap(~parameter, ncol = 1) +
geom_vline(xintercept = 1000, linetype = "dashed") +
xlab("Iteration number") +
ylab("Sample value") +
annotate("text", x = xlab, y = ylab, color = "black", label = "Warm-up", size = 5.2) +
theme(plot.title = element_text(hjust = 0))
}
```
```{r warmup2, echo = FALSE, message = FALSE, warning = FALSE }
data_mm <- tibble(t = rnorm(500, c(5, 3000), c(5, 5)))
fit_press_bad <- brm(t ~ 1,
data = data_mm,
family = gaussian(),
prior = c(prior(uniform(0, 60000), class = Intercept),
prior(uniform(0, 2000), class = sigma)),
chains = 4,
iter = 2000,
warmup = 1000)
```
(ref:fwarmup) (a) Traceplots of our `brms` model for the button-pressing data. All the chains start from initial values above `200` and are outside of the plot. (b) Traceplots of a model that **did not** converge. We can diagnose the non-convergence by the observing that the chains do not overlap---each chain seems to be sampling from a different distribution.
```{r fwarmup, fig.cap = "(ref:fwarmup)", echo = FALSE, message = FALSE, fig.height = 6}
cowplot::plot_grid(plot_all(fit_press, ylab = 150) +
coord_cartesian(ylim = c(0, 200)) +
ggtitle("(a) Model that converged."),
plot_all(fit_press_bad, ylab = 1000) +
ggtitle("(b) Model that did not converge."), nrow = 2, ncol=1)
```
#### Output of `brms`
Once the model has been fit (and assuming that we got no warning messages about convergence problems), we can print out the \index{Sample} samples of the \index{Posterior distribution} posterior distributions of each of the parameters using `as_draws_df()` (which stores metadata about the chains) or with `as.data.frame()`:
```{r}
as_draws_df(fit_press) %>%
head(3)
```
The term `b_Intercept` in the `brms` output corresponds to our $\mu$. We can ignore the last three columns: `Intercept` is an auxiliary intercept assuming centered predictors, which coincides here with `b_Intercept` and is discussed in online section \@ref(app-intercept), `lprior` is the log-density of the (joint) prior distribution and it is there for compatibility with the package `priorsense` (https://github.com/n-kall/priorsense), and `lp` is not really part of the posterior, it's the log-density of the unnormalized posterior for each iteration (`lp` is discussed in online section \@ref(app-target), which sould be read in the context of chapter \@ref(ch-introstan)).
Plot the density and traceplots of each parameter after the warmup with `plot(fit_press)` (Figure \@ref(fig:densitytrace)).
(ref:densitytrace) Density and \index{Traceplots} traceplots of our `brms` model for the button-pressing data.
```{r densitytrace, fig.cap = "(ref:densitytrace)", echo = FALSE, fig.height = 2.5}
plot(fit_press)
```
Printing the object with the `brms` fit provides a nice, if somewhat verbose, summary:
```{r, results = "hold"}
fit_press
# posterior_summary(fit_press) is also useful
```
The \index{\texttt{Estimate}} `Estimate` is just the mean of the posterior samples, \index{\texttt{Est.Error}} `Est.Error` is the standard deviation of the posterior and the CIs mark the lower and upper bounds of the 95% \index{Credible interval} credible intervals (to distinguish credible intervals from frequentist confidence intervals, the former will be abbreviated as CrIs):
```{r}
as_draws_df(fit_press)$b_Intercept %>% mean()
as_draws_df(fit_press)$b_Intercept %>% sd()
as_draws_df(fit_press)$b_Intercept %>%
quantile(c(0.025, .975))
```
Furthermore, the summary provides the `Rhat`, `Bulk_ESS`, and `Tail_ESS` of each parameter. \index{R-hat} R-hat compares the between- and within-chain estimates of each parameter. R-hat is larger than 1 when chains have not mixed well, one can only rely on the model if the R-hats for all the parameters are less than 1.05. (R warnings will appear otherwise). \index{Bulk ESS} Bulk ESS (bulk \index{Effective sample size} effective sample size) is a measure of sampling efficiency in the bulk of the posterior distribution, that is the effective sample size for the mean and median estimates, whereas \index{Tail ESS} tail ESS (tail effective sample size) indicates the sampling efficiency at the tails of the distribution, that is the minimum of effective sample sizes for 5% and 95% quantiles. The effective sample size is generally smaller than the number of post-warmup samples, because the samples from the chains are not independent (they are correlated to some extent), and carry less information about the posterior distribution in comparison to independent samples.
In some cases, however, the effective sample size is actually larger than the number of post-warmup samples. This might happen for parameters with a normally distributed posterior (in the unconstrained space, see online section \@ref(app-target)) and low dependence on the other parameters [@vehtari2019ranknormalization]. A very low effective sample size indicates sampling problems (and is accompanied by R warnings) and in general appears together with chains that are not properly mixed. As a rule of thumb, @vehtari2019ranknormalization suggest that a minimum effective sample size of $400$ is required for statistical summaries.
We see that we can fit our model without problems, and we get some posterior distributions for our parameters. However, we should ask ourselves the following questions before we can interpret the posterior distributions of the model:
1. What information are the priors encoding? Do the priors make sense?
2. Does the likelihood assumed in the model make sense for the data?
We'll try to answer these questions by looking at the *prior and posterior predictive distributions*, and by doing sensitivity analyses. This is explained in the sections that follow.
## Prior predictive distribution {#sec-priorpred}
We had defined the following priors for our linear model:
\begin{equation}
\begin{aligned}
\mu &\sim \mathit{Uniform}(0, 60000) \\
\sigma &\sim \mathit{Uniform}(0, 2000)
\end{aligned}
(\#eq:rtpriorsrepeated)
\end{equation}
These \index{Prior} priors encode assumptions about the kind of data we would expect to see in a future study.
To understand these assumptions, we are going to generate data from the model; such distribution of data, which is generated entirely by the prior distributions, is called the prior predictive distribution. Generating prior predictive distributions repeatedly helps us to check whether the priors make sense. What we want to know here is, do the priors generate realistic-looking data?
Formally, we want to know the density $p(\cdot)$ of data points $y_{pred_1},\dots,y_{pred_N}$ from a data set $\boldsymbol{y_{pred}}$ of length $N$, given a vector of priors $\boldsymbol{\Theta}$ and our likelihood $p(\cdot|\boldsymbol{\Theta})$; (in our example, $\boldsymbol{\Theta}=\langle\mu,\sigma \rangle$). The prior predictive density is written as follows:
\begin{equation}
\begin{aligned}
p(\boldsymbol{y_{pred}}) &= p(y_{pred_1},\dots,y_{pred_n})\\
&= \int_{\boldsymbol{\Theta}} p(y_{pred_1}|\boldsymbol{\Theta})\cdot p(y_{pred_2}|\boldsymbol{\Theta})\cdots p(y_{pred_N}|\boldsymbol{\Theta}) p(\boldsymbol{\Theta}) \, \mathrm{d}\boldsymbol{\Theta}
\end{aligned}
\end{equation}
In essence, the vector of parameters is integrated out (see section \@ref(sec-marginal)). This yields the probability distribution of possible data sets given the priors and the likelihood, *before any observations are taken into account*.
The integration can be carried out computationally by generating samples from the prior distribution.
Here is one way to generate prior predictive distributions:
Repeat the following many times:
1. Take one sample from each of the priors.
2. Plug those samples into the probability density/mass function used as the likelihood in the model to generate a data set $y_{pred_1},\ldots,y_{pred_n}$.
Each sample is an imaginary or potential data set.
Create a function that does this:
```{r}
normal_predictive_distribution <-
function(mu_samples, sigma_samples, N_obs) {
# empty data frame with headers:
df_pred <- tibble(trialn = numeric(0),
t_pred = numeric(0),
iter = numeric(0))
# i iterates from 1 to the length of mu_samples,
# which we assume is identical to
# the length of the sigma_samples:
for (i in seq_along(mu_samples)) {
mu <- mu_samples[i]
sigma <- sigma_samples[i]
df_pred <- bind_rows(df_pred,
tibble(trialn = seq_len(N_obs),
# seq_len generates 1, 2,..., N_obs
t_pred = rnorm(N_obs, mu, sigma),
iter = i))
}
df_pred
}
```
The following code produces $1000$ samples of the prior predictive distribution of the model that we defined in section \@ref(sec-simplenormal). This means that it will produce $361000$ predicted values ($361$ predicted observations for each of the $1000$ simulations). Although this approach works, it's quite slow (a couple of seconds). See the online section \@ref(app-efficientpriorpd) for a more efficient version of this function. Section \@ref(sec-lognormal) will show that it's possible to use `brms` to sample from the priors, ignoring the `t` in the data by setting `sample_prior = "only"`. However, since `brms` still depends on Stan's sampler, which uses Hamiltonian Monte Carlo, the prior sampling process can also fail to converge, especially when one uses very uninformative priors, like the ones used in this example. In contrast, our function above, which uses `rnorm()`, cannot have convergence issues and will always produce multiple sets of prior predictive data $y_{pred_1},\ldots,y_{pred_n}$.
The code below uses the `tic()` and `toc()` functions from the `tictoc` package to print out the time it takes to run the code.
```{r, echo = FALSE}
set.seed(123)
```
```{r}
N_samples <- 1000
N_obs <- nrow(df_spacebar)
mu_samples <- runif(N_samples, 0, 60000)
sigma_samples <- runif(N_samples, 0, 2000)
tic()
prior_pred <-
normal_predictive_distribution(mu_samples = mu_samples,
sigma_samples = sigma_samples,
N_obs = N_obs)
toc()
prior_pred
```
```{r, echo = FALSE}
set.seed(123)
#runs the efficient ppd from the appendix
<<efficient-ppd>>
```
Figure \@ref(fig:priorpred-simple) shows the first 18 samples of the prior predictive distribution (i.e., 18 independently generated prior predicted data sets) with the code below.
(ref:priorpred-simple) Eighteen samples from the prior predictive distribution of the model defined in section \@ref(sec-simplenormal).
```{r priorpred-simple, fig.cap = "(ref:priorpred-simple)", message = FALSE, fig.height = 5}
prior_pred %>%
filter(iter <= 18) %>%
ggplot(aes(t_pred)) +
geom_histogram(aes(y = after_stat(density))) +
xlab("predicted t (ms)") +
theme(axis.text.x = element_text(angle = 40,
vjust = 1,
hjust = 1)) +
scale_y_continuous(limits = c(0, 0.0005),
breaks = c(0, 0.00025, 0.0005),
name = "density") +
facet_wrap(~iter, ncol = 3)
```
The prior predictive distribution in Figure \@ref(fig:priorpred-simple) shows data sets generated from the model that are not realistic: Apart from the fact that the data sets show that finger tapping times distributions are symmetrical--and we know from prior experience with such data that they are generally right-skewed--some data sets present finger tapping times that are unrealistically long. Worse yet, if we inspect enough samples from the prior predicted data, it will become clear that a few data sets have negative finger tapping time values.
We can also look at the distribution of summary statistics in the prior predictive data. Even if we don't know beforehand what the data should look like, it's very likely that we have some expectations for possible mean, minimum, or maximum values. For example, in the button-pressing example, it seems reasonable to assume that average finger tapping times are between $200$-$600$ ms; finger tapping times are very unlikely to be below $50$ ms, and even long lapses of attention won't be greater than a couple of seconds.^[We'll see later how to generate prior predictive distributions of statistics such as mean, minimum, or maximum value in section \@ref(sec-lognormal) using `brms` and `pp_check()`.] Three distributions of summary statistics are shown in Figure \@ref(fig:priorpred-stats).
(ref:priorpred-stats) The prior predictive distributions of the mean, minimum, and maximum values of the button-pressing model defined in section \@ref(sec-simplenormal).
```{r priorpred-stats,fig.cap="(ref:priorpred-stats)", message = FALSE, echo = FALSE, fig.height = 3.5}
prior_pred %>%
group_by(iter) %>%
summarize(min_t = min(t_pred),
max_t = max(t_pred),
mean_t = mean(t_pred)) %>%
# we convert the previous data frame to a long one,
# where min_t, max_t, average_t are possible values
# of the columns "stat"
pivot_longer(cols = ends_with("t"),
names_to = "stat",
values_to = "t") %>%
mutate(stat = factor(stat, levels = c("mean_t", "min_t", "max_t"))) %>%
ggplot(aes(t)) +
geom_histogram(binwidth = 500, aes(y = after_stat(density))) +
facet_wrap(~stat, ncol = 1)
```
Figure \@ref(fig:priorpred-stats) shows that we used much less prior information than what we could have: Our priors were encoding the information that any mean between $0$ and $60000$ ms is equally likely. It seems clear that a value close to $0$ or to $60000$ ms would be extremely surprising. This wide range of mean values occurs because of the uniform prior on $\mu$. Similarly, maximum values are quite "uniform", spanning a much wider range than what one would expect. Finally, in the distribution of minimum values, negative finger tapping times occur. This might seem surprising (our prior for $\mu$ excluded negative values), but the reason that negative values appear is that the prior is interpreted together with the likelihood [@gelmanPriorCanOften2017], and the likelihood is a normal distribution, which will allow for negative samples even if the location parameter $\mu$ is constrained to have only positive values.
To summarize the above discussion, the priors used in the example are clearly not very realistic given what we might know about finger tapping times for such a button-pressing task. This raises the question: what priors should we have chosen? In the next section, we consider this question.
## The influence of priors: \index{Sensitivity analysis} sensitivity analysis {#sec-sensitivity}
For most cases that we will encounter in this book, there are four main classes of priors that we can choose from. Among Bayesians, there is no fixed nomenclature for classifying different kinds of priors. For this book, we have chosen specific names for each type of prior, but this is just a convention that we follow for consistency. There are also other classes of prior that we do not discuss in this book. An example is \index{Improper prior} improper priors such as $\mathit{Uniform}(-\infty,+\infty)$, which are not proper probability distributions because the \index{Area under the curve} area under the curve does not sum to one.
When thinking about priors, the reader should not get hung up on what precisely the name is for a particular type of prior; they should rather focus on what that prior means in the context of the research problem.
### \index{Flat, uninformative prior} Flat, uninformative priors
One option is to choose priors that are as uninformative as possible. The idea behind this approach is to let the data "speak for itself" and to not bias the statistical inference with "subjective" priors. There are several issues with this approach: First, the prior is as subjective as the likelihood, and in fact, different choices of likelihood might have a much stronger impact on the posterior than different choices of priors. Second, uninformative priors are in general unrealistic because they give equal weight to all values within the support of the prior distribution, ignoring the fact that usually there is some minimal information about the parameters of interest. Usually, at the very least, the order of magnitude is known (response times or finger tapping times will be in milliseconds and not days, EEG signals some microvolts and not volts, etc.). Third, uninformative priors make the sampling slower and might lead to convergence problems. Unless there is a large amount of data, it would be wise to avoid such priors. Fourth, it is not always clear which parameterization of a given distribution the flat priors should be assigned to. For example, the Normal distribution is sometimes defined based on its standard deviation ($\sigma$), variance ($\sigma^2$), or precision ($1/\sigma^2$): a flat prior for the standard deviation is not flat for the precision of the distribution. Although it is sometimes possible to find an uninformative prior that is invariant under a change of parameters [also called Jeffreys priors; @jaynes2003probability, section 6.15; @jeffreys1939theory, Chapter 3], this is not always the case. Finally, if Bayes factors need to be computed, uninformative priors can lead to very misleading conclusions (chapter \@ref(ch-bf)).
In the button-pressing example discussed in this chapter, an example of a flat, uninformative prior would be $\mu \sim \mathit{Uniform}(-10^{20},10^{20})$. On the millisecond scale, this is a very strange prior to use for a parameter representing mean button-pressing time: it allows for impossibly large positive values, and it also allows negative button-pressing times, which is of course impossible. It is technically possible to use such a prior, but it wouldn't make much sense.
### \index{Regularizing prior} Regularizing priors
If there does not exist much prior information (and if this information cannot be worked out through reasoning about the problem), and there is enough data (what "enough" means here will presently become clear when we look at specific examples), it is fine to use *regularizing priors*. These are priors that down-weight extreme values (that is, they provide regularization), they are usually not very informative, and mostly let the likelihood dominate in determining the posteriors. These priors are theory-neutral; that is, they usually do not bias the parameters to values supported by any prior belief or theory. The idea behind this type of prior is to help to stabilize computation. These priors are sometimes called \index{Weakly informative prior} *weakly informative* or \index{Mildly informative prior} *mildly informative* priors in the Bayesian literature. For many applications, they perform well, but discussed in chapter \@ref(ch-bf), they tend to be problematic if Bayes factors need to be computed.
In the button-pressing example, an example of a regularizing prior would be $\mu \sim \mathit{Normal}_{+}(0,1000)$. This is a Normal distribution prior truncated at $0$ ms, and allows a relatively constrained range of positive values for button-pressing times (roughly, up to $2000$ ms or so). This is a regularizing prior because it rules out negative button-pressing times and down-weights extreme values over $2000$ ms.
Here, one could assume a non-truncated prior like $\mathit{Normal}(0,1000)$. This could still be seen as a regularizing prior even though we don't expect $\mu$ to be negative; the data would ensure that the posterior distribution has positive values (because we would have no negative button-pressing times in the data).
### \index{Principled prior} Principled priors
The idea here is to have priors that encode all (or most of) the theory-neutral information that the researcher has. Since one generally knows what one's data do and do not look like, it is possible to build priors that truly reflect the properties of potential data sets, using prior predictive checks. In this book, many examples of this class of priors will come up.
In the button-pressing data, an example of a principled prior would be $\mu \sim \mathit{Normal}_{+}(250,100)$. This prior is not overly restrictive, but represents a guess about plausible button-pressing times. Prior predictive checks using principled priors should produce realistic distributions of the dependent variable.
### \index{Informative prior} Informative priors
There are cases where a lot of prior knowledge exists. In general, unless there are very good reasons for having relatively informative priors (see chapter \@ref(ch-bf)), it is not a good idea to let the priors have too much influence on the posterior. An example where informative priors would be important is when investigating a language-impaired population from which we can't get many subjects, but a lot of previously published papers exist on the research topic.
In the button-pressing data, an informative prior could be based on a meta-analysis of previously published or existing data, or the result of prior elicitation from an expert (or multiple experts) on the topic under investigation. An example of an informative prior would be $\mu \sim \mathit{Normal}_{+}(200,20)$. This prior will have some influence on the posterior for $\mu$, especially when one has relatively sparse data.
These four options constitute a continuum. The uniform prior from the last model (section \@ref(sec-simplenormal)) falls between flat, uninformative and regularizing priors. In practical data analysis situations, we are mostly going to choose priors that fall between regularizing and principled. Informative priors, in the sense defined above, will be used only relatively rarely; but they become more important to consider when doing Bayes factor analyses (chapter \@ref(ch-bf)).
## Revisiting the button-pressing example with different priors {#sec-revisit}
What would happen if even wider priors were used for the model defined previously (in section \@ref(sec-simplenormal))? Suppose that every value between $-10^{6}$ and $10^{6}$ ms is assumed to be equally likely for the location parameter $\mu$. This prior is clearly unrealistic and actually makes no sense at all: we are not expecting negative finger tapping times. Regarding the standard deviation, one could assume that any value between $0$ and $10^{6}$ is equally likely.^[Even though, in theory, one could use even wider priors, in practice, these are the widest priors that achieve convergence using brms.] The likelihood remains unchanged.
\begin{equation}
\begin{aligned}
\mu &\sim \mathit{Uniform}(-10^{6}, 10^{6}) \\
\sigma &\sim \mathit{Uniform}(0, 10^{6})
\end{aligned}
(\#eq:rtpriorsflat)
\end{equation}
```{r, message = FALSE, cache = TRUE}
# The default settings are used when they are not set explicitly:
# 4 chains, with half of the iterations (set as 3000) as warmup.
fit_press_unif <- brm(t ~ 1,
data = df_spacebar,
family = gaussian(),
prior = c(prior(uniform(-10^6, 10^6),
class = Intercept,
lb = -10^6,
ub = 10^6),
prior(uniform(0, 10^6),
class = sigma,
lb = 0,
ub = 10^6)),
iter = 3000,
control = list(adapt_delta = .99,
max_treedepth = 15))
```
Even with these extremely unrealistic priors, which require us to change the `adapt_delta` and `max_treedepth` default values to achieve convergence, the output of the model is virtually identical to the previous one (see Figure \@ref(fig:postcomp)).
```{r, echo = FALSE}
## Shorter output
print.brmsfit <- short_summary
```
```{r, eval = FALSE}
fit_press_unif
```
```{r, echo = FALSE}
short_summary(fit_press_unif)
```
Next, consider what happens if very informative priors are used. Assume that mean values very close to $400$ ms are the most likely, and that the standard deviation of the finger tapping times is very close to $100$ ms. Given that this is a model of button-pressing times, such an informative prior seems wrong---$200$ ms seems like a more realistic mean button-pressing time, not $400$ ms. You can check this by doing an experiment yourself and looking at the recorded times; software like Linger (http://tedlab.mit.edu/~dr/Linger/) makes it easy to set up such an experiment.
The $\mathit{Normal}_+$ notation indicates a normal distribution truncated at zero such that only positive values are allowed (see online section \@ref(app-truncation) discusses this type of distribution in detail). Even though the prior for the standard deviation is restricted to be positive, we are not required to add `lb = 0` to the prior, and it is automatically taken into account by `brms`.
\begin{equation}
\begin{aligned}
\mu &\sim \mathit{Normal}(400, 10) \\
\sigma &\sim \mathit{Normal}_+(100, 10)
\end{aligned}
(\#eq:infrtpriors)
\end{equation}
```{r, message = FALSE, cache = TRUE}
fit_press_inf <- brm(t ~ 1,
data = df_spacebar,
family = gaussian(),
prior = c(prior(normal(400, 10), class = Intercept),
# `brms` knows that SDs need to be bounded
# to exclude values below zero:
prior(normal(100, 10), class = sigma)))
```
Despite these unrealistic but informative priors, the likelihood mostly dominates and the new posterior means and credible intervals are just a couple of milliseconds away from the previous estimates (see Figure \@ref(fig:postcomp)):
```{r, eval = FALSE}
fit_press_inf
```
```{r, echo = FALSE}
short_summary(fit_press_inf)
```
As a final example of a sensitivity analysis, choose some principled priors. Assuming that we have some prior experience with previous similar experiments, suppose the mean response time is expected to be around $200$ ms, with a 95% probability of the mean ranging from $0$ to $400$ ms. This uncertainty is perhaps unreasonably large, but one might want to allow a bit more uncertainty than one really thinks is reasonable [this kind of conservativity in allowing somewhat more uncertainty is sometimes called Cromwell's rule in Bayesian statistics; see @kendall2004, section 3.19]. In such a case, one can decide on the prior $\mathit{Normal}(200, 100)$. Given that the experiment involves only one subject and the task is very simple, one might not expect the residual standard deviation $\sigma$ to be very large: As an example, one can settle on a location of $50$ ms for a truncated normal distribution, but still allow for relatively large uncertainty: $Normal_{+}(50,50)$.
The prior specifications are summarized below.
\begin{equation}
\begin{aligned}
\mu &\sim \mathit{Normal}(200, 100) \\
\sigma &\sim \mathit{Normal}_+(50, 50)
\end{aligned}
\end{equation}
Why are these priors principled? The designation "principled" here largely depends on our \index{Domain knowledge} domain knowledge. The online chapter \@ref(ch-priors) discusses how one can use domain knowledge when specifying priors.
One can achieve a better understanding of what a particular set of priors implies by visualizing the priors graphically, and carrying out prior predictive checks. These steps are skipped here, but these issues are discussed in detail in the online chapters \@ref(ch-priors) and \@ref(ch-workflow). These chapters will give more detailed information about choosing priors and on developing a principled workflow for Bayesian data analysis.
```{r, message = FALSE, cache = TRUE}
fit_press_prin <-
brm(t ~ 1,
data = df_spacebar,
family = gaussian(),
prior = c(prior(normal(200, 100), class = Intercept),
prior(normal(50, 50), class = sigma)))
```
The new estimates are virtually the same as before (see Figure \@ref(fig:postcomp)):
```{r, eval = FALSE}
fit_press_prin
```
```{r, echo = FALSE}
short_summary(fit_press_prin)
```
The above examples of using different priors should not be misunderstood to mean that priors never matter. When there is enough data, the likelihood will dominate in determining the posterior distributions. What constitutes "enough" data is also a function of the complexity of the model; as a general rule, more complex models require more data.
(ref:postcomp) Comparison of the posterior distributions from the model with "realistically" bounded uniform prior distribution (though still not recommended), `fit_press`, against the model with extremely unrealistically wide priors, `fit_press_unif`, the model with misspecified informative priors, `fit_press_inf`, and the model with principled priors, `fit_press_prin`. All the posteriors virtually overlap except for the posterior of `fit_press_inf`, which is shifted by a few milliseconds.
```{r postcomp, fig.cap="(ref:postcomp)", echo = FALSE, fig.height = 2.5}
fit_press_all <- bind_rows(as.data.frame(fit_press) %>%
select(-lp__, -lprior, -Intercept) %>%
pivot_longer(everything(), names_to = "Parameter") %>%
mutate(model = "fit_press"),
as.data.frame(fit_press_unif) %>%
select(-lp__, -lprior, -Intercept) %>%
pivot_longer(everything(), names_to = "Parameter") %>%
mutate(model = "fit_press_unif"),
as.data.frame(fit_press_inf) %>%
select(-lp__, -lprior, -Intercept) %>%
pivot_longer(everything(), names_to = "Parameter") %>%
mutate(model = "fit_press_inf"),
as.data.frame(fit_press_prin) %>%
select(-lp__, -lprior, -Intercept) %>%
pivot_longer(everything(), names_to = "Parameter") %>%
mutate(model = "fit_press_prin")) %>%
mutate(model = factor(model, levels = c("fit_press","fit_press_unif", "fit_press_inf","fit_press_prin")))
# Define the colors and linetypes for the models
model_colors <- c("fit_press" = "black", "fit_press_unif" = "black", "fit_press_inf" = "darkgrey", "fit_press_prin" = "darkgrey")
model_linetypes <- c("fit_press" = "solid", "fit_press_unif" = "dashed", "fit_press_inf" = "solid", "fit_press_prin" = "dashed")
ggplot(fit_press_all, aes(value, linetype = model, color = model)) +
stat_density(geom="line", position="identity", bw = .4)+
facet_wrap(. ~ Parameter, ncol = 2, scales = "free") +
scale_color_manual(values = model_colors) +
scale_linetype_manual(values = model_linetypes) +
xlab("Finger tapping times (ms)") +
theme(legend.position = "bottom")
```
Even in cases where there is enough data and the likelihood dominates in determining the posteriors, regularizing, principled priors (i.e., priors that are more consistent with our a priori beliefs about the data) will in general speed-up model convergence.
In order to determine the extent to which the posterior is influenced by the priors, it is a good practice to carry out a sensitivity analysis: try different priors and either verify that the posterior doesn't change drastically, or report how the posterior is affected by some specific priors [for examples from psycholinguistics, see @VasishthetalPLoSOne2013; @VasishthEngelmann2022]. Chapter \@ref(ch-bf) will demonstrate that sensitivity analysis becomes crucial for reporting Bayes factors; even in cases where the choice of priors does not affect the posterior distribution, it generally affects the Bayes factor.
## \index{Posterior predictive distribution} Posterior predictive distribution {#sec-ppd}
The posterior predictive distribution is a collection of data sets generated from the model (the likelihood and the priors). Having obtained the posterior distributions of the parameters after taking into account the data, the posterior distributions can be used to generate future data from the model. In other words, given the posterior distributions of the parameters of the model, the posterior predictive distribution gives us some indication of what future data might look like.
Once the posterior distributions $p(\boldsymbol{\Theta}\mid \boldsymbol{y})$ are available, the predictions based on these distributions can be generated by integrating out the parameters:
\begin{equation}
p(\boldsymbol{y_{pred}}\mid \boldsymbol{y} ) = \int_{\boldsymbol{\Theta}} p(\boldsymbol{y_{pred}}, \boldsymbol{\Theta}\mid \boldsymbol{y})\, \mathrm{d}\boldsymbol{\Theta}= \int_{\boldsymbol{\Theta}}
p(\boldsymbol{y_{pred}}\mid \boldsymbol{\Theta},\boldsymbol{y})p(\boldsymbol{\Theta}\mid \boldsymbol{y})\, \mathrm{d}\boldsymbol{\Theta}
\end{equation}
Assuming that past and future observations are conditionally independent^[Two events A and B are defined to be conditionally independent given some event E if $P(A\cap B | E) = P(A|E) P(B|E)$. See chapter 2 of @blitzstein2014introduction for examples and more discussion of conditional independence.] given $\boldsymbol{\Theta}$, i.e., $p(\boldsymbol{y_{pred}}\mid \boldsymbol{\Theta},\boldsymbol{y})= p(\boldsymbol{y_{pred}}\mid \boldsymbol{\Theta})$, the above equation can be written as:
\begin{equation}
p(\boldsymbol{y_{pred}}\mid \boldsymbol{y} )=\int_{\boldsymbol{\Theta}} p(\boldsymbol{y_{pred}}\mid \boldsymbol{\Theta}) p(\boldsymbol{\Theta}\mid \boldsymbol{y})\, \mathrm{d}\boldsymbol{\Theta}
(\#eq:postpp)
\end{equation}
In Equation \@ref(eq:postpp), we are conditioning $\boldsymbol{y_{pred}}$ only on $\boldsymbol{y}$, we do not condition on what we don't know ($\boldsymbol{\Theta}$); the unknown parameters have been integrated out. This posterior predictive distribution has important differences from predictions obtained with the frequentist approach. The frequentist approach gives a point estimate of each predicted observation given the maximum likelihood estimate of $\boldsymbol{\Theta}$ (a point value), whereas the Bayesian approach gives a distribution of values for each predicted observation. As with the prior predictive distribution, the integration can be carried out computationally by generating samples from the posterior predictive distribution. The same function that we created before, `normal_predictive_distribution()`, can be used here. The only difference is that instead of sampling `mu` and `sigma` from the priors, the samples come from the posterior.
```{r}
N_obs <- nrow(df_spacebar)
mu_samples <- as_draws_df(fit_press)$b_Intercept
sigma_samples <- as_draws_df(fit_press)$sigma
normal_predictive_distribution(mu_samples = mu_samples,
sigma_samples = sigma_samples,
N_obs = N_obs)
```
The `brms` function \index{\texttt{posterior\_predict}} `posterior_predict()` is a convenient function that delivers samples from the posterior predictive distribution. Using the command `posterior_predict(fit_press)` yields the predicted finger tapping times in a matrix, with the samples as rows and the observations (data-points) as columns. (Bear in mind that if a model is fit with `sample_prior = "only"`, the dependent variable is ignored and `posterior_predict` will yield samples from the prior predictive distribution).
The posterior predictive distribution can be used to examine the "descriptive adequacy" of the model under consideration [@Gelman14, Chapter 6; @shiffrinSurveyModelEvaluation2008]. Examining the posterior predictive distribution to establish \index{Descriptive adequacy} descriptive adequacy is called \index{Posterior predictive check} posterior predictive checking. The goal here is to establish that the posterior predictive data look more or less similar to the observed data. Achieving descriptive adequacy means that the current data could have been generated by the model. Although passing a test of descriptive adequacy is not strong evidence in favor of a model, a major failure in descriptive adequacy can be interpreted as strong evidence against a model [@shiffrinSurveyModelEvaluation2008].
For this reason, comparing the descriptive adequacy of different models is not enough to differentiate between their relative performance.
When doing model comparison, it is important to consider the criteria that @rp define. Although @rp are more interested in process models and not necessarily Bayesian models, their criteria are important for any kind of model comparison. Their main point is that it is not enough to have a good fit to the data for a model to be convincing. One should check that the range of predictions that the model makes is reasonably constrained; if a model can capture any possible outcome, then the model fit to a particular data set is not so informative. In the Bayesian modeling context, although posterior predictive checks are important, they should been as only sanity checks to assess whether the model behavior is reasonable (for more on this point, see the online chapter \@ref(ch-workflow)).
In many cases, one can simply use the plotting functions provided with `brms` (these act as wrappers for \index{\texttt{bayesplot}} `bayesplot` functions). For example, the plotting function \index{\texttt{pp\_check}} `pp_check()` takes as arguments the model, the number of predicted data sets, and the type of visualization, and it can display different visualizations of posterior predictive checks. In these type of plots, the observed data are plotted as $y$ and predicted data as $y_{rep}$. Below, we use `pp_check()` to investigate how well the observed distribution of finger tapping times fit our model based on some number ($11$ and $100$) of samples of the posterior predictive distributions (that is, simulated data sets); see Figures \@ref(fig:normalppc) and \@ref(fig:normalppc2).
(ref:normalppc) Histograms of eleven samples from the posterior predictive distribution of the model `fit_press` ($y_{rep}$).
```{r normalppc, fig.cap = "(ref:normalppc)", message = FALSE, fig.height = 3.5}
pp_check(fit_press, ndraws = 11, type = "hist")
```
(ref:normalppc2) A posterior predictive check that shows the fit of the model `fit_press` in comparison to data sets from the posterior predictive distribution using an overlay of density plots.
```{r normalppc2, fig.cap = "(ref:normalppc2)" , message = FALSE, fig.height = 2.2}
pp_check(fit_press, ndraws = 100, type = "dens_overlay")
```
The data is slightly skewed and has no values smaller than $100$ ms, but the predictive distributions are centered and symmetrical; see Figures \@ref(fig:normalppc) and \@ref(fig:normalppc2). This posterior predictive check shows a slight mismatch between the observed and predicted data. Can we build a better model? We’ll come back to this issue in the next section.
## The influence of the likelihood
Finger tapping times (and response times in general) are not usually normally distributed. A more realistic distribution is the log-normal. A random variable (such as time) that is log-normally distributed takes only positive real values and is right-skewed. Although other distributions can also produce data with such properties, the log-normal will turn out to be a pretty reasonable distribution for finger tapping times and response times.
### The \index{Log-normal likelihood} log-normal likelihood {#sec-lnfirst}
If $\boldsymbol{y}$ is log-normally distributed, this means that $\log(\boldsymbol{y})$ is normally distributed.^[More precisely, $\log_e(\boldsymbol{y})$ or $\ln(\boldsymbol{y})$, but we'll write it as just $\log(\boldsymbol{y})$.] The \index{Log-normal distribution} log-normal distribution is also defined using the parameters location, $\mu$, and scale, $\sigma$, but these are on the log ms scale; they correspond to the mean and standard deviation of the logarithm of the data $\boldsymbol{y}$, $\log(\boldsymbol{y})$, which will be normally distributed. Thus, when we model some data $\boldsymbol{y}$ using the log-normal likelihood, the parameters $\mu$ and $\sigma$ are on a different scale than the data $\boldsymbol{y}$. Equation \@ref(eq:lognormalequations) shows the relationship between the log-normal and the normal.
\begin{equation}
\begin{aligned}
\log(\boldsymbol{y}) &\sim \mathit{Normal}( \mu, \sigma)\\
\boldsymbol{y} &\sim \mathit{LogNormal}( \mu, \sigma)
\end{aligned}
(\#eq:lognormalequations)
\end{equation}
We can obtain samples from the log-normal distribution, using the normal distribution by first setting an auxiliary variable $z$, so that $z = \log(y)$. This means that $z \sim \mathit{Normal}(\mu, \sigma)$.
Then we can just use $exp(z)$ as samples from the $\mathit{LogNormal}(\mu, \sigma)$, since $\exp(z) =\exp(\log(y)) = y$. The code below produces Figure \@ref(fig:logndemo).
```{r logndemo,fig.cap="Two log-normal distributions with the same parameters generated by either generating samples from a log-normal distribution or exponentiating samples from a normal distribution.", message=FALSE, fig.show = "hold", out.width="49%", fig.width = 3.9, fig.height =3.2 }
mu <- 6
sigma <- 0.5
N <- 500000
# Generate N random samples from a log-normal distribution
sl <- rlnorm(N, mu, sigma)
ggplot(tibble(samples = sl), aes(samples)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 50) +
ggtitle("Log-normal distribution\n") +
coord_cartesian(xlim = c(0, 2000))
# Generate N random samples from a normal distribution,
# and then exponentiate them
sn <- exp(rnorm(N, mu, sigma))
ggplot(tibble(samples = sn), aes(samples)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 50) +
ggtitle("Exponentiated samples from\na normal distribution") +
coord_cartesian(xlim = c(0, 2000))
```
### Using a log-normal likelihood to fit data from a single subject pressing a button repeatedly {#sec-lognormal}
If we assume that finger tapping times are log-normally distributed, the likelihood function changes as follows:
\begin{equation}
t_n \sim \mathit{LogNormal}(\mu,\sigma)
\end{equation}
But now the scale of the priors needs to change! Let's start with uniform priors for ease of exposition, even though, as we mentioned earlier, these are not really appropriate here. (More realistic priors are discussed below.)
\begin{equation}
\begin{aligned}
\mu &\sim \mathit{Uniform}(0, 11) \\
\sigma &\sim \mathit{Uniform}(0, 1) \\
\end{aligned}
(\#eq:logpriorsunif)
\end{equation}
Because the parameters are on a different scale than the dependent variable, their interpretation changes and it is more complex than if we were dealing with a linear model that assumes a normal likelihood (location and scale do not coincide with the mean and standard deviation of the log-normal):
- \index{Location} *The location $\mu$*: In our previous linear model, $\mu$ represented the mean (in a normal distribution, the mean happens to be identical to the median and the mode). But now, the mean needs to be calculated by computing $\exp(\mu +\sigma ^{2}/2)$. In other words, in the log-normal, the mean is dependent on both $\mu$ and $\sigma$. The median is just $\exp(\mu)$. Notice that the prior of $\mu$ is not on the milliseconds scale, but rather on the log milliseconds scale.
- \index{Scale} *The scale $\sigma$*: This is the standard deviation of the normal distribution of $\log(\boldsymbol{y})$. The standard deviation of a log-normal distribution with *location* $\mu$ and *scale* $\sigma$ will be $\exp(\mu +\sigma ^{2}/2)\times \sqrt{\exp(\sigma^2)- 1}$. Unlike the normal distribution, the spread of the log-normal distribution depends on both $\mu$ and $\sigma$.
To understand the meaning of the priors on the millisecond scale, both the priors and the likelihood need to be taken into account. Generating a prior predictive distribution will help in interpreting the priors. This distribution can be generated by just exponentiating the samples produced by `normal_predictive_distribution()` (or, alternatively, edit the function by replacing `rnorm()` with `rlnorm()`).
```{r}
N_samples <- 1000
N_obs <- nrow(df_spacebar)
mu_samples <- runif(N_samples, 0, 11)
sigma_samples <- runif(N_samples, 0, 1)
prior_pred_ln <-
normal_predictive_distribution(mu_samples = mu_samples,
sigma_samples = sigma_samples,
N_obs = N_obs) %>%
mutate(t_pred = exp(t_pred))
```
Next, plot the distribution of some representative statistics; see Figure \@ref(fig:priorpredlogunif).
(ref:priorpredlogunif) The prior predictive distribution of the mean, median, minimum, and maximum value of the log-normal model with priors defined in Equation \@ref(eq:logpriorsunif), that is $\mu \sim \mathit{Uniform}(0, 11)$ and $\sigma \sim \mathit{Uniform}(0, 1)$. The x-axis shows values back-transformed from the log-scale.
```{r priorpredlogunif,fig.cap="(ref:priorpredlogunif)", message = FALSE, echo = FALSE, fig.height = 4}
prior_pred_ln %>%
group_by(iter) %>%
summarize(min_t = min(t_pred),
max_t = max(t_pred),
mean_t = mean(t_pred),
median_t = median(t_pred)) %>%
pivot_longer(cols = ends_with("t"), names_to = "stat", values_to = "t") %>%
mutate(stat = factor(stat, levels = c("mean_t", "median_t", "min_t", "max_t"))) %>%
ggplot(aes(t)) +
scale_x_continuous("Finger tapping times in ms",
trans = "log", breaks = c(0.001, 1, 10, 100, 1000, 10000, 100000)
) +
geom_histogram(aes(y = after_stat(density))) +
ylab("density") +
facet_wrap(~stat, ncol = 1)
```
We cannot generate negative values any more, since $\exp($any finite real number$) > 0$. These priors might work in the sense that the model might converge; but it would be better to have regularizing priors for the model. An example of regularizing priors:
\begin{equation}
\begin{aligned}
\mu &\sim \mathit{Normal}(6, 1.5) \\
\sigma &\sim \mathit{Normal}_+(0, 1) \\
\end{aligned}
(\#eq:logpriorsnorm)
\end{equation}
The prior for $\sigma$ here is a \index{Truncated distribution} truncated distribution, and although its location is zero, this is not its mean. We can calculate its approximate mean from a large number of random samples of the prior distribution using the function \index{\texttt{rtnorm}} `rtnorm()` from the package `extraDistr`. In this function, we have to set the parameter `a = 0` to express the fact that the normal distribution is truncated from the left at 0. (Online section \@ref(app-truncation) discusses this type of
distribution in detail):
```{r}
mean(rtnorm(100000, 0, 1, a = 0))
```
Even before generating the prior predictive distributions, we can calculate the range within which we are 95% sure that the expected median of the observations will lie. We do this by looking at what happens at two standard deviations away from the mean of the prior, $\mu$, that is $6 - 2\times 1.5$ and $6 + 2\times 1.5$, and exponentiating these values:
```{r}
c(lower = exp(6 - 2 * 1.5),
higher = exp(6 + 2 * 1.5))
```
This means that the prior for $\mu$ is still not too informative (these are medians; the actual values generated by the log-normal distribution can be much more spread out). Next, plot the distribution of some representative statistics of the prior predictive distributions.
`brms` allows one to sample from the priors, ignoring the observed data `t` , by setting `sample_prior = "only"` in the `brm` function.
If we want to use `brms` to generate prior predictive data in this manner even before we have any data, we do need to include a data frame with the relevant dependent variable (`y` in this case). Setting `sample_prior = "only"` will ignore the values of the dependent variable.
```{r, results="hide", message= FALSE}
fit_prior_press_ln <-
brm(t ~ 1,
data = df_spacebar,
family = lognormal(),
prior = c(prior(normal(6, 1.5), class = Intercept),
prior(normal(0, 1), class = sigma)),
sample_prior = "only",
control = list(adapt_delta = 0.9))
```
To avoid the warnings, increase the `adapt_delta` parameter's default value from $0.8$ to $0.9$ to simulate the data. Since Stan samples from the prior distributions in the same way that it samples from the posterior distribution, one should not ignore warnings; always ensure that the model converges. In that respect, the custom function `normal_predictive_distribution()` defined in section \@ref(sec-priorpred) has the advantage that it will always yield independent samples from the prior distribution and will not experience any convergence problems. This is because it just relies on the `rnorm()` function in R.
Plot the prior predictive distribution of means with the following code. In a prior predictive distribution, we generally want to ignore the data; this requires setting `prefix = "ppd"` in `pp_check()`.
To plot the distribution of minimum, and maximum values, just replace `mean` with `min`, and `max` respectively. The distributions of the three statistics are displayed in Figure \@ref(fig:priorpredlognorm).
(ref:priorpredlognorm) The prior predictive distributions of the mean, maximum, and minimum values of the log-normal model with priors defined in Equation \@ref(eq:logpriorsnorm). The prior predictive distributions are labeled $y_{pred}$. The x-axis shows values back-transformed from the log-scale.
```{r priorpredlognorm, fig.cap="(ref:priorpredlognorm)", message = FALSE, tidy=FALSE, fig.height = 6, fig.show="hold", fold = TRUE, fig.height = 4}
p1 <- pp_check(fit_prior_press_ln,
type = "stat",
stat = "mean",
prefix = "ppd") +
coord_cartesian(xlim = c(0.001, 300000)) +
scale_x_continuous("Finger tapping times (ms)",
trans = "log",
breaks = c(0.001, 1, 100, 1000, 10000, 100000),
labels = c("0.001", "1", "100", "1000", "10000",
"100000")) +
ggtitle("Prior predictive distribution of means")
p2 <- pp_check(fit_prior_press_ln,
type = "stat",
stat = "min",
prefix = "ppd") +
coord_cartesian(xlim = c(0.001, 300000)) +
scale_x_continuous("Finger tapping times (ms)",
trans = "log",
breaks = c(0.001, 1, 100, 1000, 10000, 100000),
labels = c("0.001", "1", "100", "1000", "10000",
"100000")) +
ggtitle("Prior predictive distribution of minimum values")
p3 <- pp_check(fit_prior_press_ln,
type = "stat",
stat = "max",
prefix = "ppd") +
coord_cartesian(xlim = c(0.001, 300000)) +
scale_x_continuous("Finger tapping times (ms)",
trans = "log",
breaks = c(0.001, 1, 100, 1000, 10000, 100000),
labels = c("0.001", "1", "10", "1000", "10000",
"100000")) +
ggtitle("Prior predictive distribution of maximum values")
plot_grid(p1, p2, p3, nrow = 3, ncol =1)
```
Figure \@ref(fig:priorpredlognorm) shows that the priors used here are still quite uninformative. The tails of the prior predictive distributions that correspond to our normal priors shown in Figure \@ref(fig:priorpredlognorm) are even further to the right, reaching more extreme values than for the prior predictive distributions generated by uniform priors (shown in Figure \@ref(fig:priorpredlogunif)). The new priors are still far from representing our prior knowledge. We could run more iterations of choosing priors and generating prior predictive distributions until we have priors that generate realistic data. However, given that the bulk of the distributions of the mean, maximum, minimum values lies roughly in the correct order of magnitude, these priors are going to be acceptable. In general, summary statistics (e.g., mean, median, min, max) can be used to test whether the priors are in a plausible range. This can be done by defining, for the particular research problem under study, the extreme data that would be very implausible to ever observe (e.g., reading times at a word larger than one minute) and choosing priors such that such extreme finger tapping times occur only very rarely in the prior predictive distribution.
Next, fit the model; recall that both the distribution family and prior change in comparison to the previous example.
```{r, message = FALSE, cache = TRUE}
fit_press_ln <-
brm(t ~ 1,
data = df_spacebar,
family = lognormal(),
prior = c(prior(normal(6, 1.5), class = Intercept),
prior(normal(0, 1), class = sigma)))
```
When we look at the summary of the posterior, the parameters are on the log-scale:
```{r, eval = FALSE}
fit_press_ln
```
```{r, echo = FALSE}
short_summary(fit_press_ln)
```
If the research goal is to find out how long it takes to press the space bar in milliseconds, we need to transform the $\mu$ (or `Intercept` in the model) to milliseconds. Because the median of the log-normal distribution is $\exp(\mu)$, the following returns the median estimate in milliseconds (recall that for the mean we would need to calculate $\exp(\mu +\sigma ^{2}/2)$):
```{r, message = FALSE}
estimate_ms <- exp(as_draws_df(fit_press_ln)$b_Intercept)
```
To display the mean and 95% credible interval of these samples, type:
```{r}
c(mean = mean(estimate_ms),
quantile(estimate_ms, probs = c(.025, .975)))
```
Next, check whether the predicted data sets look similar to
the observed data set. See Figure \@ref(fig:logppc); compare this with the earlier Figure \@ref(fig:normalppc2).
(ref:logppc) The posterior predictive distribution of `fit_press_ln`.
```{r logppc, message=FALSE, fig.cap="(ref:logppc)", fig.height = 2.2}
pp_check(fit_press_ln, ndraws = 100)
```
The key question is: Are the posterior predicted data now more similar to the observed data, compared to the case where we had a Normal likelihood? According to Figure \@ref(fig:logppc), it seems so, but it's not easy to tell.
Another way to examine the extent to which the predicted data looks similar to the observed data would be to look at the distribution of some summary statistic. As with prior predictive distributions, examine the distribution of representative summary statistics for the data sets generated by different models. However, in contrast with what occurs with prior predictive distributions, at this point we have a clear reference, our observations, and this means that we can compare the summary statistics with the observed statistics from our data. We suspect that the normal distribution would generate finger tapping times that are too fast (since the normal distribution is symmetrical) and that the log-normal distribution may capture the long tail better than the normal model. Based on this supposition, compute the distribution of minimum and maximum values for the posterior predictive distributions, and compare them with the minimum and maximum value respectively in the data. The function `pp_check()` implements this by specifying `stat` either `"min"` or `"max"` for both `fit_press`, and `fit_press_ln`; an example is shown below. The plots are shown in Figures \@ref(fig:ppcheckmin) and \@ref(fig:ppcheckmax).
```{r, eval = FALSE}
pp_check(fit_press, type = "stat", stat = "min")
```
(ref:ppcheckmin) The distributions of minimum values in a posterior predictive check, using the normal and log-normal probability density functions. The minimum in the data is $110$ ms.
```{r ppcheckmin,fig.cap="(ref:ppcheckmin)", fig.height = 3, fig.show="hold", message=FALSE, out.width = "45%", fig.width=4, echo = FALSE}
pp_check(fit_press, type = "stat", stat = "min") + ggtitle("Normal model")
pp_check(fit_press_ln, type = "stat", stat = "min") + ggtitle("Log-normal model")
```
(ref:ppcheckmax) The distributions of maximum values in a posterior predictive check using the normal and log-normal. The maximum in the data is $409$ ms.
```{r ppcheckmax,fig.cap="(ref:ppcheckmax)", fig.show="hold", fig.height = 3, message=FALSE, out.width = "45%", fig.width=4, echo = FALSE}
pp_check(fit_press, type = "stat", stat = "max") + ggtitle("Normal model")
pp_check(fit_press_ln, type = "stat", stat = "max") + ggtitle("Log-normal model")
```
Figure \@ref(fig:ppcheckmin) shows that the log-normal likelihood does a slightly better job since the minimum value is contained in the bulk of the log-normal distribution and in the tail of the normal one. Figure \@ref(fig:ppcheckmax) shows that both models are unable to capture the maximum value of the observed data. One explanation for this is that the log-normal-ish observations in our data are being generated by the task of pressing as fast as possible, whereas the observations with long finger tapping times are possibly being generated by lapses of attention.
If this assumption is correct, this would mean that the distribution of button pressing times is a mixture of two distributions; modeling such a mixture process involves more complex tools that we will take up in chapter \@ref(ch-mixture).
This completes our introduction to `brms`. We are now ready to learn more about regression models.
## List of the most important commands
Here is a list of the most important commands we learned in this chapter.
- The core `brms` function for fitting models, for generating prior predictive and posterior predictive data:
```{r eval=FALSE}
fit_press <-
brm(t ~ 1,
data = df_spacebar,
family = gaussian(),
prior = c(prior(uniform(0, 60000),
class = Intercept,
lb = 0,
ub = 60000),
prior(uniform(0, 2000),
class = sigma,
lb = 0,
ub = 2000)),
## uncomment for prior predictive:
## sample_prior = "only",
## uncomment when dealing with divergent transitions
## control = list(adapt_delta = .9)
## default values for chains, iterations and warmup:
chains = 4,
iter = 2000,
warmup = 1000)
```
- Extract samples from fitted model:
```{r eval=FALSE}
as_draws_df(fit_press)
```
- Basic plot of posteriors
```{r eval=FALSE}
plot(fit_press)
```
- Plot prior predictive/posterior predictive data
```{r eval=FALSE}
## Posterior predictive check:
pp_check(fit_press, ndraws = 100, type = "dens_overlay")
## Plot posterior predictive distribution of statistical summaries:
pp_check(fit_press, ndraws = 100, type = "stat", stat = "mean")
## Plot prior predictive distribution of statistical summaries:
pp_check(fit_press, ndraws = 100, type = "stat", stat = "mean",
prefix = "ppd")
```
## Summary
This chapter showed how to fit and interpret a Bayesian model with a normal likelihood. We looked at the effect of priors by investigating prior predictive distributions and by carrying out a sensitivity analysis. We also looked at the fit of the posterior, by inspecting the posterior predictive distribution (which gives us some idea about the descriptive adequacy of the model). We also showed how to fit a Bayesian model with a log-normal likelihood, and how to compare the predictive accuracy of different models.
## Further reading {#sec-ch3furtherreading}
Sampling algorithms are discussed in detail in @gamerman. Also helpful are the sections on sampling from the short open-source book by Bob Carpenter, *Probability and Statistics: a simulation-based introduction* (https://github.com/bob-carpenter/prob-stats), and the sections on sampling algorithms in @lambert2018student and @lynch2007introduction.
The evolution of probabilistic programming languages for Bayesian inference is discussed in @vstrumbelj2024past.
Introductory linear modeling theory is covered in @dobson2011introduction; more advanced treatments are in @monty and @seber. Generalized linear models are covered in detail in @mccullagh2019generalized. The reader may also benefit from our own freely available online lecture notes on linear modeling: https://github.com/vasishth/LM.