-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path25-workflow.Rmd
277 lines (204 loc) · 33.7 KB
/
25-workflow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
# Workflow {#ch-workflow}
Although modern Bayesian analysis tools (such as `brms`) greatly facilitate Bayesian computations, model specification is still (as it should be) the responsibility of the user. Chapter \@ref(ch-compbda) is one of the earlier chapters where some of the steps required to arrive at a robust and useful analysis were described. In this chapter, these ideas are brought together to spell out a principled approach to developing a workflow. This chapter is an abbreviated version a recent introduction of a \index{Principled Bayesian workflow} principled Bayesian workflow to cognitive science [@schad2019towardarXiv]. For a revised published version of the 2019 paper, see @schad2020toward.
A lot of research has been done recently in order to create tools that guarantee reliable Bayesian data analyses [see, for example, @Gabry:2017aa; @talts2018validating]. The development of a principled Bayesian workflow for performing a probabilistic analysis is one of the most recent outcomes of this research [@Betancourt:2018aa; @schad2019towardarXiv]. This process leaves space for future advancements in methodology and offers a logical first set of steps to take for a \index{Robust analysis} robust analysis. Parts of this workflow can, in principle, be applied to any type of data analysis, whether frequentist or Bayesian, whether sampling-based or based on analytic procedures.
In this chapter, we discuss some aspects of the principled Bayesian workflow. Certain components of this workflow are particularly useful when working with advanced or non-standard models.
When fitting a model, it is important to ask several questions and perform various checks to validate a probabilistic model. Before delving into the details of this discussion, we first examine the process of model building and how different traditions have led to different approaches to these questions.
## \index{Model building} Building a model
An effective approach to model building is to begin with a minimal model that captures only the phenomenon of interest, without incorporating much other structure in the data. For instance, this could be a linear model that includes only the factor or covariate of primary interest. This model is then subjected to a series of checks, which are described in detail in the following sections. If the model passes all of these checks and does not exhibit any signs of inadequacy, it can be applied to the problem at hand with confidence, knowing that it provides reasonably robust inferences for the scientific question that needs to be answered. However, if the model fails one or more of these checks, the model may need to be improved; in addition, even the scientific question may need to be changed. For example, in a repeated measures data set, we may use a sample of $30$ subjects with the aim to estimate the correlation parameter between by-group adjustments (their random effects correlation). If the analysis shows that the sample is not large enough to reliably estimate the correlation term, the sample size may need to be increased, or the plan to investigate any correlation may need to be abandoned.
To guide and inform model development, initially an aspirational model \index{Aspirational model} $\mathcal{M}_A$ is specified.
This model is an idea that encompasses every aspect of the phenomenon and the measurement procedure, as if time, money, subjects, computational and mathematical tools, and other resources were all infinite.
It accounts for all systematic effects that might influence the measurement process, such as influences of time or heterogeneity across individuals. By using this model as a starting point, random walks in model space can be avoided during the development of the model. The model has to capture both the latent phenomenon of interest and also the environment and experiment used to probe it.
The \index{Initial model} initial model $\mathcal{M}_1$ is designed to incorporate only the phenomenon of core scientific interest, without including any additional aspects or structures relevant for modeling or measurement. This is in contrast to the aspirational model $\mathcal{M}_A$, which includes all possible details (of course, within reason) of the phenomenon and measurement process. If the initial model turns out to be inadequate, then the aspirational model guides model development. In case the expanded model still shows problems in model checking, then model development is continued with another cycle of development.
In the following sections, prior and posterior predictive checks are discussed briefly, because they provide a foundation for a principled approach to \index{Model expansion} model expansion. Critically, model development is best built up via *expansion*. In the case that an expanded model turns out to not be a better description of the data, it's always possible to go back to a previous, simpler, version of the model.
An alternative strategy for model fitting is proposed by some researchers, whereby the model contains all group-level variance components (e.g., by-participant and by-items) that are allowed by the experimental design, as well as a full variance-covariance matrix for all group-level parameters. A commonly used name for this model, especially in psychology and psycholinguistics, is the \index{Maximal model} "maximal" model [e.g., @barr2013]. However, this model can be seen as "maximal" only in the framework of a linear model. For example, section \@ref(sec-distrmodel) treated distributional models, which are already beyond the scope of the maximal models in the sense of @barr2013. Nevertheless, a maximal model can provide an alternative starting point for the principled Bayesian workflow. Here, model expansion is not the focus. Instead, if the maximal model approach is taken, the workflow that we discuss here can be useful for specifying priors encoding domain expertise, and to ensure model adequacy. In the principled Bayesian workflow, it may be reasonable for some steps (especially computationally demanding steps) to be executed only one time for a given series of related studies (with similar designs) or only if models are coded in Stan.
The term "maximal" in a maximal model as used by @barr2013 refers to the maximal specification of the variance components within the parameters of the linear regression approximation. Models constrained by the linear regression structure are unable to account for factors like measurement error, dynamic changes in processes over time, or selection bias in the data. Crucially, the aspirational model--a representation of the actual data-generating process--is not the "maximal" model.
Last but not least, occasionally the outcomes of the Bayesian workflow will demonstrate that the data or experimental design used is insufficient to address the specific scientific question at hand. In this instance, either the level of ambition must be lowered or fresh data must be gathered, maybe using an alternative experimental design that is more sensitive to the relevant phenomenon.
\index{Pre-registration} Pre-registration of experimental analyses prior to data collection is a significant development in open science practices [@chambers2019seven]. Online resources like AsPredicted (https://aspredicted.org/) and the Open Science Foundation (https://osf.io/) can be used for this [but see @szollosi2019preregistration]. What details should or can be recorded during the Bayesian workflow's preregistration? The population- and group-level effects (also referred to as fixed and random effects) and contrast coding [@schadHowCapitalizePriori2020] should be described if the maximal model in the sense of @barr2013 is going to be used for analysis.
Rigid preregistration is meaningless in the context of incremental model building unless one knows precisely what the model is, as any subsequent inference will be limited, if not useless, if the model isn't appropriate for the data at hand. Preregistration's deeper problem is that a model cannot be validated until the phenomenon *and* experiment are thoroughly understood.
Defining the initial and aspirational models as well as the incremental strategy used to probe the initial model in order to move it closer to the aspirational model is one feasible option.
It can be helpful to preregister the initial model and the principles one plans to use in model selection; this is a helpful step even though it can be challenging, or indeed impossible, to fully define the aspirational model. The incremental model building strategy towards the aspirational model may be seen as lying at the boundary between confirmatory and exploratory, and becomes more confirmatory the more clearly the aspirational model can be spelled out a priori [@lee2019robust].
## Principled questions to ask on a model
What qualities are key for a useful probabilistic model? A first quality is \index{consistency} consistency with domain expertise. Furthermore, in order to effectively address scientific inquiries, a probabilistic model must possess sufficient richness to accurately represent the structure of the actual data generation process. Two additional requirements must be met when developing very complex or non-standard models (which we will touch on briefly in this chapter): the model must allow accurate posterior approximation and it must capture enough of the experimental design to provide meaningful insights into our research questions.
In order to meet these requirements for our probabilistic model, what can we do? We will go over the several analysis steps and questions to ask.
We will first examine whether our model is consistent with our domain expertise using prior predictive checks. Additionally, posterior predictive checks evaluate whether the model adequately captures the relevant structure of the actual data-generating process for the given data set.
We will also touch on two more, computationally costly steps that are part of the principled workflow and can be used, for example, when coding complex or non-standard models: this includes examining (a) model sensitivity, and (b) the question of whether we can recover model parameters with the provided design, including checks of
\index{Computational faithfulness} computational faithfulness, by examining whether posterior estimation is accurate.
### \index{Prior predictive check} Checking whether assumptions are consistent with \index{Domain expertise} domain expertise: Prior predictive checks
When investigating the model, it is crucial to first determine whether the model and the prior parameter distributions are in line with domain knowledge.
Prior distributions may be chosen on the basis of previous studies or practicality. It is frequently challenging to determine which prior distributions to use for complex models, as well as the effects that distributions of prior model parameters have on the a priori expected data. One workable solution is to simulate artificial data from the model using prior distributions, and then verify if the simulated data make sense and align with domain knowledge. When compared to directly evaluating prior distributions in complex models, this (simulation-based) method is frequently far simpler to judge.
To put this strategy into action, take the following actions:
1. Using the prior $p(\boldsymbol{\Theta})$, draw a parameter set $\boldsymbol{\Theta_{pred}}$ from it via random sampling: $\boldsymbol{\Theta_{pred}} \sim p(\boldsymbol{\Theta})$
2. Based on this parameter set $\boldsymbol{\Theta_{pred}}$, simulate artificial data $\boldsymbol{y_{pred}}$ from the model: $\boldsymbol{y_{pred}} \sim p(\boldsymbol{y} \mid \boldsymbol{\Theta_{pred}})$
It is helpful to compute summary statistics of the simulated data $t(\boldsymbol{y_{pred}})$ in order to evaluate whether previous model predictions are consistent with domain expertise. Histograms can be used to display the distribution of these summary statistics (see Figure \@ref(fig:figPriorPredCh)).
This may rapidly show whether the data falls within an expected range or whether a sizable number of \index{Extreme data} extreme data points are expected a priori.
Extreme values, for instance, might be reading times less than $50$ ms or more than $2000$ ms in a study utilizing self-paced reading times.
While reading times longer than $2000$ ms for a word are not impossible, they are unlikely and largely at odds with domain knowledge. Reading research has demonstrated that a tiny percentage of observations may truly take extreme values. However, if we find a substantial number of extreme data points in the hypothetical data and if these are at odds with domain expertise, then the model or the priors should be changed to produce hypothetical data that falls within the range of acceptable values.
```{r figPriorPredCh, echo=FALSE, fig.height=2.5, fig.width=3*3, fig.cap="Prior predictive checks. a) In a first step, define a summary statistic that one wants to investigate. b) Second, define extremity thresholds (shaded areas), beyond which one does not expect a lot of data to be observed. c) Third, simulate prior model predictions for the data (histogram) and compare them with the extreme values (shaded areas).", message=FALSE, results="hide"}
c_light <- "#999999"
c_light_highlight <- "#B0ADAD"
c_mid <- "#444444"
c_mid_highlight <- "#666666"
c_dark <- "#000000"
c_dark_highlight <- "#333333"
set.seed(1234)
d <- data.frame(x = rnorm(250, 1, 1))
pPPC_A <- ggplot() +
stat_bin(data = d, aes(x = x, y = after_stat(density)), geom = "blank", bins = 30) +
geom_blank(aes(x = -3.5, y = 0.6), stat = "identity") + # , width = 3, fill = c_mid
geom_blank(aes(x = +5.5, y = 0.6), stat = "identity") + # , width = 3, fill = c_mid
geom_blank(aes(x = c(-2, -2), y = c(0, 0.6))) + # , size = 2, colour = c_dark
geom_blank(aes(x = c(+4, +4), y = c(0, 0.6))) + # , size = 2, colour = c_dark
# geom_line(aes(x=c(-3.5,-2),y=c(0.3,0.3)), arrow=arrow(length=unit(0.30,"cm"), ends="first", type = "closed"), colour=c_dark, size=2)+
# geom_line(aes(x=c(4,5.5),y=c(0.3,0.3)), arrow=arrow(length=unit(0.30,"cm"), ends="last", type = "closed"), colour=c_dark, size=2)+
labs(x = "Summary Statistic [t(y_pred)]") +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
pPPC_B <- ggplot() + # stat_bin(data=d, aes(x=x, y=after_stat(density))) +
geom_bar(aes(x = -3.5, y = 0.6), stat = "identity", width = 3, fill = c_mid) +
geom_bar(aes(x = +5.5, y = 0.6), stat = "identity", width = 3, fill = c_mid) +
geom_line(aes(x = c(-2, -2), y = c(0, 0.6)), linewidth = 2, colour = c_dark) +
geom_line(aes(x = c(+4, +4), y = c(0, 0.6)), linewidth = 2, colour = c_dark) +
geom_line(aes(x = c(-3.5, -2), y = c(0.3, 0.3)), arrow = arrow(length = unit(0.30, "cm"), ends = "first", type = "closed"), colour = c_dark, linewidth = 2) +
geom_line(aes(x = c(4, 5.5), y = c(0.3, 0.3)), arrow = arrow(length = unit(0.30, "cm"), ends = "last", type = "closed"), colour = c_dark, linewidth = 2) +
labs(x = "Summary Statistic [t(y_pred)]", y = "density") +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
pPPC_C <- ggplot() +
stat_bin(data = d, aes(x = x, y = after_stat(density)), bins = 30) +
geom_bar(aes(x = -3.5, y = 0.6), stat = "identity", width = 3, fill = c_mid) +
geom_bar(aes(x = +5.5, y = 0.6), stat = "identity", width = 3, fill = c_mid) +
geom_line(aes(x = c(-2, -2), y = c(0, 0.6)), linewidth = 2, colour = c_dark) +
geom_line(aes(x = c(+4, +4), y = c(0, 0.6)), linewidth = 2, colour = c_dark) +
geom_line(aes(x = c(-3.5, -2), y = c(0.3, 0.3)), arrow = arrow(length = unit(0.30, "cm"), ends = "first", type = "closed"), colour = c_dark, linewidth = 2) +
geom_line(aes(x = c(4, 5.5), y = c(0.3, 0.3)), arrow = arrow(length = unit(0.30, "cm"), ends = "last", type = "closed"), colour = c_dark, linewidth = 2) +
labs(x = "Summary Statistic [t(y_pred)]") +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
# quartz(width=3*3, height=2.5)
cowplot::plot_grid(pPPC_A, pPPC_B, pPPC_C, labels = c("a", "b", "c"), ncol = 3)
```
Selecting quality \index{Summary statistics} summary statistics is more art than science. However, the choice of summary statistics will be important since they offer important indicators of the information that we want the model to take into account. As a result, they ought to be carefully selected and created in accordance with our expectations regarding the actual process of data generation as well as the kinds of structures and effects we expect the data will display. One can use summary statistics to criticize the model as well. For example, one can formalize criticism of an analysis into a summary statistic that one believes will exhibit undesirable behavior.
Selecting appropriate priors will be especially important when the data does not provide enough information to determine the likelihood (see Figure \@ref(fig:FigBayes), especially g-i). This frequently happens, for instance, in hierarchical models when a "maximal" model is fitted for a small data set that does not constrain the estimation of variance and covariance parameters for all group-level effects.^[This issue shows up as problems with optimizer convergence in frequentist methods (like the ones used in the `lme4` package), indicating that the likelihood is too flat and the parameter estimates are not limited by the data.]
In such cases, a prior in a Bayesian analysis (or a more informative one instead of a relatively uninformative one) should incorporate just enough domain expertise to suppress extreme but not impossible parameter values.
Since the posterior is now sufficiently constrained, it may now be possible to fit the model. Therefore, by incorporating prior knowledge into Bayesian computation, we can fit and understand models that frequentist tools are unable to reliably estimate.
Thus, more concentrated prior distributions are a welcome side-effect of adding more domain expertise (into what still constitutes weakly informative priors), which can help with Bayesian computation. This makes it possible to estimate more complex models; in other words, models that would not otherwise be able to be estimated with the tools at hand, can be fitted thanks to the use of prior knowledge. Put another way, by utilizing prior knowledge, the iterative model-building process can help us approach the aspirational model better. Moreover, MCMC algorithms will converge more quickly once additional informative priors are provided.
(ref:FigBayes) The role of priors when data is informative or \index{Uninformative data} uninformative. a)-c) When the data offers good information through the likelihood (b), a flat uninformative prior (a) is sufficient to obtain a concentrated posterior (c). d)–f) When the data does not adequately constrain the parameters through the likelihood (e), using a flat \index{Uninformative prior} uninformative prior (d) results in a widely distributed posterior (f) (i.e., different combinations of parameters are equally plausible; this hints at the fact that the model is empirically not identifiable). g)-i) When the data does not constrain the parameter through the likelihood (h), adding domain knowledge through an \index{Informative prior} informative prior (g) can help constrain the posterior (i) to reasonable values.
```{r FigBayes, echo=FALSE, fig.width=6, fig.height=6, fig.cap="(ref:FigBayes)", message=FALSE, results="hide"}
FigBay <- function(corr) {
sigma <- matrix(c(1, corr, corr, 1), nrow = 2)
m <- c(0, 0)
data.grid <- expand.grid(s.1 = seq(-3, 3, length.out = 200), s.2 = seq(-3, 3, length.out = 200))
q.samp <- cbind(data.grid, prob = mvtnorm::dmvnorm(data.grid, mean = m, sigma = sigma))
FigBay3c <- ggplot(q.samp, aes(x = s.1, y = s.2)) +
geom_raster(aes(fill = prob)) +
scale_fill_gradient(low = "white", high = c_dark, guide = "none") +
# scale_fill_gradient(guide = FALSE) +
labs(x = "", y = "") +
coord_fixed(xlim = c(-3, 3), ylim = c(-3, 3), ratio = 1) +
# papaja::theme_apa() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(colour = "black", linewidth = 4, fill = NA),
axis.text.x = element_blank(), axis.ticks.x = element_blank(),
axis.text.y = element_blank(), axis.ticks.y = element_blank()
)
return(FigBay3c)
}
sigma <- matrix(c(1, -0.9, -0.9, 1), nrow = 2)
m <- c(0, 0)
data.grid <- expand.grid(s.1 = seq(-3, 3, length.out = 200), s.2 = seq(-3, 3, length.out = 200))
q.samp <- cbind(data.grid, prob = mvtnorm::dmvnorm(data.grid, mean = m, sigma = sigma))
maxQ <- max(q.samp$prob)
mINQ <- min(q.samp$prob)
s1 <- unique(q.samp$s.1)
q.samp$prob2 <- NA
for (i in 1:length(s1)) { # i <- 1
idxQ <- which(q.samp$s.1 == s1[i])
q.samp$prob2[idxQ] <- dnorm(q.samp$s.2[idxQ], -s1[i], 1)
}
q.samp$prob <- q.samp$prob2
FigBay2bc3b <- ggplot(q.samp, aes(x = s.1, y = s.2)) +
geom_raster(aes(fill = prob)) +
scale_fill_gradient(low = "white", high = c_dark, guide = "none") +
# scale_fill_gradient(guide = FALSE) +
labs(x = "", y = "") +
coord_fixed(xlim = c(-3, 3), ylim = c(-3, 3), ratio = 1) +
# papaja::theme_apa() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(colour = "black", linewidth = 4, fill = NA),
axis.text.x = element_blank(), axis.ticks.x = element_blank(),
axis.text.y = element_blank(), axis.ticks.y = element_blank()
)
FigBay1a2a <- FigBay(1) + labs(title = "Prior")
FigBay3c <- FigBay(-0.7) + labs(title = "Posterior")
FigBay1bc3a <- FigBay(0)
FigBay1b <- FigBay1bc3a + labs(title = "Likelihood")
FigBay1c <- FigBay1bc3a + labs(title = "Posterior")
FigBay3a <- FigBay1bc3a + labs(title = "Prior")
FigBay2b3b <- FigBay2bc3b + labs(title = "Likelihood")
FigBay2c <- FigBay2bc3b + labs(title = "Posterior")
cowplot::plot_grid(FigBay1a2a, FigBay1b, FigBay1c,
FigBay1a2a, FigBay2b3b, FigBay2c,
FigBay3a, FigBay2b3b, FigBay3c,
labels = c("a", "b", "c", "d", "e", "f", "g", "h", "i"), ncol = 3
)
```
When computing Bayes factors, adding more domain expertise to the prior has important implications for Bayesian modeling (see chapter \@ref(ch-bf), on Bayes factors).
Crucially, the \index{Prior predictive distribution} prior predictive distribution, which describes the interaction between the prior and the likelihood, can be used to simulate prior predictive data. It calculates an integral, or average, over various possible (prior) parameter values mathematically. As previously mentioned (also refer to chapter \@ref(ch-compbda)), the prior predictive distribution is:
\begin{equation}
\begin{aligned}
p(\boldsymbol{y_{pred}}) &= \int p(\boldsymbol{y_{pred}}, \boldsymbol{\Theta}) \; d\boldsymbol{\Theta} = \int p(\boldsymbol{y_{pred}} \mid \boldsymbol{\Theta}) p(\boldsymbol{\Theta}) \; d\boldsymbol{\Theta}\\ &= \int \mathrm{likelihood}(\boldsymbol{y_{pred}} \mid \boldsymbol{\Theta}) \cdot \mathrm{prior}(\boldsymbol{\Theta}) \; d\boldsymbol{\Theta}
\end{aligned}
\end{equation}
As an illustration, let's say that we consider our likelihood to be a Normal distribution with mean $\mu$ and standard deviation $\sigma$. Assume that we now define $\sigma \sim \mathit{Uniform}(1,2)$ and $\mu \sim \mathit{Normal}(0,1)$ as priors on the parameters. The steps below can be used to create the prior predictive distribution:
- Perform the following 100,000 times:
- Select one sample (m) from the distribution $\mathit{Normal}(0,1)$.
- Take one sample (s) from a $\mathit{Uniform}(1,2)$ distribution
- Create and store a data point from a $\mathit{Normal}(m,s)$ distribution.
- The prior predictive distribution is the generated data.
It is also possible to define more intricate generative processes involving data from repeated measures.
### \index{Computational faithfulness} Testing for correct posterior approximations: Checks of computational faithfulness
Approximations of posterior expectations can be inaccurate. For example, a computer program that is designed to sample from a posterior can be erroneous.
This may be due to an error in the likelihood specification (for example, an error in the R syntax formula) or to an inadequate sampling of the posterior's entire density.
The sampler may be biased, meaning the parameter samples might be systematically larger or smaller than those drawn from the exact posterior. Alternatively, the variance of the posterior samples may differ, being either larger or smaller than the variance of the exact posterior. However, posterior sampling from simple and standard models should work properly in most cases. Thus, we think that in many applications, a further check of computational faithfulness may be asking for too much, and might need to be performed only once for a given research program, where different experiments are rather similar to each other. However, checking computational faithfulness can become an important issue when dealing with more advanced/non-standard models (such as those discussed in the later chapters of this book). Here, errors in the specification of the likelihood can occur more easily.
Designing a process to verify whether the posterior approximation of choice is accurate is crucial because posterior approximations can be erroneous. For example, one should make sure that the software utilized to implement the sampling is error-free for the particular problem at hand.
This checking can be performed using \index{Simulation-based calibration} simulation-based calibration [SBC; @talts2018validating; @schad2019towardarXiv; @ModrakEtAl2023]. This is a computationally intensive procedure that can take a long time to run for particularly complex models and large data sets. We do not discuss SBC in detail here, but refer the reader to its later treatment in chapter \@ref(ch-MPT), where SBC is applied for models coded in Stan directly, as well as to the description in @schad2019towardarXiv [and also @ModrakEtAl2023].
After confirming the accuracy and faithfulness of our posterior computations, we can move on to examine the model analyses' sensitivity.
### \index{Model sensitivity} Sensitivity of the model
What can we reasonably expect from a model's posterior, and how can we determine whether these expectations are reasonable given the current configuration? First, we might expect that the data are generated without \index{Bias} bias as the posterior recovers the true values of the parameters. In other words, we could anticipate that the posterior mean will be near to the true value when simulating hypothetical data based on a true parameter value (a point value for the parameter). This expectation, however, might or might not be warranted for a particular model, experimental setup, and data set.
In fact, some models--such as non-linear models--may have biased parameter estimates, making it nearly impossible to determine the parameter's true value from the data. Simultaneously, we could expect that the posterior is very informative concerning the parameters that produced the data. In other words, in comparison to our past knowledge, we could aim for a small posterior uncertainty, or a small posterior standard deviation.
But posterior certainty isn't always high. When compared to our past knowledge, some experimental designs, models, priors, or data sets may produce extremely uninformative estimates where the degree of uncertainty is not decreased. This may occur when there is a dearth of data or when the model is too complex for the experimental design, preventing us from constraining specific model parameters.
In order to examine model sensitivity, two model-related questions can be looked into:
1) How closely does the estimated posterior mean of a parameter match its true (simulated) value?
2) To what extent is uncertainty reduced between the posterior and the prior?
To investigate these questions, it is again possible to perform extensive simulation studies. (Often it might be sufficient to simulate a few data sets with different parameter/design settings, instead of running a full simulation.)
This is crucial to do for complex, non-standard, or cognitive models, but may be less important for simpler and more standard models.
Indeed, the same set of simulations can be used that are also used in SBC. Therefore, both analyses can be usefully applied in tandem. Again, here we skip the details of how these computations can be implemented, and refer the interested reader to @schad2019towardarXiv.
### \index{Posterior predictive check} Does the model adequately capture the data?--Posterior predictive checks
"*All models are wrong but some are useful.*" [@box1979robustness, p. 2].
We are aware that the observed data noisily reflects the \index{True data generating process} true data generating process, which our model most likely does not fully capture. Therefore, we want to know to what extent our model accurately approximates the true process that produced the data. We can simulate data from the model and compare the simulated to the real data in order to compare the model to the actual data generating process (i.e., to the data). A posterior predictive distribution (refer to chapter \@ref(ch-compbda)) can be used to formulate this: the model is fitted to the data, and new data is simulated using the estimated posterior model parameters.
The posterior predictive distribution can be expressed mathematically as follows:
\begin{equation}
p(\boldsymbol{y_{pred}} \mid \boldsymbol{y}) = \int p(\boldsymbol{y_{pred}} \mid \boldsymbol{\Theta}) p(\boldsymbol{\Theta} \mid \boldsymbol{y}) \; d \boldsymbol{\Theta}
\end{equation}
\noindent
Here, the posterior distribution over model parameters, $p(\boldsymbol{\Theta} \mid \boldsymbol{y})$, is inferred from the observed data, $\boldsymbol{y}$. To generate future data, $\boldsymbol{y_{pred}}$, the posterior distribution of
$\boldsymbol{\Theta}$ is combined with the distribution of the future data $\boldsymbol{y_{pred}}$ given $\boldsymbol{\Theta}$, $p(\boldsymbol{y_{pred}} \mid \boldsymbol{\Theta})$. Averaging over various possible values for the posterior model parameters ($\boldsymbol{\Theta}$) is indicated by the integral $\int d \boldsymbol{\Theta}$.
As stated in chapter \@ref(ch-compbda), we are unable to evaluate this integral exactly. Since $\boldsymbol{\Theta}$ can be a vector with multiple parameters, this integral is extremely complex and lacks an analytical solution. On the other hand, sampling allows us to approximate it. We can first obtain samples from the parameter posterior. Next, we can use these posterior samples to simulate new, artificial data from the model. This process approximates the posterior predictive distribution (and also gets rid of the necessity of computing the exact value or integral of the posterior predictive density $p(\boldsymbol{y_{pred}} \mid \boldsymbol{y})$).
In summary, we first fit the model to the data to obtain the posterior, and then simulate new data using the estimated posterior model parameters. The crucial question is then how closely the new simulated data resembles the observed data.
One strategy for comparing the data and the model is to use important features from the data and gauging the model's ability to capture them. In fact, in the prior predictive checks, we had already defined \index{Summary statistics} summary statistics. Now that we have the data simulated from the posterior predictive distribution, we can compute these summary statistics. Every summary statistic will then have a distribution. We also compute the summary statistic for the observed data. We can now determine whether the observed data falls within the distribution of the model predictions (see Figure \@ref(fig:FigPostPredCh)a) or whether the model predictions deviate significantly from the observed data (see Figure \@ref(fig:FigPostPredCh)b).
Descriptive adequacy is supported if the observed data closely match the posterior-predicted data. A substantial disparity might suggest three possibilities: (1) Our model may be overlooking important details about the processes we care about, in which case we need to apply our domain knowledge to further enhance the model. (2) Our model does not overlook important details, but there are a number of very low probability observations that were nonetheless produced by the process we modeled. (3) Our model might be missing details about less critical processes that are producing extreme observations (e.g., lapses in attention or errors in data collection). We might choose to address these issues or simply remove the problematic observations. Generally speaking, it is very difficult to distinguish between these three possibilities, so we must use our best judgment.
Specifically, we ought to modify the model exclusively in situations where the disparity aligns with a recognized absent model feature. Note that if we perform a lot of checks (e.g., per subject), it is not too surprising if the discrepancy is substantial a few times.
```{r FigPostPredCh, echo=FALSE, fig.height=2.5, fig.width=5, fig.cap="Posterior predictive checks. For a particular summary statistic, t(y), compare the posterior model predictions (histogram) with the observed data (vertical line). a) This illustrates a situation where the posterior model predictions (histogram) and the observed summary statistic (vertical line) coincide. b) This illustrates an instance in which the observed data's summary statistic (vertical line) falls outside of the model's posteriori predictions (histogram).", message=FALSE, results="hide"}
set.seed(1234)
dx <- rnorm(250, 1, 1)
d <- data.frame(x = c(dx, dx), y = rep(c(1.4, 6), each = 250), cond = rep(c("a", "b"), each = 250))
ggplot(data = d, aes(x = x, y = after_stat(density))) +
stat_bin() +
geom_vline(aes(xintercept = y)) +
facet_wrap(~cond) +
labs(x = "Summary Statistic [t(y)]") +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
```
## Further reading
Some important articles relating to developing a principled Bayesian workflow are by @Betancourt:2018aa, @Gabry:2017aa, @gelman2020bayesian, and @talts2018validating. The `stantargets` R package provides tools for a systematic, efficient, and reproducible workflow [@stantargets]. Also recommended is the article on reproducible workflows by @wilson2017good.