12-comparison.Rmd

# (PART) Model comparison {-}
# Introduction to model comparison {#ch-comparison}

A key goal of cognitive science is to decide which theory, among multiple candidates under consideration, accounts for the experimental data better. This can be accomplished by implementing the theories (or simplified versions of them) as Bayesian models and comparing their predicting power.  This kind of model comparison is closely related to hypothesis testing. There are two Bayesian perspectives on model comparison: a \index{Prior predictive perspective} *prior* predictive perspective based on the \index{Bayes factor} Bayes factor using \index{Marginal likelihood} marginal likelihoods, and a \index{Posterior predictive perspective} *posterior* predictive perspective based on \index{Cross-validation} cross-validation. The main characteristic difference between the prior predictive approach (Bayes factor) versus the posterior predictive approach (cross-validation) is the following: The Bayes factor examines how well the model (prior and likelihood) explains the experimental data. By contrast, the posterior predictive approach assesses model predictions for held-out data after seeing most of the data. 

## Prior predictive vs. posterior predictive model comparison

The \index{Predictive accuracy} predictive accuracy of the Bayes factor is only based on its \index{Prior predictive distribution} prior predictive distribution.  In Bayes factor analyses, the prior model predictions are used to evaluate the support that the data give to the model. By contrast, in cross-validation, the model is fit to a large subset of the data (i.e., the \index{Training data} training data). The posterior distributions of the parameters of this fitted model are then used to make predictions for held-out or \index{Validation data} validation data, and \index{Model fit} model fit is assessed on this subset of the data. Typically, this process is repeated several times, until the subsets of the entire data set are assessed as held-out data. This approach attempts to assess whether the model will generalize to truly new, \index{Unobserved data} unobserved data. Of course, the \index{Held-out data} held-out data is usually not "truly new" because it is part of the data that was collected, but at least it is data that the model has not been exposed to. That is, the predictive accuracy of cross-validation methods is based on how well the \index{Posterior predictive distribution} posterior predictive distribution that is fit to most of the data (i.e., the training data) characterizes \index{Out-of-sample data} out-of-sample data (i.e., the test or held-out data). Notice that one could in principle use the posterior predictive approach using truly new data, by just repeating the experiment with new subjects and then treating that new data as held-out data.

Coming back to Bayes factors, the prior predictive distribution is obviously highly sensitive to the priors: it evaluates the probability of the observed data under prior assumptions. By contrast, the posterior predictive distribution is less dependent on the priors because the priors are combined with the likelihood (and are thus less influential, given sufficient data) before making predictions for held-out validation data.

Jaynes [-@jaynes2003probability, Chapter 20] compares these two  perspectives to  "a cruel realist" and "a fair judge".  According to Jaynes, Bayes factor adopts the posture of a cruel realist, who "judge[s] each model taking into account the prior information we actually have pertaining to it; that is, we penalize a model if we do not have the best possible prior information about its parameters, although that is not really a fault of the model itself."  By contrast, cross-validation adopts the posture of a scrupulously fair judge, "who insists that fairness in comparing models requires that each is delivering the best performance of which it is capable, by giving each the best possible prior probability for its parameters (similarly, in Olympic games we might consider it unfair to judge two athletes by their performance when one of them is sick or injured; the fair judge might prefer to compare them when both are doing their absolute best)."

## Some important points to consider when comparing models

Regardless of whether we use Bayes factor or cross-validation or any other method for model comparison, there are several important points that one should keep in mind:

1. Although the objective of model comparison might ultimately be to find out which of the models under consideration generalizes better, this generalization can only be done well within the range of the observed data [see @VehtariLampinen2002;  @VehtariOjanen2012]. That is, if one hypothesis implemented as the model $\mathcal{M}_1$ is shown to be superior to a second hypothesis, implemented as the model $\mathcal{M}_2$, according to Bayes factor and/or cross-validation,  and evaluated with a young western university student population, this doesn't mean that $\mathcal{M}_1$ will be superior to $\mathcal{M}_2$ when it is evaluated with a broader population  [and in fact it seems that many times it won't, see @henrich_heine_norenzayan_2010]. However, if we can't generalize even within the range of the observed data (e.g., university students in the northern part of the western hemisphere), there is no hope of generalizing outside of that range (e.g., non-University students). @navarroDevilDeepBlue2018 argues that one of the most important functions of a model is to encourage directed exploration of new territory; our view is that this makes sense only if historical data can also be accounted for. In practice, what that means for us is that evaluating a model's performance should be carried out using \index{Historical benchmark data} historical benchmark data in addition to any new data one has; just using isolated pockets of new data to evaluate a model is not convincing. For examples from psycholinguistics of model evaluation using historical benchmark data, see @EngelmannJaegerVasishth2019, @NicenboimPreactivation2019, and @YadavetalJML2022.

2. Model comparison can provide a quantitative way to evaluate models, but this cannot replace understanding the qualitative patterns in the data [see, e.g., @navarroDevilDeepBlue2018]. A model can provide a good fit by behaving in a way that contradicts our substantive knowledge. For example, @lisson_2020 examine two computational models of sentence comprehension. One of the models yielded higher predictive accuracy when the parameter that is related to the probability of correctly comprehending a sentence was higher for impaired subjects (individuals with aphasia) than for the control population. This contradicts  \index{Domain knowledge} domain knowledge---impaired subjects are generally observed to show worse performance than unimpaired control subjects---and led to a re-evaluation of the model.

3. Model comparison is based on finding the most "useful model" for characterizing our data, but neither the Bayes factor nor cross-validation (nor any other method that we are aware of) guarantees selecting the \index{Model closest to the truth} model closest to the truth (even with enough data). This is related to our previous point: A model that's closest to the true generating data process is not guaranteed to produce the best (prior or posterior) predictions, and a model with a clearly wrong generating data process is not guaranteed to produce poor (prior or posterior) predictions. See @WangGelman2014difficulty, for an example with cross-validation; and @navarroDevilDeepBlue2018 for a toy example with Bayes factors.

4.  One should also check that the precision of the data being modeled is high; if an effect is being modeled that has high uncertainty (the posterior distribution of the target parameter is widely spread out), then any measure of model fit can be uninformative because we don't have accurate estimates of the effect of interest. In the Bayesian context, this implies that the posterior predictive distributions of the effects generated by the model should be theoretically plausible and reasonably constrained, and the target parameter of interest should have as high precision as possible; this implies that we need to have sufficient data if we want to obtain precise estimates of the parameter of interest. What counts as sufficient will depend on the topic being studied.^[As an example from psycholinguistics, strong garden-path effects like those elicited by "The horse (that was) raced past the barn fell" [@bever1970cognitive] may be easy to detect with high precision with a relatively small number of subjects, but subtle effects such as local coherence [@taboretal04] will probably require a much larger sample size to detect the effect with high precision [@lcpaape2024].] Later in this part of the book, we will discuss the adverse impact of imprecision in the data on model comparison (see section \@ref(sec-BFvar)). We will show that, in the face of low precision, we generally won't learn much from model comparison.  

5. When comparing a \index{Null model} null model with an \index{Alternative model} alternative model, it is important to be clear about what the null model specification is. For example, in section \@ref(sec-mcvivs), we encountered the correlated varying intercepts and varying slopes model for the Stroop effect. The `brms` formula for the full model  was:

`n400 ~ 1 + c_cloze + (1 + c_cloze | subj)`

The formal statement of this model is: 

 \begin{equation}
  signal_n \sim \mathit{Normal}(\alpha + u_{subj[n],1} + c\_cloze_n \cdot (\beta+ u_{subj[n],2}),\sigma)
 \end{equation}


If we want to test the null hypothesis that centered cloze has no effect on the dependent variable, one null model is:

`n400 ~ 1 + (1 + c_cloze | subj) (Model M0a)`

Formally, this would be stated as follows (the $\beta$ term is removed as it is assumed to be $0$):

 \begin{equation}
  signal_n \sim \mathit{Normal}(\alpha + u_{subj[n],1} + c\_cloze_n \cdot u_{subj[n],2},\sigma)
 \end{equation}


In model `M0a`, by-subject variability is allowed; just the population-level (or fixed) effect of centered cloze is assumed to be zero. This is called a \index{Nested model comparison} nested model comparison, because the null model is subsumed in the full model.

An alternative null model could remove only the \index{Varying slopes} varying slopes:

`n400 ~ 1 + c_cloze + (1 | subj) (Model M0b)`

Formally:

 \begin{equation}
  signal_n \sim \mathit{Normal}(\alpha + u_{subj[n],1} + c\_cloze_n \cdot \beta,\sigma)
 \end{equation}


Model `M0b`, which is also nested inside the full model, can be used to test a different null hypothesis than `M0a` above: is the between-subject variability in the centered cloze effect zero? 

Yet another possibility is to remove both the population-level and group-level (or random) effects of centered cloze:

`n400 ~ 1 + (1 | subj) (Model M0c)`


Formally:

 \begin{equation}
  signal_n \sim \mathit{Normal}(\alpha + u_{subj[n],1},\sigma)
 \end{equation}


Model `M0c` is also nested inside the full model, but it now has two parameters missing instead of one: $\beta$ and $u_{subj[n],1}$. Usually, it is best to compare models by removing one parameter; otherwise one cannot be sure which parameter was responsible for our rejecting or accepting the null hypothesis. 


## Further reading
@rp and @pittWhenGoodFit2002 argue for the need of going beyond "a good fit" (this is a good posterior predictive check in the context of Bayesian data analysis) and argue for the need of model comparison and a focus on measuring the generalizability of a model. @navarroDevilDeepBlue2018  deals with the problematic aspects  of model selection in the context of psychological literature and cognitive modeling. Fabian Dablander's blog post, https://fabiandablander.com/r/Law-of-Practice.html, shows a very clear comparison between Bayes factor and PSIS-LOO-CV.  @rodriguez2021and provides JAGS code for fitting models with spike-and-slab priors. Fabian Dablander has a comprehensive blog post on how to implement a Gibbs sampler in R when using such a prior: https://fabiandablander.com/r/Spike-and-Slab.html. @YadavetalJML2022 uses $17$ different data sets for model comparison using cross-validation, holding out each data set successively; this is an example of evaluating the predictive performance of a model on truly new data.