23-extended-modelcomp.Rmd

# Model comparison - Extended


## Credible intervals should not be used to reject a null hypothesis {#app-null}

Researchers often incorrectly use \index{Credible intervals for null hypothesis testing} credible intervals for null hypothesis testing, that is, to test whether a parameter $\beta$ is zero or not. A common approach is to check whether zero is included in the 95\% credible interval for the parameter $\beta$; if it is, then the null hypothesis that the effect is zero is accepted; and if zero is outside the interval, then the null is rejected. For example, in a  tutorial paper that two of the authors of this book wrote [@NicenboimVasishth2016], we incorrectly suggest that the credible interval can be used to reject the hypothesis that the $\beta$ is zero. This is generally not the correct approach. 
The problem with this approach is that it is a  heuristic that will work in some cases and might be misleading in others [for an example, see @SampleSizeCBB2021]. Unfortunately, when they will work or not is in fact not well-defined.

Why is the credible-interval approach only a \index{Heuristic} heuristic?
One line of (generally incorrect) reasoning that justifies looking at the overlap between credible intervals and zero is based on the fact that the most likely values of $\beta$ lie within 95% credible interval.^[This is also strictly true only in a highest density interval (HDI), this is a credible interval where all the points within the interval have a higher probability density than points outside the interval. However,  when posterior distributions are symmetrical, these intervals are virtually identical to the equal-tail intervals we use in this book.] This entails that if zero is outside the interval, it must have a low probability density. This is true, but it's meaningless: Regardless of where zero lies (or any point value), zero will have a probability mass of exactly zero since we are dealing with a continuous distribution.
The lack of overlap doesn't tell us how much posterior probability the null model has. 

A partial solution could be to look at a probability interval close to zero rather than zero (e.g., an interval of, say, $-2$ to $2$ ms in a response time experiment), so that we obtain a non-zero probability mass. While the lack of overlap would be slightly more informative, excluding a small interval can be problematic when the prior probability mass of that interval is very small to begin with (as was the case with the regularizing priors we assigned to our parameters). @rouder2018bayesian show that if prior probability mass is added to the point value zero using a \index{Spike-and-slab prior} *spike-and-slab* prior (or if probability mass is  added to the small interval close to zero if one considers that equivalent to the null model), looking at whether zero is in the 95% credible interval is analogous to the Bayes factor. Unfortunately, the *spike-and-slab* prior cannot be incorporated in Stan, because it relies on a discrete parameter. However, other programming tools (like PyMC3, JAGS, or Turing) can be used if such a prior needs to be fit; see the further readings at the end of the chapter.

Rather than looking at the overlap of the 95% credible interval, we might be tempted to conclude that there is evidence for an effect because the probability that a parameter is positive is high, that is $P(\beta > 0) >> 0.5$. However, the same logic from the previous paragraph renders this meaningless. Given that the probability mass of a point value, $P(\beta = 0)$, is zero, what we can conclude from $P(\beta > 0) >> 0.5$ is that $\beta$ is very likely to be positive rather than negative, but we can't make any assertions about whether $\beta$ is exactly zero. 

As we saw, the main problem with these heuristics is that they ignore that the \index{Null model} null model is a separate hypothesis. In many situations, the null hypothesis may not be of interest, and it might be perfectly fine to base our conclusions on credible intervals or $P(\beta > 0)$. The problem arises when these heuristics are used to provide \index{Evidence in favor or against the null hypothesis} evidence in favor or against the null hypothesis. If one wants to argue about the evidence in favor of or against a null hypothesis, Bayes factors or cross-validation will be needed. These are discussed in the next two chapters.

How can credible intervals be used sensibly? The \index{Region of practical equivalence} region of practical equivalence \index{ROPE} (ROPE) approach [@spiegelhalter1994bayesian; @Freedman1984; and, more recently, @kruschke2018bayesian; @kruschke2014doing] is a reasonable alternative to hypothesis testing and arguing for or against a null. This approach is related to the spike-and-slab discussion above. In the ROPE approach, one can define a range of values for a target parameter that is predicted before the data are seen. Of course, there has to be a principled justification for choosing this range a priori; an example of a principled justification would be the prior predictions of a computational model. Then, the overlap (or lack thereof) between this predicted range and the observed credible interval can be used to infer whether one has estimates consistent (or partly consistent) with the predicted range. Here, we are not ruling out any null hypothesis, and we are not using the credible interval to make a decision like "the null hypothesis is true/false."  

There is one situation where credible intervals could arguably be used to carry out a null hypothesis test. When priors are flat, credible intervals can show frequentist properties, making it reasonable to check whether zero falls within the credible interval. For example, @newall2023evaluation use credible intervals as confidence intervals after calibration. They explicitly verify that 5% of the 95% credible intervals exclude zero when no effect exists. When using such an approach, a verification step would be necessary. We don't discuss this approach any further because our aim in this part of the book is not to derive frequentist statistics from Bayesian analysis, but to use Bayesian methods for obtaining posterior probabilities and Bayes factors, focusing on Bayesian hypothesis testing.


## The likelihood ratio vs the Bayes factor {#app-likR}

The \index{Likelihood ratio test} likelihood ratio test is a very similar, but \index{Frequentist approach} frequentist, approach to model comparison and hypothesis testing, which also compares the likelihood for the data given two different models. We show this here to highlight the similarities and differences between frequentist and Bayesian hypothesis testing. In contrast to the Bayes factor, the likelihood ratio test depends on the "best" (i.e., the maximum likelihood) estimate for the model parameter(s), that is, the model parameter $\theta$ occurs on the right side of the semi-colon in the equation for each likelihood. (An aside: we do not use a \index{Conditional statement} conditional statement, i.e., the vertical bar, when talking about likelihood in the frequentist context; instead, we use a semi-colon. This is because the statement $f(y\mid \theta)$ is a conditional statement, implying that $\theta$ has a probability density function associated with it; in the frequentist framework, parameters cannot have a pdf associated with them, they are assumed to have fixed, point values.)

\begin{equation}
LikRat = \frac{P(\boldsymbol{y} ; \boldsymbol{\hat{\Theta}_1}, \mathcal{M}_1)}{P(\boldsymbol{y} ; \boldsymbol{\hat{\Theta}_2}, \mathcal{M}_2)}
\end{equation}

That means that in the likelihood ratio test, each model is tested on its ability to explain the data using this "best" estimate for the model parameter (here, the \index{Maximum likelihood estimate} maximum likelihood estimate $\hat{\theta}$). That is, the likelihood ratio test reduces the full range of possible parameter values to a \index{Point value} point value, leading to overfitting the model to the maximum likelihood estimate \index{MLE} (MLE). If the MLE badly misestimates the true value of the parameter (point value), due to \index{Type M error} Type M error [@gelmancarlin], we could end up with a "significant" effect that is just a consequence of this misestimation [it will not be consistently replicable; see @VasishthMertzenJaegerGelman2018 for an example]. By contrast, the Bayes factor involves range hypotheses, which are implemented via \index{Integral} integrals over the model parameter; that is, it uses marginal likelihoods that are averaged across all possible prior values of the model parameter(s). Thus, if, due to Type M error, the best point estimate (the MLE) for the model parameter(s) is not very representative of the possible values for the model parameter(s), then Bayes factors will be superior to the frequentist likelihood ratio test (see exercise  \@ref(exr:bf-logn)). An additional difference, of course, is that Bayes factors rely on priors for estimating each model's parameter(s), whereas the frequentist likelihood ratio test does not (and cannot) consider priors in the estimation of the best-fitting model parameter(s). As we show in this chapter, this has far-reaching consequences for Bayes factor-based model comparison.


## Approximation of the (expected) log predictive density of a model without integration {#app-integral}

To compare models based on their predictive accuracy, we often use the expected log predictive density ($elpd$), which evaluates how well a model's predictions align with likely future data. In previous sections, we introduced the idea that the $elpd$ can be calculated by integrating over all possible future data, weighting predictions by their likelihood under the true data-generating process. However, because the true data-generating distribution, $p_t$, is unknown, we instead use the observed data distribution as a proxy. This allows us to approximate the $elpd$ by summing the posterior predictive density of our observed data points, assuming they reflect the distribution of future data.

As an example, imagine that there are $N$ observations in an experiment. Suppose also that the \index{True generative process} true generative process (which is normally always unknown to us) is a \index{Beta distribution} Beta distribution:

\begin{equation}
p_t(y) = \mathit{ Beta}(y | 1, 3)
\end{equation}

Set $N$ and observe some simulated data $y$:

```{r}
N <- 10000
y_data <- rbeta(N, 1, 3)
head(y_data)
```

Let's say that we fit the Bayesian model $\mathcal{M}_{1}$, and somehow, after getting the posterior distribution, we are able to derive the analytical form of its posterior predictive distribution for the model:

\begin{equation}
p(y_{pred} | y, \mathcal{M}_1) = \mathit{Beta}(y_{pred} | 2, 2)
\end{equation}

This distribution will tell us how likely different future observations will be, and it also entails that our future observations will be bounded by $0$ and $1$. (Any observation outside this range will have a probability density of zero).

Imagine that we could know the true distribution of the data, $p_t$, which is conveniently close to our posterior predictive distribution. This means that Equation \@ref(eq:elpd), repeated below, is simple enough, and we know all its terms:

\begin{equation}
elpd = u(\mathcal{M}_1) = \int_{y_{pred}} p_t(y_{pred}) \log p(y_{pred} \mid y, \mathcal{M}_1)\, dy_{pred}
\end{equation}

We can compute this quantity in R. Notice that we don't introduce the data at any point. However, the data had to be used when `p`, the posterior predictive distribution, was derived; we skipped that step here.

```{r}
# True distribution:
p_t <- function(y) dbeta(y, 1, 3)
# Predictive distribution:
p <- function(y) dbeta(y, 2, 2)
# Integration:
integrand <- function(y) p_t(y) * log(p(y))
integrate(f = integrand, lower = 0, upper = 1)$value
```

Because we will never know `p_t`, this integral can be approximated using the data, `y_data`. It is possible to approximate the integration without any reference to `p_t`; see Equation \@ref(eq:elpdapprox):

```{r}
1/N * sum(log(p(y_data)))
```

The main problem with this approach is that we are using `y_data` twice, once to derive `p`, the predictive posterior distribution, and once for the approximation of $elpd$.  We'll see that cross-validation approaches rely on deriving the posterior predictive distribution with part of the data, and estimating the approximation to $elpd$ with unseen data. (Don't worry that we don't know the analytical form of the posterior predictive distribution: we saw that we could generate samples from that distribution based on the distribution we use as the likelihood and our posterior samples.)


## The cross-validation algorithm for the expected log predictive density of a model {#app-CV-alg}

Here we spell out the Bayesian \index{Cross-validation algorithm} cross-validation algorithm in detail:

1. Split the data pseudo-randomly into $K$ held-out or validation sets $D_k$, (where $k=1,\dots,K$) that are a fraction of the original data, and $K$ training sets, $D_{-k}$.  The length of the held-out data vector $D_k$ is approximately $1/K$-th the size of the full data set. It is common to  use $K=10$ for K-fold-CV. For LOO-CV, K should be set to the number of observations.

2. Fit $K$ models using each of the $K$ training sets, and obtain posterior distributions $p_{-k} (\Theta) = p(\Theta\mid D_{-k})$, where $\Theta$ is the vector of model parameters.

3. Each posterior distribution $p(\Theta\mid D_{-k})$ is used to compute the predictive accuracy (calculated as $\widehat{elpd}$) for each held-out data-point $y_n$ in the vector $D_{k}$:

  \begin{equation}
    \widehat{elpd}_n = \log p(y_n \mid D_{-k}) \text{ with } y_n \in D_k
  \end{equation}

   Given that the posterior distribution $p(\Theta\mid D_{-k})$ is summarized by $S$ samples, the log predictive density for each data point $y_n$ in a data vector $D_k$ can be approximated as follows:

  \begin{equation}
    \widehat{elpd}_n = \log \left(\frac{1}{S} \sum_{s=1}^S p(y_n\mid \Theta^{k,s})\right)
    (\#eq:pwkfold)
  \end{equation}

   where $\Theta^{k,s}$ corresponds to the sample $s$ of the posterior of the model fit to the training set $D_-k$.

5. We obtain the $elpd_{kfold}$ (or $elpd_{loo}$) for all the held-out data points by summing up the $\widehat{elpd}_n$:

  \begin{equation}
    elpd_{kfold} = \sum_{n=1}^N \widehat{elpd}_n
    (\#eq:totalkfold)
  \end{equation}

The \index{Standard deviation of the sampling distribution} standard deviation of the sampling distribution (the \index{Standard error} standard error) can be computed by multiplying the standard deviation (or square root of variance) of the $N$ components  by $\sqrt{N}$. Letting $\widehat{ELPD}$ be the vector $\widehat{elpd}_1,\dots,\widehat{elpd}_N$, the standard error is computed as follows:

\begin{equation}
se(\widehat{elpd}) = \sqrt{N \mathit{Var}(\widehat{ELPD})}
(\#eq:sekfold)
\end{equation}

  The difference between the $elpd_{kfold}$ of two competing models, $\mathcal{M}_1$ and $\mathcal{M}_2$, is a measure of \index{Relative predictive performance} relative predictive performance. The  standard error of their difference can be computed using the formula discussed in @vehtariPracticalBayesianModel2017:

\begin{equation}
se(\widehat{elpd}_{\mathcal{M}1} - \widehat{elpd}_{\mathcal{M}2}) = \sqrt{N \mathit{Var}(\widehat{ELPD_{\mathcal{M}1}} - \widehat{ELPD_{\mathcal{M}2}})}
(\#eq:sekfolddiff)
\end{equation}