-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path26-exercises.Rmd
1426 lines (889 loc) · 81.5 KB
/
26-exercises.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Exercises
## Introduction {#sec-Foundationsexercises}
### Practice using the `pnorm()` function--Part 1 {#exr:FoundationsexercisespnormPart1 .unlisted}
Given a normal distribution with mean 500 and standard deviation 100, use the `pnorm()` function to calculate the probability of obtaining values between 200 and 800 from this distribution.
### Practice using the `pnorm()` function--Part 2 {#exr:FoundationsexercisespnormPart2 .unlisted}
```{r echo=FALSE}
mu <- round(rnorm(1, mean = 51, sd = 2), digits = 0)
sigma <- round(runif(1, min = 2, max = 4), digits = 0)
q <- round(runif(1, min = 40, max = 50))
q1 <- round(runif(1, min = 51, max = 60))
p1 <- round(pnorm(q, mean = mu, sd = sigma), digits = 3)
p2 <- round(1 - pnorm(q, mean = mu, sd = sigma), digits = 3)
p3 <- round(1 - pnorm(q1, mean = mu, sd = sigma), digits = 3)
```
Calculate the following probabilities.
Given a normal distribution with mean 800 and standard deviation 150, what is the probability of obtaining:
- a score of 700 or less
- a score of 900 or more
- a score of 800 or more
### Practice using the `pnorm()` function--Part 3 {#exr:FoundationsexercisespnormPart3 .unlisted}
Given a normal distribution with mean 600 and standard deviation 200, what is the probability of obtaining:
- a score of 550 or less.
- a score between 300 and 800.
- a score of 900 or more.
### Practice using the `qnorm()` function--Part 1 {#exr:FoundationsexercisesqnormPart1 .unlisted}
Consider a normal distribution with mean 1 and standard deviation 1.
Compute the lower and upper boundaries such that:
- the area (the probability) to the left of the lower boundary is 0.10.
- the area (the probability) to the left of the upper boundary is 0.90.
### Practice using the `qnorm()` function--Part 2 {#exr:FoundationsexercisesqnormPart2 .unlisted}
Given a normal distribution with mean 650 and standard deviation 125. There exist two quantiles, the lower quantile q1 and the upper quantile q2, that are equidistant from the mean 650, such that the \index{Area under the curve} area under the curve of the normal between q1 and q2 is 80%. Find q1 and q2.
### Practice getting summaries from samples--Part 1 {#exr:Foundationsexercisessamples1 .unlisted}
Given data that is generated as follows:
```{r}
data_gen1 <- rnorm(1000, mean = 300, sd = 200)
```
Calculate the mean, variance, and the lower quantile q1 and the upper quantile q2, that are equidistant and such that the range of probability between them is 80%.
### Practice getting summaries from samples--Part 2. {#exr:Foundationsexercisessamples2 .unlisted}
This time we generate the data with a truncated normal distribution from the package `extraDistr`. The details of this distribution will be discussed later in section \@ref(sec-pupil) and in the online section \@ref(app-truncation), but for now we can treat it as an unknown generative process:
```{r}
data_gen1 <- rtnorm(1000, mean = 300, sd = 200, a = 0)
```
Using the sample data, calculate the mean, variance, and the lower quantile q1 and the upper quantile q2, such that the probability of observing values between these two quantiles is 80%.
### Practice with a variance-covariance matrix for a bivariate distribution. {#exr:Foundationsexercisesvcov1 .unlisted}
Suppose that you have a bivariate distribution where one of the two random variables comes from a normal distribution with mean $\mu_X=600$ and standard deviation $\sigma_X=100$, and the other from a normal distribution with mean $\mu_Y=400$ and standard deviation $\sigma_Y=50$. The correlation $\rho_{XY}$ between the two random variables is $0.4$. Write down the variance-covariance matrix of this bivariate distribution as a matrix (with numerical values, not mathematical symbols), and then use it to generate $100$ pairs of simulated data points. Plot the simulated data such that the relationship between the random variables $X$ and $Y$ is clear. Generate two sets of new data ($100$ pairs of data points each) with correlation $-0.4$ and $0$, and plot these alongside the plot for the data with correlation $0.4$.
## Introduction to Bayesian data analysis {#sec-BDAexercises}
### Deriving Bayes' rule {#exr:BDAexercisesDerivingBayes .unlisted}
Let $A$ and $B$ be two observable events. $P(A)$ is the probability that $A$ occurs, and $P(B)$ is the probability that $B$ occurs. $P(A|B)$ is the conditional probability that $A$ occurs given that $B$ has happened. $P(A,B)$ is the joint probability of $A$ and $B$ both occurring.
You are given the definition of conditional probability:
\begin{equation}
P(A|B)= \frac{P(A,B)}{P(B)} \hbox{ where } P(B)>0
\end{equation}
Using the above definition, and using the fact that $P(A,B)=P(B,A)$ (i.e., the probability of $A$ and $B$ both occurring is the same as the probability of $B$ and $A$ both occurring),
derive an expression for $P(B|A)$. Show the steps clearly in the derivation.
### Conjugate forms 1 {#exr:BDAexercisesConj1 .unlisted}
- Computing the general form of a PDF for a posterior
Suppose you are given data $k$ consisting of the number of successes, coming from a $\mathit{Binomial}(n,\theta)$ distribution.
Given $k$ successes in n trials coming from a binomial distribution, we define a $\mathit{Beta}(a,b)$ prior on the parameter $\theta$.
Write down the Beta distribution that represents the posterior, in terms of $a,b, n,$ and $k$.
- Practical application
We ask 10 yes/no questions from a subject, and the subject returns 0 correct answers. We assume a binomial likelihood function for these data. Also assume a $\mathit{Beta}(1,1)$ prior on the parameter $\theta$, which represents the probability of success. Use the result you derived above to write down the posterior distribution of the $\theta$ parameter.
### Conjugate forms 2 {#exr:BDAexercisesConj2 .unlisted}
Suppose that we perform $n$ independent trials until we get a success (e.g., a heads in a coin toss). For repeated coin tosses, observing T,T, H would correspond to a score of $n=3$. The probability of success in each trial is $\theta$. Then, the Geometric random variable, call it $X$, gives us the probability of getting a success in $n$ trials as follows:
\begin{equation}
Prob(X=n)=\theta(1-\theta)^{ n-1}
\end{equation}
where $n=1,2,\dots$.
Let the prior on $\theta$ be $\mathit{Beta}(a,b)$, a beta distribution with parameters $a$,$b$.
The posterior distribution is a beta distribution with parameters $a^*$ and $b^*$. Determine these parameters in terms of $a$, $b$, and $n$.
### Conjugate forms 3 {#exr:BDAexercisesConj3 .unlisted}
Suppose that we have $n$ data points, $x_1,\dots,x_n$, drawn independently from an exponential distribution with parameter $\lambda$. The parameter of interest here (what we want to learn about from the data) is $\lambda$.
The \index{Exponential likelihood} exponential likelihood function is:
\begin{equation}
p(x_1,\dots,x_n | \lambda)=\lambda^n \exp (-\lambda \sum_{i=1}^n x_i )
\end{equation}
Starting with a Gamma prior distribution for $\lambda$ (see below), show that the posterior distribution for $\lambda$ is also a Gamma distribution. Provide formulas giving the posterior parameters $a^*, b^*$ in terms of the prior parameters $a, b$ and the data. Use the following facts about Gamma distributions.
The \index{Gamma distribution} Gamma distribution is defined in terms of the parameters $a$, $b$: $\mathit{Gamma}(a,b)$. In general, if there is a random variable $Y$ (where $y\geq 0$) that has a Gamma distribution as a PDF ($Y\sim \mathit{Gamma}(a,b)$), then:
\begin{equation}
\mathit{Gamma}(y | a,b)=\frac{b^a y^{a-1} \exp(-by)}{\Gamma(a)}
\end{equation}
The $\mathit{Gamma}(a,b)$ prior on the $\lambda$ parameter in the exponential distribution will be written:
\begin{equation}
\mathit{Gamma}(\lambda | a,b)=\frac{b^a \lambda^{a-1} \exp(-b\lambda)}{\Gamma(a)}
\end{equation}
### Conjugate forms 4 {#exr:BDAexercisesConj4 .unlisted}
- Computing the posterior
This is a contrived example. Suppose we are modeling the number of times that a speaker says the word "I" per day. This could be of interest if we are studying, for example, how self-oriented a speaker is. The number of times $x$ that the word is uttered in over a particular time period (here, one day) can be modeled by a \index{Poisson distribution} Poisson distribution ($x=0,1,2,\dots$):
\begin{equation}
f(x\mid \theta) = \frac{\exp(-\theta) \theta^x}{x!} \hbox{ for } x=0,1,2,\dots
\end{equation}
where the rate $\theta$ is unknown, and the numbers of utterances of the target word on each day are independent given $\theta$.
As an aside: Given $n$ independent observations of a Poisson random variable with rate parameter $\theta$, the maximum-likelihood estimator (MLE) for $\theta$ turns out to be $\hat{\theta} = \frac{\sum_{i=1}^n x_i}{n}$. When we are talking about a particular sample of data, the maximum-likelihood estimate is computed using the formula for the estimator, $\frac{\sum_{i=1}^n x_i}{n}$, and is represented as $\bar{x}$.
We are told that the prior mean of $\theta$ is 100 and prior variance for $\theta$ is $225$. This information is based on the results of previous studies on the topic. We will use the $\mathit{Gamma}(a,b)$ density (see previous question) as a prior for $\theta$ because this is a conjugate prior to the Poisson distribution.
a) First, visualize the prior, a Gamma density prior for $\theta$ based on the above information.
[Hint: we know that for a Gamma density with parameters $a$, $b$, the mean is $\frac{a}{b}$ and the variance is $\frac{a}{b^2}$. Since we are given values for the mean and variance, we can solve for $a,b$, which gives us the Gamma density.]
```{r, echo=FALSE,eval=FALSE,fig.cap="\\label{fig1}The Gamma prior for the parameter theta."}
x <- 0:200
plot(x, dgamma(x, 10000 / 225, 100 / 225),
type = "l", lty = 1,
main = "Gamma prior", ylab = "density",
cex.lab = 2, cex.main = 2, cex.axis = 2
)
```
b) Next, derive the posterior distribution of the parameter $\theta$ up to proportionality, and write down the posterior distribution in terms of the parameters of a Gamma distribution.
- Practical application
Suppose we know that the number of "I" utterances from a particular individual is $115, 97, 79, 131$. Use the result you derived above to obtain the posterior distribution. In other words, write down the parameters of the Gamma distribution (call them $a^*,b^*$) representing the posterior distribution of $\theta$.
Plot the prior and the posterior distributions alongside each other.
Now suppose you get one new data point: 200. Using the posterior $\mathit{Gamma}(a^*,b^*)$ as your prior, write down the updated posterior (in terms of the updated parameters of the Gamma distribution) given this new data point. Add the updated posterior to the plot you made above.
### The posterior mean is a weighted mean of the prior mean and the MLE (Poisson-Gamma conjugate case) {#exr:BDAexercisesWeightedMean .unlisted}
The number of times an event happens per unit time can be modeled using a Poisson distribution, whose PMF is:
\begin{equation}
f(x\mid \theta) = \frac{\exp(-\theta) \theta^x}{x!}
\end{equation}
Suppose that we define a Gamma(a,b) prior for the rate parameter $\theta$. It is a fact (see exercises above) that the posterior of the $\theta$ parameter is a $Gamma(a^*,b^*)$ distribution, where $a^*$ and $b^*$ are the updated parameters given the data: $\theta \sim Gamma(a^*,b^*)$.
- Prove that the posterior mean is a weighted mean of the prior mean and the maximum likelihood estimate (mean) of the Poisson-distributed data, $\bar{x} = \sum_{i=1}^n x/n$. Hint: the mean of a Gamma distribution is $\frac{a}{b}$. Specifically, what you have to prove is that:
\begin{equation}
\frac{a^*}{b^*} = \frac{a}{b} \times \frac{w_1}{w_1 + w_2} + \bar{x} \times \frac{w_2}{w_1 + w_2}
(\#eq:weightingpoisga)
\end{equation}
where $w_1 = 1$ and $w_2=\frac{n}{b}$.
- Given equation \@ref(eq:weightingpoisga), make an informal argument showing that as $n$ increases (as sample size goes up), the maximum likelihood estimate $\bar{x}$ dominates in determining the posterior mean, and when $n$ gets smaller and smaller, the prior mean dominates in determining the posterior mean.
- Finally, given that the variance of a Gamma distribution is $\frac{a}{b^2}$, show that as $n$ increases, the posterior variance will get smaller and smaller (the uncertainty on the posterior will go down).
## Computational Bayesian data analysis {#ex:compbda}
### Check for parameter recovery in a linear model using simulated data. {#exr:simulatedlinearmod .unlisted}
Generate some simulated independent and identically distributed data with $n=100$ data points as follows:
```{r}
y <- rnorm(100, mean = 500, sd = 50)
```
Next, fit a simple linear model with a normal likelihood:
\begin{equation}
y_n \sim \mathit{Normal}(\mu,\sigma) (\#eq:simrtlikLM)
\end{equation}
Specify the following priors:
\begin{equation}
\begin{aligned}
\mu &\sim \mathit{Uniform}(0, 60000) \\
\sigma &\sim \mathit{Uniform}(0, 2000)
\end{aligned}
(\#eq:simrtpriors)
\end{equation}
Generate posterior distributions of the parameters and check that the true values of the parameters $\mu=500, \sigma=50$ are recovered by the model. What this means is that you should check whether these true values lie within the range of the posterior distributions of the two parameters. This is a good sanity check for finding out whether a model can in principle recover the true parameter values correctly.
### A simple linear model. {#exr:linearmod .unlisted}
\vspace{-.5cm}
a. Fit the model `fit_press` with just a few iterations, say 50 iterations (set warmup to the default of 25, and use four chains). Does the model converge?
b. Using normal distributions, choose priors that better represent **your** assumptions/beliefs about finger tapping times. To think about a reasonable set of priors for $\mu$ and $\sigma$, you should come up with your own subjective assessment about what you think a reasonable range of values can be for $\mu$ and how much variability might happen. There is no correct answer here, we'll discuss priors in depth in chapter \@ref(ch-priors). Fit this model to the data. Do the posterior distributions change?
### Revisiting the button-pressing example with different priors. {#exr:compbda-biasedpost .unlisted}
\vspace{-.5cm}
a. Can you come up with very informative priors that influence the posterior in a noticeable way (use normal distributions for priors, not uniform priors)? Again, there are no correct answers here; you may have to try several different priors before you can noticeably influence the posterior.
b. Generate and plot prior predictive distributions based on this prior and plot them.
c. Generate posterior predictive distributions based on this prior and plot them.
### Posterior predictive checks with a log-normal model. {#exr:ppd .unlisted}
\vspace{-.5cm}
a. For the log-normal model `fit_press_ln`, change the prior of $\sigma$ so that it is a log-normal distribution with location ($\mu$) of $-2$ and scale ($\sigma$) of $0.5$. What does such a prior imply about your belief regarding button-pressing times in milliseconds? Is it a good prior? Generate and plot prior predictive distributions. Do the new estimates change compared to earlier models when you fit the model?
b. For the log-normal model, what is the mean (rather than median) time that takes to press the space bar, what is the standard deviation of the finger tapping times in milliseconds?
### A skew normal distribution. {#exr:skew .unlisted}
Would it make sense to use a "skew normal distribution" instead of the log-normal? The skew normal distribution has three parameters: location $\xi$ (this is the lower-case version of the Greek letter $\Xi$, pronounced "chi", with the "ch" pronounced like the "ch" in "Bach"), scale $\omega$ (omega), and shape $\alpha$. The distribution is right skewed if $\alpha >0$, is left skewed if $\alpha <0$, and is identical to the regular normal distribution if $\alpha =0$. For fitting this in `brms`, one needs to change `family` and set it to `skew_normal()`, and add a prior of `class = alpha` (location remains `class = Intercept` and scale, `class = sigma`).
a. Fit this model with a prior that assigns approximately 95% of the prior probability of `alpha` to be between 0 and 10.
b. Generate posterior predictive distributions and compare the posterior distribution of summary statistics of the skew normal with the normal and log-normal.
## Bayesian regression models {#sec-LMexercises}
### A simple linear regression: Power posing and testosterone. {#exr:powerposing .unlisted}
Load the following data set:
```{r}
data("df_powerpose")
head(df_powerpose)
```
The data set, which was originally published in @carney2010power but released in modified form by @FossePowerPose, shows the testosterone levels of 39 different individuals, before and after treatment, where treatment refers to each individual being assigned to a high power pose or a low power pose. In the original paper by @carney2010power, the unit given for testosterone measurement (estimated from saliva samples) was picograms per milliliter (pg/ml). One picogram per milliliter is 0.001 nanogram per milliliter (ng/ml).
The research hypothesis is that on average, assigning a subject a high power pose vs. a low power pose will lead to higher testosterone levels after treatment. Assuming that you know nothing about typical ranges of testosterone using salivary measurement, you can use the default priors in `brms` for the target parameter(s).
Investigate this claim using a linear model and the default priors of `brms`. You'll need to estimate the effect of a new variable that encodes the change in testosterone.
### Another linear regression model: Revisiting attentional load effect on pupil size. {#exr:pupils .unlisted}
Here, we revisit the analysis shown in the chapter, on how attentional load affects pupil size.
a. Our priors for this experiment were quite arbitrary. How do the prior predictive distributions look like? Do they make sense?
b. Is our posterior distribution sensitive to the priors that we selected? Perform a sensitivity analysis to find out whether the posterior is affected by our choice of prior for the $\sigma$.
c. Our data set includes also a column that indicates the trial number. Could it be that trial has also an effect on the pupil size? As in `lm()`, we indicate another main effect with a `+` sign. How would you communicate the new results?
### Log-normal model: Revisiting the effect of trial on finger tapping times. {#exr:lognormalm .unlisted}
We continue considering the effect of trial on finger tapping times.
a. Estimate the slowdown in milliseconds between the last two times the subject pressed the space bar in the experiment.
b. How would you change your model (keeping the log-normal likelihood) so that it includes centered log-transformed trial numbers or square-root-transformed trial numbers (instead of centered trial numbers)? Does the effect in milliseconds change?
### Logistic regression: Revisiting the effect of set size on free recall. {#exr:reg-logistic .unlisted}
Our data set includes also a column coded as `tested` that indicates the position of the queued word. (In Figure \@ref(fig:oberauer) `tested` would be 3). Could it be that position also has an effect on recall accuracy? How would you incorporate this in the model? (We indicate another main effect with a `+` sign).
### Red is the sexiest color. {#exr:red .unlisted}
Load the following data set:
```{r}
data("df_red")
head(df_red)
```
The data set is from a study [@beall2013women] that contains information about the color of the clothing worn (red, pink, or red or pink) when the subject (female) is at risk of becoming pregnant (is ovulating, self-reported). The broader issue being investigated is whether women wear red more often when they are ovulating (in order to attract a mate). Using logistic regressions, fit three different models to investigate whether being ovulating increases the probability of wearing (a) red, (b) pink, or (c) either pink or red. Use priors that are reasonable (in your opinion).
## Bayesian hierarchical models {#sec-HLMexercises}
### A hierarchical model (normal likelihood) of cognitive load on pupil size. {#exr:hierarchical-normal .unlisted}
As in section \@ref(sec-pupil), we focus on the effect of cognitive load on pupil size, but this time we look at all the subjects of @wahnPupilSizesScale2016:
```{r}
data("df_pupil_complete")
df_pupil_complete
```
You should be able to now fit a "maximal" model (correlated varying intercept and slopes for subjects) assuming a normal likelihood. Base your priors in the priors discussed in section \@ref(sec-pupil).
(a) Examine the effect of load on pupil size, and the average pupil size. What do you conclude?
(b) Do a sensitivity analysis for the prior on the intercept ($\alpha$). What is the estimate of the effect ($\beta$) under different priors?
(c) Is the effect of load consistent across subjects? Investigate this visually.
### Are subject relatives easier to process than object relatives (log-normal likelihood)? {#exr:hierarchical-logn .unlisted}
We begin with a classic question from the psycholinguistics literature: Are subject relatives easier to process than object relatives? The data come from Experiment 1 in a paper by @grodner.
*Scientific question*: Is there a subject relative advantage in reading?
@grodner investigate an old claim in psycholinguistics that \index{Object relative clause} object relative clause (ORC) sentences are more difficult to process than \index{Subject relative clause} subject relative clause (SRC) sentences. One explanation for this predicted difference is that the distance between the relative clause verb (*sent* in the example below) and the head noun phrase of the relative clause (*reporter* in the example below) is longer in ORC vs. SRC. Examples are shown below. The relative clause is shown in square brackets.
(1a) The *reporter* [who the photographer *sent* to the editor] was hoping for a good story. (ORC)
(1b) The *reporter* [who *sent* the photographer to the editor] was hoping for a good story. (SRC)
The underlying explanation has to do with memory processes: Shorter linguistic dependencies are easier to process due to either reduced interference or decay, or both. For implemented computational models that spell this point out, see @lewisvasishth:cogsci05 and @EngelmannJaegerVasishth2019.
In the Grodner and Gibson data, the dependent measure is \index{Reading time} reading time at the relative clause verb, (e.g., *sent*) of different sentences with either ORC or SRC. The dependent variable is in milliseconds and was measured in a self-paced reading task. \index{Self-paced reading} Self-paced reading is a task where subjects read a sentence or a short text word-by-word or phrase-by-phrase, pressing a button to get each word or phrase displayed; the preceding word disappears every time the button is pressed. In \@ref(sec-simpleexamplepriors), we provide a more detailed explanation of this experimental method.
For this experiment, we are expecting longer reading times at the relative clause verbs of ORC sentences in comparison to the relative clause verb of SRC sentences.
```{r open_grodneretal, message = FALSE}
data("df_gg05_rc")
df_gg05_rc
```
You should use a sum coding for the predictors. Here, object relative clauses (`"objgaps"`) are coded $+1/2$, subject relative clauses
$-1/2$.
```{r}
df_gg05_rc <- df_gg05_rc %>%
mutate(c_cond = if_else(condition == "objgap", 1/2, -1/2))
```
You should be able to now fit a "maximal" model (correlated varying intercept and slopes for subjects and for items) assuming a log-normal likelihood.
(a) Examine the effect of relative clause attachment site (the predictor `c_cond`) on reading times `RT` ($\beta$).
(b) Estimate the median difference between relative clause attachment sites in milliseconds, and report the mean and 95% CI.
(c) Do a sensitivity analysis. What is the estimate of the effect ($\beta$) under different priors? What is the difference in milliseconds between conditions under different priors?
### Relative clause processing in Mandarin Chinese {#exr:HLMExerciseMandarinRC .unlisted}
Load the following two data sets:
```{r}
data("df_gibsonwu")
data("df_gibsonwu2")
```
The data are taken from two experiments that investigate (inter alia) the effect of relative clause type on reading time in Chinese. The data are from @gibsonwu and @VasishthetalPLoSOne2013 respectively. The second data set is a direct replication attempt of the @gibsonwu experiment.
Chinese relative clauses are interesting theoretically because they are prenominal: the relative clause appears before the head noun. For example, the English relative clauses shown above would appear in the following order in Mandarin. The square brackets mark the relative clause, and REL refers to the Chinese equivalent of the English relative pronoun *who*.
(2a) [The photographer *sent* to the editor] REL the *reporter* was hoping for a good story. (ORC)
(2b) [*sent* the photographer to the editor] REL the *reporter* who was hoping for a good story. (SRC)
As discussed in @gibsonwu, the consequence of Chinese relative clauses being prenominal is that the distance between the verb in relative clause and the head noun is larger in subject relatives than object relatives. @hsiao03 were the first to suggest that the larger distance in subject relatives leads to longer reading time at the head noun. Under this view, the prediction is that subject relatives are harder to process than object relatives. If this is true, this is interesting and surprising because in most other languages that have been studied, subject relatives are easier to process than object relatives; so Chinese will be a very unusual exception cross-linguistically.
The data provided are for the critical region (the head noun; here, *reporter*). The experiment method is self-paced reading, so we have reading times in milliseconds. The second data set is a direct replication attempt of the first data set, which is from @gibsonwu.
The research hypothesis is whether the difference in reading times between object and subject relative clauses is negative. For the first data set (`df_gibsonwu`), investigate this question by fitting two "maximal" hierarchical models (correlated varying intercept and slopes for subjects and items). The dependent variable in both models is the raw reading time in milliseconds. The first model should use the normal likelihood in the model; the second model should use the log-normal likelihood. In both models, use $\pm 0.5$ sum coding to model the effect of relative clause type. You will need to decide on appropriate priors for the various parameters.
(a) Plot the posterior predictive distributions from the two models. What is the difference in the posterior predictive distributions of the two models; and why is there a difference?
(b) Examine the posterior distributions of the effect estimates (in milliseconds) in the two models. Why are these different?
(c) Given the posterior predictive distributions you plotted above, why is the log-normal likelihood model better for carrying out inference and hypothesis testing?
Next, work out a normal approximation of the log-normal model's posterior distribution for the relative clause effect that you obtained from the above data analysis. Then use that normal approximation as an informative prior for the slope parameter when fitting a hierarchical model to the second data set. This is an example of incrementally building up knowledge by successively using a previous study's posterior as a prior for the next study; this is essentially equivalent to pooling both data sets (check that pooling the data and using a Normal(0,1) prior for the effect of interest, with a log-normal likelihood, gives you approximately the same posterior as the informative-prior model fit above).
### \index{Agreement attraction} Agreement attraction in comprehension {#exr:HLMExerciseEnglishAgrmt .unlisted}
Load the following data:
```{r}
data("df_dillonE1")
dillonE1 <- df_dillonE1
head(dillonE1)
```
The data are taken from an experiment that investigate (inter alia) the effect of number similarity between a noun and the auxiliary verb in sentences like the following. There are two levels to a factor called Int(erference): low and high.
(3a) low: The key to the cabinet *are* on the table
(3b) high: The key to the *cabinets* *are* on the table
Here, in (3b), the auxiliary verb *are* is predicted to be read faster than in (3a), because the plural marking on the noun *cabinets* leads the reader to think that the sentence is grammatical. (Both sentences are ungrammatical.) This phenomenon, where the high condition is read faster than the low condition, is called **agreement attraction**.
The data provided are for the critical region (the auxiliary verb *are*). The experiment method is eye-tracking; we have total reading times in milliseconds.
The research question is whether the difference in reading times between high and low conditions is negative.
- First, using a log-normal likelihood, fit a hierarchical model with correlated varying intercept and slopes for subjects and items. You will need to decide on the priors for the model.
- By simply looking at the posterior distribution of the slope parameter $\beta$, what would you conclude about the theoretical claim relating to agreement attraction?
### \index{Attentional blink} Attentional blink (Bernoulli likelihood) {#exr:ab .unlisted}
The attentional blink [AB; first described by @raymond1992temporary; though it has been noticed before e.g., @Broadbent1987] refers to a temporary reduction in the accuracy of detecting a *probe* (e.g., a letter "X") presented closely after a *target* that has been detected (e.g., a white letter). We will focus on the experimental condition of Experiment 2 of @raymond1992temporary. Subjects are presented with letters in \index{Rapid serial visual presentation} rapid serial visual presentation (RSVP) at the center of the screen at a constant rate and are required to identify the only white letter (target) in the stream of black letters, and then to report whether the letter X (probe) occurred in the subsequent letter stream. The AB is defined as having occurred when the target is reported correctly but the report of the probe is inaccurate at a short *lag* or *target-probe* interval.
The data set `df_ab` is a subset of the data of this paradigm from a replication conducted by @grassi_two_2021. In this subset, the probe was always present and the target was correctly identified. We want to find out how the lag affects the accuracy of the identification of the probe.
```{r}
data("df_ab")
df_ab
```
Fit a logistic regression assuming a linear relationship between `lag` and accuracy (`probe_correct`). Assume a hierarchical structure with correlated varying intercept and slopes for subjects. You will need to decide on the priors for this model.
(a) How is the accuracy of the probe identification affected by the lag? Estimate this in log-odds and percentages.
(b) Is the linear relationship justified? Use posterior predictive checks to verify this.
(c) Can you think about a better relationship between lag and accuracy? Fit a new model and use posterior predictive checks to verify if the fit improved.
### Is there a Stroop effect in accuracy? {#exr:strooplogis-brms .unlisted}
Instead of the response times of the correct answers, we want to find out whether accuracy also changes by condition in the Stroop task. Fit the Stroop data with a hierarchical logistic regression (i.e., a Bernoulli likelihood with a logit link). Use the complete data set, `df_stroop_complete` which also includes incorrect answers, and subset it selecting the first 50 subjects.
(a) Fit the model.
(b) Report the Stroop effect in log-odds and accuracy.
### \index{Distributional regression} Distributional regression for the Stroop effect. {#exr:stroop-dist .unlisted}
We will relax some of the assumptions of the model of Stroop presented in section \@ref(sec-stroop). We will no longer assume that all subjects share the same variance component, and, in addition, we'll investigate whether the experimental manipulation affects the scale of the response times. A reasonable hypothesis could be that the incongruent condition is noisier than the congruent one.
Assume the following likelihood, and fit the model with sensible priors (recall that our initial prior for $\beta$ wasn't reasonable). (Priors for all the `sigma` parameters require us to set `dpar = sigma`).
\begin{equation}
\begin{aligned}
rt_n &\sim \mathit{LogNormal}(\alpha + u_{subj[n],1} + c\_cond_n \cdot (\beta + u_{subj[n],2}), \sigma_n)\\
\sigma_n &= \exp(\sigma_\alpha + \sigma_{u_{subj[n],1}} + c\_cond \cdot (\sigma_\beta + \sigma_{u_{subj[n],2}}) )
\end{aligned}
\end{equation}
In this likelihood $\sigma_n$ has both population- and group-level parameters: $\sigma_\alpha$ and $\sigma_\beta$ are the intercept and slope of the population level effects repectively, and $\sigma_{u_{subj[n],1}}$ and $\sigma_{u_{subj[n],2}}$ are the intercept and slope of the group-level effects.
(a) Is our hypothesis reasonable in light of the results?
(b) Why is the intercept for the scale negative?
(c) What's the posterior estimate of the scale for congruent and incongruent conditions?
```{r, eval = FALSE, echo = FALSE}
fit_stroop_d <- brm(bf(RT ~ c_cond + (c_cond | subj),
sigma ~ c_cond + (c_cond | subj)),
family = lognormal(),
prior =
c(
prior(normal(6, 1.5), class = Intercept),
prior(normal(0, .1), class = b),
prior(normal(0, 1), class = sd),
prior(lkj(2), class = cor),
prior(normal(-1, 1), class = Intercept, dpar = sigma),
prior(normal(0, 1),
class = b,
dpar = sigma),
prior(normal(0, 1),
class = sd, group = subj,
dpar = sigma)
),
iter = 4000,
data = df_stroop %>% filter(subj < 30)
)
```
### The \index{Grammaticality illusion} grammaticality illusion {#exr:HLMExerciseGramCE .unlisted}
Load the following two data sets:
```{r}
data("df_english")
english <- df_english
data("df_dutch")
dutch <- df_dutch
```
In an offline accuracy rating study on English double center-embedding constructions, @gibsonthomas99 found that grammatical constructions (e.g., example 4a below) were no less acceptable than ungrammatical constructions (e.g., example 4b) where a middle verb phrase (e.g., *was cleaning every week*) was missing.
(4a) The apartment that the maid who the service had sent over was cleaning every week was well decorated.
(4b) *The apartment that the maid who the service had sent over --- was well decorated
Based on these results from English, @gibsonthomas99 proposed that working-memory overload leads the comprehender to forget the prediction of the upcoming verb phrase (VP), which reduces working-memory load. This came to be known as the \index{VP-forgetting hypothesis} *VP-forgetting hypothesis*. The prediction is that in the word immediately following the final verb, the grammatical condition (which is coded as +1 in the data frames) should be harder to read than the ungrammatical condition (which is coded as -1).
The design shown above is set up to test this hypothesis using self-paced reading for English [@VSLK08], and for Dutch [@FrankEtAl2015]. The data provided are for the critical region (the noun phrase, labeled NP1, following the final verb); this is the region for which the theory predicts differences between the two conditions. We have reading times in log milliseconds.
(a) First, fit a linear model with a full hierarchical structure by subjects and by items for the English data. Because we have log milliseconds data, we can simply use the normal likelihood (not the log-normal). What scale will be the parameters be in, milliseconds or log milliseconds?
(b) Second, using the posterior for the effect of interest from the English data, derive a prior distribution for the effect in the Dutch data. Then fit two linear mixed models: (i) one model with relatively uninformative priors for $\beta$ (for example, $Normal(0,1)$), and (ii) one model with the prior for $\beta$ you derived from the English data. Do the posterior distributions of the Dutch data's effect show any important differences given the two priors? If yes, why; if not, why not?
(c) Finally, just by looking at the English and Dutch posteriors, what can we say about the VP-forgetting hypothesis? Are the posteriors of the effect from these two languages consistent with the hypothesis?
## Contrast coding {#sec-Contrastsexercises}
### Contrast coding for a four-condition design {#exr:ContrastsPersian .unlisted}
Load the following data. These data are from Experiment 1 in a set of reading studies on Persian [@SafaviEtAlFrontiers2016]. This is a self-paced reading study on particle-verb constructions, with a $2\times 2$ design: distance (short, long) and predictability (predictable, unpredictable). The data are from a critical region in the sentence. All the data from the @SafaviEtAlFrontiers2016 paper are available from https://github.com/vasishth/SafaviEtAl2016.
```{r}
library(bcogsci)
data("df_persianE1")
dat1 <- df_persianE1
head(dat1)
```
The four conditions are:
- Distance=short and Predictability=unpredictable
- Distance=short and Predictability=predictable
- Distance=long and Predictability=unpredictable
- Distance=long and Predictability=predictable
The researcher wants to do the following sets of comparisons between condition means:
Compare the condition labeled Distance=short and Predictability=unpredictable with each of the following conditions:
- Distance=short and Predictability=predictable
- Distance=long and Predictability=unpredictable
- Distance=long and Predictability=predictable
Questions:
- Which contrast coding is needed for such a comparison?
- First, define the relevant contrast coding. Hint: You can do it by creating a condition column labeled a,b,c,d and then use a built-in contrast coding function.
- Then, use the `hypr` library function to confirm that your contrast coding actually does the comparison you need.
- Fit a simple linear model with the above contrast coding and display the slopes, which constitute the relevant comparisons.
- Now, compute each of the four conditions' means and check that the slopes from the linear model correspond to the relevant differences between means that you obtained from the data.
### \index{Helmert coding} Helmert coding for a six-condition design. {#exr:ContrastsNPIHelmert .unlisted}
This data-set is from a psycholinguistics study, and although we explain the theoretical background below, one does not need to deeply understand the research questions to be able to define the contrasts.
Load the following data:
```{r}
library(bcogsci)
data("df_polarity")
head(df_polarity)
```
The data come from an eyetracking study in German reported in @VBLD07. The experiment is a reading study involving six conditions. The sentences are in English, but the original design was involved German sentences. In German, the word *durchaus* (certainly) is a positive polarity item: in the constructions used in this experiment, *durchaus* cannot have a c-commanding element that is a negative polarity item licensor. By contrast, the German negative polarity item *jemals* (ever) is a negative polarity item: in the constructions used in this experiment, *jemals* must have a c-commanding element that is a negative polarity item licensor.
Here are the conditions:
- Negative polarity items
(a) Grammatical: No man who had a beard was ever thrifty.
(b) Ungrammatical (Intrusive NPI licensor): A man who had no beard was ever thrifty.
(c) Ungrammatical: A man who had a beard was ever thrifty.
- Positive polarity items
(d) Ungrammatical: No man who had a beard was certainly thrifty.
(e) Grammatical (Intrusive NPI licensor): A man who had no beard was certainly thrifty.
(f) Grammatical: A man who had a beard was certainly thrifty.
We will focus only on re-reading time in this data set. Subset the data so that we only have re-reading times in the data frame:
```{r}
dat2 <- subset(df_polarity, times == "RRT")
head(dat2)
```
The comparisons we are interested in are as follows:
- What is the difference in reading time between negative polarity items and positive polarity items? In other words, we want to compare the mean of conditions (a), (b), (c) with the mean of (d), (e), (f).
- Within negative polarity items, what is the difference between grammatical and ungrammatical conditions? In other words, we want to compare condition (a) with the average of (b) and (c).
- Within negative polarity items, what is the difference between the two ungrammatical conditions? Here, we want to compare conditions (b) and (c).
- Within positive polarity items, what is the difference between grammatical and ungrammatical conditions? Here, we want to compare condition (d) with the average of (e) and (f).
- Within positive polarity items, what is the
difference between the two grammatical conditions? Here, the comparison is between (e) and (f).
Use the `hypr` package to specify the comparisons specified above, and then extract the contrast matrix. Finally, specify the contrasts to the condition column in the data frame. Fit a linear model using this contrast specification, and then check that the estimates from the model match the mean differences between the conditions being compared.
### Number of possible comparisons in a single model. {#exr:ContrastsNcomparisons .unlisted}
- How many comparisons can one make in a single model when there is a single factor with four levels? Why can we not code four comparisons in a single model?
- How many comparisons can one code in a model where there are two factors, one with three levels and one with two levels?
- How about a model for a $2 \times 2 \times 3$ design?
## Contrast coding with two predictor variables {#sec-Contrasts2x2exercises}
### ANOVA coding for a four-condition design. {#exr:ContrastsPersianANOVA .unlisted}
Load the following data. These data are from Experiment 1 in a set of reading studies on Persian [@SafaviEtAlFrontiers2016]; we encountered these data in the preceding chapter's exercises.
```{r}
library(bcogsci)
data("df_persianE1")
dat1 <- df_persianE1
head(dat1)
```
The four conditions are:
- Distance=short and Predictability=unpredictable
- Distance=short and Predictability=predictable
- Distance=long and Predictability=unpredictable
- Distance=long and Predictability=predictable
For the data given above, define an ANOVA-style contrast coding, and compute main effects and interactions. Check with `hypr` what the estimated comparisons are with an ANOVA coding.
### ANOVA and nested comparisons in a $2\times 2\times 2$ design {#exr:Contrasts2x2x2Dillon2013 .unlisted}
Load the following data set. This is a $2\times 2\times 2$ design from @JaegerMertzenVanDykeVasishth2019, with the factors Grammaticality (grammatical vs. ungrammatical), Dependency (Agreement vs.\ Reflexives), and Interference (Interference vs.\ no interference). The experiment is a replication attempt of Experiment 1 reported in @Dillon-EtAl-2013.
```{r}
library(bcogsci)
data("df_dillonrep")
```
- The grammatical conditions are a,b,e,f. The rest of the conditions are ungrammatical.
- The agreement conditions are a,b,c,d. The other conditions are reflexives.
- The interference conditions are a,d,e,h, and the others are the no-interference conditions.
The dependent measure of interest is TFT (total fixation time, in milliseconds).
Using a linear model, do a main effects and interactions ANOVA contrast coding, and obtain an estimate of the main effects of Grammaticality, Dependency, and Interference, and all interactions. You may find it easier to code the contrasts coding the main effects as +1, -1, using `ifelse()` in R to code vectors corresponding to each main effect. This will make the specification of the interactions easy.
The researchers had a further research hypothesis: in ungrammatical sentences only, agreement would show an interference effect but reflexives would not. In grammatical sentences, both agreement and reflexives are expected to show interference effects. This kind of research question can be answered with nested contrast coding.
To carry out the relevant nested contrasts, define contrasts that estimate the effects of
- grammaticality
- dependency type
- the interaction between grammaticality and dependency type
- reflexives interference within grammatical conditions
- agreement interference within grammatical conditions
- reflexives interference within ungrammatical conditions
- agreement interference within ungrammatical conditions
Do the estimates match expectations? Check this by computing the condition means and checking that the estimates from the models match the relevant differences between conditions or clusters of conditions.
## Introduction to the probabilistic programming language Stan
### A very simple model. {#exr:first .unlisted}
In this exercise we revisit the model from \@ref(sec-simplenormal). Assume the following:
1. There is a true underlying time, $\mu$, that the subject needs to press the space bar.
2. There is some noise in this process.
3. The noise is normally distributed (this assumption is questionable given that response times are generally skewed; we fix this assumption later).
That is the likelihood for each observation $n$ will be:
\begin{equation}
\begin{aligned}
t_n \sim \mathit{Normal}(\mu, \sigma)
\end{aligned}
\end{equation}
a. Decide on appropriate priors and fit this model in Stan. Data can be found in `df_spacebar`.
b. Change the likelihood to a log-normal distribution and change the priors. Fit the model in Stan.
### Incorrect Stan model. {#exr:badstan .unlisted}
We want to fit both response times and accuracy with the same model. We simulate the data as follows:
```{r }
N <- 500
df_sim <- tibble(rt = rlnorm(N, mean = 6, sd = .5),
correct = rbern(N, prob = .85))
```
We build the following model:
```{r incorrectstan, echo = FALSE}
incorrect <- system.file("stan_models", "incorrect.stan", package = "bcogsci")
```
```{stan output.var = "incorrect_stan", code = readLines(incorrect), tidy = TRUE, comment="", eval = FALSE, cache = FALSE, cache.lazy = FALSE}
```
Why does this model not work?
```{r, error = TRUE}
ls_sim <- list(rt = df_sim$rt,
correct = df_sim$correct)
incorrect <- system.file("stan_models",
"incorrect.stan",
package = "bcogsci")
fit_sim <- stan(incorrect, data = ls_sim)
```
Try to make it run. (Hint: There are several problems.)
### Using Stan documentation. {#exr:skewstan .unlisted}
Edit the simple example with Stan from section \@ref(sec-firststan), and replace the normal distribution with a skew normal distribution. (Don't forget to add a prior to the new parameter, and check the Stan documentation or a statistics textbook for more information about the distribution).
Fit the following data:
```{r yscoresagain3, eval = FALSE}
Y <- rnorm(1000, mean = 3, sd = 10)
```
Does the estimate of the new parameter make sense?
### The probit link function as an alternative to the logit function. {#exr:linkfunction .unlisted}
The probit link function is the inverse of the CDF of the standard normal distribution ($Normal(0,1)$). Since the CDF of the standard normal is usually written using the Greek letter $\Phi$ (Phi), the probit function is written as its inverse, $\Phi^{-1}$. Refit the model presented in \@ref(sec-logisticstan) changing the logit link function for the probit link (that is transforming the regression to a constrained space using `Phi()` in Stan).
You will probably see the following as the model runs; this is because the probit link is less numerically stable (i.e., under- and overflows) than the logit link in Stan. Don't worry, it is good enough for this exercise.
```
Rejecting initial value:
Log probability evaluates to log(0), i.e. negative infinity.
Stan can't start sampling from this initial value.
```
a. Do the results of the coefficients $\alpha$ and $\beta$ change?
b. Do the results in probability space change?
### Examining the position of the queued word on recall. {#exr:logisticstan .unlisted}
Refit the model presented in section \@ref(sec-logisticstan) and examine whether set size, trial effects, the position of the queued word (`tested` in the data set), and their interaction affect free recall. (Tip: You can do this exercise without changing the Stan code.).
How does the accuracy change from position one to position two?
### The conjunction fallacy. {#exr:fallacy .unlisted}
@Paolaccietal examined whether the results of some classic experiments differ between a university pool population and subjects recruited from Mechanical Turk. We'll examine whether the results of the conjunction fallacy experiment [or Linda problem: @TverskyKahneman1983] are replicated for both groups.
```{r, message = FALSE}
data("df_fallacy")
df_fallacy
```
The conjunction fallacy shows that people often fail to regard a combination of events as less probable than a single event in the combination [@TverskyKahneman1983]:
*Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.*
*Which is more probable?*
a. *Linda is a bank teller.*
b. *Linda is a bank teller and is active in the feminist movement.*
The majority of those asked chose option b even though it's less probable ($\Pr(a \land b)\leq \Pr(b)$. The data set is named `df_fallacy` and it indicates with `0` option "a" and with `1` option `b`.
Fit a logistic regression in Stan and report:
a. The estimated overall probability of answering (b) ignoring the group.
b. The estimated overall probability of answering (b) for each group.
## Hierarchical models and reparameterization
### A log-normal model in Stan. {#exr:stroop .unlisted}
Refit the Stroop example from section \@ref(sec-stroop) in Stan (`df_stroop`).
Assume the following likelihood and priors:
\begin{equation}
rt_n \sim \mathit{LogNormal}(\alpha + u_{subj[n],1} + c\_cond_n \cdot (\beta + u_{subj[n],2}), \sigma)
\end{equation}
\begin{equation}
\begin{aligned}
\alpha & \sim \mathit{Normal}(6, 1.5) \\
\beta & \sim \mathit{Normal}(0, .1) \\
\sigma &\sim \mathit{Normal}_+(0, 1)
\end{aligned}
\end{equation}
\begin{equation}
\begin{aligned}
\tau_{u_1} &\sim \mathit{Normal}_+(0,1)\\
\tau_{u_2} &\sim \mathit{Normal}_+(0,1)\\
\begin{bmatrix}
1 & \rho_u \\
\rho_u & 1
\end{bmatrix} &\sim \mathit{LKJcorr}(2)
\end{aligned}
\end{equation}
### A by-subjects and by-items hierarchical model with a log-normal likelihood. {#exr:hierarchical-logn-stan .unlisted}
Revisit the question "Are subject relatives easier to process than object relatives?" Fit the model from the exercise \@ref(exr:hierarchical-logn) using Stan.
### A hierarchical logistic regression with Stan. {#exr:strooplogis .unlisted}
Revisit the question "Is there a Stroop effect in accuracy?" Fit the model the exercise \@ref(exr:strooplogis-brms) using Stan.
### A distributional regression model of the effect of cloze probability on the N400. {#exr:distr-stan .unlisted}
In section \@ref(sec-distrmodel), we saw how to fit a distributional regression model. We might want to extend this approach to Stan. Fit the EEG data to a hierarchical model with by-subject and by-items varying intercept and slopes, and in addition assume that the residual standard deviation (the scale of the normal likelihood) can vary by subject.
\begin{equation}
\begin{aligned}
signal_n &\sim \mathit{Normal}(\alpha + u_{subj[n],1} + w_{item[n],1} + \\
& c\_cloze_n \cdot (\beta + u_{subj[n],2}+ w_{item[n],2}), \sigma_n)\\
\sigma_n &= \exp(\alpha_{\sigma} + u_{\sigma_{subj[n]}})
\end{aligned}
\end{equation}
\begin{equation}
\begin{aligned}
\alpha_\alpha &\sim \mathit{Normal}(0,log(50))\\
u_\sigma &\sim \mathit{Normal}(0, \tau_{u_\sigma}) \\
\tau_{u_\sigma} &\sim \mathit{Normal}_+(0, 5)
\end{aligned}
\end{equation}
To fit this model, take into account that `sigma` is now a vector, and it is a transformed parameter which depends on two parameters: `alpha_sigma` and the vector with `N_subj` elements `u_sigma`. In addition, `u_sigma` depends on the hyperparameter `tau_u_sigma` ($\tau_{u_\sigma}$). (Using the non-centered parameterization for `u_sigma` speeds up the model fit considerably).
## Custom distributions in Stan {#sec-customexercises}
### Fitting a \index{Shifted log-normal distribution} shifted log-normal distribution. {#exr:shiftedlogn .unlisted}
A random variable $Y$ has a shifted log-normal distribution with shift $\psi$, location $\mu$, and scale $\sigma$, if $Z = Y-\psi$ and $Z \sim \mathit{LogNormal}(\mu,\sigma)$.
1. Implement a `shifted_lognormal_ldpf` function in Stan with three parameters, `mu`, `sigma`, and `psi`. Tip: One can use the regular log-normal distribution and apply a change of variable. In this case the adjustment of the Jacobian would be
$|\frac{d}{dY} Y - \psi|=1$, which in log-space is conveniently zero.
2. Verify the correctness of the model by recovering the true values of (your choice) of the parameters of the model and by using simulation-based calibration. In order to use simulation-based calibration, you will need to decide on sensible priors; assume that $\psi \sim \mathit{Normal}_+(100,50)$, and choose priors for $\mu$ and $\sigma$ so that the prior predictive distributions are adequate for response times.
### Fitting a Wald distribution. {#exr:wald .unlisted}
The \index{Wald distribution} Wald distribution (or inverse Gaussian distribution) and its variants have been proposed as another useful distribution for \index{Response time} response times [see for example @heathcote2004fitting].
The probability density function of the Wald distribution is the following.
\begin{equation}
f(x;\mu ,\lambda )={\sqrt {\frac {\lambda }{2\pi x^{3}}}}\exp {\biggl (}-{\frac {\lambda (x-\mu )^{2}}{2\mu ^{2}x}}{\biggr )}
\end{equation}
1. Implement this distribution in Stan as `wald_lpdf`. In order to do this, you will need to derive the logarithm of the PDF presented above. You can adapt the code of the following R function.
```{r}
dwald <- function(x, lambda, mu, log = FALSE) {
log_density <- 0.5 * log(lambda / (2 * pi())) -
1.5 * log(x) -
0.5 * lambda * ((x - mu) / (mu * sqrt(x)))^2
if (log == FALSE) {
exp(log_density)
} else {
log_density
}
}
```
2. Verify the correctness of the model by recovering the true values of (your choice) of the parameters of the model and by using simulation-based calibration. As with the previous exercise, you will need to decide on sensible priors by deriving prior predictive distributions that are adequate for response times.
## Meta-analysis and measurement error models {#sec-REMAMEexercises}
### Extracting estimates from published papers {#exr:REMAMEExtracting .unlisted}
Researchers often do not release the data that lie behind the published paper. This creates the problem that one has to figure out the estimated effect size and standard error from published statistics. This exercise gives some practice on how to do this by considering some typical cases that are encountered in repeated measures designs.
(a) Suppose that in a repeated measures reading study, the observed t-value from a paired t-test is 2.3 with degrees of freedom 23. Suppose also that the estimated standard deviation is 150 ms. What is the estimated standard error and the estimated effect size?
(b) A repeated measures reading study reports an F-score from an analysis of variance as $F(1,34)=6.1$, with an effect size of 25 ms. What is the estimated standard error?
### A meta-analysis of picture-word interference data {#exr:REMAMEBuerki .unlisted}
Load the following data set:
```{r}
data("df_buerki")
head(df_buerki)
df_buerki <- subset(df_buerki, se > 0.60)
```
The data are from @BuerkiEtAl2020. We have a summary of the effect estimates (d) and standard errors (se) of the estimates from 162 published experiments on a phenomenon called *semantic picture-word interference*. We removed an implausibly low SE in the code above, but the results don't change regardless of whether we keep them or not, because we have data from a lot of studies.
In this experimental paradigm, subjects are asked to name a picture while ignoring a distractor word (which is either related or unrelated to the picture). The word can be printed on the picture itself, or presented auditorily. The dependent measure is the response latency, or time interval between the presentation of the picture and the onset of the vocal response. Theory says that distractors that come from the same semantic category as the picture to be named lead to a slower response then when the distractor comes from a different semantic category.
Carry out a random effects meta-analysis using `brms` and display the posterior distribution of the effect, along with the posterior of the between study standard deviation.
Choose $\mathit{Normal}(0,100)$ priors for the intercept and between study sd parameters. You can also try vague priors (sensitivity analysis). Examples would be:
- $\mathit{Normal}(0,200)$
- $\mathit{Normal}(0,400)$
### Measurement error model for English VOT data {#exr:REMAMELiEnglish .unlisted}
Load the following data:
```{r}
data("df_VOTenglish")
head(df_VOTenglish)
```
You are given mean voice onset time (VOT) data (with SEs) in milliseconds for English, along with mean vowel durations (with SEs) in milliseconds. Fit a measurement-error model investigating the effect of mean vowel duration on mean VOT duration. First plot the relationship between the two variables; does it look like there is an association between the two?
Then use `brms` with measurement error included in both the dependent and independent variables. Do a sensitivity analysis to check the influence of the priors on the posteriors of the relevant parameters.
## Introduction to model comparison
## Bayes factors
### Is there evidence for differences in the effect of cloze probability among the subjects? {#exr:bysubjects .unlisted}
\vspace{-.5cm}
Use Bayes factor to compare the log cloze probability model that we examined in section \@ref(sec-BFnonnested) with a similar model but that incorporates the strong assumption of no difference between subjects for the effect of cloze ($\tau_{u_2}=0$).
### Is there evidence for the claim that English subject relative clauses are easier to process than object relative clauses? {#exr:bf-logn .unlisted}
Consider again the reading time data coming from Experiment 1 of @grodner presented in exercise \@ref(exr:hierarchical-logn):
```{r open_grodneretal_again, message = FALSE}
data("df_gg05_rc")
df_gg05_rc
```
As in exercise \@ref(exr:hierarchical-logn), you should use a sum coding for the predictors. Here, object relative clauses (`"objgaps"`) are coded $+1/2$, and subject relative clauses as $-1/2$.
```{r}
df_gg05_rc <- df_gg05_rc %>%
mutate(c_cond = if_else(condition == "objgap", 1/2, -1/2))
```
Using the Bayes factors function shown in this chapter, quantify the evidence against the null model (no population-level reading time difference between SRC and ORC) relative to the following alternative models: