Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Allow weights to also be specified at group level #1719

Open
bschneidr opened this issue Dec 15, 2024 · 3 comments
Open

Feature request: Allow weights to also be specified at group level #1719

bschneidr opened this issue Dec 15, 2024 · 3 comments
Labels

Comments

@bschneidr
Copy link

bschneidr commented Dec 15, 2024

Summary

This request is to allow the user to specify not only unit-level weights in a brms formula, but group-level weights as well. This would extend the current syntax from this:

# Specify a formula with unit-level weights
y | weights(w) ~ x + (1 | g)

to maybe something like this

# Specify a formula with group-level weights as well
y | weights(w) ~ x + ( 1 | g | group_weights(w_g) )

where $w$ is a unit-level weight variable with potentially $n$ distinct values, and $w_g$ is a group-level weight variable with one value per group.

I don't know what the best syntax for this functionality would be, but I hope this illustrates the idea well enough.

The motivation for this is to make brms able to fit survey-weighted multilevel models: Savitsky and Williams (2022) showed that only weighting the unit-level likelihood doesn't produce reasonable estimates for multilevel models, but adding group-level weights to the random effects' priors works well.

I think would be a valuable addition to brms for many folks who want to analyze data from health and social surveys. I would be happy to work on implementing this if you are open to it and have a preferred way for how this would look.

Example Data

Below is an example dataset of size $n=6$ with $M=3$ groups. The variable w is a unit level weight, and the variable w_g is a group-level weight.

example_data <- data.frame(
  y   = rnorm(n = 6),
  x   = rnorm(n = 6),
  w   = runif(n = 6, min = 0.9, max = 1),
  g   = c('a', 'a',  'b',  'b',  'c',  'c'),
  w_g = c(0.5, 0.5, 0.75, 0.75, 0.25, 0.25)
)

We want to estimate a simple varying-intercepts model of the form y ~ x + (1 | g).

Below is Stan code that would be generated for this model:

brms::make_stancode(
    data    = example_data,
    formula = y | weights(w) ~ x + (1 | g),
    family  = gaussian()
)
Click to show generated Stan code
// generated with brms 2.22.0
functions {
}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  vector<lower=0>[N] weights;  // model weights
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  int<lower=1> Kc;  // number of population-level effects after centering
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  array[N] int<lower=1> J_1;  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_1_1;
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}
parameters {
  vector[Kc] b;  // regression coefficients
  real Intercept;  // temporary intercept for centered predictors
  real<lower=0> sigma;  // dispersion parameter
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  array[M_1] vector[N_1] z_1;  // standardized group-level effects
}
transformed parameters {
  vector[N_1] r_1_1;  // actual group-level effects
  real lprior = 0;  // prior contributions to the log posterior
  r_1_1 = (sd_1[1] * (z_1[1]));
  lprior += student_t_lpdf(Intercept | 3, 0.2, 2.5);
  lprior += student_t_lpdf(sigma | 3, 0, 2.5)
    - 1 * student_t_lccdf(0 | 3, 0, 2.5);
  lprior += student_t_lpdf(sd_1 | 3, 0, 2.5)
    - 1 * student_t_lccdf(0 | 3, 0, 2.5);
}
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = rep_vector(0.0, N);
    mu += Intercept + Xc * b;
    for (n in 1:N) {
      // add more terms to the linear predictor
      mu[n] += r_1_1[J_1[n]] * Z_1_1[n];
    }
    for (n in 1:N) {
      target += weights[n] * (normal_lpdf(Y[n] | mu[n], sigma));
    }
  }
  // priors including constants
  target += lprior;
  target += std_normal_lpdf(z_1[1]);
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);
}

Details

The context that motivates this request is the situation where the analyst has data from a complex survey sample with sample weights and they want to fit a model with group-level parameters (i.e., random intercepts or other group-varying parameters).

The current formula syntax allows users to specify unit-level weights that go into the model likelihood at the unit level, using the syntax

y | weights(w) ~ x + (1 | g)

The key part of the generated Stan code that uses the weights is highlighted below:

model {
   ...
    for (n in 1:N) {
      target += weights[n] * (normal_lpdf(Y[n] | mu[n], sigma));
    }
  }
  ...
}

When there are group-varying parameters in the model, such as random intercepts, Savitsky and Williams (2022) showed that the above approach (which they call "single weighting") is insufficient: it tends to produce substantial bias in estimation of the variance parameters and the group-varying parameters.

They showed though that reasonable estimates can be obtained with a strategy they called "double weighting", which applies weights at the individual level and also at the group level. In other words, this approach weights the likelihood as well as the prior on the random effects, where the weights for the prior are group-level weights.

Below is what this could look like in the Stan code:

model {
    ...
    for (n in 1:N) {
      target += weights[n] * (normal_lpdf(Y[n] | mu[n], sigma));
    }
  }
  // priors including constants
  target += lprior;
  for (n_1 in 1:N_1) {
    target += group_weights[1][n_1] * std_normal_lpdf(z_1[1][n_1]);
  }
}
@paul-buerkner
Copy link
Owner

I like this idea. My suggestion for syntax would be

y | weights(w) ~ x + ( 1 | gr(g, weights = w_g) )

The gr() function already exists and allows us to customize random effects, so it is a good place for this feature. Before anyone starts working on this feature, could you check if the weights option in the multi-membership terms (?mm) may already cover your use-case?

@bschneidr
Copy link
Author

Thanks, Paul. I think that syntax using gr(g, weights = w_g) is quite clear and would be a good place for this.

I inspected the mm() implementation and unfortunately I don't think it can be used to implement this kind of weighting, for two reasons. The first reason is that the mm() function throws an error message when only one grouping variable is supplied, as it expects to have multiple grouping variables to define the multiple memberships. The second and more important reason is that the weights applied by mm() ultimately don't go into the necessary part of the model: mm() weights go into the linear predictor mu rather than affecting the prior for a random effect. So unfortunately mm() can't be hacked to implement this.

@paul-buerkner
Copy link
Owner

Makes sense, so we need an actual new feature for this. At least the user-interface (something like gr(g, weights = w_g) ) seems easy enough. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants