Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

love.plot() option to remove missing indicators of categorical variables #89

Open
pasahe opened this issue Nov 7, 2024 · 4 comments
Open

Comments

@pasahe
Copy link

pasahe commented Nov 7, 2024

Dear @ngreifer,

I have a data set where some variables have some missing values. When I use the love.plot() function, it adds a variable VAR: <NA> for each categorical variable that has missing values. Is there a way to exclude these missing indicators from the plot? If there's no way to do this, I think it would be a nice feature to add to the function. It's really hard to exclude them afterwards in ggplot.

Thank you very much!

@ngreifer
Copy link
Owner

ngreifer commented Nov 7, 2024

To remove them, you should first create a bal.tab object and remove the values from the balance table, then call love.plot() on the modified bal.tab object. See below for how to do this.

b <- bal.tab(W, un = TRUE)
b$Balance <- b$Balance[!endsWith(rownames(b$Balance), "<NA>"),]
love.plot(b)

Fortunately it's not too hard (one or two extra lines of code) so I don't plan on adding this as an option. But at least you don't need to modify the ggplot object and can retain all of the options bal.tab() and love.plot() provide. If you have more complicated data (e.g., clustered or multicategory) you will have to do a little more work to exclude the desired values from each balance table.

@pasahe
Copy link
Author

pasahe commented Nov 8, 2024

Thanks, that's a nice workaround! Unfortunately, when I try to add the labels in the love plot with var.names, it gives an error because it expects the <NA> indicators. See this behavior in the following reproducible example:

library(cobalt)
v <- data.frame(old = c("age", "educ", "race_black", "race_hispan", 
                        "race_white", "married", "nodegree", "re74", "re75", "distance"),
                new = c("Age", "Years of Education", "Black", 
                        "Hispanic", "White", "Married", "No Degree Earned", 
                        "Earnings 1974", "Earnings 1975", "Propensity Score"))
covs <- subset(lalonde_mis, select = -c(treat, re78, nodegree, married))
b <- bal.tab(treat ~ covs, data = lalonde_mis)
b$Balance <- b$Balance[!endsWith(rownames(b$Balance), "<NA>"),]
love.plot(b, var.names = v)

This gives me an error in the assingment old_levels[idx] <- names(new_levels) in the last line.

@ngreifer
Copy link
Owner

ngreifer commented Nov 8, 2024

Ah, that's annoying. I'll work on making a fix for this. In the meantime, you can just change the row names of b$Balance to the new names and then omit the var.names argument to love.plot(). That doesn't allow you to use the feature whereby the stem of a set of dummy variables is replaced by a common new stem, but it does allow you to manually change the name of each variable. So for example, I believe you could add

matches <- na.omit(match(v$old, rownames(b$Balance))
rownames(b$Balance)[matches] <- v$new[matches]

before supplying it to love.plot(). Obviously this is getting into hacky territory which I why I think an automatically solution would be necessary.

@pasahe
Copy link
Author

pasahe commented Nov 11, 2024

Yes, that would be a workaround for this reproducible example. It wouldn't work for my particular case when dealing with categorical variables, because in var.names I have the old and new labels for the entire variable, not for all categories. For example, instead of having all the labels for all the categories in race, I only have the label for the variable name:

v <- data.frame(old = c("age", "educ", "race", "married", "nodegree", "re74", "re75", "distance"),
                new = c("age", "years of education", "Race", "married", "no degree", 
                        "Income 1974", "Income 1975", "Propensity Score"))

But don't worry, I'll tweak the code to print the correct labels. It would be great if a solution for this scenario is added to the package in the near future.

Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants