Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a data dictionary #315

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
8328e9c
Add variable names and reformat README data dictionary for better rea…
jeancochrane Jan 8, 2025
65cd2eb
Add data dict to repo
jeancochrane Jan 8, 2025
46c163b
Add pre-commit hook to check that the data dict is up to date with pa…
jeancochrane Jan 9, 2025
7994e4f
Better error msg and dependencies in check-data-dict.R pre-commit hook
jeancochrane Jan 9, 2025
3f33931
Merge branch '2025-assessment-year' into jeancochrane/300-missing-dat…
jeancochrane Jan 9, 2025
5fb7b56
Update data dict and README to use latest feature info
jeancochrane Jan 9, 2025
69d176e
Merge branch '2025-assessment-year' into jeancochrane/300-missing-dat…
jeancochrane Jan 10, 2025
ffbc73d
Remove corner lot indicator from README and data dict
jeancochrane Jan 10, 2025
2e300ec
Fix descriptions for parcel shape characteristics in docs and data dict
jeancochrane Jan 10, 2025
95db04c
Set up tmate session in pre-commit workflow to debug cache problems
jeancochrane Jan 10, 2025
ad157a1
Revert "Set up tmate session in pre-commit workflow to debug cache pr…
jeancochrane Jan 10, 2025
87ba713
Pin to branch of pre-commit action to test improved R caching
jeancochrane Jan 10, 2025
7a88504
Revert "Pin to branch of pre-commit action to test improved R caching"
jeancochrane Jan 10, 2025
dc1ce6e
Add call to action to error message in check-data-dict pre-commit hook
jeancochrane Jan 13, 2025
e3eca6c
Add links to more useful resources in the Features Used section of th…
jeancochrane Jan 13, 2025
b8684fb
Add encoded values to Features Used table in README and data dict
jeancochrane Jan 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ cache/
*.rds
*.zip
*.csv
!docs/data-dict.csv
*.xlsx
*.xlsm
*.html
Expand Down
7 changes: 7 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,10 @@ repos:
entry: Cannot commit .Rhistory, .RData, .Rds or .rds.
language: fail
files: '\.(Rhistory|RData|Rds|rds)$'
- id: check-data-dict
name: Data dictionary must be up to date with params file
entry: Rscript R/hooks/check-data-dict.R
files: (^|/)((params\.yaml)|(data-dict\.csv))$
language: r
additional_dependencies:
- yaml
jeancochrane marked this conversation as resolved.
Show resolved Hide resolved
34 changes: 34 additions & 0 deletions R/hooks/check-data-dict.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#!/usr/bin/env Rscript
# Script to check that the data dictionary file is up to date with the
# latest feature set
library(yaml)

params_filename <- "params.yaml"
data_dict_filename <- "docs/data-dict.csv"

params <- read_yaml(params_filename)
data_dict <- read.csv(data_dict_filename)

symmetric_diff <- c(
setdiff(data_dict$variable_name, params$model$predictor$all),
setdiff(params$model$predictor$all, data_dict$variable_name)
)
symmetric_diff_len <- length(symmetric_diff)

if (symmetric_diff_len > 0) {
err_msg_prefix <- ifelse(symmetric_diff_len == 1, "Param is", "Params are")
err_msg <- paste0(
err_msg_prefix,
" not present in both ",
params_filename,
" and ",
data_dict_filename,
": ",
paste(symmetric_diff, collapse = ", "),
". ",
"Did you forget to reknit README.Rmd after updating ",
params_filename,
"?"
)
stop(err_msg)
}
72 changes: 53 additions & 19 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -231,10 +231,11 @@ Model accuracy for each parameter combination is measured on a validation set us

### Features Used

The residential model uses a variety of individual and aggregate features to determine a property's assessed value. We've tested a long list of possible features over time, including [walk score](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_walkscore.html), [crime rate](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/chicago_crimerate.html), [school districts](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_school_boundaries_mean_encoded.html), and many others. The features in the table below are the ones that made the cut. They're the right combination of easy to understand and impute, powerfully predictive, and well-behaved. Most of them are in use in the model as of `r Sys.Date()`.
The residential model uses a variety of individual and aggregate features to determine a property's assessed value. We've tested a long list of possible features over time, including [walk score](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_walkscore.html), [crime rate](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/chicago_crimerate.html), [school districts](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_school_boundaries_mean_encoded.html), and many others. The features in the table below are the ones that made the cut. They're the right combination of easy to understand and impute, powerfully predictive, and well-behaved.
jeancochrane marked this conversation as resolved.
Show resolved Hide resolved

```{r feature_guide, message=FALSE, results='asis', echo=FALSE}
library(dplyr)
library(readr)
library(tidyr)
library(yaml)
library(jsonlite)
Expand Down Expand Up @@ -316,38 +317,71 @@ param_notes <- param_tbl$value %>%
)) %>%
unlist()

ccao::vars_dict %>%
inner_join(
param_tbl %>% mutate(description = param_notes),
by = c("var_name_model" = "value")
param_tbl_fmt <- param_tbl %>%
mutate(description = param_notes) %>%
left_join(
ccao::vars_dict,
by = c("value" = "var_name_model")
jeancochrane marked this conversation as resolved.
Show resolved Hide resolved
) %>%
group_by(var_name_pretty) %>%
mutate(row = paste0("X", row_number())) %>%
distinct(
`Feature Name` = var_name_pretty,
Category = var_type,
Type = var_data_type,
Notes = description,
var_value, row
feature_name = var_name_pretty,
variable_name = value,
description,
category = var_type,
type = var_data_type,
jeancochrane marked this conversation as resolved.
Show resolved Hide resolved
var_code, var_value, row
) %>%
mutate(Category = recode(
Category,
mutate(category = recode(
category,
char = "Characteristic", acs5 = "ACS5", loc = "Location",
prox = "Proximity", ind = "Indicator", time = "Time",
meta = "Meta", other = "Other", ccao = "Other"
meta = "Meta", other = "Other", ccao = "Other", shp = "Parcel Shape"
)) %>%
pivot_wider(
id_cols = `Feature Name`:`Notes`,
id_cols = `feature_name`:`category`,
names_from = row,
values_from = var_value
values_from = c(var_code, var_value)
) %>%
unite(
"possible_codes",
starts_with("var_code_X"),
sep = ", ",
na.rm = TRUE
) %>%
unite(
"possible_values",
starts_with("var_value_X"),
sep = ", ",
na.rm = TRUE
) %>%
mutate(description = replace_na(description, "")) %>%
arrange(category)

# Write machine-readable version of the table to file
param_tbl_fmt %>%
write_csv("docs/data-dict.csv")
Comment on lines +362 to +364
Copy link
Contributor Author

@jeancochrane jeancochrane Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed to me like the simplest way to keep the data dict up to date: Any time we render the README, we'll save the data dict to the file. If model parameters haven't changed, the data dict file won't change, and there won't be a diff; otherwise, there will be a diff and the code author will be prompted to commit it. Not the most airtight system, but I figure it's probably a good enough starting place. Let me know if you have other ideas!

Perhaps out of scope for now, but we could also consider adding a pre-commit check similar to readme-rmd-rendered that compares the params in params.yml to the params in this file to make sure they match. I'm happy to take a crack at that now if you think it's a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pre-commit hook was pretty straightforward so I went ahead and implemented it in 46c163b.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love that we've ended up with a system where we need to sync four separate things: ccao::vars_dict, params.yaml, docs/data-dict.csv, and the README. I agree this is a good simple solution for now though. Let's roll with it and worry about something better/more long-term once 2025 modeling is finished.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it's confusing and brittle to maintain. I opened #324 to keep track of this work so that we can pick it up once we're done with modeling.


# Render human-readable version of the table to the doc
param_tbl_fmt %>%
rename(
"Feature Name" = "feature_name",
"Variable Name" = "variable_name",
"Description" = "description",
"Category" = "category",
"Possible Values (Encoded)" = "possible_codes",
"Possible Values (Semantic)" = "possible_values",
) %>%
unite("Possible Values", starts_with("X"), sep = ", ", na.rm = TRUE) %>%
mutate(Notes = replace_na(Notes, "")) %>%
arrange(Category) %>%
relocate(Notes, .after = everything()) %>%
knitr::kable(format = "markdown")
```

We maintain a few useful resources for working with these features:

- Once you've [pulled the input data](#getting-data), you can inner join the data to the CSV version of the data dictionary ([`docs/data-dict.csv`](./docs/data-dict.csv)) to filter for only the features that we use in the model.
- You can browse our [data catalog](https://ccao-data.github.io/data-architecture/#!/overview) to see more details about these features, in particular the [residential model input view](https://ccao-data.github.io/data-architecture/#!/model/model.ccao_data_athena.model.vw_card_res_input) which is the source of our training data.
- You can use the [`ccao` R package](https://ccao-data.github.io/ccao/) or its [Python equivalent](https://ccao-data.github.io/ccao/python/) to programmatically convert variable names to their human-readable versions ([`ccao::vars_rename()`](https://ccao-data.github.io/ccao/reference/vars_rename.html)) or convert numerically-encoded variables to human-readable values ([`ccao::vars_recode()`](https://ccao-data.github.io/ccao/reference/vars_recode.html). The [`ccao::vars_dict` object](https://ccao-data.github.io/ccao/reference/vars_dict.html) is also useful for inspecting the raw crosswalk that powers the rename and recode functions.

#### Data Sources

We rely on numerous third-party sources to add new features to our data. These features are used in the primary valuation model and thus need to be high-quality and error-free. A non-exhaustive list of features and their respective sources includes:
Expand Down
Loading
Loading