Add a data dictionary #315

jeancochrane · 2025-01-08T21:34:09Z

This PR turns the "Features Used" table in the README into a proper data dictionary by making two changes:

Add a new column Variable Name that lists the name of the variable as it appears in the model code
Save a version of the table to CSV in docs/data-dict.csv

We also rename the Notes column to Description for clarity, and move it to the left in the table so that it's more prominent.

Note that this PR does not create historical dictionaries for past models. My expectation is that we will keep docs/data-dict.csv up to date with the most recent version of the parameter file, and then in the future we can back out the data dict that we used for past models by referencing the version of docs/data-dict.csv that existed at the time of the yearly model tag.

If this change looks good, I'll go ahead and copy it to the condo model to address ccao-data/model-condo-avm#72.

Closes #300.

…dability

jeancochrane · 2025-01-08T21:36:08Z

README.Rmd

@@ -231,10 +231,13 @@ Model accuracy for each parameter combination is measured on a validation set us

 ### Features Used

-The residential model uses a variety of individual and aggregate features to determine a property's assessed value. We've tested a long list of possible features over time, including [walk score](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_walkscore.html), [crime rate](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/chicago_crimerate.html), [school districts](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_school_boundaries_mean_encoded.html), and many others. The features in the table below are the ones that made the cut. They're the right combination of easy to understand and impute, powerfully predictive, and well-behaved. Most of them are in use in the model as of `r Sys.Date()`.
+The residential model uses a variety of individual and aggregate features to determine a property's assessed value. We've tested a long list of possible features over time, including [walk score](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_walkscore.html), [crime rate](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/chicago_crimerate.html), [school districts](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_school_boundaries_mean_encoded.html), and many others. The features in the table below are the ones that made the cut. They're the right combination of easy to understand and impute, powerfully predictive, and well-behaved.


The only change to this paragraph is removing this line:

Most of them are in use in the model as of r Sys.Date().

The first half of this line seems inaccurate to me (all of these features are in use in the model) and not particularly helpful (this table represents the most recent version of the parameters, so the date is not useful). Happy to keep one or both of these pieces of info if you think there's a good reason for them, though.

jeancochrane · 2025-01-08T21:37:38Z

README.Rmd

+param_tbl_fmt <- param_tbl %>%
+  mutate(description = param_notes) %>%
+  left_join(
+    ccao::vars_dict,
+    by = c("value" = "var_name_model")


I switched this up so that the params are the left side of the join here, which feels more intuitive to me than vars_dict being the left side. We also use a left join so that we'll preserve all of the parameters, even if one happens to be misdocumented in vars_dict in the future.

jeancochrane · 2025-01-08T21:38:33Z

README.Rmd

+    feature_name = var_name_pretty,
+    variable_name = value,
+    description,
+    category = var_type,
+    type = var_data_type,


I think it's clearer for the CSV to use lowercase and underscored column names, so we start with those and then reformat them when rendering the table to the README.

jeancochrane · 2025-01-08T21:40:51Z

README.Rmd

+# Write machine-readable version of the table to file
+param_tbl_fmt %>%
+  write_csv("docs/data-dict.csv")


This seemed to me like the simplest way to keep the data dict up to date: Any time we render the README, we'll save the data dict to the file. If model parameters haven't changed, the data dict file won't change, and there won't be a diff; otherwise, there will be a diff and the code author will be prompted to commit it. Not the most airtight system, but I figure it's probably a good enough starting place. Let me know if you have other ideas!

Perhaps out of scope for now, but we could also consider adding a pre-commit check similar to readme-rmd-rendered that compares the params in params.yml to the params in this file to make sure they match. I'm happy to take a crack at that now if you think it's a good idea.

The pre-commit hook was pretty straightforward so I went ahead and implemented it in 46c163b.

jeancochrane · 2025-01-08T21:41:36Z

README.md

+For a machine-readable version of this data dictionary, see
+[`docs/data-dict.csv`](./docs/data-dict.csv).
+
+| Feature Name                                                            | Variable Name                                         | Description                                                                                                                                           | Category       | Possible Values                                                      |


Big diff here, mainly because I've rearranged the column order. I double-checked to make sure the parameter count matches params.yaml.

jeancochrane · 2025-01-08T21:42:00Z

README.md

+5.  Run `renv::activate(profile = "default")` if you would like to
+    switch back to the default renv profile


This seems like an unrelated change due to autoformat.

wrridgeway · 2025-01-08T22:41:14Z

This is awesome. Question for anyone - if someone wanted to grab training data and use it to feed a python model, would they have the info they need to recode features/apply variable labels currently?

…rams

jeancochrane · 2025-01-09T19:27:15Z

if someone wanted to grab training data and use it to feed a python model, would they have the info they need to recode features/apply variable labels currently?

Yes, the development version of the Python ccao package has both vars_recode and vars_rename. However, it's under active development and we haven't used it in production yet, so I wouldn't recommend this path to anyone quite yet.

…a-dictionary-in-readme

jeancochrane · 2025-01-10T17:34:32Z

.pre-commit-config.yaml

+      - id: check-data-dict
+        name: Data dictionary must be up to date with params file
+        entry: Rscript R/hooks/check-data-dict.R
+        files: (^|/)((params\.yaml)|(data-dict\.csv))$
+        language: r
+        additional_dependencies:
+          - yaml


I also considered adding a hook to make sure that ccao::vars_dict contains every feature in the param file, and that the dbt DAG has a description for every feature. This would help guard against a situation where we neglect to update the table property, but it has the downside of enforcing a commit-level check that we could only resolve by updating external dependencies, which I expect would lower our velocity during modeling. We could instead think about making this a CI check that only runs on tags, but I've skipped that here since it feels out of scope.

jeancochrane · 2025-01-10T17:36:01Z

Currently blocked by ccao-data/data-architecture#704. Once that comes in I'll do a final pass at rendering the README and data dict and re-request review.

…oblems" This reverts commit 95db04c.

jeancochrane · 2025-01-10T22:36:51Z

Now blocked by ccao-data/actions#36.

This reverts commit 87ba713.

jeancochrane added 2 commits January 8, 2025 18:15

Add variable names and reformat README data dictionary for better rea…

8328e9c

…dability

Add data dict to repo

65cd2eb

jeancochrane linked an issue Jan 8, 2025 that may be closed by this pull request

Missing Data Dictionary in readme #300

Open

jeancochrane commented Jan 8, 2025

View reviewed changes

jeancochrane marked this pull request as ready for review January 8, 2025 21:48

jeancochrane requested review from dfsnow and wrridgeway as code owners January 8, 2025 21:48

Add pre-commit hook to check that the data dict is up to date with pa…

46c163b

…rams

jeancochrane added 4 commits January 9, 2025 19:58

Better error msg and dependencies in check-data-dict.R pre-commit hook

7994e4f

Merge branch '2025-assessment-year' into jeancochrane/300-missing-dat…

3f33931

…a-dictionary-in-readme

Update data dict and README to use latest feature info

5fb7b56

Merge branch '2025-assessment-year' into jeancochrane/300-missing-dat…

69d176e

…a-dictionary-in-readme

jeancochrane mentioned this pull request Jan 10, 2025

Clean up descriptions for parcel shape features so they are readable in model feature table ccao-data/data-architecture#704

Merged

Remove corner lot indicator from README and data dict

ffbc73d

jeancochrane commented Jan 10, 2025

View reviewed changes

jeancochrane marked this pull request as draft January 10, 2025 17:34

jeancochrane added 4 commits January 10, 2025 20:22

Fix descriptions for parcel shape characteristics in docs and data dict

2e300ec

Set up tmate session in pre-commit workflow to debug cache problems

95db04c

Revert "Set up tmate session in pre-commit workflow to debug cache pr…

ad157a1

…oblems" This reverts commit 95db04c.

Pin to branch of pre-commit action to test improved R caching

87ba713

jeancochrane mentioned this pull request Jan 10, 2025

Update pre-commit action to cache R additional_dependencies ccao-data/actions#36

Open

Revert "Pin to branch of pre-commit action to test improved R caching"

7a88504

This reverts commit 87ba713.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a data dictionary #315

Add a data dictionary #315

jeancochrane commented Jan 8, 2025 •

edited

Loading

jeancochrane Jan 8, 2025

jeancochrane Jan 8, 2025

jeancochrane Jan 8, 2025

jeancochrane Jan 8, 2025 •

edited

Loading

jeancochrane Jan 9, 2025

jeancochrane Jan 8, 2025

jeancochrane Jan 8, 2025

wrridgeway commented Jan 8, 2025

jeancochrane commented Jan 9, 2025

jeancochrane Jan 10, 2025

jeancochrane commented Jan 10, 2025

jeancochrane commented Jan 10, 2025

		5. Run `renv::activate(profile = "default")` if you would like to
		switch back to the default renv profile

Add a data dictionary #315

Are you sure you want to change the base?

Add a data dictionary #315

Conversation

jeancochrane commented Jan 8, 2025 • edited Loading

jeancochrane Jan 8, 2025

Choose a reason for hiding this comment

jeancochrane Jan 8, 2025

Choose a reason for hiding this comment

jeancochrane Jan 8, 2025

Choose a reason for hiding this comment

jeancochrane Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

jeancochrane Jan 9, 2025

Choose a reason for hiding this comment

jeancochrane Jan 8, 2025

Choose a reason for hiding this comment

jeancochrane Jan 8, 2025

Choose a reason for hiding this comment

wrridgeway commented Jan 8, 2025

jeancochrane commented Jan 9, 2025

jeancochrane Jan 10, 2025

Choose a reason for hiding this comment

jeancochrane commented Jan 10, 2025

jeancochrane commented Jan 10, 2025

jeancochrane commented Jan 8, 2025 •

edited

Loading

jeancochrane Jan 8, 2025 •

edited

Loading