Skip to content

Commit

Permalink
Add comps map to individual PIN report (#181)
Browse files Browse the repository at this point in the history
* Add intermediate leaf_node output to interpret pipeline stage

* Add python get_comps function for computing comps from leaf node assignments

* Flesh out comp calculation

* Add Python requirements to renv environment

* Make sure assessment data is loaded in interpret stage when comp_enable is TRUE

* Continue with comps debugging

* Refactor and test get_comps logic

* Clean up comments and extraneous debugging code ahead of testing

* Temporarily set comp_enable=TRUE for the purposes of testing comps

* Satisfy pre-commit

* Remove num_iteration arg from predict() in comp calculation

* Make sure requirements.txt is copied into image before installing R dependencies

* Install python3-venv in Dockerfile

* Pass n=20 to get_comps correctly in 04-interpret.R

* Temporarily slim down training set to test comp calculation

* Wrap get_comps() call in tryCatch in interpret pipeline stage for better error logging

* Test raising an error from python/comps.py

* Remove temporary error in python/comps.py

* Swap arg order in _get_similarity_matrix to confirm numba error message

* Revert "Swap arg order in _get_similarity_matrix to confirm numba error message"

This reverts commit 5beefd5.

* Raise error in interpret stage if get_comps fails

* Revert "Temporarily slim down training set to test comp calculation"

This reverts commit e27581f.

* Try refactoring comps.py for less memory use

* Get comps working locally with less memory intensive algorithm

* Use sales to generate comps

* Instrument python/comps.py with logging and temporarily remove numba decorator

* Instrument interpret comps stage with more logging and skip feature importance for now

* Bump vcpu and memory in build-and-run-model to take full advantage of 10xlarge instance

* Add some logging to try to determine whether record_evals are being saved properly

* Add extra logging to extract_weights function to debug empty weights vector

* Pin lightsnip to jeancochrane/record-evals branch

* Remove debug logs from comps and tree weights extraction functions

* njit _get_top_n_comps

* Revert "Remove debug logs from comps and tree weights extraction functions"

This reverts commit 6d82d5b.

* Print record_evals length in train stage for debugging

* Add some more debug logging to train stage

* Switch to save_tree_error instead of valids arg in lightgbm model definition

* Update lightsnip to latest working version

* More fixes for comps

* Try removing parallelism from _get_top_n_comps

* Enable parallelization for comps algorithm

* Temporarily write comps inputs out to file for testing

* Reduce vcpu/memory in build-and-run-model to see if it provisions smaller instance

* Transpose weights in get_comps and add debug script

* Remove debugging utilities from comps pipeline ahead of final test

* Appease pre-commit

* Add back empty line in 04-interpret.R that got accidentally deleted

* Try jeancochrane/restrict-instance-types-in-build-and-run-batch-job branch for build-and-run-model workflow

* Switch back to m4.10xlarge instance sizing in build-and-run-model

* Add progress logging to comps.py

* Switch back to main branch of build-and-run-batch-job

* Switch to bare iteration rather than vector operations for producing similarity scores in comps.py

* Run comps against binned data to speed up python/comps.py

* Log price ranges in python/comps.py

* Update comps pipeline to work with sales chunking

* Qualify package for rownames_to_column in interpret pipeline stage

* Skip comps bin when no observations are placed in that bin in python/comps.py

* Small cleanup to python/comps.py

* Fix partitioning for comps pipeline

* Fix typo in comps pipeline

* Add comps to individual PIN report

* Cleanup comps map

---------

Co-authored-by: Dan Snow <[email protected]>
  • Loading branch information
jeancochrane and dfsnow authored Jan 28, 2024
1 parent df5b89c commit aed2d12
Show file tree
Hide file tree
Showing 5 changed files with 130 additions and 11 deletions.
4 changes: 2 additions & 2 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -790,9 +790,9 @@ Our goal in maintaining multiple lockfiles is to keep the list of dependencies r

### Using Lockfiles for Local Development

When working on the model locally, you'll typically want to install non-core dependencies _on top of_ the core dependencies. To do this, simply run `renv::restore("<path_to_lockfile")` to install all dependencies from the lockfile.
When working on the model locally, you'll typically want to install non-core dependencies _on top of_ the core dependencies. To do this, simply run `renv::restore(lockfile = "<path_to_lockfile")` to install all dependencies from the lockfile.

For example, if you're working on the `ingest` stage and want to install all its dependencies, start with the main profile (run `renv::activate()`), then install the `dev` profile dependencies on top of it (run `renv::restore("renv/profiles/dev/renv.lock")`).
For example, if you're working on the `ingest` stage and want to install all its dependencies, start with the main profile (run `renv::activate()`), then install the `dev` profile dependencies on top of it (run `renv::restore(lockfile = "renv/profiles/dev/renv.lock")`).

> :warning: WARNING: Installing dependencies from a dev lockfile will **overwrite** any existing version installed by the core one. For example, if `[email protected]` is installed by the core lockfile, and `[email protected]` is installed by the dev lockfile, renv will **overwrite** `[email protected]` with `[email protected]`.
Expand Down
21 changes: 12 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -310,13 +310,16 @@ proportion, including:

| LightGBM Parameter | CV Search Range | Parameter Description |
|:---------------------------------------------------------------------------------------------------|:----------------|:-----------------------------------------------------------------------------------|
| [num_leaves](https://lightgbm.readthedocs.io/en/latest/Parameters.html#num_leaves) | 50 - 2000 | Maximum number of leaves in each tree. Main parameter to control model complexity. |
| [num_iterations](https://lightgbm.readthedocs.io/en/latest/Parameters.html#num_iterations) | 100 - 2500 | NA |
| [learning_rate](https://lightgbm.readthedocs.io/en/latest/Parameters.html#learning_rate) | -3 - -0.4 | NA |
| [max_bin](https://lightgbm.readthedocs.io/en/latest/Parameters.html#max_bin) | 50 - 512 | Maximum number of bins used to bucket continuous features |
| [num_leaves](https://lightgbm.readthedocs.io/en/latest/Parameters.html#num_leaves) | 32 - 2048 | Maximum number of leaves in each tree. Main parameter to control model complexity. |
| [add_to_linked_depth](https://ccao-data.github.io/lightsnip/reference/train_lightgbm.html) | 1 - 7 | Amount to add to `max_depth` if linked to `num_leaves`. See `max_depth`. |
| [feature_fraction](https://lightgbm.readthedocs.io/en/latest/Parameters.html#feature_fraction) | 0.3 - 0.7 | The random subset of features selected for a tree, as a percentage. |
| [min_gain_to_split](https://lightgbm.readthedocs.io/en/latest/Parameters.html#min_gain_to_split) | 0.001 - 10000 | The minimum gain needed to create a split. |
| [min_data_in_leaf](https://lightgbm.readthedocs.io/en/latest/Parameters.html#min_data_in_leaf) | 2 - 300 | The minimum data in a single tree leaf. Important to prevent over-fitting. |
| [min_data_in_leaf](https://lightgbm.readthedocs.io/en/latest/Parameters.html#min_data_in_leaf) | 2 - 400 | The minimum data in a single tree leaf. Important to prevent over-fitting. |
| [max_cat_threshold](https://lightgbm.readthedocs.io/en/latest/Parameters.html#max_cat_threshold) | 10 - 250 | Maximum number of split points for categorical features |
| [min_data_per_group](https://lightgbm.readthedocs.io/en/latest/Parameters.html#min_data_per_group) | 4 - 300 | Minimum number of observations per categorical group |
| [min_data_per_group](https://lightgbm.readthedocs.io/en/latest/Parameters.html#min_data_per_group) | 2 - 400 | Minimum number of observations per categorical group |
| [cat_smooth](https://lightgbm.readthedocs.io/en/latest/Parameters.html#cat_smooth) | 10 - 200 | Categorical smoothing. Used to reduce noise. |
| [cat_l2](https://lightgbm.readthedocs.io/en/latest/Parameters.html#cat_l2) | 0.001 - 100 | Categorical-specific L2 regularization |
| [lambda_l1](https://lightgbm.readthedocs.io/en/latest/Parameters.html#lambda_l1) | 0.001 - 100 | L1 regularization |
Expand Down Expand Up @@ -350,7 +353,7 @@ districts](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-
and many others. The features in the table below are the ones that made
the cut. They’re the right combination of easy to understand and impute,
powerfully predictive, and well-behaved. Most of them are in use in the
model as of 2023-12-20.
model as of 2024-01-22.

| Feature Name | Category | Type | Possible Values | Notes |
|:------------------------------------------------------------------------|:---------------|:------------|:-----------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------|
Expand Down Expand Up @@ -1112,8 +1115,7 @@ following stages:
The entire end-to-end pipeline can also be run using
[DVC](https://dvc.org/). DVC will track the dependencies and parameters
required to run each stage, cache intermediate files, and store
versioned input data on S3. The packages `dvc` and `dvc[s3]` are required
for the following commands.
versioned input data on S3.

To pull all the necessary input data based on the information in
`dvc.lock`, run:
Expand Down Expand Up @@ -1232,6 +1234,7 @@ Uploaded Parquet files are converted into the following Athena tables:
|:---------------------|:-----------------------------------|:-----------------------------------------------------------------------------|:--------------------------------------------------------------------------------------|
| assessment_card | card | year, run_id, township_code, meta_pin, meta_card_num | Assessment results at the card level AKA raw model output |
| assessment_pin | pin | year, run_id, township_code, meta_pin | Assessment results at the PIN level AKA aggregated and cleaned |
| comp | card | year, run_id, meta_pin, meta_card_num | Comparables for each card (computed using leaf node assignments) |
| feature_importance | predictor | year, run_id, model_predictor_all_name | Feature importance values (gain, cover, and frequency) for the run |
| metadata | model run | year, run_id | Information about each run, including parameters, run ID, git info, etc. |
| parameter_final | model run | year, run_id | Chosen set of hyperparameters for each run |
Expand Down Expand Up @@ -1347,13 +1350,13 @@ benefit of a more maintainable model over the long term.

When working on the model locally, you’ll typically want to install
non-core dependencies *on top of* the core dependencies. To do this,
simply run `renv::restore("<path_to_lockfile")` to install all
dependencies from the lockfile.
simply run `renv::restore(lockfile = "<path_to_lockfile")` to install
all dependencies from the lockfile.

For example, if you’re working on the `ingest` stage and want to install
all its dependencies, start with the main profile (run
`renv::activate()`), then install the `dev` profile dependencies on top
of it (run `renv::restore("renv/profiles/dev/renv.lock")`).
of it (run `renv::restore(lockfile = "renv/profiles/dev/renv.lock")`).

> :warning: WARNING: Installing dependencies from a dev lockfile will
> **overwrite** any existing version installed by the core one. For
Expand Down
13 changes: 13 additions & 0 deletions reports/_setup.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,11 @@ if (!exists("training_data")) {
training_data <- read_parquet(paths$input$training$local)
}
# Load assessment set used for this run
if (!exists("assessment_data")) {
assessment_data <- read_parquet(paths$input$assessment$local)
}
# Load Home Improvement Exemption data
if (!exists("hie_data")) {
hie_data <- read_parquet(paths$input$hie$local)
Expand Down Expand Up @@ -159,6 +164,14 @@ if (file.exists(paths$output$shap$local) & metadata$shap_enable) {
shap_exists <- FALSE
}
# Load comp data if it exists
if (file.exists(paths$output$comp$local) & metadata$comp_enable) {
comp_df <- read_parquet(paths$output$comp$local)
comp_exists <- nrow(comp_df) > 0
} else {
comp_exists <- FALSE
}
# Add colors to re-use across plots
plot_colors <- list(
"sales" = "#66c2a5",
Expand Down
96 changes: 96 additions & 0 deletions reports/pin/_comp.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
{{< include ../_setup.qmd >}}

## Comparables

This map shows the target parcel alongside the `r metadata$comp_num_comps` most
similar parcels, where similarity is determined by the number of matching leaf
node assignments that the model gives to each parcel weighted by the relative
importance of each tree. See [this
vignette](https://ccao-data.github.io/lightsnip/articles/finding-comps.html)
for more background on the similarity algorithm.

```{r _comp_map}
comp_df_filtered <- comp_df %>%
filter(pin == target_pin) %>%
tidyr::pivot_longer(starts_with("comp_pin_"), values_to = "comp_pin") %>%
select(-name, -starts_with("comp_score_")) %>%
bind_cols(
comp_df %>%
filter(pin == target_pin) %>%
tidyr::pivot_longer(
starts_with("comp_score_"),
values_to = "comp_score"
) %>%
select(-name, -starts_with("comp_pin_"), -pin)
) %>%
mutate(type = "Comp.") %>%
left_join(
training_data,
by = c("comp_pin" = "meta_pin"),
relationship = "many-to-many"
) %>%
select(
pin, comp_pin, comp_score, meta_1yr_pri_board_tot,
meta_sale_date, meta_sale_price,
loc_latitude, loc_longitude, meta_class,
char_bldg_sf, char_yrblt, char_ext_wall, type
) %>%
group_by(comp_pin) %>%
filter(meta_sale_date == max(meta_sale_date)) %>%
bind_rows(
tibble::tribble(
~pin, ~comp_pin, ~comp_score, ~type,
target_pin, target_pin, 1, "target"
) %>%
left_join(
assessment_data %>%
select(
meta_pin, meta_class, meta_1yr_pri_board_tot,
char_bldg_sf, char_yrblt, char_ext_wall,
loc_latitude, loc_longitude
),
by = c("pin" = "meta_pin"),
) %>%
mutate(type = "Target")
) %>%
mutate(meta_1yr_pri_board_tot = meta_1yr_pri_board_tot * 10)
comp_palette <-
colorFactor(
palette = "Set1",
domain = comp_df_filtered$type
)
leaflet() %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addCircleMarkers(
data = comp_df_filtered,
~loc_longitude,
~loc_latitude,
opacity = 1,
fillOpacity = 1,
radius = 2,
color = ~ comp_palette(type),
popup = ~ paste0(
type, " PIN: ",
"<a target='_blank' rel='noopener noreferrer' ",
"href='https://www.cookcountyassessor.com/pin/", comp_pin,
"'>", comp_pin, "</a>",
"<br>Score: ", scales::percent(comp_score, accuracy = 0.01),
"<br>Class: ", meta_class,
"<br>BoR FMV: ", scales::dollar(meta_1yr_pri_board_tot, accuracy = 1),
"<hr>",
"Sale Date: ", meta_sale_date,
"<br>Sale Price: ", scales::dollar(meta_sale_price, accuracy = 1),
"<hr>",
"Bldg Sqft: ", scales::comma(char_bldg_sf),
"<br>Year Built: ", char_yrblt,
"<br>Ext. Wall: ", char_ext_wall
)
) %>%
setView(
lng = mean(comp_df_filtered$loc_longitude),
lat = mean(comp_df_filtered$loc_latitude),
zoom = 10
)
```
7 changes: 7 additions & 0 deletions reports/pin/pin.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -160,3 +160,10 @@ if (shap_exists) {
cat(shap_qmd, sep = "\n")
}
```

```{r, results='asis'}
if (comp_exists) {
comp_qmd <- knitr::knit_child("_comp.qmd", quiet = TRUE)
cat(comp_qmd, sep = "\n")
}
```

0 comments on commit aed2d12

Please sign in to comment.