Add comps map to individual PIN report (#181)

* Add intermediate leaf_node output to interpret pipeline stage * Add python get_comps function for computing comps from leaf node assignments * Flesh out comp calculation * Add Python requirements to renv environment * Make sure assessment data is loaded in interpret stage when comp_enable is TRUE * Continue with comps debugging * Refactor and test get_comps logic * Clean up comments and extraneous debugging code ahead of testing * Temporarily set comp_enable=TRUE for the purposes of testing comps * Satisfy pre-commit * Remove num_iteration arg from predict() in comp calculation * Make sure requirements.txt is copied into image before installing R dependencies * Install python3-venv in Dockerfile * Pass n=20 to get_comps correctly in 04-interpret.R * Temporarily slim down training set to test comp calculation * Wrap get_comps() call in tryCatch in interpret pipeline stage for better error logging * Test raising an error from python/comps.py * Remove temporary error in python/comps.py * Swap arg order in _get_similarity_matrix to confirm numba error message * Revert "Swap arg order in _get_similarity_matrix to confirm numba error message" This reverts commit 5beefd5. * Raise error in interpret stage if get_comps fails * Revert "Temporarily slim down training set to test comp calculation" This reverts commit e27581f. * Try refactoring comps.py for less memory use * Get comps working locally with less memory intensive algorithm * Use sales to generate comps * Instrument python/comps.py with logging and temporarily remove numba decorator * Instrument interpret comps stage with more logging and skip feature importance for now * Bump vcpu and memory in build-and-run-model to take full advantage of 10xlarge instance * Add some logging to try to determine whether record_evals are being saved properly * Add extra logging to extract_weights function to debug empty weights vector * Pin lightsnip to jeancochrane/record-evals branch * Remove debug logs from comps and tree weights extraction functions * njit _get_top_n_comps * Revert "Remove debug logs from comps and tree weights extraction functions" This reverts commit 6d82d5b. * Print record_evals length in train stage for debugging * Add some more debug logging to train stage * Switch to save_tree_error instead of valids arg in lightgbm model definition * Update lightsnip to latest working version * More fixes for comps * Try removing parallelism from _get_top_n_comps * Enable parallelization for comps algorithm * Temporarily write comps inputs out to file for testing * Reduce vcpu/memory in build-and-run-model to see if it provisions smaller instance * Transpose weights in get_comps and add debug script * Remove debugging utilities from comps pipeline ahead of final test * Appease pre-commit * Add back empty line in 04-interpret.R that got accidentally deleted * Try jeancochrane/restrict-instance-types-in-build-and-run-batch-job branch for build-and-run-model workflow * Switch back to m4.10xlarge instance sizing in build-and-run-model * Add progress logging to comps.py * Switch back to main branch of build-and-run-batch-job * Switch to bare iteration rather than vector operations for producing similarity scores in comps.py * Run comps against binned data to speed up python/comps.py * Log price ranges in python/comps.py * Update comps pipeline to work with sales chunking * Qualify package for rownames_to_column in interpret pipeline stage * Skip comps bin when no observations are placed in that bin in python/comps.py * Small cleanup to python/comps.py * Fix partitioning for comps pipeline * Fix typo in comps pipeline * Add comps to individual PIN report * Cleanup comps map --------- Co-authored-by: Dan Snow <[email protected]>
ccao-data · Jan 28, 2024 · aed2d12 · aed2d12
1 parent df5b89c
commit aed2d12
Show file tree

Hide file tree

Showing 5 changed files with 130 additions and 11 deletions.
diff --git a/README.Rmd b/README.Rmd
@@ -790,9 +790,9 @@ Our goal in maintaining multiple lockfiles is to keep the list of dependencies r
 
 ### Using Lockfiles for Local Development
 
-When working on the model locally, you'll typically want to install non-core dependencies _on top of_ the core dependencies. To do this, simply run `renv::restore("<path_to_lockfile")` to install all dependencies from the lockfile.
+When working on the model locally, you'll typically want to install non-core dependencies _on top of_ the core dependencies. To do this, simply run `renv::restore(lockfile = "<path_to_lockfile")` to install all dependencies from the lockfile.
 
-For example, if you're working on the `ingest` stage and want to install all its dependencies, start with the main profile (run `renv::activate()`), then install the `dev` profile dependencies on top of it (run `renv::restore("renv/profiles/dev/renv.lock")`).
+For example, if you're working on the `ingest` stage and want to install all its dependencies, start with the main profile (run `renv::activate()`), then install the `dev` profile dependencies on top of it (run `renv::restore(lockfile = "renv/profiles/dev/renv.lock")`).
 
 > :warning: WARNING: Installing dependencies from a dev lockfile will **overwrite** any existing version installed by the core one. For example, if `[email protected]` is installed by the core lockfile, and `[email protected]` is installed by the dev lockfile, renv will **overwrite** `[email protected]` with `[email protected]`.
 

diff --git a/README.md b/README.md
@@ -310,13 +310,16 @@ proportion, including:
 
 | LightGBM Parameter                                                                                 | CV Search Range | Parameter Description                                                              |
 |:---------------------------------------------------------------------------------------------------|:----------------|:-----------------------------------------------------------------------------------|
-| [num_leaves](https://lightgbm.readthedocs.io/en/latest/Parameters.html#num_leaves)                 | 50 - 2000       | Maximum number of leaves in each tree. Main parameter to control model complexity. |
+| [num_iterations](https://lightgbm.readthedocs.io/en/latest/Parameters.html#num_iterations)         | 100 - 2500      | NA                                                                                 |
+| [learning_rate](https://lightgbm.readthedocs.io/en/latest/Parameters.html#learning_rate)           | -3 - -0.4       | NA                                                                                 |
+| [max_bin](https://lightgbm.readthedocs.io/en/latest/Parameters.html#max_bin)                       | 50 - 512        | Maximum number of bins used to bucket continuous features                          |
+| [num_leaves](https://lightgbm.readthedocs.io/en/latest/Parameters.html#num_leaves)                 | 32 - 2048       | Maximum number of leaves in each tree. Main parameter to control model complexity. |
 | [add_to_linked_depth](https://ccao-data.github.io/lightsnip/reference/train_lightgbm.html)         | 1 - 7           | Amount to add to `max_depth` if linked to `num_leaves`. See `max_depth`.           |
 | [feature_fraction](https://lightgbm.readthedocs.io/en/latest/Parameters.html#feature_fraction)     | 0.3 - 0.7       | The random subset of features selected for a tree, as a percentage.                |
 | [min_gain_to_split](https://lightgbm.readthedocs.io/en/latest/Parameters.html#min_gain_to_split)   | 0.001 - 10000   | The minimum gain needed to create a split.                                         |
-| [min_data_in_leaf](https://lightgbm.readthedocs.io/en/latest/Parameters.html#min_data_in_leaf)     | 2 - 300         | The minimum data in a single tree leaf. Important to prevent over-fitting.         |
+| [min_data_in_leaf](https://lightgbm.readthedocs.io/en/latest/Parameters.html#min_data_in_leaf)     | 2 - 400         | The minimum data in a single tree leaf. Important to prevent over-fitting.         |
 | [max_cat_threshold](https://lightgbm.readthedocs.io/en/latest/Parameters.html#max_cat_threshold)   | 10 - 250        | Maximum number of split points for categorical features                            |
-| [min_data_per_group](https://lightgbm.readthedocs.io/en/latest/Parameters.html#min_data_per_group) | 4 - 300         | Minimum number of observations per categorical group                               |
+| [min_data_per_group](https://lightgbm.readthedocs.io/en/latest/Parameters.html#min_data_per_group) | 2 - 400         | Minimum number of observations per categorical group                               |
 | [cat_smooth](https://lightgbm.readthedocs.io/en/latest/Parameters.html#cat_smooth)                 | 10 - 200        | Categorical smoothing. Used to reduce noise.                                       |
 | [cat_l2](https://lightgbm.readthedocs.io/en/latest/Parameters.html#cat_l2)                         | 0.001 - 100     | Categorical-specific L2 regularization                                             |
 | [lambda_l1](https://lightgbm.readthedocs.io/en/latest/Parameters.html#lambda_l1)                   | 0.001 - 100     | L1 regularization                                                                  |
@@ -350,7 +353,7 @@ districts](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-
 and many others. The features in the table below are the ones that made
 the cut. They’re the right combination of easy to understand and impute,
 powerfully predictive, and well-behaved. Most of them are in use in the
-model as of 2023-12-20.
+model as of 2024-01-22.
 
 | Feature Name                                                            | Category       | Type        | Possible Values                                                              | Notes                                                                                                             |
 |:------------------------------------------------------------------------|:---------------|:------------|:-----------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------|
@@ -1112,8 +1115,7 @@ following stages:
 The entire end-to-end pipeline can also be run using
 [DVC](https://dvc.org/). DVC will track the dependencies and parameters
 required to run each stage, cache intermediate files, and store
-versioned input data on S3. The packages `dvc` and `dvc[s3]` are required
-for the following commands.
+versioned input data on S3.
 
 To pull all the necessary input data based on the information in
 `dvc.lock`, run:
@@ -1232,6 +1234,7 @@ Uploaded Parquet files are converted into the following Athena tables:
 |:---------------------|:-----------------------------------|:-----------------------------------------------------------------------------|:--------------------------------------------------------------------------------------|
 | assessment_card      | card                               | year, run_id, township_code, meta_pin, meta_card_num                         | Assessment results at the card level AKA raw model output                             |
 | assessment_pin       | pin                                | year, run_id, township_code, meta_pin                                        | Assessment results at the PIN level AKA aggregated and cleaned                        |
+| comp                 | card                               | year, run_id, meta_pin, meta_card_num                                        | Comparables for each card (computed using leaf node assignments)                      |
 | feature_importance   | predictor                          | year, run_id, model_predictor_all_name                                       | Feature importance values (gain, cover, and frequency) for the run                    |
 | metadata             | model run                          | year, run_id                                                                 | Information about each run, including parameters, run ID, git info, etc.              |
 | parameter_final      | model run                          | year, run_id                                                                 | Chosen set of hyperparameters for each run                                            |
@@ -1347,13 +1350,13 @@ benefit of a more maintainable model over the long term.
 
 When working on the model locally, you’ll typically want to install
 non-core dependencies *on top of* the core dependencies. To do this,
-simply run `renv::restore("<path_to_lockfile")` to install all
-dependencies from the lockfile.
+simply run `renv::restore(lockfile = "<path_to_lockfile")` to install
+all dependencies from the lockfile.
 
 For example, if you’re working on the `ingest` stage and want to install
 all its dependencies, start with the main profile (run
 `renv::activate()`), then install the `dev` profile dependencies on top
-of it (run `renv::restore("renv/profiles/dev/renv.lock")`).
+of it (run `renv::restore(lockfile = "renv/profiles/dev/renv.lock")`).
 
 > :warning: WARNING: Installing dependencies from a dev lockfile will
 > **overwrite** any existing version installed by the core one. For

diff --git a/reports/_setup.qmd b/reports/_setup.qmd
@@ -51,6 +51,11 @@ if (!exists("training_data")) {
   training_data <- read_parquet(paths$input$training$local)
 }
 
+# Load assessment set used for this run
+if (!exists("assessment_data")) {
+  assessment_data <- read_parquet(paths$input$assessment$local)
+}
+
 # Load Home Improvement Exemption data
 if (!exists("hie_data")) {
   hie_data <- read_parquet(paths$input$hie$local)
@@ -159,6 +164,14 @@ if (file.exists(paths$output$shap$local) & metadata$shap_enable) {
   shap_exists <- FALSE
 }
 
+# Load comp data if it exists
+if (file.exists(paths$output$comp$local) & metadata$comp_enable) {
+  comp_df <- read_parquet(paths$output$comp$local)
+  comp_exists <- nrow(comp_df) > 0
+} else {
+  comp_exists <- FALSE
+}
+
 # Add colors to re-use across plots
 plot_colors <- list(
   "sales" = "#66c2a5",

diff --git a/reports/pin/_comp.qmd b/reports/pin/_comp.qmd
@@ -0,0 +1,96 @@
+{{< include ../_setup.qmd >}}
+
+## Comparables
+
+This map shows the target parcel alongside the `r metadata$comp_num_comps` most
+similar parcels, where similarity is determined by the number of matching leaf
+node assignments that the model gives to each parcel weighted by the relative
+importance of each tree. See [this
+vignette](https://ccao-data.github.io/lightsnip/articles/finding-comps.html)
+for more background on the similarity algorithm.
+
+```{r _comp_map}
+comp_df_filtered <- comp_df %>%
+  filter(pin == target_pin) %>%
+  tidyr::pivot_longer(starts_with("comp_pin_"), values_to = "comp_pin") %>%
+  select(-name, -starts_with("comp_score_")) %>%
+  bind_cols(
+    comp_df %>%
+      filter(pin == target_pin) %>%
+      tidyr::pivot_longer(
+        starts_with("comp_score_"),
+        values_to = "comp_score"
+      ) %>%
+      select(-name, -starts_with("comp_pin_"), -pin)
+  ) %>%
+  mutate(type = "Comp.") %>%
+  left_join(
+    training_data,
+    by = c("comp_pin" = "meta_pin"),
+    relationship = "many-to-many"
+  ) %>%
+  select(
+    pin, comp_pin, comp_score, meta_1yr_pri_board_tot,
+    meta_sale_date, meta_sale_price,
+    loc_latitude, loc_longitude, meta_class,
+    char_bldg_sf, char_yrblt, char_ext_wall, type
+  ) %>%
+  group_by(comp_pin) %>%
+  filter(meta_sale_date == max(meta_sale_date)) %>%
+  bind_rows(
+    tibble::tribble(
+      ~pin, ~comp_pin, ~comp_score, ~type,
+      target_pin, target_pin, 1, "target"
+    ) %>%
+      left_join(
+        assessment_data %>%
+          select(
+            meta_pin, meta_class, meta_1yr_pri_board_tot,
+            char_bldg_sf, char_yrblt, char_ext_wall,
+            loc_latitude, loc_longitude
+          ),
+        by = c("pin" = "meta_pin"),
+      ) %>%
+      mutate(type = "Target")
+  ) %>%
+  mutate(meta_1yr_pri_board_tot = meta_1yr_pri_board_tot * 10)
+
+comp_palette <-
+  colorFactor(
+    palette = "Set1",
+    domain = comp_df_filtered$type
+  )
+
+leaflet() %>%
+  addProviderTiles(providers$CartoDB.Positron) %>%
+  addCircleMarkers(
+    data = comp_df_filtered,
+    ~loc_longitude,
+    ~loc_latitude,
+    opacity = 1,
+    fillOpacity = 1,
+    radius = 2,
+    color = ~ comp_palette(type),
+    popup = ~ paste0(
+      type, " PIN: ",
+      "<a target='_blank' rel='noopener noreferrer' ",
+      "href='https://www.cookcountyassessor.com/pin/", comp_pin,
+      "'>", comp_pin, "</a>",
+      "<br>Score: ", scales::percent(comp_score, accuracy = 0.01),
+      "<br>Class: ", meta_class,
+      "<br>BoR FMV: ", scales::dollar(meta_1yr_pri_board_tot, accuracy = 1),
+      "<hr>",
+      "Sale Date: ", meta_sale_date,
+      "<br>Sale Price: ", scales::dollar(meta_sale_price, accuracy = 1),
+      "<hr>",
+      "Bldg Sqft: ", scales::comma(char_bldg_sf),
+      "<br>Year Built: ", char_yrblt,
+      "<br>Ext. Wall: ", char_ext_wall
+    )
+  ) %>%
+  setView(
+    lng = mean(comp_df_filtered$loc_longitude),
+    lat = mean(comp_df_filtered$loc_latitude),
+    zoom = 10
+  )
+```
diff --git a/reports/pin/pin.qmd b/reports/pin/pin.qmd
@@ -160,3 +160,10 @@ if (shap_exists) {
   cat(shap_qmd, sep = "\n")
 }
 ```
+
+```{r, results='asis'}
+if (comp_exists) {
+  comp_qmd <- knitr::knit_child("_comp.qmd", quiet = TRUE)
+  cat(comp_qmd, sep = "\n")
+}
+```