-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove price binning from comps and filter targets by tri instead #301
Remove price binning from comps and filter targets by tri instead #301
Conversation
…omps feedback cycle
…eed up comps feedback cycle" This reverts commit 5c7098a.
|
||
# Translate comp indexes to PINs and document numbers | ||
comps[[1]] <- comps[[1]] %>% | ||
mutate( | ||
# Correct for the fact that Python is 0-indexed by incrementing the | ||
# comp indexes by 1, and cast null indicators (-1) to null | ||
across(everything(), ~ ifelse(. == -1, NA, . + 1)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous version of this code wasn't properly handling the null indicator for comps (-1). I think we never ran into this problem because we've never had a null comp in production, but if you reduce the number of trees dramatically for testing purposes, you can run into models that can't find a comp for certain parcels.
).values | ||
total_num_possible_comps = len(comparison_df) | ||
chunked_ids, chunked_scores = [], [] | ||
for chunk_num in set(observation_df["chunk"]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We iterate over the set of chunks here rather than something like range(1, num_chunks + 1)
because pd.cut
won't produce a bin for each number in the range if the number of targets is less than the number of bins. This should never occur in production, but it can happen if you're testing the algorithm on a very small number of targets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this looks great @jeancochrane. I like that it simplifies things a lot. Let's merge #314 and this should be good to go.
* Add comps incremental speedups * Format with ruff
This PR removes price binning from the comps algorithm, and replaces it with a filter that only generates comps for the tri that the model is targeting. We also reduce the number of comps that we calculate for each property from 20 to 5, since our internal analysis suggests that comps after the fifth comp are less informative.
With these changes, the comps pipeline runs in 48 hours for the City tri. See
2024-12-29-compassionate-kyra
for the output of the test run.We also use this PR to merge some speed improvements that @dfsnow added in #314.