Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove price binning from comps and filter targets by tri instead #301

Merged

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented Dec 24, 2024

This PR removes price binning from the comps algorithm, and replaces it with a filter that only generates comps for the tri that the model is targeting. We also reduce the number of comps that we calculate for each property from 20 to 5, since our internal analysis suggests that comps after the fifth comp are less informative.

With these changes, the comps pipeline runs in 48 hours for the City tri. See 2024-12-29-compassionate-kyra for the output of the test run.

We also use this PR to merge some speed improvements that @dfsnow added in #314.

…eed up comps feedback cycle"

This reverts commit 5c7098a.

# Translate comp indexes to PINs and document numbers
comps[[1]] <- comps[[1]] %>%
mutate(
# Correct for the fact that Python is 0-indexed by incrementing the
# comp indexes by 1, and cast null indicators (-1) to null
across(everything(), ~ ifelse(. == -1, NA, . + 1)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous version of this code wasn't properly handling the null indicator for comps (-1). I think we never ran into this problem because we've never had a null comp in production, but if you reduce the number of trees dramatically for testing purposes, you can run into models that can't find a comp for certain parcels.

).values
total_num_possible_comps = len(comparison_df)
chunked_ids, chunked_scores = [], []
for chunk_num in set(observation_df["chunk"]):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We iterate over the set of chunks here rather than something like range(1, num_chunks + 1) because pd.cut won't produce a bin for each number in the range if the number of targets is less than the number of bins. This should never occur in production, but it can happen if you're testing the algorithm on a very small number of targets.

@jeancochrane jeancochrane marked this pull request as ready for review January 3, 2025 22:50
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this looks great @jeancochrane. I like that it simplifies things a lot. Let's merge #314 and this should be good to go.

* Add comps incremental speedups

* Format with ruff
@jeancochrane jeancochrane merged commit aca92c2 into 2025-assessment-year Jan 8, 2025
4 checks passed
@jeancochrane jeancochrane deleted the jeancochrane/remove-binning-for-comps branch January 8, 2025 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants