Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ferc-ferc plant matching with ccai implementation. #3007

Merged
merged 89 commits into from
Dec 26, 2023

Conversation

zschira
Copy link
Member

@zschira zschira commented Nov 2, 2023

Closes catalyst-cooperative/ccai-entity-matching#109 .

This PR pulls @katie-lamb's CCAI implementation of the FERC-FERC inter-year plant matching process. This new implementation works well, and seems to be running much faster than the old implementation (~2 seconds vs ~8 seconds on the etl_fast dataset).

Remaining work:

  • Verify new model runs in full etl
  • Clean up code after dagster integration
  • Get review of current implementation after model optimizations and dagster integration
  • Run validation tests locally

Sorry, something went wrong.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@katie-lamb
Copy link
Member

This looks good so far! Is it the PCA or the clustering that blows up memory?

@katie-lamb
Copy link
Member

katie-lamb commented Nov 8, 2023

@zschira I made some somewhat confusing but hopefully helpful plots to better understand the distance between clusters. With the fittedsklearn Agglomerative Clustering model one can make a scipy dendrogram chart (see this example). To understand what a dendrogram is, I thought this example was helpful. That being said, I'm still not entirely sure how to interpret these plots, and would love some other ideas if I've missed the mark here.

These plots are run on the small subset of data I've been using (2000 records), and shows the clusters that are p=40 merges from the final merge. The y-axis shows the distance at which two nodes are merged into one, indicated by a bracket connecting them. I recommend ignoring the labels on the x-axis, but it basically represents the size of each node. The red horizontal line indicates the threshold I've been using for two clusters to be merged (currently .5). The merges made above this line didn't happen and aren't represented in the labels. The merges below this bar are represented by the labels and were clusters that were merged to form a new cluster. I progressively zoomed in on the y-axis so we can see what's going on.

There's a big jump from threshold of 10 to threshold of 2. I think this means that 10 is way too large of a cluster distance threshold.
Screen Shot 2023-11-09 at 12 47 51 AM

Zooming in on a y axis of 0 to 1, it still seems like a lot of merging happens at a much smaller distance. Maybe it's an indication that the threshold should actually be lower? Maybe it doesn't matter too much if the threshold is .5 or .1, still thinking about that and not entirely sure what to make of it.
Screen Shot 2023-11-09 at 12 47 58 AM

More merging happening in the <.05 range. But that's also to be expected. There's maybe more clustering of larger nodes (bigger on the x-axis), which is good.
Screen Shot 2023-11-09 at 12 48 05 AM

It's a little harder to visualize results with the model run on the full dataset, but for the most part it seems like results align with the smaller sample dataset. I ran this with p=20, so 20 merges from the final merge, because it was impossible to tell what was going on with p=40.

Screen Shot 2023-11-09 at 12 59 34 AM Screen Shot 2023-11-09 at 12 59 40 AM Screen Shot 2023-11-09 at 12 59 47 AM

@zaneselvans
Copy link
Member

I'm definitely confused here. With several thousand expected clusters (one for each FERC plant) it seems hard to visualize all of them at once.

Have you made histograms of the cluster sizes in the old vs. new systems?

If you randomly select a plant_id_ferc1 out of the dataframe post ID assignment, do the records look like they belong to the same plant?

@zschira
Copy link
Member Author

zschira commented Nov 8, 2023

@zaneselvans I've spot checked a number of plant id's and so far they've all looked like the same plant to me.

I think doing some focused spot checking using the dendrogram as a guide would be interesting. For example, look at clusters that merge just above our threshold at the 0.5-0.6 range and see if they look like the same plant or not, or do the same just below the threshold, and then maybe zoom way in to clusters with very small distances, and see if we can find any that don't seem to be general matches. I guess generally, it would be interesting to find some cases where we're clearly failing (matching plants that shouldn't be matched or not matching plants that should), and see where/why those might have gone wrong.

@katie-lamb
Copy link
Member

@zaneselvans

I'm definitely confused here. With several thousand expected clusters (one for each FERC plant) it seems hard to visualize all of them at once.

You can think of the dendrograms as a sample of the several thousand clusters that are created by the model. In our case, the p=40 parameter is not that meaningful, since I'm instead using a distance threshold to decide when merging should stop. I think the dendrogram is mostly helpful to understand if the distance threshold for merging is appropriate. As @zschira pointed out, the second step for validating is probably to spot check nodes that are merged right below or right above that .5 mark.

This also verifies why the average distance between records in a cluster is always very small (.05 or less) even when I experiment with a distance threshold in the .2-1 range. The vast majority of merges happen between nodes with a distance <.05, so looking at the average distance between records in a cluster isn't a very helpful metric for verifying whether the threshold is good in the .2-1 range.

Have you made histograms of the cluster sizes in the old vs. new systems?

Yes, currently in the notebook in the CCAI repo. Here's a comparison from the smaller sample dataset (2000 records, I didn't have a screenshot of the full model histogram), where the average new cluster size is ~5.8 and the average old plant_id_ferc1 size is ~4.3. This disparity is more pronounced on the whole dataset, where the new average cluster size is ~7.3.

Screen Shot 2023-11-09 at 10 15 54 AM Screen Shot 2023-11-09 at 10 15 24 AM

I'll do some spot checking around the current distance threshold as Zach suggests.

@zaneselvans
Copy link
Member

I think the y-axis label might belong on the x-axis?

It's interesting that there's a bump at the very high end of the length spectrum (like, one record for every possible year) in the old version, but not in the new version. I wonder why that would have happened, and whether we've actually lost some good long time series, or if they were bad for some reason and the new algorithm does a better job of distinguishing them?

@katie-lamb
Copy link
Member

I think the y-axis label might belong on the x-axis?

Oh yep, I did that too fast, but switch the label from y-axis to x-axis.

It's interesting that there's a bump at the very high end of the length spectrum (like, one record for every possible year) in the old version, but not in the new version. I wonder why that would have happened, and whether we've actually lost some good long time series, or if they were bad for some reason and the new algorithm does a better job of distinguishing them?

That's a good point, and that disparity at the high end of the length spectrum happens in the full dataset as well. I'll spot check some of those.

Copy link
Member

@katie-lamb katie-lamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good! I left a few small comment in the cross_year module. As sort of a side note, it might be useful for us to keep track of a couple validation checks besides just if there are duplicate report years. That would probably belong in experiment tracking infrastructure and metrics I'm assuming, and would go in with a later PR.

src/pudl/analysis/record_linkage/cross_year.py Outdated Show resolved Hide resolved
src/pudl/analysis/record_linkage/cross_year.py Outdated Show resolved Hide resolved
src/pudl/analysis/record_linkage/cross_year.py Outdated Show resolved Hide resolved
Copy link
Member

@katie-lamb katie-lamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't finish making all comments at once. Here's a few more things.

src/pudl/analysis/record_linkage/cross_year.py Outdated Show resolved Hide resolved
src/pudl/analysis/record_linkage/classify_plants_ferc1.py Outdated Show resolved Hide resolved
@zaneselvans
Copy link
Member

I think you might need to add an __init__.py to the new record_linkage subpackge to make it importable, and bring in all of the new modules, and add an import record_linkage into src/pudl/analysis/__init__.py to reflect the new subpackage.

@zschira zschira marked this pull request as ready for review December 22, 2023 17:15
@zaneselvans
Copy link
Member

zaneselvans commented Dec 22, 2023

It looks like when the modules under pudl.analysis were rearranged, the new import paths were not added to pudl.analysis.__init__.py so they were not importable by other modules, which was causing the use of pudl.analysis.fuel_by_plant functions to fail in the steam table processing.

I'm running the full ETL locally using the etl_full job and also seeing that EPA CEMS is using all 10 of my CPUs rather than limiting itself to a concurrency of 2. Did something get changed that makes that safe? Is the EPA CEMS concurrency setting not working in the Dagster UI?

@zaneselvans
Copy link
Member

After including the new import paths, I get a failure on the plant parts EIA:

AssertionError: Consistency of all matches across years dipped below 75.0% to 53.5%
  File "/Users/zane/code/catalyst/pudl/src/pudl/analysis/eia_ferc1_record_linkage.py", line 94, in out_pudl__yearly_assn_eia_ferc1_plant_parts
    connects_ferc1_eia = prettyify_best_matches(
                         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/code/catalyst/pudl/src/pudl/analysis/eia_ferc1_record_linkage.py", line 888, in prettyify_best_matches
    check_match_consistency(
  File "/Users/zane/code/catalyst/pudl/src/pudl/analysis/eia_ferc1_record_linkage.py", line 998, in check_match_consistency
    raise AssertionError(

@zaneselvans
Copy link
Member

Hmm, attempting to re-run just the EPA CEMS it seems to do them two-by-two. So maybe it was just that the whole EPA CEMS graph job had failed and left the unstarted ghost assets in the UI.

@zaneselvans
Copy link
Member

zaneselvans commented Dec 26, 2023

@zschira The builds passed last night so this could go into dev cleanly now.

Do we know why the test coverage drops by half a percent between this PR and dev? That seems like kind of a lot.

Edit: pytest wasn't running the tests in test/integration/record_linkage.py because the name of the module didn't end with _test which is required for test discovery.

@zaneselvans zaneselvans self-requested a review December 26, 2023 20:05
column_transform_from_key("name_cleaner_transform"),
column_transform_from_key("string_transform"),
],
"weight": 2.0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there weights for these first two features, but not the rest? Do they default to 1.0?

@@ -0,0 +1 @@
"""This module impolements models for various forms of record linkage."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were some failures in the steam table processing due to pudl.analysis.fuel_by_plant not being imported in pudl/analysis/__init__.py and we have a lot of places where we just import a whole module, rather than the individual functions or constants within it, so I feel like adding the imports here for now would help avoid some confusion with that pattern breaking on some modules.

@zaneselvans zaneselvans merged commit 5ead5b3 into dev Dec 26, 2023
13 of 15 checks passed
@zaneselvans zaneselvans deleted the entity_matching branch December 26, 2023 20:22
@zaneselvans zaneselvans added ferc1 Anything having to do with FERC Form 1 record-linkage Issues related to connecting related records / entities that don't have explicit IDs or keys. labels Feb 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ferc1 Anything having to do with FERC Form 1 record-linkage Issues related to connecting related records / entities that don't have explicit IDs or keys.
Projects
Archived in project
4 participants