Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ferc-ferc plant matching with ccai implementation. #3007

Merged
merged 89 commits into from
Dec 26, 2023
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
6c8a86d
Update ferc-ferc plant matching with ccai implementation.
zschira Nov 2, 2023
a57810a
Update docstring
zschira Nov 2, 2023
688b577
Make distance estimator param name more descriptive
zschira Nov 2, 2023
1878348
Take only report_years to calculate ferc-ferc distance penalty
zschira Nov 9, 2023
e7686c3
Adjust PCA to use less memory
zschira Nov 9, 2023
e526516
Increase ferc-ferc dist threshold
zschira Nov 17, 2023
4b5bea4
Generalize CrossYearLinker interface.
zschira Nov 18, 2023
4ccdcc6
Merge branch 'dev' into entity_matching
zschira Nov 18, 2023
586b973
Update conda-lock.yml and rendered conda environment files.
zschira Nov 18, 2023
f5e4f14
Allow configurable column cleaning in generic inter-year linker
zschira Nov 18, 2023
3026681
Remove old classify_plants_ferc1 module
zschira Nov 18, 2023
be7f28d
Merge branch 'entity_matching' of github.com:catalyst-cooperative/pud…
zschira Nov 18, 2023
211ff32
Improve docstring in cross-year-linker
zschira Nov 20, 2023
a77c7d7
Add company name cleaner to ferc-ferc matching
zschira Nov 21, 2023
778d059
Add revert_filled_in_nulls back to ferc-ferc match
zschira Nov 21, 2023
44f2fd0
fix yaml/yml typo in .gitattributes
zaneselvans Nov 21, 2023
b76f2d2
Merge branch 'dev' into entity_matching
zaneselvans Nov 21, 2023
54cb3ca
Update conda-lock.yml and rendered conda environment files.
zaneselvans Nov 21, 2023
ead2c8e
Add __init__ to record_linkage module.
zschira Nov 21, 2023
fdfe1c2
Merge branch 'entity_matching' of github.com:catalyst-cooperative/pud…
zschira Nov 21, 2023
4190361
Add integration to perform ferc-ferc matching on synthetic data.
zschira Nov 22, 2023
ecd84ca
Merge dev
zschira Nov 22, 2023
0921557
Fix classify_plants_ferc1 integration test
zschira Nov 22, 2023
305411e
Add comments to ferc-ferc matching test
zschira Nov 22, 2023
8694279
Fix docstring spacing
zschira Nov 22, 2023
43c51e1
Fix import path
zschira Nov 22, 2023
e6df48e
Don't modify utility names during ferc-ferc matching.
zschira Nov 25, 2023
ffcdc1e
Merge branch 'dev' into entity_matching
zschira Nov 27, 2023
fd58293
Make attribute comments compatible with pydantic class
zschira Nov 27, 2023
361162c
Update conda-lock.yml and rendered conda environment files.
zschira Nov 27, 2023
7026784
Merge branch 'dev' into ferc-eia-ccai
zschira Nov 30, 2023
1e6e8e8
Refactor new record linkage interface.
zschira Nov 30, 2023
ab95188
Minor change to ferc-ferc integration test
zschira Nov 30, 2023
8aba02b
Update conda-lock.yml and rendered conda environment files.
zschira Nov 30, 2023
50d1f98
Add more memory efficient PCA option for record linkage models
zschira Dec 1, 2023
6190584
Merge branch 'entity_matching' of github.com:catalyst-cooperative/pud…
zschira Dec 1, 2023
7abd6bd
Merge branch 'dev' into entity_matching
zaneselvans Dec 1, 2023
25a533d
Merge branch 'entity_matching' of github.com:catalyst-cooperative/pud…
zaneselvans Dec 1, 2023
5e109f6
Add more test records to ferc-ferc matching simulation.
zschira Dec 1, 2023
152e776
Merge branch 'entity_matching' of github.com:catalyst-cooperative/pud…
zaneselvans Dec 1, 2023
ea643bd
Fix bad docstring formatting that was breaking docs build
zaneselvans Dec 1, 2023
99ec8ee
Merge branch 'dev' into entity_matching
zaneselvans Dec 1, 2023
7e7affc
Merge branch 'dev' into entity_matching
zaneselvans Dec 1, 2023
5757f44
Add missing combined cycle plant type strings.
zaneselvans Dec 1, 2023
7b356e8
Add more test records to ferc-ferc matching simulation.
zschira Dec 4, 2023
6f2b787
Merge branch 'entity_matching' of github.com:catalyst-cooperative/pud…
zschira Dec 5, 2023
c2ea24a
Merge branch 'dev' into entity_matching
zschira Dec 5, 2023
3ec367d
Update conda-lock.yml and rendered conda environment files.
zschira Dec 5, 2023
8647517
Merge branch 'dev' into entity_matching
zaneselvans Dec 6, 2023
61a3c01
Integrate ferc-ferc model with dagster ops
zschira Dec 7, 2023
772c582
Update record linkage integration test.
zschira Dec 7, 2023
50d6126
Merge branch 'entity_matching' of github.com:catalyst-cooperative/pud…
zschira Dec 7, 2023
dbe48d2
Fix column name in validate steam ids
zschira Dec 7, 2023
43e7f5f
Merge branch 'entity_matching' of github.com:catalyst-cooperative/pud…
zaneselvans Dec 7, 2023
9c6616f
Merge branch 'dev' into entity_matching
zaneselvans Dec 7, 2023
71347c9
Merge branch 'dev' into entity_matching
zaneselvans Dec 8, 2023
dfa163c
Update conda-lock.yml and rendered conda environment files.
zaneselvans Dec 8, 2023
cced81d
Remove redundant logs
zschira Dec 8, 2023
4b11198
Remove files from wrong branch
zschira Dec 8, 2023
c2869c3
Merge branch 'entity_matching' of github.com:catalyst-cooperative/pud…
zschira Dec 8, 2023
49cf1a8
Add factory function for creating dataframe embedder
zschira Dec 10, 2023
6bd9a55
Improve comments in cross year record linkage model
zschira Dec 11, 2023
eafb9cb
Remove plant_id_ferc1 form plants_steam_ferc1
zschira Dec 11, 2023
82592b6
Make ferc plant assignment a stand alone intermediate asset
zschira Dec 12, 2023
4ad4813
Improve name in record linkage dataframe embedding
zschira Dec 12, 2023
f435204
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 12, 2023
98e86d9
Merge branch 'dev' into entity_matching
zschira Dec 12, 2023
e1bbea5
Update conda-lock.yml and rendered conda environment files.
zschira Dec 12, 2023
67151b8
Fix name to not clobber import
zschira Dec 12, 2023
b8cf219
Merge branch 'entity_matching' of github.com:catalyst-cooperative/pud…
zschira Dec 12, 2023
96aaa36
Use numba to speed up ferc-ferc model
zschira Dec 13, 2023
52b9adb
Merge branch 'dev' into entity_matching
katie-lamb Dec 19, 2023
7edc408
create new migration
katie-lamb Dec 20, 2023
dc9f0bd
Improve fuel fraction test generation
zschira Dec 20, 2023
0faf725
Merge branch 'entity_matching' of https://github.com/catalyst-coopera…
zschira Dec 20, 2023
ec34f1e
Fix typo
zschira Dec 20, 2023
95bb1ca
Refine ferc-ferc model parameters
zschira Dec 20, 2023
b844f80
Add improved docstring
zschira Dec 20, 2023
7c1d6c3
Remove inaccurate docstring
zschira Dec 20, 2023
c6a9496
Add dedicated module for fuel by plant
zschira Dec 20, 2023
9e64c30
Add options to all dataframe embedding steps
zschira Dec 20, 2023
12472bc
Change dict access to get()
zschira Dec 20, 2023
90bdc59
Simplify ferc plant id verification
zschira Dec 20, 2023
171504f
Merge branch 'dev' into entity_matching
zaneselvans Dec 22, 2023
665a805
Add missing module imports.
zaneselvans Dec 22, 2023
39d05ec
Merge branch 'dev' into entity_matching
zaneselvans Dec 24, 2023
1cae6bb
Merge branch 'dev' into entity_matching
zaneselvans Dec 25, 2023
09b7c7b
Merge branch 'dev' into entity_matching
zaneselvans Dec 25, 2023
962fc3d
Rename record linkage test module so pytest actually runs it.
zaneselvans Dec 26, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions src/pudl/analysis/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
ferc1_eia_record_linkage,
mcoe,
plant_parts_eia,
record_linkage,
katie-lamb marked this conversation as resolved.
Show resolved Hide resolved
service_territory,
spatial,
state_demand,
Expand Down
762 changes: 0 additions & 762 deletions src/pudl/analysis/classify_plants_ferc1.py

This file was deleted.

1 change: 1 addition & 0 deletions src/pudl/analysis/record_linkage/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""This module impolements models for various forms of record linkage."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern we've adopted in the other subpackages is to have each __init__.py import all of the modules in that subpackage. Should we do that here? Or should we be doing something different everywhere?

Copy link
Member Author

@zschira zschira Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the imports within __init__.py are strictly necessary, and in my experience it's pretty common practice to have empty __init__.py files in sub-packages. I always thought it was more if you want to make code from a module available in the package level namespace you could import it in __init__.py, but python doesn't have a problem finding modules within a package. All that said, I don't really feel like I have a firm grasp on import best practices, and I'm totally fine with sticking with the pattern used elsewhere for consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were some failures in the steam table processing due to pudl.analysis.fuel_by_plant not being imported in pudl/analysis/__init__.py and we have a lot of places where we just import a whole module, rather than the individual functions or constants within it, so I feel like adding the imports here for now would help avoid some confusion with that pattern breaking on some modules.

359 changes: 359 additions & 0 deletions src/pudl/analysis/record_linkage/classify_plants_ferc1.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,359 @@
"""Scikit-Learn classification pipeline for identifying related FERC 1 plant records.

Sadly FERC doesn't provide any kind of real IDs for the plants that report to them --
all we have is their names (a freeform string) and the data that is reported alongside
them. This is often enough information to be able to recognize which records ought to be
associated with each other year to year to create a continuous time series. However, we
want to do that programmatically, which means using some clustering / categorization
tools from scikit-learn
"""
import re

import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, Normalizer

import pudl
from pudl.analysis.record_linkage.cleaning_steps import CleaningRules
from pudl.analysis.record_linkage.cross_year import ColumnTransform, CrossYearLinker

logger = pudl.logging_helpers.get_logger(__name__)


def fuel_by_plant_ferc1(
katie-lamb marked this conversation as resolved.
Show resolved Hide resolved
fuel_df: pd.DataFrame, fuel_categories: list[str], thresh: float = 0.5
) -> pd.DataFrame:
"""Calculates useful FERC Form 1 fuel metrics on a per plant-year basis.

Each record in the FERC Form 1 corresponds to a particular type of fuel. Many plants
-- especially coal plants -- use more than one fuel, with gas and/or diesel serving
as startup fuels. In order to be able to classify the type of plant based on
relative proportions of fuel consumed or fuel costs it is useful to aggregate these
per-fuel records into a single record for each plant.

Fuel cost (in nominal dollars) and fuel heat content (in mmBTU) are calculated for
each fuel based on the cost and heat content per unit, and the number of units
consumed, and then summed by fuel type (there can be more than one record for a
given type of fuel in each plant because we are simplifying the fuel categories).
The per-fuel records are then pivoted to create one column per fuel type. The total
is summed and stored separately, and the individual fuel costs & heat contents are
divided by that total, to yield fuel proportions. Based on those proportions and a
minimum threshold that's passed in, a "primary" fuel type is then assigned to the
plant-year record and given a string label.

Args:
fuel_df: Pandas DataFrame resembling the post-transform
result for the fuel_ferc1 table.
thresh: A value between 0.5 and 1.0 indicating the minimum fraction of
overall heat content that must have been provided by a fuel in a plant-year
for it to be considered the "primary" fuel for the plant in that year.
Default value: 0.5.

Returns:
DataFrame with a single record for each plant-year, including the columns
required to merge it with the plants_steam_ferc1 table/DataFrame (report_year,
utility_id_ferc1, and plant_name) as well as totals for fuel mmbtu consumed in
that plant-year, and the cost of fuel in that year, the proportions of heat
content and fuel costs for each fuel in that year, and a column that labels the
plant's primary fuel for that year.

Raises:
AssertionError: If the DataFrame input does not have the columns required to
run the function.
"""
keep_cols = [
"report_year", # key
"utility_id_ferc1", # key
"plant_name_ferc1", # key
"fuel_type_code_pudl", # pivot
"fuel_consumed_units", # value
"fuel_mmbtu_per_unit", # value
"fuel_cost_per_unit_burned", # value
]

# Ensure that the dataframe we've gotten has all the information we need:
for col in keep_cols:
if col not in fuel_df.columns:
raise AssertionError(f"Required column {col} not found in input fuel_df.")

# Calculate per-fuel derived values and add them to the DataFrame
df = (
# Really there should *not* be any duplicates here but... there's a
# bug somewhere that introduces them into the fuel_ferc1 table.
fuel_df[keep_cols]
.drop_duplicates()
# Calculate totals for each record based on per-unit values:
.assign(fuel_mmbtu=lambda x: x.fuel_consumed_units * x.fuel_mmbtu_per_unit)
.assign(fuel_cost=lambda x: x.fuel_consumed_units * x.fuel_cost_per_unit_burned)
# Drop the ratios and heterogeneous fuel "units"
.drop(
["fuel_mmbtu_per_unit", "fuel_cost_per_unit_burned", "fuel_consumed_units"],
axis=1,
)
# Group by the keys and fuel type, and sum:
.groupby(
[
"utility_id_ferc1",
"plant_name_ferc1",
"report_year",
"fuel_type_code_pudl",
]
)
.sum()
.reset_index()
# Set the index to the keys, and pivot to get per-fuel columns:
.set_index(["utility_id_ferc1", "plant_name_ferc1", "report_year"])
.pivot(columns="fuel_type_code_pudl")
.fillna(0.0)
)

# Undo pivot. Could refactor this old function
katie-lamb marked this conversation as resolved.
Show resolved Hide resolved
plant_year_totals = df.stack("fuel_type_code_pudl").groupby(level=[0, 1, 2]).sum()

# Calculate total heat content burned for each plant, and divide it out
mmbtu_group = (
pd.merge(
# Sum up all the fuel heat content, and divide the individual fuel
# heat contents by it (they are all contained in single higher
# level group of columns labeled fuel_mmbtu)
df.loc[:, "fuel_mmbtu"].div(
df.loc[:, "fuel_mmbtu"].sum(axis=1), axis="rows"
),
# Merge that same total into the dataframe separately as well.
plant_year_totals.loc[:, "fuel_mmbtu"],
right_index=True,
left_index=True,
)
.rename(columns=lambda x: re.sub(r"$", "_fraction_mmbtu", x))
.rename(columns=lambda x: re.sub(r"_mmbtu_fraction_mmbtu$", "_mmbtu", x))
)

# Calculate total fuel cost for each plant, and divide it out
cost_group = (
pd.merge(
# Sum up all the fuel costs, and divide the individual fuel
# costs by it (they are all contained in single higher
# level group of columns labeled fuel_cost)
df.loc[:, "fuel_cost"].div(df.loc[:, "fuel_cost"].sum(axis=1), axis="rows"),
# Merge that same total into the dataframe separately as well.
plant_year_totals.loc[:, "fuel_cost"],
right_index=True,
left_index=True,
)
.rename(columns=lambda x: re.sub(r"$", "_fraction_cost", x))
.rename(columns=lambda x: re.sub(r"_cost_fraction_cost$", "_cost", x))
)

# Re-unify the cost and heat content information:
df = pd.merge(
mmbtu_group, cost_group, left_index=True, right_index=True
).reset_index()

# Label each plant-year record by primary fuel:
for fuel_str in fuel_categories:
try:
mmbtu_mask = df[f"{fuel_str}_fraction_mmbtu"] > thresh
df.loc[mmbtu_mask, "primary_fuel_by_mmbtu"] = fuel_str
except KeyError:
pass

try:
cost_mask = df[f"{fuel_str}_fraction_cost"] > thresh
df.loc[cost_mask, "primary_fuel_by_cost"] = fuel_str
except KeyError:
pass

df[["primary_fuel_by_cost", "primary_fuel_by_mmbtu"]] = df[
["primary_fuel_by_cost", "primary_fuel_by_mmbtu"]
].fillna("")

return df


def construct_ferc1_plant_matching_model(fuel_cols: list[str]) -> CrossYearLinker:
"""Create a CrossYearLinker configured to match FERC1 plants."""
return CrossYearLinker(
**{
"id_column": "plant_id_ferc1",
"column_transforms": [
ColumnTransform(
**{
"step_name": "plant_name_ferc1",
"columns": "plant_name_ferc1",
"transformer": "string",
"weight": 2.0,
"cleaning_ops": [
CleaningRules(input_column="plant_name_ferc1")
],
}
),
ColumnTransform(
**{
"step_name": "plant_type",
"columns": ["plant_type"],
"transformer": "category",
"weight": 2.0,
"cleaning_ops": ["null_to_empty_str"],
}
),
ColumnTransform(
**{
"step_name": "construction_type",
"columns": ["construction_type"],
"transformer": "category",
"cleaning_ops": ["null_to_empty_str"],
}
),
ColumnTransform(
**{
"step_name": "capacity_mw",
"columns": ["capacity_mw"],
"transformer": "number",
"cleaning_ops": ["null_to_zero"],
}
),
ColumnTransform(
**{
"step_name": "construction_year",
"columns": ["construction_year"],
"transformer": "category",
"cleaning_ops": ["fix_int_na"],
}
),
ColumnTransform(
**{
"step_name": "utility_id_ferc1",
"columns": ["utility_id_ferc1"],
"transformer": "category",
}
),
ColumnTransform(
**{
"step_name": "fuel_fraction_mmbtu",
"columns": fuel_cols,
"transformer": Pipeline(
[("scaler", MinMaxScaler()), ("norm", Normalizer())]
),
"cleaning_ops": ["null_to_zero"],
}
),
],
}
)


def plants_steam_assign_plant_ids(
ferc1_steam_df: pd.DataFrame,
ferc1_fuel_df: pd.DataFrame,
fuel_categories: list[str],
) -> pd.DataFrame:
"""Assign IDs to the large steam plants."""
###########################################################################
# FERC PLANT ID ASSIGNMENT
###########################################################################
# Now we need to assign IDs to the large steam plants, since FERC doesn't
# do this for us.
logger.info("Identifying distinct large FERC plants for ID assignment.")

# Grab fuel consumption proportions for use in assigning plant IDs:
fuel_fractions = fuel_by_plant_ferc1(ferc1_fuel_df, fuel_categories)
ffc = list(fuel_fractions.filter(regex=".*_fraction_mmbtu$").columns)

ferc1_steam_df = ferc1_steam_df.merge(
fuel_fractions[["utility_id_ferc1", "plant_name_ferc1", "report_year"] + ffc],
on=["utility_id_ferc1", "plant_name_ferc1", "report_year"],
how="left",
)

fuel_cols = list(ferc1_steam_df.filter(regex=".*_fraction_mmbtu$").columns)

# Train the classifier using DEFAULT weights, parameters not listed here.
clf = construct_ferc1_plant_matching_model(fuel_cols)
ferc1_steam_df = clf.fit_predict(ferc1_steam_df)

# Set the construction year back to numeric because it is.
zschira marked this conversation as resolved.
Show resolved Hide resolved
ferc1_steam_df["construction_year"] = pd.to_numeric(
ferc1_steam_df["construction_year"], errors="coerce"
)
# We don't actually want to save the fuel fractions in this table... they
# were only here to help us match up the plants.
ferc1_steam_df = ferc1_steam_df.drop(ffc, axis=1)
ferc1_steam_df = revert_filled_in_string_nulls(ferc1_steam_df)

return ferc1_steam_df


def revert_filled_in_string_nulls(df: pd.DataFrame) -> pd.DataFrame:
katie-lamb marked this conversation as resolved.
Show resolved Hide resolved
"""Revert the filled nulls from string columns.

Many columns that are used for the classification in
:func:`plants_steam_assign_plant_ids` have many nulls. The classifier can't handle
nulls well, so we filled in nulls with empty strings for string columns. This
function replaces empty strings with null values for specific columns that are known
to contain empty strings introduced for the classifier.
"""
for col in [
"plant_type",
"construction_type",
"fuel_type_code_pudl",
"primary_fuel_by_cost",
"primary_fuel_by_mmbtu",
]:
if col in df.columns:
# the replace to_replace={column_name: {"", pd.NA}} mysteriously doesn't work.
df[col] = df[col].replace(
to_replace=[""],
value=pd.NA,
)
return df


def revert_filled_in_float_nulls(df: pd.DataFrame) -> pd.DataFrame:
"""Revert the filled nulls from float columns.

Many columns that are used for the classification in
:func:`plants_steam_assign_plant_ids` have many nulls. The classifier can't handle
nulls well, so we filled in nulls with zeros for float columns. This function
replaces zeros with nulls for all float columns.
"""
float_cols = list(df.select_dtypes(include=[float]))
if float_cols:
df.loc[:, float_cols] = df.loc[:, float_cols].replace(0, np.nan)
return df


def plants_steam_validate_ids(ferc1_steam_df: pd.DataFrame) -> pd.DataFrame:
"""Tests that plant_id_ferc1 times series includes one record per year.

Args:
ferc1_steam_df: A DataFrame of the data from the FERC 1 Steam table.

Returns:
The input dataframe, to enable method chaining.
"""
##########################################################################
# FERC PLANT ID ERROR CHECKING STUFF
##########################################################################

# Test to make sure that we don't have any plant_id_ferc1 time series
# which include more than one record from a given year. Warn the user
# if we find such cases (which... we do, as of writing)
year_dupes = (
ferc1_steam_df.groupby(["plant_id_ferc1", "report_year"])["utility_id_ferc1"]
.count()
.reset_index()
.rename(columns={"utility_id_ferc1": "year_dupes"})
.query("year_dupes>1")
)
if len(year_dupes) > 0:
for dupe in year_dupes.itertuples():
logger.error(
f"Found report_year={dupe.report_year} "
f"{dupe.year_dupes} times in "
f"plant_id_ferc1={dupe.plant_id_ferc1}"
)
else:
logger.info("No duplicate years found in any plant_id_ferc1. Hooray!")

return ferc1_steam_df
Loading