Skip to content

Latest commit

Β 

History

History
149 lines (113 loc) Β· 10.3 KB

README.md

File metadata and controls

149 lines (113 loc) Β· 10.3 KB

CAISO Wind Energy Forecast

time series plot with labels

This repository is about using machine learning to forecast the amount of wind energy in the California electricity grid. With regards to the typical data science workflow, the repository covers modeling and limited feature engineering.

This project is executed in R, using the Modeltime library for modeling, and MLFlow for tracking experiments.

The motivation for this project is to explore the capabilities (and limits) of the Modeltime time series machine learning library and to get a feeling for the challenges of modelling a part of the electricity grid.

Overview

The picture below gives an overview of the ETL and modeling process. To aid reproducibility, alphanumeric strings in labels (e.g. b2caff6) refer to specific git commits on which these models were generated.

Model Architecture

About the Data

All data have been acquired and aggregated by the author from public sources. Three datasets are available to aid the forecasting effort:

  • db_pull_production_data_raw_20201208.csv – 5-min time series data stating energy in the California grid by energy source, according to CAISO. Columns Time and Wind are relevant for this analysis. Wind is what we are trying to forecast.

  • db_pull_weather_data_raw_20201208.csv – 1-hour weather time series data at 10 key locations for renewable energy generation (wind, solar) in California. The author determined these locations based on geospatial analysis of renewable energy assets in the state, an analysis that is outside the scope of this repository. Columns starting with 0_ through 4_ belong to wind-generating locations, ordered in descending order of generating capacity. This dataset can be used to create features for the model. Weather data have been acquired from Dark Sky.

  • db_pull_feature_gross_production_20201209.csv - 1-hour time series dataset with domain-informed features at 10 key locations (see above). The author generated the wind-related features (0_wind through 4_wind) by combining weather and turbine power curve information. From the weather information above, the author generated air density and hub height adjusted estimates of the available wind energy. That information was fed through an assumed power curve and multiplied by the assumed total capacity available at that key location. That analysis is outside the scope of this repository, but the generated features can be used for modelling.

All data are fed through a feature engineering pipeline, for which relevant features have been selected using a random forest model.

About the Model

The model is a weighted ensemble model of 8 tree-based models. Two of those models are based on Cubist, a boosted regression model (find a great presentation about Cubist by Max Kuhn here). Three models are based on XGBoost and another three models are random forest models. These types of models have been chosen according to their ability to incorporate a set of 188 features and good training performance in R.

Each model has been trained on about 143k data points. The model parameters have been selected after hyperparameter tuning, subject to 4-fold time series cross-validation. The number of folds was constrained by training performance.

A weighted average was chosen for ensembling the models due to performance constraints over potentially more accurate methods like stacking. The weights were chosen after assessing 20 different weight combinations through a latin hypercube experimental design (find assessment here).

In summary, the modeling process was heavily constrained by performance considerations and available project time.

All model runs, including training and hyperparameter optimization, have been recorded in MLFlow. You can boot up a workable MLFlow instance using the associated Dockerfile. All results have also been exported as CSVs in the "static" directory, where every file name corresponds to an MLFlow experiment.

Repository Structure

  • data – Data used or produced in the modeling process
  • docs – Documentation
  • models – Trained models (removed to save storage space)
    • ens_level1_f9e6c40.rds – Final, trained ensemble model (see above)
  • notebooks – Notebooks for experiments and analyses
  • src – Source code for building models
  • util – Utility tools
    • mlflow – Data related to MLFlow
      • mlruns – MLFlow data directory
      • static – Static export of tracked MLFlow experiments
      • Dockerfile – Dockerfile for running MLFlow server
      • environment.yml – Conda environment file to create MLFlow environment

Model Performance

This section summarizes model performance by use case.

Time Series Forecasting Use Case

The table below shows the performance of the models at time series forecasting. Given the training and test set, it is clear that the Cubist models heavily overfit and the random forest models overfit to some extent as well.

Cross-validation (CV) has not been performed for the ensemble model due to performance constraints.

Model Name CV, folds mae (train) rmse (train) rsq (train) mae (test) rmse (test) rsq (test) ratio mae train/test
ens_level1_f9e6c40 ❌ 469 599 0.779
cubist_level0_b2caff6_1 βœ… 4 15 32 0.999 493 634 0.686 0.03
cubist_level0_b2caff6_3 βœ… 4 20 39 0.999 507 656 0.670 0.04
rf_level0_cc50409_1 βœ… 4 219 300 0.943 439 559 0.755 0.50
rf_level0_cc50409_2 βœ… 4 231 315 0.937 440 560 0.753 0.53
rf_level0_cc50409_3 βœ… 4 37 57 0.998 444 567 0.751 0.08
xgboost_level0_9dc6cbe_1 βœ… 4 580 772 0.759 640 799 0.678 0.91
xgboost_level0_9dc6cbe_2 βœ… 4 519 689 0.803 598 752 0.706 0.87
xgboost_level0_9dc6cbe_3 βœ… 4 565 752 0.763 628 783 0.670 0.90

Time Series Peak Forecasting Use Case

Forecasting the peaks of the wind energy time series in the California grid is a specific use case of this model. The analyze_performance.rmd (HTML | view in browser) notebook investigates this case in detail. In summary, the peaks predicted by the model, in 75% of all cases, do not miss the energy in the actual peaks within a 26-hour window by more than 14%.

Contribution

I am happy about any contribution or feedback. Please let me know about your comments via the Issues tab on GitHub.

License and Attributions

This project is released under the MIT License.

Please note that raw data as provided in db_pull_production_data_raw_20201208.csv have been generated by the California ISO.

Please also note that weather data as provided in db_pull_weather_data_raw_20201208.csv has been extracted from DarkSky and is subject to its Terms of Use, allowing use only for "personal, non-commercial purposes".

For the social preview picture, the California bear is from Vecteezy.com. The R Logo is used in its original form under the CC BY-SA 4.0, and is (C) 2016 by The R Foundation. The Modeltime logo is taken from the Modeltime GitHub repository and is subject to the MIT License.