Gap Analysis Step 1b: Estimate the AEP and Its Uncertainty Using Cubico Open Data#

Before getting started, be sure that the reanalysis modifier was used for installing OpenOA (i.e., pip install "openoa[reanalysis]").

This notebook provides an overview and walk-through of the steps taken to produce a plant-level operational energy asssessment (OA) of a wind plant, using the open data available for Kelmarsh and Penmanshiel.

Uncertainty in the annual energy production (AEP) estimate is calculated through a Monte Carlo approach. Specifically, inputs into the OA code as well as intermediate calculations are randomly sampled based on their specified or calculated uncertainties. By performing the OA assessment thousands of times under different combinations of the random sampling, a distribution of AEP values results from which uncertainty can be deduced. Details on the Monte Carlo approach will be provided throughout this notebook.

Note: If the data is not already available on your computer, be sure to install the additionally required packages cdsapi (ERA5 API library) and xarray (data handling for NetCDF data). This can be done via pip install "openoa[weather]" to have have it managed through the OpenOA installation process.

Step 1: Import the data#

Data is downloaded automatically from Zenodo for the wind farm data and from GES DISC and CDS for the long-term reanalysis data. These are then stored in the ‘examples/data’ folder under the asset names. Note this includes all available data, and so for the 2 assets uses about 20 GB of space in total. The first time this notebook is run it could therefore take a signficant amount of time to download the data (e.g. over an hour). Its recommended that you leave it to run, then restart and run-all to clear the download messages. For downloading the GES DISC and CDS long-term reanalysis data registration is also required, see: GES DISC registration and CDS registration.

import pandas as pd
import matplotlib.pyplot as plt

from openoa.analysis.aep import MonteCarloAEP
from openoa.utils import plot

import project_Cubico
WARNING:py.warnings:/Users/rhammond/miniforge3/envs/openoa/lib/python3.11/site-packages/cdsapi/api.py:17: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  import pkg_resources

WARNING:py.warnings:/Users/rhammond/miniforge3/envs/openoa/lib/python3.11/site-packages/pkg_resources/__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('sphinxcontrib')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
%load_ext autoreload
%autoreload 2

In the call below, data is loaded using the project_Cubico.prepare() function.

asset = "kelmarsh" # kelmarsh or penmanshiel

project = project_Cubico.prepare(asset=asset)

Step 2: Review the data#

Several Pandas data frames have now been loaded. Histograms showing the distribution of the plant-level metered energy, availability, and curtailment are shown below. Note curtailments are not calculated/identified in the original data.

plot.column_histograms(project.meter.resample("1h").sum(numeric_only=True), columns=["MMTR_SupWh"])
plot.column_histograms(project.curtail.resample("1h").sum(numeric_only=True), columns=["IAVL_DnWh", "IAVL_ExtPwrDnWh"])
../_images/4229e7d9da549b97612ece3e41dd8062569b769ac16d16fe0bdac3156fa80a72.png ../_images/7d2cf03dd1c0213a8bcbd3e9124352793e46e0f26bf5db0c3815bdff3626f739.png

Step 3: Process the data into monthly averages and sums#

The raw plant data can be in different time resolutions (in this case 10-minute periods). The following steps process the data into monthly averages and combine them into a single ‘monthly’ data frame to be used in the OA assessment. The plots below show the monthly metered generation and the losses due to plant availability and curtailments.

project.meter.tail()
MMTR_SupWh
time
2021-06-30 23:10:00 341.260829
2021-06-30 23:20:00 341.260829
2021-06-30 23:30:00 341.260829
2021-06-30 23:40:00 341.260829
2021-06-30 23:50:00 341.260829
project.meter.resample("1ME").sum(numeric_only=True).plot();
../_images/e3c88903ec672df71d4b4bca86385352afd08c56117f6bd16467c4b35e542308.png
project.curtail.resample("1ME").sum(numeric_only=True).plot();
../_images/f767c08694b6d97d262423a7d49884529ea41ab843901639917145b9cb25174a.png

First, we’ll create a MonteCarloAEP object which is used to calculate long-term AEP. The available reanalysis products are specified as arguments.

pa = MonteCarloAEP(project, reanalysis_products = list(project.reanalysis.keys()))

Let’s view the result. Note the extra fields we’ve calculated that we’ll use later for filtering:

  • energy_nan_perc : the percentage of NaN values in the raw revenue meter data used in calculating the monthly sum. If this value is too large, we shouldn’t include this month

  • nan_flag : if too much energy, availability, or curtailment data was missing for a given month, flag the result

  • num_days_expected : number of days in the month (useful for normalizing monthly gross energy later)

  • num_days_actual : actual number of days per month as found in the data (used when trimming monthly data frame)

# View the monthly data frame
pa.aggregate
energy_gwh energy_nan_perc num_days_expected num_days_actual availability_gwh curtailment_gwh gross_energy_gwh availability_pct curtailment_pct avail_nan_perc curt_nan_perc nan_flag availability_typical curtailment_typical combined_loss_valid merra2 era5 era5_monthly merra2_monthly
time
2016-02-01 0.000000 1.000000 29 29 0.789936 0.0 0.789936 1.000000 0.0 0.051125 0.050088 True True True True 8.804539 9.050368 5.520306 9.157150
2016-03-01 0.000000 1.000000 31 31 0.103745 0.0 0.103745 1.000000 0.0 0.011686 0.011686 True True True True 6.952017 7.209454 4.495580 7.259103
2016-04-01 0.000000 1.000000 30 30 0.164337 0.0 0.164337 1.000000 0.0 0.027739 0.027739 True True True True 6.416745 6.602320 4.079608 6.660838
2016-05-01 1.880098 0.078181 31 31 0.036452 0.0 1.916550 0.019019 0.0 0.029122 0.029122 True True True True 6.523691 6.785986 4.245113 6.796216
2016-06-01 1.301968 0.000000 30 30 0.003242 0.0 1.305210 0.002484 0.0 0.000849 0.000694 False True True True 5.084447 5.128154 3.362180 5.297147
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2021-02-01 2.974716 0.000000 28 28 0.529105 0.0 3.503821 0.151008 0.0 0.000000 0.000000 False True True True 8.613341 8.621293 5.218669 8.942116
2021-03-01 3.203956 0.000000 31 31 0.092464 0.0 3.296420 0.028050 0.0 0.000224 0.000224 False True True True 7.613309 7.721181 4.804443 7.891042
2021-04-01 1.641460 0.000000 30 30 0.008583 0.0 1.650043 0.005202 0.0 0.000000 0.000000 False True True True 5.595019 5.601271 3.549652 5.786864
2021-05-01 2.366911 0.000000 31 31 0.045542 0.0 2.412453 0.018878 0.0 0.000299 0.000000 False True True True 6.717222 6.747532 4.282772 6.966653
2021-06-01 1.148925 0.000000 30 30 0.003906 0.0 1.152832 0.003389 0.0 0.000000 0.000000 False True True True 5.283969 5.381440 3.463406 5.490280

65 rows × 19 columns

Step 4: Review reanalysis data#

Reanalysis data will be used to long-term correct the operational energy over the plant period of operation to the long-term. It is important that we only use reanalysis data that show reasonable trends over time with no noticeable discontinuities. A plot like below, in which normalized monthly wind speeds are shown from 2000 to present, provides a good first look at data quality.

The plot shows that the reanalysis products track each other reasonably well and seem well-suited for the analysis.

Note ‘era5_monthly’ and ‘merra2_monthly’ are taken from monthly aggregated data at ground level, while other sources are at different heights and generally aggregated from hourly data.

pa.plot_normalized_monthly_reanalysis_windspeed(
    return_fig=False,
    #xlim=(datetime(2000, 1, 1), datetime(2022, 12, 31)),
    #ylim=(0.8, 1.2),
)
../_images/a81ee009b47d96024ad6cb215cfb3951a1a9dedade425b422cf8d4391ba69d48.png

Step 5: Review energy and loss data#

It is useful to take a look at the energy data and make sure the values make sense. We begin with scatter plots of gross energy and wind speed for each reanalysis product. We also show a time series of gross energy, as well as availability and curtailment loss.

Let’s start with the scatter plots of gross energy vs wind speed for each reanalysis product. Here we use the ‘Robust Linear Model’ (RLM) module of the Statsmodels package with the default Huber algorithm to produce a regression fit that excludes outliers. Data points with an ‘x’ marker show the outliers, and were excluded based on a Huber sensitivity factor of 3.0.

The plots below reveal that:

  • there are some outliers

  • the reanalysis products are strongly correlated with plant energy

  • the mean wind speed of the reanalaysis products varies considerably, in part because of the different heights they are measured at

If the user would like the RLM outlier detection algorithm to be used in the Monte Carlo analysis, the “outlier_detection” parameter should be set to “True” when calling the plant_analysis class.

pa.plot_reanalysis_gross_energy_data(outlier_threshold=3, plot_kwargs={"s": 60})
../_images/252fbfa225b98e876570a5a34348a97492c9d44140920f0fdbc17b364808c67f.png

Next we show time series plots of the monthly gross energy, availabilty, and curtialment. Note that the availability and curtailment data were estimated based on SCADA data from the plant.

Long-term availability and curtailment losses for the plant are calculated based on average percentage losses for each calendar month. Summing those average values weighted by the fraction of long-term gross energy generated in each month yields the long-term annual estimates. Weighting by monthly long-term gross energy helps account for potential correlation between losses and energy production (e.g., high availability losses in summer months with lower energy production). The long-term losses are calculated in Step 9.

pa.plot_aggregate_plant_data_timeseries()
../_images/ac85ad4201c6c13e3e261ff30e78933851f8bbd08cb9ae67593095cfb3a1e5d5.png

Step 6: Specify availabilty and curtailment data not represenative of actual plant performance#

There may be anomalies in the reported availabilty that shouldn’t be considered representative of actual plant performance. Force majeure events (e.g. lightning) are a good example. Such losses aren’t typically considered in pre-construction AEP estimates; therefore, plant availablity loss reported in an operational AEP analysis should also not include such losses.

The ‘availability_typical’ and ‘curtailment_typical’ fields in the monthly data frame are initially set to True. Below, individual months can be set to ‘False’ if it is deemed those months are unrepresentative of long-term plant losses. By flagging these months as false, they will be omitted when assessing average availabilty and curtailment loss for the plant.

Justification for removing months from assessing average availabilty or curtailment should come from conversations with the owner/operator. For example, if a high-loss month is found, reasons for the high loss should be discussed with the owner/operator to determine if those losses can be considered representative of average plant operation.

# For both assets there are a few months that aren't representative of long-term losses, so these are excluded.

if asset == 'kelmarsh':
    pa.aggregate.loc['2016-02-01',['availability_typical','curtailment_typical']] = False
    pa.aggregate.loc['2016-03-01',['availability_typical','curtailment_typical']] = False
    pa.aggregate.loc['2016-04-01',['availability_typical','curtailment_typical']] = False
    
elif asset == 'penmanshiel':
    pa.aggregate.loc['2018-01-01',['availability_typical','curtailment_typical']] = False
    pa.aggregate.loc['2018-02-01',['availability_typical','curtailment_typical']] = False
    pa.aggregate.loc['2018-03-01',['availability_typical','curtailment_typical']] = False

Step 7: Select reanalysis products to use#

Based on the assessment of reanalysis products above (both long-term trend and relationship with plant energy), we now set which reanalysis products we will include in the OA. For this particular case study, we use all products given the high regression relationships. These can then be compared in the results.

Step 8: Set up Monte Carlo inputs#

The next step is to set up the Monte Carlo framework for the analysis. Specifically, we identify each source of uncertainty in the OA estimate and use that uncertainty to create distributions of the input and intermediate variables from which we can sample for each iteration of the OA code. For input variables, we can create such distributions beforehand. For intermediate variables, we must sample separately for each iteration.

Detailed descriptions of the sampled Monte Carlo inputs, which can be specified when initializing the MonteCarloAEP object if values other than the defaults are desired, are provided below:

  • uncertainty_meter : Revenue meter energy measurements are associated with a measurement uncertainty of around 0.5%. This uncertainty is used to create a distribution centered at 1 (and with standard deviation therefore of 0.005). This column represents random samples from that distribution. For each iteration of the OA code, a value from this column is multiplied by the monthly revenue meter energy data before the data enter the OA code, thereby capturing the 0.5% uncertainty.

  • uncertainty_losses : Reported availability and curtailment losses are estimates and are associated with uncertainty. For now, we assume the reported values are associated with an uncertainty of 5%. Similar to above, we therefore create a distribution centered at 1 (with std of 0.05) from which we sample for each iteration of the OA code. These sampled values are then multiplied by the availability and curtaiment data independently before entering the OA code to capture the 5% uncertainty in the reported values.

  • uncertainty_windiness : This intends to capture the uncertainty associated with the number of historical years an analyst chooses to use in the windiness correction. The industry standard is typically 20 years and is based on the assumption that year-to-year wind speeds are uncorrelated. However, a growing body of research suggests that there is some correlation in year-to-year wind speeds and that there are trends in the resource on the decadal timescale. To capture this uncertainty both in the long-term trend of the resource and the analyst choice, we randomly sample integer values betweeen 10 and 20 as the number of years to use in the windiness correction.

  • uncertainty_loss_max : Due to uncertainty in reported availability and curtailment estimates, months with high combined losses are associated with high uncertainty in the calculated gross energy. It is common to remove such data from analysis. For this analysis, we randomly sample float values between 0.1 and 0.2 (i.e. 10% and 20%) to serve as criteria for the combined availability and curtailment losses. Specifically, months are excluded from analysis if their combined losses exceeds that criteria for the given OA iteration.

  • uncertainty_nan_energy: Threshold for removing days/months based on the fraction of NaNs in the data.

  • outlier_detection: Whether to perform (True) or not (False) outlier detection filtering. The default value is set to False.

  • uncertainty_outlier : Sample values between 1 and 3 which set the Huber algorithm outlier detection parameter (for linear regression; for machine learning regression algorithms, please see example 2b). Varying this threshold accounts for analyst subjectivity on what data points constitute outliers and which do not.

  • reanalyis_product : This captures the uncertainty of using different reanalysis products and, lacking a better method, is a proxy way of capturing uncertainty in the modelled monthly wind speeds. For each iteration of the OA code, one of the reanalysis products that we’ve already determined as valid (see the cells above) is selected.

Finally, we note that the operational data are also randomly resampled at each iteration using bootstrapping to help quantify uncertainty in the results. Consequently, different slope and intercept values for the linear regression model mapping monthly wind speed to energy production are determined each Monte Carlo iteration.

Step 9: Run the OA code#

We’re now ready to run the Monte Carlo based OA code. We repeat the OA process “num_sim” times using different sampling combinations of the input and intermediate variables to produce a distribution of AEP values.

A single line of code here in the notebook performs this step, but below is more detail on what is being done.

Steps in OA process:

  • Set the wind speed and gross energy data to be used in the regression based on i) the reanalysis product to be used (Monte Carlo sampled); ii) randomly resampling which data points are included each Monte Carlo iteration, with replacement (e.g., bootstrapping); iii) the NaN energy data criteria (1%); iv) combined availability and curtailment loss criteria (Monte Carlo sampled); and v) the outlier criteria (Monte Carlo sampled)

  • Normalize gross energy to 30-day months

  • Perform linear regression and determine slope and intercept values, their standard errors, and the covariance between the two

  • Apply the slope and intercept values to the long-term monthly average wind speeds (based on a 10-20 year historical period as determined by the Monte Carlo process) to calculate long-term monthly gross energy

  • ‘Denormalize’ monthly long-term gross energy back to the normal number of days

  • Calculate AEP by subtracting out the long-term avaiability loss (curtailment loss is left in as part of AEP)

  • To account for uncertainty in the AEP for any single year, interannual variability is applied to the AEP estimates from each Monte Carlo iteration based on the standard deviation of the annual AEP values during the historical period

# Run Monte Carlo based OA
pa.run(num_sim=5000, reanalysis_products=list(project.reanalysis.keys()))
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:09<00:00, 523.60it/s]

The key result is shown below: a distribution of AEP values from which uncertainty can be deduced. In this case, uncertainty is around 10%.

# Plot a distribution of AEP values from the Monte Carlo OA method
pa.plot_result_aep_distributions(
    annotate_kwargs={"fontsize": 12},
)

# # Alternative available in the plot module
# plot_results = pa.results.copy()
# plot_results[["avail_pct", "curt_pct"]] = plot_results[["avail_pct", "curt_pct"]] * 100
# plot.plot_distributions(
#     data=plot_results,
#     which=["aep_GWh", "avail_pct", "curt_pct"],
#     xlabels=["AEP (GWh/yr)", "Availability Loss (%)", "Curtailment Loss (%)"],
#     xlim_aep=(8, 18),
#     xlim_availability=(0.7, 1.3),
#     xlim_curtail=(0.04, 0.09),
#     ylim_aep=(0, 0.4),
#     ylim_availability=(0, 9),
#     ylim_curtail=(0, 120),
#     annotate_kwargs={"fontsize": 12},
# )
../_images/45009314ff277304116b12c338a666c31fa624c760cd537a9aae4ce292560f05.png

Step 10: Post-analysis visualization#

Here we show some supplementary results of the Monte Carlo OA approach to help illustrate how it works.

First, it’s worth looking at the Monte Carlo tracker data frame again, now that the slope, intercept, and number of outlier fields have been completed. Note that for transparency, debugging, and analysis purposes, we’ve also included in the tracker data frame the number of data points used in the regression.

# Produce histograms of the various MC-parameters
mc_reg = pd.DataFrame(data={
    'slope': pa._mc_slope.ravel(),
    'intercept': pa._mc_intercept, 
    'num_points': pa._mc_num_points, 
    'metered_energy_fraction': pa.mc_inputs.metered_energy_fraction, 
    'loss_fraction': pa.mc_inputs.loss_fraction,
    'num_years_windiness': pa.mc_inputs.num_years_windiness, 
    'loss_threshold': pa.mc_inputs.loss_threshold,
    'reanalysis_product': pa.mc_inputs.reanalysis_product
})

It’s useful to plot distributions of each variable to show what is happening in the Monte Carlo OA method. Based on the plot below, we observe the following:

  • metered_energy_fraction, and loss_fraction sampling follow a normal distribution as expected.

  • The slope and intercept distributions highlight the use of different reanalysis products, resulting in different regression relationships.

  • We see approximately equal sampling of the num_years_windiness, loss_threshold, and reanalysis_product, as expected.

plot.column_histograms(mc_reg)
../_images/03006506685819ce1170153882e5df903d6ea35019318a0ee86ce34d1b87c480.png

It’s worth highlighting the inverse relationship between slope and intercept values under the Monte Carlo approach. Slope and intercept values are strongly negatively correlated (e.g. slope goes up, intercept goes down), as shown below.

# Produce scatter plots of slope and intercept values. Here we focus on only one reanalysis data set
plt.figure(figsize=(8,6))
plt.plot(
    mc_reg.intercept[mc_reg.reanalysis_product ==list(project.reanalysis.keys())[0]],
    mc_reg.slope[mc_reg.reanalysis_product ==list(project.reanalysis.keys())[0]],
    '.', label="Monte Carlo Regression Values"
)
plt.xlabel('Intercept (GWh)')
plt.ylabel('Slope (GWh / (m/s))')
plt.legend()
plt.show()
../_images/26852ecfd90f597be410953066fffb2b9be11827ae648ffd9dc020f8da9e12f8.png

We can look further at the influence of certain Monte Carlo parameters on the AEP result. For example, let’s see what effect the choice of reanalysis product has on the result:

pa.plot_aep_boxplot(x=mc_reg['reanalysis_product'], xlabel="Reanalysis Product")
../_images/1c7b57c0d9ec770930576581610d9204076f8104a8709b743abf497873878c0d.png

We can also look at the effect on the number of years used in the windiness correction, in the below code block and figure, where we can se that the number of years used in the windiness correction does not significantly impact the AEP estimate.

# Boxplot of AEP based on number of years in windiness correction
# NOTE: This is the same method, but calling the same method through the plot module directly
plot.plot_boxplot(
    y=pa.results.aep_GWh,
    x=mc_reg['num_years_windiness'],
    xlabel="Number of Years in the Windiness Correction",
    ylabel="AEP (GWh/yr)",
    plot_kwargs_box={"flierprops":dict(marker="x", markeredgecolor="tab:blue")}
)
../_images/82dc3e968fcaa56668076b577cba1c543fa0c7cff87e01f675a5612930e786d3.png

In some cases, it may be useful to show more of the background information for a plot, so we can return the figure and add another routine on top

fig, ax, boxes = pa.plot_aep_boxplot(
    x=mc_reg['num_years_windiness'],
    xlabel="Number of Years in the Windiness Correction",
    figure_kwargs=dict(figsize=(12, 6)),
    plot_kwargs_box={
        "flierprops": dict(marker="x", markeredgecolor="tab:blue"),
        "medianprops": dict(linewidth=1.5)
    },
    return_fig=True,
    with_points=True,
    points_label="Individual AEP Estimates",
    plot_kwargs_points=dict(alpha=0.5, s=2),
    legend_kwargs=dict(loc="lower left"),
)
../_images/c972ce3db856a31a608b5be858e3aa3c9b9ca76c4c2c92876e86c58ae1e2ef96.png