La Haute Borne dataset

September 24, 2022

For my work, I mainly use R in combination with dplyr, which is an extremely useful library for data manipulation. However, since Python is nowadays one of the most common programming languages, I thought it’d be a good idea to improve my skills in this language. I’ve already talked about this in my first post on Codility.

The Codility exercises focus mainly on basic Python, whereas in my daily work I require other libraries as well. To that end, I’m going to work with the La Haute Borne wind power data with the overarching goal to produce calibrated probabilistic forecasts of the wind power. The first notebook will deal with importing, cleaning and preprocessing the raw power measurements.

When forecasting wind power more than a few hours into the future, it is important to include numerical weather prediction (NWP) forecasts that provide a future estimate of meteorological parameters such as wind speed. For that, we’ll use forecasts from the Global Forecast System (GFS) maintained by the National Centers for Environmental Prediction because they are open-source and a convenient API is available, which is addressed in the second notebook.

The final notebook will be dedicated to training, validating and testing a selection of forecast models on this real-world dataset!

Some information regarding the dataset and my first effort:

Data for 2013-2016 can be downloaded from Engie. Since the raw dataset is quite large, the reader should download it for themselves.
A description of the data features can be downloaded here.
The static information, e.g., coordinates or rated power, of the wind farm can be downloaded here.
I’ve already started working on the first task, i.e., importing, cleaning and preprocessing the data. The notebook can be found here (it’s still a work in progress though!).

Stay tuned!