NeurIPS 2023 Citylearn - Forecasting Track Challenge Solutions

Hi everyone! Now that the phase II is completed I will detail my solution. This will be a bit long but I felt that it could benefit other people, so enjoy the reading!

First of all thanks to the organizers for this competition, I was able to deepen my knowledge of Time Series Forecasting and really take the time to explore many techniques involved in forecasting.

Don’t hesitate to post your solution as well, I am really looking forward to learn how people were able to predict noise I mean EEP and DHW :smiley: , and how everyone dealt with the inference time.

My approach - The automated path

For this competition I wanted to see how far I could go with an “automated workflow” i.e. automating experiments trying many models and techniques.

I started by reading a few recent papers on time series forecasting to build a list of ideas to try (see the full list below).

Validation scheme
I saw many validation schemes (TimeSeries split, Bootstrapping, hv-blocked CV, purged CV, …) and ended up choosing the Blocked CV such that the model could see all of the data during training or testing.

Forecasting strategy
Again, there were many possibilities to choose from (Direct, Recursive, Horizon, Multioutput, DirRec, Average of Direct and Recursive). Direct forecasting seemed to be promising in early results and could give me much flexibility later, so I chose this simple forecasting strategy (1 model per timestep so 48 for us).

Pre/post processing/data augmentation
From STL/VMD decomposition, ensembling, stacking, blend ranging, I found many techniques that could eventually be used to improve the forecasting performance.

Baseline model

The baseline consisting of generating predictions with the aggregated hourly values was already decent (especially for the solar generation TS), so I decided to use it as a baseline model.

Now, using a direct forecasting methodology means that I can specify my optimal model, parameters and feature set for each {variable, timestep} pair. The objective was to automate that optimization.

Feature generation

I used

  • windowed features (min/max/mean)
  • mathematical operations on the variables (log and squaring)
  • spike features on DHW and EEP (time since last spike of value greater than ε, size of last spike of value greater than ε)
  • an interpolated feature for the variables for which we had forecasts at multiple horizons (6h, 12h, …)

The spiky variables proved to be difficult to predict, I tried clipping, spike extraction with kernel smoothing (see [19] if interested) but it did not significantly improve the results for me…

Feature selection

I wanted a robust selection method that would use a cross-validation and determine the optimal set based on the results over the folds. I ended up using the following agreement-based approach:
For each fold:

  • perform a forward feature selection and a backward feature selection
  • get the top_k sets from each method.

Then add all the top_k sets to a List (of size 2 * k_fold * top_k) and count the frequency of each unique set in the list (the idea is that a robust optimal set should be in the top ranking sets over multiple folds)

  • if we found a set with an agreement (frequency) > min_agreement, we return this set
  • Otherwise, we return the best performing set (over the folds) of minimum length (to follow Occam’s razor)

As this approach is computationally expensive (needs to be done 48 times, 1 for each Timestep) I used a simple Linear Regression as the predictor (Sklearn’s LinearRegression is coded in Fortran and is extremely fast)


I tried playing with NIXTLA/Statsforecast/Darts to test a bunch of statistical and Deep learning models but the interfaces were a pain to work with and early results were disappointing so I focused on tabular ML models (and on solving my many inference issues).

I tried most of the models in Sklearn + XGB/LightGBM, performing HPO each time with Optuna.
The ExtraTrees model ended up being a suprisingly competitive model, often matching or even surpassing the performance of LightGBM. I ended up using it for all variables except for the Carbon intensity were a simple Ridge was giving similar levels of performance.


If the DirectForecasting gave me much flexibility, it was computationally expensive. Combined with the streaming forecasting evaluation and the need for many features, my first local evaluation ran for … 1h20 :cry:, and it took me nearly a week between significantly reducing the run time, testing and debugging the evalution.

  • Multiprocessing messed up the inference pods somehow so I dropped it (it initially gave me a nice boost)
  • My buffer consisted of simple lists (not a bottleneck), and I dropped DataFrames as they were extremely slow compared to numpy arrays, or even faster using the math library on python lists (I only needed one new value per feature at each timestep). However, dropping DataFrame introduced additional complexity to make sure that the features given to the model were in the correct order despite having no names…
  • Batch inference for the building variables reduced the inference time as well
    After implementing all of these optimizations, my run time fell to around 5 minutes.

Things I would have liked to try with more time

  • Stacking models (early ensembling/stacking experiments did not look promising)
  • Range blending and other data augmentation techniques (for example looking at the original dataset from US Building Stock)
  • Preprocessing for the variables (e.g. Variable Mode Decomposition which seems to be used in Wind forecasting)
  • Investigating more why I was getting poor results from statistical and deep learning models
  • Trying the SETAR-Tree and SETAR-Forest models (for R only though…)

Suggestions for the organizers

  • please if possible make the access to the variables easier and normalize the naming:
    np.array(observations)[0][np.where(np.array(self.observation_names)[0] == 'carbon_intensity')[0][0]] is, in my honest opinion, a horrible way to access the data.
  • If possible, batch Evaluation for the Forecast track
  • Why do you still use Python 3.7?

Thanks for reading!


[1] Y. In and J.-Y. Jung, ‘Simple averaging of direct and recursive forecasts via partial pooling using machine learning’, International Journal of Forecasting, vol. 38, no. 4, pp. 1386–1399, Oct. 2022, doi: 10.1016/j.ijforecast.2021.11.007.
[2] S. Ben Taieb, A. Sorjamaa, and G. Bontempi, ‘Multiple-output modeling for multi-step-ahead time series forecasting’, Neurocomputing, vol. 73, no. 10, pp. 1950–1957, Jun. 2010, doi: 10.1016/j.neucom.2009.11.030.
[3] G. Bontempi, S. Ben Taieb, and Y.-A. Le Borgne, ‘Machine Learning Strategies for Time Series Forecasting’, in Business Intelligence, vol. 138, M.-A. Aufaure and E. Zimányi, Eds., in Lecture Notes in Business Information Processing, vol. 138. , Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 62–77. doi: 10.1007/978-3-642-36318-4_3.
[4] S. Aras and İ. D. Kocakoç, ‘A new model selection strategy in time series forecasting with artificial neural networks: IHTS’, Neurocomputing, vol. 174, pp. 974–987, Jan. 2016, doi: 10.1016/j.neucom.2015.10.036.
[5] V. Cerqueira, L. Torgo, and I. Mozetič, ‘Evaluating time series forecasting models: an empirical study on performance estimation methods’, Mach Learn, vol. 109, no. 11, pp. 1997–2028, Nov. 2020, doi: 10.1007/s10994-020-05910-7.
[6] A. Gasparin, S. Lukovic, and C. Alippi, ‘Deep learning for time series forecasting: The electric load case’, CAAI Transactions on Intelligence Technology, vol. 7, no. 1, pp. 1–25, 2022, doi: 10.1049/cit2.12060.
[7] J. Duan, P. Wang, W. Ma, S. Fang, and Z. Hou, ‘A novel hybrid model based on nonlinear weighted combination for short-term wind power forecasting’, International Journal of Electrical Power & Energy Systems, vol. 134, p. 107452, Jan. 2022, doi: 10.1016/j.ijepes.2021.107452.
[8] D. Li, F. Jiang, M. Chen, and T. Qian, ‘Multi-step-ahead wind speed forecasting based on a hybrid decomposition method and temporal convolutional networks’, Energy, vol. 238, p. 121981, Jan. 2022, doi: 10.1016/
[9] S. F. Stefenon et al., ‘Time series forecasting using ensemble learning methods for emergency prevention in hydroelectric power plants with dam’, Electric Power Systems Research, vol. 202, p. 107584, Jan. 2022, doi: 10.1016/j.epsr.2021.107584.
[10] W. Yang, S. Sun, Y. Hao, and S. Wang, ‘A novel machine learning-based electricity price forecasting model based on optimal model selection strategy’, Energy, vol. 238, p. 121989, Jan. 2022, doi: 10.1016/
[11] M. H. D. M. Ribeiro, R. G. da Silva, S. R. Moreno, V. C. Mariani, and L. dos S. Coelho, ‘Efficient bootstrap stacking ensemble learning model applied to wind power generation forecasting’, International Journal of Electrical Power & Energy Systems, vol. 136, p. 107712, Mar. 2022, doi: 10.1016/j.ijepes.2021.107712.
[12] C. Lu, S. Li, and Z. Lu, ‘Building energy prediction using artificial neural networks: A literature survey’, Energy and Buildings, vol. 262, p. 111718, May 2022, doi: 10.1016/j.enbuild.2021.111718.
[13] F. Martínez, F. Charte, M. P. Frías, and A. M. Martínez-Rodríguez, ‘Strategies for time series forecasting with generalized regression neural networks’, Neurocomputing, vol. 491, pp. 509–521, Jun. 2022, doi: 10.1016/j.neucom.2021.12.028.
[14] C. S. Bojer, ‘Understanding machine learning-based forecasting methods: A decomposition framework and research opportunities’, International Journal of Forecasting, vol. 38, no. 4, pp. 1555–1561, Oct. 2022, doi: 10.1016/j.ijforecast.2021.11.003.
[15] M. Anderer and F. Li, ‘Hierarchical forecasting with a top-down alignment of independent-level forecasts’, International Journal of Forecasting, vol. 38, no. 4, pp. 1405–1414, Oct. 2022, doi: 10.1016/j.ijforecast.2021.12.015.
[16] K. Bandara, H. Hewamalage, R. Godahewa, and P. Gamakumara, ‘A fast and scalable ensemble of global models with long memory and data partitioning for the M5 forecasting competition’, International Journal of Forecasting, vol. 38, no. 4, pp. 1400–1404, Oct. 2022, doi: 10.1016/j.ijforecast.2021.11.004.
[17] A. D. Lainder and R. D. Wolfinger, ‘Forecasting with gradient boosted trees: augmentation, tuning, and cross-validation strategies: Winning solution to the M5 Uncertainty competition’, International Journal of Forecasting, vol. 38, no. 4, pp. 1426–1433, Oct. 2022, doi: 10.1016/j.ijforecast.2021.12.003.
[18] H. Hewamalage, K. Ackermann, and C. Bergmeir, ‘Forecast evaluation for data scientists: common pitfalls and best practices’, Data Min Knowl Disc, vol. 37, no. 2, pp. 788–832, Mar. 2023, doi: 10.1007/s10618-022-00894-5.
[19] P. Bacher, P. A. de Saint-Aubain, L. E. Christiansen, and H. Madsen, ‘Non-parametric method for separating domestic hot water heating spikes and space heating’, Energy and Buildings, vol. 130, pp. 107–112, Oct. 2016, doi: 10.1016/j.enbuild.2016.08.037.