Since there is currently no stochasticity in the environment, we are reducing the number of episodes for the online evaluation to 1. If your agent has any randomness, this will increase the variance of your scores. We suggest to seed your agents such that your score is consistent.
The timeout for agents is now updated to 30 minutes. Your agent has to complete 1 episode in 30 minutes.
Also, as a participant pointed out, it is possible to use the information of the first episode to overfit a solution to the subsequent episodes. This is not in the spirit of the competition, and we request you to setup your agents such that they can run episodes independently.
Starting from phase 3, the parallel run setup will be changing, but the episodes will remain 1.
If I may, I’d add some proposals to the evaluation before the phase 3 has started:
I see the competition problem in 2 parts:
how to know the future
how to act when you somehow figured out the future
In my opinion, the ideal test would be an unseen set of buildings in the unseen time in the future. Relating it to our competition, if we’d design it from the beginning and if you only have 1 year of data from 17 buildings: it would be great to for example give first 8 month and 10 buildings for training / public scores and have the next 4 months 7 buildings for private scores, and not to show the scores/logs for the private part until the end. Of course only one episode of evaluation should be run.
However, when all the future data is already shown, I think the best design of the 3rd stage is to:
run it only for 1 first episode - this would eliminate the problem of overfitting/memorising data, you will not even need to check who is doing what + in reality you will never have a chance to be twice in the same time, why do it in the evaluation?
do not show scores for additional 7 buildings and have a final score only on them (as on a separate 7 building community): this will eliminate the manual/ other solutions that overfit on training, or the data from logs in phase 2. + Blocking logs as described in the previous post is mandatory.
This are only my ideas, would be glad to see the organisers and participants responses / proposals.
Indeed, we do have a private leaderboard planned for Phase 3.
Originally we had planned to hide the scores of the 7 buildings that will be added in Phase 3 completely. However I also like your idea of hiding last 4 months, we’ll discuss this internally.
@tymur_prorochenko Great suggestions! That said, I’m not sure if I agree about splitting the evaluation by time (8 month training/public, 4 month private). In that case, the final objective is to generalize after a warm-up period. However, in reality, the cold start problem is a real concern.
@mt1 sadly it does not make any sense now singe we already know what happens during the year. Ideally we should discuss hiding future months only if the orgs have more data somewhere))
@tymur_prorochenko Thanks for the quick reply. I think I might be missing something. I see how there is nothing we can do about the training data because it was already provided, but if I’m not mistaken, the evaluation framework prevents us from accessing future data in the validation and test sets. If we only run one episode, then our agents will never know the future data ahead of time.