🏆 Final Results & Next Steps

Context for updates on this post: 🏆 Final Results & Next Steps - #11 by aicrowd_team

Dear Participants,

Thank you to every team that took part in the Flextrack Challenge. The task was to build models that detect demand response events and estimate their impact by separating normal energy use from intentional shifts. This is a practical problem with real value for flexible power grids. Your contributions move us closer to practical, scalable energy sustainability.

This edition saw 6,458 submissions from 606 participants and 92 teams.

Private leaderboards for the Competition Phase and the Final Round are now live. See them here

Final rankings are based on the average team rank across both phases.
Final rankings are based on the Private Leaderboard of the Final Round.

Here are the top 10 teams:

Rank Team & Participants Normalized MAE Normalized RMSE Geometric Mean Score F1 Score
:1st_place_medal: 1 flex_king (@Bob575, @ErenYeager) 0.698 1.104 0.651 0.624
:2nd_place_medal: 2 zch (@kzchhk) 0.706 1.107 0.651 0.629
:3rd_place_medal: 3 WollongongOrBust (@jack_vandyke, @ryan_sharp) 0.731 1.086 0.749 0.720
:3rd_place_medal: 4 ningjia (@ningjia) 0.779 1.052 0.743 0.702
:3rd_place_medal: 5 DTU (@AREYA, @hbz, @Remok) 0.889 1.233 0.637 0.581
6 improvers (@improvers) 0.890 1.175 0.578 0.523
7 pluto (@pluto) 0.915 1.229 0.711 0.629
8 liberifatali (@liberifatali) 0.940 1.285 0.563 0.487
9 Phaedrus (@Phaedrus) 0.989 1.359 0.819 0.457
10 danglchris (@danglchris) 0.991 1.223 0.618 0.532

Next steps
Winners will undergo due diligence. Submitted code must reproduce predictions consistent with each team’s best leaderboard submissions. We will contact the top teams directly for this review.

Share your work
We invite all teams to submit a short solution paper. Tell us what you tried, what worked, and what you learned so others can build on it. Please submit via Google Form link. We will share selected papers with the community.

Thank you again to all participants and to our organisers at the University of Wollongong.

Best regards,
AIcrowd

First off, thanks to the organizers for curating this competition, we learned a lot and enjoyed the process. However, there are some points we would like to express, to be considered in future for a more fair competition.

  • Submission limits in the Phase 2 felt uneven across time zones. Some teams including us had 20 attempts while others had 30. In this phase where the errors were so little between the groups, each attempt mattered.

  • We received the final-round email only one day before the deadline. Of course we checked the website frequently, but communication could be better.

  • The email said Final Ranking is the average rank across both phases, but current final rankings doesn’t seem to be the average rankings. Is it based on NMAE or only the ranking?

Once again, thanks to the organizers and congrats to the winners!

2 Likes

It appears that the final rankings have undergone several revisions. In light of this, I would like to raise one final question for the competition organizers: why were classification metrics not included in the final ranking criteria?

The competition’s stated objectives are to:

  • Identify when demand response events were activated and for how long
  • Quantify how much energy consumption increased or decreased during these events compared to normal conditions.

The first objective is inherently a classification problem, while the second is a regression problem. Given this dual focus, it seems logical that both types of metrics should contribute to the final evaluation.

In particular, while the F1 score is generally more informative than the Geometric Mean for assessing classification performance, a balanced combination of both could provide a more comprehensive and fair assessment. I suggest this blended approach be considered for both the competition phase and the final round.

I have serious concern about the decision of using average rank across both phases as the final ranking.

This ranking approach conflicts with the statement about the Phase 2 and does not make sense.

Quote from [Phase 2 — We hear you ! Here’s how we’re updating the plan]

“Phase 2 builds directly on Phase 1 by expanding the test set rather than replacing it. Because the Phase 2 test set is a superset of the Phase 1 private test set (Site F), the final score will naturally include participant performance on both previously seen and newly added sites. Final ranking will be based on the equally weighted average across all sites within this combined test set, ensuring that prior progress continues to matter while rewarding models that generalise to new contexts.”

Phase 2 test set includes all data in Phase 1 private test set. The Final ranking will be based on “this combined test set” which is the Phase 2 set. If my understanding is correct, the final round score and rank should already include the evaluation results of phase 1.

Even if Phase 2 set doesn’t contain the Phase 1 data set completely, a simple average of score/rank of Phase 1 and Phase 2 is not ideal because Phase 2 data set contains more sites and data. Your suggested “equally weighted average across all sites” is fair. For instance, Phase 1 only has 1 site (site F) as private, Phase 2 has X sites as private. So, the weight of Phase 1 should be 1/(1+X).

Also, as jack_vandyke pointed out, since this is a classification and regression problem and other 3 metrics are calculated, simply focusing on NMAE and totally ignore others will benefit un-balanced solutions overfitting NMAE.

Thanks to the organizers and congrats to the winners.

2 Likes

Thank you organizers, and congrats to the winners.

  • I agree with @ningjia - the statement that “Final rankings are based on the average team rank across both phases” not only contradicts previous communication, but I also believe it is unfair that this statement was not communicated to us until after the competition ended.
  • I believe it is unfair that @ningjia was quietly removed from fourth place on the “final” rankings, as seen by the post’s revision history. These rankings are evidently not final and are subject to change without notice.

On a related note, the private leaderboard for the Competition Phase is not visible to the public. – Edit: it is now visible

Thank you again.

I agree that changing the ranking methodology after the competition ended and quietly moving ningjia from 4th to 5th place is unfair. This contradicts the earlier communication that Phase 2 would be based on the combined test set.

Rather than modifying ranking rules post-competition, I suggest expanding 3rd Place from 2 winners to 3 winners to fairly recognize all deserving teams. This would be more transparent and equitable than retroactive rule changes.

The competition should maintain the integrity established during the actual competition period.

2 Likes

Dear Participants,

We took your feedback to the FlexTrack organizing team and carefully reviewed all previous communications regarding the determination of the final winners for the challenge.

We acknowledge that the intention to take the mean of ranks across the Competition Phase and the Final Round was not clearly communicated, even though it had been the intention internally.

In light of this, we collectively agreed that the fairest approach is to treat the Final Round Private Leaderboard as the sole criterion for determining the final rankings of the competition.

The top three positions remain consistent across both methodologies. However, Team @DTU and Team @ningjia swap ranks depending on which approach is used. We want to commend the outstanding performance of both teams. To ensure fairness and recognize their efforts, we will extend an additional AUD $1,000 prize so that both teams are officially acknowledged as winners of the competition.

We have updated the original announcement post to reflect this change.
We sincerely appreciate your constructive feedback and commend the swift action taken by the FlexTrack organizing team to resolve the matter transparently and fairly.

Warm regards,
The FlexTrack 2025 Organizing Team

3 Likes

Thank you for the update. I’m curious to see which models are causal.