Phase 2 — We hear you ! Here’s how we’re updating the plan

First, thank you to everyone who took the time to comment on our Important Update and to those who shared constructive criticism in the forum. We heard the concerns around the late introduction of Phase 2, the proposed “single‑prediction per series” constraint, and the perceived shift from the original “back‑cast” framing. Our goal remains the same: a fair, practical challenge that yields models useful in the real world.

Why Phase 2 exists (and how we’re aligning it with the original intent)

The challenge was always about detecting and measuring flexibility so that it can be used and verified in practice. That includes two complementary real‑world needs—both of which your models can serve:

  • Use case 1 — Real‑time operations (causal): Run a model “now” to estimate the demand response flag and capacity at the current moment to support short‑term decisions (e.g., when and how much flexibility to bid). This is directly useful to aggregators and VPP operators making live control and market decisions.

  • Use case 2 — Auditing and settlement (non‑causal): Back‑cast to verify how much flexibility was actually delivered after the fact. This enables transparent payment flows and helps assess device/control behaviour.

The feedback we received made clear that our Oct 7 note unintentionally narrowed the design toward only Use case 1 by enforcing a single prediction per series to avoid look‑ahead. We agree that this would have ruled out legitimate auditing workflows and diverged from the “back‑cast from historic time‑series” description on the challenge page. We’re correcting course.


What’s changing (effective for Phase 2)

1) Evaluation granularity: per‑row predictions (no “single prediction per series” requirement).

Phase 2 submissions will be evaluated on predictions for each row of the Phase 2 dataset. This directly supports both causal (real‑time) and non‑causal (back‑cast) approaches. If you built features with lookbacks or other context in Phase 1, they remain valid—no refactor to a special “window slice” format is required.

2) Both causal and non‑causal models are eligible for prizes.

  • You may submit either type.

  • In your solution document, please self‑declare whether the approach is causal (uses only inputs available at time t₀) or non‑causal (may use future context).

  • We encourage performant causal models because they unlock the operational use case described above, but non‑causal models are equally eligible for prizes.

3) Dataset format stays consistent and tests generalisation across buildings.

Phase 2 will use the same schema you’ve seen so far (timestamps at 15‑minute resolution, with the same core columns). The Phase 2 test set will be a superset of the Phase 1 private test set (Site F), with additional new unseen sites added to assess generalisation. This ensures that participants’ progress on Site F continues to count directly in Phase 2 scoring.

4) Scoring and final ranking remain balanced across phases.

Phase 2 builds directly on Phase 1 by expanding the test set rather than replacing it. Because the Phase 2 test set is a superset of the Phase 1 private test set (Site F), the final score will naturally include participant performance on both previously seen and newly added sites. Final ranking will be based on the equally weighted average across all sites within this combined test set, ensuring that prior progress continues to matter while rewarding models that generalise to new contexts.

During the Phase 2 window (between 20 and 22 October 2025, 23:59 UTC), the leaderboard will show interim ranks using a subset of the test data for transparency and momentum. Once Phase 2 concludes, the leaderboard will update with full‑set scores to determine final rankings.

5) Transparency: scorer & starter assets.

As before, we’ll provide the scoring script and a minimal code sample alongside the Phase 2 release so you can locally reproduce leaderboard scores.


What stays the same

  • Intent: Models that are generalised and context‑aware across buildings and climates.

  • Inputs & resolution: Same variables and 15‑minute cadence as documented.

  • Phase 1 timelines and private‑set integrity: Your Phase 1 work continues to count toward the final ranking as outlined above.


Quick FAQ

Q: Do I need to rebuild my pipeline for Phase 2?

A: No. With per‑row evaluation and a consistent data schema, models and feature pipelines developed for the current phase should apply with minimal changes.

Q: Are non‑causal (back‑cast) models “allowed” now?

A: Yes. Both causal and non‑causal approaches are eligible. Please self‑declare the type in your solution documentation. This harmonises the real‑time and audit/settlement use cases and resolves earlier ambiguity around the term “back‑cast.”

Q: Will there be a Phase 2 public leaderboard?

A: We know this is important. And, yes! There will be a public leaderboard for you to see where your team stands as the Phase 2 window progresses.


A note on timing & communication

Several of you expressed concern about introducing a major change close to the deadline. That feedback is fair. By reverting to a per‑row evaluation that matches the current data format—and by supporting both causal and non‑causal models—we aim to preserve your existing work while still delivering a phase that measures generalisation across sites and supports real‑world outcomes. Thank you for holding us to a high bar on clarity and fairness.


Next steps

  • We will publish the updated Phase 2 details (dataset, scorer, and starter assets) with the Phase 2 launch, and we’ll keep all discussion in one pinned thread to make it easy to follow.

On behalf of the organising team—thank you again for the candid feedback, the thoughtful posts, and the time you’re investing in FlexTrack. We’re excited to see what you build next.

2 Likes

Thanks for the clarifications.

Are we allowed to use future context (meaning context from time t>t0 to make predictions at time t0) as per - “In your solution document, please self‑declare whether the approach is causal (uses only inputs available at time t₀) or non‑causal (may use future context).” Basically are look-ahead features allowed?

I thought the organizers said we are NOT allowed to use future context at any point in time. If that is not the case, its almost a given that all high performing models will use it. Please clarify.

1 Like

Thanks for the clarifications.

Will prizes for both approaches be included? I agree that models using look-ahead features probably will have a higher NMAE.

@Phaedrus: Yes — the use of future context is allowed.
Our earlier announcement aimed to focus the competition on causal models (which do not rely on future information). However, based on community feedback and considering the limited time remaining before the competition ends, we’ve decided to revert that restriction. Both causal and non-causal models will be supported going forward.

@igorkf: The final prizes will be awarded based on the Phase 2 leaderboard, regardless of whether a submission is causal or non-causal.

2 Likes

Thanks for answering. Are we going to use NMAE as metric again in the phase 2? If that is the case, it seems people using look-ahead features (i.e., non-casual models) probably will achieve a smaller error.