💬 Feedback & Suggestions

aicrowd_team · May 19, 2026, 6:52pm

We are constantly trying to improve this challenge for you and would appreciate any feedback you might have!

Please reply to this thread with your suggestions and feedback on making the challenge better for you!

What have been your major pain points so far?
What would you like to see improved?

All The Best!

snehananavati · May 19, 2026, 8:21pm

snehananavati · June 2, 2026, 9:31pm

jamespayor · June 4, 2026, 5:06pm

Is the smoke test flop budget lower than the full submission budget? I’m seeing submissions run out of budget in the smoke test even though they are under the cap locally, so I suspect it may be half what it should be

mohanty · June 5, 2026, 1:03am

Good catch, @jamespayor - confirmed and fixed. The smoke test was running under a stale hardcoded budget (1e10) instead of the real grading budget (6.8e10), so estimators spending anywhere between the two were getting wrongly rejected. The fix is live now (smoke uses the exact same per-MLP budget as grading), and we have re-graded the submissions that were affected, so they should show up scored. Thanks for flagging it!

Affected submissions that were re-evaluated: 309660, 309713, 309714, 309723, 309724, 309738, 309742, 309744, 309745.

jamespayor · June 5, 2026, 1:28am

Nice, thank you for the fix and rerunning!

jamespayor · June 7, 2026, 5:46am

Returning here, it now seems that submissions (mine at least) are getting a much higher walltime penalty than before? Here’s the same code submitted yesterday (AIcrowd | ARC White-Box Estimation Challenge 2026 | Submissions #309886) vs today (AIcrowd | ARC White-Box Estimation Challenge 2026 | Submissions #309977), same flops and walltimes but much increased penalty.

If this is an intended change I’d appreciate clarification; I could be wrong but for instance I didn’t think my code is spending that much time doing things other than dispatching meaty flopscope ops, but I’m now paying a large effective compute penalty.

jamespayor · June 7, 2026, 11:31pm

Hm also fyi on the submissions page the “final layer MSE” column seems to be populated with all-layers MSE scores: AIcrowd | ARC White-Box Estimation Challenge 2026 | Submissions

miruu · June 8, 2026, 1:06am

It seems there has been a rejudge since then, but the current leaderboard is still using pre-rejudge numbers.

omw · June 14, 2026, 6:24pm

Would it be possible to publish the exact flopscope distribution that the grader uses?

I’ve hit some rejections due to the grader’s runtime environment differing from my local env. The starter kit’s pyproject constraint is flopscope>=0.5.0; this resolves to 0.5.0 in the lockfile, but the installed wheel is 0.5.0+np2.2.6, which apparently has a richer API than whatever the grader’s build uses, leading to grader errors like AttributeError: module 'flopscope' has no attribute 'as_symmetric'.

(I also noticed that plain numpy isn’t importable in the grader env—is that by design?)

mohanty · June 19, 2026, 4:14am

@omw : The evaluators, as we launch Phase 1, currently use: flopscope v0.8.0rc1 and whestbench v0.12.0rc0.

This might change, as we address the feedback we receive over the next week - (more details here)

For every submission, you can also check out the exact versions that the evaluator uses, in the Configuration tab of the Submission Details page.

(I also noticed that plain numpy isn’t importable in the grader env—is that by design?)

Yes, that is a challenge design choice for whestbench.
We might revisit that decision if we come across strong usecases where using numpy (which counts towards the heavily penalized residual time) is necessary. So we are open to feedback.

Best,

mohanty · June 19, 2026, 4:21am

This occurred during an internal update to Flopscope aimed at more accurately distributing time across our three time-accounting buckets. Because the change affected every submission’s score, we re-evaluated all submissions received up to that point.

Hm also fyi on the submissions page the “final layer MSE” column seems to be populated with all-layers MSE scores: AIcrowd | ARC White-Box Estimation Challenge 2026 | Submissions 8

This is fixed now ! Thanks !

Best,

jamespayor · June 19, 2026, 7:47pm

Hi again, running into a grading issue today, see AIcrowd | ARC White-Box Estimation Challenge 2026 | Submissions #311067
I tried throwing up some test submissions (to check in on new timing calculations), they’re all running on just 5/100 nets though with the other 95 reported as failures. Idk what’s going wrong exactly but sounds like each of the 5 graders is managing 1 net then the rest are failing? Flagging for review if you’re available!

skelterhelter · June 19, 2026, 10:13pm

I have the same issue as above. AIcrowd | ARC White-Box Estimation Challenge 2026 | Submissions #311088

jamespayor · June 20, 2026, 8:55am

Also, it might be that this is specific to estimators making use of setup(); I’m seeing successes with no setup(), this might just be sporadic though.

andrei_bulzan · June 20, 2026, 9:09am

Just managed to get around the same hurdle, though with only 1 successful submission as evidence so far. It looks like setup() itself is okay, but storing fnp/flopscope arrays created in setup() on self can cause later MLPs on the same worker to fail. Changing setup to store only plain Python tuples/floats, then materializing fnp.array(…) inside predict(), seemed to have fixed the failures.

jamespayor · June 23, 2026, 6:10am

This looks like it’s been fixed fwiw! I observe that my submissions are now back to freely using flopscope in setup and saving arrays without this trouble.

(New bug: sometimes I’ve seen “a + b” i.e. __add__ on flopscope arrays have an odd type error to do with remote dispatch. Simple workaround though as fnp.add(...) succeeds. I’ll try to clarify this elsethread but that’s currently a gotcha for me.)

mohanty · June 23, 2026, 10:14am

@jamespayor : Just added some additional context on this issue here: Submission validates locally but fails with Evaluation error - #3 by mohanty