Flopscope v0.8.0 Release Candidate Available!

mohanty · June 17, 2026, 5:18pm

Hi everyone,

We have published the first release candidate of flopscope v0.8.0rc1 to PyPI, paired with whestbench v0.12.0rc0.

This is the version the evaluators will use for Phase 1. We are sharing the release candidate now so you can build against it, check your estimators, and send feedback before the final v0.8.0 release.

TL;DR
Install the release candidate:
pip install --pre "flopscope>=0.8.0rc1" "whestbench>=0.12.0rc0"
This is the Phase 1 evaluator version.

The cost model is now more consistent: you are billed for computation on values, not data movement.

Contraction costs are unified across matmul, dot, inner, outer, tensordot, vdot, einsum, and relevant linalg operations.

Residual wall-time accounting is fairer: framework overhead is not charged to your estimator.

Weight packing is now pickle-free through flops.Module.

Warm-Up evaluators are unchanged and remain on flopscope v0.5.0 / whestbench v0.10.0.

What’s New in flopscope v0.8.0

1. Computation vs data logistics

flopscope now clarifies one core principle:

You are charged for computation on values, not for moving data around.

Arithmetic, reductions, matrix multiplies, transcendentals, and FFTs cost FLOPs.
Copying, reshaping, stacking, concatenating, slicing, and gathering are free.

Matrix operations dominate many estimator budgets and are now counted in full, so please re-check your estimator against the Phase 1 budget.

2. One cost engine for contractions

matmul, dot, inner, outer, tensordot, vdot, einsum, and relevant linalg contractions now share the same symmetry-aware machinery.

This resolves cases where operations such as fnp.tensordot could previously be undercounted. These operations are now billed consistently with the consistent einsum_cost machinery.

3. Fairer residual wall-time accounting

Your score accounts for both FLOPs and residual wall time, which is time spent outside tracked operations.

We re-audited what counts as participant residual time. Framework plumbing, including data transport between the flopscope client and server and array unpacking, is now attributed to flopscope overhead rather than your residual wall time.

In short: you are charged residual time for your own code, not for evaluator plumbing.

4. Pickle-free weight packing and clearer errors

You can now bundle data with your submission through flops.Module. Loading this data is free: it costs 0 FLOPs and is not counted in residual wall time.

The new release also provides clearer errors when an operation is not available in the grading environment.

The full per-operation cost rules are documented in the cost-model.md reference. You can also use budget.summary() to inspect where your own FLOPs are going.

Packing Data Into Your Submission

Define your model as a flops.Module. Array attributes are discovered automatically, and save writes them with a small JSON config instead of using pickle.

# model.py
import flopscope as flops
import flopscope.numpy as fnp

class Linear(flops.Module):
    def __init__(self, n_in, n_out):
        self.W = fnp.zeros((n_out, n_in))     # array state: auto-discovered
        self.b = fnp.zeros(n_out)

    def config(self):                         # non-array config, used to rebuild
        return {"n_in": self.W.shape[1], "n_out": self.W.shape[0]}

    def __call__(self, x):
        return fnp.einsum("oi,i->o", self.W, x) + self.b

if __name__ == "__main__":
    model = Linear(8, 4)
    # ...
    model.save("model.npz")                   # named arrays + JSON config, no pickle

Load the saved module once in your estimator’s setup():

# estimator.py
from pathlib import Path
from whestbench import BaseEstimator
from model import Linear

class Estimator(BaseEstimator):
    def setup(self, ctx):                                                   # runs once
        self.model = Linear.from_file(Path(__file__).parent / "model.npz")  # free load

    def predict(self, mlp, budget):
        ...   # use self.model

The whest CLI bundles everything in your estimator folder, up to 50 MB / 50 files, so model.py and model.npz travel with estimator.py and load for free at grading time.

whest login                                     # once, with your AIcrowd API key
whest submit --estimator estimator.py --watch   # packages, uploads, and follows to a score

Please iterate locally first:

whest run --estimator estimator.py

The full estimator and submission walkthrough is available in the starter kit. The examples/12_save_load_mlp.py example includes a multi-layer flops.Module.

Try the Release Candidate

Install the release candidate:

pip install --pre "flopscope>=0.8.0rc1" "whestbench>=0.12.0rc0"
pip show flopscope whestbench   # expect 0.8.0rc1 / 0.12.0rc0

If you are using the starter kit with uv, pin the candidate with:

uv add "flopscope>=0.8.0rc1" "whestbench>=0.12.0rc0"

Then check your estimator against the budget:

import flopscope as flops

with flops.BudgetContext(flop_budget=68_000_000_000) as budget:
    estimator.predict(mlp, budget=68_000_000_000)

print(f"FLOPs used: {budget.flops_used:,}")
print(budget.summary())   # per-operation breakdown

What Happens if the Cost Model Changes Again?

Phase 1 runs on this release candidate, and it is a candidate for a reason.

If community feedback leads to major changes in the final v0.8.0 release, we will re-evaluate every Phase 1 submission received up to that release on the final cost model. You can submit now without worrying that an early submission will be disadvantaged by a later evaluator change.

Send Us Feedback

Please share any issues you find with the updated cost model, flopscope in general, whestbench or the starter-kit; your feedback is extremely valuable and also counts towards the community contribution prizes of 500-5000 USD each!

Feedback / Discussion: please start a new topic on the challenge forum at: AIcrowd | ARC White-Box Estimation Challenge 2026 | Discussions
Feedback / Discussion on the Cost Model: please use the dedicated thread: Flopscope v0.8.0 Release Candidate Available!
Reproducible bugs: open an issue on flopscope, whestbench, or the starter kit. PRs are welcome as well.
Security issues: email arc-whestbench@aicrowd.com privately rather than opening a public issue.

To avoid any confusion: the Warm-Up evaluators are unchanged and stay on flopscope v0.5.0 / whestbench v0.10.0.

This release candidate is the evaluator version planned for Phase 1, which launches at 00:00 UTC on 18 June 2026 with an independent evaluation environment and a separate leaderboard.

Stay tuned for the official Phase 1 launch announcement, which will cover the updated target architecture, budget changes, leaderboard details, and prize structure.

All the best!

mohanty · June 19, 2026, 3:29am

jamespayor · July 6, 2026, 3:18pm

Hi there! I’m encountering a bit of cost-model feedback: currently there are a bunch of savings on the table for manually accounting for input-sparsity (i.e. zeros present in the contraction), pretty relevant because it’s a factor of 2 for passing samples through the net.

Raising this in case you would like to automatically favor/exclude zero entries in matmuls and similar. This is a little realistic for current hardware (at least nvidia GPUs have a sparsity speedup) but idk whether you consider in scope.

jamespayor · July 6, 2026, 9:34pm

Okay a different thing that seems more of an exploit: float32/float64 are billed the same, okay fair enough, but so is complex64. So you can get a “free” factor of 2 flop reduction by packing two values together as a “complex” number and performing float64 * complex64 multiplications.

I think the simple answer here would be to bill real * complex ops as 2x more costly than real * real, complex * complex can be 2x or 4x, or whatever seems appropriate.

mohanty · July 6, 2026, 10:33pm

@jamespayor : Yes we are aware of this issue, and are currently working on a proper fix for this. The cost adjustments will land in the next release, and affected submissions will be re-evaluated.

We may or may not include the float32/float64 fix in phase 1, but the complex one will definitely be addressed.

mohanty · July 6, 2026, 10:37pm

as discussed in the townhall meeting:
Thanks for the feedback, this is indeed a great suggestion, but we will discuss internally a bit more, if we want to include the sparsity related savings within the scope of Phase 1 and/or Phase 2.
We will share more details once we have a clear decision on this.

jamespayor · July 6, 2026, 11:12pm

Thanks! Fwiw I think keeping it uniform across float32 vs float64 is appealing from my perspective, as something that simplifies matters (“just use float64”) though I’d of course be comfortable however y’all decide about it.

andrew_epstein · July 7, 2026, 6:54pm

James beat me to reporting it but yes, my most recent two submissions are just flops accounting tricks with a “true” score that should be around 2.5e-7. There is some further trickery you can do with quantization and bitpacking to go beyond just the 2x reduction afforded by the complex type. I’m not at my computer now but I’ll reply in a few hours with the details of that.

RomanChernenko · July 8, 2026, 2:38pm

Hello @mohanty ,

Do you have any expectation when the FLOPS estimation for complex number will be fixed? The current situation with the leaderboard, when it is impossible to understand whether it is real scores or hacks with number packing, is confusing a little.

mohanty · July 8, 2026, 5:59pm

@RomanChernenko : Fair point, and we acknowledge the uncertainty this creates. We are trying our best to address this before the end of this week, but hopefully sooner than that. We will make an announcement as soon as this is resolved.

Best,
Mohanty

yangxinyu_xie9 · July 12, 2026, 11:42pm

Thank you so much for this thread. We were very confused by the top of the leaderboard earlier this week too. Hope this fix will come soon.

mliston · July 13, 2026, 3:26pm

Just want to +1 as I’m also looking forward to a fix being implemented for this.

mohanty · July 20, 2026, 1:03am

Here is a PR that is now in the final stages of review and addresses most of the concerns raised in this thread:

github.com/AIcrowd/flopscope

feat(billing): dtype-aware four-factor cost model + reviewer-driven re-tiering

AIcrowd:main ← AIcrowd:claude/cost-model-planning-de1198

opened 11:12PM - 19 Jul 26 UTC

spMohanty

+14944 -2313

## Summary This PR replaces flopscope's dtype-blind billing (`charged = flop_co…st × weight`) with a **four-factor model** and re-prices the operation catalog per the external cost-model review: ``` charged = int(flop_cost × weight × dtype_rate × complex_factor) ``` It closes a family of budget-bypass loopholes (precision packing, free data movement, free-data side channels), re-tiers 99 operations, adds two newly-metered surfaces (`arr[key]` indexing, `random.sample`), fixes a cost-model liveness bug that could hang the grader, and rewrites `docs/reference/cost-model.md` as the audited single source for all of it. **Supersedes #144** (same lineage, carried forward through the review + repricing). ## Why The previous model billed every dtype identically, so `float64` work cost the same as `float32` and packing two real arrays into one complex array halved the bill. Independently, data movement (copies, concatenation, gathers, fills, saves) was free, which left too many ways to do real work without spending budget. The external review of the operation catalog (469 yes / 109 no / 37 unsure across 616 ops) drove a systematic re-tiering; every "no" and "unsure" row was triaged to a decision and implemented here — 153/153 dispositions verified in code, none silently dropped. ## The model **`dtype_rate`** — priced by the width numpy actually computes in (not just the input): 32-bit and below = 1.0, 64-bit = 2.0, float96 = 3.0, float128 = 4.0, complex64 = 1.0, complex128 = 2.0. Integer reductions that widen to a 64-bit accumulator (`sum`, `prod`, `cumsum`, `mean`, …) bill at the accumulator's rate; comparisons (`max`, `argmin`, …) do not widen. Unknown numeric dtypes fail closed. **`complex_factor`** — real-FLOP equivalents for complex arithmetic: add 2, multiply 6, divide 11, abs 4, sqrt 10, compare/sort 2, per-op factors for transcendentals and composites; FFT's `5·N·log₂N` already counts complex butterflies so it takes rate only. **Weights** collapse to a locked tier set `{0, 1, 4, 16}`: views 0, writes/copies 1, selector-deriving access 4 (sorts, set-ops, histograms, random reordering, gathers), transcendentals 16. ## What participants see (headline re-tiering) Before = `main`, after = this branch. Full per-op table with probe-verified examples: `docs` → cost-model page; a participant-facing before/after summary accompanies this PR. - **Data movement is no longer free (→ 1 × numel(output))**: creation fills (`ones`, `full`, `eye`, …), `copy`/`reshape`/`ravel`/`require`, `concatenate`/`stack`/`tile`/ `repeat`/`roll`/`insert`/`delete`, diagonal & triangular constructors, index generators, `meshgrid`, `fft.fftshift`/`ifftshift`. `zeros`/`empty`/views stay free. - **Gathers cost 4 × numel(output)**: `take`, `take_along_axis`, `choose`, and fancy `arr[int_idx]`; boolean masks bill `numel(mask) + 4 × selected`; slices stay free. Scatters (`put`, `place`, `putmask`, `fill_diagonal`) bill 1× elements written. - **Value-ordering ops move 1× → 4×** (formulas unchanged): `sort`/`argsort`/`partition`, `searchsorted`, `unique`/set-ops, histograms/`bincount`/`digitize`, and random reordering (`shuffle`/`permutation`/`choice`). - **Conditional selection**: 3-arg `where` bills `4 × numel(broadcast output)` (it billed 0 before); `select` bills `numel(output) × len(condlist)` (also billed 0 before); `piecewise` and `apply_along_axis` drop 4× → 1×. - **I/O**: `save`/`savez`/`savez_compressed` bill `4 × bytes-that-hit-disk` — array data, the shape header (`ndim × 8`), archive member names, and the `__meta__` blob. `load` stays free. - **Windows**: `hamming`/`hanning` re-derived to `18n` at weight 1 (was `2n` at weight 8); `bartlett`/`blackman`/`kaiser` formulas unchanged (their bills move only via the float64 rate). `real`/`imag` become free (views). ## Exploit closures - **Precision packing**: complex-packing and int-bit-packing no longer beat the meter (complex factors exceed 2; sub-32-bit floors at the fp32 rate; accumulator rule). - **Free-data side channels**: `savez` `__meta__` blobs and archive member names are billed as egress; the `.npy` shape header (`ndim × 8` participant-controlled bytes) is billed, closing the zero-element-array channel. - **Server first-touch**: arrays ingested via `create_from_data` are billed from the first touching op (no free `getitem`/`astype` on fresh ingests). - **Zero-billed compute**: 3-arg `where`, `select`, `compress`, and the pad compute modes all billed 0 before; `pad` is now writes-consistent (`numel(output)` + mode extras, `mode=<callable>` rejected). - **Method-surface parity**: `x.copy()/.reshape()/.ravel()/.flatten()/.choose()` bill exactly like their `fnp.*` counterparts. - **Alias-tier consistency**: module-level `random.choice` billed 1× while the Generator/RandomState surfaces billed 4× — the cheaper alias route is closed (4× on all three surfaces, pinned). ## Correctness & liveness fixes - **Symmetric-tensor cost hang**: pricing a reduction over a high-rank symmetric tensor re-enumerated a near-budget permutation group once per candidate generator (S₈ hung the grader). Dimino enumeration is now bounded cumulatively per op; oversized groups fall back to dense cost with a `CostFallbackWarning`. Over-charges only, never under. - **List index keys on the wire**: fancy `x[[0, 1]]` vs element access `x[0, 1]` — the tagged `{"__list__": …}` encoding landed on main via #148; this branch aligns with it and additionally drops the legacy bare-list guessing heuristic (client and server co-deploy) and extends decoder tests (bytes keys, nested tuples). - Empty-domain contractions/reductions bill ≥ 0 (aligned with #146). ## Documentation - `docs/reference/cost-model.md` rewritten as the single source: model chapters (dtype, complex, accumulators, worked examples) + per-family formula tables + non-exploitability notes. The doc was then audited claim-by-claim against live billing (~410 probe checks across three passes); all drift found was fixed — the published formulas now match what the meter charges. - The published op catalog was audited note-by-note (553 ops, probe-based): stale formulas left over from earlier repricings were refreshed, citation residue scrubbed, and each op's displayed weight now resolves through the same fallback chain the meter uses, so the catalog cannot disagree with billing. - `ops.json` / generated API docs stay in sync via CI gates (`generate_api_docs.py --check`, `sync_client.py --check`, cost-model coverage locks, view-semantics lock, dtype-conformance probes, production-weight billing test). ## Verification - Review-sheet audit: **153/153** agreed dispositions implemented (110 locked by exact price pins in `tests/test_triage_price_pins.py`, 42 probed, 1 code-read); the 13 places the implementation deviates from the raw review proposals are recorded decisions. - Suites: root **5190 passed**, server **254 passed**, client **1357 passed**; ruff and pyright clean. - Every budget-affecting change carried an adversarial review pass (finder → refuter) before landing; the branch had three whole-branch review rounds converge clean. ## Compatibility & rollout - **Server-authoritative**: billing changes bind on the grader only after a release and eval-server redeploy. Client and server deploy together; the wire change is not backward compatible by design. - Local runs and published docs change immediately on merge; participants should re-check budget headroom against the new prices (see the before/after summary). - Sheet write-back: regenerate the review sheet from code after merge (`scripts/numpy_audit.py` + upload) so the sheet reflects shipped reality. ## Merge checklist - [x] Reconciled with `origin/main` (#146 merged cleanly; #148 aligned — same `{"__list__"}` tag, this branch keeps the stricter bare-list→tuple rule and adopts the bool-preservation fix; all #148/#146 tests pass) - [ ] Full CI matrix green + `mergeStateStatus` CLEAN - [ ] Close #144 as superseded (or retarget) - [ ] Post-merge: release + eval-server redeploy; regenerate review sheet; publish the participant before/after note

It introduces dtype-specific billing, including explicit handling of float64 and complex operations. It also includes targeted adjustments to the cost model after reviewing its behaviour across Phase 1 submissions. We identified several operations whose current billing could underestimate the computational work being performed, so the updated model aims to make costs more representative and consistent across different implementation paths.

We would appreciate any initial feedback on the PR. The PR description contains further details.

We plan to release the updated version on 21 July, deploy it to the evaluation servers, and re-evaluate all Phase 1 submissions received so far. All subsequent Phase 1 evaluations will use the updated version of Flopscope.

We will publish a more detailed announcement before the release, including the final changes and rollout details.