Sharing our Phase-1 approach and a result we think is interesting for the Best Algorithmic Contribution discussion: a clean, provably unbiased estimator of the per-neuron post-ReLU activation means, plus an honest map of which “obvious” improvements actually help at depth 32 — and why most do not.
The estimator (two ingredients, both variance reduction at fixed FLOPs)
-
Randomly-shifted rank-1 lattice (RQMC) inputs. Instead of
Ni.i.d. Gaussian input rows we use a Kronecker latticeg_j = frac(√p_j)with a single uniform Cranley–Patterson shiftU(seeded per-MLP), inverse-CDF mapped to Gaussians:x_k = Φ⁻¹(frac(k·g + U)). Low discrepancy ⇒ faster convergence than i.i.d. MC on this smooth-ish integrand. -
Rao-Blackwell layer 1 (exact).
z⁽¹⁾ = W⁽¹⁾ᵀXis exactly Gaussian, soE[ReLU(z⁽¹⁾_i)] = s_i/√(2π)in closed form (s_i²= column squared-norm ofW⁽¹⁾). We use the exact value for that row.
Unbiasedness. For a fixed lattice point k, frac(k·g + U) is exactly Unif[0,1)ⁿ under U~Unif, so x_k ~ N(0,I) marginally and E_U[g_ℓ(x_k)] = E_{X~N(0,I)}[g_ℓ(X)] for every k; averaging preserves it. The random shift removes the deterministic bias a fixed lattice would carry — the standard RQMC argument — so the estimate is unbiased for every N (up to a ~1e-11 tail-clip). Numerically: 60 shifts vs a 4M-sample reference gave 0/64 final-layer neurons with |bias| > 3·SE.
Result (public mini split, n=256, L=32, B=2.72e11): adjusted final-layer score ≈ 4.10e-7, 0/100 failures, C/B ≈ 0.42. The whole gain over equal-FLOP MC comes from low-discrepancy sampling + an exact first layer — no fitted surrogate, so no estimation-bias or distribution-shift risk.
The part we think is the real contribution: where the frontier is
We tested the “obvious” upgrades at the grading shape and measured what each does to the final-layer MSE at fixed C/B. Almost nothing helps:
- Linearization / layer-1 control variates (subtract a known-mean linear term built from the expected-gain Jacobian, or from the exact layer-1 law): ≈ ×1.0–1.2.
- Block-covariance gains for the CV (accurate σ at depth): still ≈ ×1.0.
- Scrambled Sobol’ vs the Kronecker lattice: per-row tied (MSE·N ≈ equal).
- Korobov lattice: much worse (rank-1 quality is violently N-sensitive).
- A learned bias-corrector on the deterministic block-covariance estimate (which runs at the 0.1 compute floor), trained on synthetic MLPs and evaluated leave-MLPs-out: cuts block-cov MSE only ≈ 2.3× — far short of the ~17× it would need to beat the sampler at the floor.
Why: RQMC and exact-mean control variates are substitutes — both remove the smooth, low-effective-dimension part of Var[ReLU(z⁽ᴸ⁾)]. Once the lattice has taken it, the only exactly-known means available (the input X, the exact layer-1 law) are 31–32 ReLU layers upstream and decorrelate badly from the output; what remains is genuinely high-dimensional “ReLU-kink” variance that no exact-mean linear/quadratic CV reaches. And the learned corrector’s final-layer features cannot recover the whole-network higher cumulants that drive the block-cov error.
So our reading is that, for the post-ReLU mean of a deep random MLP, a randomly-shifted lattice with an exact first layer is at the practical accuracy-per-FLOP frontier among unbiased methods — the residual is irreducible kink variance, not a modelling gap a control variate or analytic corrector can close. Beating it materially seems to require a biased learned corrector, which trades away the unbiasedness guarantee.
Happy to share the unbiasedness derivation, the FLOP accounting against flopscope 0.8, or the negative-result experiments in more detail. Curious whether others found a control variate or lattice that genuinely beats plain RQMC at depth — we couldn’t.