Unbiased randomized-QMC + Rao-Blackwell for post-ReLU activation means — method, unbiasedness proof, and where the frontier is

Sharing our Phase-1 approach and a result we think is interesting for the Best Algorithmic Contribution discussion: a clean, provably unbiased estimator of the per-neuron post-ReLU activation means, plus an honest map of which “obvious” improvements actually help at depth 32 — and why most do not.

The estimator (two ingredients, both variance reduction at fixed FLOPs)

  1. Randomly-shifted rank-1 lattice (RQMC) inputs. Instead of N i.i.d. Gaussian input rows we use a Kronecker lattice g_j = frac(√p_j) with a single uniform Cranley–Patterson shift U (seeded per-MLP), inverse-CDF mapped to Gaussians: x_k = Φ⁻¹(frac(k·g + U)). Low discrepancy ⇒ faster convergence than i.i.d. MC on this smooth-ish integrand.
  2. Rao-Blackwell layer 1 (exact). z⁽¹⁾ = W⁽¹⁾ᵀX is exactly Gaussian, so E[ReLU(z⁽¹⁾_i)] = s_i/√(2π) in closed form (s_i² = column squared-norm of W⁽¹⁾). We use the exact value for that row.

Unbiasedness. For a fixed lattice point k, frac(k·g + U) is exactly Unif[0,1)ⁿ under U~Unif, so x_k ~ N(0,I) marginally and E_U[g_ℓ(x_k)] = E_{X~N(0,I)}[g_ℓ(X)] for every k; averaging preserves it. The random shift removes the deterministic bias a fixed lattice would carry — the standard RQMC argument — so the estimate is unbiased for every N (up to a ~1e-11 tail-clip). Numerically: 60 shifts vs a 4M-sample reference gave 0/64 final-layer neurons with |bias| > 3·SE.

Result (public mini split, n=256, L=32, B=2.72e11): adjusted final-layer score ≈ 4.10e-7, 0/100 failures, C/B ≈ 0.42. The whole gain over equal-FLOP MC comes from low-discrepancy sampling + an exact first layer — no fitted surrogate, so no estimation-bias or distribution-shift risk.

The part we think is the real contribution: where the frontier is

We tested the “obvious” upgrades at the grading shape and measured what each does to the final-layer MSE at fixed C/B. Almost nothing helps:

  • Linearization / layer-1 control variates (subtract a known-mean linear term built from the expected-gain Jacobian, or from the exact layer-1 law): ≈ ×1.0–1.2.
  • Block-covariance gains for the CV (accurate σ at depth): still ≈ ×1.0.
  • Scrambled Sobol’ vs the Kronecker lattice: per-row tied (MSE·N ≈ equal).
  • Korobov lattice: much worse (rank-1 quality is violently N-sensitive).
  • A learned bias-corrector on the deterministic block-covariance estimate (which runs at the 0.1 compute floor), trained on synthetic MLPs and evaluated leave-MLPs-out: cuts block-cov MSE only ≈ 2.3× — far short of the ~17× it would need to beat the sampler at the floor.

Why: RQMC and exact-mean control variates are substitutes — both remove the smooth, low-effective-dimension part of Var[ReLU(z⁽ᴸ⁾)]. Once the lattice has taken it, the only exactly-known means available (the input X, the exact layer-1 law) are 31–32 ReLU layers upstream and decorrelate badly from the output; what remains is genuinely high-dimensional “ReLU-kink” variance that no exact-mean linear/quadratic CV reaches. And the learned corrector’s final-layer features cannot recover the whole-network higher cumulants that drive the block-cov error.

So our reading is that, for the post-ReLU mean of a deep random MLP, a randomly-shifted lattice with an exact first layer is at the practical accuracy-per-FLOP frontier among unbiased methods — the residual is irreducible kink variance, not a modelling gap a control variate or analytic corrector can close. Beating it materially seems to require a biased learned corrector, which trades away the unbiasedness guarantee.

Happy to share the unbiasedness derivation, the FLOP accounting against flopscope 0.8, or the negative-result experiments in more detail. Curious whether others found a control variate or lattice that genuinely beats plain RQMC at depth — we couldn’t.