Possible evaluator change: concurrency now 4 (was 1) — Qwen3/Neuron vLLM outputs corrupted (finish_reason=length, missing <uci_move>)

Hi AIcrowd team — thanks for running the Global Chess Challenge.

I’m seeing what looks like a recent regression for Qwen3 (neuron.model_type=qwen3) on the Neuron/vLLM backend: at higher evaluator concurrency the model often produces garbled output, hits max tokens, and fails to reliably emit <uci_move>...</uci_move>, which causes immediate resignations and extremely high ACPL.

What changed (evidence from evaluation-state logs)

Looking at the config_snapshot field from GET /submissions/<id>/evaluation-state:

  • Submission 305873 (Dec 23): config_snapshot.concurrency = 1

    • finish_reason=stop (100%), reasonable completion lengths, <uci_move> present reliably
    • Overall ACPL ≈ 119
  • Recent submissions (Dec 24) now show config_snapshot.concurrency = 4

    • Example 305972: config_snapshot.concurrency = 4
    • finish_reason=length ~100%, completion tokens always hit the cap, <uci_move> rate ~25%
    • Outputs often look corrupted/garbled (binary-ish text), leading to resignations and ACPL ≈ 864+
    • This submission used the same prompt settings as 305873 (vllm.max_model_len=512, dtype=bfloat16, enforce_eager=true, max_tokens=64).

I also tried explicitly requesting --num-games 1 / --concurrency 1 in aicrowd submit-model, but the resulting evaluation logs still show concurrency=4 (and num_games=4), suggesting these are being overridden by the evaluator (e.g. submission 305974).

Questions

  1. Did the evaluator concurrency change recently from 1 → 4 for this challenge?
  2. If so, is there a recommended configuration (vLLM/Neuron flags or supported submission fields) to keep Qwen3 stable at concurrency=4?
  3. Is this a known Neuron/vLLM issue/regression for Qwen3 under concurrent request load?

I’m happy to provide additional req/resp snippets (showing finish_reason=length + corrupted outputs) if that helps debugging. If this should be handled privately instead of on the forum, let me know and I can share details via DM/support.

Thanks again for the challenge and for any guidance here.

3 Likes

+1 running into issues since today morning - request timing out even for base Qwen 3 model (was working fine earlier)

2 Likes

Even Qwen3 4B timesout 100% now. Previously i was running 8B without any issue.

1 Like

I got unreadable outputs when i submitted the model. Is this the same issue you are experiencing?
In local,
ran local_evaluation.py (vllm) → worked fine.
ran my test notebook (Transformers) → worked fine.

Submission

Local

Config

Submit the model

aicrowd submit-model
–challenge “$CHALLENGE”
–hf-repo “$HF_REPO”
–hf-repo-tag “$HF_REPO_TAG”
–prompt_template_path “$PROMPT_TEMPLATE”
–vllm.dtype bfloat16
–vllm-inference.max-tokens 128
–vllm.max-model-len 1024

Yep, same issue here. Local runs look fine, but some Qwen3 submissions on the Neuron/vLLM evaluator produce garbled/unreadable output, often hit finish_reason=length, and then miss <uci_move>. If you are comfortable sharing your submission id, it may help the organizers correlate when they have time.

Hello @artist, yes you’re noticing the same underlying issue.

We did recently increase the evaluator concurrency to 4, and we made that change only after validating that our baseline (also Qwen3) was stable at concurrency=4 during testing.

That said, we haven’t changed the evaluation setup since those baseline tests. But we’re now seeing similar timeouts behavior even with the baseline under the current setup. We’re actively working with the organizers to identify and fix the root cause.

In the meantime, you can disable concurrency by running with a single game: pass --num-games 1. This configuration should now be respected by the evaluator and will effectively run with concurrency=1. If your model supports it, we suggest you trying out other concurrency numbers to speed up the evaluation. We also set --max-num-seqs vLLM parameter to the same value.

Please try submitting again with --num-games 1 and let us know whether the issue persists after that.

1 Like

I tried --num-games 1 and --max-num-seqs 1, and the outputs from Qwen are now readable.
Thank you

I noticed this issue when running a fine-tuned QWEN3-4B-Instruct model, but I managed to fix it with the parameters mentioned above. Is this also an issue for LLAMA models?