Submitting models to Neuron: pick the right `--neuron.model-type` (and tune vLLM if you need to)

jyotish · December 18, 2025, 10:16am

When you run aicrowd submit-model, the platform spins up a vLLM server for your model. You can pass a handful of --vllm.* flags to control things like max context length, dtype, batching limits, LoRA settings, and a few inference-time parameters.

But there’s one flag you must get right for Neuron hardware:

--neuron.model-type <model-type>

Why this matters (Neuron compilation in one paragraph)

AWS Inferentia/Trainium (Neuron) doesn’t run your PyTorch model “as-is” the way a typical GPU setup might. The model needs to be compiled into a Neuron-compatible artifact before it can run on the accelerator.

Because the compilation path is model-architecture-specific, the submission system needs to know which backend/architecture you’re using. Hence --neuron.model-type.

If you don’t set --neuron.model-type, the submission will default to qwen3.

Supported model types (backends)

The NxD Inference model hub currently supports architectures including: Llama (text), Llama (multimodal), Llama4, Mixtral, DBRX, Qwen2.5, Qwen3, and FLUX.1 (beta).

github.com

aws-neuron/neuronx-distributed-inference/blob/main/src/neuronx_distributed_inference/inference_demo.py#L51-L56


      
          "llama": {"causal-lm": NeuronLlamaForCausalLM},
          "mixtral": {"causal-lm": NeuronMixtralForCausalLM},
          "dbrx": {"causal-lm": NeuronDbrxForCausalLM},
          "qwen2": {"causal-lm": NeuronQwen2ForCausalLM},
          "qwen3": {"causal-lm": NeuronQwen3ForCausalLM},
          "qwen3_moe": {"causal-lm": NeuronQwen3MoeForCausalLM},

(For the others, the exact string is typically the obvious lowercase name, aligned to the supported architecture list above. If you’re unsure, match it to your model family—e.g., Mixtral → mixtral, Qwen3 → qwen3—because picking the wrong type can lead to compile failures or incorrect behavior.)

Supported vLLM server flags

aicrowd submit-model currently supports these vLLM arguments:

--vllm.max-model-len
--vllm.dtype
--vllm.kv-cache-dtype
--vllm.quantization
--vllm.load-format
--vllm.rope-theta
--vllm.rope-scaling
--vllm.max-num-batched-tokens
--vllm.max-num-seqs
--vllm.enforce-eager true
--vllm.enable-lora true
--vllm.lora-dtype
--vllm.lora-extra-vocab-size
--vllm.enable-prefix-caching true
--vllm-env.allow-long-max-model-len

# inference time parameters
--vllm-inference.max-tokens

Example: complete submission command

At minimum, set your repo/tag and the Neuron model type:

aicrowd submit-model \
  --hf-repo <repo> \
  --hf-repo-tag <branch/tag> \
  --neuron.model-type llama

Or, for DBRX:

aicrowd submit-model \
  --hf-repo <repo> \
  --hf-repo-tag <branch/tag> \
  --neuron.model-type dbrx

If you need tighter control over serving behavior, add the relevant --vllm.* flags on the same command.

jyotish · December 18, 2025, 10:17am

hasheerama · December 18, 2025, 10:30am

Appreciate the technical details, but I need to point out a serious problem with how this challenge is being run.

This information should have been available on December 2nd when the challenge launched, not December 18th with 13 days remaining.

WASTED TIME AND RESOURCES - Participants have spent two weeks training models without knowing: which architectures are actually supported on Neuron, what compilation constraints exist, what vLLM parameters we can tune, how to structure submissions, or whether our chosen approach is even compatible with your infrastructure.

Some of us are using paid compute credits. Some of us are graduate students with limited time before graduation. We made architectural decisions in the dark that may now need to be completely redone.
I have been using GPU credits from a previous hackathon and already spent 1-2 days training models between every iteration and you only allow 5 submissions a day

QUESTIONS THAT NEED IMMEDIATE ANSWERS:

Will the deadline be extended to account for the two weeks of infrastructure uncertainty?
Will participants receive additional AWS credits to compensate for compute wasted training models that may not be compatible with your undisclosed requirements?
When exactly will ALL submission infrastructure details be finalized? Not “soon” - an actual date.
Will you provide a testing sandbox so we don’t waste our limited daily submissions debugging the infrastructure?
What other critical information are you still planning to announce? We need a complete technical specification.

This is a $17,000 competition sponsored by AWS. The level of disorganization is frankly unacceptable.

I’m not asking for perfect execution, but basic information like “what model architectures can you submit” should have been in the starter kit on day one. That’s competition organization 101.

Either extend the deadline substantially or acknowledge that this round is effectively a pilot and reset the timeline fairly for all participants.

Waiting for a substantive response from the organizing team.

whoamananand · December 20, 2025, 7:59am

Hey!! This Ain’t Working For me! Help!

jyotish · December 21, 2025, 12:30am

Hey @whoamananand

Your model is hitting memory limits and crashing the node on which the evaluation was running. Can you try submitting a smaller model?

We will figure out a way to relay the OOM errors properly on the submission details page.

whoamananand · December 21, 2025, 8:22am

Got it @jyotish that’s a legit issue! Hope you guys find a way around it or suggest us to change something so that it works.