Issue
My submissions are getting stuck at “Compiling model for Neuron - Hold tight, this will just take a moment” for 1.5+ hours without completing.
Timeline
2 days ago: Submissions worked fine.
Today: Multiple submissions stuck at Neuron compilation step
What Changed
I noticed AIcrowd added new Neuron documentation on Dec 18 with the --neuron.model-type flag. My earlier successful submissions didn’t use any Neuron flags.
What I’ve Tried
Recent submissions with --neuron.model-type llama - stuck
Submissions with --vllm.max-model-len 2048 - also stuck
Submission WITHOUT neuron flags (testing if old approach still works) - stuck
Model Details
Model: Llama-based
Architecture: meta-llama/Llama-3.2-8B-Instruct base
I have the same issue with Llama 3.1 8B. I can confirm that it compiles and evaluates on an AWS trn1.2xlarge instance but gets stuck here. Perhaps it’s due to a config mismatch?
Its Frustrating tbh, I wasted 10+hrs of GPU credits and then it refuses to evaluate it, without an apparent reason, It was working until neuron wasn’t there. And there’s no one from the team that’s helping, I even mailed one of them.
@artist@whoamananand it seems like the models are hitting memory limits. Can you share the the config params you used to compile the model on trn1.2xlarge so that we can investigate this further?