I’m using my own custom architecture based on Qwen3 family. And Neuron fails to compile.
......
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) INFO:Neuron:Finished loading module context_encoding_model in 0.017525434494018555 seconds
(EngineCore_DP0 pid=103) INFO:Neuron:generating HLO: context_encoding_model, input example shape = torch.Size([1, 4096])
(EngineCore_DP0 pid=103) /opt/conda/lib/python3.12/site-packages/neuronx_distributed/parallel_layers/layers.py:532: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
(EngineCore_DP0 pid=103) with torch.cuda.amp.autocast(enabled=False):
(EngineCore_DP0 pid=103) /opt/conda/lib/python3.12/site-packages/neuronx_distributed_inference/modules/generation/sampling.py:374: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
(EngineCore_DP0 pid=103) probs_cumsum = cumsum(
(EngineCore_DP0 pid=103) /opt/conda/lib/python3.12/site-packages/neuronx_distributed_inference/modules/generation/sampling.py:327: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
(EngineCore_DP0 pid=103) probs_cumsum = cumsum(tensor_in=probs, dim=dim, on_cpu=self.neuron_config.on_cpu)
(EngineCore_DP0 pid=103) /opt/conda/lib/python3.12/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:470: UserWarning: Received an input tensor that was unused or used in a non-static way when traced so the tensor will be ignored. (index=1, shape=torch.Size([1, 4096]), dtype=torch.int32). The non-static usage could happen when the traced function expects the input tensor's shape to change (i.e., using the shape to do index slicing), which is not allowed by inference trace expecting static input shapes.
(EngineCore_DP0 pid=103) warnings.warn(
(EngineCore_DP0 pid=103) /opt/conda/lib/python3.12/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:470: UserWarning: Received an input tensor that was unused or used in a non-static way when traced so the tensor will be ignored. (index=3, shape=torch.Size([1]), dtype=torch.int32). The non-static usage could happen when the traced function expects the input tensor's shape to change (i.e., using the shape to do index slicing), which is not allowed by inference trace expecting static input shapes.
(EngineCore_DP0 pid=103) warnings.warn(
(EngineCore_DP0 pid=103) /opt/conda/lib/python3.12/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:470: UserWarning: Received an input tensor that was unused or used in a non-static way when traced so the tensor will be ignored. (index=5, shape=torch.Size([1]), dtype=torch.int32). The non-static usage could happen when the traced function expects the input tensor's shape to change (i.e., using the shape to do index slicing), which is not allowed by inference trace expecting static input shapes.
(EngineCore_DP0 pid=103) warnings.warn(
(EngineCore_DP0 pid=103) /opt/conda/lib/python3.12/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:470: UserWarning: Received an input tensor that was unused or used in a non-static way when traced so the tensor will be ignored. (index=6, shape=torch.Size([1]), dtype=torch.int32). The non-static usage could happen when the traced function expects the input tensor's shape to change (i.e., using the shape to do index slicing), which is not allowed by inference trace expecting static input shapes.
(EngineCore_DP0 pid=103) warnings.warn(
(EngineCore_DP0 pid=103) INFO:Neuron:Finished generating HLO for context_encoding_model in 0.9926915168762207 seconds, input example shape = torch.Size([1, 4096])
(EngineCore_DP0 pid=103) INFO:Neuron:Generating 1 hlos for key: token_generation_model
(EngineCore_DP0 pid=103) INFO:Neuron:Minimal metadata will be added to HLO
(EngineCore_DP0 pid=103) INFO:Neuron:Started loading module token_generation_model
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) WARNING:Neuron:TP degree (2) and KV heads (8) are not divisible. Overriding attention sharding strategy to GQA.CONVERT_TO_MHA!
(EngineCore_DP0 pid=103) INFO:Neuron:Finished loading module token_generation_model in 0.016804933547973633 seconds
(EngineCore_DP0 pid=103) INFO:Neuron:generating HLO: token_generation_model, input example shape = torch.Size([1, 1])
(EngineCore_DP0 pid=103) /opt/conda/lib/python3.12/site-packages/neuronx_distributed/parallel_layers/layers.py:532: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
(EngineCore_DP0 pid=103) with torch.cuda.amp.autocast(enabled=False):
(EngineCore_DP0 pid=103) /opt/conda/lib/python3.12/site-packages/neuronx_distributed_inference/modules/generation/sampling.py:374: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
(EngineCore_DP0 pid=103) probs_cumsum = cumsum(
(EngineCore_DP0 pid=103) /opt/conda/lib/python3.12/site-packages/neuronx_distributed_inference/modules/generation/sampling.py:327: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
(EngineCore_DP0 pid=103) probs_cumsum = cumsum(tensor_in=probs, dim=dim, on_cpu=self.neuron_config.on_cpu)
(EngineCore_DP0 pid=103) /opt/conda/lib/python3.12/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:470: UserWarning: Received an input tensor that was unused or used in a non-static way when traced so the tensor will be ignored. (index=3, shape=torch.Size([1]), dtype=torch.int32). The non-static usage could happen when the traced function expects the input tensor's shape to change (i.e., using the shape to do index slicing), which is not allowed by inference trace expecting static input shapes.
(EngineCore_DP0 pid=103) warnings.warn(
(EngineCore_DP0 pid=103) /opt/conda/lib/python3.12/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:470: UserWarning: Received an input tensor that was unused or used in a non-static way when traced so the tensor will be ignored. (index=5, shape=torch.Size([1]), dtype=torch.int32). The non-static usage could happen when the traced function expects the input tensor's shape to change (i.e., using the shape to do index slicing), which is not allowed by inference trace expecting static input shapes.
(EngineCore_DP0 pid=103) warnings.warn(
(EngineCore_DP0 pid=103) /opt/conda/lib/python3.12/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:470: UserWarning: Received an input tensor that was unused or used in a non-static way when traced so the tensor will be ignored. (index=6, shape=torch.Size([1]), dtype=torch.int32). The non-static usage could happen when the traced function expects the input tensor's shape to change (i.e., using the shape to do index slicing), which is not allowed by inference trace expecting static input shapes.
(EngineCore_DP0 pid=103) warnings.warn(
(EngineCore_DP0 pid=103) INFO:Neuron:Finished generating HLO for token_generation_model in 0.933382511138916 seconds, input example shape = torch.Size([1, 1])
(EngineCore_DP0 pid=103) INFO:Neuron:Generated all HLOs in 2.0631725788116455 seconds
(EngineCore_DP0 pid=103) INFO:Neuron:Starting compilation for the priority HLO
(EngineCore_DP0 pid=103) INFO:Neuron:'token_generation_model' is the priority model with bucket rank 0
(EngineCore_DP0 pid=103) /opt/conda/lib/python3.12/site-packages/libneuronxla/neuron_cc_wrapper.py:284: SyntaxWarning: str format compiler_flags is discouraged as its handling involves repeated joining and splitting, which can easily make mistakes if something is quoted or escaped. Use list[str] instead. Refer to documentation of the Python subprocess module for details.
(EngineCore_DP0 pid=103) warnings.warn(SyntaxWarning(
.
(EngineCore_DP0 pid=103) 2026-01-08 17:25:54.000229: 103 [ERROR]: Failed compilation with ['neuronx-cc', 'compile', '--framework=XLA', '/tmp/nxd_model/token_generation_model/_tp0_bk0/model.MODULE_aa4a80fec8aa33a8959e+97c2cc02.hlo_module.pb', '--output', '/tmp/nxd_model/token_generation_model/_tp0_bk0/model.MODULE_aa4a80fec8aa33a8959e+97c2cc02.neff', '--target=trn1', '--auto-cast=none', '--model-type=transformer', '--tensorizer-options=--enable-ccop-compute-overlap --cc-pipeline-tiling-factor=1 --vectorize-strided-dma ', '--lnc=1', '-O2', '--internal-hlo2tensorizer-options=--verify-hlo=true', '--verbose=35', '--logfile=/tmp/nxd_model/token_generation_model/_tp0_bk0/log-neuron-cc.txt', '--enable-internal-neff-wrapper']:
(EngineCore_DP0 pid=103) 2026-01-08T17:25:54Z invalid input!
(EngineCore_DP0 pid=103) [libneuronxla 2.2.14584.0+06ac23d1]
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] EngineCore failed to start.
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] Traceback (most recent call last):
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 83, in __init__
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] File "/opt/conda/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] self._init_executor()
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] File "/opt/conda/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 55, in _init_executor
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] self.collective_rpc("load_model")
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] File "/opt/conda/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] return [run_method(self.driver_worker, method, args, kwargs)]
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] File "/opt/conda/lib/python3.12/site-packages/vllm/utils/__init__.py", line 3122, in run_method
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] return func(*args, **kwargs)
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] File "/vllm-neuron/vllm_neuron/worker/neuron_worker.py", line 86, in load_model
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] self.model_runner.load_model()
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] File "/vllm-neuron/vllm_neuron/worker/neuronx_distributed_model_runner.py", line 221, in load_model
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] self.model = get_neuron_model(
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] File "/vllm-neuron/vllm_neuron/worker/neuronx_distributed_model_loader.py", line 714, in get_neuron_model
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] model.load_weights(model_name_or_path=model_config.model,
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] File "/vllm-neuron/vllm_neuron/worker/neuronx_distributed_model_loader.py", line 394, in load_weights
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] self._compile_and_load_model(model_name_or_path, neuronx_model_cls,
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] File "/vllm-neuron/vllm_neuron/worker/neuronx_distributed_model_loader.py", line 240, in _compile_and_load_model
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] self.model.compile(compiled_path)
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] File "/opt/conda/lib/python3.12/site-packages/neuronx_distributed_inference/models/application_base.py", line 302, in compile
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] traced_model = self.get_builder(debug).trace(
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] File "/opt/conda/lib/python3.12/site-packages/neuronx_distributed/trace/model_builder.py", line 651, in trace
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] neff_bytes, wrapped_neff_bytes = neuron_xla_wlo_compile(module_bytes, compiler_args,
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] File "/opt/conda/lib/python3.12/site-packages/libneuronxla/neuron_cc_wrapper.py", line 297, in neuron_xla_wlo_compile
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] neuron_xla_compile_impl(
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] File "/opt/conda/lib/python3.12/site-packages/libneuronxla/neuron_cc_wrapper.py", line 390, in neuron_xla_compile_impl
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] return compile_cache_entry(output, entry, execution_mode,
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] File "/opt/conda/lib/python3.12/site-packages/libneuronxla/neuron_cc_wrapper.py", line 232, in compile_cache_entry
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] raise (e)
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] File "/opt/conda/lib/python3.12/site-packages/libneuronxla/neuron_cc_wrapper.py", line 204, in compile_cache_entry
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] ret = call_neuron_compiler(
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] File "/opt/conda/lib/python3.12/site-packages/libneuronxla/neuron_cc_wrapper.py", line 120, in call_neuron_compiler
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] raise subprocess.CalledProcessError(res.returncode, cmd, stderr=error_with_cmd)
(EngineCore_DP0 pid=103) ERROR 01-08 17:25:54 [v1/engine/core.py:708] subprocess.CalledProcessError: Command '['neuronx-cc', 'compile', '--framework=XLA', '/tmp/nxd_model/token_generation_model/_tp0_bk0/model.MODULE_aa4a80fec8aa33a8959e+97c2cc02.hlo_module.pb', '--output', '/tmp/nxd_model/token_generation_model/_tp0_bk0/model.MODULE_aa4a80fec8aa33a8959e+97c2cc02.neff', '--target=trn1', '--auto-cast=none', '--model-type=transformer', '--tensorizer-options=--enable-ccop-compute-overlap --cc-pipeline-tiling-factor=1 --vectorize-strided-dma ', '--lnc=1', '-O2', '--internal-hlo2tensorizer-options=--verify-hlo=true', '--verbose=35', '--logfile=/tmp/nxd_model/token_generation_model/_tp0_bk0/log-neuron-cc.txt', '--enable-internal-neff-wrapper']' returned non-zero exit status 70.
(EngineCore_DP0 pid=103) Process EngineCore_DP0:
(EngineCore_DP0 pid=103) Traceback (most recent call last):
(EngineCore_DP0 pid=103) File "/opt/conda/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=103) self.run()
(EngineCore_DP0 pid=103) File "/opt/conda/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=103) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=103) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 712, in run_engine_core
(EngineCore_DP0 pid=103) raise e
(EngineCore_DP0 pid=103) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=103) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=103) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=103) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=103) super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=103) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 83, in __init__
(EngineCore_DP0 pid=103) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=103) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=103) File "/opt/conda/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_DP0 pid=103) self._init_executor()
(EngineCore_DP0 pid=103) File "/opt/conda/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 55, in _init_executor
(EngineCore_DP0 pid=103) self.collective_rpc("load_model")
(EngineCore_DP0 pid=103) File "/opt/conda/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
(EngineCore_DP0 pid=103) return [run_method(self.driver_worker, method, args, kwargs)]
(EngineCore_DP0 pid=103) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=103) File "/opt/conda/lib/python3.12/site-packages/vllm/utils/__init__.py", line 3122, in run_method
(EngineCore_DP0 pid=103) return func(*args, **kwargs)
(EngineCore_DP0 pid=103) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=103) File "/vllm-neuron/vllm_neuron/worker/neuron_worker.py", line 86, in load_model
(EngineCore_DP0 pid=103) self.model_runner.load_model()
(EngineCore_DP0 pid=103) File "/vllm-neuron/vllm_neuron/worker/neuronx_distributed_model_runner.py", line 221, in load_model
(EngineCore_DP0 pid=103) self.model = get_neuron_model(
(EngineCore_DP0 pid=103) ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=103) File "/vllm-neuron/vllm_neuron/worker/neuronx_distributed_model_loader.py", line 714, in get_neuron_model
(EngineCore_DP0 pid=103) model.load_weights(model_name_or_path=model_config.model,
(EngineCore_DP0 pid=103) File "/vllm-neuron/vllm_neuron/worker/neuronx_distributed_model_loader.py", line 394, in load_weights
(EngineCore_DP0 pid=103) self._compile_and_load_model(model_name_or_path, neuronx_model_cls,
(EngineCore_DP0 pid=103) File "/vllm-neuron/vllm_neuron/worker/neuronx_distributed_model_loader.py", line 240, in _compile_and_load_model
(EngineCore_DP0 pid=103) self.model.compile(compiled_path)
(EngineCore_DP0 pid=103) File "/opt/conda/lib/python3.12/site-packages/neuronx_distributed_inference/models/application_base.py", line 302, in compile
(EngineCore_DP0 pid=103) traced_model = self.get_builder(debug).trace(
(EngineCore_DP0 pid=103) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=103) File "/opt/conda/lib/python3.12/site-packages/neuronx_distributed/trace/model_builder.py", line 651, in trace
(EngineCore_DP0 pid=103) neff_bytes, wrapped_neff_bytes = neuron_xla_wlo_compile(module_bytes, compiler_args,
(EngineCore_DP0 pid=103) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=103) File "/opt/conda/lib/python3.12/site-packages/libneuronxla/neuron_cc_wrapper.py", line 297, in neuron_xla_wlo_compile
(EngineCore_DP0 pid=103) neuron_xla_compile_impl(
(EngineCore_DP0 pid=103) File "/opt/conda/lib/python3.12/site-packages/libneuronxla/neuron_cc_wrapper.py", line 390, in neuron_xla_compile_impl
(EngineCore_DP0 pid=103) return compile_cache_entry(output, entry, execution_mode,
(EngineCore_DP0 pid=103) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=103) File "/opt/conda/lib/python3.12/site-packages/libneuronxla/neuron_cc_wrapper.py", line 232, in compile_cache_entry
(EngineCore_DP0 pid=103) raise (e)
(EngineCore_DP0 pid=103) File "/opt/conda/lib/python3.12/site-packages/libneuronxla/neuron_cc_wrapper.py", line 204, in compile_cache_entry
(EngineCore_DP0 pid=103) ret = call_neuron_compiler(
(EngineCore_DP0 pid=103) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=103) File "/opt/conda/lib/python3.12/site-packages/libneuronxla/neuron_cc_wrapper.py", line 120, in call_neuron_compiler
(EngineCore_DP0 pid=103) raise subprocess.CalledProcessError(res.returncode, cmd, stderr=error_with_cmd)
(EngineCore_DP0 pid=103) subprocess.CalledProcessError: Command '['neuronx-cc', 'compile', '--framework=XLA', '/tmp/nxd_model/token_generation_model/_tp0_bk0/model.MODULE_aa4a80fec8aa33a8959e+97c2cc02.hlo_module.pb', '--output', '/tmp/nxd_model/token_generation_model/_tp0_bk0/model.MODULE_aa4a80fec8aa33a8959e+97c2cc02.neff', '--target=trn1', '--auto-cast=none', '--model-type=transformer', '--tensorizer-options=--enable-ccop-compute-overlap --cc-pipeline-tiling-factor=1 --vectorize-strided-dma ', '--lnc=1', '-O2', '--internal-hlo2tensorizer-options=--verify-hlo=true', '--verbose=35', '--logfile=/tmp/nxd_model/token_generation_model/_tp0_bk0/log-neuron-cc.txt', '--enable-internal-neff-wrapper']' returned non-zero exit status 70.
(APIServer pid=61) Traceback (most recent call last):
(APIServer pid=61) File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=61) File "<frozen runpy>", line 88, in _run_code
(APIServer pid=61) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 2042, in <module>
(APIServer pid=61) uvloop.run(run_server(args))
(APIServer pid=61) File "/opt/conda/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=61) return __asyncio.run(
(APIServer pid=61) ^^^^^^^^^^^^^^
(APIServer pid=61) File "/opt/conda/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=61) return runner.run(main)
(APIServer pid=61) ^^^^^^^^^^^^^^^^
(APIServer pid=61) File "/opt/conda/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=61) return self._loop.run_until_complete(task)
(APIServer pid=61) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=61) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=61) File "/opt/conda/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=61) return await main
(APIServer pid=61) ^^^^^^^^^^
(APIServer pid=61) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1973, in run_server
(APIServer pid=61) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=61) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1991, in run_server_worker
(APIServer pid=61) async with build_async_engine_client(
(APIServer pid=61) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=61) File "/opt/conda/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=61) return await anext(self.gen)
(APIServer pid=61) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=61) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 269, in build_async_engine_client
(APIServer pid=61) async with build_async_engine_client_from_engine_args(
(APIServer pid=61) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=61) File "/opt/conda/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=61) return await anext(self.gen)
(APIServer pid=61) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=61) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 314, in build_async_engine_client_from_engine_args
(APIServer pid=61) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=61) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=61) File "/opt/conda/lib/python3.12/site-packages/vllm/utils/__init__.py", line 1572, in inner
(APIServer pid=61) return fn(*args, **kwargs)
(APIServer pid=61) ^^^^^^^^^^^^^^^^^^^
(APIServer pid=61) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 207, in from_vllm_config
(APIServer pid=61) return cls(
(APIServer pid=61) ^^^^
(APIServer pid=61) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 134, in __init__
(APIServer pid=61) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=61) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=61) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=61) return AsyncMPClient(*client_args)
(APIServer pid=61) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=61) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 769, in __init__
(APIServer pid=61) super().__init__(
(APIServer pid=61) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 448, in __init__
(APIServer pid=61) with launch_core_engines(vllm_config, executor_class,
(APIServer pid=61) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=61) File "/opt/conda/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=61) next(self.gen)
(APIServer pid=61) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 732, in launch_core_engines
(APIServer pid=61) wait_for_engine_startup(
(APIServer pid=61) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 785, in wait_for_engine_startup
(APIServer pid=61) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=61) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
The error message is very long, but it does not explain why the failure is happening. I should note that the model works fine with vLLM locally.