Issues Using LLM and LoRA Adapter on NVIDIA T4

I am trying to tune a model using LoRA with Peft, but I’m encountering issues with using vllm and LoRA together.

Environment:

  • vllm==0.4.2 (or 0.4.3)
  • torch==2.3.0 (compatible with the above version of vllm and higher)
  • CUDA 12.1.1 (original Docker image: nvidia/cuda:12.1.1-cudnn8-devel-ubuntu20.04)

Python Code:

self.llm = vllm.LLM(
    self.model_name,
    tensor_parallel_size=VLLM_TENSOR_PARALLEL_SIZE, 
    gpu_memory_utilization=VLLM_GPU_MEMORY_UTILIZATION, 
    trust_remote_code=True,
    dtype="half",  # note: bfloat16 is not supported on NVIDIA-T4 GPUs
    enforce_eager=True,
    enable_lora=True
)

Error Message:

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

From my testing in Google Colab, this error does not occur with L4 GPUs but does occur with T4 GPUs. Furthermore, the error does not arise when vllm is used without the LoRA adapter (enable_lora=False); it only occurs when enable_lora=True.

Although I understand this might be beyond the scope of what can be handled by your team, the inability to use the Peft model with these settings is quite likely. I would appreciate any ideas on possible solutions. Additionally, if the issue cannot be resolved, please consider extending the processing time limits given the difficulties in implementing vllm with these configurations.

Thank you for your assistance.

1 Like

Hello,

We received exactly the same report by other participants. Here is some information that may be helpful.

  1. Reason. It seems that (although no official documents are found), to use lora with vllm, the GPU has to have compute capability >8.0, which is not met by NVIDIA T4 (I believe something like 7.5). This point is mentioned in this issue of vllm. I also clone your solution to our T4 machine and get the same error.

  2. Potential Solution. As a workaround, (although I did not try it myself), it should not be difficult to merge lora weights with the base model weights and get a new model (without lora but has the same parameters as base model + lora). vllm should be able to serve that. This may be helpful. In addition, the merging should be done off-line, as doing so online would count through your time limit.

Hope that helps.

3 Likes

Thank you for your response!
It seems indeed challenging to use vllm + LoRA on T4.

I had been looking into how to merge weights but hadn’t found any information, so your help is much appreciated. Thank you very much. I’ll give it a try.

@tomoya_miyazawa @yilun_jin ,hello, thank you for your sugestions, I meet the same error when using lora + vllm, i wonder if vllm is a must option? Does merging weight everytime when submitting a new model makes git LFS storage not enough for this?

Hello,

  1. Is VLLM a must? No. However, VLLM accelerates inference greatly, and if you don’t use vllm, you will have a harder time to meet the time limit.
  2. Managing storage. We recommend all submissions to contain only the necessary checkpoints, even though git lfs may support the full repo, because pulling your checkpoints from git lfs takes time, which counts towards your total time limit. Also, merging lora and base model weights also counts towards the time limit, which is why we recommend merging offline.

@yilun_jin i used the LoRARequest ok on my local machine, but push to the online it’s also error. do you using to the online ok ?

https://docs.vllm.ai/en/v0.4.0/models/lora.html

I don’t know for sure, but I think it depends heavily on what your local machine is. If it is a powerful GPU like 3090/4090, it is possible. Our hypothesis is that our T4 GPU is not so powerful to support LoRA, so I think you have to find the other way around.