I wonder if predictive timeouts are common

chicken_li · March 31, 2024, 5:34am

My team submitted it several times and failed without exception. The message tells me Timeout while making a prediction, but I’m pretty sure on my computer they predicted it very quickly. I tried quantifying the model to 4 bits, but I still got a timeout error. If no offense is taken, I would like to get a specific reason for the predicted timeouts.

chicken_li · March 31, 2024, 5:43am

What’s more, I can understand the failure submissions will increase the burden on the platform, but I have to say this is unfair. We make many failure submissions because the package version cannot be found, and we have to modify the requirements.txt over and over again. But to be honesty, each version of the requirements.txt can install the packages in out devices perfectly. It seems that we have no any other way to test our solutions before submission, but only using the system for debugging over and over again. I think this is depressing .

yilun_jin · March 31, 2024, 8:00am

Hi @chicken_li,

We well understand your concerns.

Here are some tips that we can provide to help you.

As a reference, the baseline Vicuna-7B takes ~3s for generating 100 tokens in full (as implemented in the baseline). We give 15 seconds for one prediction, which is about five times more than that. If you face a timeout, you should probably reduce the number of generation tokens, and test locally how much time it would take.
You should also consider the specific hardware that we use. If your machines are some A100, you should factor in the difference between T4 and A100.
Maybe you can try opening up a clean environment, and try building your solution locally with our Dockerfile, and see what happens. During our testing, all submissions that build successfully locally also go through the system.

If you have further questions, you can reply to me in the thread, and see if I can answer.

yilun_jin · March 31, 2024, 8:01am

Moreover, I cannot echo your comment that this is ‘unfair’. All participants use the same submission system nonetheless.

You can describe your problems in more details, and see if we can help you.

----------------UPDATE-------------------------
I took a look at your requirement.txt and your solution, and here are some comments that I think may be useful.

Your requirement.txt seems to be too cumbersome. I think in your submission you only use transformers, torch, etc., and you gave a long list of requirements. In my opinion, you can try to keep the requirements to the minimum (which is what we did in the baselines).
Your generation_config.max_new_tokens = 1024 is set unreasonably high. From our experience, Vicuna-7B generates 100 new tokens in 3 seconds, and if you do 1024, that is 30 seconds, which is clearly beyond the 15-second-per-sample limit.
It seems that you use zero-shot CoT prompting. From our experience that almost surely leads to timeout (as you call LLMs twice per question, and for one time, you also generate a long reasoning path).

chicken_li · March 31, 2024, 8:40am

requirements.txt We follow the instruction to use pip freeze > requirements.txt ,It’s painful to check for availability one by one .

yilun_jin · March 31, 2024, 8:49am

In my opinion, you should keep the requirements.txt to the minimum. It reduces the risk of failure at a package that you totally do not need, and reduces the time needed for building your solution.

I also think that according to what you did in the submission, there is no way you can meet the time limit.

chicken_li · March 31, 2024, 10:12am

I need to admit that using Zero-shot CoT is an unwise behavior and your advice is helpful. To minimize the requirements.txt sounds make sense, but I notice that in the docs/hardware-and-system-config.md you recommend to use pip3 freeze >> requirements.txt to export all the packages, which can be confusing. And we did worry that not specifying a version would cause compatibility issues at that time .

yilun_jin · March 31, 2024, 4:13pm

We will revise that doc per your suggestion.

Yet the doc indeed says that “ if your software runtime is simple, perform: pip freeze …”, while your runtime does not really seem simple…For example why would you need both cuda runtime 11 and 12?

Specifying versions is necessary. I think the following way would be good.

Identifying the necessary packages, like transformer, torch, accelerate, xxx
Specify their versions according to what you are using.
Only put them in the requirements

aicrowd_team · April 2, 2024, 1:27pm

@chicken_li : For docker build related errors, it is much easier to locally build and test the docker images by using the included docker_run.sh file.