HI I am still seeing failures as before. It fails at ranker evaluation and then no logs are shown. Is this happening to anyone else? Maybe it is an error that is only occurring now for certain evaluations?
For example see #204968 which I believe was resubmited from the host side after the fix.
We’re not providing logs apart from the validation runs since those can be used can be used to leak data.
The previous issues we had were with fetching valid logs which is fixed.
Your run probably failed because some value is hardcoded that is not present in the test dataset, if you still want the logs please make a new submission and I can get them for you separately.
Thank you Dipam. Sorry for the confusion. I am resubmitting and I will let you know, but in any case it is weird because the new submit is essentially an old one ( that run successfully and as far as I can tell does not hardcode anything) with different model weights.
It has failed again. @dipam note that it does not even show logs for the validation parts, it does not show any log at all. Although it would be nice to know what is going on with the ranker.
Hi everyone,
I also experienced the same issues as @felipe_b and @rein20:
When testing locally with local_evaluation.py everything seemed to be fine, no logs are available, just failed inference. Please have a look @dipam if there is indeed some issue with the ranker.
Thank you!
I figured it out without looking at the logs and for my case, it was CUDA OOM error.
Decreasing GPU Memory usage led to successful submission. (Note that server has T4 16GB, which may be different from other competitions)
Not being able to look at the logs cost me last day’s 2 submissions though.
Thank you, In my case it fails at clariq ranker and does not provide logs from any steps. I have ran a very similar code with weights of the same size before successfully so all the packages being equal it should not be OOM.