- How do we select our final submissions? (in case we believe private test set is different from public test set)
- What are the time constraints on submission stages? (training, inference)
- it is said that “Note: The scores used to compute the current leaderboard are tentative scores computed on 60% of the test datasets. The final scores, computed on the complete test datasets, will be released at the end of the competition.” so will the final standings be based on 40% or 100% of the test set? Is ti correct that now we see the scores from 60% of 1473 samples?
Hello team =) Could you please respond or tag a person to whom these questions should be addressed?
-
You do not need to select your final submission. All the submissions made are also evaluated on the private dataset. For the final leaderboard, the best score ( on 100% of the test set) among all your submissions will be considered.
-
During an evaluation, only the inference part of the code is run. We do not run the training part. For the inference part, the current timeout is 15 minutes.
-
The final standing will be based on 100% of the test set. For now, the score visible is 60% of 1473 samples.
“For now, the score visible is 60% of 1473 samples.”
Can you confirm that participants will only be scored on the hidden 40% of the test set?
Unfortunately, as it stands, the staff has confirmed multiple times that the final score will be on 100% of the test set, which includes the public 60%.
Funny enough, the original poster of this thread seems to be doing exactly what you are mentioning.
I wish it would be otherwise… One suggestion is to shuffle the test set and re-run our submissions.
Yeah, unfortunately the current rules lead to two things
A) Fine-tuning to the Public LB.
B) Submitting as many models as possible with as much variance as you can get away with while maintaining a relatively high score because ALL of your models will be scored for the Private LB, so all you need is 1 of your 200 submissions to get lucky. Each submission is basically a lottery ticket.
Agree,
however, I’d say A) can be split into 2 categories.
A1) fine-tuning/overfitting
A2) manually modifying predictions by probing
Wow. Should we go for all 3 options @branden_murray, @michael_bordeleau?
A2 seems to be the best way to be top 1
If the competition had gone on for another week without any clarification, I might have been tempted to jump on the dark side of the force.
However, A2) has now been rightly addressed in this post.