Do you trust your Leaderboard Score?

Hi all,

Do you trust your Leaderboard Score? This is a simple but fundamental question. I personally have not found a good correlation between my local cv score and the public leaderboard one. To be more precise, in my case I see a good correlation for models that scores in the high 0.6xx low 0.7xx, however when considering my best performing models (low 0.6xx, high 0.5xx on local cv) the correlation seems to be completely broken.

My gut feeling is to trust more my local cv and expect a huge shake-up for the final leaderboard, but I’m interested in your experience. It may very well be the case that I have just not found yet a good validation scheme.


Thanks for bringing that up. :ok_hand:

I feel the same way and it’s been bothering me for multiple weeks now.

I have developed clever features and modeling approach which show a boost in local validation, yet only to perform worst on the public Leaderboard.

Strangely enough, a more basic low-grade, basic features only model returns a better a score on the public Leaderboard. :confounded:

On the other hand, I see participants such as @Li-Der , consistently improving its score on a pattern that makes me envious. He clearly found something that he has been optimizing slowly but surely. Can’t wait to hear more about his solution once this is over.


Back to your topic, I do trust that my models are good, but I’m not so sure about the shake-up for the final unfortunately as you are.


I suspect there’s something going on within the data itself. :thinking:
I tried to hint to this in my earlier post here.


Could have we been provided a mix of clock images coming from different sources?

  • Pen&Paper drawn, then scanned? :memo:
  • Electronically drawn? :computer_mouse:
  • Different clinical tests with slightly different test protocols? :newspaper:
  • Pre-drawn circle vs empty page
  • … dreaded Synthetic data? :desktop_computer: :skull:

If the train data is more heavily “contaminated” with this mix of sources than the test data, it could lead to what we are experiencing.


Hi @michael_bordeleau thanks for your reply! :smile:

I completely agree with you. Shake-up or not, there will be a lot to learn from the winning solutions.

My main doubt is that the composition of the of the test set used for the public leaderboard does not reward well rounded models. The pre_alzheimer is the most difficult class to predict but probably the one that could have the higher impact from a social point of view. I’m focusing indeed on this aspect, but models that have better perforamnce on this class seems to perform worse on the public leaderboard.

How many models are just ignoring this class? If we look at the F1 scores of the top leaderboard they are all below 0.5. This is the main reason why I feel like the current leaderbord scores will not be reflective of the final scores.

1 Like

Interesting discussion!

I was working on a notebook that shares some throughs similar of yours @etnemelc . I am convinced that most of the time I managed to improve my leaderboard score (from 0.610 to 0.606), I was overfitting this specific dataset. The difference are so little, it could clearly be noise, especially on “small” datasets like those.

I agree with you @michael_bordeleau, I wonder if observations came from the same source. I find the split in number of observations (33,000 for the train, 362 for test and 1,500 for leaderboard) really intriguing … is it to make the competition harder, or is there a reason behind? I’m curious to see.


Useful Discussion!

The same thing happens to me as well. Whenever I got good boost to the local CV score by adding interesting and useful features,they are failing short on the leader board. Unable to decide whether to trust local score or leader board score.


Hello @demarsylvain.

Your notebook looks great! any idea how to run that in the VM?


Thank’s. Don’t hesitate to like it :upside_down_face:, this will increase my chance for the Community Price, even if I participated lately.

For the R notebook, you have to install R in order to see “R kernel” at the top right of your notebook.


Upvoted. BTW, If I would be in the money with your notebook, I would not share with you :wink:

1 Like

I hope you’ll invite me for a game if you win the playstation :wink:


Indeed, why not?

But you need something to play with me, so I hope you got a prize as well.

It doesn’t matter what you should trust (CV or LB), half the data is just noise due to the very bad algorithm used for features extraction. If we add to this the unknown distribution of the private LB, then top 10 winners are those with the best luck.

If you want to increase your chances of winning, you just need to stick to one solution and submit this solution with all possible combinations of class weights. With 10 subs per day, you may end up overfitting the private LB and win a prize.

I have a NN that scored 0.613 on LB. After some little improvements on my CV the score on LB is now 0.627 !!


Hi @moto ,

Just curious, your best submission (the one kept in the private LB) is it Python or R? :wink:

1 Like

Hello @demarsylvain. It is in Python from my teammate.

1 Like