Any idea why this might be? Is anyone else getting similarly low results for validation (0.08 AP), but much higher (0.37), yet still underperforming your local results (0.47), for test+40?
A few specific debug questions:
Is there a good way to retrieve my predicted annotations json for the validation set produced by the Gitlab CI?
Is the validation set used for this run exactly the same as the 945 image set provided in the resources section? If not, could I have the validation set used for a local run?
Are the annotations exactly the same? (sha256sum of 92d5307a1fea44c1095f958a4c7629569436924a1343e2db91bd8e6f20892b62)
If there have been any changes to the data or code inputs in the CI, would it be possible to run the top submission again (submitted one month ago and before v2.1 was released). If the score is exactly the same, then we know the problem isn’t an inconsistency in the environments introduced in 2.1.
Another weird thing I noticed is that the “Testing (40%) Metrics” section gradually increases. Starting at around 0.05 AP when the images are 10% done evaluating, working its way up to 0.37 AP when its 100% complete. Log:
AP IoU 0.50:0.95 score as aicrowd gitlab bot edits the comment:
Completed Percentage: 8.51%
Testing (40%) Metrics: 0.0952
Testing (100% + 40%) Metrics: 0.0404
Completed Percentage: 12.77%
Testing (40%) Metrics: 0.1322
Testing (100% + 40%) Metrics: 0.0580
Completed Percentage: 21.20%
Testing (40%) Metrics: 0.2113
Testing (100% + 40%) Metrics: 0.0863
Completed Percentage: 49.25%
Testing (40%) Metrics: 0.2944
Testing (100% + 40%) Metrics: 0.1923
Completed Percentage: 76.06%
Testing (40%) Metrics: 0.3362
Testing (100% + 40%) Metrics: 0.2684
Completed Percentage: 100%
Testing (40%) Metrics: 0.3702
Testing (100% + 40%) Metrics: 0.3430
Does everyone else experience this increasing AP behavior, or do your numbers bounce around above and below as they converge on final AP?
This makes me think that some how I’m getting an AP of 0 for the first few images, then it works normally, resulting in a gradual rise in the average.
Yes, I have same results. I assumed the evaluation are going in several machines or at least gpus, so they have not wait until all the images will be processed and calc mAP after it. Organizers probably decided to evaluate the score during the inference, and I assume when you get for example 10% of images processed other 90% of answers setted as 0.
Thus, your score is continuous improving within the evaluation more and more photos (less zeros remains).
@Mykola_Lavreniuk do you have the same result from score.py locally as you do for the validation section on gitlab for your submission? Or do you see something similar to my results with 0.47AP locally and 0.09AP in the submission?
Validation results that you see in Gitlab are just a model sanity check. I think it has nothing to do with what you see on your local machine. It just checks if your submission is worthy to assign pods to evaluate.
Your local validation results are overfitted for the very obvious reason that some ~950 images are the same as that of the train. In case you have already removed these you will run into the problem of class imbalance with some classes not available at all. I created a new validation set for myself and I can say that it’s the best, the result I see on my local machine gives ~+6.0% (positive variance) jump on a test score.
Thanks for opening this discussion, and it is indeed an important one.
I am happy that we all can brainstorm and improve the pipeline for all the participants together.
The scores calculated right now we’re using the complete annotation files as @Mykola_Lavreniuk suspected, while the predictions made by your codebases get added for the calculation purposes on the go. The intention has always been to provide real-time feedback to you to show the approximate scores as soon as possible.
Issue #1: Score gradually increases
The flaw
We run COCOEval using complete annotations BUT with the subset of images (predictions generated ones), due to which the score starts with a much lower value and grows to your final scores.
Future update
We will be making an update in the pipeline based on this thread to create a subset of annotations file dynamically, calculate the scores with that for live reporting.
Issue #2: Validation scores don’t make sense
The flaw
We were using only 100 images in the validation phase because it worked as an integration test. We missed providing a proper scores table for the validation phase that you can use to compare local v/s online performance – which is our bad. The scores get calculated with annotation files for the complete validation subset, but only 100 images are present in the validation phase (as in Issue #1), causing much lower scores.
Future update
We will continue use a tiny subset to limit the compute used in validation phase and believe the score on 100 images should be reliable enough to compare local v/s online scoring. The list of these ~100 images will be shared in the resources section in future so you can reproduce the scores locally. (meanwhile the current scores were quite low, example 0.09AP while full evaluation for a lot more, it was happening due to Issue #1)