Local Run Produces Different AP

lapp09 · April 10, 2022, 11:59pm

I’ve placed the 945-image validation set images and annotations json in data/.

I am seeing a very different result from aicloud.gitlab.com (mAP of 0.08) when running score.py on a local docker container (mAP of 0.47):

aicrowd@1a39dca5fdaa:~$ python score.py 
Loading Ground Truth Annotations...
...
DONE (t=10.70s).
Accumulating evaluation results....
Accumulating evaluation results...
DONE (t=3.14s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.471
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.609
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.513
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.354
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.745
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.640
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.703
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.709
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.062
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.353
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.745
Scores Computed!!
 Mean Average Precision : 0.47119927160832803
 Mean Average Recall : 0.7093107905825435
 All Scores: {'average_precision_iou_all_area_all_max_dets_100': 0.47119927160832803, 'average_precision_iou_5_area_all_max_dets_100': 0.6092357382102722, 'average_precision_iou_75_area_all_max_dets_100': 0.5131199283891409, 'average_precision_iou_all_area_small_max_dets_100': 0.0012939793545111333, 'average_precision_iou_all_area_medium_max_dets_100': 0.35413810611830415, 'average_precision_iou_all_area_large_max_dets_100': 0.7449979372937293, 'average_recall_iou_all_area_all_max_dets_1': 0.6404105832615801, 'average_recall_iou_all_area_all_max_dets_10': 0.7033399947300033, 'average_recall_iou_all_area_all_max_dets_100': 0.7093107905825435, 'average_recall_iou_all_area_small_max_dets_1': 0.0625, 'average_recall_iou_all_area_medium_max_dets_1': 0.35323361823361826, 'average_recall_iou_all_area_large_max_dets_1': 0.74471351133707}

Validation Metrics

Any idea why this might be? Is anyone else getting similarly low results for validation (0.08 AP), but much higher (0.37), yet still underperforming your local results (0.47), for test+40?

A few specific debug questions:

Is there a good way to retrieve my predicted annotations json for the validation set produced by the Gitlab CI?
Is the validation set used for this run exactly the same as the 945 image set provided in the resources section? If not, could I have the validation set used for a local run?
Is there a different score.py used by gitlab CI which isn’t identical to food-recognition-benchmark-starter-kit/score.py at master · AIcrowd/food-recognition-benchmark-starter-kit · GitHub
Are the annotations exactly the same? (sha256sum of 92d5307a1fea44c1095f958a4c7629569436924a1343e2db91bd8e6f20892b62)

If there have been any changes to the data or code inputs in the CI, would it be possible to run the top submission again (submitted one month ago and before v2.1 was released). If the score is exactly the same, then we know the problem isn’t an inconsistency in the environments introduced in 2.1.

lapp09 · April 11, 2022, 4:11am

Another weird thing I noticed is that the “Testing (40%) Metrics” section gradually increases. Starting at around 0.05 AP when the images are 10% done evaluating, working its way up to 0.37 AP when its 100% complete. Log:

AP IoU 0.50:0.95 score as aicrowd gitlab bot edits the comment:

Completed Percentage: 8.51%

Testing (40%) Metrics: 0.0952
Testing (100% + 40%) Metrics: 0.0404

Completed Percentage: 12.77%

Testing (40%) Metrics: 0.1322
Testing (100% + 40%) Metrics: 0.0580

Completed Percentage: 21.20%

Testing (40%) Metrics: 0.2113
Testing (100% + 40%) Metrics: 0.0863

Completed Percentage: 49.25%

Testing (40%) Metrics: 0.2944
Testing (100% + 40%) Metrics: 0.1923

Completed Percentage: 76.06%

Testing (40%) Metrics: 0.3362
Testing (100% + 40%) Metrics: 0.2684

Completed Percentage: 100%

Testing (40%) Metrics: 0.3702
Testing (100% + 40%) Metrics: 0.3430

Does everyone else experience this increasing AP behavior, or do your numbers bounce around above and below as they converge on final AP?

This makes me think that some how I’m getting an AP of 0 for the first few images, then it works normally, resulting in a gradual rise in the average.

Mykola_Lavreniuk · April 12, 2022, 11:32am

Yes, I have same results. I assumed the evaluation are going in several machines or at least gpus, so they have not wait until all the images will be processed and calc mAP after it. Organizers probably decided to evaluate the score during the inference, and I assume when you get for example 10% of images processed other 90% of answers setted as 0.
Thus, your score is continuous improving within the evaluation more and more photos (less zeros remains).

lapp09 · April 12, 2022, 3:38pm

That seems plausible, though I still question why my local run has a much better score than the gitlab CI run.

@shivam do you have any thoughts on this, and my debug questions in the original post?

lapp09 · April 13, 2022, 5:08pm

@Mykola_Lavreniuk do you have the same result from score.py locally as you do for the validation section on gitlab for your submission? Or do you see something similar to my results with 0.47AP locally and 0.09AP in the submission?

gaurav_singhal · April 19, 2022, 10:08am

My theory:

Validation results that you see in Gitlab are just a model sanity check. I think it has nothing to do with what you see on your local machine. It just checks if your submission is worthy to assign pods to evaluate.
Your local validation results are overfitted for the very obvious reason that some ~950 images are the same as that of the train. In case you have already removed these you will run into the problem of class imbalance with some classes not available at all. I created a new validation set for myself and I can say that it’s the best, the result I see on my local machine gives ~+6.0% (positive variance) jump on a test score.

lapp09 · April 19, 2022, 4:19pm

Oh wow you’re right, I missed the part about the validation set being part of the training set. I wish I created a separate validation set.

My 41AP, 46AP, and 48AP models all have the same AP of ~37.0 on the test set indicating it converged early on.

Thanks!

shivam · May 4, 2022, 5:42am

Hi all,

Thanks for opening this discussion, and it is indeed an important one.
I am happy that we all can brainstorm and improve the pipeline for all the participants together.

The scores calculated right now we’re using the complete annotation files as @Mykola_Lavreniuk suspected, while the predictions made by your codebases get added for the calculation purposes on the go. The intention has always been to provide real-time feedback to you to show the approximate scores as soon as possible.

Issue #1: Score gradually increases

The flaw
We run COCOEval using complete annotations BUT with the subset of images (predictions generated ones), due to which the score starts with a much lower value and grows to your final scores.

Future update
We will be making an update in the pipeline based on this thread to create a subset of annotations file dynamically, calculate the scores with that for live reporting.

Issue #2: Validation scores don’t make sense

The flaw
We were using only 100 images in the validation phase because it worked as an integration test. We missed providing a proper scores table for the validation phase that you can use to compare local v/s online performance – which is our bad. The scores get calculated with annotation files for the complete validation subset, but only 100 images are present in the validation phase (as in Issue #1), causing much lower scores.

Future update
We will continue use a tiny subset to limit the compute used in validation phase and believe the score on 100 images should be reliable enough to compare local v/s online scoring. The list of these ~100 images will be shared in the resources section in future so you can reproduce the scores locally. (meanwhile the current scores were quite low, example 0.09AP while full evaluation for a lot more, it was happening due to Issue #1)