About the evaluation metric

We received feedback about the evaluation metric and some clarifications are needed:
We are aware that the evaluation calculation is not the same as the official pascal mAP@0.5IoU, However, to be consistent with previous edition of object detection challenges in ImageCLEF, we choose to keep this script and naming for this edition. This decision will be reevaluated for future editions.

Dear Dimitry,

I need to disagree with you. Other challenges from ImageCLEF such as ImageCLEF-Coral are using mAP0.5 as evaluation metric. Same metric was used also last year in the same challenge.

All the big datasets are having mAP as the main metric. I don’t see any reasonable argument to use anything different. The metric you are using does not make any sense in for your long-tail “class presence” distribution. It will ignore the least present classes.


Dear Lukas,
The code for the evaluator is the same as for the CORAL task.
This also affects this task and its previous editions.
We know it is not desirable but we are now in the middle of the evaluation phase and it is difficult to change it now. We decided with the organizers to keep it while transparently communicating the issue.
Thank you for your understanding.

Dear Dimitry,

Again, I don’t agree with you. Looking at the paper from the last year --> http://ceur-ws.org/Vol-2380/paper_200.pdf, you can see that the matric was different. They had to use different script.

With proposed metric, the competition could look really amateurish and related articles probably wont be well acceptable by the CV/ML community.

Furthermore, looks like your script is ignoring classes while calculating IoU, can you confirm / deny that?

PS: If you consider that the competition is running for 2 months already, changing the final metric two weeks before deadline is a bit unfair.


Dear Luka,
I am one of the organisers of the ImageCLEFcoral task.

I can confirm we are the DrawnUI task is using the same evaluation script as the Coral task. We also use the same methodology as past ImageCLEF annotation tasks.

I can also confirm that the evaluation does NOT ignore classes, you can check the line where is written “if predictions[image_key].get(widget_key):”. Therefore is only calculate if the class is correctly identify.

If you wish to calculate only the precision/recall ignoring the box overlapping, you can change “if iou>0.5” to “if iou>=0.0” or remove the conditional.

We appreciate that a system can be evaluate in many ways. Unfortunately, aicrowd only supports two metrics at this moment. The given evaluation gives an overall score over all the images and classes.
Hopefully the resources given by this or other tasks are useful for you and the CV/ML community, they have been created with lot of voluntary effort!!! (from Dimitri and many other people). If you plan to further work on this task and submit the results in a journal, I personally encourage you to provide a further analysis of the results. As you saw in the ImageCLEFcoral 2019 paper, further analysis was also done to identify the accuracy by class (unfortunately this needs to be outside aicrowd as it only support 2 metrics).

Any constructive feedback is always welcome and, indeed, encourage. Therefore, thanks a lot for taking the time to check the script.

Finally, as you are a participant of the DrawnUI task, I would like to encourage you participate in the coral task. We made sure that both tasks shared the same submission format and evaluation in order to facilitate participants to submit results in both tasks. The time is very limited but if you already have your work ready for the DrawnUI task, it would be “easy” to train it on a different image collection (the coral) and indeed very interesting to see how the image collection affects the approach.



Dear Alba & Dimitry,

to prove my point I have submitted one submission today. I have achieved the 0.997 overall Precision and 10.276 Recall. You can note that maximum values for both, Precision and Recall are 1.

I’m begging you, PLEASE change the evaluation metric. Without the change, it will end badly.

Kind Regards,


1 Like