Although the validation set contains representative recordings for two of the final four recording sites of the test set, there are large differences in cmap and rmap values. Our model for example achieves a cmap value of 0.1480 (rmap: 0.2220) on the validation set in contrast to a cmap score of 0.000212 (rmap: 0.025) at the official evaluation of the test set. Does anyone have an explanation for these discrepancies?
I am currently investigating this - I was wondering about the scores, too. We tested the submission system before we launched the challenge and it seemed to work just fine. I downloaded all submission and I am running some additional tests. I will provide you with updated scores if anything changes. I will also evaluate each submission per location and provide you with these scores as well.
I investigated the submissions and decided to revert the scoring system back to the one used for the validation data. This improved the scores across the board (see topic “Official results”). The reason why we chose to apply a more restrictive measure lies within the test data: The validation dataset had bird sounds in every sound file, the test data did not. The additional measure was intended to apply a penalty for detections in segments that did not contain a bird vocalization. It turned out that this measure was to restrictive and diminished the scores more than I anticipated. I will update the scoring system here on AICrowd for the second round, each new submission should score accordingly.