LifeCLEF 2022 Plant Evaluation

in order to compensate for species that would be underrepresented in the test set, we will use a macro-averaged version of the MRR (average MRR per species)

The test dataset appears to only have 26,868 observations but the above statement would imply that there are at least 80,000 observations since you would need at least 1 observation of each species to be able to weight the MRR per species.

  1. Is the “macro-averaged version of the MRR” the mean of the MRRs of each species?
  2. Is every species represented in the test set?

Thank you for your comment and request for clarification:

  1. Yes, the Macro-Average version of the MRR is the average of the MRRs for each species.
  2. No, not all species are represented in the test.

In other words, the MA-MRR is calculated only on the species actually present in the test set. Also, there are several observations per species, so there are less than 26,868 species in the test set.

We would have liked to have been able to evaluate all species in the training set, but it was difficult to collect so much sufficiently expert data by botanists at such a scale.
We could have retrieved complementary data published on GBIF for the test set, but because we allow participants to use external complementary training data, there would have been the risk of having test images included in external training data, which would have biased the results of the challenge.

Finally, we can add that having fewer species in the test set corresponds to a realistic scenario faced by automatic identification systems such as Pl@ntNet, Inaturalist: these systems must be able to recognize as many species as possible without knowing in advance which species will be the most frequently requested and which will never be requested.