External datasets used by participants

As required by the competition rules, I here share the external data I used for my competition entries. Other participants may want to share their data as well in this thread.

I have used the following external datasets:

inaturalist 2021 dataset:

A custom subset of iNaturalist images, including many mosquito species, but also other species was downloaded from inaturalist-open-data:

A csv file containing path, url and species name:

For downloading use e.g. a download manager like aria2. Careful, a lot of space is required (440 GB), which is why I’m sharing the links rather than reuploading the data.

The license is image specific but generally is either public domain or some form of creative commons. The bulk of the images have CC-BY-NC and CC BY-SA licenses. I’m not a lawyer, but I assume using them for non commercial machine learning models is fair use.

Justification for selection:

to solve the issue of having few samples in some minority classes and to have better discriminate features for insect classification.

1 Like

Here are our released external datasets used for competition entries, each dataset is organized in the ImageFolder format, 5 GB space is required.

  1. mosquito dataset from kaggle
    https://drive.google.com/file/d/1aXVaowHDaoDRK4PeqJN25lFcujQMJHUx/view?usp=drive_link

  2. mosquito dataset from inaturalist subset
    https://drive.google.com/file/d/1xrz2qMmzd2ut12g_EXkOSxzXGUSRSh8s/view?usp=drive_link

The liscense is as same as @tfriedel.

Justification for selection:
Extend samples in minority classes(aegypti and anopheles).

1 Like

Hi guys,

Thanks for sharing these links! No external data so far for me but I plan to try some of the ones you’ve shared soon.

@MPWARE really? that’s impressive if you got such a high score only with the provided data! Got to tell us how you achieved that after the competition!

Yes, really but I’m quite sure I’m overfitting the public LB.

@OverWhelmingFit @tfriedel Are we sure that your external data does not contain any subset of Mosquito Alerts (outside the data provided in train dataset)? As it’s forbidden in the rules.

@MPWARE
So iNaturalist is a separate app from mosquito alert, so any image taken with the app directly will be different from images taken with the mosquito alert app. That said, it’s also possible to upload files you have stored locally, probably in both apps. It would be difficult to rule out images that have been uploaded in such a way by users to both apps, especially if you don’t have the mosquito alert image dataset. I guess if this happens rarely it will not be a big deal.

1 Like

I’m using the external datasets above (inaturalist subset + kaggle), shared by other participants now.

I’m using the datasets shared by @OverWhelmingFit including the inaturalist subset and Kaggle datasets. Huge thanks for making these available!

Damn! I missed the thread. I have been attempting too hard, which is why my score and the top LB difference are too high.

@tfriedel could you please share the source code?