Colab for this competition

Dear Ibrahim,

to be honest we do not have any experience running it on Google Colab. I believe that you need to download it on your local machine and upload it to your Google Drive. Then you should be able to access the data from Google Colab. To Submit your solution, you will need to upload your end-to-end system to our local GitLab repository as we have a private test set.

Hope it helped.

Best,
Lukas

Thank you for your reply picekl.

The train file is about 60GB so downloading and uploading won’t be feasible really, I saw a previous post someone using it in kaggle successfully and another tutorial using a different competition for colab so I think there is way around it hopefully.

Best regards

Hi @ibrahim_sherif,

Yes, AIcrowd supports dataset download via cli.
You can download the dataset on the Colab/Notebooks using the same.

I have added an example notebook here:

You can also click on “Open in Colab” button to get started right away!
Let us know in case you face any difficulties.

Cheers,
Shivam

3 Likes

Hi @ibrahim_sherif,

I did a quick implementation and this might be the end to end example you are looking for. :wink:

Please note, that I only worried about dataset integration on Colab/Notebook via CLI in this example, and have not done end to end testing of baseline.

The baseline need some love, and you can even share your fixed/better version in the Notebook sections! :muscle:
I am pretty sure other participants might be interested in the same too. :smiley:

2 Likes

Thank you for your help shivam

Hello @shivam

The dataset is quite large when downloaded to colab this way as I cant unzipped due to disk constraints. Any advise ? Is there a way to download the dataset unzipped to colab directly or perhaps a download link to be able to use with tensorflow directly which can download from url and extract it directly.

Thanks in advance

Hi @ibrahim_sherif,

As we include the original size images, the whole dataset could be a bit “spacy”.

Would a resized version of the dataset help? I believe I could reduce the size to 1/4 of its original.

Best,
Lukas

Hello picekl,

It depends on the image size of the dataset. I think anything lower than 256x256 would degrade the quality. What are the current sizes of the images ?

Thanks

Hi @ibrahim_sherif,

it is highly diverse. A lot with 12MP, e.g., 4000x3000. I can start by resizing just the big images. Resizing it to 1/4 of its original should not have a negative impact.

LP

If so then that would help greatly.

Thank you

Hello @picekl

Hope you are doing well. Any updates regarding the problem of the data set size.

Thanks in advance

Hi Ibrahim,

Just finished the resizing.
Unfortunately, the new archive has 40G. Is it useful?

Best,
Lukas

Hello picekl

It is better than the 60G. I think if you uploaded it chunks it would be easier to handle, like two 20G files or four 10G files, if you have the time.

Thanks

Hi Ibrahim,

feel free to download data from my personal folder --> http://ptak.felk.cvut.cz/personal/picekluk/SnakeCLEF2021/

Cheers,
Lukas

Hello picekl,

The link doesn’t seem to work.

Thank you so much for your help and response.

The Proxy is probably down. Just try it later :slight_smile:
It will work sooner or later.

LP

FYI - It is UP :slight_smile:

Lukas

Thank you picekl for your great help, I am currently downloading the dataset and will update you when I have any progress. Hopefully I can make my first submission.

Hello @picekl

I downloaded the dataset chunks that you split. Any idea how to combine and extract them as all my trials failed with an error related to checksum in the first chunk.

Thanks in advance

Hi Ibrahim,

After successfully splitting tar files or any large file in Linux, you can join the files using the cat command. Employing cat is the most efficient and reliable method of performing a joining operation.

To join back all the blocks or tar files, we issue the command below:

cat home.tar.bz2.parta* >backup.tar.gz.joined

Best,
Lukas

1 Like