Colab for this competition

picekl · March 11, 2021, 9:17am

Dear Ibrahim,

to be honest we do not have any experience running it on Google Colab. I believe that you need to download it on your local machine and upload it to your Google Drive. Then you should be able to access the data from Google Colab. To Submit your solution, you will need to upload your end-to-end system to our local GitLab repository as we have a private test set.

Hope it helped.

Best,
Lukas

ibrahim_sherif · March 11, 2021, 9:32am

Thank you for your reply picekl.

The train file is about 60GB so downloading and uploading won’t be feasible really, I saw a previous post someone using it in kaggle successfully and another tutorial using a different competition for colab so I think there is way around it hopefully.

Best regards

shivam · March 11, 2021, 9:32am

Hi @ibrahim_sherif,

Yes, AIcrowd supports dataset download via cli.
You can download the dataset on the Colab/Notebooks using the same.

I have added an example notebook here:

You can also click on “Open in Colab” button to get started right away!
Let us know in case you face any difficulties.

Cheers,
Shivam

shivam · March 11, 2021, 9:59am

Hi @ibrahim_sherif,

I did a quick implementation and this might be the end to end example you are looking for.

Please note, that I only worried about dataset integration on Colab/Notebook via CLI in this example, and have not done end to end testing of baseline.

The baseline need some love, and you can even share your fixed/better version in the Notebook sections!
I am pretty sure other participants might be interested in the same too.

ibrahim_sherif · March 16, 2021, 10:25pm

Thank you for your help shivam

ibrahim_sherif · March 17, 2021, 10:06am

Hello @shivam

The dataset is quite large when downloaded to colab this way as I cant unzipped due to disk constraints. Any advise ? Is there a way to download the dataset unzipped to colab directly or perhaps a download link to be able to use with tensorflow directly which can download from url and extract it directly.

Thanks in advance

picekl · March 17, 2021, 10:20am

Hi @ibrahim_sherif,

As we include the original size images, the whole dataset could be a bit “spacy”.

Would a resized version of the dataset help? I believe I could reduce the size to 1/4 of its original.

Best,
Lukas

ibrahim_sherif · March 17, 2021, 1:06pm

Hello picekl,

It depends on the image size of the dataset. I think anything lower than 256x256 would degrade the quality. What are the current sizes of the images ?

Thanks

picekl · March 17, 2021, 1:45pm

Hi @ibrahim_sherif,

it is highly diverse. A lot with 12MP, e.g., 4000x3000. I can start by resizing just the big images. Resizing it to 1/4 of its original should not have a negative impact.

LP

ibrahim_sherif · March 17, 2021, 1:51pm

If so then that would help greatly.

Thank you

ibrahim_sherif · March 28, 2021, 9:22pm

Hello @picekl

Hope you are doing well. Any updates regarding the problem of the data set size.

Thanks in advance

picekl · March 30, 2021, 7:32pm

Hi Ibrahim,

Just finished the resizing.
Unfortunately, the new archive has 40G. Is it useful?

Best,
Lukas

ibrahim_sherif · March 31, 2021, 9:33am

Hello picekl

It is better than the 60G. I think if you uploaded it chunks it would be easier to handle, like two 20G files or four 10G files, if you have the time.

Thanks

picekl · March 31, 2021, 11:59am

Hi Ibrahim,

feel free to download data from my personal folder --> http://ptak.felk.cvut.cz/personal/picekluk/SnakeCLEF2021/

Cheers,
Lukas

ibrahim_sherif · March 31, 2021, 3:01pm

Hello picekl,

The link doesn’t seem to work.

Thank you so much for your help and response.

picekl · March 31, 2021, 3:51pm

The Proxy is probably down. Just try it later
It will work sooner or later.

LP

picekl · April 1, 2021, 8:01am

FYI - It is UP

Lukas

ibrahim_sherif · April 1, 2021, 10:41am

Thank you picekl for your great help, I am currently downloading the dataset and will update you when I have any progress. Hopefully I can make my first submission.

ibrahim_sherif · April 11, 2021, 1:16pm

Hello @picekl

I downloaded the dataset chunks that you split. Any idea how to combine and extract them as all my trials failed with an error related to checksum in the first chunk.

Thanks in advance

picekl · April 11, 2021, 4:26pm

Hi Ibrahim,

After successfully splitting tar files or any large file in Linux, you can join the files using the cat command. Employing cat is the most efficient and reliable method of performing a joining operation.

To join back all the blocks or tar files, we issue the command below:

cat home.tar.bz2.parta* >backup.tar.gz.joined

Best,
Lukas