According to round 3 announcement, I understand all vocabularies we should predict are in round3-vocabulary.txt. (So I understand we don’t need to predict 109 vocabularies, is it right?)
I confirmed round3-vocabulary.txt, but I found strange lines, dty and hyacinthvegetable. These are correct vocabularies? These are not included in vocabulary.txt.
I seem dty is a typo of dry and hyacinthvegetable is combined with hyacinth and vegetable.
@hjuinj: The train.csv is something that we have consistently used across the previous rounds, and hence changing the same can create some confusion.
If filtering the invalid vocab items for round 3 is adding friction, we would be very happy to upload a filtered version.
We will update this thread as soon as its done.
thanks for the swift reply, it is okay, I can to the filtering myself.
I just thought it made sense to exclude labels outside of the actual vocab but I understand your reasoning.
Question: have you made sure that now in round 3, with reduced vocab set, the test set molecules also have the out-of-vocab labels removed from the reference answer?
there seems to be an limit on the size of model file that I can push. What is the upper limit? is it that in total I cannot exceed this limit or is it an limit per model file?
Yes, in this challenge only the testing phase is run online, while you need to do offline training and push your models.
Yes, the normal maximum file size is 50MB. For uploading larger file, you can submit files of size up to GBs via git lfs. In case you are not familiar with it can check quick help doc here: How to upload large files (size) to your submission
Can I follow up on the first point, what is the issue with doing the training and testing online? Is it due to me not being able to write the trained model in the container?
Launching training phase online means putting up limitations in term of time available to train, resources requirement for this challenge and so on upon participants.
We are not concerned about time taken or other factors in training phase (in this challenge), due to which it is kept offline, giving participants the flexibility of playing around with data, building up their models and so on in their familiar environment.
In case you do not want to train not on your system, you can make use of free compute available via Google Colab and submit directly via Colab.
Meanwhile, this challenge has some awesome community contributed notebooks which can help you in getting pre-setup environment too!