I am trying the submission pipeline and submitted successfully with debug equal to true in the json file in this submission:
However, once I switch debug mode off, I get the following error in the next submission:
The prediction vocabulary size is 5. The prediction vocabulary has to be >= 60 words in size.
I am not quite sure if there is a bug or if this is a problem with my code. Could you please take a look? Thank you.
Your generated prediction file has in total 5 vocabulary size as mentioned in the error.
I assume it is happening because it is returning the default prediction generated from your function
We don’t check/care about prediction vocabulary in debug submissions, because of smaller test data size. This is why you didn’t get the error upfront in your debug run.
I would suggest letting your code fail in debug run instead of using default_prediction – instead of wildcard
try/except block, this may help in knowing the error it is failing on. Other alternative can be catch all the exceptions in your codebase and add them in a list, and at the end of code run (in debug)
if len(errors)>1, print them and
Let me know if this helps. We can help you debug further accordingly.
Thank you for the swift reply. I am just using the code from the starter kit to get thing running at the moment and this was the code from the started kit. I had a look at the submission file and I am not quite sure what
The prediction vocabulary size is 5. really means, there are definitely more than 5 different vocabs being used in my submission.
What does it mean that
The prediction vocabulary has to be >= 60 words in size.? that the prediction for each molecule has to be more than 60 words? that does not make sense…
I can quickly check the starter kit’s code to make sure of any fault.
It means overall >=60 words need to be present in your submission’s vocabulary i.e. across all the molecules (not each molecule).
the local version of my submission.csv file supposedly contains all 75 words in the vocabulary. But I guess the testing for leader board only uses a subset of all test molecules.
Yup, along with it the approach in fingerprint baseline depends on molecule being present in PubChem database which may or may not be the case, etc.
You can read about the fingerprint approach here:
thanks, do you reproduce the same issue when running the
Yes, I can confirm it to be an issue with the current
FingerprintPrediction predictions, looks like that approach is no longer valid in Round 3 without some more work!
I don’t quite understand where the issue lies, is it because there are additional molecules in the test set that is not shown in the test.csv and thus the look up of pubchem fingerprint from a static file no longer works?
I will check how much information (if any) we can share based on publically available information about Round 3’s
test.csv and get back!
thanks, I think for me I basically would like to know whether using pre-calculated features for each molecule in the train.csv and test.csv would still work in the submission.
@hjuinj: the test set for this round contains numerous molecules which are not present in the test sets of previous round. Hence precomputed features will not work here.
You will have to ensure you can do feature computation on the fly before you make the predictions
Hello, I have a related problem with my submission in non-debug mode:
using my code I can process all the smiles stings of the train set, but once the program runs on the set of the non-debug submission the code fails to process some of the strings.
I suspect that the error is probably due to some special char in the test set smiles (like a backslash), in any case I find very strange that this error don’t show up for any of the smiles in the train set and multiple times during the test
Hi, I’ve got image build problem. submission #118160
When I run command below on my local machine it works fine.
aicrowd-repo2docker \ --no-run \ --user-id 1001 \ --user-name aicrowd \ --image-name sample_aicrowd_build_9f63e \ --debug \ .
Does anyone have idea how to fix this?