However, once I switch debug mode off, I get the following error in the next submission: The prediction vocabulary size is 5. The prediction vocabulary has to be >= 60 words in size.
Your generated prediction file has in total 5 vocabulary size as mentioned in the error.
I assume it is happening because it is returning the default prediction generated from your function def default_prediction.
We don’t check/care about prediction vocabulary in debug submissions, because of smaller test data size. This is why you didn’t get the error upfront in your debug run.
I would suggest letting your code fail in debug run instead of using default_prediction – instead of wildcard try/except block, this may help in knowing the error it is failing on. Other alternative can be catch all the exceptions in your codebase and add them in a list, and at the end of code run (in debug) if len(errors)>1, print them and raise an Exception.
Let me know if this helps. We can help you debug further accordingly.
Thank you for the swift reply. I am just using the code from the starter kit to get thing running at the moment and this was the code from the started kit. I had a look at the submission file and I am not quite sure what The prediction vocabulary size is 5. really means, there are definitely more than 5 different vocabs being used in my submission.
What does it mean that The prediction vocabulary has to be >= 60 words in size.? that the prediction for each molecule has to be more than 60 words? that does not make sense…
the local version of my submission.csv file supposedly contains all 75 words in the vocabulary. But I guess the testing for leader board only uses a subset of all test molecules.
Yes, I can confirm it to be an issue with the current FingerprintPrediction predictions, looks like that approach is no longer valid in Round 3 without some more work!
I don’t quite understand where the issue lies, is it because there are additional molecules in the test set that is not shown in the test.csv and thus the look up of pubchem fingerprint from a static file no longer works?
thanks, I think for me I basically would like to know whether using pre-calculated features for each molecule in the train.csv and test.csv would still work in the submission.
@hjuinj: the test set for this round contains numerous molecules which are not present in the test sets of previous round. Hence precomputed features will not work here.
You will have to ensure you can do feature computation on the fly before you make the predictions
Hello, I have a related problem with my submission in non-debug mode:
using my code I can process all the smiles stings of the train set, but once the program runs on the set of the non-debug submission the code fails to process some of the strings.
I suspect that the error is probably due to some special char in the test set smiles (like a backslash), in any case I find very strange that this error don’t show up for any of the smiles in the train set and multiple times during the test