Cryptic submission failed message

gaetan_ramet · May 3, 2021, 9:27pm

Hi,

My latest submissions (133671) are being rejected but I can’t find any sensible reason why. It failed in the “generate predictions” step and here are the logs I get:

Selecting runtime language: python
[NbConvertApp] Converting notebook predict.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python
[NbConvertApp] Writing 17040 bytes to predict.nbconvert.ipynb

Any idea what went wrong? Is it a timeout issue or out-of-memory error of some kind?

jyotish · May 3, 2021, 9:34pm

Hello @gaetan_ramet

Shared the traceback with you privately.

gaetan_ramet · May 4, 2021, 8:52am

Hi @jyotish and thank you for your help!

For anyone interested, it seems to be an out-of-memory issue because the evaluation pods have only 4Go of RAM.

I think it would make sense to have the same hardware capabilities on the training and evaluation machines, as you need to be able to load your models/assets on both right? It feels weird to me to be allowed to develop with 14Go but to test only with 4Go. What do you think?

jyotish · May 4, 2021, 8:55am

Hello @gaetan_ramet

We upgraded the evaluation pods with 4 vCPUs and 16 GB RAM to match the specs from the AD workbench. We are re-evaluating the submissions that might have failed due to this.

In case anyone’s submission failed with the message “Inference failed” and the logs say something like “kernel died”, please let us know.

gaetan_ramet · May 4, 2021, 8:57am

Great news! Thanks for acting so quickly!