Submissions get killed without any error message

bjoern.holzhauer · January 1, 2020, 8:10am

On my latest attempts to get things running on the evaluation sever, I ran into a number of issues, some of which I managed to fix, but the current error I get baffles me. Things now get killed without any interpretable error message (at least none that is visible to me in the agent log). Everything works fine on the training server, but the agent log on the evaluation server says:
2019-12-31T23:58:12.255439177Z /home/aicrowd/run.sh: line 9: 11 Killed Rscript predict.R

Is this a memory limit issue like it was described in another thread, @shivam, or something else? Any help to get this working?

PS: It would be good to make some official annoucement about the changes to the AICROWD_TRAIN_DATA_PATH environment variable on the evaluation server. Additionally, warning people about the changes to the training data file on the evaluation server would have been good (seems to no longer match the training server - at least read_csv gives different results now versus before the latest changes and different than in the training environment). Those changes took me completely by surprise - I had expected the only changes vs. before to be a scrambled/randomized row_id column. It’s quite tedious to figure those things out via debug submissions (took me 8 submissions \approx 2 hours due to the slowness of the submission process) to trouble-shoot that.

bjoern.holzhauer · January 6, 2020, 8:13pm

@shivam, any advice?

bjoern.holzhauer · January 8, 2020, 10:04am

@shivam, @kelleni2 any updates? I have not heard anything back despite reaching out via multiple channels & not received any information on the GitLab issue, either. Unless this gets resolved our feature engineering part of the pipeline will not run on the evaluation server and you will get a solution with that part done on the training server, instead.

shivam · January 8, 2020, 11:35am

Hi @bjoern.holzhauer,

Sorry for missing out your query earlier.

Yes the Killed is referring to OOMKill of your submission. It happens when your codebase is breaching the memory limit i.e. ~5.5G during evaluation.

The training data on the server is same as workspace except the row_id part which were changed, which I announced on the day on change.

Can you share me your latest submission ID which you think is only getting stuck due to OOM issue? I can debug it for you and share the part of code which is causing high memory usage.

bjoern.holzhauer · January 8, 2020, 12:34pm

Hi @shivam, the one linked above is the last one. I’ve profiled my code and it definitely by far exceeds 5.5G in several places (various string processing & some full_joins). Is there is a possibility to ge tmore RAM? The same as on the training server would be the obvious choice.

shivam · January 8, 2020, 12:40pm

We are using the kubernetes cluster from organisers which have 8G base machines and AKS have quite hard eviction policy due to which it kill code as soon as it reach 5.5G.

Best might be to see if your RAM usage can be reduced by down casting variables.

Meanwhile, @kelleni2, @laurashishodia is it possible to change underlying nodes in AKS cluster from Standard_F4s_v2 to some higher RAM nodes? I am seeing OOM issue for multiple teams (3-4+ at least).