Yes, the row_id i.e. 5,6,7 in test data provided to you on workspace can be anything say 1000123, 1001010, 100001 (and in random order) in test data present on the server going forward, so we know predictions are being carried out during evaluation.
To use nltk for the evaluation, you need to provide ntlk_data folder in your repository root, which can be done as follows (current working directory: at your repository root):
One more clarification is that even though row_id of test data may differ in evaluation cluster,hope the other content is intact as made available shared path testing_data_full.
This demonstrates what is happening in your submission along with alternatives which you can use (name of variable changed to hide any potential information getting public on feature used):
(suggested ways #1, decently memory efficient)
>>> something_stack_2 = pd.get_dummies(something_stack)
>>> something_stack_2.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 38981 entries, (0, 0) to (8690, 2)
Columns: 4589 entries, to <removed>
dtypes: uint8(4589)
memory usage: 170.7 MB
(suggested ways #2, most memory efficient, slower then #1)
>>> something_stack_2 = pd.get_dummies(something_stack, sparse=True)
>>> something_stack_2.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 38981 entries, (0, 0) to (8690, 2)
Columns: 4589 entries, to <removed>
dtypes: Sparse[uint8, 0](4589)
memory usage: 304.8 KB
(what your submission is doing -- ~5G was available at this time)
>>> something_stack_2 = something_stack.str.get_dummies()
Killed
NOTE: The only difference between two approaches is Series.str.get_dummies use “|” as separator by default. In case you were relying on it, can do something like below:
Let us know in case the problem continues after changing this (here and it’s usage anywhere else in your codebase), we will be happy to debug further accordingly.
Is there any way for the users to see the error cause by themselves?It feels odd to post here to know the reason of error every single time without any clue on trace back.
Sorry for the sed issue, we were trying to provided automated patch to user codes for row_id which went wrong. I have undo this and requeued all the submissions affected by it now.