Submission format for the NLP feature engineering challenge

falak · June 11, 2021, 8:35am

Hi!
I tried submitting a notebook to the challenge, but it showed error suggesting a zip file was to be submitted. Uploading the zip file using the command in the starter kit also gave error. Could you please clarify the submission format? The submission section has very few details.

Thanks.

Shubhamaicrowd · June 11, 2021, 10:05am

It looks like you were able to make a submission using baseline. But here are the things you need to take care of -

The markdown such as Install packages, Define preprocessing code , and Prediction phase are the most important markdown. These should not be removed in any case.

As mentioned in the above picture, you will only have access to the internet in the Install packages markdown section.

After that, only the code in **Define preprocessing code ** and Prediction phase will be executed to make the predictions. These predictions should be saved in assets folder as submission.csv.

It would help if you also kept the Setup AIcrowd Utilities and AIcrowd Runtime Configuration as it is because after submitting your notebook, those environment variables will be used in many different cases.

In the starter kit, sections such as Training phase will never be executed! So make sure that your code will still generate the predictions. You can save any pre-trained model in the assets folder, which will be accessible in the Prediction phase.

Now, let’s talk about the reasons you got the following errors -

DockerBuildError: Failed to install packages. View the submission for more details.

This error occurred due to no mention of markdown in the submitted notebook. Again, markdown like Install packages, Define preprocessing code, and **Prediction phase ** ( in h1 ) is really critical to make the submission work,
FileFormatError: The submission file should be a zip file. View the submission for more details.

You submitted .ipynb instead of a zip file; the Starter Kit should create a zip file and submit it automatically. Did you make any changes with the starter kit while submitting it?

And thanks for your feedback. I will update the submission section ASAP to have more detailed information

falak · June 11, 2021, 10:58am

Thank you so much for the information. It’s clear now.
Yes, those were my mistakes. I was not running it on colab at first hence the different format and may have lead to changes in the markdown.
Noted, I’ll keep that unchanged and submit.
Thanks again.

AkashPB · June 15, 2021, 6:34am

Hi Shubham,
I am still getting the error message- DockerBuildError. I am running your starter notebook as it is but still getting the error mentioned.
Thanks,
Akash

Shubhamaicrowd · June 15, 2021, 6:42am

I looked into the issue and it seems like this block of code is causing the error -

from google.colab import drive
drive.mount('/content/drive')

The error is expected because google library is only available in google colab. To make sure your submission goes without any error, either remove the block of code or add try/except like this -

try:
    from google.colab import drive
    drive.mount('/content/drive') 
except:
    pass

Also, if you do access some model files from your google drive, save it in the assets folder and read the file from assets folder in Prediction Phrase. Let me know if you have any other issues or doubts

Cheers

Shubhamai

AkashPB · June 15, 2021, 7:13am

Hi Shubham,
Thanks ! Now got it … Damn the code structure is too rigid . Also, I am trying to understand the problem statement-

We have train data with inputs feature being text and output feature being labels. In your starter notebook, you consider the emotion detection dataset as train data and the corresponding label as target. My question is - “Is the emotion detection dataset the train dataset for our problem?”
I am getting confused because in the descriptions section, it is written -
" Working on the same Research Paper Dataset you used in the multi-class problem, you will be building a model using the word2vec approach using Tensorflow."

The train,test and validation datasets are not clear for the problem to be honest.

We need to find the embeddings in such a way that the F1 score is increased on the test dataset. I can see that datasets.csv is the test data having just 10 observations. Is this the complete data or there are some hidden data for us to generalize our solution?
Also I can see - " Each vector is should only contain 512 elements" in the description. Is it so that we can’t use any SOTA model embeddings(like SBERT) here(which may have more than 512 elements)?
Are you going to use any other models like Decision Tree Classifier in “train_model” function. I mean how exactly is F score predicted on leaderboard ?

Shubhamaicrowd · June 15, 2021, 7:41am

Hahaha, actually our main goal with this challenge was to see how well participants can generate features for a piece of text ( embeddings ). If you can extract the meaning of a text really well, in theory, you should get a perfect F1 on the leaderboard.

We have train data with inputs feature being text and output feature being labels. In your starter notebook, you consider the emotion detection dataset as train data and the corresponding label as target. My question is - “Is the emotion detection dataset the train dataset for our problem?”

The emotion Detection dataset was used to teach different techniques used to generate embeddings such as word2vec or Count Vectorization. Because it was a really simple dataset with binary classification, that’s why we used specifically the emotion detection dataset. Again, Basically, we used Emotion Detection dataset to teach how to use different techniques to convert text into embeddings. You can even use Research Paper Classification if you want.

I am getting confused because in the descriptions section, it is written -
" Working on the same Research Paper Dataset you used in the multi-class problem, you will be building a model using the word2vec approach using Tensorflow."

That seems mistake on our side, will fix it soon!

We need to find the embeddings in such a way that the F1 score is increased on the test dataset. I can see that datasets.csv is the test data having just 10 observations. Is this the complete data or there are some hidden data for us to generalize our solution?

The data.csv is only provided for testing your notebook purposes. We didn’t want to share a complete dataset because then, it would have been quite easy to hard code the feature generation process.

Also I can see - " Each vector is should only contain 512 elements" in the description. Is it so that we can’t use any SOTA model embeddings(like SBERT) here(which may have more than 512 elements)?

You can use BERT in fact, the reason we only kept it to 512 elements so that the problem becomes a bit harder because now you will need to generate the features and pack them together in only 512 elements. Another reason is we are using a machine learning model to train on the feature your code generates to the actual labels. You can read more about the evaluation process or even a sample evaluation code here .

Are you going to use any other models like Decision Tree Classifier in “train_model” function. I mean how exactly is F score predicted on leaderboard ?

Yes, you can use any other model you like to generate the features. And for the evaluation and how we generate the F1 Score, again, I think this will help a lot.

Let me know if you have any other doubts or questions

Have Fun!

Shubhamai

AkashPB · June 15, 2021, 8:12am

Now things are clear. I think out of all the problems given, this one is the most challenging of all.

sean_benhur · June 16, 2021, 4:39am

Can you explain to me clear where should we save our models, in the ASSETS_DIR or assets ?

Shubhamaicrowd · June 16, 2021, 4:41am

You will need to save & read your model from assets directory.

sean_benhur · June 16, 2021, 5:00am

I have saved my model in assets directory, but i am still getting timeout error and local evaluatrion error
, here is the trace

Using notebook: /content/drive/MyDrive/Colab Notebooks/Copy of AI BLITZ 9 Community🤘.ipynb for submission...
Removing existing files from submission directory...
Scrubbing API keys from the notebook...
Collecting notebook...
Validating the submission...
Executing install.ipynb...
[NbConvertApp] Converting notebook /content/submission/install.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python3
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] Writing 3113 bytes to /content/submission/install.nbconvert.ipynb
Executing predict.ipynb...
[NbConvertApp] Converting notebook /content/submission/predict.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python3
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
2021-06-16 04:58:14.653518: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[NbConvertApp] ERROR | Timeout waiting for execute reply (30s).
Traceback (most recent call last):
  File "/usr/local/bin/jupyter-nbconvert", line 8, in <module>
sys.exit(main())
  File "/usr/local/lib/python2.7/dist-packages/jupyter_core/application.py", line 267, in launch_instance
return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/traitlets/config/application.py", line 658, in launch_instance
app.start()
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/nbconvertapp.py", line 338, in start
self.convert_notebooks()
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/nbconvertapp.py", line 508, in convert_notebooks
self.convert_single_notebook(notebook_filename)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/nbconvertapp.py", line 479, in convert_single_notebook
output, resources = self.export_single_notebook(notebook_filename, resources, input_buffer=input_buffer)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/nbconvertapp.py", line 408, in export_single_notebook
output, resources = self.exporter.from_filename(notebook_filename, resources=resources)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/exporters/exporter.py", line 179, in from_filename
return self.from_file(f, resources=resources, **kw)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/exporters/exporter.py", line 197, in from_file
return self.from_notebook_node(nbformat.read(file_stream, as_version=4), resources=resources, **kw)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/exporters/notebook.py", line 32, in from_notebook_node
nb_copy, resources = super(NotebookExporter, self).from_notebook_node(nb, resources, **kw)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/exporters/exporter.py", line 139, in from_notebook_node
nb_copy, resources = self._preprocess(nb_copy, resources)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/exporters/exporter.py", line 316, in _preprocess
nbc, resc = preprocessor(nbc, resc)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/preprocessors/base.py", line 47, in __call__
return self.preprocess(nb, resources)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/preprocessors/execute.py", line 381, in preprocess
nb, resources = super(ExecutePreprocessor, self).preprocess(nb, resources)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/preprocessors/base.py", line 69, in preprocess
nb.cells[index], resources = self.preprocess_cell(cell, resources, index)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/preprocessors/execute.py", line 414, in preprocess_cell
reply, outputs = self.run_cell(cell, cell_index)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/preprocessors/execute.py", line 491, in run_cell
exec_reply = self._wait_for_reply(parent_msg_id, cell)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/preprocessors/execute.py", line 483, in _wait_for_reply
raise TimeoutError("Cell execution timed out")
RuntimeError: Cell execution timed out
Local Evaluation Error Error: predict.ipynb failed to execute

Shubhamaicrowd · June 16, 2021, 5:04am

How much time your code take to make predictions on 10 samples ? Do you have any iter/sec ?

sean_benhur · June 16, 2021, 5:09am

Let me check and tell

sean_benhur · June 16, 2021, 5:21am

Here is the logs

CPU times: user 23.5 s, sys: 887 ms, total: 24.4 s
Wall time: 24.2 s

sean_benhur · June 16, 2021, 5:21am

Also now I get another error,

Assets Directory Error Error: Assets directory should be a direct part of the current directory

I can’t understand this, my assets directory path is /content/assets, so the data path, then why I get this error!?

Shubhamaicrowd · June 16, 2021, 5:31am

This challenge only provide 15 minutes for the prediction phase. Seems like your notebook will take too much time for 30k samples if it’s taking 23 seconds for 10 samples.

Shubhamaicrowd · June 16, 2021, 5:32am

Can you share the whole error and the code/command you ran ?

sean_benhur · June 16, 2021, 5:33am

Can I share you in DM!?

sean_benhur · June 16, 2021, 5:35am

This the error

Shubhamaicrowd · June 16, 2021, 5:35am

Sure, here’s my discord Shubhamai#3553