🚀 Round 2 Launched!

mohanty · November 23, 2020, 7:14am

Round 2 is live!

There are some important changes in this Round and we are excited to see how you tackle them!

1. Round 2 submissions are code based
2. You can choose your own subset of the vocabulary now

Code-based Submissions

Unlike Round 1 which was csv-based, in this round you will have to submit your full code which will be run on our evaluation infrastructure.

Each submission will have access to the following resources during evaluation :

4 CPU cores
16 GB RAM
1 NVIDIA K80 (optional, needs to be enabled in aicrowd.json)

All submissions will have a 10 minute setup time for loading their models, any preprocessing that they need, and then they are expected to make a single prediction in less than 1 second (per smile string).

Check out this starter kit to get stared, and make your first submission!

Choose your own vocabulary

For Round 2, you can choose a subset of the whole vocabulary(composed of 109 smell words) and create your own - if you believe that it improves your accuracy.

Read on to understand how it works

Lets define :

voc_gt : (the ground truth vocabulary) as the set of smell words in the actual challenge dataset (ground truth). 109 distinct smell words as present in the training set and test set of Round-1.
voc_x : (submission vocabulary) as a subset of voc_gt, on which participants choose to train their models on, and sample their predictions from. voc_x has to be composed of atleast 60 distinct smell words. This is estimated as the set of all distinct smell words used across all the predictions made by the model.
model_compression: We define the model compression as :
1 - [len(voc_x) / len(voc_gt)].
For every 1% model compression, we expect to have an improvement in accuracy of atleast 0.5%.
top_5_TSS_voc_x, top_2_TSS_voc_x : This refers to the top_5_TSS and top_2_TSS computed using the vocabulary used by the participants. When computing this metric, any smell word which is not present in voc_x is removed from the ground truth sentences.
- top_5_TSS : The Jaccard Index computed using the top-5 sentences in comparison to the ground truth (as described for Round 1 above)
- top_2_TSS: The Jaccard Index computed using the top-2 sentences in comparison to the ground truth (as opposed to top 5 for top_5_tss)
top_5_TSS_voc_gt, top_2_TSS_voc_gt : This refers to the top_5_TSS and top_2_TSS computed using the vocabulary present in the ground truth data. Here, this is exactly the same as top_5_TSS and top_2_TSS.
Finally, adjusted_top_5_TSS, adjusted_top_2_TSS
- The adjusted scores are computed like this

if (top_5_TSS_voc_x - top_5_TSS_voc_gt) >= 0.5 * model_compression : 
    adjusted_top_5_TSS = top_5_TSS_voc_x
    adjusted_top_2_TSS = top_2_TSS_voc_x
else:
    adjusted_top_5_TSS = top_5_TSS_voc_gt
    adjusted_top_2_TSS = top_2_TSS_voc_gt

So, if the improvement in accuracy between voc_x and voc_gt is greater than the expected 0.5 * model_compression, then we use the improved voc_x accuracy, else we use the original voc_gt accuracy.

The leaderboard is sorted based on adjusted_top_5_TSS as the primary score, and the adjusted_top_2_TSS as the secondary score.

During the course of Round-2, all the scores are based on 60% of the whole test data, and the final leaderboards on the whole test data will be released at the end of Round-2.

Cheers!

mohanty · November 23, 2020, 7:14am

contrebande · November 23, 2020, 3:53pm

So, to be clear, one’s “chosen vocabulary” can only be a subset of round 1’s pre-computed subset ? And it’s optional, which then makes round 2 exactly the same as round 1, except for the submission format (private git repo2docker with no public explainer required) ? And you mention a guide for advanced submission, but will you accept submissions with a custom Dockerfile bypassing your entry point (run.sh) and boilerplate code (predict.py) ? Or using Java instead of Python 3, for instance ?

mohanty · November 23, 2020, 4:22pm

@contrebande: The goal of this round is to build up on results of the previous rounds and nudge the community in the direction of model-compression.
If participants decide to continue using the whole 109 words vocabulary, then indeed the problem statement will be the same as round-1 (until someone else manages to get a adjusted score boost by cleverly using a subset of the provided vocabulary).

And, in terms of code based submissions, if you are submitted by packaging your runtime as a Dockerfile, then in principle submitting via Java is indeed possible. But at the moment it is not supported unfortunately, but if that is a use case that will be beneficial for you, I am sure we can put together simple bindings which will let you also submit your code in java.
One approach would be for us to rewrite the whole plumbing of the evaluator in java, another would be to simply wrap your java objects using something like Py4J and then use the already existing python boilerplate code to submit.

Looping in @shivam for any advice (and follow up) on this request.

Cheers,
Mohanty

contrebande · November 23, 2020, 5:45pm

Hi @mohanty,

Please do not rewrite your submission code. I was asking, not making a request. My biggest issue with the challenge as it is currently defined, is that I misinterpreted it as open source. Or hoped it would remain so, like round 1 was. The mandatory explainer was a strong incentive for open discussion. And that’s why I was interested in it, despite the proprietary data factor that was bothering me. But round 2 removed any requirement for open sourcing code or discussing strategies in the forum. In fact, even in round 1, there were only half a dozen explainer notebooks published, there was not much discussion in the forum around them and a lot of questions (mine here, for example) remain unanswered. I don’t think there is enough motive for paper-worthy ideas to be born (or even circulating) here in this way. Which is unfortunate, because it’s obviously very much a paper-worthy problem. And with some proper community fostering we could have a lot of fun too. Anyways, I will continue to follow this challenge, but as it stands now, I don’t think I will be participating.

shivam · November 23, 2020, 7:24pm

Hi @contrebande,

Thanks for sharing your use case on Java submission.
We try to keep our submission process as open as we can, so it doesn’t affect any of the participants. Meanwhile, some structure is introduced based on challenges, to make sure, we can efficiently evaluate codebases.

In order to make, Java submissions onboarding simpler I have added a new predictor i.e. java_random_predict.py in the starter kit here. (complete diff: 04200de5)

This class makes use of Py4J to interface with the Java codebase present in assets/JavaCodebase/[...].

How does it work?

You need to install py4j python package by pip install py4j.
The py4j0.[..].jar needs to be used while compiling your codebase.
You can make your codebase available in python by converting the class instance with GatewayServer, example here.
Now in the python codebase, gateway.entry_point is effectively your Java class’s interface, example here.
During predict.py you need to start gateway server, example here.

You can use the above example to port your existing codebase into the starter kit, or alternatively, change the RandomPredictor.java.

Let us know in case you face any issues while making the submission.

contrebande · November 23, 2020, 9:54pm

Hi @shivam,

Well, thanksl. But, like I said to Monhanty, I was merely asking about the constraints of not supporting custom Dockerfiles and I gave Java only as an example. But also, when using Java (or any other compiled language), one will surely need pre-compiled dependencies too, no ? And if pre-compiled dependencies are allowed, one can then also submit pre-compiled binaries (pre-packaged jars) with only the superficial hooks for setSeed, readVocabulary and predict. And if someone can submit pre-compiled binaries, then what’s the point of this round ? Are contestant still required (from the challenge rules) to release the code to their solutions under an open source license of their choice ? Also, since there is a performance criteria (less than one second per prediction), why not have all the hooks be command line and not penalize non-Python languages with non-native Python wrappers ?