Please consider releasing the datasets and use the submission file for evaluation

yzhounvs · November 19, 2019, 4:44am

We joined the competition to contribute to a data science challenge, while the current setup forces us to spend lots of energy dealing with the constrains imposed by the Aridhia environment and to deal with the unknowns of the test dataset.

Here are a few examples how this setup limits efficiency and creativity:

Reduce efficiency. We have our own matured data analysis environment at work. We can use tools such as Spotfire to visualize and explore relationships much quicker, while the current environment significantly reduces productivity. Anaconda is not everything we need for machine learning. I am not sure how many people feel comfortable in compiling and installing those tools we use daily in our work. We cannot recreate the efficient work environment we rely on daily on Aridhia.
Reduce creativity. We are considering some deep learning approach, no GPU access kills that idea. Andrew Lo mentioned one could imagine taking advantage of the chemical space, but without being able to connect to commercial chemical database Novartis has licensed, how can one retrieve those drug structures? Even we retype those drug names and search them in our work computer, we cannot do that for the test dataset, as test set is not made available. So this setup kills ideas that could have become possible in real life settings.
Derail the main goal: why should we spend time learning the Informa API? Why not make the data available in standard table format? I assume it is probably not the purpose of this competition to check who can understand a 3rd party API and is able to get a API to work quicker. Members in my team are not able to work on the competition full time in the next few weeks, this is our part-time activity, while we still have demanding daily responsibilities. A high hurdle in data access means we have even less time to spend on understanding the problem itself.
Does not model real life: why hide the test dataset? In real life, we have access to the test dataset (of course not the outcome field), which enables us to examine the distribution of variables. We will be able to cluster training and test records together to discover new relationships. When we predict records in 2018, we have all records in 2017 available. That is the real access we have in real life and is not data leak. By not being able to see the test set, we not only access less information than what we can in real life, but also force have to rely on assumptions that may fail and receive zero feedback (as it is run is a different environment), while such case can totally be avoided in real life.

So I would kindly suggest organizers to consider releasing training and test sets, so we can use the computational platforms and analysis tools of our own choice, and let us spend the limited time on the most creative part of the problem at hand. It is not too late to do that.

Thank you for your consideration.

ngewkokyew · November 19, 2019, 9:20am

Also, delay in accessing the work-space environment and data gives us much unnecessary challenges. I have huge concern on whether my team can deliver the results given there are 3 weeks left and we are still pending access in Aridhia.

kelleni2 · November 19, 2019, 3:17pm

Hi and thanks for your suggestions.

Will be discussed with the core team regarding data restrictions.
a) If you need a GPU that can be arranged. Since one does not automatically need a GPU for deep learning - as Tensor Flow and similar able to be run on a CPU - we defaulted to CPU’s. I could easily imagine a GPU might prove necessary for considering new data types of larger sizes.
b) We have a team who will be uploading all of our chemical assay data, linked to the core data set where possible - right now for phase 2. This required both legal approval and “wrangling”. We can certainly provide a linkage table for the compounds in the test set.
You are not required to use the API. This was provided out of goodwill by Informa on a trial basis for the challenge. Many teams have data engineers and wranglers who have displayed quite some interest, and there will be a walk-through. It was also the proposed solution from Informa to get the linkage for the compound data. See next point regarding raw data.
I will confirm with the core team, but you do or should shortly have access to the raw data for which we have a subscription. You can choose whether to participate in the leaderboard aspect – if so, it is your responsibility not to leak information if going back to the raw data. You can also simply proceed to the insights section and ignore the leaderboard. The leaderboard is a methods comparison, where additional data will be available centrally to all participants including the evaluation cluster where you can link in the same manner you did with the training data. The test split is meant to mimic post-phase 2 investment and conservatively avoid information leak. For practical reasons we chose a single date. Other options are rolling window, or re-training a model for each prediction based on all information time stamped earlier than the decision point, which was not feasible given the specifics of the data. Feel free to do as you feel appropriate for the insights section, including phase 3 design, etc

kelleni2 · November 19, 2019, 3:19pm

Regarding access, I will send a communication today. Due to latency issues in Hyderabad, another hub was stood up by Aridhia. The loss of time will be compensated for where possible.

yzhounvs · November 19, 2019, 4:18pm

Hi, Kelley

Your reply is greatly appreciated and very helpful. Just to mention that even the chemical data is made available, the tools to use the chemical data will be missing (e.g., how do you calculate the similarity between two structures. We cannot reestablish our work environment in Aridhia. Thanks.