We joined the competition to contribute to a data science challenge, while the current setup forces us to spend lots of energy dealing with the constrains imposed by the Aridhia environment and to deal with the unknowns of the test dataset.
Here are a few examples how this setup limits efficiency and creativity:
Reduce efficiency. We have our own matured data analysis environment at work. We can use tools such as Spotfire to visualize and explore relationships much quicker, while the current environment significantly reduces productivity. Anaconda is not everything we need for machine learning. I am not sure how many people feel comfortable in compiling and installing those tools we use daily in our work. We cannot recreate the efficient work environment we rely on daily on Aridhia.
Reduce creativity. We are considering some deep learning approach, no GPU access kills that idea. Andrew Lo mentioned one could imagine taking advantage of the chemical space, but without being able to connect to commercial chemical database Novartis has licensed, how can one retrieve those drug structures? Even we retype those drug names and search them in our work computer, we cannot do that for the test dataset, as test set is not made available. So this setup kills ideas that could have become possible in real life settings.
Derail the main goal: why should we spend time learning the Informa API? Why not make the data available in standard table format? I assume it is probably not the purpose of this competition to check who can understand a 3rd party API and is able to get a API to work quicker. Members in my team are not able to work on the competition full time in the next few weeks, this is our part-time activity, while we still have demanding daily responsibilities. A high hurdle in data access means we have even less time to spend on understanding the problem itself.
Does not model real life: why hide the test dataset? In real life, we have access to the test dataset (of course not the outcome field), which enables us to examine the distribution of variables. We will be able to cluster training and test records together to discover new relationships. When we predict records in 2018, we have all records in 2017 available. That is the real access we have in real life and is not data leak. By not being able to see the test set, we not only access less information than what we can in real life, but also force have to rely on assumptions that may fail and receive zero feedback (as it is run is a different environment), while such case can totally be avoided in real life.
So I would kindly suggest organizers to consider releasing training and test sets, so we can use the computational platforms and analysis tools of our own choice, and let us spend the limited time on the most creative part of the problem at hand. It is not too late to do that.
Thank you for your consideration.