Is training data available during evaluation?

bzhousd · November 19, 2019, 9:13pm

/shared_data does not seem to be visible on the evaluation cluster, how can we access training data file during evaluation. Some feature engineering ideas rely on combining training data and test data. Thanks!

shivam · November 21, 2019, 3:31am

Training data is not available while evaluation.
We only have the test dataset available as of now. It might be ideal to add your computed model in git repository out of the /shared_data training for the time being.

cc: @kelleni2, @mohanty , should we provide it in evaluation side or pre-computed models work?

yzhounvs · November 21, 2019, 3:40am

Please do not assume the test data alone is sufficient for prediction. For example, if we create a new feature counting how many indications a drug has been previously tried, this required us to look up the data in the training set. I am just giving a trivial example, there are features requiring the combining test set with training set in order to calculate (if we start caching training data for this, we just end up storing the training dataset). As I mentioned previously, the best is to make test set available, there is not much point in hiding the test set for this exercise. In any case, please make the training dataset available. Thanks!

kelleni2 · November 21, 2019, 7:32am

Hi Shivam,
I will follow up with you regarding which paths the evaluation cluster should have access to.

kelleni2 · November 21, 2019, 9:18am

But yes, for starters, I see no issue making the training data available in the evaluation cluster, and we will do so asap.

Regarding test data - I have discussed with the core team and will reach out directly with feedback and questions.

bjoern.holzhauer · November 21, 2019, 1:23pm

Training data available would really be useful, test data unavailable is okay (otherwise it may get too tempting for people to look up the outcomes and make sure their models perform vs. the test data).

yzhounvs · November 22, 2019, 12:11am

Test data does not include the outcome column for sure, it means the variables for test data not the target for test data.

yzhounvs · November 22, 2019, 12:12am

If an engineered feature only relies on the test record itself, there is no issue. Some feature could rely on other test records and training records, that will cause waste of time to troubleshoot.

shivam · November 22, 2019, 12:15am

As an update on this request. The organising team is working on it and training data will be available during the evaluation soon. We will make the announcement once it is available.

yzhounvs · November 22, 2019, 12:52am

Just to provide a trivial example on why test features should be available. The Terminated Reason seems to be a a somewhat-controlled vocabulary, but is it? We are not sure, if not, could the person entered the data introduced a typo? If for example we one-hot encode that column for our model, we then see a new value or a typo in the test set, it could break the model. Knowing such things allow us to save time and handle the columns correctly.

kelleni2 · November 22, 2019, 10:33am

Hi all - quick update, we fully support making the training data available to the evaluation cluster. As Shivam mentioned - the training data will be visible to the evaluation cluster today. There were a few steps.

@yzhounvs - thanks for your concrete example of “data cleaning” - it is useful

I am traveling today but I would like to speak with you Monday if possible.

Raw data will be available soon but due to its size I was unable to find a quick solution while traveling.

shivam · November 29, 2019, 11:16am

Hi all,

We are sorry that the announcement didn’t went through for this change. The testing data is available during evaluation and starter kit has been updated accordingly for demonstrating example.

It can be accessed via environment variable AICROWD_TRAIN_DATA_PATH which refers to same directory structure as /shared_data/data/training_data/ i.e. in which all of training related files are present.

Example to use it:

AICROWD_TEST_DATA_PATH = os.getenv("AICROWD_TEST_DATA_PATH", "/shared_data/data/testing_data/to_be_added_in_workspace.csv")
[...]
train_df = pd.read_csv(AICROWD_TRAIN_DATA_PATH + 'training_data_2015_split_on_outcome.csv')

Please let us know in case there is any follow up question.

lcb · December 5, 2019, 9:55am

Thanks for that. Before others waste their time: AICROWD_TRAIN_DATA_PATH is a folder not a file (as one could think if one only looked at the example code and not read the text…)
Took me way too long to get that…