/shared_data does not seem to be visible on the evaluation cluster, how can we access training data file during evaluation. Some feature engineering ideas rely on combining training data and test data. Thanks!
Training data is not available while evaluation.
We only have the test dataset available as of now. It might be ideal to add your computed model in git repository out of the
/shared_data training for the time being.
Please do not assume the test data alone is sufficient for prediction. For example, if we create a new feature counting how many indications a drug has been previously tried, this required us to look up the data in the training set. I am just giving a trivial example, there are features requiring the combining test set with training set in order to calculate (if we start caching training data for this, we just end up storing the training dataset). As I mentioned previously, the best is to make test set available, there is not much point in hiding the test set for this exercise. In any case, please make the training dataset available. Thanks!
I will follow up with you regarding which paths the evaluation cluster should have access to.
But yes, for starters, I see no issue making the training data available in the evaluation cluster, and we will do so asap.
Regarding test data - I have discussed with the core team and will reach out directly with feedback and questions.
Training data available would really be useful, test data unavailable is okay (otherwise it may get too tempting for people to look up the outcomes and make sure their models perform vs. the test data).
Test data does not include the outcome column for sure, it means the variables for test data not the target for test data.
If an engineered feature only relies on the test record itself, there is no issue. Some feature could rely on other test records and training records, that will cause waste of time to troubleshoot.
As an update on this request. The organising team is working on it and training data will be available during the evaluation soon. We will make the announcement once it is available.
Just to provide a trivial example on why test features should be available. The Terminated Reason seems to be a a somewhat-controlled vocabulary, but is it? We are not sure, if not, could the person entered the data introduced a typo? If for example we one-hot encode that column for our model, we then see a new value or a typo in the test set, it could break the model. Knowing such things allow us to save time and handle the columns correctly.
Hi all - quick update, we fully support making the training data available to the evaluation cluster. As Shivam mentioned - the training data will be visible to the evaluation cluster today. There were a few steps.
@yzhounvs - thanks for your concrete example of “data cleaning” - it is useful
I am traveling today but I would like to speak with you Monday if possible.
Raw data will be available soon but due to its size I was unable to find a quick solution while traveling.
We are sorry that the announcement didn’t went through for this change. The testing data is available during evaluation and starter kit has been updated accordingly for demonstrating example.
It can be accessed via environment variable
AICROWD_TRAIN_DATA_PATH which refers to same directory structure as
/shared_data/data/training_data/ i.e. in which all of training related files are present.
Example to use it:
AICROWD_TEST_DATA_PATH = os.getenv("AICROWD_TEST_DATA_PATH", "/shared_data/data/testing_data/to_be_added_in_workspace.csv") [...] train_df = pd.read_csv(AICROWD_TRAIN_DATA_PATH + 'training_data_2015_split_on_outcome.csv')
Please let us know in case there is any follow up question.
Thanks for that. Before others waste their time: AICROWD_TRAIN_DATA_PATH is a folder not a file (as one could think if one only looked at the example code and not read the text…)
Took me way too long to get that…