Original Datasets for Train and Test

shravankoninti · November 27, 2019, 7:18pm

Hi team,

where can I find the original datasets for trian and test.

There must training_phase_2.csv and testing_phase_2_release.csv…

What I see in the starter kit is just the subset of 100 rows data . Can you please share the files path ?

shravankoninti · November 28, 2019, 3:58am

@shivam Can you please reply here.

shivam · November 28, 2019, 4:40am

Hi @shravankoninti,

Are you referring in workspace or on evaluator?

In the workspace those files are present in /shared_data/data/ while in the evaluator you can access them using the environment variables AICROWD_TEST_DATA_PATH.

shravankoninti · November 28, 2019, 5:03am

@shivam Thanks for the reply. Yes I am looking at /shared_data/data/

I see there are 3 files

random_number_join.csv
training_data_2015_split_on_outcome.csv
training_data_2015_split_on_outcome.xls

Questions:
What is the use of the file -1?
In file-2/3 we have no of records = 8,691 records with 72 columns. Please confirm. Do we need to work only on this data as trainset?
But I see in README file (starterkit)you mentioned there will be file named training_phase2.csv with 1600649 records. What is this file? which file is our training dataset? the file with 8691 records?? Please let me know.

shivam · November 28, 2019, 7:58am

Hi,

Consider the files present on /shared_data/data/ on workspace as latest version and the records as correct. The README in starter kit contains number from previous dataset version and can be wrong.

I am not sure about random_number_join.csv. @kelleni2 might be aware of it?

kelleni2 · November 28, 2019, 8:48am

The test data was originally not intended to be visible other than a sample file for column names and format.

However, we will plan to make the test data available due to various logistical reasons for those who feel they need it. I will create a separate post on that topic.