Original Datasets for Train and Test

#1

Hi team,

where can I find the original datasets for trian and test.

There must training_phase_2.csv and testing_phase_2_release.csv…

What I see in the starter kit is just the subset of 100 rows data . Can you please share the files path ?

#2

@shivam Can you please reply here.

#3

Hi @shravankoninti,

Are you referring in workspace or on evaluator?

In the workspace those files are present in /shared_data/data/ while in the evaluator you can access them using the environment variables AICROWD_TEST_DATA_PATH.

#4

@shivam Thanks for the reply. Yes I am looking at /shared_data/data/

I see there are 3 files

  1. random_number_join.csv
  2. training_data_2015_split_on_outcome.csv
  3. training_data_2015_split_on_outcome.xls

Questions:
What is the use of the file -1?
In file-2/3 we have no of records = 8,691 records with 72 columns. Please confirm. Do we need to work only on this data as trainset?
But I see in README file (starterkit)you mentioned there will be file named training_phase2.csv with 1600649 records. What is this file? which file is our training dataset? the file with 8691 records?? Please let me know.

#5

Hi,

Consider the files present on /shared_data/data/ on workspace as latest version and the records as correct. The README in starter kit contains number from previous dataset version and can be wrong.

I am not sure about random_number_join.csv. @kelleni2 might be aware of it?

#6

The test data was originally not intended to be visible other than a sample file for column names and format.

However, we will plan to make the test data available due to various logistical reasons for those who feel they need it. I will create a separate post on that topic.

Accessing the train file and test file in the same predict.py?