Accessing the train file and test file in the same predict.py?

Hi Team,
@kelleni2 @shivam

Can we access and read the shared folder training files in the predict.py file where the orignal test data is layed out and this is accesed from production evironment path.

I want to read both the dataset i.e., train and test like this.

train_df = pd.read_csv(’/shared_data/data/training_data/training_data_2015_split_on_outcome.csv’)

the above path refers to local path.

test_df = pd.read_csv(AICROWD_TEST_DATA_PATH,index_col=0)

the above path refers to production environment path.

Please clarify this - becuase I want to train the model in predict.py only by using both the files train and test at the same location.

Regards
Shravan

Hi @shravankoninti,

Yes, you can access all the files at the same time during evaluation.

The starter kit have all the information about the environment variable, but let me clarify on the environment variables available during evaluations here as well.

  • AICROWD_TEST_DATA_PATH: Refers to testing_phase2_release.csv file which is used by evaluator to judge your models in testing phase (soon to be made public)
  • AICROWD_TRAIN_DATA_PATH: Refers to /shared_data/data/training_data/ in which all of training related files are present.
  • AICROWD_PREDICTIONS_OUTPUT_PATH: Refers to the path at which your code is expected to output final predictions

Now in your codebase, you can simply do something as follows to load both the files:

AICROWD_TRAIN_DATA_PATH = os.getenv("AICROWD_TRAIN_DATA_PATH", "/shared_data/data/training_data/")
AICROWD_TEST_DATA_PATH = os.getenv("AICROWD_TEST_DATA_PATH", "/shared_data/data/testing_data/to_be_added_in_workspace.csv")
AICROWD_PREDICTIONS_OUTPUT_PATH = os.getenv("AICROWD_PREDICTIONS_OUTPUT_PATH", "random_prediction.csv")


train_df = pd.read_csv(AICROWD_TRAIN_DATA_PATH + 'training_data_2015_split_on_outcome.csv')
# Do pre-processing, etc
[...]
test_df = pd.read_csv(AICROWD_TEST_DATA_PATH, index_col=0)
# Make predictions
[...]
# Submit your answer
prediction_df.to_csv(AICROWD_PREDICTIONS_OUTPUT_PATH, index=False)

I hope the example clarifies your doubt.

Thanks very much!.. This is really helpful.

I hope you update the path for test data and its name ASAP.

Shravan

Sure. Can you point us to the file/link where you find wrong path?

No No. as of now it is good.

AICROWD_TEST_DATA_PATH = os.getenv(“AICROWD_TEST_DATA_PATH”, “/shared_data/data/testing_data/to_be_added_in_workspace.csv”)

you mentioned “to_be_added_in_workspace.csv” right —> this needs to be replace with testing_phase2_release.csv

Let me know if this is right?

Yes, this is correct.

Hi Shivam,

Is ‘AICROWD_PREDICTIONS_OUTPUT_PATH’ customizable?What will be absolute path of this predictions file?Could you please explain with an example?

Hi @maruthi0506,

Your codebase need to read this environment variable i.e. absolute and just write final predictions at that location. The example is in starter kit already as well as in this comment above.

It just says 'AICROWD_PREDICTIONS_OUTPUT_PATH = os.getenv(“AICROWD_PREDICTIONS_OUTPUT_PATH”, “random_prediction.csv”)
'.But what is the default path for that file? For example,does it need to be in shared data or personal folder that I created or any directory and is it expected to have any predefined name for the output file?.Do we need to mention complete path - /x/y/z/predictions.csv.

Hi,

The default path can be anything of your preference i.e. your workspace based path for testing.

While during evaluation this environment variable will be set always and default value wouldn’t be used.

“While during evaluation this environment variable will be set always and default value wouldn’t be used.” – What does this line mean?Does it write to some other server for evaluation?

Yes, the evaluations run in seperate servers then your workspaces.