Clarification Questions (Python Starter)

nlevin · April 26, 2021, 5:30pm

Excited to get started on this challenge. As I was working through the Python starter notebook a couple questions that have come up for me.

The starter file generates 362 predictions for what appears to be the validation data. Shouldn’t the predictions submitted be for the test data for which there are 1473 observations or am I mistaken?
What should be the form of our predictions?
In the overview the Output is 2 columns - “ID” and “Predicted Class”, but in the starter notebook the generated predictions are 4 columns - “row_id”, “normal_diagnosis_probability”, “post_alz…probability”, “pre_alz…probability”.

Thanks!

jyotish · April 26, 2021, 5:39pm

Hello @nlevin

Great to have you here!

Yes, you are right! The starter notebook generated predictions on the validation data. During evaluation (when you submit the notebook to AIcrowd), we use test data (which is not exposed to the participants) and re-run the notebook. The test data that is used during the evaluation will have 1473 observations.
The output file should have the row ID (row_id) and probabilities for normal, pre and post-diagnosis (normal_diagnosis_probability, pre_alzheimer_diagnosis_probability, post_alzheimer_diagnosis_probability). The sum of the values of the three diagnoses should not be greater than one, that is,

normal_diagnosis_probability + pre_alzheimer_diagnosis_probability + post_alzheimer_diagnosis_probability <= 1

You can also have a look at the predictions.csv generated by the notebook to get a feel of how the output should look like.

Hope this clears the confusion.

jyotish · April 26, 2021, 5:45pm

Also, thanks for pointing this out. We updated the table as well.

thanish · April 29, 2021, 12:38pm

"The sum of the values of the three diagnoses should not be greater than one

normal_diagnosis_probability + pre_alzheimer_diagnosis_probability + post_alzheimer_diagnosis_probability <= 1"

This just raised a question for me. So according to the above statement, does it mean the sum of the 3 class probabilities cannot be = 1 always?
If it’s a multiclass problem the sum of probabilities will be equal to 1, however if it is a multilabel problem, there are chances that sum of 3 class probabilities might be greater than 1 or it can also be less than 1. So which one holds right here?

mohanty · April 29, 2021, 1:45pm

@thanish: this is a multiclass classification problem, so the sum of the probabilities should be equal to 1.

The description specifies that the sum is less than equals to 1, to stay true to the implementation details of the validation strategies in place. And also to communicate that this is a probability distribution after all, and cannot sum up to more than 1.

Hope this clarifies your question ?