RAW DATA now available in shared folder

Dear all,

The team has now place the raw MIT/Informa tables for our data wranglers to explore in the /shared_data folders under “raw”. We can start a thread from our data wranglers if there are questions on the data.

As of now, and explained earlier for a variety of reasons, this data is not available in this raw format to the evaluation engine.

Since we will now be providing the test data, you will still be able to wrangle raw data to the test data and pass along additional columns to the evaluator.

NOTICE: It is the responsibility of the team to ensure that any additional data:

  • DOES NOT CONTAIN INFORMATION AFTER 2015
  • DOES NOT CONTAIN INFORMATION ABOUT THE PHASE 3 TRIAL FOR A LEADERBOARD PREDICTION – note that this is actually encouraged for some of the challenge insights questions!
  • Which obviously includes, but is not limited to the outcome of the trial itself

Please Consider: Solutions and predictions by teams choosing to add additional data will be under an additional layer of oversight in both code and model generalizability to ensure no information has been leaked.

If you have observed a performance increase from specific data and would like to pass this into the evaluation cluster – making this available to other participants and to receive better validation that it is not leaking information, this sharing would be recognized especially if demonstrated to enable the competition and is highly encouraged - please email me or post on the forums.

Thanks for making the raw data available in this more approachable manner. Is there a way to share the data dictionary of the raw data, or instructions on how to get the data dictionary through Informa API?

I ask because the master data dictionary provided under “Resources” is not complete. Many variables’ descriptions are missing or not informative. Examples:

  • “strTerminationReason”: what does is mean when this variable is “” (empty string)?

  • “TherapyDescription”, “strStudyDesign”, “strPrimaryEndpoint”, “strPatientPopulation”, “DrugDeliveryRouteDescription”: all missing variable description.

It’d be helpful if we could know how are these variables collected: free text in some trial registry? Or hopefully some more structured meta-data with which we could engineer some features out of the long text values. e.g. we want to create a new endpoint category: biomarker, survival, etc.

  • “intpriorapproval”: missing description. What do “[]” and “[0]” mean?

  • “drugCountryName”: missing description, not in the script documentation, and are all a big string separated by “|” for all rows (see screenshot below)