I noticed a few datapoints in the training set with intphaseendyear 1900. Is this a mistake? Perhaps these values are supposed to be missing?
Hard to say without seeing the entire row.
While there was significant cleaning done on the data, at some point self-reporting can still have mistakes. And might require further filtering rules depending on team approach.
This might be something that could be cross-checked quickly. I’d be curious to know the results.
Which row number was it?
There’s 14 rows:
X = 49, 240, 506, 1344, 1843, 2037, 2038, 2229, 2882, 3749, 7242, 7243, 7544, 7896
If I had access to the Informa clinical dataset, one could lookup the protocolid and match to other databases to confirm.
Immunooncology drugs seem rather unlikely in 1900…
Thanks for the specifics. I have asked the informa team and am awaiting their response. I assume this is a known issue, and will relay.
From Imran at Informa:
“Usually when you see a date of 1900-01-01 it means that we do not know the date.”
He also said he can investigate but we will have the opportunity to ask many questions tomorrow.