Data from the future

johnpateha · July 20, 2020, 9:57am

just started to analyze the competition data and realize, that big part of them are just straight lines.
So if you would find coordinates and time difference for 2 distant endpoints for line, consisted of hundred other points of one aircraft, you could easily locate all points between them - in most cases aircraft speed near-constant.

And the main question - are any data from future test points could be used when we predict coordinates for other points? I guess future data could significantly improve results.
In real life, it is not possible, but I didn’t find any restrictions in the competition rules.
Hope organizers will clarify this moment.
I think it would be better if the winners’ solutions would help in practice.

sconfina · July 20, 2020, 1:46pm

Waiting for @masorx answer… I think that the solution must be useful in real applications, so, it is better not using future data.
Ciao
Mauro

vitaly_bondar · July 20, 2020, 2:46pm

I agree that models need to be practical and useful
It depends what is practical. For example if your task is anti-spoofing then this is right idea to see all airplane track
That is incorrect to add significant details to rules in the final stage of the competition.
Train data that we have includes “data from future” and obviously all use it.
It will be practical if model should be limited by the data that was given by organizers in data archive.

masorx · July 20, 2020, 4:43pm

Hi John,

Excellent question and one we discussed internally before the competition. In short, we are well aware of this, it is fine for this round to use all given data without restrictions and as mentioned, it wouldn’t be fair to change the rules after several weeks. However, we are definitely considering future rounds with such a premise.

There are use cases where we’d analyse historical data just like the data given and can use everything available, @vitaly_bondar has named one, OSINT would be another one. Thus, live tracking is not everything and there is utility in doing it after the fact. On the other hand, live tracking is very important in other fields and applicable solutions do have higher value there.

Scientifically, it will be interesting to analyse the different approaches and see which ones are best for which use case. This is why we require the code to be open sourced for the award recipients. Then we can compare and evaluate.

Finally, part of what you write is actually applicable for live tracking, too: the best prediction for the next data point most of the time would be that aircraft heading & speed are staying constant.

Best,
Martin

johnpateha · July 21, 2020, 7:45am

Thank you for clarification.

I participated in some Kaggle competitions where winners used data leaks from the future. Not sure if organizers were happy with top score solutions with data leaks.

Hope this time you will get something useful, but for me, tasks with leaks are not so attractive.

Best,
Evgeny

masorx · July 22, 2020, 10:15pm

Hi Evgeny,

That’s totally understandable! Hope to have you onboard in a future round/competition!

Out of interest, since in this setting it is impossible to not provide any future data in a meaningful way (since tracking, as opposed to point-by-point prediction is explicitly desired), would you be happy with a simple rule against the use? We can only check after the fact if contestants adhered to such a rule.

RomanChernenko · July 23, 2020, 8:56am

Hello @masorx

Just to add the rule to prohibit a future data usage is not enough in general. Always possible to implement a method that solves the offline version of the problem with all future data and then finetune the official online method with predicted data at hidden offline realization as “ground-truth”.
If you really want to solve an online tracking problem, you should invent some method how to strictly hide future data, like at the flatland challenge. But it required something like kernel-competitions at Kaggle.

johnpateha · July 26, 2020, 7:39am

Hi masorx!

I understand the problem that impossible to hide future data and think that rules against usage future data could be good idea. Of course, you would need to spend more time investigating top solutions, but it increases the chances to get a practical solution.
Code competition with hidden test is also would help - it could not prevent at all usage future data but at least make it harder to deep analyze test data.

Hope you would implement some restrictions for next competitions. We all interested in practical implementation of DS solutions after competitions.

masorx · July 26, 2020, 9:22am

Since we have a lot of test data, we will certainly test the provided solutions for this round on them (although this won’t change the winners) but we will consider implement hidden tests for the next rounds.