Multiple dead submissions

joe_booth · March 29, 2019, 9:01pm

@mohanty - I think I may have overloaded the system last night - I was trying to trigger multiple submissions so that I have them all queue up. Two show as Failed, Eight show as ImageBuildStarted after 15 hrs.

I’m not sure if there is anything I can do on my end

mohanty · March 29, 2019, 9:32pm

Hey, please do not worry about that.
We will do a cleanup of the stale submissions soon.

joe_booth · March 29, 2019, 9:34pm

OK - thanks! as a rule, should I try and limit the number of concurrent submissions?

I know statistically that my current agent can score 10 on each seed so if I submit it enough times then at some point it should score that.

mohanty · March 30, 2019, 3:22pm

@joe_booth: The evaluator is setup to properly queue the evaluation jobs as they come in.

But making multiple submissions with the same agent to hope to squeeze out a bit more of the score wouldnt exactly be in the spirit of the competition. And it will also cause a lot of inconvenience to other participants whose jobs would be stuck in the queue.

We originally only allowed 5 submissions per day, and that was to limit attempts by participants to overfit on this particular seed set.
But we eventually increased it to 20, as participants needed more submissions to debug their submissions etc.

That said, we should see the current leaderboard state as only a “tentative” ordering, as these are evaluated only using 5 episodes on each evaluation. If we see the overall rewards (and hence the scores) as random variables with a particular distribution, then its easy to see that with just 5 episodes per evaluation, our own confidence in the estimation of this random variable is much lower, and that same confidence is (to some extent) also proportional to our confidence in the current ordering of the participants on the leaderboard.

Now, before the final prizes are decided, we will be running the evaluations using a much larger number of episodes per evaluation to get a better estimate of the distribution of the same random variable. And we could in principle continue increasing the number of episodes required for the evaluation until we get a statistically significant difference between the top-N players. But the point being, in that case, increasing the number of submissions you make now would not additionally help, as the same multi-runs will anyway be done by us offline later.

joe_booth · April 1, 2019, 6:19am

Hi @mohanty - thanks for pointing this out - sorry I didn’t see this until this evening and so had continued to push but will back off per you points.

I think it’s great that you extended the timeline - it gives us the time to take more risks with experimental approaches.