Is it a preknowledge that we should select one of existing submissions for the final evaluation?

Hi everyone,

I’m wondering if I’m the only one that learns we should select one of the previous submissions for the final evaluation. I cannot find any official statement about this and the only clue I can find now is this answer, which I’ve previously read but not paid much attention to the word “existing”. That was a mistake of mine but I humbly don’t think such an answer in the forum could be counted as a formal statement.

It’s really frustrating to learn this at this point as none of my previous solutions was prepared for the final evaluations. I thought the challenge was to find a good solution but in the end, I found myself trapped in some word game. I am not meant to complain as I definitely should be responsible for the above mistake. However, if anyone feels the same way, please say something. Maybe, together we can make the game more interesting.

Hey @the_raven_chaser

I understand your frustration, especially after coming this far in the competition and feeling disappointed :confused:

However, participants were made aware that they can only make submissions for Warm-up round, Round 1, and Round 2. I am pasting below the same content from the challenge overview page.

ROUND 2 (FINALS)

Round 2 will evaluate submissions on the 16 public Procgen environments, as well as on the 4 private test environments. Participants’ final score will be a weighted average of the normalized return across these 20 environments, with the private test environments contributing the same weight as the 16 public environments.

This round will evaluate agents on both sample efficiency and generalization.

Sample efficiency will be measured as before, with agents restricted to training for 8M timesteps.

Generalization will be measured by restricting agents to 200 levels from each environment during training (as well as 8M total timesteps). In both cases, agents will be evaluated on the full distribution of levels. We will have separate winners for the categories of sample efficiency and generalization.

Because significant computation is required to train and evaluate agents in this final round, only the top 50 submissions from Round 1 will be eligible to submit solutions for Round 2. The leaderboard will report performance on a subset of all environments, specifically on 4 public environments and 1 private test environment.

The top 10 submissions will be subject to a more thorough evaluation, with their performance being averaged over 3 separate training runs. The final winners will be determined by this evaluation.

Round 2 marks the end of the competition in terms of submissions. The final evaluations that we run are only to make sure that the results are consistent across different runs.

Normally, we would pick the top 10 entries from the leaderboard and use them for the final evaluations.
Based on discussions from Round 2 is open for submissions 🚀, we decided to let the participant choose their best submission (which probably would be the best scoring one on the leaderboard) instead of us taking the best scoring submission.

I hope this helps.

1 Like

Hi @vrv

Thank you for the response. Yeah, I know that was my bad after reviewing thoroughly the overview page and the answer I linked before. However, we did not always follow these, right? For example, we used 6 public and 4 private test environments in the second round instead of 4 and 1 described on the overview page. Also, this answer said we got to pick 3 submissions but at the end of the day, we only pick one.

Maybe I should ask this before instead of wishfully thinking a new submission is viable. At this point, I don’t how which submission I should use as I said before, none of them were made for the final evaluation.

The purpose I posted this was to see if there was someone else facing a similar situation. If I’m the only one, I’ll accept it.

Although some of the above words may seem like complaining, I am not meant to. I’ve learned a lot during the competition and received a lot of help from you guys. Thank you all.

1 Like

Hi @the_raven_chaser

I’m just curious as to why you think none of your submissions was prepared for the final evaluation? … Full disclosure we did not try to tune our submissions to the rest of the 10 environments either (Though we knew that the final evaluation will be done on 20 envs)

Hi @dipam_chakraborty

The final evaluation evaluates generalization, but I did not use any regularizations such as batch normalization and data augmentations in my previous submissions. Also, in my latest few submissions, I chose to experiment with a newly introduced hyperparameter instead of using the one that performed well on my local machine.