Different results between training, debug and evaluation

we are experiencing very different results between what we get when evaluating compared to the local and debug mode.

Locally and in debug mode we get the same expected score, while in evaluation we get unexpected low or a lot lower performance.
For example episode 1, we always stop at level 5 in evaluation. According to our stats we have 99% success across the training seeds at level 5 and indeed we never fail at 5 locally and in debug mode.

Now I understand that the evaluation seeds are different, but we cannot understand how there can be such a difference. We tried to change model at level 5 but the behavior is the same, fine locally and in debug

Any idea?

For the admins this is one debug test for instance:
This is one evaluation:

It would be interesting to have some info on episode 1 to understand (a video?), to know how it dies?

Could it be you overfit to the 100 seeds we have on shared binary of OT environment? AFAIK the evaluation uses seeds outside these 100.

We had something similar: Training model further did not make results more reliable in evaluation.

Thanks. We now think that overfitting is the problem as well.
I did not expect to overfit such a complex environment, but we probably did.
On the training seeds we average out above 20 over 5 runs, but on the evaluation seeds seem to be a totally different story.