We are constantly trying to make this challenge better for everyone and would really appreciate your feedback.
Feel free to reply to this thread with your suggestions and feedback on improving the challenge for you!
- What have been your major pain points so far?
- What would you like to see improved?
Things have been pretty smooth so far, with aicrowd-gym having some growing pains (Issues with Pillow and Iterating over Dict spaces in aicrowd-gym) . However as long those issues can be worked around, things are cool!
I just gave some feedback about tight time limits in this comment.
I agree to some of your points and have shared it with the nethack team, we’ll discuss it and may increase the time if they agree. Thanks for the feedback.
I would like to see continued improvement in returning log information to submitters. We’re seeing more than we were before a recent bugfix, but here’s the log section from Panic submission 46
This challenge is super awesome! Thank you so much! I love it!
One thing for improvement: I would have liked to see role, race, gender and alignment in the Observation (blstats).
I found it very nice that you provided the starter kit, so that I was able to get good scores without heavy hardware setup. My laptop is very old
One thing that I do not like at all: the heavy score variance. With one and the same model I get scores from 400 to 500 (median). The mean score is also highly different on each evaluation. I hope you increase the final evaluation to take more than 2048 episodes. Or, you may use a more stable distribution for the roles, e.g. 20 times Tourist, 20 times Healer etc. or so…
I believe the finals will use 4096 runs.
I really like the stable distribution idea – maybe next year? One of the things I have considered doing is writing a function to normalize our local scores to an even role distribution, but it has never moved to the top of my priority list.