I think the current evaluation prompt is too strict, causing everyone to respond with ‘I don’t know’ frequently just to ensure a score > 0. In reality, many answers could be considered partially correct—at least, human evaluators would take this into account. However, under the current setup, the top 10 models don’t attempt to provide partially correct strategies, which might actually perform worse in human evaluation compared to strategies scoring below 0. Yet, these strategies never even reach human review. I suggest the organizers relax the evaluation prompt to at least allow for some score differentiation.
Moreover, I believe the evaluation prompt should be made public. If someone wants to ‘hack’ the prompt, they don’t actually need to know its exact content—keeping it secret only widens the gap between local testing and server-side evaluation results.