We are updating the metrics used to compute the leaderboards during the challenge.
Noting that the final prizes and standings will be decided based on the outcomes of the human evaluations, we explored how closely the current metrics correlated with overall score of human evaluation results.
After a thorough investigation, we noticed that World Level F1 and 4-gram BLEU did not accurately reflect the performance of the submitted models.
In light of the said observation, and with the motivation to provide a more accurate feedback to the teams about their performance, we have decided to incorporate three new metrics: CPDScore, USEScore, and BERTScore. CPDScore and USEScore have demonstrated superior accuracy compared to the previously used metrics. Moving forward, CPDScore will be the primary metric for leaderboard rankings, while BERTScore will serve as an additional metric for reference due to its widespread use in automatic evaluation benchmarks.
CPDScore is a LLM based metric that uses a prompt similar to G-EVAL. The metric focuses on “Humanness” whose criteria are described in the prompt. For Round 2, it employs GPT-3.5-turbo-0125, while for the final leaderboard we might use a stronger model.
USEScore calculates similarity using the Universal Sentence Encoder.
BERTScore is a commonly used benchmark in the evaluation of automatic metrics.
We have added the scores to all submissions of Round 2 - Task 1.
Thank you for the update @dipam, the new metrics seem more reasonable indeed. I have a few questions regarding the human evaluation though.
Me and my teammate are creating a new persona based chat dataset that is more diverse in terms of conversation settings and the number of persona facts, and will likely result in diverse and factually sensible responses that are a bit further away from the ground truth. Our questions are:
Is there a ranking cutoff for submissions selected for human evaluation?
What does the human evaluation metrics / rubric consists of?
Do we need to include the dataset during gitlab submission or will we be contacted for the dataset, documentation, etc. if our submission is selected for human evaluation?
We were also witnessing almost 0 leaderboard and offline correlation of scores when looking at f1 or bleu. This is a welcome change thanks for this.
However, since the api track uses gpt3.5 wont the gpt3.5 scores be naturally higher for this track? At this point in time its clearly visible on the leaderboard as well? All gpu track false are high scoring and gpu track true are low scoring.
@saidinesh_pola , Yes GPT3.5 score is GPT-3.5-turbo generating a score based on a modified prompt similar to G-eval. It scores every utterance with the conversation history as context. And it generates a score.
@unnikrishnan.r I don’t see why gpt3.5 would naturally score prompt track submissions higher. Yes it is what is currently occuring but there is no natural reason for it. The metrics were decided based on actual human evaluations done on a blindly selected subset of Round 1 conversations, and gpt3.5 scores were the most correlated with the human evaluations.
Hi @dipam,
Could you specify the exact number or range of submissions that will undergo human evaluation as stated in your guidelines? It mentions “only the top several systems…will be judged via human evaluation” but it is unclear.
Thank you.
@dipam Using chatGPT, you are producing a score between 0 and 5, but is there a backup pipeline? if the generated value is a random text string or some out-of-bounds number. For human evaluation, it is also preferable to select the best LB from the GPU and PE tracks.
@saidinesh_pola I can’t share all the details about the GPT metric, but we do manage it for cases it’s not a valid score. For final submissions, each team will get to select any 2 successful submissions, doesn’t matter if it’s GPU track or not.
Yes, at least five teams from the final leaderboard will be selected for human evaluation. There is a possibility that the number of teams selected may increase, which will be communicated at a later date.
We are currently reviewing the criteria for human evaluation. Although we intend to adhere to the existing criteria, we may make adjustments if we encounter any difficulties during the review process. Further details are explained over here: Task 1: Commonsense Dialogue Response Generation.
Yes, if your submission ranks at the top of the leaderboard by the end of the challenge, we will contact you via email to collect any datasets you have gathered or created for training your model.
Only problem is that the leader board is dominated by ChatGPT’s PE track, but the peacock paper’s human evaluation is not the same as ChatGPT/ChatGPT4. It is as if their experimentation can be made false simply by using prompt engineering. Could someone please clarify this?
In the human evaluation, we find that facts generated by COMET-BART receive a high acceptance rate by crowdworkers for plausibility, slightly beating fewshot GPT-3. We also find that zero-shot GPT-3.5 model, although more advanced than the GPT-3 baseline model, scores, on average, ∼15.3% and
∼9.3% lower than COMET-BART in terms of automatic metrics and human acceptance, respectively.