Which Dataset and Model should I use?

master-pc9845x · June 11, 2024, 3:59am

Tad bit confused: the main overview mentioned that I need to use an external dataset and model, but in the starter kit, it mentions that they use llama-7b with the data/development.json as the dataset. Which dataset and model should I actually use?

yilun_jin · June 11, 2024, 6:12am

In short:

Model: You can use whatever model that is publicly available/accessible, or you can train your own. If you train your own, you should make your model publicly available after the competition.
Data. You can use whatever data that is publicly available/accessible, or you can build your own dataset. Similarly, you should also make your dataset publicly available after the competition.

Other than that, they are your choice.

master-pc9845x · June 11, 2024, 6:32am

Thanks. One more thing: if I build my own dataset, should I include the ground truths (labels) together with the prompt and MCQ flag?

yilun_jin · June 11, 2024, 7:31am

It’s simply your choice.