Which Dataset and Model should I use?

Tad bit confused: the main overview mentioned that I need to use an external dataset and model, but in the starter kit, it mentions that they use llama-7b with the data/development.json as the dataset. Which dataset and model should I actually use?

In short:

  1. Model: You can use whatever model that is publicly available/accessible, or you can train your own. If you train your own, you should make your model publicly available after the competition.
  2. Data. You can use whatever data that is publicly available/accessible, or you can build your own dataset. Similarly, you should also make your dataset publicly available after the competition.

Other than that, they are your choice.

Thanks. One more thing: if I build my own dataset, should I include the ground truths (labels) together with the prompt and MCQ flag?

It’s simply your choice.