Tad bit confused: the main overview mentioned that I need to use an external dataset and model, but in the starter kit, it mentions that they use llama-7b with the data/development.json as the dataset. Which dataset and model should I actually use?
In short:
- Model: You can use whatever model that is publicly available/accessible, or you can train your own. If you train your own, you should make your model publicly available after the competition.
- Data. You can use whatever data that is publicly available/accessible, or you can build your own dataset. Similarly, you should also make your dataset publicly available after the competition.
Other than that, they are your choice.
Thanks. One more thing: if I build my own dataset, should I include the ground truths (labels) together with the prompt and MCQ flag?
It’s simply your choice.