Can we use other LLM at training stage?

Hi Organizers,

I want to understand if we can use other LLM (not LLAMA2 family) during the traning stage, specifically, used for RLHF and Data Generation.

Below is the raw request for model:

This KDD Cup requires participants to use Llama models to build their RAG solution. Specially, participants can use or fine-tune the following 4 Llama 2 models from

  • llama-2-7b
  • llama-2-7b-chat
  • llama-2-70b
  • llama-2-70b-chat


1 Like

Hello wufanyou,

Thank you for the question! Yes, this is ok as long as the process follows the ToS/license of the other LLMs.

Best Regards,
The CRAG Team

This means we can use other llms to generate the answer or just use it to genrate data for training. Is the use of private model api such as GPT4 or Claude for gen the data for training allowed or not?

Same question, what is the exact constraints for used model?

The exact constraints have been specified in the challenge overview and rules.

Below is a copy of “USE OF EXTERNAL RESOURCES” from the challenge overview:

By only providing a small development set, we encourage participants to exploit public resources to build their solutions. However, participants should ensure that the used datasets or models are publicly available and equally accessible to use by all participants. Such a constraint rules out proprietary datasets and models by large corporations. Participants are allowed to re-formulate existing datasets (e.g., adding additional data/labels manually or with Llama models), but award winners are required to make them publicly available after the competition.

For more details, please refer to the challenge rules: AIcrowd | Meta Comprehensive RAG Benchmark: KDD Cup 2024 | Challenge_rules

Best Regards,
The CRAG Team