Can we use a dataset generated from scratch with an LLM?

The explanation states: “We would like to highlight that all solutions submitted to this challenge should be based on resources (e.g. datasets and models) that are publicly available. Submissions should not contain proprietary data or model checkpoints. Participants can paraphrase or extend upon existing datasets (e.g. manual labeling, or labeling/generation with GPT), but should make their extended datasets available after the competition.”

I have a question about this.

Do we necessarily have to base our solutions on publicly available data? Is it acceptable to use synthetic data created from scratch with LLMs such as Mistral? If not, would it be acceptable to use such data if we agree to make it publicly available after the competition?

  1. Is it acceptable to use synthetic data created from scratch from LLMs?
    Yes, if you agree to make it publicly available after the competition.

  2. Would it be acceptable to use such data if we agree to make it publicly available after the competition?
    Yes if you agree to make it publicly available after the competition.

1 Like