I train a model on the training part of MUSDB18, but evaluate it on the test part after each epoch to select optimal number of epochs. So, the training script uses both parts of MUSDB18. Can I submit a model trained in this way to leaderboard A?
for systems that were solely trained on the training part of MUSDB18HQ
Yes, that seems fair because your model was trained on the training part of the MUSDB18HQ dataset.
Update: View clarification from @StefanUhlich in the below response.
your model should go to “Leaderboard B”. Although you are not using the test set directly for training, you use it for early stopping and, hence, you are using it to obtain your model.
For MUSDB18, there is already a split of the “train” songs into “train” and “validation”, which you can find here: https://github.com/sigsep/sigsep-mus-db/blob/master/musdb/configs/mus.yaml#L44. This split works quite good and you could use it for your early stopping instead of the test set of MUSDB18.
Can we tune hyper-parameters by analyzing the SDRs measured by AICrowd? For example, what if we choose the best checkpoint after submitting top-3 checkpoints? Should it be submitted to A or B? Or is this not allowed?
Challenge Rules said, “one for systems that were solely trained on the training part of MUSDB18HQ.”
However, there are two versions of splits in musdb18: (84 training, 16 validation, 50 test) and (100 training, 50 test) as far as we know. I am not sure this means whether we should use the predefined (86 training, 14 validation, 50 test) split for leaderboard A or not. Can we use customized splits in 100 training tracks to submit a model to leaderboard A? - for example, 95 training items and five validation items?
thanks for your questions - here are the answers:
Yes, this is fine - this is why we split MDXDB21 (the hidden test set) into three parts and you will only see during the challenge your scores on 2/3 of this dataset. If the model is trained on MUSDB18 train only, then it doesn’t use any other data and can go to leaderboard A.
Yes, you can split MUSDB18 train in any way that is suitable for you and e.g., train on 95 songs and use 5 songs for validation. The only restriction for leaderboard A is that the training songs as well as the validation songs should come from the 100 songs of MUSDB18(HQ) train.
Thank you very much!
Is a participant obligated to use early stopping? Can a participant arbitrarily choose the number of epochs? If so, does a participant have to justify their choice?
I don’t have much experience with machine learning. Using the test part seems like a tempting idea to to me. Compare these ways:
Training on 86 with validation on 14 and early stopping, then re-training on 100 with optimal number of epochs (or finetuning on 14 ?)
Training on 100 with validation on 50 and early stopping
The second way is faster and can probably lead to a slightly better model.
Cheating scenario: participant train a model with validation on test part, then remove validation step from the training script, submit model to Leaderbord A and say that number of epochs was chosen based on intuition / experience / leaderboard.
This is probably not a very important issue. I just share my thoughts.
I agree with @_lyghter.
I think it’s not state very clear in the challenge rules. All of my submissions before are trained in this way, but I didn’t set
true. I believe many participants have also done this but not knowing it.
Should these submissions need to be disquailified from leaderboard A? If so, how to?
sorry for my late answer.
To be eligible for leaderboard A, you should only use songs from MUSDB18 train.
We decided on this in order to be fair to other publicly available models which use MUSDB18 test for evaluation (e.g., all models that were submitted SiSEC 2018). These models can’t use songs from MUSDB18 test and thus would have a disadvantage.
The first one is fine and can go with leaderboard A. The second one uses the test set of MUSDB18 and thus should set the flag that external data was used.
One remark about scheme 1: From experience, the epoch where you should stop your training is different from run to run and it is always a good choice to use (even a small) validation set. As @woosung_choi said, you could store the best five models on this small validation set and submit these models to see which one performed best. Otherwise, you need to make sure to seed everything correctly - but due to multiprocessed dataloading even in this case the best epoch can vary.
@StefanUhlich, I will train a model on the training part of MUSDB18-HQ without early stopping and submit checkpoints to leaderboard A every day. Is this acceptable?
Yes, that’s fine - you are only using MUSDB18 train and therefore you are eligible for leaderboard A.
@StefanUhlich, you allowed the participants to train models on full training part of MUSDB18 without early stopping and to set the number of epochs arbitrarily relying on leaderboards scores.
I think that validation on the test part of MUSDB18 should also be allowed, because the participants can do it secretly and the organizers cannot detect it. I understand that it is not allowed because the models submitted to this challenge should be comparable with other publicly available models. But they are already incomparable because MDX-models can use MDXDB21 to tune hyperparameters.
thanks a lot for your message and your thoughts.
However, for the current competition, we would like to keep it as discussed (only MUSDB18 train can be used for submissions to leaderboard A). This allows to compare these submissions to other open-sourced models.