Post-challenge discussion

Hey guys,

Since the challenge has ended, just open this thread to see if anyone is interested in some post-challenge discussions on any thoughts, findings or things to share throughout the challenge.

As a starter, let me share some info and thoughts on behalf of our team “JusperLee”. We are from Tencent AI Lab and this is Yi Luo posting this thread. We participated in both tracks and here is some info for the systems we submitted.

Model arch: we use and only use BSRNN in both MDX and CDX, as this is the model we proposed and we would like to see how it performs comparing with all the other possible systems in the wild. We did make some modifications comparing with our original version, and we will describe them in future papers.

Data for MDX: we follow the pipeline we mentioned in our BSRNN paper, which only uses the 100 training songs in MUSDB18HQ with 1750 additional unlabeled songs for semi-supervised finetuning.

Data for CDX: things are little bit tricky here. For Leaderboard A (DnR-only track), we found that sometimes the music and effect data might contain speech which can greatly harm the training of the model, so we used the MDX model above to preprocess the DnR data to remove the “vocal” part from music and effect. We actually did not know whether this was permitted, as it said that “only DnR data can be used to train the systems”, but indeed we did not find specific rules clarifying this. So we simply went through the DnR data and train the CDX model with the preprocessed DnR-only data and found great performance improvement. For Leaderboard B, we added ~10 hrs of cinematic sound effect and ~100 hrs of cinematic BGM (both internal data) to the preprocessed DnR data.

Some observations and guesses: the interesting thing on the CDX challenge is that, when we use a pretty strong speech enhancement model on our side, the SDR score for the “dialog” track was always below 13 dB. We listened to the model outputs from real dramas and movies and thought that the quality was actually pretty good, so we struggled for really a while on what we should do. One day we randomly tried to use our MDX model to extract the “vocal” part to serve as “dialog”, and suddenly the SDR score goes to ~15 dB. We know that our MDX model may fail to remove some sound effects or noises when directly applied to speech, but the much better SDR scores made us assume that the “dialog” tracks in the hidden test data, which assumably are directly collected from real movies, contain some noise as they might be directly recorded on the film set instead of in a recording studio (I guess the “dialog” tracks in the demo audio clips were recorded in a studio?). That might explain why our enhancement model is worse than this MDX model here but is far better in almost all our other internal test sets. I personally do hope that the organizers can share some information about the evaluation dataset, particularly on whether they are clean (in terms of environmental sounds) or not.

The things we enjoyed:

  • It is the first time for us to participate in such source separation challenges and it was really an action-packed competition, especially in the last week where we and many other participants were trying the best to improve the scores on the leaderboard.
  • It is good to know the performance of our models on real-world evaluation recordings (real stems or cinematic tracks), which could shed light on the future directions to make improvements on our systems.

The things we were confused:

  • I received an email from the organizers asking whether we could provide an implementation of BSRNN to serve as a baseline for the challenge. We did submit one system and made the entry publicly available (, submission #209291), but it seems that the organizers haven’t marked it as a baseline like several other baseline models until now. We did not get followups about whether the info about this system has been shared with any participants since we submitted it.
  • We have four members in our team and each of us had one account. Although it was mentioned in the systems that each team was able to make 5 submissions per day, we found that each of us could only make 1 submission per day, so only a total of 4. We thought that this might be some misinfo in the system, but in the last (extended) week of the challenges we found that many other teams might have up to 10 successful submissions per day (Submission Times - Cinematic Sound Demixing Track - CDX’23 - AIcrowd Forum). We also found in one thread where the AIcrowd team mentioned that “The submission quotas are checked against the number of submissions made by your (or any of your team members) in the last 24 hour window” (No Submission Slots Remaining - Cinematic Sound Demixing Track - CDX’23 - AIcrowd Forum), and another thread that “Hence we’ll be increasing the number of submissions per day from 5 to 10, starting from Monday - April 3rd onwards. This increase will only be valid for a week, and the submission slots will be reduced back to 5 per day from April 10th onwards”(Phase 1 scores for new submissions - Cinematic Sound Demixing Track - CDX’23 - AIcrowd Forum). We were pretty confused on how many submissions did each team have throughout the challenges, as the quota for our team was always 4 in the final month but it seems like different teams did have different quotas, at least given the information in the submission page (AIcrowd | Cinematic Sound Demixing Track - CDX’23 | Submissions, one can easily count how many successful submissions a team made in the last 24 hour window). The response we got from the organizers is “it is possible that the higher number of submissions occurred during the one-week period when the submission quota was temporarily increased”, but that was actually contradictory to the announcement above saying that the submission quota went back to 5 after that one-week period but there could still be more than 5 successful submission within a 24 hour slot in the final week of the challenge. I don’t know if this is an AIcrowd issue or something else.
  • Our result on CDX final Leaderboard A has been removed (others are still there). According to the response from the organizers it is because “the use of pretrained models is strictly prohibited in the challenge as they may have been trained on datasets not specified in the competition guidelines”, so I think maybe this was indeed not allow in such limited-data tracks. We would like to apologize if this is a common sense in challenges as we indeed do not have much experience on it, but we also hope that it could be clearly clarified in the challenge rules. As our result for CDX Leaderboard A is still here, maybe the organizers can also remove that if necessary.

@XavierJ thanks for the discussion. I can just respond to one thing, as I was directly involved with it

I received an email from the organizers asking whether we could provide an implementation of BSRNN to serve as a baseline for the challenge. We did submit one system and made the entry publicly available (Tomasyu / sdx-2023-music-demixing-track-starter-kit · GitLab, submission #209291), but it seems that the organizers haven’t marked it as a baseline like several other baseline models until now. We did not get followups about whether the info about this system has been shared with any participants since we submitted it.

I was in contact with Jianwei Yu from your team via mail in Februrary. I think after the last issues have been resolved with the submission, we already entered phase II and you didn’t resubmit (which means the submission dissappeared). I then forgot to ping the AIcrowd again then. So I’m sorry for that.

I will make that we rerun the submission again so that it appears on the final leaderboard and will be marked as a baseline with appropriate links. Is that ok?

Hello @XavierJ. Do you plan to release best weights for your model for Leaderboard C?

Thanks for sharing your insights @XavierJ.

Some of my observations for the CDX DnR only challenge.

  • Dialog: The dialog stem of the test set seems noisy, even adding the scaled mixture to the outputs of a ‘good’ speech enhancement model improves the score. So ensembling a not-so-good model and a good model performed relatively OK.

  • Music: The Music stem in the DnR dataset feels quite unnatural because of the abrupt endings and complete silences in between. I guess in a realistic movie clip, usually the background music will start and end with some fading.

  • Effects: The Effects stem itself seems fine in the DnR dataset, however, I think in most movies effects are almost always overlapped with some background music and rarely occur in complete silence.

  • Local Validation: This was by far the hardest thing to do since the validation score on the DnR validation set was completely not in tune with the test set here. Especially the dialog stem, it was so baffling to see really good validation scores and perceptually such good speech enhancement, but not even reach 5dB SDR on the leaderboard. I had almost given up when I happened to submit a worse validation model (with audible interferences) and got a better score.

To this end, I removed the absolute silences in the “Music” stem in the dataset and merged the segments with cross-fading. I also removed silences between the dialogs. I left the effects stem as it is. So when I create training mixes, the effects will almost always overlap with dialog or music. and a very small percent of times be completely on its own.

This strategy gave the best score (3.466) for the Effects stem on leaderboard A.

Thank @XavierJ, for opening the thread.
I’m Chin-Yun Yu, the team leader of aim-less from C4DM at the Queen Mary University of London.
We are currently organising our codebase and will make it publicly available soon.

Some of our key findings throughout the challenge:

Leaderboard A, CDX

Since the aim of the challenge is hacking the evaluation metric, we use the negative SDR as the loss function, which significantly improves the score by at least 1 dB compares to L1 loss. We also found the speech seems to be always panned in the centre in the test set after we used the average of the two predicted channels and found it improved the score slightly. Our final model consists of an HDemucsV3 and a BandSplitRNN (we implemented it based on the paper and would love to compare it with @XavierJ’s version) that only predicts the music and the predictions are the average of the two.

Leaderboard A, MDX

The strategy we used at the end was listening through all the label noise tracks and labelling the clean tracks for training. We trained an HDemucsV3 with data augmentation on those clean tracks. My teammate @christhetree would probably want to add more details to this. We missed the team-up deadline, but I was in charge of all the submissions in the last two weeks, so we didn’t exploit any extra quota.

Hello @ZFTurbo , unfortunately due to the use of internal data we are not able to release the model weights. We did have a smaller version of the model trained only with MUSDB18HQ mentioned in the original post above, and @faroit has already coordinated with the organizers to mark it as a baseline in the Leaderboard now (thanks for the help!). Maybe that could serve as a starting point for others to use their own internal data, either labeled or unlabeled, to work on the model.