Hey guys,
Since the challenge has ended, just open this thread to see if anyone is interested in some post-challenge discussions on any thoughts, findings or things to share throughout the challenge.
As a starter, let me share some info and thoughts on behalf of our team “JusperLee”. We are from Tencent AI Lab and this is Yi Luo posting this thread. We participated in both tracks and here is some info for the systems we submitted.
Model arch: we use and only use BSRNN in both MDX and CDX, as this is the model we proposed and we would like to see how it performs comparing with all the other possible systems in the wild. We did make some modifications comparing with our original version, and we will describe them in future papers.
Data for MDX: we follow the pipeline we mentioned in our BSRNN paper, which only uses the 100 training songs in MUSDB18HQ with 1750 additional unlabeled songs for semi-supervised finetuning.
Data for CDX: things are little bit tricky here. For Leaderboard A (DnR-only track), we found that sometimes the music and effect data might contain speech which can greatly harm the training of the model, so we used the MDX model above to preprocess the DnR data to remove the “vocal” part from music and effect. We actually did not know whether this was permitted, as it said that “only DnR data can be used to train the systems”, but indeed we did not find specific rules clarifying this. So we simply went through the DnR data and train the CDX model with the preprocessed DnR-only data and found great performance improvement. For Leaderboard B, we added ~10 hrs of cinematic sound effect and ~100 hrs of cinematic BGM (both internal data) to the preprocessed DnR data.
Some observations and guesses: the interesting thing on the CDX challenge is that, when we use a pretty strong speech enhancement model on our side, the SDR score for the “dialog” track was always below 13 dB. We listened to the model outputs from real dramas and movies and thought that the quality was actually pretty good, so we struggled for really a while on what we should do. One day we randomly tried to use our MDX model to extract the “vocal” part to serve as “dialog”, and suddenly the SDR score goes to ~15 dB. We know that our MDX model may fail to remove some sound effects or noises when directly applied to speech, but the much better SDR scores made us assume that the “dialog” tracks in the hidden test data, which assumably are directly collected from real movies, contain some noise as they might be directly recorded on the film set instead of in a recording studio (I guess the “dialog” tracks in the demo audio clips were recorded in a studio?). That might explain why our enhancement model is worse than this MDX model here but is far better in almost all our other internal test sets. I personally do hope that the organizers can share some information about the evaluation dataset, particularly on whether they are clean (in terms of environmental sounds) or not.
The things we enjoyed:
- It is the first time for us to participate in such source separation challenges and it was really an action-packed competition, especially in the last week where we and many other participants were trying the best to improve the scores on the leaderboard.
- It is good to know the performance of our models on real-world evaluation recordings (real stems or cinematic tracks), which could shed light on the future directions to make improvements on our systems.
The things we were confused:
- I received an email from the organizers asking whether we could provide an implementation of BSRNN to serve as a baseline for the challenge. We did submit one system and made the entry publicly available (https://gitlab.aicrowd.com/Tomasyu/sdx-2023-music-demixing-track-starter-kit, submission #209291), but it seems that the organizers haven’t marked it as a baseline like several other baseline models until now. We did not get followups about whether the info about this system has been shared with any participants since we submitted it.
- We have four members in our team and each of us had one account. Although it was mentioned in the systems that each team was able to make 5 submissions per day, we found that each of us could only make 1 submission per day, so only a total of 4. We thought that this might be some misinfo in the system, but in the last (extended) week of the challenges we found that many other teams might have up to 10 successful submissions per day (Submission Times - Cinematic Sound Demixing Track - CDX’23 - AIcrowd Forum). We also found in one thread where the AIcrowd team mentioned that “The submission quotas are checked against the number of submissions made by your (or any of your team members) in the last 24 hour window” (No Submission Slots Remaining - Cinematic Sound Demixing Track - CDX’23 - AIcrowd Forum), and another thread that “Hence we’ll be increasing the number of submissions per day from 5 to 10, starting from Monday - April 3rd onwards. This increase will only be valid for a week, and the submission slots will be reduced back to 5 per day from April 10th onwards”(Phase 1 scores for new submissions - Cinematic Sound Demixing Track - CDX’23 - AIcrowd Forum). We were pretty confused on how many submissions did each team have throughout the challenges, as the quota for our team was always 4 in the final month but it seems like different teams did have different quotas, at least given the information in the submission page (AIcrowd | Cinematic Sound Demixing Track - CDX’23 | Submissions, one can easily count how many successful submissions a team made in the last 24 hour window). The response we got from the organizers is “it is possible that the higher number of submissions occurred during the one-week period when the submission quota was temporarily increased”, but that was actually contradictory to the announcement above saying that the submission quota went back to 5 after that one-week period but there could still be more than 5 successful submission within a 24 hour slot in the final week of the challenge. I don’t know if this is an AIcrowd issue or something else.
- Our result on CDX final Leaderboard A has been removed (others are still there). According to the response from the organizers it is because “the use of pretrained models is strictly prohibited in the challenge as they may have been trained on datasets not specified in the competition guidelines”, so I think maybe this was indeed not allow in such limited-data tracks. We would like to apologize if this is a common sense in challenges as we indeed do not have much experience on it, but we also hope that it could be clearly clarified in the challenge rules. As our result for CDX Leaderboard A is still here, maybe the organizers can also remove that if necessary.