As I am currently on vacation, I cannot attend the townhall in person. I made a short video to present the key novelties of my submission to the challenge. Please ask your questions here if any. I will try to answer as soon as possible, given that I am still on vacation.
Hi @defossez It’s ok if you cant answer all of these questions
How much extra data did you use for leaderboard B?
Could you make Demucs better if there was a longer time limit for predictions?
Did you use model blending for your submissions?
Are you going to keep working on Demucs to try and make it even better?
When can we train our own models on the new Demucs and when can we read your paper about it?
Congrats on the impressive performance in both leaderboards!
Surpassing the IRM means that you’re essentially surpassing the limitations of time-frequency uncertainty in the STFT, right?
For leaderboard B, I used 150 extra tracks. At the end of the competition I realised it was fine to use the test set from MusDB as well, so I fine tuned including that as well (keeping only the validation set out).
If there was no time limit I could only marginally improve Demucs, maybe by 0.1 dB. Also a limiting factor at the moment, especially for the hybrid model, is not run time but memory on the GPU while training.
I used a mixture of 4 models for my final submissions. For track A, it was a mixture of hybrid and non hybrid model, trained with different seeds. For track B, it was all hybrid models trained with different seeds.
I will probably keep on working on Demucs, on the longer term, but for the immediate future I will take a break from source separation and mostly work on other deep learning problems. Also, while for the MDX challenge I mostly built on the existing Demucs architecture, I am not sure if the next iteration will be the same, or something completely different.
I will write the paper in september, and most likely release everything in october.
Actually the time-frequency uncertainty is not the issue here. In particular, if you take a complex spectrogram model, it will also be able to surpass the Ideal Ratio Mask oracle, as you can represent any possible output with a complex spectrogram. The limitation of the IRM is that it uses masking of the input spectrogram, and in particular always reuse the complex phase of the input. This is bad for some instruments, like percussive sound; where getting the phase wrong will make transients (the attack of a drum for instance) sound hollow, or empty. The phase will be wrong if for instance some other instruments overlap in the frequency domain (which is not hard because percussive sounds cover all frequencies during the attack), because the input phase will be some blend from the two instruments. Waveform models; or complex spectrogram model (but not masking ones) can actually predict the right phase and overcome the IRM oracle.