In the dynamic world of audio research, Team Aim-less, based at the Centre for Digital Music, Queen Mary University of London, has carved a niche for themselves. Under the leadership of Chin-Yun Yu, their participation in AIcrowd’s Sound Demixing Challenge has been a remarkable blend of skill and creative thinking. This blog delves into their remarkable journey and the insights they gained en route to clinching the top spot on Leaderboard A in the Cinematic Sound Demixing Track. Here, we’ll explore their motivations for participating, the challenges they faced, and the creative solutions they implemented. We’ll also delve into the technical intricacies of their winning strategy, offering a detailed glimpse into the development of their groundbreaking solution.
Discovering the Challenge
The team stumbled upon this opportunity at ISMIR 2022, recognizing it as the perfect arena to hone and apply their diverse skills in music and signal processing. Not all members were audio source separation specialists, but this diversity became their strength, allowing them to blend a multitude of ideas from various domains. The team’s collective goal was clear: to explore new methodologies in audio separation and enrich their research portfolio.
The Challenge: Beyond Music to Cinematic Sound
The Sound Demixing (SDX) Challenge, a sequel to the Music DemiXing (MDX) challenge organized by giants like Sony and Meta, offered two distinct tracks: Music Source Separation (MDX) and Cinematic Sound Separation (CDX). Participants were tasked with dissecting complex audio signals into individual components, a challenge that has applications ranging from karaoke systems to up-mixing old movies for spatial audio.
Technical Mastery: The Aim-less Approach
Team Aim-less developed their models utilizing the robust capabilities of PyTorch and PyTorch LightningCLI. They specifically leveraged LightningCLI to ensure the training of different models was easily configurable. To streamline their process, they established a separate training code repository apart from their submission repository on AIcrowd Gitlab. This strategic separation allowed them to concentrate exclusively on model development.
Understanding the need for efficient evaluation, they ingeniously made a subset of their training code into an installable Python package. This package, once added to the environment requirements, enabled the submission runner to seamlessly load the model checkpoints for accurate evaluation. This approach not only optimized their workflow but also ensured the effectiveness of their model development process.
Overcoming the Stereo-Mono Disparity
In the Cinematic Sound Demixing (CDX) challenge, the test data presented a unique format, being stereo, in contrast to the mono format of the training data from the DnR dataset. Team Aim-less initially experimented with basic data augmentation methods, such as random panning and widening, to transform the mono data into stereo. However, they discovered that these methods performed equivalently to simply training a mono model and then conducting source separation for each channel independently.
A significant observation made by the team was that dialogue in the test data predominantly appeared centered. This meant that the side signal, obtained by subtracting the right channel from the left, contained minimal dialogue. This insight was crucial in shaping their solution strategy.
They trained a Hybrid Demucs model on the DnR dataset and then applied it to various components of the test data, including the left/right channels, the mid channel (an average of the left and right channels), and the side signal. The final prediction was an artful linear combination of these mono-separations alongside the side signal of the input mixture. To fine-tune their approach, the team adjusted the combination coefficients based on their performance on leaderboard A.
Furthermore, Team Aim-less enhanced their solution by training a BandSplitRNN model, specifically targeting the separation of background music, a move aimed at boosting the Signal-to-Distortion Ratio (SDR) score. The ultimate prediction for the background music was derived from averaging the outputs from both the Hybrid Demucs and BandSplitRNN models. This multi-faceted approach underscored their technical ingenuity and adaptability in addressing the challenge’s complexities.
The Final Symphony
In tackling the MDX Label Noise challenge, Team Aim-less made a critical discovery within the provided corrupted data set: a subset of it was, in fact, clean. They took the initiative to manually label this clean subset and focused on training a Hybrid Demucs model exclusively on this portion of the data. This selective approach was crucial in ensuring the quality and integrity of their training process.
To further enhance the diversity and robustness of their training data, they employed a variety of techniques. These included random mixing of tracks from different segments of various songs, applying random effects augmentation, and utilizing gradient accumulation. These methods were instrumental in not only diversifying the training data but also in increasing the batch size, which is pivotal for model training.
However, the team also acknowledged a hindsight in their approach. They realized that initiating the training of this model earlier in the process could have been more beneficial. This realization came as they observed continuous improvement in the evaluation score throughout the challenge, suggesting that the model was still underfitted and had potential for further optimization. This reflection highlights the team’s ongoing commitment to learning and adapting their strategies for enhanced performance.
Reflecting on the Journey
The journey of Aim-less in the Sound Demixing Challenge is a narrative of innovation, collaboration, technical proficiency, and the relentless pursuit of excellence. Their story stands as an inspiration to the AIcrowd community, showcasing the power of diverse expertise coming together to tackle complex problems.
As we look forward to more such challenges and innovations, we invite you to be a part of this exciting journey. Test your skills in the Commonsense Persona-Grounded Dialogue Challenge 2023, a challenge that pushes the boundaries of natural conversation understanding using AI. It’s an opportunity to test your skills, learn, and contribute to the exciting field of AI. Whether you are a student, a researcher, or a professional, AIcrowd offers a platform to test your abilities, grow your knowledge, and be a part of a community driving technological innovation. Join us in these explorations, where your skills and creativity can contribute to advancing the field of AI.