Nabarun Goswami’s journey in AI and audio research illustrates his remarkable commitment and innovative approach. Starting at Sony and extending to the halls of The University of Tokyo, Nabarun has made notable contributions, especially in music and speech separation.
The Road to AI Mastery
Nabarun’s path in AI began with projects spanning image captioning to symbolic music generation at Sony. His transition to the audio department catalyzed a deeper dive into AI/ML research. His participation in the SiSEC MUS 2018 challenge, where he earned top results and co-authored three papers, was a significant step in his research journey. Nabarun’s move to pursue a Master’s and Ph.D. in Machine Intelligence at The University of Tokyo further reinforced his dedication to AI research, focusing on self-supervised representation learning for speech and audio.
Stepping into the Sound Demixing Arena
For Nabarun, entering the AIcrowd Sound Demixing Challenge was a fitting next step. The challenge, known for its focus on audio source separation, matched well with his expertise and research interests. It offered him a chance to test his skills in a competitive setting and contribute to a community that played a crucial role in his professional growth.
Technical Mastery: Crafting the Winning Solution
Nabarun’s approach to conquering the MDX and CDX Leaderboards was marked by a combination of strategic planning and technical acumen.
In the MDX Leaderboard A, Nabarun’s strategy involved training two distinct models: Wavelet-HTDemucs with Noise Robust Loss V1 and DWT-Transformer-UNet with Noise Robust Loss V2. His approach for evaluating the Labelnoise dataset involved scoring stems based on their Signal-to-Distortion Ratio (SDR), filtering out those above 9dB, and manually verifying a subset for quality. The lightweight BSRNN models trained on this refined dataset played a crucial role, particularly enhancing vocal stems.
For MDX Leaderboard B, he again employed the Wavelet-HTDemucs and DWT-Transformer-UNet models. The final submission for this board was a carefully weighted blending of the outputs from these two models.
In the CDX Leaderboard A, Nabarun displayed his ingenuity in dataset preprocessing. He removed silences from dialog and music stems, recombining segments with cross-fading, and trained two models: DWT-Transformer-UNet with L1 loss and BSRNN with L1 loss. The final submission was a blend of these models’ outputs, with a particular focus on enhancing dialogue clarity.
Overcoming Data Challenges
A significant hurdle in the Open Leaderboard C of the music demixing challenge was acquiring high-quality data. Nabarun creatively combined private stems and open data for training, a strategy that led to a respectable seventh position but also highlighted the need for early and extensive model training.
Reflecting on the Journey
Nabarun Goswami’s journey in the Sound Demixing Challenge is a story of persistence, innovation, and technical proficiency. His narrative is an inspiration for the AIcrowd community, showcasing the impact of persistent exploration and innovation.
Explore the wealth of knowledge from our winners! Delve into the released models and source code available in the “Notes” section of our MDX track and CDX track papers. For more valuable insights, don’t miss the teams’ model announcements on the discussion forum. These resources crafted by our champions serve as invaluable learning materials for those eager to understand the intricacies of winning solutions.
As we continue to unveil new challenges like the Commonsense Persona-Grounded Dialogue Challenge 2023, we invite you to join us. This challenge is a playground for testing and expanding your skills in natural conversation understanding using AI. Join us in this exciting exploration of AI possibilities, where your contributions can help shape the future of technology.