I have a question regarding the class label definitions of the sound effect and music class for movies. I am aware that e.g. for the definition of what music is very hard and can be vague but I guess, to create the hidden test set, some rules were used and I think it would be helpful if these rules could be revealed to all participants (if available). Basically I would consider two clarifications to be helpful:
discrimimate effects from music: for the the dnr dataset, speech and musical instrument clips are filtered out from the effect class to avoid confusion. Was this also done for the hidden test set?
discriminate music from dialogue: for the dnr dataset, music with singing voice was filtered out for the music class to avoid confusion. Was this also done for the hidden test test set?
Hi Fabian,
I’m Masato Hirano - one of the organizers of track 2, CDX’23.
Thank you for asking about the discrimination of effects and music.
We manually checked all test samples one by one, and filtered out those containing speech (or singing voice) inside non-dialogue stems. Thus all human voice only appear in dialogue stem (question 1, 2).
As for the musical instruments in effect track (question 1), there’s no “acoustic” musical instruments inside effect stem. As well as the case for speech, we dropped obvious music instruments inside effect stem.
I should note one point - although some “tonal synthetic” sounds (e.g., beep at high frequency) appear inside effect stem (and some may feel it’s like instrumental sound), they should be categorized as effect.