Improving singing voice transcription generalization with AI-generated accompaniments

Published in Proceedings of the 31st Conference of Multimedia Modeling, 2025

Recommended citation: Perez, Miguel & Kirchhoff, Holger & Grosche, Peter & Serra, Xavier (2025) "Improving singing voice transcription generalization with AI-generated accompaniments " Proceedings of the 31st conference of Multimedia Modeling, Nara (Japan). https://link.springer.com/chapter/10.1007/978-981-96-2061-6_9

Singing voice transcription is a very popular task in MIR which consists of obtaining the notes being sung in a given excerpt of music audio. Data is essential for state-of-the-art methods, but annotating it is labor-intensive, requiring musical expertise. Moreover, in cases such as pop music, sharing such data is even more challenging due to copyright and data distribution rights. We present in this paper a data augmentation technique leveraging AI-generated music audio to alleviate the aforementioned data-related difficulties. Specifically, we generate the music accompanying the vocals for which target notes are already known. In this way, we create new mixes that maintain the harmony of the original piece while providing large variations in the audio content. We conducted a set of cross-dataset experiments and discovered that our proposed approach with harmonic-matching mixes yields the best results compared to other augmentation strategies. The employed tools for this augmentation are open-source and ready to be used by the research community.