Singing Voice Accompaniment Data Augmentation with Generative Models
Published in 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, 2025
Recommended citation: Perez, Miguel & Kirchhoff, Holger & Grosche, Peter & Serra, Xavier (2025) "Singing Voice Accompaniment Data Augmentation with Generative Models" 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), Hyderabad, India, 2025, pp. 1-5. https://ieeexplore.ieee.org/document/11011167
Singing voice transcription is a key task in Music Information Retrieval (MIR) that focuses on identifying sung notes within a music audio segment. Advancing state-of-the-art methods in this area relies heavily on high-quality data, yet annotating such data is resource-intensive and requires musical expertise. In genres like pop music, data sharing is further complicated by copyright and distribution limitations. In this paper, we refine a recently proposed data augmentation technique that leverages AI-generated music audio to address these data-related challenges. Specifically, we create musical accompaniments for vocals with known target notes, enabling the generation of new mixes that retain the original piece’s harmony while introducing substantial audio variation. Our cross-dataset experiments reveal that using harmony-matched mixes improves generalization, though performance remains below that achieved by training with additional real data.
