Lyrics transcription of polyphonic music is challenging as the background
music affects lyrics intelligibility. Typically, lyrics transcription can be
performed by a two step pipeline, i.e. singing vocal extraction frontend,
followed by a lyrics transcriber backend, where the frontend and backend are
trained separately. Such a two step pipeline suffers from both imperfect vocal
extraction and mismatch between frontend and backend. In this work, we propose
a novel end-to-end integrated training framework, that we call PoLyScriber, to
globally optimize the vocal extractor front-end and lyrics transcriber backend
for lyrics transcription in polyphonic music. The experimental results show
that our proposed integrated training model achieves substantial improvements
over the existing approaches on publicly available test datasets.Comment: 13 page