PSYCHOACOUSTIC OPTIMIZATION OF THE VQ-VAE AND TRANSFORMER ARCHITECTURES FOR HUMAN-LIKE AUDITORY PERCEPTION IN MUSIC INFORMATION RETRIEVAL AND GENERATION TASKS
Despite incredible advancements in the utilization of learning-based architectures
(AI) in natural language and image domains, their applicability to the domain of
music has remained limited. In fact, the performance of state-of-the-art Automated
Music Transcription (AMT) systems has seen only marginal improvements from
novel AI architectures. Moreover, the importance of psychoacoustic perception and
its incorporation into MIR systems have mostly stayed addressed, leading to shortcomings
in current approaches. This thesis provides an overview of music processing
and novel neural architectures, investigates the reasons behind the subpar performance
achieved by their utilization in music information retrieval (MIR) tasks,
and proposes several ways of adjusting both the music (data-related) pre-processing
pipelines, and psychoacoustically-adjusted transformer-based model to improve the
performance on MIR and AMT tasks. In particular, a new music transformer architecture
is proposed, and various algorithms of music pre-processing for psychoacoustic
optimization are implemented along with several adaptive models aimed at
addressing the missing factor of modeling human music perception. The preliminary
performance results exhibit promising outcomes, warranting the continued investigation
of transformer architectures for music information retrieval applications.
Several intriguing insights unveiled during the research process are discussed and
presented. The thesis concludes by delineating a set of promising future research directions,
paving the way for further advancements in the field of music information
retrieval and generation using proposed architectures