PSYCHOACOUSTIC OPTIMIZATION OF THE VQ-VAE AND TRANSFORMER ARCHITECTURES FOR HUMAN-LIKE AUDITORY PERCEPTION IN MUSIC INFORMATION RETRIEVAL AND GENERATION TASKS

Abstract

Despite incredible advancements in the utilization of learning-based architectures (AI) in natural language and image domains, their applicability to the domain of music has remained limited. In fact, the performance of state-of-the-art Automated Music Transcription (AMT) systems has seen only marginal improvements from novel AI architectures. Moreover, the importance of psychoacoustic perception and its incorporation into MIR systems have mostly stayed addressed, leading to shortcomings in current approaches. This thesis provides an overview of music processing and novel neural architectures, investigates the reasons behind the subpar performance achieved by their utilization in music information retrieval (MIR) tasks, and proposes several ways of adjusting both the music (data-related) pre-processing pipelines, and psychoacoustically-adjusted transformer-based model to improve the performance on MIR and AMT tasks. In particular, a new music transformer architecture is proposed, and various algorithms of music pre-processing for psychoacoustic optimization are implemented along with several adaptive models aimed at addressing the missing factor of modeling human music perception. The preliminary performance results exhibit promising outcomes, warranting the continued investigation of transformer architectures for music information retrieval applications. Several intriguing insights unveiled during the research process are discussed and presented. The thesis concludes by delineating a set of promising future research directions, paving the way for further advancements in the field of music information retrieval and generation using proposed architectures

    Similar works