1,740 research outputs found
Recommended from our members
Singing voice separation with deep U-Net convolutional networks
The decomposition of a music audio signal into its vocal and backing track components is analogous to image-to-image translation, where a mixed spectrogram is transformed into its constituent sources. We propose a novel application of the U-Net architecture â initially developed for medical imaging â for the task of source separation, given its proven capacity for recreating the fine, low-level detail required for high-quality audio reproduction. Through both quantitative evaluation and subjective assessment, experiments demonstrate that the proposed algorithm achieves state-of-the-art performance
Predicting and Composing a Top Ten Billboard Hot 100 Single with Descriptive Analytics and Classification
In late 20th and early 21st century Western popular music, there are cyclical structures, sounds, and themes that come and go with historical trends. Not only do the production techniques utilized reflect technological advancements (the Yamaha DX7, the Roland 808, etc.), the art form reflects contemporary cultural attitudes through lyrics and stylistic choice. Through this lens, pop songs can serve as historical artifacts for their unique ability to captivate listeners based on their generally acceptable and familiar elements, both upon release and with future audiences. It raises the questions: âCan a chronological analysis of artistic choices reveal trends in songwriting and popular music composition?â; âBased on collected analysis, could forecast data suggest criteria that a future hit song may fit?â; and âHow could the next âhit songâ sound, based on the calculated criteria from trend analysis and forecasting techniques?â By manually listening to and analyzing Billboard songs for each of the last 50 years and employing an assortment of feature selection and classification techniques, a random forest model predicts some of the significant characteristics of a potential future hit song. This prediction provided the framework for an original composition
Final Research Report on Auto-Tagging of Music
The deliverable D4.7 concerns the work achieved by IRCAM until M36 for the âauto-tagging of musicâ. The deliverable is a research report. The software libraries resulting from the research have been integrated into Fincons/HearDis! Music Library Manager or are used by TU Berlin. The final software libraries are described in D4.5.
The research work on auto-tagging has concentrated on four aspects:
1) Further improving IRCAMâs machine-learning system ircamclass. This has been done by developing the new MASSS audio features, including audio augmentation and audio segmentation into ircamclass. The system has then been applied to train HearDis! âsoftâ features (Vocals-1, Vocals-2, Pop-Appeal, Intensity, Instrumentation, Timbre, Genre, Style). This is described in Part 3.
2) Developing two sets of âhardâ features (i.e. related to musical or musicological concepts) as specified by HearDis! (for integration into Fincons/HearDis! Music Library Manager) and TU Berlin (as input for the prediction model of the GMBI attributes). Such features are either derived from previously estimated higher-level concepts (such as structure, key or succession of chords) or by developing new signal processing algorithm (such as HPSS) or main melody estimation. This is described in Part 4.
3) Developing audio features to characterize the audio quality of a music track. The goal is to describe the quality of the audio independently of its apparent encoding. This is then used to estimate audio degradation or music decade. This is to be used to ensure that playlists contain tracks with similar audio quality. This is described in Part 5.
4) Developing innovative algorithms to extract specific audio features to improve music mixes. So far, innovative techniques (based on various Blind Audio Source Separation algorithms and Convolutional Neural Network) have been developed for singing voice separation, singing voice segmentation, music structure boundaries estimation, and DJ cue-region estimation. This is described in Part 6.EC/H2020/688122/EU/Artist-to-Business-to-Business-to-Consumer Audio Branding System/ABC D
Repertoire-Specific Vocal Pitch Data Generation for Improved Melodic Analysis of Carnatic Music
Deep Learning methods achieve state-of-the-art in many tasks, including vocal pitch extraction. However, these methods rely on the availability of pitch track annotations without errors, which are scarce and expensive to obtain for Carnatic Music. Here we identify the tradition-related challenges and propose tailored solutions to generate a novel, large, and open dataset, the Saraga-Carnatic-Melody-Synth (SCMS), comprising audio mixtures and time-aligned vocal pitch annotations. Through a cross-cultural evaluation leveraging this novel dataset, we show improvements in the performance of Deep Learning vocal pitch extraction methods on Indian Art Music recordings. Additional experiments show that the trained models outperform the currently used heuristic-based pitch extraction solutions for the computational melodic analysis of Carnatic Music and that this improvement leads to better results in the musicologically relevant task of repeated melodic pattern discovery when evaluated using expert annotations. The code and annotations are made available for reproducibility. The novel dataset and trained models are also integrated into the Python package compIAM1 which allows them to be used out-of-the-box
An automatic annotation system for audio data containing music
Thesis (S.B. and M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references (leaves 51-53).by Janet Marques.S.B.and M.Eng
Vocal Detection: An evaluation between general versus focused models
This thesis focuses on presenting a technique on improving current vocal detection methods. One of the most popular methods employs some type of statistical approach where vocal signals can be distinguished automatically by first training a model on both vocal and non-vocal example data, then using this model to classify audio signals into vocals or non-vocals. There is one problem with this method which is that the model that has been trained is typically very general and does its best at classifying various different types of data. Since the audio signals containing vocals that we care about are songs, we propose to improve vocal detection accuracies by creating focused models targeted at predicting vocal segments according to song artist and artist gender. Such useful information like artist name are often overlooked, this restricts opportunities in processing songs more specific to its type and hinders its potential success. Experiment results with several models built according to artist and artist gender reveal improvements of up to 17% when compared to using the general approach. With such improvements, applications such as automatic lyric synchronization to vocal segments in real-time may become more achievable with greater accuracy
Gendering the Virtual Space: Sonic Femininities and Masculinities in Contemporary Top 40 Music
This dissertation analyzes vocal placementâthe apparent location of a voice in the virtual space created by a recordingâand its relationship to gender. When listening to a piece of recorded music through headphones or stereo speakers, one hears various sound sources as though they were located in a virtual space (Clarke 2013). For instance, a specific vocal performanceâonce manipulated by various technologies in a recording studioâmight evoke a concert hall, an intimate setting, or an otherworldly space. The placement of the voice within this space is one of the central musical parameters through which listeners ascribe cultural meanings to popular music.
I develop an original methodology for analyzing vocal placement in recorded popular music. Combining close listening with music information retrieval tools, I precisely locate a voiceâs placement in virtual space according to five parameters: (1) Width, (2) Pitch Height, (3) Prominence, (4) Environment, and (5) Layering. I use the methodology to conduct close and distant readings of vocal placement in twenty-first-century Anglo-American popular music. First, an analysis of âLove the Way You Lieâ (2010), by Eminem feat. Rihanna, showcases how the methodology can be used to support close readings of individual songs. Through my analysis, I suggest that Rihannaâs wide vocal placement evokes a nexus of conflicting emotions in the wake of domestic violence. Eminemâs narrow placement, conversely, expresses anger, frustration, and violence. Second, I use the analytical methodology to conduct a larger-scale study of vocal placement in a corpus of 113 post-2008 Billboard chart-topping collaborations between two or more artists. By stepping away from close readings of individual songs, I show how gender stereotypes are engineered en masse in the popular music industry. I show that women artists are generally assigned vocal placements that are wider, more layered, and more reverberated than those of men. This vocal placement configurationâexemplified in âLove the Way You Lieââcreates a sonic contrast that presents womenâs voices as ornamental and diffuse, and menâs voices as direct and relatable. I argue that these contrasting vocal placements sonically construct a gender binary, exemplifying one of the ways in which dichotomous conceptions of gender are reinforced through the sound of popular music
Autoethnographic and qualitative research on popular music: Exploring the blues, jazz, grime, John Cage, live performance, SoundCloud and the masculinities of metal
This special edition of Riffs focuses on autoethnography and qualitative research in relation to popular music. The journal publication is twinned with a forthcoming book entitled: Popular Music Ethnographies: practice, place, identity. The intention of these studies is to uphold the principle that âmusic is good to think withâ (Chambers 1981: 38). Riffs was founded in 2015 to promote experimental writing on popular music, with a strong DiY ethos and space to offer flexibility and diversity of outputs through challenging interdisciplinary boundaries. At the same time there is a degree of similarity with specialist popular music magazines including Mojo, fRoots (1979-2019), Rolling Stone, Record Collector, Prog, Mixmag, and Uncut, through a focus on visuals and creative images. This suggests that there has been an increased growth at the âpopularâ end of biographical and autoethnography within popular music. Critically, popular music autoethnographies work across and within disciplinary boundaries of anthropology, social anthropology, cultural studies, sociology, and popular music studies
Recommended from our members
Musical source separation with deep learning and large-scale datasets
Throughout this thesis we will explore automatic music source separation by utilizing modern (at the time of writing) techniques and tools from machine learning and big data processing. The bulk of this work was carried out between 2016 and 2019.
In Chapter 2 we conduct a review of source separation literature. We start by outlining a subset of applications of source separation in some depth. We describe some of the early, pioneering work in automatic source separation: Auditory Scene Analysis, and its digital counterpart, Computational Auditory Scene Analysis.
We then introduce matrix decomposition-based methods such as Independent Component Analysis and Non-Negative Matrix factorization, and pitch informed methods where the separation algorithm is guided by pitch information that is known a priori. We brie y discuss user-guided methods, before conducting a thorough review of Deep Learning based source separation, including recurrent, convolutional, deep clustering-based, and Generative Adversarial Networks.
We then proceed to describe common evaluation metrics
and training datasets. Finally, we list a number of current challenges and drawbacks of current systems.
Chapter 3 focuses on datasets for musical source separation. First we show the growth of dataset sizes for both machine learning in general and music information retrieval specifically. We give several examples of the complexities and idiosyncrasies that are intrinsic to music datasets. We then proceed to present a method for extracting ground truth data for source separation from large unstructured musical catalogs.
In Chapter 4 we design a novel deep learning-based source separation algorithm. Motivation is provided by means of a musicological study1 that showed the high importance of vocals relative to other musical factors, in the minds of listeners. At the core of the vocal separation algorithm is the U-Net, a deep learning architecture that uses skip connections to preserve fine-grained detail. It was originally developed in the biomedical imaging domain, and later adapted to image-to-image translation. We adapt it to the source separation domain by treating spectrograms as images, and we use the dataset mining methods from Chapter 3 to generate sufficiently large training data. We evaluate our model objectively using standard evaluation metrics, subjectively using \crowdsourced" human subjects. To the best of our knowledge, this is the first use of U-Nets for source separation.
In the introduction above we proposed joint learning to optimize source separation and other objectives. In Chapter 5 we investigate one such instance: multi-task learning of vocal removal and vocal pitch tracking. We combine the vocal separation model from Chapter 4 with a state of the art pitch salience estimation model2, exploring several ways of combining the two models. We find that vocal pitch estimation benefits from joint learning when the two tasks are trained in sequence, with the source separation model preceding the pitch estimation model. We also report benefits from fine-tuning by iteratively applying the model.
Chapter 6 extends the U-Net model to multiple instruments. In order to minimize the phase artifacts that were a common issue in Chapter 4, we modify the model to operate in the complex domain. We run experiments with several loss functions: Time-domain loss, magnitude-only frequency domain loss, and joint time and frequency-domain loss. Our experiments are evaluated both objectively and subjectively, and we carry out extensive qualitative analysis to investigate the effects of complex masking.
Finally, we conclude the thesis in Chapter 7 by summarizing this work and highlighting several future directions of research
- âŠ