351 research outputs found

    Modeling and predicting emotion in music

    Get PDF
    With the explosion of vast and easily-accessible digital music libraries over the past decade, there has been a rapid expansion of research towards automated systems for searching and organizing music and related data. Online retailers now offer vast collections of music, spanning tens of millions of songs, available for immediate download. While these online stores present a drastically different dynamic than the record stores of the past, consumers still arrive with the same requests recommendation of music that is similar to their tastes; for both recommendation and curation, the vast digital music libraries of today necessarily require powerful automated tools.The medium of music has evolved speci cally for the expression of emotions, and it is natural for us to organize music in terms of its emotional associations. But while such organization is a natural process for humans, quantifying it empirically proves to be a very difficult task. Myriad features, such as harmony, timbre, interpretation, and lyrics affect emotion, and the mood of a piece may also change over its duration. Furthermore, in developing automated systems to organize music in terms of emotional content, we are faced with a problem that oftentimes lacks a well-defined answer; there may be considerable disagreement regarding the perception and interpretation of the emotions of a song or even ambiguity within the piece itself.Automatic identi cation of musical mood is a topic still in its early stages, though it has received increasing attention in recent years. Such work offers potential not just to revolutionize how we buy and listen to our music, but to provide deeper insight into the understanding of human emotions in general. This work seeks to relate core concepts from psychology to that of signal processing to understand how to extract information relevant to musical emotion from an acoustic signal. The methods discussed here survey existing features using psychology studies and develop new features using basis functions learned directly from magnitude spectra. Furthermore, this work presents a wide breadth of approaches in developing functional mappings between acoustic data and emotion space parameters. Using these models, a framework is constructed for content-based modeling and prediction of musical emotion.Ph.D., Electrical Engineering -- Drexel University, 201

    Computational Modeling and Analysis of Multi-timbral Musical Instrument Mixtures

    Get PDF
    In the audio domain, the disciplines of signal processing, machine learning, psychoacoustics, information theory and library science have merged into the field of Music Information Retrieval (Music-IR). Music-IR researchers attempt to extract high level information from music like pitch, meter, genre, rhythm and timbre directly from audio signals as well as semantic meta-data over a wide variety of sources. This information is then used to organize and process data for large scale retrieval and novel interfaces. For creating musical content, access to hardware and software tools for producing music has become commonplace in the digital landscape. While the means to produce music have become widely available, significant time must be invested to attain professional results. Mixing multi-channel audio requires techniques and training far beyond the knowledge of the average music software user. As a result, there is significant growth and development in intelligent signal processing for audio, an emergent field combining audio signal processing and machine learning for producing music. This work focuses on methods for modeling and analyzing multi-timbral musical instrument mixtures and performing automated processing techniques to improve audio quality based on quantitative and qualitative measures. The main contributions of the work involve training models to predict mixing parameters for multi-channel audio sources and developing new methods to model the component interactions of individual timbres to an overall mixture. Linear dynamical systems (LDS) are shown to be capable of learning the relative contributions of individual instruments to re-create a commercial recording based on acoustic features extracted directly from audio. Variations in the model topology are explored to make it applicable to a more diverse range of input sources and improve performance. An exploration of relevant features for modeling timbre and identifying instruments is performed. Using various basis decomposition techniques, audio examples are reconstructed and analyzed in a perceptual listening test to evaluate their ability to capture salient aspects of timbre. These tests show that a 2-D decomposition is able to capture much more perceptually relevant information with regard to the temporal evolution of the frequency spectrum of a set of audio examples. The results indicate that joint modeling of frequencies and their evolution is essential for capturing higher level concepts in audio that we desire to leverage in automated systems.Ph.D., Electrical Engineering -- Drexel University, 201

    Modeling Emotion Dynamics in Song Lyrics with State Space Models

    Get PDF
    Most previous work in music emotion recognition assumes a single or a few song-level labels for the whole song. While it is known that different emotions can vary in intensity within a song, annotated data for this setup is scarce and difficult to obtain. In this work, we propose a method to predict emotion dynamics in song lyrics without song-level supervision. We frame each song as a time series and employ a State Space Model (SSM), combining a sentence-level emotion predictor with an Expectation-Maximization (EM) procedure to generate the full emotion dynamics. Our experiments show that applying our method consistently improves the performance of sentence-level baselines without requiring any annotated songs, making it ideal for limited training data scenarios. Further analysis through case studies shows the benefits of our method while also indicating the limitations and pointing to future directions

    The role of artist and genre on music emotion recognition

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceThe goal of this study is to classify a dataset of songs according to their emotion and to understand the impact that the artist and genre have on the accuracy of the classification model. This will help market players such as Spotify and Apple Music to retrieve useful songs in the right context. This analysis was performed by extracting audio and non-audio features from the DEAM dataset and classifying them. The correlation between artist, song genre and other audio features was also analyzed. Furthermore, the classification performance of different machine learning algorithms was evaluated and compared, e.g., Support Vector Machines (SVM), Decision Trees, Naive Bayes and K-Nearest Neighbors. We found that Support Vector Machines attained the highest performance when using either only Audio features or a combination of Audio Features and Genre. Namely, an F-measure of 0.46 and 0.45 was achieved, respectively. We concluded that the Artist variable was not impactful to the emotion of the songs. Therefore, by using Support Vector Machines with the combination of Audio and Genre variables, we analyzed the results and created a dashboard to visualize the incorrectly classified songs. This information helped to understand if these variables are useful to improve the emotion classification model developed and what were the relationships between them and other audio and non-audio features

    Latent variable methods for visualization through time

    Get PDF

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies

    Multimodal Video Analysis and Modeling

    Get PDF
    From recalling long forgotten experiences based on a familiar scent or on a piece of music, to lip reading aided conversation in noisy environments or travel sickness caused by mismatch of the signals from vision and the vestibular system, the human perception manifests countless examples of subtle and effortless joint adoption of the multiple senses provided to us by evolution. Emulating such multisensory (or multimodal, i.e., comprising multiple types of input modes or modalities) processing computationally offers tools for more effective, efficient, or robust accomplishment of many multimedia tasks using evidence from the multiple input modalities. Information from the modalities can also be analyzed for patterns and connections across them, opening up interesting applications not feasible with a single modality, such as prediction of some aspects of one modality based on another. In this dissertation, multimodal analysis techniques are applied to selected video tasks with accompanying modalities. More specifically, all the tasks involve some type of analysis of videos recorded by non-professional videographers using mobile devices.Fusion of information from multiple modalities is applied to recording environment classification from video and audio as well as to sport type classification from a set of multi-device videos, corresponding audio, and recording device motion sensor data. The environment classification combines support vector machine (SVM) classifiers trained on various global visual low-level features with audio event histogram based environment classification using k nearest neighbors (k-NN). Rule-based fusion schemes with genetic algorithm (GA)-optimized modality weights are compared to training a SVM classifier to perform the multimodal fusion. A comprehensive selection of fusion strategies is compared for the task of classifying the sport type of a set of recordings from a common event. These include fusion prior to, simultaneously with, and after classification; various approaches for using modality quality estimates; and fusing soft confidence scores as well as crisp single-class predictions. Additionally, different strategies are examined for aggregating the decisions of single videos to a collective prediction from the set of videos recorded concurrently with multiple devices. In both tasks multimodal analysis shows clear advantage over separate classification of the modalities.Another part of the work investigates cross-modal pattern analysis and audio-based video editing. This study examines the feasibility of automatically timing shot cuts of multi-camera concert recordings according to music-related cutting patterns learnt from professional concert videos. Cut timing is a crucial part of automated creation of multicamera mashups, where shots from multiple recording devices from a common event are alternated with the aim at mimicing a professionally produced video. In the framework, separate statistical models are formed for typical patterns of beat-quantized cuts in short segments, differences in beats between consecutive cuts, and relative deviation of cuts from exact beat times. Based on music meter and audio change point analysis of a new recording, the models can be used for synthesizing cut times. In a user study the proposed framework clearly outperforms a baseline automatic method with comparably advanced audio analysis and wins 48.2 % of comparisons against hand-edited videos
    corecore