3,020 research outputs found

    Bag-of-words representations for computer audition

    Get PDF
    Computer audition is omnipresent in everyday life, in applications ranging from personalised virtual agents to health care. From a technical point of view, the goal is to robustly classify the content of an audio signal in terms of a defined set of labels, such as, e.g., the acoustic scene, a medical diagnosis, or, in the case of speech, what is said or how it is said. Typical approaches employ machine learning (ML), which means that task-specific models are trained by means of examples. Despite recent successes in neural network-based end-to-end learning, taking the raw audio signal as input, models relying on hand-crafted acoustic features are still superior in some domains, especially for tasks where data is scarce. One major issue is nevertheless that a sequence of acoustic low-level descriptors (LLDs) cannot be fed directly into many ML algorithms as they require a static and fixed-length input. Moreover, also for dynamic classifiers, compressing the information of the LLDs over a temporal block by summarising them can be beneficial. However, the type of instance-level representation has a fundamental impact on the performance of the model. In this thesis, the so-called bag-of-audio-words (BoAW) representation is investigated as an alternative to the standard approach of statistical functionals. BoAW is an unsupervised method of representation learning, inspired from the bag-of-words method in natural language processing, forming a histogram of the terms present in a document. The toolkit openXBOW is introduced, enabling systematic learning and optimisation of these feature representations, unified across arbitrary modalities of numeric or symbolic descriptors. A number of experiments on BoAW are presented and discussed, focussing on a large number of potential applications and corresponding databases, ranging from emotion recognition in speech to medical diagnosis. The evaluations include a comparison of different acoustic LLD sets and configurations of the BoAW generation process. The key findings are that BoAW features are a meaningful alternative to statistical functionals, offering certain benefits, while being able to preserve the advantages of functionals, such as data-independence. Furthermore, it is shown that both representations are complementary and their fusion improves the performance of a machine listening system.Maschinelles Hören ist im täglichen Leben allgegenwärtig, mit Anwendungen, die von personalisierten virtuellen Agenten bis hin zum Gesundheitswesen reichen. Aus technischer Sicht besteht das Ziel darin, den Inhalt eines Audiosignals hinsichtlich einer Auswahl definierter Labels robust zu klassifizieren. Die Labels beschreiben bspw. die akustische Umgebung der Aufnahme, eine medizinische Diagnose oder - im Falle von Sprache - was gesagt wird oder wie es gesagt wird. Übliche Ansätze hierzu verwenden maschinelles Lernen, d.h., es werden anwendungsspezifische Modelle anhand von Beispieldaten trainiert. Trotz jüngster Erfolge beim Ende-zu-Ende-Lernen mittels neuronaler Netze, in welchen das unverarbeitete Audiosignal als Eingabe benutzt wird, sind Modelle, die auf definierten akustischen Merkmalen basieren, in manchen Bereichen weiterhin überlegen. Dies gilt im Besonderen für Einsatzzwecke, für die nur wenige Daten vorhanden sind. Allerdings besteht dabei das Problem, dass Zeitfolgen von akustischen Deskriptoren in viele Algorithmen des maschinellen Lernens nicht direkt eingespeist werden können, da diese eine statische Eingabe fester Länge benötigen. Außerdem kann es auch für dynamische (zeitabhängige) Klassifikatoren vorteilhaft sein, die Deskriptoren über ein gewisses Zeitintervall zusammenzufassen. Jedoch hat die Art der Merkmalsdarstellung einen grundlegenden Einfluss auf die Leistungsfähigkeit des Modells. In der vorliegenden Dissertation wird der sogenannte Bag-of-Audio-Words-Ansatz (BoAW) als Alternative zum Standardansatz der statistischen Funktionale untersucht. BoAW ist eine Methode des unüberwachten Lernens von Merkmalsdarstellungen, die von der Bag-of-Words-Methode in der Computerlinguistik inspiriert wurde, bei der ein Textdokument als Histogramm der vorkommenden Wörter beschrieben wird. Das Toolkit openXBOW wird vorgestellt, welches systematisches Training und Optimierung dieser Merkmalsdarstellungen - vereinheitlicht für beliebige Modalitäten mit numerischen oder symbolischen Deskriptoren - erlaubt. Es werden einige Experimente zum BoAW-Ansatz durchgeführt und diskutiert, die sich auf eine große Zahl möglicher Anwendungen und entsprechende Datensätze beziehen, von der Emotionserkennung in gesprochener Sprache bis zur medizinischen Diagnostik. Die Auswertungen beinhalten einen Vergleich verschiedener akustischer Deskriptoren und Konfigurationen der BoAW-Methode. Die wichtigsten Erkenntnisse sind, dass BoAW-Merkmalsvektoren eine geeignete Alternative zu statistischen Funktionalen darstellen, gewisse Vorzüge bieten und gleichzeitig wichtige Eigenschaften der Funktionale, wie bspw. die Datenunabhängigkeit, erhalten können. Zudem wird gezeigt, dass beide Darstellungen komplementär sind und eine Fusionierung die Leistungsfähigkeit eines Systems des maschinellen Hörens verbessert

    A Cross-Cultural Analysis of Music Structure

    Get PDF
    PhDMusic signal analysis is a research field concerning the extraction of meaningful information from musical audio signals. This thesis analyses the music signals from the note-level to the song-level in a bottom-up manner and situates the research in two Music information retrieval (MIR) problems: audio onset detection (AOD) and music structural segmentation (MSS). Most MIR tools are developed for and evaluated on Western music with specific musical knowledge encoded. This thesis approaches the investigated tasks from a cross-cultural perspective by developing audio features and algorithms applicable for both Western and non-Western genres. Two Chinese Jingju databases are collected to facilitate respectively the AOD and MSS tasks investigated. New features and algorithms for AOD are presented relying on fusion techniques. We show that fusion can significantly improve the performance of the constituent baseline AOD algorithms. A large-scale parameter analysis is carried out to identify the relations between system configurations and the musical properties of different music types. Novel audio features are developed to summarise music timbre, harmony and rhythm for its structural description. The new features serve as effective alternatives to commonly used ones, showing comparable performance on existing datasets, and surpass them on the Jingju dataset. A new segmentation algorithm is presented which effectively captures the structural characteristics of Jingju. By evaluating the presented audio features and different segmentation algorithms incorporating different structural principles for the investigated music types, this thesis also identifies the underlying relations between audio features, segmentation methods and music genres in the scenario of music structural analysis.China Scholarship Council EPSRC C4DM Travel Funding, EPSRC Fusing Semantic and Audio Technologies for Intelligent Music Production and Consumption (EP/L019981/1), EPSRC Platform Grant on Digital Music (EP/K009559/1), European Research Council project CompMusic, International Society for Music Information Retrieval Student Grant, QMUL Postgraduate Research Fund, QMUL-BUPT Joint Programme Funding Women in Music Information Retrieval Grant

    Open Listener: Cross-Cultural Experience and Identity in American Music

    Get PDF

    A Study in Violinist Identification using Short-term Note Features

    Get PDF
    The perception of music expression and emotion are greatly influenced by performer's individual interpretation, thus modelling performer's style is important to music understanding, style transfer, music education and characteristic music generation. This Thesis proposes approaches for modelling and identifying musical instrumentalists, using violinist identification as a case study. In violin performance, vibrato and timbre play important roles in players’ emotional expression, and they are key factors of playing style while execution shows great diversity. To validate that these two factors are effective to model violinists, we design and extract note-level vibrato features and timbre features from isolated concerto music notes, then present a violinist identification method based on the similarity of feature distributions, using single feature as well as fused features. The result shows that vibrato features are helpful for the violinist identification, and some timbre features perform better than vibrato features. In addition, the accuracy obtained from fused features is higher than using any single feature. However, apart from performer, the timbre is also determined by musical instruments, recording conditions and other factors. Furthermore, the common scenario for violinist identification is based on short music clips rather than isolated notes. To solve these two problems, we further examine the method using note-level timbre features to recognize violinists from segmented solo music clips, then use it to identify master players from concerto fragments. The results show that the designed features and method work very well for both types of music. Another experiment is conducted to examine the influence of instrument on the features. Results suggest that the selected timbre features can model performers’ individual playing reasonably and objectively, regardless of the instrument they play. Expressive timing is another key factor to reflect individual play styles. This Thesis develops a novel onset time deviation feature, which is used to model and identify master violinists on concerto fragments data. Results show that it performs better than timbre features on the dataset. To generalise the violinist identification method and further improve the result, deep learning methods are proposed and investigated. We present a transfer learning approach for violinist identification from pre-trained music auto-tagging neural networks and singer identification models. We then transfer pre-trained weights and fine-tune the models using violin datasets and finally obtain violinist identification results. We compare our system with state-of-the-art works, which shows that our model outperforms them using our two datasets

    Recent Trends in Computational Intelligence

    Get PDF
    Traditional models struggle to cope with complexity, noise, and the existence of a changing environment, while Computational Intelligence (CI) offers solutions to complicated problems as well as reverse problems. The main feature of CI is adaptability, spanning the fields of machine learning and computational neuroscience. CI also comprises biologically-inspired technologies such as the intellect of swarm as part of evolutionary computation and encompassing wider areas such as image processing, data collection, and natural language processing. This book aims to discuss the usage of CI for optimal solving of various applications proving its wide reach and relevance. Bounding of optimization methods and data mining strategies make a strong and reliable prediction tool for handling real-life applications

    Analyzing and enhancing music mood classification : an empirical study

    Get PDF
    In the computer age, managing large data repositories is one of the common challenges, especially for music data. Categorizing, manipulating, and refining music tracks are among the most complex tasks in Music Information Retrieval (MIR). Classification is one of the core functions in MIR, which classifies music data from different perspectives, from genre to instrument to mood. The primary focus of this study is on music mood classification. Mood is a subjective phenomenon in MIR, which involves different considerations, such as psychology, musicology, culture, and social behavior. One of the most significant prerequisitions in music mood classification is answering these questions: what combination of acoustic features helps us to improve the accuracy of classification in this area? What type of classifiers is appropriate in music mood classification? How can we increase the accuracy of music mood classification using several classifiers? To find the answers to these questions, we empirically explored different acoustic features and classification schemes on the mood classification in music data. Also, we found the two approaches to use several classifiers simultaneously to classify music tracks using mood labels automatically. These methods contain two voting procedures; namely, Plurality Voting and Borda Count. These approaches are categorized into ensemble techniques, which combine a group of classifiers to reach better accuracy. The proposed ensemble methods are implemented and verified through empirical experiments. The results of the experiments have shown that these proposed approaches could improve the accuracy of music mood classification

    2D to 3D non photo realistic character transformation and morphing (computer animation)

    Get PDF
    This research concerns the transformation and morphing between a full body 2D and 3D animated character. This practice based research will examine both technical and aesthetic techniques for enhancing morphing of animated characters. Stylized character transformations from A to B and from B to A, where details like facial expression, body motion, texture are to be expressively transformed aesthetically in a narrated story. Currently it is hard to separate 2D and 3D animation in a mix media usage. If we analyse and breakdown these graphical components, we could actually find a distinction as to how these 2D and 3D element increase the information level and complexity of storytelling. However, if we analyse it from character animation perspective, instance transformation of a digital character from 2D to 3D is not possible without post production techniques, pre-define 3D information such as blend shape or complex geometry data and mathematic calculation. There are mainly two elements to this investigation. The primary element is the design system of such stylizes character in 2D and 3D. Currently many design systems (morphing software) are based on photo realistic artifacts such as Fanta Morph, Morph Buster, Morpheus, Fun Morph and etc. This investigation will focus on non photo realistic character morphing. In seeking to define the targeted non photo realistic, illustrated stylize 2D and 3D character, I am examining the advantages and disadvantages of a number of 2D illustrated characters in respect to 3D morphing. This investigation could also help to analyse the efficiency and limitation of such 2D and 3D non photo realistic character design and transformation where broader techniques will be explored. The secondary element is the theoretical investigation by relating how such artistic and technical morphing idea is being used in past and today films/games. In a narrated story contain character that acts upon a starting question or situation and reacts on the event. The gap between his aim and the result of his acting, the gap between his vision and his personality creates the dramatic tension. I intend to distinguish the possibility of identifying a transitional process of voice between narrator and morphing character, while also illustrating, through visual terminology, the varying fluctuations between two speaking agents. I intend to prove and insert sample demonstrating “morphing” is not just visually important but have direct impact on storytelling

    ANALYZING IMAGE TWEETS IN MICROBLOGS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    corecore