167 research outputs found
Towards a (better) Definition of Annotated MIR Corpora
International audienceToday, annotated MIR corpora are provided by various re- search labs or companies, each one using its own annota- tion methodology, concept definitions, and formats. This is not an issue as such. However, the lack of descriptions of the methodology used--how the corpus was actually an- notated, and by whom--and of the annotated concepts, i.e. what is actually described, is a problem with respect to the sustainability, usability, and sharing of the corpora. Ex- perience shows that it is essential to define precisely how annotations are supplied and described. We propose here a survey and consolidation report on the nature of the an- notated corpora used and shared in MIR, with proposals for the axis against which corpora can be described so to enable effective comparison and the inherent influence this has on tasks performed using them
DEEP-RHYTHM FOR TEMPO ESTIMATION AND RHYTHM PATTERN RECOGNITION
International audienceIt has been shown that the harmonic series at the tempo frequency of the onset-strength-function of an audio signal accurately describes its rhythm pattern and can be used to perform tempo or rhythm pattern estimation. Recently, in the case of multi-pitch estimation, the depth of the input layer of a convolutional network has been used to represent the harmonic series of pitch candidates. We use a similar idea here to represent the harmonic series of tempo candidates. We propose the Harmonic-Constant-Q-Modulation which represents, using a 4D-tensors, the harmonic series of modulation frequencies (considered as tempo frequencies) in several acoustic frequency bands over time. This representation is used as input to a convolutional network which is trained to estimate tempo or rhythm pattern classes. Using a large number of datasets, we evaluate the performance of our approach and compare it with previous approaches. We show that it slightly increases Accuracy-1 for tempo estimation but not the average-mean-Recall for rhythm pattern recognition
Addressing Tempo Estimation Octave Errors in Electronic Music by Incorporating Style Information Extracted From Wikipedia
(Abstract to follow
Recommended from our members
Deep Learning for Automatic Assessment and Feedback of Spoken English
Growing global demand for learning a second language (L2), particularly English, has led to
considerable interest in automatic spoken language assessment, whether for use in computerassisted language learning (CALL) tools or for grading candidates for formal qualifications.
This thesis presents research conducted into the automatic assessment of spontaneous nonnative English speech, with a view to be able to provide meaningful feedback to learners. One
of the challenges in automatic spoken language assessment is giving candidates feedback on
particular aspects, or views, of their spoken language proficiency, in addition to the overall
holistic score normally provided. Another is detecting pronunciation and other types of errors
at the word or utterance level and feeding them back to the learner in a useful way.
It is usually difficult to obtain accurate training data with separate scores for different
views and, as examiners are often trained to give holistic grades, single-view scores can
suffer issues of consistency. Conversely, holistic scores are available for various standard
assessment tasks such as Linguaskill. An investigation is thus conducted into whether
assessment scores linked to particular views of the speakerâs ability can be obtained from
systems trained using only holistic scores.
End-to-end neural systems are designed with structures and forms of input tuned to single
views, specifically each of pronunciation, rhythm, intonation and text. By training each
system on large quantities of candidate data, individual-view information should be possible
to extract. The relationships between the predictions of each system are evaluated to examine
whether they are, in fact, extracting different information about the speaker. Three methods
of combining the systems to predict holistic score are investigated, namely averaging their
predictions and concatenating and attending over their intermediate representations. The
combined graders are compared to each other and to baseline approaches.
The tasks of error detection and error tendency diagnosis become particularly challenging
when the speech in question is spontaneous and particularly given the challenges posed by
the inconsistency of human annotation of pronunciation errors. An approach to these tasks is
presented by distinguishing between lexical errors, wherein the speaker does not know how a
particular word is pronounced, and accent errors, wherein the candidateâs speech exhibits
consistent patterns of phone substitution, deletion and insertion. Three annotated corpora
x
of non-native English speech by speakers of multiple L1s are analysed, the consistency of
human annotation investigated and a method presented for detecting individual accent and
lexical errors and diagnosing accent error tendencies at the speaker level
Evaluating the spectral clustering segmentation algorithm for describing diverse music collections
This paper presents an evaluation of the spectral clustering segmentation algorithm used for automating the description of musical structure within a song. This study differs from the standard evaluation in that it accounts for variability in genre, class, tempo, song duration, and time signature on the results of evaluation metrics. The study uses standard metrics for segment boundary placement accuracy and labeling accuracy against these song metadata. It reveals that song duration, tempo, class, and genre have a significant effect on evaluation scores. This study demonstrates how the algorithm may be evaluated to predict its performance for a given collection where these variables are known. The possible causes and implications of these effects on evaluation scores are explored based on the construction of the spectral clustering algorithm and its potential for use in describing diverse music collections.Master of Science in Library Scienc
Recommended from our members
Identifying Speaker State from Multimodal Cues
Automatic identification of speaker state is essential for spoken language understanding, with broad potential in various real-world applications. However, most existing work has focused on recognizing a limited set of emotional states using cues from a single modality. This thesis describes my research that addresses these limitations and challenges associated with speaker state identification by studying a wide range of speaker states, including emotion and sentiment, humor, and charisma, using features from speech, text, and visual modalities.
The first part of this thesis focuses on emotion and sentiment recognition in speech. Emotion and sentiment recognition is one of the most studied topics in speaker state identification and has gained increasing attention in speech research recently, with extensive emotional speech models and datasets published every year. However, most work focuses only on recognizing a set of discrete emotions in high-resource languages such as English, while in real-life conversations, emotion is changing continuously and exists in all spoken languages. To address the mismatch, we propose a deep neural network model to recognize continuous emotion by combining inputs from raw waveform signals and spectrograms. Experimental results on two datasets show that the proposed model achieves state-of-the-art results by exploiting both waveforms and spectrograms as input. Due to the higher number of existing textual sentiment models than speech models in low-resource languages, we also propose a method to bootstrap sentiment labels from text transcripts and use these labels to train a sentiment classifier in speech. Utilizing the speaker state information shared across modalities, we extend speech sentiment recognition from high-resource languages to low-resource languages. Moreover, using the natural verse-level alignment in the audio Bibles across different languages, we also explore cross-lingual and cross-modality sentiment transfer.
In the second part of the thesis, we focus on recognizing humor, whose expression is related to emotion and sentiment but has very different characteristics. Unlike emotion and sentiment that can be identified by crowdsourced annotators, humorous expressions are highly individualistic and cultural-specific, making it hard to obtain reliable labels. This results in the lack of data annotated for humor, and thus we propose two different methods to automatically and reliably label humor. First, we develop a framework for generating humor labels on videos, by learning from extensive user-generated comments. We collect and analyze 100 videos, building multimodal humor detection models using speech, text, and visual features, which achieves an F1-score of 0.76. In addition to humorous videos, we also develop another framework for generating humor labels on social media posts, by learning from user reactions to Facebook posts. We collect 785K posts with humor and non-humor scores and build models to detect humor with performance comparable to human labelers.
The third part of the thesis focuses on charisma, a commonly found but less studied speaker state with unique challenges -- the definition of charisma varies a lot among perceivers, and the perception of charisma also varies with speakers' and perceivers' different demographic backgrounds. To better understand charisma, we conduct the first gender-balanced study of charismatic speech, including speakers and raters from diverse backgrounds. We collect personality and demographic information from the rater as well as their own speech, and examine individual differences in the perception and production of charismatic speech. We also extend the work to politicians' speech by collecting speaker trait ratings on representative speech segments of politicians and study how the genre, gender, and the rater's political stance influence the charisma ratings of the segments
Trennung und SchĂ€tzung der Anzahl von Audiosignalquellen mit Zeit- und FrequenzĂŒberlappung
Everyday audio recordings involve mixture signals: music contains a mixture of instruments; in a meeting or conference, there is a mixture of human voices. For these mixtures, automatically separating or estimating the number of sources is a challenging task. A common assumption when processing mixtures in the time-frequency domain is that sources are not fully overlapped. However, in this work we consider some cases where the overlap is severe â for instance, when instruments play the same note (unison) or when many people speak concurrently ("cocktail party") â highlighting the need for new representations and more powerful models.
To address the problems of source separation and count estimation, we use conventional signal processing techniques as well as deep neural networks (DNN). We ïŹrst address the source separation problem for unison instrument mixtures, studying the distinct spectro-temporal modulations caused by vibrato. To exploit these modulations, we developed a method based on time warping, informed by an estimate of the fundamental frequency. For cases where such estimates are not available, we present an unsupervised model, inspired by the way humans group time-varying sources (common fate). This contribution comes with a novel representation that improves separation for overlapped and modulated sources on unison mixtures but also improves vocal and accompaniment separation when used as an input for a DNN model.
Then, we focus on estimating the number of sources in a mixture, which is important for real-world scenarios. Our work on count estimation was motivated by a study on how humans can address this task, which lead us to conduct listening experiments, conïŹrming that humans are only able to estimate the number of up to four sources correctly. To answer the question of whether machines can perform similarly, we present a DNN architecture, trained to estimate the number of concurrent speakers. Our results show improvements compared to other methods, and the model even outperformed humans on the same task.
In both the source separation and source count estimation tasks, the key contribution of this thesis is the concept of âmodulationâ, which is important to computationally mimic human performance. Our proposed Common Fate Transform is an adequate representation to disentangle overlapping signals for separation, and an inspection of our DNN count estimation model revealed that it proceeds to ïŹnd modulation-like intermediate features.Im Alltag sind wir von gemischten Signalen umgeben: Musik besteht aus einer Mischung von Instrumenten; in einem Meeting oder auf einer Konferenz sind wir einer Mischung menschlicher Stimmen ausgesetzt. FĂŒr diese Mischungen ist die automatische Quellentrennung oder die Bestimmung der Anzahl an Quellen eine anspruchsvolle Aufgabe. Eine hĂ€uïŹge Annahme bei der Verarbeitung von gemischten Signalen im Zeit-Frequenzbereich ist, dass die Quellen sich nicht vollstĂ€ndig ĂŒberlappen. In dieser Arbeit betrachten wir jedoch einige FĂ€lle, in denen die Ăberlappung immens ist zum Beispiel, wenn Instrumente den gleichen Ton spielen (unisono) oder wenn viele Menschen gleichzeitig sprechen (Cocktailparty) â, so dass neue Signal-ReprĂ€sentationen und leistungsfĂ€higere Modelle notwendig sind.
Um die zwei genannten Probleme zu bewĂ€ltigen, verwenden wir sowohl konventionelle Signalverbeitungsmethoden als auch tiefgehende neuronale Netze (DNN). Wir gehen zunĂ€chst auf das Problem der Quellentrennung fĂŒr Unisono-Instrumentenmischungen ein und untersuchen die speziellen, durch Vibrato ausgelösten, zeitlich-spektralen Modulationen. Um diese Modulationen auszunutzen entwickelten wir eine Methode, die auf Zeitverzerrung basiert und eine SchĂ€tzung der Grundfrequenz als zusĂ€tzliche Information nutzt. FĂŒr FĂ€lle, in denen diese SchĂ€tzungen nicht verfĂŒgbar sind, stellen wir ein unĂŒberwachtes Modell vor, das inspiriert ist von der Art und Weise, wie Menschen zeitverĂ€nderliche Quellen gruppieren (Common Fate). Dieser Beitrag enthĂ€lt eine neuartige ReprĂ€sentation, die die Separierbarkeit fĂŒr ĂŒberlappte und modulierte Quellen in Unisono-Mischungen erhöht, aber auch die Trennung in Gesang und Begleitung verbessert, wenn sie in einem DNN-Modell verwendet wird.
Im Weiteren beschĂ€ftigen wir uns mit der SchĂ€tzung der Anzahl von Quellen in einer Mischung, was fĂŒr reale Szenarien wichtig ist. Unsere Arbeit an der SchĂ€tzung der Anzahl war motiviert durch eine Studie, die zeigt, wie wir Menschen diese Aufgabe angehen. Dies hat uns dazu veranlasst, eigene Hörexperimente durchzufĂŒhren, die bestĂ€tigten, dass Menschen nur in der Lage sind, die Anzahl von bis zu vier Quellen korrekt abzuschĂ€tzen. Um nun die Frage zu beantworten, ob Maschinen dies Ă€hnlich gut können, stellen wir eine DNN-Architektur vor, die erlernt hat, die Anzahl der gleichzeitig sprechenden Sprecher zu ermitteln. Die Ergebnisse zeigen Verbesserungen im Vergleich zu anderen Methoden, aber vor allem auch im Vergleich zu menschlichen Hörern.
Sowohl bei der Quellentrennung als auch bei der SchĂ€tzung der Anzahl an Quellen ist ein Kernbeitrag dieser Arbeit das Konzept der âModulationâ, welches wichtig ist, um die Strategien von Menschen mittels Computern nachzuahmen. Unsere vorgeschlagene Common Fate Transformation ist eine adĂ€quate Darstellung, um die Ăberlappung von Signalen fĂŒr die Trennung zugĂ€nglich zu machen und eine Inspektion unseres DNN-ZĂ€hlmodells ergab schlieĂlich, dass sich auch hier modulationsĂ€hnliche Merkmale ïŹnden lassen
- âŠ