46 research outputs found
Analysis and classification of phonation modes in singing
Phonation mode is an expressive aspect of the singing voice and can be described using the four categories neutral, breathy, pressed and flow. Previous attempts at automatically classifying the phonation mode on a dataset containing vowels sung by a female professional have been lacking in accuracy or have not sufficiently investigated the characteristic features of the different phonation modes which enable successful classification. In this paper, we extract a large range of features from this dataset, including specialised descriptors of pressedness and breathiness, to analyse their explanatory power and robustness against changes of pitch and vowel. We train and optimise a feed-forward neural network (NN) with one hidden layer on all features using cross validation to achieve a mean F-measure above 0.85 and an improved performance compared to previous work. Applying feature selection based on mutual information and retaining the nine highest ranked features as input to a NN results in a mean F-measure of 0.78, demonstrating the suitability of these features to discriminate between phonation modes. Training and pruning a decision tree yields a simple rule set based only on cepstral peak prominence (CPP), temporal flatness and average energy that correctly categorises 78% of the recordings
Analysis and Detection of Pathological Voice using Glottal Source Features
Automatic detection of voice pathology enables objective assessment and
earlier intervention for the diagnosis. This study provides a systematic
analysis of glottal source features and investigates their effectiveness in
voice pathology detection. Glottal source features are extracted using glottal
flows estimated with the quasi-closed phase (QCP) glottal inverse filtering
method, using approximate glottal source signals computed with the zero
frequency filtering (ZFF) method, and using acoustic voice signals directly. In
addition, we propose to derive mel-frequency cepstral coefficients (MFCCs) from
the glottal source waveforms computed by QCP and ZFF to effectively capture the
variations in glottal source spectra of pathological voice. Experiments were
carried out using two databases, the Hospital Universitario Principe de
Asturias (HUPA) database and the Saarbrucken Voice Disorders (SVD) database.
Analysis of features revealed that the glottal source contains information that
discriminates normal and pathological voice. Pathology detection experiments
were carried out using support vector machine (SVM). From the detection
experiments it was observed that the performance achieved with the studied
glottal source features is comparable or better than that of conventional MFCCs
and perceptual linear prediction (PLP) features. The best detection performance
was achieved when the glottal source features were combined with the
conventional MFCCs and PLP features, which indicates the complementary nature
of the features
On the use of voice descriptors for glottal source shape parameter estimation
International audienceThis paper summarizes the results of our investigations into estimating the shape of the glottal excitation source from speech signals. We employ the Liljencrants-Fant (LF) model describing the glottal flow and its derivative. The one-dimensional glottal source shape parameter Rd describes the transition in voice quality from a tense to a breathy voice. The parameter Rd has been derived from a statistical regression of the R waveshape parameters which parameterize the LF model. First, we introduce a variant of our recently proposed adaptation and range extension of the Rd parameter regression. Secondly, we discuss in detail the aspects of estimating the glottal source shape parameter Rd using the phase minimization paradigm. Based on the analysis of a large number of speech signals we describe the major conditions that are likely to result in erroneous Rd estimates. Based on these findings we investigate into means to increase the robustness of the Rd parameter estimation. We use Viterbi smoothing to suppress unnatural jumps of the estimated Rd parameter contours within short time segments. Additionally, we propose to steer the Viterbi algorithm by exploiting the covariation of other voice descriptors to improve Viterbi smoothing. The novel Viterbi steering is based on a Gaussian Mixture Model (GMM) that represents the joint density of the voice descriptors and the Open Quotient (OQ) estimated from corresponding electroglottographic (EGG) signals. A conversion function derived from the mixture model predicts OQ from the voice descriptors. Converted to Rd it defines an additional prior probability to adapt the partial probabilities of the Viterbi algorithm accordingly. Finally, we evaluate the performances of the phase minimization based methods using both variants to adapt and extent the Rd regression on one synthetic test set as well as in combination with Viterbi smoothing and each variant of the novel Viterbi steering on one test set of natural speech. The experimental findings exhibit improvements for both Viterbi approaches
UR-FUNNY: A Multimodal Language Dataset for Understanding Humor
Humor is a unique and creative communicative behavior displayed during social
interactions. It is produced in a multimodal manner, through the usage of words
(text), gestures (vision) and prosodic cues (acoustic). Understanding humor
from these three modalities falls within boundaries of multimodal language; a
recent research trend in natural language processing that models natural
language as it happens in face-to-face communication. Although humor detection
is an established research area in NLP, in a multimodal context it is an
understudied area. This paper presents a diverse multimodal dataset, called
UR-FUNNY, to open the door to understanding multimodal language used in
expressing humor. The dataset and accompanying studies, present a framework in
multimodal humor detection for the natural language processing community.
UR-FUNNY is publicly available for research
Identificación de locutor a partir de información glotal en unidades lingüÃsticas
En este estudio pretendemos obtener información glotal que caracterice a los locutores según su cualidad vocal. El primer paso es obtener el pulso glotal a partir de la señal de voz ya segmentada en fonemas, para lo cual explicaremos y escogeremos entre distintos algoritmos con el criterio de hacer nuestro sistema lo más robusto y estable. Para esta tarea recurriremos a contribuciones de distintos investigadores. La estimación del pulso glotal sobre lenguaje hablado implica una serie de algoritmos complejos y que si no se realiza de manera conveniente va a repercutir en la extracción de caracterÃsticas glotales.
Más tarde extraeremos de la señal glotal ciertos parámetros siguiendo el modelo LF (Liljencrants Fant). El siguiente paso consistirá en realizar un análisis inter/intra variabilidad con 200 locutores de la base de datos TIMIT con el fin de proponer un vector con la media y la varianza de los parámetros LF que más nos convengan.
Por último con el motor de scoring GMM-UBM-MAP aplicado a los 630 locutores de TIMIT podremos cuantificar si estas caracterÃsticas glotales diferencian suficientemente a unos locutores de otros
Models and Analysis of Vocal Emissions for Biomedical Applications
The Models and Analysis of Vocal Emissions with Biomedical Applications (MAVEBA) workshop came into being in 1999 from the particularly felt need of sharing know-how, objectives and results between areas that until then seemed quite distinct such as bioengineering, medicine and singing. MAVEBA deals with all aspects concerning the study of the human voice with applications ranging from the neonate to the adult and elderly. Over the years the initial issues have grown and spread also in other aspects of research such as occupational voice disorders, neurology, rehabilitation, image and video analysis. MAVEBA takes place every two years always in Firenze, Italy
Memory Fusion Network for Multi-view Sequential Learning
Multi-view sequential learning is a fundamental problem in machine learning
dealing with multi-view sequences. In a multi-view sequence, there exists two
forms of interactions between different views: view-specific interactions and
cross-view interactions. In this paper, we present a new neural architecture
for multi-view sequential learning called the Memory Fusion Network (MFN) that
explicitly accounts for both interactions in a neural architecture and
continuously models them through time. The first component of the MFN is called
the System of LSTMs, where view-specific interactions are learned in isolation
through assigning an LSTM function to each view. The cross-view interactions
are then identified using a special attention mechanism called the Delta-memory
Attention Network (DMAN) and summarized through time with a Multi-view Gated
Memory. Through extensive experimentation, MFN is compared to various proposed
approaches for multi-view sequential learning on multiple publicly available
benchmark datasets. MFN outperforms all the existing multi-view approaches.
Furthermore, MFN outperforms all current state-of-the-art models, setting new
state-of-the-art results for these multi-view datasets.Comment: AAAI 2018 Oral Presentatio