65 research outputs found

    Continuous Emotion Prediction from Speech: Modelling Ambiguity in Emotion

    Full text link
    There is growing interest in emotion research to model perceived emotion labelled as intensities along the affect dimensions such as arousal and valence. These labels are typically obtained from multiple annotators who would have their individualistic perceptions of emotional speech. Consequently, emotion prediction models that incorporate variation in individual perceptions as ambiguity in the emotional state would be more realistic. This thesis develops the modelling framework necessary to achieve continuous prediction of ambiguous emotional states from speech. Besides, emotion labels, feature space distribution and encoding are an integral part of the prediction system. The first part of this thesis examines the limitations of current low-level feature distributions and their minimalistic statistical descriptions. Specifically, front-end paralinguistic acoustic features are reflective of speech production mechanisms. However, discriminatively learnt features have frequently outperformed acoustic features in emotion prediction tasks, but provide no insights into the physical significance of these features. One of the contributions of this thesis is the development of a framework that can modify the acoustic feature representation based on emotion label information. Another investigation in this thesis indicates that emotion perception is language-dependent and in turn, helped develop a framework for cross-language emotion prediction. Furthermore, this investigation supported the hypothesis that emotion perception is highly individualistic and is better modelled as a distribution rather than a point estimate to encode information about the ambiguity in the perceived emotion. Following this observation, the thesis proposes measures to quantify the appropriateness of distribution types in modelling ambiguity in dimensional emotion labels which are then employed to compare well-known bounded parametric distributions. These analyses led to the conclusion that the beta distribution was the most appropriate parametric model of ambiguity in emotion labels. Finally, the thesis focuses on developing a deep learning framework for continuous emotion prediction as a temporal series of beta distributions, examining various parameterizations of the beta distributions as well as loss functions. Furthermore, distribution over the parameter spaces is examined and priors from kernel density estimation are employed to shape the posteriors over the parameter space which significantly improved valence ambiguity predictions. The proposed frameworks and methods have been extensively evaluated on multiple state of-the-art databases and the results demonstrate both the viability of predicting ambiguous emotion states and the validity of the proposed systems

    Speech-based recognition of self-reported and observed emotion in a dimensional space

    Get PDF
    The differences between self-reported and observed emotion have only marginally been investigated in the context of speech-based automatic emotion recognition. We address this issue by comparing self-reported emotion ratings to observed emotion ratings and look at how differences between these two types of ratings affect the development and performance of automatic emotion recognizers developed with these ratings. A dimensional approach to emotion modeling is adopted: the ratings are based on continuous arousal and valence scales. We describe the TNO-Gaming Corpus that contains spontaneous vocal and facial expressions elicited via a multiplayer videogame and that includes emotion annotations obtained via self-report and observation by outside observers. Comparisons show that there are discrepancies between self-reported and observed emotion ratings which are also reflected in the performance of the emotion recognizers developed. Using Support Vector Regression in combination with acoustic and textual features, recognizers of arousal and valence are developed that can predict points in a 2-dimensional arousal-valence space. The results of these recognizers show that the self-reported emotion is much harder to recognize than the observed emotion, and that averaging ratings from multiple observers improves performance

    Emotion-aware cross-modal domain adaptation in video sequences

    Get PDF

    Bag-of-words representations for computer audition

    Get PDF
    Computer audition is omnipresent in everyday life, in applications ranging from personalised virtual agents to health care. From a technical point of view, the goal is to robustly classify the content of an audio signal in terms of a defined set of labels, such as, e.g., the acoustic scene, a medical diagnosis, or, in the case of speech, what is said or how it is said. Typical approaches employ machine learning (ML), which means that task-specific models are trained by means of examples. Despite recent successes in neural network-based end-to-end learning, taking the raw audio signal as input, models relying on hand-crafted acoustic features are still superior in some domains, especially for tasks where data is scarce. One major issue is nevertheless that a sequence of acoustic low-level descriptors (LLDs) cannot be fed directly into many ML algorithms as they require a static and fixed-length input. Moreover, also for dynamic classifiers, compressing the information of the LLDs over a temporal block by summarising them can be beneficial. However, the type of instance-level representation has a fundamental impact on the performance of the model. In this thesis, the so-called bag-of-audio-words (BoAW) representation is investigated as an alternative to the standard approach of statistical functionals. BoAW is an unsupervised method of representation learning, inspired from the bag-of-words method in natural language processing, forming a histogram of the terms present in a document. The toolkit openXBOW is introduced, enabling systematic learning and optimisation of these feature representations, unified across arbitrary modalities of numeric or symbolic descriptors. A number of experiments on BoAW are presented and discussed, focussing on a large number of potential applications and corresponding databases, ranging from emotion recognition in speech to medical diagnosis. The evaluations include a comparison of different acoustic LLD sets and configurations of the BoAW generation process. The key findings are that BoAW features are a meaningful alternative to statistical functionals, offering certain benefits, while being able to preserve the advantages of functionals, such as data-independence. Furthermore, it is shown that both representations are complementary and their fusion improves the performance of a machine listening system.Maschinelles Hören ist im tĂ€glichen Leben allgegenwĂ€rtig, mit Anwendungen, die von personalisierten virtuellen Agenten bis hin zum Gesundheitswesen reichen. Aus technischer Sicht besteht das Ziel darin, den Inhalt eines Audiosignals hinsichtlich einer Auswahl definierter Labels robust zu klassifizieren. Die Labels beschreiben bspw. die akustische Umgebung der Aufnahme, eine medizinische Diagnose oder - im Falle von Sprache - was gesagt wird oder wie es gesagt wird. Übliche AnsĂ€tze hierzu verwenden maschinelles Lernen, d.h., es werden anwendungsspezifische Modelle anhand von Beispieldaten trainiert. Trotz jĂŒngster Erfolge beim Ende-zu-Ende-Lernen mittels neuronaler Netze, in welchen das unverarbeitete Audiosignal als Eingabe benutzt wird, sind Modelle, die auf definierten akustischen Merkmalen basieren, in manchen Bereichen weiterhin ĂŒberlegen. Dies gilt im Besonderen fĂŒr Einsatzzwecke, fĂŒr die nur wenige Daten vorhanden sind. Allerdings besteht dabei das Problem, dass Zeitfolgen von akustischen Deskriptoren in viele Algorithmen des maschinellen Lernens nicht direkt eingespeist werden können, da diese eine statische Eingabe fester LĂ€nge benötigen. Außerdem kann es auch fĂŒr dynamische (zeitabhĂ€ngige) Klassifikatoren vorteilhaft sein, die Deskriptoren ĂŒber ein gewisses Zeitintervall zusammenzufassen. Jedoch hat die Art der Merkmalsdarstellung einen grundlegenden Einfluss auf die LeistungsfĂ€higkeit des Modells. In der vorliegenden Dissertation wird der sogenannte Bag-of-Audio-Words-Ansatz (BoAW) als Alternative zum Standardansatz der statistischen Funktionale untersucht. BoAW ist eine Methode des unĂŒberwachten Lernens von Merkmalsdarstellungen, die von der Bag-of-Words-Methode in der Computerlinguistik inspiriert wurde, bei der ein Textdokument als Histogramm der vorkommenden Wörter beschrieben wird. Das Toolkit openXBOW wird vorgestellt, welches systematisches Training und Optimierung dieser Merkmalsdarstellungen - vereinheitlicht fĂŒr beliebige ModalitĂ€ten mit numerischen oder symbolischen Deskriptoren - erlaubt. Es werden einige Experimente zum BoAW-Ansatz durchgefĂŒhrt und diskutiert, die sich auf eine große Zahl möglicher Anwendungen und entsprechende DatensĂ€tze beziehen, von der Emotionserkennung in gesprochener Sprache bis zur medizinischen Diagnostik. Die Auswertungen beinhalten einen Vergleich verschiedener akustischer Deskriptoren und Konfigurationen der BoAW-Methode. Die wichtigsten Erkenntnisse sind, dass BoAW-Merkmalsvektoren eine geeignete Alternative zu statistischen Funktionalen darstellen, gewisse VorzĂŒge bieten und gleichzeitig wichtige Eigenschaften der Funktionale, wie bspw. die DatenunabhĂ€ngigkeit, erhalten können. Zudem wird gezeigt, dass beide Darstellungen komplementĂ€r sind und eine Fusionierung die LeistungsfĂ€higkeit eines Systems des maschinellen Hörens verbessert

    Multimodaalsel emotsioonide tuvastamisel pÔhineva inimese-roboti suhtluse arendamine

    Get PDF
    VĂ€itekirja elektrooniline versioon ei sisalda publikatsiooneÜks afektiivse arvutiteaduse peamistest huviobjektidest on mitmemodaalne emotsioonituvastus, mis leiab rakendust peamiselt inimese-arvuti interaktsioonis. Emotsiooni Ă€ratundmiseks uuritakse nendes sĂŒsteemides nii inimese nĂ€oilmeid kui kakĂ”net. KĂ€esolevas töös uuritakse inimese emotsioonide ja nende avaldumise visuaalseid ja akustilisi tunnuseid, et töötada vĂ€lja automaatne multimodaalne emotsioonituvastussĂŒsteem. KĂ”nest arvutatakse mel-sageduse kepstri kordajad, helisignaali erinevate komponentide energiad ja prosoodilised nĂ€itajad. NĂ€oilmeteanalĂŒĂŒsimiseks kasutatakse kahte erinevat strateegiat. Esiteks arvutatakse inimesenĂ€o tĂ€htsamate punktide vahelised erinevad geomeetrilised suhted. Teiseks vĂ”etakse emotsionaalse sisuga video kokku vĂ€hendatud hulgaks pĂ”hikaadriteks, misantakse sisendiks konvolutsioonilisele tehisnĂ€rvivĂ”rgule emotsioonide visuaalsekseristamiseks. Kolme klassifitseerija vĂ€ljunditest (1 akustiline, 2 visuaalset) koostatakse uus kogum tunnuseid, mida kasutatakse Ă”ppimiseks sĂŒsteemi viimasesetapis. Loodud sĂŒsteemi katsetati SAVEE, Poola ja Serbia emotsionaalse kĂ”neandmebaaside, eNTERFACE’05 ja RML andmebaaside peal. Saadud tulemusednĂ€itavad, et vĂ”rreldes olemasolevatega vĂ”imaldab kĂ€esoleva töö raames loodudsĂŒsteem suuremat tĂ€psust emotsioonide Ă€ratundmisel. Lisaks anname kĂ€esolevastöös ĂŒlevaate kirjanduses vĂ€ljapakutud sĂŒsteemidest, millel on vĂ”imekus tunda Ă€raemotsiooniga seotud ̆zeste. Selle ĂŒlevaate eesmĂ€rgiks on hĂ”lbustada uute uurimissuundade leidmist, mis aitaksid lisada töö raames loodud sĂŒsteemile ̆zestipĂ”hiseemotsioonituvastuse vĂ”imekuse, et veelgi enam tĂ”sta sĂŒsteemi emotsioonide Ă€ratundmise tĂ€psust.Automatic multimodal emotion recognition is a fundamental subject of interest in affective computing. Its main applications are in human-computer interaction. The systems developed for the foregoing purpose consider combinations of different modalities, based on vocal and visual cues. This thesis takes the foregoing modalities into account, in order to develop an automatic multimodal emotion recognition system. More specifically, it takes advantage of the information extracted from speech and face signals. From speech signals, Mel-frequency cepstral coefficients, filter-bank energies and prosodic features are extracted. Moreover, two different strategies are considered for analyzing the facial data. First, facial landmarks' geometric relations, i.e. distances and angles, are computed. Second, we summarize each emotional video into a reduced set of key-frames. Then they are taught to visually discriminate between the emotions. In order to do so, a convolutional neural network is applied to the key-frames summarizing the videos. Afterward, the output confidence values of all the classifiers from both of the modalities are used to define a new feature space. Lastly, the latter values are learned for the final emotion label prediction, in a late fusion. The experiments are conducted on the SAVEE, Polish, Serbian, eNTERFACE'05 and RML datasets. The results show significant performance improvements by the proposed system in comparison to the existing alternatives, defining the current state-of-the-art on all the datasets. Additionally, we provide a review of emotional body gesture recognition systems proposed in the literature. The aim of the foregoing part is to help figure out possible future research directions for enhancing the performance of the proposed system. More clearly, we imply that incorporating data representing gestures, which constitute another major component of the visual modality, can result in a more efficient framework

    Deep Learning applied to Visual Speech Recognition

    Get PDF
    Visual Speech Recognition (VSR) or Automatic Lip-Reading (ALR), the artificial process used to infer visemes, words, or sentences from video inputs, is an efficient yet far from being a day-to-day tool. With the evolution of deep learning models and the proliferation of databases (DB), vocabularies increase in quality and quantity. Large DB feed end-to-end deep learning (DL) models that extract speech, solely on the visual recognition of the speaker’s lips movements. However, large DB production requires large resources, unavailable to the majority of ALR researchers, impairing a larger scale evolution. This dissertation contributes to the development of ALR by diversifying training data, on which the DL depends upon. This includes producing a new DB, in Portuguese language, capable of state-of-the-art (SOTA) performance. As DL only shows a SOTA performance if trained on a large DB, whose resources are not on the scope of this dissertation, a knowledge leveraging method emerges, as a necessary subsequent objective. A large DB and a SOTA model are selected and used as templates, from which a smaller DB (LusaPt) is created, comprising 100 phrases by 10 speakers, uttering 50 typical Portuguese digits and words, recorded and processed by day-to-day equipment. After having pre-trained on the SOTA DB, the new model is then fine-tuned on the new DB. For LusaPt’s validation, the performance of new and the SOTA’s are compared. Results reveal that, if the same video is recurrently subject to the same model, the same prediction is obtained. Tests also show a clear increase on the word recognition rate (WRR), from the 0% when inferring with the SOTA model with no further training on the new DB, to an over 95% when inferring with the new model. Besides showing a “powerful belief” of the SOTA model in its predictions, this work also validates the new DB and its creation methodology. It reenforces that the transfer learning process is efficient in learning a new language, therefore new words. Another contribution is to demonstrate that, with a day-to-day equipment and limited human resources, it is possible to enrich the DB corpora and, ultimately, to positively impact the performance and future of Automatic Lip-Reading
    • 

    corecore