1,646 research outputs found

    Machine learning for automatic analysis of affective behaviour

    Get PDF
    The automated analysis of affect has been gaining rapidly increasing attention by researchers over the past two decades, as it constitutes a fundamental step towards achieving next-generation computing technologies and integrating them into everyday life (e.g. via affect-aware, user-adaptive interfaces, medical imaging, health assessment, ambient intelligence etc.). The work presented in this thesis focuses on several fundamental problems manifesting in the course towards the achievement of reliable, accurate and robust affect sensing systems. In more detail, the motivation behind this work lies in recent developments in the field, namely (i) the creation of large, audiovisual databases for affect analysis in the so-called ''Big-Data`` era, along with (ii) the need to deploy systems under demanding, real-world conditions. These developments led to the requirement for the analysis of emotion expressions continuously in time, instead of merely processing static images, thus unveiling the wide range of temporal dynamics related to human behaviour to researchers. The latter entails another deviation from the traditional line of research in the field: instead of focusing on predicting posed, discrete basic emotions (happiness, surprise etc.), it became necessary to focus on spontaneous, naturalistic expressions captured under settings more proximal to real-world conditions, utilising more expressive emotion descriptions than a set of discrete labels. To this end, the main motivation of this thesis is to deal with challenges arising from the adoption of continuous dimensional emotion descriptions under naturalistic scenarios, considered to capture a much wider spectrum of expressive variability than basic emotions, and most importantly model emotional states which are commonly expressed by humans in their everyday life. In the first part of this thesis, we attempt to demystify the quite unexplored problem of predicting continuous emotional dimensions. This work is amongst the first to explore the problem of predicting emotion dimensions via multi-modal fusion, utilising facial expressions, auditory cues and shoulder gestures. A major contribution of the work presented in this thesis lies in proposing the utilisation of various relationships exhibited by emotion dimensions in order to improve the prediction accuracy of machine learning methods - an idea which has been taken on by other researchers in the field since. In order to experimentally evaluate this, we extend methods such as the Long Short-Term Memory Neural Networks (LSTM), the Relevance Vector Machine (RVM) and Canonical Correlation Analysis (CCA) in order to exploit output relationships in learning. As it is shown, this increases the accuracy of machine learning models applied to this task. The annotation of continuous dimensional emotions is a tedious task, highly prone to the influence of various types of noise. Performed real-time by several annotators (usually experts), the annotation process can be heavily biased by factors such as subjective interpretations of the emotional states observed, the inherent ambiguity of labels related to human behaviour, the varying reaction lags exhibited by each annotator as well as other factors such as input device noise and annotation errors. In effect, the annotations manifest a strong spatio-temporal annotator-specific bias. Failing to properly deal with annotation bias and noise leads to an inaccurate ground truth, and therefore to ill-generalisable machine learning models. This deems the proper fusion of multiple annotations, and the inference of a clean, corrected version of the ``ground truth'' as one of the most significant challenges in the area. A highly important contribution of this thesis lies in the introduction of Dynamic Probabilistic Canonical Correlation Analysis (DPCCA), a method aimed at fusing noisy continuous annotations. By adopting a private-shared space model, we isolate the individual characteristics that are annotator-specific and not shared, while most importantly we model the common, underlying annotation which is shared by annotators (i.e., the derived ground truth). By further learning temporal dynamics and incorporating a time-warping process, we are able to derive a clean version of the ground truth given multiple annotations, eliminating temporal discrepancies and other nuisances. The integration of the temporal alignment process within the proposed private-shared space model deems DPCCA suitable for the problem of temporally aligning human behaviour; that is, given temporally unsynchronised sequences (e.g., videos of two persons smiling), the goal is to generate the temporally synchronised sequences (e.g., the smile apex should co-occur in the videos). Temporal alignment is an important problem for many applications where multiple datasets need to be aligned in time. Furthermore, it is particularly suitable for the analysis of facial expressions, where the activation of facial muscles (Action Units) typically follows a set of predefined temporal phases. A highly challenging scenario is when the observations are perturbed by gross, non-Gaussian noise (e.g., occlusions), as is often the case when analysing data acquired under real-world conditions. To account for non-Gaussian noise, a robust variant of Canonical Correlation Analysis (RCCA) for robust fusion and temporal alignment is proposed. The model captures the shared, low-rank subspace of the observations, isolating the gross noise in a sparse noise term. RCCA is amongst the first robust variants of CCA proposed in literature, and as we show in related experiments outperforms other, state-of-the-art methods for related tasks such as the fusion of multiple modalities under gross noise. Beyond private-shared space models, Component Analysis (CA) is an integral component of most computer vision systems, particularly in terms of reducing the usually high-dimensional input spaces in a meaningful manner pertaining to the task-at-hand (e.g., prediction, clustering). A final, significant contribution of this thesis lies in proposing the first unifying framework for probabilistic component analysis. The proposed framework covers most well-known CA methods, such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Locality Preserving Projections (LPP) and Slow Feature Analysis (SFA), providing further theoretical insights into the workings of CA. Moreover, the proposed framework is highly flexible, enabling novel CA methods to be generated by simply manipulating the connectivity of latent variables (i.e. the latent neighbourhood). As shown experimentally, methods derived via the proposed framework outperform other equivalents in several problems related to affect sensing and facial expression analysis, while providing advantages such as reduced complexity and explicit variance modelling.Open Acces

    Fast speaker independent large vocabulary continuous speech recognition [online]

    Get PDF

    A Computational Model of Auditory Feature Extraction and Sound Classification

    Get PDF
    This thesis introduces a computer model that incorporates responses similar to those found in the cochlea, in sub-corticai auditory processing, and in auditory cortex. The principle aim of this work is to show that this can form the basis for a biologically plausible mechanism of auditory stimulus classification. We will show that this classification is robust to stimulus variation and time compression. In addition, the response of the system is shown to support multiple, concurrent, behaviourally relevant classifications of natural stimuli (speech). The model incorporates transient enhancement, an ensemble of spectro - temporal filters, and a simple measure analogous to the idea of visual salience to produce a quasi-static description of the stimulus suitable either for classification with an analogue artificial neural network or, using appropriate rate coding, a classifier based on artificial spiking neurons. We also show that the spectotemporal ensemble can be derived from a limited class of 'formative' stimuli, consistent with a developmental interpretation of ensemble formation. In addition, ensembles chosen on information theoretic grounds consist of filters with relatively simple geometries, which is consistent with reports of responses in mammalian thalamus and auditory cortex. A powerful feature of this approach is that the ensemble response, from which salient auditory events are identified, amounts to stimulus-ensemble driven method of segmentation which respects the envelope of the stimulus, and leads to a quasi-static representation of auditory events which is suitable for spike rate coding. We also present evidence that the encoded auditory events may form the basis of a representation-of-similarity, or second order isomorphism, which implies a representational space that respects similarity relationships between stimuli including novel stimuli

    Auf einem menschlichen Gehörmodell basierende Elektrodenstimulationsstrategie für Cochleaimplantate

    Get PDF
    Cochleaimplantate (CI), verbunden mit einer professionellen Rehabilitation, haben mehreren hunderttausenden Hörgeschädigten die verbale Kommunikation wieder ermöglicht. Betrachtet man jedoch die Rehabilitationserfolge, so haben CI-Systeme inzwischen ihre Grenzen erreicht. Die Tatsache, dass die meisten CI-Träger nicht in der Lage sind, Musik zu genießen oder einer Konversation in geräuschvoller Umgebung zu folgen, zeigt, dass es noch Raum für Verbesserungen gibt.Diese Dissertation stellt die neue CI-Signalverarbeitungsstrategie Stimulation based on Auditory Modeling (SAM) vor, die vollständig auf einem Computermodell des menschlichen peripheren Hörsystems beruht.Im Rahmen der vorliegenden Arbeit wurde die SAM Strategie dreifach evaluiert: mit vereinfachten Wahrnehmungsmodellen von CI-Nutzern, mit fünf CI-Nutzern, und mit 27 Normalhörenden mittels eines akustischen Modells der CI-Wahrnehmung. Die Evaluationsergebnisse wurden stets mit Ergebnissen, die durch die Verwendung der Advanced Combination Encoder (ACE) Strategie ermittelt wurden, verglichen. ACE stellt die zurzeit verbreitetste Strategie dar. Erste Simulationen zeigten, dass die Sprachverständlichkeit mit SAM genauso gut wie mit ACE ist. Weiterhin lieferte SAM genauere binaurale Merkmale, was potentiell zu einer Verbesserung der Schallquellenlokalisierungfähigkeit führen kann. Die Simulationen zeigten ebenfalls einen erhöhten Anteil an zeitlichen Pitchinformationen, welche von SAM bereitgestellt wurden. Die Ergebnisse der nachfolgenden Pilotstudie mit fünf CI-Nutzern zeigten mehrere Vorteile von SAM auf. Erstens war eine signifikante Verbesserung der Tonhöhenunterscheidung bei Sinustönen und gesungenen Vokalen zu erkennen. Zweitens bestätigten CI-Nutzer, die kontralateral mit einem Hörgerät versorgt waren, eine natürlicheren Klangeindruck. Als ein sehr bedeutender Vorteil stellte sich drittens heraus, dass sich alle Testpersonen in sehr kurzer Zeit (ca. 10 bis 30 Minuten) an SAM gewöhnen konnten. Dies ist besonders wichtig, da typischerweise Wochen oder Monate nötig sind. Tests mit Normalhörenden lieferten weitere Nachweise für die verbesserte Tonhöhenunterscheidung mit SAM.Obwohl SAM noch keine marktreife Alternative ist, versucht sie den Weg für zukünftige Strategien, die auf Gehörmodellen beruhen, zu ebnen und ist somit ein erfolgversprechender Kandidat für weitere Forschungsarbeiten.Cochlear implants (CIs) combined with professional rehabilitation have enabled several hundreds of thousands of hearing-impaired individuals to re-enter the world of verbal communication. Though very successful, current CI systems seem to have reached their peak potential. The fact that most recipients claim not to enjoy listening to music and are not capable of carrying on a conversation in noisy or reverberative environments shows that there is still room for improvement.This dissertation presents a new cochlear implant signal processing strategy called Stimulation based on Auditory Modeling (SAM), which is completely based on a computational model of the human peripheral auditory system.SAM has been evaluated through simplified models of CI listeners, with five cochlear implant users, and with 27 normal-hearing subjects using an acoustic model of CI perception. Results have always been compared to those acquired using Advanced Combination Encoder (ACE), which is today’s most prevalent CI strategy. First simulations showed that speech intelligibility of CI users fitted with SAM should be just as good as that of CI listeners fitted with ACE. Furthermore, it has been shown that SAM provides more accurate binaural cues, which can potentially enhance the sound source localization ability of bilaterally fitted implantees. Simulations have also revealed an increased amount of temporal pitch information provided by SAM. The subsequent pilot study, which ran smoothly, revealed several benefits of using SAM. First, there was a significant improvement in pitch discrimination of pure tones and sung vowels. Second, CI users fitted with a contralateral hearing aid reported a more natural sound of both speech and music. Third, all subjects were accustomed to SAM in a very short period of time (in the order of 10 to 30 minutes), which is particularly important given that a successful CI strategy change typically takes weeks to months. An additional test with 27 normal-hearing listeners using an acoustic model of CI perception delivered further evidence for improved pitch discrimination ability with SAM as compared to ACE.Although SAM is not yet a market-ready alternative, it strives to pave the way for future strategies based on auditory models and it is a promising candidate for further research and investigation

    Articulatory features for conversational speech recognition

    Get PDF

    Exploration and Optimization of Noise Reduction Algorithms for Speech Recognition in Embedded Devices

    Get PDF
    Environmental noise present in real-life applications substantially degrades the performance of speech recognition systems. An example is an in-car scenario where a speech recognition system has to support the man-machine interface. Several sources of noise coming from the engine, wipers, wheels etc., interact with speech. Special challenge is given in an open window scenario, where noise of traffic, park noise, etc., has to be regarded. The main goal of this thesis is to improve the performance of a speech recognition system based on a state-of-the-art hidden Markov model (HMM) using noise reduction methods. The performance is measured with respect to word error rate and with the method of mutual information. The noise reduction methods are based on weighting rules. Least-squares weighting rules in the frequency domain have been developed to enable a continuous development based on the existing system and also to guarantee its low complexity and footprint for applications in embedded devices. The weighting rule parameters are optimized employing a multidimensional optimization task method of Monte Carlo followed by a compass search method. Root compression and cepstral smoothing methods have also been implemented to boost the recognition performance. The additional complexity and memory requirements of the proposed system are minimum. The performance of the proposed system was compared to the European Telecommunications Standards Institute (ETSI) standardized system. The proposed system outperforms the ETSI system by up to 8.6 % relative increase in word accuracy and achieves up to 35.1 % relative increase in word accuracy compared to the existing baseline system on the ETSI Aurora 3 German task. A relative increase of up to 18 % in word accuracy over the existing baseline system is also obtained from the proposed weighting rules on large vocabulary databases. An entropy-based feature vector analysis method has also been developed to assess the quality of feature vectors. The entropy estimation is based on the histogram approach. The method has the advantage to objectively asses the feature vector quality regardless of the acoustic modeling assumption used in the speech recognition system

    雑音特性の変動を伴う多様な環境で実用可能な音声強調

    Get PDF
    筑波大学 (University of Tsukuba)201
    corecore