771 research outputs found

    Continuous Interaction with a Virtual Human

    Get PDF
    Attentive Speaking and Active Listening require that a Virtual Human be capable of simultaneous perception/interpretation and production of communicative behavior. A Virtual Human should be able to signal its attitude and attention while it is listening to its interaction partner, and be able to attend to its interaction partner while it is speaking – and modify its communicative behavior on-the-fly based on what it perceives from its partner. This report presents the results of a four week summer project that was part of eNTERFACE’10. The project resulted in progress on several aspects of continuous interaction such as scheduling and interrupting multimodal behavior, automatic classification of listener responses, generation of response eliciting behavior, and models for appropriate reactions to listener responses. A pilot user study was conducted with ten participants. In addition, the project yielded a number of deliverables that are released for public access

    Speech-based recognition of self-reported and observed emotion in a dimensional space

    Get PDF
    The differences between self-reported and observed emotion have only marginally been investigated in the context of speech-based automatic emotion recognition. We address this issue by comparing self-reported emotion ratings to observed emotion ratings and look at how differences between these two types of ratings affect the development and performance of automatic emotion recognizers developed with these ratings. A dimensional approach to emotion modeling is adopted: the ratings are based on continuous arousal and valence scales. We describe the TNO-Gaming Corpus that contains spontaneous vocal and facial expressions elicited via a multiplayer videogame and that includes emotion annotations obtained via self-report and observation by outside observers. Comparisons show that there are discrepancies between self-reported and observed emotion ratings which are also reflected in the performance of the emotion recognizers developed. Using Support Vector Regression in combination with acoustic and textual features, recognizers of arousal and valence are developed that can predict points in a 2-dimensional arousal-valence space. The results of these recognizers show that the self-reported emotion is much harder to recognize than the observed emotion, and that averaging ratings from multiple observers improves performance

    Characterizing and recognizing spoken corrections in human-computer dialog

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (p. 103-106).Miscommunication in human-computer spoken language systems is unavoidable. Recognition failures on the part of the system necessitate frequent correction attempts by the user. Unfortunately and counterintuitively, users' attempts to speak more clearly in the face of recognition errors actually lead to decreased recognition accuracy. The difficulty of correcting these errors, in turn, leads to user frustration and poor assessments of system quality. Most current approaches to identifying corrections rely on detecting violations of task or belief models that are ineffective where such constraints are weak and recognition results inaccurate or unavailable. In contrast, the approach pursued in this thesis, in contrast, uses the acoustic contrasts between original inputs and repeat corrections to identify corrections in a more content- and context-independent fashion. This thesis quantifies and builds upon the observation that suprasegmental features, such as duration, pause, and pitch, play a crucial role in distinguishing corrections from other forms of input to spoken language systems. These features can also be used to identify spoken corrections and explain reductions in recognition accuracy for these utterances. By providing a detailed characterization of acoustic-prosodic changes in corrections relative to original inputs in a voice-only system, this thesis contributes to natural language processing and spoken language understanding. We present a treatment of systematic acoustic variability in speech recognizer input as a source of new information, to interpret the speaker's corrective intent, rather than simply as noise or user error. We demonstrate the application of a machine-learning technique, decision trees, for identifying spoken corrections and achieve accuracy rates close to human levels of performance for corrections of misrecognition errors, using acoustic-prosodic information. This process is simple and local and depends neither on perfect transcription of the recognition string nor complex reasoning based on the full conversation. We further extend the conventional analysis of speaking styles beyond a 'read' versus 'conversational' contrast to extreme clear speech, describing divergence from phonological and durational models for words in this style.by Gina-Anne Levow.Ph.D

    Recognizing emotions in spoken dialogue with acoustic and lexical cues

    Get PDF
    Automatic emotion recognition has long been a focus of Affective Computing. It has become increasingly apparent that awareness of human emotions in Human-Computer Interaction (HCI) is crucial for advancing related technologies, such as dialogue systems. However, performance of current automatic emotion recognition is disappointing compared to human performance. Current research on emotion recognition in spoken dialogue focuses on identifying better feature representations and recognition models from a data-driven point of view. The goal of this thesis is to explore how incorporating prior knowledge of human emotion recognition in the automatic model can improve state-of-the-art performance of automatic emotion recognition in spoken dialogue. Specifically, we study this by proposing knowledge-inspired features representing occurrences of disfluency and non-verbal vocalisation in speech, and by building a multimodal recognition model that combines acoustic and lexical features in a knowledge-inspired hierarchical structure. In our study, emotions are represented with the Arousal, Expectancy, Power, and Valence emotion dimensions. We build unimodal and multimodal emotion recognition models to study the proposed features and modelling approach, and perform emotion recognition on both spontaneous and acted dialogue. Psycholinguistic studies have suggested that DISfluency and Non-verbal Vocalisation (DIS-NV) in dialogue is related to emotions. However, these affective cues in spoken dialogue are overlooked by current automatic emotion recognition research. Thus, we propose features for recognizing emotions in spoken dialogue which describe five types of DIS-NV in utterances, namely filled pause, filler, stutter, laughter, and audible breath. Our experiments show that this small set of features is predictive of emotions. Our DIS-NV features achieve better performance than benchmark acoustic and lexical features for recognizing all emotion dimensions in spontaneous dialogue. Consistent with Psycholinguistic studies, the DIS-NV features are especially predictive of the Expectancy dimension of emotion, which relates to speaker uncertainty. Our study illustrates the relationship between DIS-NVs and emotions in dialogue, which contributes to Psycholinguistic understanding of them as well. Note that our DIS-NV features are based on manual annotations, yet our long-term goal is to apply our emotion recognition model to HCI systems. Thus, we conduct preliminary experiments on automatic detection of DIS-NVs, and on using automatically detected DIS-NV features for emotion recognition. Our results show that DIS-NVs can be automatically detected from speech with stable accuracy, and auto-detected DIS-NV features remain predictive of emotions in spontaneous dialogue. This suggests that our emotion recognition model can be applied to a fully automatic system in the future, and holds the potential to improve the quality of emotional interaction in current HCI systems. To study the robustness of the DIS-NV features, we conduct cross-corpora experiments on both spontaneous and acted dialogue. We identify how dialogue type influences the performance of DIS-NV features and emotion recognition models. DIS-NVs contain additional information beyond acoustic characteristics or lexical contents. Thus, we study the gain of modality fusion for emotion recognition with the DIS-NV features. Previous work combines different feature sets by fusing modalities at the same level using two types of fusion strategies: Feature-Level (FL) fusion, which concatenates feature sets before recognition; and Decision-Level (DL) fusion, which makes the final decision based on outputs of all unimodal models. However, features from different modalities may describe data at different time scales or levels of abstraction. Moreover, Cognitive Science research indicates that when perceiving emotions, humans make use of information from different modalities at different cognitive levels and time steps. Therefore, we propose a HierarchicaL (HL) fusion strategy for multimodal emotion recognition, which incorporates features that describe data at a longer time interval or which are more abstract at higher levels of its knowledge-inspired hierarchy. Compared to FL and DL fusion, HL fusion incorporates both inter- and intra-modality differences. Our experiments show that HL fusion consistently outperforms FL and DL fusion on multimodal emotion recognition in both spontaneous and acted dialogue. The HL model combining our DIS-NV features with benchmark acoustic and lexical features improves current performance of multimodal emotion recognition in spoken dialogue. To study how other emotion-related tasks of spoken dialogue can benefit from the proposed approaches, we apply the DIS-NV features and the HL fusion strategy to recognize movie-induced emotions. Our experiments show that although designed for recognizing emotions in spoken dialogue, DIS-NV features and HL fusion remain effective for recognizing movie-induced emotions. This suggests that other emotion-related tasks can also benefit from the proposed features and model structure

    Combining representations for improved sketch recognition

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 89-96).Sketching is a common means of conveying, representing, and preserving information, and it has become a subject of research as a method for human-computer interaction, specifically in the area of computer-aided design. Digitally collected sketches contain both spatial and temporal information; additionally, they may contain a conceptual structure of shapes and sub shapes. These multiple aspects suggest several ways of representing sketches, each with advantages and disadvantages for recognition. Most existing sketch recognitions systems are based on a single representation and do not use all available information. We propose combining several representations and systems as a way to improve recognition accuracy. This thesis presents two methods for combining recognition systems. The first improves recognition by improving segmentation, while the second seeks to predict how well systems will recognize a given domain or symbol and combine their outputs accordingly. We show that combining several recognition systems based on different representations can improve the accuracy of existing recognition methods.by Sonya J. Cates.Ph.D

    A Study of Accomodation of Prosodic and Temporal Features in Spoken Dialogues in View of Speech Technology Applications

    Get PDF
    Inter-speaker accommodation is a well-known property of human speech and human interaction in general. Broadly it refers to the behavioural patterns of two (or more) interactants and the effect of the (verbal and non-verbal) behaviour of each to that of the other(s). Implementation of thisbehavior in spoken dialogue systems is desirable as an improvement on the naturalness of humanmachine interaction. However, traditional qualitative descriptions of accommodation phenomena do not provide sufficient information for such an implementation. Therefore, a quantitativedescription of inter-speaker accommodation is required. This thesis proposes a methodology of monitoring accommodation during a human or humancomputer dialogue, which utilizes a moving average filter over sequential frames for each speaker. These frames are time-aligned across the speakers, hence the name Time Aligned Moving Average (TAMA). Analysis of spontaneous human dialogue recordings by means of the TAMA methodology reveals ubiquitous accommodation of prosodic features (pitch, intensity and speech rate) across interlocutors, and allows for statistical (time series) modeling of the behaviour, in a way which is meaningful for implementation in spoken dialogue system (SDS) environments.In addition, a novel dialogue representation is proposed that provides an additional point of view to that of TAMA in monitoring accommodation of temporal features (inter-speaker pause length and overlap frequency). This representation is a percentage turn distribution of individual speakercontributions in a dialogue frame which circumvents strict attribution of speaker-turns, by considering both interlocutors as synchronously active. Both TAMA and turn distribution metrics indicate that correlation of average pause length and overlap frequency between speakers can be attributed to accommodation (a debated issue), and point to possible improvements in SDS “turntaking” behaviour. Although the findings of the prosodic and temporal analyses can directly inform SDS implementations, further work is required in order to describe inter-speaker accommodation sufficiently, as well as to develop an adequate testing platform for evaluating the magnitude ofperceived improvement in human-machine interaction. Therefore, this thesis constitutes a first step towards a convincingly useful implementation of accommodation in spoken dialogue systems

    Students’ Major Problems in Learning Speaking Skill at Jimma Teachers' College

    Get PDF
    Speaking English was a challenge to the most Ethiopian students. This study is planned to investigate students' major problems in learning speaking skills. All students who were attending second year English language department in academic year 0/2005 and all instructors ofJimma Teachers' College were taken as the subjects of the study. The required datafor the study were collected through interview and questionnaire to collect the information from the subjects. Moreover, Classroom observation was employed as a supplementary instrument. The results of the interview were used to crosscheck the responses of the students for the questionnaires while the result of the observation were used to confirm the practice of learning and teaching speaking skills in actual classroom. The collected data organized and presented by tables and paragraphs to yield the result. The obtained data showed that students had problems in learning speaking skills. Therefore, the study showed how to set up speaking activities which could make students to participate and to interact with one another. It gave realistic solutions to overcome students' major problems in learning speaking skills. It indicated how students can manage their problems by applying communication strategies and using elements of speaking which are embedded in speaking skill. Finally, the overall results of the all instruments were triangulated to give conclusions and recommendations.Jimma Universit

    Unfamiliar facial identity registration and recognition performance enhancement

    Get PDF
    The work in this thesis aims at studying the problems related to the robustness of a face recognition system where specific attention is given to the issues of handling the image variation complexity and inherent limited Unique Characteristic Information (UCI) within the scope of unfamiliar identity recognition environment. These issues will be the main themes in developing a mutual understanding of extraction and classification tasking strategies and are carried out as a two interdependent but related blocks of research work. Naturally, the complexity of the image variation problem is built up from factors including the viewing geometry, illumination, occlusion and other kind of intrinsic and extrinsic image variation. Ideally, the recognition performance will be increased whenever the variation is reduced and/or the UCI is increased. However, the variation reduction on 2D facial images may result in loss of important clues or UCI data for a particular face alternatively increasing the UCI may also increase the image variation. To reduce the lost of information, while reducing or compensating the variation complexity, a hybrid technique is proposed in this thesis. The technique is derived from three conventional approaches for the variation compensation and feature extraction tasks. In this first research block, transformation, modelling and compensation approaches are combined to deal with the variation complexity. The ultimate aim of this combination is to represent (transformation) the UCI without losing the important features by modelling and discard (compensation) and reduce the level of the variation complexity of a given face image. Experimental results have shown that discarding a certain obvious variation will enhance the desired information rather than sceptical in losing the interested UCI. The modelling and compensation stages will benefit both variation reduction and UCI enhancement. Colour, gray level and edge image information are used to manipulate the UCI which involve the analysis on the skin colour, facial texture and features measurement respectively. The Derivative Linear Binary transformation (DLBT) technique is proposed for the features measurement consistency. Prior knowledge of input image with symmetrical properties, the informative region and consistency of some features will be fully utilized in preserving the UCI feature information. As a result, the similarity and dissimilarity representation for identity parameters or classes are obtained from the selected UCI representation which involves the derivative features size and distance measurement, facial texture and skin colour. These are mainly used to accommodate the strategy of unfamiliar identity classification in the second block of the research work. Since all faces share similar structure, classification technique should be able to increase the similarities within the class while increase the dissimilarity between the classes. Furthermore, a smaller class will result on less burden on the identification or recognition processes. The proposed method or collateral classification strategy of identity representation introduced in this thesis is by manipulating the availability of the collateral UCI for classifying the identity parameters of regional appearance, gender and age classes. In this regard, the registration of collateral UCI s have been made in such a way to collect more identity information. As a result, the performance of unfamiliar identity recognition positively is upgraded with respect to the special UCI for the class recognition and possibly with the small size of the class. The experiment was done using data from our developed database and open database comprising three different regional appearances, two different age groups and two different genders and is incorporated with pose and illumination image variations

    The perception of english word-final /L/ by brazilian learners

    Get PDF
    Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro de Comunicação e Expressão. Programa de Pós-Graduação em Letras/Inglês e Literatura Correspondente.Very little research exists on Brazilians concerning English word-final /l/ beyond noting that they generally produce [u] (Baptista, 2001) or [w] (Avery & Ehrlich, 1992). Perception of this word-final consonant is also little researched. To attempt to address these gaps in the literature, this study investigated Brazilian ESL students' perception of English word-final /l/ (dark /l/). Two groups of 20 Brazilian learners of English (intermediate and advanced) and one group of native speakers of English participated in the experiment. Three pairs of tests - two Categorial Discrimination Tests, two discrimination tasks, and two identification tests - examined perception of word-final /l/. The first test of each pair assessed word-final contrasts in both Portuguese and English; the second examined English-only contrasts. All results were analyzed by overall error rate, error rate per vowel context and error rate per test. Demographic data and total error rate were explored for correlations. No significant differences were found between the two groups of Brazilian students. Only for the vowel contexts /o/ and /?/ did native speakers perform significantly better than Brazilians. Native and non-native error rates were very low for vowel contexts /a?/ and /e?/ and quite high for /a?/. Há pouca pesquisa com brasileiros a respeito da pronúncia do /l/ final de palavras inglesas, além da observação de que geralmente é produzido como [u] (Baptista, 2001) ou [w] (Avery & Ehrlich, 1992). A percepção dessa consoante também é pouco pesquisada. Para tentar preencher essas lacunas na literatura, o objetivo desta pesquisa foi investigar a percepção do /l/ no final de palavras inglesas ("dark /l/") por brasileiros estudantes de inglês como língua estrangeira. Dois grupos de 20 estudantes brasileiros de inglês (dos níveis intermediário e avançado) e um grupo de falantes nativos de inglês participaram neste experimento. Três pares de testes - dois Testes de Discriminação Categórica, dois Testes de Discriminação, e dois Testes de Identificação - aferiram a percepção do /l/ no final de palavras. O primeiro teste de cada par examinou contrastes finais em palavras do português brasileiro e do inglês; o segundo examinou contrastes somente em palavras inglesas. Os resultados foram abalizados por índice de erro global, de erro por vogal, e de erro por teste. Dados demográficos e índice de erro global foram explorados para investigar correlações. Nenhuma diferença significante foi encontrada entre os grupos de brasileiros. O menor índice de erro do resultado dos falantes nativos de inglês foi estatisticamente significativo somente nos contextos de /o/ e /?/. O índice de erro de todos os grupos foi muito baixo nos contextos de /a?/ e /e?/ e muito alto em /a?/
    corecore