1,165 research outputs found

    Recognition of Emotion from Speech: A Review

    Get PDF

    Word learning in the first year of life

    Get PDF
    In the first part of this thesis, we ask whether 4-month-old infants can represent objects and movements after a short exposure in such a way that they recognize either a repeated object or a repeated movement when they are presented simultaneously with a new object or a new movement. If they do, we ask whether the way they observe the visual input is modified when auditory input is presented. We investigate whether infants react to the familiarization labels and to novel labels in the same manner. If the labels as well as the referents are matched for saliency, any difference should be due to processes that are not limited to sensorial perception. We hypothesize that infants will, if they map words to the objects or movements, change their looking behavior whenever they hear a familiar label, a novel label, or no label at all. In the second part of this thesis, we assess the problem of word learning from a different perspective. If infants reason about possible label-referent pairs and are able to make inferences about novel pairs, are the same processes involved in all intermodal learning? We compared the task of learning to associate auditory regularities to visual stimuli (reinforcers), and the word-learning task. We hypothesized that even if infants succeed in learning more than one label during one single event, learning the intermodal connection between auditory and visual regularities might present a more demanding task for them. The third part of this thesis addresses the role of associative learning in word learning. In the last decades, it was repeatedly suggested that co-occurrence probabilities can play an important role in word segmentation. However, the vast majority of studies test infants with artificial streams that do not resemble a natural input: most studies use words of equal length and with unambiguous syllable sequences within word, where the only point of variability is at the word boundaries (Aslin et al., 1998; Saffran, Johnson, Aslin, & Newport, 1999; Saffran et al., 1996; Thiessen et al., 2005; Thiessen & Saffran, 2003). Even if the input is modified to resemble the natural input more faithfully, the words with which infants are tested are always unambiguous \u2013 within words, each syllable predicts its adjacent syllable with the probability of 1.0 (Pelucchi, Hay, & Saffran, 2009; Thiessen et al., 2005). We therefore tested 6-month-old infants with such statistically ambiguous words. Before doing that, we also verified on a large sample of languages whether statistical information in the natural input, where the majority of the words are statistically ambiguous, is indeed useful for segmenting words. Our motivation was partly due to the fact that studies that modeled the segmentation process with a natural language input often yielded ambivalent results about the usefulness of such computation (Batchelder, 2002; Gambell & Yang, 2006; Swingley, 2005). We conclude this introduction with a small remark about the term word. It will be used throughout this thesis without questioning its descriptive value: the common-sense meaning of the term word is unambiguous enough, since all people know what are we referring to when we say or think of the term word. However, the term word is not unambiguous at all (Di Sciullo & Williams, 1987). To mention only some of the classical examples: (1) Do jump and jumped, or go and went, count as one word or as two? This example might seem all too trivial, especially in languages with weak overt morphology as English, but in some languages, each basic form of the word has tens of inflected variables. (2) A similar question arises with all the words that are morphological derivations of other words, such as evict and eviction, examine and reexamine, unhappy and happily, and so on. (3) And finally, each language contains many phrases and idioms: Does air conditioner and give up count as one word, or two? Statistical word segmentation studies in general neglect the issue of the definition of words, assuming that phrases and idioms have strong internal statistics and will therefore be selected as one word (Cutler, 2012). But because compounds or phrases are usually composed of smaller meaningful chunks, it is unclear how would infants extracts these smaller units of speech if they were using predominantly statistical information. We will address the problem of over-segmentations shortly in the third part of the thesis

    Emerging Linguistic Functions in Early Infancy

    Get PDF
    This paper presents results from experimental studies on early language acquisition in infants and attempts to interpret the experimental results within the framework of the Ecological Theory of Language Acquisition (ETLA) recently proposed by (Lacerda et al., 2004a). From this perspective, the infant’s first steps in the acquisition of the ambient language are seen as a consequence of the infant’s general capacity to represent sensory input and the infant’s interaction with other actors in its immediate ecological environment. On the basis of available experimental evidence, it will be argued that ETLA offers a productive alternative to traditional descriptive views of the language acquisition process by presenting an operative model of how early linguistic function may emerge through interaction

    An audio-visual corpus for multimodal automatic speech recognition

    Get PDF

    CLASS - A Study of methods for coarse phonetic classification

    Get PDF
    The objective of this thesis was to examine computer techniques for classifying speech signals into four coarse phonetic classes: vowel-like, strong fricative, weak fricative and silence. The study compared classification results from the K-means clustering algorithm using Euclidian distance measurements with classification using a multivariate maximum likelihood distance measure. In addition to the comparison of statistical methods, this study compared classification using several tree-structured decision making processes. The system was trained on ten speakers using 98 utterances with both known and unknown speakers. Results showed very little difference between the Euclidian distance and maximum likelihood; however, the introduction of the tree structure on both systems had a positive influence on their performance

    Speech data analysis for semantic indexing of video of simulated medical crises.

    Get PDF
    The Simulation for Pediatric Assessment, Resuscitation, and Communication (SPARC) group within the Department of Pediatrics at the University of Louisville, was established to enhance the care of children by using simulation based educational methodologies to improve patient safety and strengthen clinician-patient interactions. After each simulation session, the physician must manually review and annotate the recordings and then debrief the trainees. The physician responsible for the simulation has recorded 100s of videos, and is seeking solutions that can automate the process. This dissertation introduces our developed system for efficient segmentation and semantic indexing of videos of medical simulations using machine learning methods. It provides the physician with automated tools to review important sections of the simulation by identifying who spoke, when and what was his/her emotion. Only audio information is extracted and analyzed because the quality of the image recording is low and the visual environment is static for most parts. Our proposed system includes four main components: preprocessing, speaker segmentation, speaker identification, and emotion recognition. The preprocessing consists of first extracting the audio component from the video recording. Then, extracting various low-level audio features to detect and remove silence segments. We investigate and compare two different approaches for this task. The first one is threshold-based and the second one is classification-based. The second main component of the proposed system consists of detecting speaker changing points for the purpose of segmenting the audio stream. We propose two fusion methods for this task. The speaker identification and emotion recognition components of our system are designed to provide users the capability to browse the video and retrieve shots that identify ”who spoke, when, and the speaker’s emotion” for further analysis. For this component, we propose two feature representation methods that map audio segments of arbitary length to a feature vector with fixed dimensions. The first one is based on soft bag-of-word (BoW) feature representations. In particular, we define three types of BoW that are based on crisp, fuzzy, and possibilistic voting. The second feature representation is a generalization of the BoW and is based on Fisher Vector (FV). FV uses the Fisher Kernel principle and combines the benefits of generative and discriminative approaches. The proposed feature representations are used within two learning frameworks. The first one is supervised learning and assumes that a large collection of labeled training data is available. Within this framework, we use standard classifiers including K-nearest neighbor (K-NN), support vector machine (SVM), and Naive Bayes. The second framework is based on semi-supervised learning where only a limited amount of labeled training samples are available. We use an approach that is based on label propagation. Our proposed algorithms were evaluated using 15 medical simulation sessions. The results were analyzed and compared to those obtained using state-of-the-art algorithms. We show that our proposed speech segmentation fusion algorithms and feature mappings outperform existing methods. We also integrated all proposed algorithms and developed a GUI prototype system for subjective evaluation. This prototype processes medical simulation video and provides the user with a visual summary of the different speech segments. It also allows the user to browse videos and retrieve scenes that provide answers to semantic queries such as: who spoke and when; who interrupted who? and what was the emotion of the speaker? The GUI prototype can also provide summary statistics of each simulation video. Examples include: for how long did each person spoke? What is the longest uninterrupted speech segment? Is there an unusual large number of pauses within the speech segment of a given speaker
    • …
    corecore