3,331 research outputs found

    Confusion modelling for lip-reading

    Get PDF
    Lip-reading is mostly used as a means of communication by people with hearing di�fficulties. Recent work has explored the automation of this process, with the aim of building a speech recognition system entirely driven by lip movements. However, this work has so far produced poor results because of factors such as high variability of speaker features, diffi�culties in mapping from visual features to speech sounds, and high co-articulation of visual features. The motivation for the work in this thesis is inspired by previous work in dysarthric speech recognition [Morales, 2009]. Dysathric speakers have poor control over their articulators, often leading to a reduced phonemic repertoire. The premise of this thesis is that recognition of the visual speech signal is a similar problem to recog- nition of dysarthric speech, in that some information about the speech signal has been lost in both cases, and this brings about a systematic pattern of errors in the decoded output. This work attempts to exploit the systematic nature of these errors by modelling them in the framework of a weighted finite-state transducer cascade. Results indicate that the technique can achieve slightly lower error rates than the conventional approach. In addition, it explores some interesting more general questions for automated lip-reading

    Features of hearing: applications of machine learning to uncover the building blocks of hearing

    Get PDF
    Recent advances in machine learning have instigated a renewed interest in using machine learning approaches to better understand human sensory processing. This line of research is particularly interesting for speech research since speech comprehension is uniquely human, which complicates obtaining detailed neural recordings. In this thesis, I explore how machine learning can be used to uncover new knowledge about the auditory system, with a focus on discovering robust auditory features. The resulting increased understanding of the noise robustness of human hearing may help to better assist those with hearing loss and improve Automatic Speech Recognition (ASR) systems. First, I show how computational neuroscience and machine learning can be combined to generate hypotheses about auditory features. I introduce a neural feature detection model with a modest number of parameters that is compatible with auditory physiology. By testing feature detector variants in a speech classification task, I confirm the importance of both well-studied and lesser-known auditory features. Second, I investigate whether ASR software is a good candidate model of the human auditory system. By comparing several state-of-the-art ASR systems to the results from humans on a range of psychometric experiments, I show that these ASR systems diverge markedly from humans in at least some psychometric tests. This implies that none of these systems act as a strong proxy for human speech recognition, although some may be useful when asking more narrowly defined questions. For neuroscientists, this thesis exemplifies how machine learning can be used to generate new hypotheses about human hearing, while also highlighting the caveats of investigating systems that may work fundamentally differently from the human brain. For machine learning engineers, I point to tangible directions for improving ASR systems. To motivate the continued cross-fertilization between these fields, a toolbox that allows researchers to assess new ASR systems has been released.Open Acces

    An investigation of signs for median crossovers

    Get PDF
    “This paper describes a study of advance warning signs for median crossovers on divided highways. Candidate crossover signs were identified from a literature review, survey of current State practices and discussions with FHWA personnel. Seven of these signs were selected for further testing in a laboratory study for legibility, understanding and driver preference. Sixty subjects representing a cross-section of drivers took part in the study, thirty at the Turner-Fairbanks Highway Research Center in McLean, Virginia and thirty at the University of Missouri--Rolla in Rolla, Missouri. Two of the seven signs were word messages and five were symbolic signs. The results from both groups of subjects showed that the most appropriate word message sign would appear to be Median Crossover . This sign was understood the best by the subjects to whom it was shown and Crossover was the word the majority of subjects thought best conveyed the intended meaning. The symbolic sign found to be the best out of those tested was one showing two median noses. This did well In legibility and understanding tests and was least confused with other signs. It was also the symbolic sign most preferred by the subjects and was the simplest of the symbolic designs. Legibility of the symbolic signs was much greater than that of the word messages and this symbolic design is the sign recommended to identify median crossovers”--Abstract, page ii

    An investigation of signs for median crossovers

    Get PDF
    “This paper describes a study of advance warning signs for median crossovers on divided highways. Candidate crossover signs were identified from a literature review, survey of current State practices and discussions with FHWA personnel. Seven of these signs were selected for further testing in a laboratory study for legibility, understanding and driver preference. Sixty subjects representing a cross-section of drivers took part in the study, thirty at the Turner-Fairbanks Highway Research Center in McLean, Virginia and thirty at the University of Missouri--Rolla in Rolla, Missouri. Two of the seven signs were word messages and five were symbolic signs. The results from both groups of subjects showed that the most appropriate word message sign would appear to be Median Crossover . This sign was understood the best by the subjects to whom it was shown and Crossover was the word the majority of subjects thought best conveyed the intended meaning. The symbolic sign found to be the best out of those tested was one showing two median noses. This did well In legibility and understanding tests and was least confused with other signs. It was also the symbolic sign most preferred by the subjects and was the simplest of the symbolic designs. Legibility of the symbolic signs was much greater than that of the word messages and this symbolic design is the sign recommended to identify median crossovers”--Abstract, page ii

    Android HIV: A Study of Repackaging Malware for Evading Machine-Learning Detection

    Full text link
    Machine learning based solutions have been successfully employed for automatic detection of malware in Android applications. However, machine learning models are known to lack robustness against inputs crafted by an adversary. So far, the adversarial examples can only deceive Android malware detectors that rely on syntactic features, and the perturbations can only be implemented by simply modifying Android manifest. While recent Android malware detectors rely more on semantic features from Dalvik bytecode rather than manifest, existing attacking/defending methods are no longer effective. In this paper, we introduce a new highly-effective attack that generates adversarial examples of Android malware and evades being detected by the current models. To this end, we propose a method of applying optimal perturbations onto Android APK using a substitute model. Based on the transferability concept, the perturbations that successfully deceive the substitute model are likely to deceive the original models as well. We develop an automated tool to generate the adversarial examples without human intervention to apply the attacks. In contrast to existing works, the adversarial examples crafted by our method can also deceive recent machine learning based detectors that rely on semantic features such as control-flow-graph. The perturbations can also be implemented directly onto APK's Dalvik bytecode rather than Android manifest to evade from recent detectors. We evaluated the proposed manipulation methods for adversarial examples by using the same datasets that Drebin and MaMadroid (5879 malware samples) used. Our results show that, the malware detection rates decreased from 96% to 1% in MaMaDroid, and from 97% to 1% in Drebin, with just a small distortion generated by our adversarial examples manipulation method.Comment: 15 pages, 11 figure

    Evaluation of a context-aware voice interface for Ambient Assisted Living: qualitative user study vs. quantitative system evaluation

    No full text
    International audienceThis paper presents an experiment with seniors and people with visual impairment in a voice-controlled smart home using the SWEET-HOME system. The experiment shows some weaknesses in automatic speech recognition which must be addressed, as well as the need of better adaptation to the user and the environment. Indeed, users were disturbed by the rigid structure of the grammar and were eager to adapt it to their own preferences. Surprisingly, while no humanoid aspect was introduced in the system, the senior participants were inclined to embody the system. Despite these aspects to improve, the system has been favourably assessed as diminishing most participant fears related to the loss of autonomy

    The perception and cognition of emotion from motion

    Get PDF
    Emotional expression has been intensively researched in the past, however, this research was normally conducted on facial expressions and only seldomly on dynamic stimuli. We have been interested in better understanding the perception and cognition of emotion from human motion. To this end 11 experiments were conducted that spanned the perception and representation of emotion, the role spatial and temporal cues played in the perception of emotions and finally high level cognitive features in the categorisation of emotion. The stimuli we employed were point-light displays of human arm movements recorded as actors portrayed ordinary actions with emotion. To create them we used motion capture technology and computer animation techniques. Results from the first two experiments showed basic human competence in recognition of emotion and that the representation of emotions is along two dimensions. These dimensions resembled arousal and valence, and the psychological space resembled that found for both facial expression and experienced affect. In a search for possible stimulus properties that would act as correlates for the dimensions, it emerged that arousal could be accounted for by movement speed while valence was related to phase relations between joints in the displays. In the third experiment we manipulated the dimension of arousal and showed that through a modulation of duration, perception of angry, sad and neutral movements could be modulated. In experiments 4-7 the contribution of spatial cues to the perception of emotion was explored and in the final set of experiments (8-11) perception of emotion was examined from a cognitive perspective. Through the course of the research a number of interesting findings emerged that suggested three primary directions for future research: the possible relationship between attributions of animacy and emotion to animate and inanimate non-humans. The phase or timing relationships between elements in a display as a categorical cue to valence and finally the unexplored relationship between cues to emotion from movements and faces

    Towards multi-domain speech understanding with flexible and dynamic vocabulary

    Get PDF
    Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2001.Includes bibliographical references (p. 201-208).In developing telephone-based conversational systems, we foresee future systems capable of supporting multiple domains and flexible vocabulary. Users can pursue several topics of interest within a single telephone call, and the system is able to switch transparently among domains within a single dialog. This system is able to detect the presence of any out-of-vocabulary (OOV) words, and automatically hypothesizes each of their pronunciation, spelling and meaning. These can be confirmed with the user and the new words are subsequently incorporated into the recognizer lexicon for future use. This thesis will describe our work towards realizing such a vision, using a multi-stage architecture. Our work is focused on organizing the application of linguistic constraints in order to accommodate multiple domain topics and dynamic vocabulary at the spoken input. The philosophy is to exclusively apply below word-level linguistic knowledge at the initial stage. Such knowledge is domain-independent and general to all of the English language. Hence, this is broad enough to support any unknown words that may appear at the input, as well as input from several topic domains. At the same time, the initial pass narrows the search space for the next stage, where domain-specific knowledge that resides at the word-level or above is applied. In the second stage, we envision several parallel recognizers, each with higher order language models tailored specifically to its domain. A final decision algorithm selects a final hypothesis from the set of parallel recognizers.(cont.) Part of our contribution is the development of a novel first stage which attempts to maximize linguistic constraints, using only below word-level information. The goals are to prevent sequences of unknown words from being pruned away prematurely while maintaining performance on in-vocabulary items, as well as reducing the search space for later stages. Our solution coordinates the application of various subword level knowledge sources. The recognizer lexicon is implemented with an inventory of linguistically motivated units called morphs, which are syllables augmented with spelling and word position. This first stage is designed to output a phonetic network so that we are not committed to the initial hypotheses. This adds robustness, as later stages can propose words directly from phones. To maximize performance on the first stage, much of our focus has centered on the integration of a set of hierarchical sublexical models into this first pass. To do this, we utilize the ANGIE framework which supports a trainable context-free grammar, and is designed to acquire subword-level and phonological information statistically. Its models can generalize knowledge about word structure, learned from in-vocabulary data, to previously unseen words. We explore methods for collapsing the ANGIE models into a finite-state transducer (FST) representation which enables these complex models to be efficiently integrated into recognition. The ANGIE-FST needs to encapsulate the hierarchical knowledge of ANGIE and replicate ANGIE's ability to support previously unobserved phonetic sequences ...by Grace Chung.Ph.D
    • …
    corecore