1,822 research outputs found

    SCREEN: Learning a Flat Syntactic and Semantic Spoken Language Analysis Using Artificial Neural Networks

    Get PDF
    In this paper, we describe a so-called screening approach for learning robust processing of spontaneously spoken language. A screening approach is a flat analysis which uses shallow sequences of category representations for analyzing an utterance at various syntactic, semantic and dialog levels. Rather than using a deeply structured symbolic analysis, we use a flat connectionist analysis. This screening approach aims at supporting speech and language processing by using (1) data-driven learning and (2) robustness of connectionist networks. In order to test this approach, we have developed the SCREEN system which is based on this new robust, learned and flat analysis. In this paper, we focus on a detailed description of SCREEN's architecture, the flat syntactic and semantic analysis, the interaction with a speech recognizer, and a detailed evaluation analysis of the robustness under the influence of noisy or incomplete input. The main result of this paper is that flat representations allow more robust processing of spontaneous spoken language than deeply structured representations. In particular, we show how the fault-tolerance and learning capability of connectionist networks can support a flat analysis for providing more robust spoken-language processing within an overall hybrid symbolic/connectionist framework.Comment: 51 pages, Postscript. To be published in Journal of Artificial Intelligence Research 6(1), 199

    Zero-shot keyword spotting for visual speech recognition in-the-wild

    Full text link
    Visual keyword spotting (KWS) is the problem of estimating whether a text query occurs in a given recording using only video information. This paper focuses on visual KWS for words unseen during training, a real-world, practical setting which so far has received no attention by the community. To this end, we devise an end-to-end architecture comprising (a) a state-of-the-art visual feature extractor based on spatiotemporal Residual Networks, (b) a grapheme-to-phoneme model based on sequence-to-sequence neural networks, and (c) a stack of recurrent neural networks which learn how to correlate visual features with the keyword representation. Different to prior works on KWS, which try to learn word representations merely from sequences of graphemes (i.e. letters), we propose the use of a grapheme-to-phoneme encoder-decoder model which learns how to map words to their pronunciation. We demonstrate that our system obtains very promising visual-only KWS results on the challenging LRS2 database, for keywords unseen during training. We also show that our system outperforms a baseline which addresses KWS via automatic speech recognition (ASR), while it drastically improves over other recently proposed ASR-free KWS methods.Comment: Accepted at ECCV-201

    Symbolic inductive bias for visually grounded learning of spoken language

    Full text link
    A widespread approach to processing spoken language is to first automatically transcribe it into text. An alternative is to use an end-to-end approach: recent works have proposed to learn semantic embeddings of spoken language from images with spoken captions, without an intermediate transcription step. We propose to use multitask learning to exploit existing transcribed speech within the end-to-end setting. We describe a three-task architecture which combines the objectives of matching spoken captions with corresponding images, speech with text, and text with images. We show that the addition of the speech/text task leads to substantial performance improvements on image retrieval when compared to training the speech/image task in isolation. We conjecture that this is due to a strong inductive bias transcribed speech provides to the model, and offer supporting evidence for this.Comment: ACL 201

    Are developmental disorders like cases of adult brain damage? Implications from connectionist modelling

    Get PDF
    It is often assumed that similar domain-specific behavioural impairments found in cases of adult brain damage and developmental disorders correspond to similar underlying causes, and can serve as convergent evidence for the modular structure of the normal adult cognitive system. We argue that this correspondence is contingent on an unsupported assumption that atypical development can produce selective deficits while the rest of the system develops normally (Residual Normality), and that this assumption tends to bias data collection in the field. Based on a review of connectionist models of acquired and developmental disorders in the domains of reading and past tense, as well as on new simulations, we explore the computational viability of Residual Normality and the potential role of development in producing behavioural deficits. Simulations demonstrate that damage to a developmental model can produce very different effects depending on whether it occurs prior to or following the training process. Because developmental disorders typically involve damage prior to learning, we conclude that the developmental process is a key component of the explanation of endstate impairments in such disorders. Further simulations demonstrate that in simple connectionist learning systems, the assumption of Residual Normality is undermined by processes of compensation or alteration elsewhere in the system. We outline the precise computational conditions required for Residual Normality to hold in development, and suggest that in many cases it is an unlikely hypothesis. We conclude that in developmental disorders, inferences from behavioural deficits to underlying structure crucially depend on developmental conditions, and that the process of ontogenetic development cannot be ignored in constructing models of developmental disorders

    Models of atypical development must also be models of normal development

    Get PDF
    Functional magnetic resonance imaging studies of developmental disorders and normal cognition that include children are becoming increasingly common and represent part of a newly expanding field of developmental cognitive neuroscience. These studies have illustrated the importance of the process of development in understanding brain mechanisms underlying cognition and including children ill the study of the etiology of developmental disorders

    Connectionist modelling of lexical segmentation and vocabulary acquisition

    Get PDF
    Adults typically hear sentences in their native language as a sequence of separate words and we might therefeore assume, that words in speech are physically separated in the way that they are perceived. However, when listening to an unfamiliar language we no longer experience sequences of discrete words, but rather hear a continuous stream of speech with boundaries separating individual sentences or utterances. Theories of how adult listeners segment the speech stream into words emphasise the role that knowledge of individual words plays in the segmentation of speech. However, since words can not be learnt until the speech stream can be segmented, it seems unlikely that infants will be able to use word recognition to segment connected speech. For this reason, researchers have proposed a variety of strategies and cues that infants could use to identify word boundaries without being able to recognise the words that these boundaries delimit. This chapter, describes some computational simulations proposing ways in which these cues and strategies for the acquisition of lexical segmentation can be integrated with the infants’ acquisition of the meanings of words. The simulations reported here describe simple computational mechanisms and knowledge sources that may support these different aspects of language acquisition

    Deep learning for speech enhancement : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Albany, New Zealand

    Get PDF
    Speech enhancement, aiming at improving the intelligibility and overall perceptual quality of a contaminated speech signal, is an effective way to improve speech communications. In this thesis, we propose three novel deep learning methods to improve speech enhancement performance. Firstly, we propose an adversarial latent representation learning for latent space exploration of generative adversarial network based speech enhancement. Based on adversarial feature learning, this method employs an extra encoder to learn an inverse mapping from the generated data distribution to the latent space. The encoder establishes an inner connection with the generator and contributes to latent information learning. Secondly, we propose an adversarial multi-task learning with inverse mappings method for effective speech representation. This speech enhancement method focuses on enhancing the generator's capability of speech information capture and representation learning. To implement this method, two extra networks are developed to learn the inverse mappings from the generated distribution to the input data domains. Thirdly, we propose a self-supervised learning based phone-fortified method to improve specific speech characteristics learning for speech enhancement. This method explicitly imports phonetic characteristics into a deep complex convolutional network via a contrastive predictive coding model pre-trained with self-supervised learning. The experimental results demonstrate that the proposed methods outperform previous speech enhancement methods and achieve state-of-the-art performance in terms of speech intelligibility and overall perceptual quality
    corecore