19 research outputs found

    Deliberation Model Based Two-Pass End-to-End Speech Recognition

    Full text link
    End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively relative to conventional models. To further improve the quality, a two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model while maintaining a reasonable latency. The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses. In this work, we propose to attend to both acoustics and first-pass hypotheses using a deliberation network. A bidirectional encoder is used to extract context information from first-pass hypotheses. The proposed deliberation model achieves 12% relative WER reduction compared to LAS rescoring in Google Voice Search (VS) tasks, and 23% reduction on a proper noun test set. Compared to a large conventional model, our best model performs 21% relatively better for VS. In terms of computational complexity, the deliberation decoder has a larger size than the LAS decoder, and hence requires more computations in second-pass decoding

    Attention-Inspired Artificial Neural Networks for Speech Processing: A Systematic Review

    Get PDF
    Artificial Neural Networks (ANNs) were created inspired by the neural networks in the human brain and have been widely applied in speech processing. The application areas of ANN include: Speech recognition, speech emotion recognition, language identification, speech enhancement, and speech separation, amongst others. Likewise, given that speech processing performed by humans involves complex cognitive processes known as auditory attention, there has been a growing amount of papers proposing ANNs supported by deep learning algorithms in conjunction with some mechanism to achieve symmetry with the human attention process. However, while these ANN approaches include attention, there is no categorization of attention integrated into the deep learning algorithms and their relation with human auditory attention. Therefore, we consider it necessary to have a review of the different ANN approaches inspired in attention to show both academic and industry experts the available models for a wide variety of applications. Based on the PRISMA methodology, we present a systematic review of the literature published since 2000, in which deep learning algorithms are applied to diverse problems related to speech processing. In this paper 133 research works are selected and the following aspects are described: (i) Most relevant features, (ii) ways in which attention has been implemented, (iii) their hypothetical relationship with human attention, and (iv) the evaluation metrics used. Additionally, the four publications most related with human attention were analyzed and their strengths and weaknesses were determined

    Viseme-based Lip-Reading using Deep Learning

    Get PDF
    Research in Automated Lip Reading is an incredibly rich discipline with so many facets that have been the subject of investigation including audio-visual data, feature extraction, classification networks and classification schemas. The most advanced and up-to-date lip-reading systems can predict entire sentences with thousands of different words and the majority of them use ASCII characters as the classification schema. The classification performance of such systems however has been insufficient and the need to cover an ever expanding range of vocabulary using as few classes as possible is challenge. The work in this thesis contributes to the area concerning classification schemas by proposing an automated lip reading model that predicts sentences using visemes as a classification schema. This is an alternative schema to using ASCII characters, which is the conventional class system used to predict sentences. This thesis provides a review of the current trends in deep learning- based automated lip reading and analyses a gap in the research endeavours of automated lip-reading by contributing towards work done in the region of classification schema. A whole new line of research is opened up whereby an alternative way to do lip-reading is explored and in doing so, lip-reading performance results for predicting s entences from a benchmark dataset are attained which improve upon the current state-of-the-art. In this thesis, a neural network-based lip reading system is proposed. The system is lexicon-free and uses purely visual cues. With only a limited number of visemes as classes to recognise, the system is designed to lip read sentences covering a wide range of vocabulary and to recognise words that may not be included in system training. The lip-reading system predicts sentences as a two-stage procedure with visemes being recognised as the first stage and words being classified as the second stage. This is such that the second-stage has to both overcome the one-to-many mapping problem posed in lip-reading where one set of visemes can map to several words, and the problem of visemes being confused or misclassified to begin with. To develop the proposed lip-reading system, a number of tasks have been performed in this thesis. These include the classification of continuous sequences of visemes; and the proposal of viseme-to-word conversion models that are both effective in their conversion performance of predicting words, and robust to the possibility of viseme confusion or misclassification. The initial system reported has been testified on the challenging BBC Lip Reading Sentences 2 (LRS2) benchmark dataset attaining a word accuracy rate of 64.6%. Compared with the state-of-the-art works in lip reading sentences reported at the time, the system had achieved a significantly improved performance. The lip reading system is further improved upon by using a language model that has been demonstrated to be effective at discriminating between homopheme words and being robust to incorrectly classified visemes. An improved performance in predicting spoken sentences from the LRS2 dataset is yielded with an attained word accuracy rate of 79.6% which is still better than another lip-reading system trained and evaluated on the the same dataset that attained a word accuracy rate 77.4% and it is to the best of our knowledge the next best observed result attained on LRS2

    Representation Analysis Methods to Model Context for Speech Technology

    Get PDF
    Speech technology has developed to levels equivalent with human parity through the use of deep neural networks. However, it is unclear how the learned dependencies within these networks can be attributed to metrics such as recognition performance. This research focuses on strategies to interpret and exploit these learned context dependencies to improve speech recognition models. Context dependency analysis had not yet been explored for speech recognition networks. In order to highlight and observe dependent representations within speech recognition models, a novel analysis framework is proposed. This analysis framework uses statistical correlation indexes to compute the coefficiency between neural representations. By comparing the coefficiency of neural representations between models using different approaches, it is possible to observe specific context dependencies within network layers. By providing insights on context dependencies it is then possible to adapt modelling approaches to become more computationally efficient and improve recognition performance. Here the performance of End-to-End speech recognition models are analysed, providing insights on the acoustic and language modelling context dependencies. The modelling approach for a speaker recognition task is adapted to exploit acoustic context dependencies and reach comparable performance with the state-of-the-art methods, reaching 2.89% equal error rate using the Voxceleb1 training and test sets with 50% of the parameters. Furthermore, empirical analysis of the role of acoustic context for speech emotion recognition modelling revealed that emotion cues are presented as a distributed event. These analyses and results for speech recognition applications aim to provide objective direction for future development of automatic speech recognition systems

    Semantic Communications for Speech Transmission

    Get PDF
    Wireless communication systems have undergone vigorous advancements from the first generation (1G) to the fifth generation (5G) over the past few decades by developing numerous coding algorithms and channel models to recover accurate sources at the bit level. However, in recent years, the flourishing of artificial intelligence (AI) has revolutionised various industries and incubated multifarious intelligent tasks, which increases the amount of data transmission to the zetta-byte level and requires massive machine connectivity with low transmission latency and energy consumption. In this context, conventional communication systems are facing severe challenges imposed by ubiquitous AI tasks. Therefore, it is inevitable to develop a new communication paradigm. Semantic communications have been proposed to address the challenges by extracting semantic information inherent in source data while omitting irrelative redundant information to reduce the transmission data, thereby lowering communication resources and facilitating high semantic fidelity transmission. Nevertheless, the exploration of semantic communications has gone through decades of stagnation since it was first identified because of the inadequacy of mathematical models for semantic information. Inspired by the thriving of AI, deep learning (DL)-enabled semantic communications have been scrutinised as promising solutions to the bottlenecks in conventional communications by leveraging the learning and fitting capabilities of neural networks to bypass mathematical models for semantic extraction and representation. To this end, this thesis explores DL-enabled semantic communications for speech transmission to tackle technical problems in conventional speech communication networks, including semantic-agnostic coding algorithms, unreliable speech transmission in complicated channel environments, single system output limited to the speech modality, and speech quality susceptible to external interferences. Specifically, a general semantic communication system for speech transmission over single-input single-output (SISO) channels, named DeepSC-S, is first developed to reconstruct speech information by transmitting global semantic features. In addition, the system output is extended to multimodal data across different languages by introducing a task-oriented semantic communication framework for speech transmission, named DeepSC-ST, to perform various downstream intelligent tasks, including speech recognition, speech synthesis, speech-to-text translation (S2TT), and speech-to-speech translation (S2ST). Moreover, the endeavours towards semantic communications for speech transmission over multiple-input multiple-output (MIMO) channels are carried out to contend with practical communication scenarios, and a semantic-aware network is devised to improve the performance of intelligent tasks. Furthermore, the realistic scenarios involving corrupted speech input due to external inferences are further considered by establishing a semantic impairment suppression mechanism to compensate for impaired semantics in the corrupted speech and to facilitate robust end-to-end (E2E) semantic communications for speech-to-text translation. The proposed DeepSC-S and its variants investigated in this thesis demonstrate high proficiency in semantic communications for speech transmission by reducing substantial transmission data, performing diverse semantic tasks, providing superior system performance, and tolerating dynamic channel effects
    corecore