74 research outputs found

    The third 'CHiME' speech separation and recognition challenge: Analysis and outcomes

    Get PDF
    This paper presents the design and outcomes of the CHiME-3 challenge, the first open speech recognition evaluation designed to target the increasingly relevant multichannel, mobile-device speech recognition scenario. The paper serves two purposes. First, it provides a definitive reference for the challenge, including full descriptions of the task design, data capture and baseline systems along with a description and evaluation of the 26 systems that were submitted. The best systems re-engineered every stage of the baseline resulting in reductions in word error rate from 33.4% to as low as 5.8%. By comparing across systems, techniques that are essential for strong performance are identified. Second, the paper considers the problem of drawing conclusions from evaluations that use speech directly recorded in noisy environments. The degree of challenge presented by the resulting material is hard to control and hard to fully characterise. We attempt to dissect the various 'axes of difficulty' by correlating various estimated signal properties with typical system performance on a per session and per utterance basis. We find strong evidence of a dependence on signal-to-noise ratio and channel quality. Systems are less sensitive to variations in the degree of speaker motion. The paper concludes by discussing the outcomes of CHiME-3 in relation to the design of future mobile speech recognition evaluations

    Distant Speech Recognition of Natural Spontaneous Multi-party Conversations

    Get PDF
    Distant speech recognition (DSR) has gained wide interest recently. While deep networks keep improving ASR overall, the performance gap remains between using close-talking recordings and distant recordings. Therefore the work in this thesis aims at providing some insights for further improvement of DSR performance. The investigation starts with collecting the first multi-microphone and multi-media corpus of natural spontaneous multi-party conversations in native English with the speaker location tracked, i.e. the Sheffield Wargame Corpus (SWC). The state-of-the-art recognition systems with the acoustic models trained standalone and adapted both show word error rates (WERs) above 40% on headset recordings and above 70% on distant recordings. A comparison between SWC and AMI corpus suggests a few unique properties in the real natural spontaneous conversations, e.g. the very short utterances and the emotional speech. Further experimental analysis based on simulated data and real data quantifies the impact of such influence factors on DSR performance, and illustrates the complex interaction among multiple factors which makes the treatment of each influence factor much more difficult. The reverberation factor is studied further. It is shown that the reverberation effect on speech features could be accurately modelled with a temporal convolution in the complex spectrogram domain. Based on that a polynomial reverberation score is proposed to measure the distortion level of short utterances. Compared to existing reverberation metrics like C50, it avoids a rigid early-late-reverberation partition without compromising the performance on ranking the reverberation level of recording environments and channels. Furthermore, the existing reverberation measurement is signal independent thus unable to accurately estimate the reverberation distortion level in short recordings. Inspired by the phonetic analysis on the reverberation distortion via self-masking and overlap-masking, a novel partition of reverberation distortion into the intra-phone smearing and the inter-phone smearing is proposed, so that the reverberation distortion level is first estimated on each part and then combined

    Child Speech Recognition in Human-Robot Interaction: Evaluations and Recommendations

    Get PDF
    An increasing number of human-robot interaction (HRI) studies are now taking place in applied settings with children. These interactions often hinge on verbal interaction to effectively achieve their goals. Great advances have been made in adult speech recognition and it is often assumed that these advances will carry over to the HRI domain and to interactions with children. In this paper, we evaluate a number of automatic speech recognition (ASR) engines under a variety of conditions, inspired by real-world social HRI conditions. Using the data collected we demonstrate that there is still much work to be done in ASR for child speech, with interactions relying solely on this modality still out of reach. However, we also make recommendations for child-robot interaction design in order to maximise the capability that does currently exist

    Automatic speech recognition for European Portuguese

    Get PDF
    Dissertação de mestrado em Informatics EngineeringThe process of Automatic Speech Recognition (ASR) opens doors to a vast amount of possible improvements in customer experience. The use of this type of technology has increased significantly in recent years, this change being the result of the recent evolution in ASR systems. The opportunities to use ASR are vast, covering several areas, such as medical, industrial, business, among others. We must emphasize the use of these voice recognition systems in telecommunications companies, namely, in the automation of consumer assistance operators, allowing the service to be routed to specialized operators automatically through the detection of matters to be dealt with through recognition of the spoken utterances. In recent years, we have seen big technological breakthrough in ASR, achieving unprecedented accuracy results that are comparable to humans. We are also seeing a move from what is known as the Traditional approach of ASR systems, based on Hidden Markov Models (HMM), to the newer End-to-End ASR systems that obtain benefits from the use of deep neural networks (DNNs), large amounts of data and process parallelization. The literature review showed us that the focus of this previous work was almost exclusively for the English and Chinese languages, with little effort being made in the development of other languages, as it is the case with Portuguese. In the research carried out, we did not find a model for the European Portuguese (EP) dialect that is freely available for general use. Focused on this problem, this work describes the development of a End-to-End ASR system for EP. To achieve this goal, a set of procedures was followed that allowed us to present the concepts, characteristics and all the steps inherent to the construction of these types of systems. Furthermore, since the transcribed speech needed to accomplish our goal is very limited for EP, we also describe the process of collecting and formatting data from a variety of different sources, most of them freely available to the public. To further try and improve our results, a variety of different data augmentation techniques were implemented and tested. The obtained models are based on a PyTorch implementation of the Deep Speech 2 model. Our best model achieved an Word Error Rate (WER) of 40.5%, in our main test corpus, achieving slightly better results to those obtained by commercial systems on the same data. Around 150 hours of transcribed EP was collected, so that it can be used to train other ASR systems or models in different areas of investigation. We gathered a series of interesting results on the use of different batch size values as well as the improvements provided by the use of a large variety of data augmentation techniques. Nevertheless, the ASR theme is vast and there is still a variety of different methods and interesting concepts that we could research in order to seek an improvement of the achieved results.O processo de Reconhecimento Automático de Fala (ASR) abre portas para uma grande quantidade de melhorias possíveis na experiência do cliente. A utilização deste tipo de tecnologia tem aumentado significativamente nos últimos anos, sendo esta alteração o resultado da evolução recente dos sistemas ASR. As oportunidades de utilização do ASR são vastas, abrangendo diversas áreas, como médica, industrial, empresarial, entre outras. É de realçar que a utilização destes sistemas de reconhecimento de voz nas empresas de telecomunicações, nomeadamente, na automatização dos operadores de atendimento ao consumidor, permite o encaminhamento automático do serviço para operadores especializados através da detecção de assuntos a tratar através do reconhecimento de voz. Nos últimos anos, vimos um grande avanço tecnológico em ASR, alcançando resultados de precisão sem precedentes que são comparáveis aos atingidos por humanos. Por outro lado, vemos também uma mudança do que é conhecido como a abordagem tradicional, baseados em modelos de Markov ocultos (HMM), para sistemas mais recentes ponta-a-ponta que reúnem benefícios do uso de redes neurais profundas, em grandes quantidades de dados e da paralelização de processos. A revisão da literatura efetuada mostra que o foco do trabalho anterior foi quase que exclusivamente para as línguas inglesa e chinesa, com pouco esforço no desenvolvimento de outras línguas, como é o caso do português. Na pesquisa realizada, não encontramos um modelo para o dialeto português europeu (PE) que se encontre disponível gratuitamente para uso geral. Focado neste problema, este trabalho descreve o desenvolvimento de um sistema de ASR ponta-a-ponta para o PE. Para atingir este objetivo, foi seguido um conjunto de procedimentos que nos permitiram apresentar os conceitos, características e todas as etapas inerentes à construção destes tipos de sistemas. Além disso, como a fala transcrita necessária para cumprir o nosso objetivo é muito limitada para PE, também descrevemos o processo de coleta e formatação desses dados em uma variedade de fontes diferentes, a maioria delas disponíveis gratuitamente ao público. Para tentar melhorar os nossos resultados, uma variedade de diferentes técnicas de aumento de dados foram implementadas e testadas. Os modelos obtidos são baseados numa implementação PyTorch do modelo Deep Speech 2. O nosso melhor modelo obteve uma taxa de erro de palavras (WER) de 40,5% no nosso corpus de teste principal, obtendo resultados ligeiramente melhores do que aqueles obtidos por sistemas comerciais sobre os mesmos dados. Foram coletadas cerca de 150 horas de PE transcritas, que podem ser utilizadas para treinar outros sistemas ou modelos de ASR em diferentes áreas de investigação. Reunimos uma série de resultados interessantes sobre o uso de diferentes valores de batch size, bem como as melhorias fornecidas pelo uso de uma grande variedade de técnicas de data augmentation. O tema ASR é vasto e ainda existe uma grande variedade de métodos diferentes e conceitos interessantes que podemos investigar para melhorar os resultados alcançados

    Sound Event Localization, Detection, and Tracking by Deep Neural Networks

    Get PDF
    In this thesis, we present novel sound representations and classification methods for the task of sound event localization, detection, and tracking (SELDT). The human auditory system has evolved to localize multiple sound events, recognize and further track their motion individually in an acoustic environment. This ability of humans makes them context-aware and enables them to interact with their surroundings naturally. Developing similar methods for machines will provide an automatic description of social and human activities around them and enable machines to be context-aware similar to humans. Such methods can be employed to assist the hearing impaired to visualize sounds, for robot navigation, and to monitor biodiversity, the home, and cities. A real-life acoustic scene is complex in nature, with multiple sound events that are temporally and spatially overlapping, including stationary and moving events with varying angular velocities. Additionally, each individual sound event class, for example, a car horn can have a lot of variabilities, i.e., different cars have different horns, and within the same model of the car, the duration and the temporal structure of the horn sound is driver dependent. Performing SELDT in such overlapping and dynamic sound scenes while being robust is challenging for machines. Hence we propose to investigate the SELDT task in this thesis and use a data-driven approach using deep neural networks (DNNs). The sound event detection (SED) task requires the detection of onset and offset time for individual sound events and their corresponding labels. In this regard, we propose to use spatial and perceptual features extracted from multichannel audio for SED using two different DNNs, recurrent neural networks (RNNs) and convolutional recurrent neural networks (CRNNs). We show that using multichannel audio features improves the SED performance for overlapping sound events in comparison to traditional single-channel audio features. The proposed novel features and methods produced state-of-the-art performance for the real-life SED task and won the IEEE AASP DCASE challenge consecutively in 2016 and 2017. Sound event localization is the task of spatially locating the position of individual sound events. Traditionally, this has been approached using parametric methods. In this thesis, we propose a CRNN for detecting the azimuth and elevation angles of multiple temporally overlapping sound events. This is the first DNN-based method performing localization in complete azimuth and elevation space. In comparison to parametric methods which require the information of the number of active sources, the proposed method learns this information directly from the input data and estimates their respective spatial locations. Further, the proposed CRNN is shown to be more robust than parametric methods in reverberant scenarios. Finally, the detection and localization tasks are performed jointly using a CRNN. This method additionally tracks the spatial location with time, thus producing the SELDT results. This is the first DNN-based SELDT method and is shown to perform equally with stand-alone baselines for SED, localization, and tracking. The proposed SELDT method is evaluated on nine datasets that represent anechoic and reverberant sound scenes, stationary and moving sources with varying velocities, a different number of overlapping sound events and different microphone array formats. The results show that the SELDT method can track multiple overlapping sound events that are both spatially stationary and moving

    Speaker Adaptation Experiments with Limited Data for End-to-End Text-To-Speech Synthesis using Tacotron2

    Get PDF
    Speech synthesis has the aim of generating humanlike speech from text. Nowadays, with end-to-end systems, highly natural synthesized speech can be achieved if a large enough dataset is available from the target speaker. However, often it would be necessary to adapt to a target speaker for whom only a few training samples are available. Limited data speaker adaptation might be a difficult problem due to the overly few training samples. Issues might appear with a limited speaker dataset, such as the irregular allocation of linguistic tokens (i.e., some speech sounds are left out from the synthesized speech). To build lightweight systems, measuring the number of minimum data samples and training epochs is crucial to acquire a reasonable quality. We conducted detailed experiments with four target speakers for adaptive speaker text-to-speech (TTS) synthesis to show the performance of the end-to-end Tacotron2 model and the WaveGlow neural vocoder with an English dataset at several training data samples and training lengths. According to our investigation of objective and subjective evaluations, the Tacotron2 model exhibits good performance in terms of speech quality and similarity for unseen target speakers at 100 sentences of data (pair of text and audio) with a relatively low training time

    Towards Implementing a Software Tester for Benchmarking MAP-T Devices

    Get PDF
    Several IPv6 transition technologies have been designed and developed over the past few years to accelerate the full adoption of the IPv6 address pool. To make things more organized, the Benchmarking Working Group of IETF has standardized a comprehensive benchmarking methodology for these technologies in its RFC 8219. The Mapping of Address and Port using Translation (MAP-T) is one of the most important transition technologies that belong to the double translation category in RFC 8219. This paper aims at presenting our progress towards implementing the world’s first RFC 8219 compliant Tester for the MAP-T devices, more specifically, the MAP-T Customer Edge (CE) and the MAP-T Border Relay (BR). As part of the work of this paper, we presented a typical design for the Tester, followed by a discussion about the operational requirements, the scope of measurements, and some design considerations. Then, we installed a testbed for one of the MAP-T implementations, called Jool, and showed the results of the testbed. And finally, we ended up with a brief description of the MAP-T test program and its configuration parameters in case of testing the BR device

    Design of reservoir computing systems for the recognition of noise corrupted speech and handwriting

    Get PDF
    corecore