2,614 research outputs found

    Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

    Get PDF
    Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

    Towards Natural Human Control and Navigation of Autonomous Wheelchairs

    Get PDF
    Approximately 2.2 million people in the United States depend on a wheelchair to assist with their mobility. Often times, the wheelchair user can maneuver around using a conventional joystick. Visually impaired or wheelchair patients with restricted hand mobility, such as stroke, arthritis, limb injury, Parkinson’s, cerebral palsy or multiple sclerosis, prevent them from using traditional joystick controls. The resulting mobility limitations force these patients to rely on caretakers to perform everyday tasks. This minimizes the independence of the wheelchair user. Modern day speech recognition systems can be used to enhance user experiences when using electronic devices. By expanding the motorized wheelchair control interface to include the detection of user speech commands, the independence is given back to the mobility impaired. A speech recognition interface was developed for a smart wheelchair. By integrating navigation commands with a map of the wheelchair’s surroundings, the wheelchair interface is more natural and intuitive to use. Complex speech patterns are interpreted for users to command the smart wheelchair to navigate to specified locations within the map. Pocketsphinx, a speech toolkit, is used to interpret the vocal commands. A language model and dictionary were generated based on a set of possible commands and locations supplied to the speech recognition interface. The commands fall under the categories of speed, directional, or destination commands. Speed commands modify the relative speed of the wheelchair. Directional commands modify the relative direction of the wheelchair. Destination commands require a known location on a map to navigate to. The completion of the speech input processer and the connection between wheelchair components via the Robot Operating System make map navigation possible

    Efficient training algorithms for HMMs using incremental estimation

    Get PDF
    Typically, parameter estimation for a hidden Markov model (HMM) is performed using an expectation-maximization (EM) algorithm with the maximum-likelihood (ML) criterion. The EM algorithm is an iterative scheme that is well-defined and numerically stable, but convergence may require a large number of iterations. For speech recognition systems utilizing large amounts of training material, this results in long training times. This paper presents an incremental estimation approach to speed-up the training of HMMs without any loss of recognition performance. The algorithm selects a subset of data from the training set, updates the model parameters based on the subset, and then iterates the process until convergence of the parameters. The advantage of this approach is a substantial increase in the number of iterations of the EM algorithm per training token, which leads to faster training. In order to achieve reliable estimation from a small fraction of the complete data set at each iteration, two training criteria are studied; ML and maximum a posteriori (MAP) estimation. Experimental results show that the training of the incremental algorithms is substantially faster than the conventional (batch) method and suffers no loss of recognition performance. Furthermore, the incremental MAP based training algorithm improves performance over the batch versio

    A weighted MVDR beamformer based on SVM learning for sound source localization

    Get PDF
    3noA weighted minimum variance distortionless response (WMVDR) algorithm for near-field sound localization in a reverberant environment is presented. The steered response power computation of the WMVDR is based on a machine learning component which improves the incoherent frequency fusion of the narrowband power maps. A support vector machine (SVM) classifier is adopted to select the components of the fusion. The skewness measure of the narrowband power map marginal distribution is showed to be an effective feature for the supervised learning of the power map selection. Experiments with both simulated and real data demonstrate the improvement of the WMVDR beamformer localization accuracy with respect to other state-of-the-art techniques.partially_openopenSalvati, Daniele; Drioli, Carlo; Foresti, Gian LucaSalvati, Daniele; Drioli, Carlo; Foresti, Gian Luc

    The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios

    Full text link
    The CHiME challenges have played a significant role in the development and evaluation of robust automatic speech recognition (ASR) systems. We introduce the CHiME-7 distant ASR (DASR) task, within the 7th CHiME challenge. This task comprises joint ASR and diarization in far-field settings with multiple, and possibly heterogeneous, recording devices. Different from previous challenges, we evaluate systems on 3 diverse scenarios: CHiME-6, DiPCo, and Mixer 6. The goal is for participants to devise a single system that can generalize across different array geometries and use cases with no a-priori information. Another departure from earlier CHiME iterations is that participants are allowed to use open-source pre-trained models and datasets. In this paper, we describe the challenge design, motivation, and fundamental research questions in detail. We also present the baseline system, which is fully array-topology agnostic and features multi-channel diarization, channel selection, guided source separation and a robust ASR model that leverages self-supervised speech representations (SSLR)

    Developmental Robots - A New Paradigm

    Get PDF
    It has been proved to be extremely challenging for humans to program a robot to such a sufficient degree that it acts properly in a typical unknown human environment. This is especially true for a humanoid robot due to the very large number of redundant degrees of freedom and a large number of sensors that are required for a humanoid to work safely and effectively in the human environment. How can we address this fundamental problem? Motivated by human mental development from infancy to adulthood, we present a theory, an architecture, and some experimental results showing how to enable a robot to develop its mind automatically, through online, real time interactions with its environment. Humans mentally “raise” the robot through “robot sitting” and “robot schools” instead of task-specific robot programming

    Online Audio-Visual Multi-Source Tracking and Separation: A Labeled Random Finite Set Approach

    Get PDF
    The dissertation proposes an online solution for separating an unknown and time-varying number of moving sources using audio and visual data. The random finite set framework is used for the modeling and fusion of audio and visual data. This enables an online tracking algorithm to estimate the source positions and identities for each time point. With this information, a set of beamformers can be designed to separate each desired source and suppress the interfering sources
    • …
    corecore