15 research outputs found

    Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop

    Get PDF
    We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding the discovery of linguistic units (subwords and words) in a language without orthography. We study the replacement of orthographic transcriptions by images and/or translated text in a well-resourced language to help unsupervised discovery from raw speech.Comment: Accepted to ICASSP 201

    The I4U Mega Fusion and Collaboration for NIST Speaker Recognition Evaluation 2016

    Get PDF
    The 2016 speaker recognition evaluation (SRE'16) is the latest edition in the series of benchmarking events conducted by the National Institute of Standards and Technology (NIST). I4U is a joint entry to SRE'16 as the result from the collaboration and active exchange of information among researchers from sixteen Institutes and Universities across 4 continents. The joint submission and several of its 32 sub-systems were among top-performing systems. A lot of efforts have been devoted to two major challenges, namely, unlabeled training data and dataset shift from Switchboard-Mixer to the new Call My Net dataset. This paper summarizes the lessons learned, presents our shared view from the sixteen research groups on recent advances, major paradigm shift, and common tool chain used in speaker recognition as we have witnessed in SRE'16. More importantly, we look into the intriguing question of fusing a large ensemble of sub-systems and the potential benefit of large-scale collaboration.Peer reviewe

    An Overview of Indian Spoken Language Recognition from Machine Learning Perspective

    Get PDF
    International audienceAutomatic spoken language identification (LID) is a very important research field in the era of multilingual voice-command-based human-computer interaction (HCI). A front-end LID module helps to improve the performance of many speech-based applications in the multilingual scenario. India is a populous country with diverse cultures and languages. The majority of the Indian population needs to use their respective native languages for verbal interaction with machines. Therefore, the development of efficient Indian spoken language recognition systems is useful for adapting smart technologies in every section of Indian society. The field of Indian LID has started gaining momentum in the last two decades, mainly due to the development of several standard multilingual speech corpora for the Indian languages. Even though significant research progress has already been made in this field, to the best of our knowledge, there are not many attempts to analytically review them collectively. In this work, we have conducted one of the very first attempts to present a comprehensive review of the Indian spoken language recognition research field. In-depth analysis has been presented to emphasize the unique challenges of low-resource and mutual influences for developing LID systems in the Indian contexts. Several essential aspects of the Indian LID research, such as the detailed description of the available speech corpora, the major research contributions, including the earlier attempts based on statistical modeling to the recent approaches based on different neural network architectures, and the future research trends are discussed. This review work will help assess the state of the present Indian LID research by any active researcher or any research enthusiasts from related fields

    Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the “Speaking rosetta” JSALT 2017 workshop

    Get PDF
    International audienceWe summarize the accomplishments of a multidisciplinary workshop exploring the computational and scientific issues surrounding the discovery of linguistic units (subwords and words) in a language without orthography. We study the replacement of orthographic transcriptions by images and/or translated text in a well-resourced language to help unsuper-vised discovery from raw speech

    Learning Feature Representation for Automatic Speech Recognition

    Get PDF
    Feature extraction in automatic speech recognition (ASR) can be regarded as learning representations from lower-level to more abstract higher-level features. Lower-level feature can be viewed as features from the signal domain, such as perceptual linear predictive (PLP) and Mel-frequency cepstral coefficients (MFCCs) features. Higher-level feature representations can be considered as bottleneck features (BNFs) learned using deep neural networks (DNNs). In this thesis, we focus on improving feature extraction at different levels mainly for ASR. The first part of this thesis focuses on learning features from the signal domain that help ASR. Hand-crafted spectral and cepstral features such as MFCC are the main features used in most conventional ASR systems; all are inspired by physiological models of the human auditory system. However, some aspects of the signal such as pitch cannot be easily extracted from spectral features, but are found to be useful for ASR. We explore new algorithm to extract a pitch feature directly from a signal for ASR and show that this feature, appended to the other feature, gives consistent improvements in various languages, especially tonal languages. We then investigate replacing the conventional features with jointly training from the signal domain using time domain, and frequency domain approaches. The results show that our time-domain joint feature learning setup achieves state-of-the-art performance using MFCC, while our frequency domain setup outperforms them in various datasets. Joint feature extraction results in learning data or language-dependent filter banks, that can degrade the performance in unseen noise and channel conditions or other languages. To tackle this, we investigate joint universal feature learning across different languages using the proposed direct-from-signal setups. We then investigate the filter banks learned in this setup and propose a new set of features as an extension to conventional Mel filter banks. The results show consistent word error rate (WER) improvement, especially in clean condition. The second part of this thesis focuses on learning higher-level feature embedding. We investigate learning and transferring deep feature representations across different domains using multi-task learning and weight transfer approaches. They have been adopted to explicitly learn intermediate-level features that are useful for several different tasks
    corecore