667 research outputs found

    Environmentally robust ASR front-end for deep neural network acoustic models

    Get PDF
    This paper examines the individual and combined impacts of various front-end approaches on the performance of deep neural network (DNN) based speech recognition systems in distant talking situations, where acoustic environmental distortion degrades the recognition performance. Training of a DNN-based acoustic model consists of generation of state alignments followed by learning the network parameters. This paper first shows that the network parameters are more sensitive to the speech quality than the alignments and thus this stage requires improvement. Then, various front-end robustness approaches to addressing this problem are categorised based on functionality. The degree to which each class of approaches impacts the performance of DNN-based acoustic models is examined experimentally. Based on the results, a front-end processing pipeline is proposed for efficiently combining different classes of approaches. Using this front-end, the combined effects of different classes of approaches are further evaluated in a single distant microphone-based meeting transcription task with both speaker independent (SI) and speaker adaptive training (SAT) set-ups. By combining multiple speech enhancement results, multiple types of features, and feature transformation, the front-end shows relative performance gains of 7.24% and 9.83% in the SI and SAT scenarios, respectively, over competitive DNN-based systems using log mel-filter bank features.This is the final version of the article. It first appeared from Elsevier via http://dx.doi.org/10.1016/j.csl.2014.11.00

    Impact of single-microphone dereverberation on DNN-based meeting transcription systems

    Get PDF
    Over the past few decades, a range of front-end techniques have been proposed to improve the robustness of automatic speech recognition systems against environmental distortion. While these techniques are effective for small tasks consisting of carefully designed data sets, especially when used with a classical acoustic model, there has been limited evidence that they are useful for a state-of-the-art system with large scale realistic data. This paper focuses on reverberation as a type of distortion and investigates the degree to which dereverberation processing can improve the performance of various forms of acoustic models based on deep neural networks (DNNs) in a challenging meeting transcription task using a single distant microphone. Experimental results show that dereverberation improves the recognition performance regardless of the acoustic model structure and the type of the feature vectors input into the neural networks, providing additional relative improvements of 4.7% and 4.1% to our best configured speaker-independent and speaker-adaptive DNN-based systems, respectively.Xie Chen was funded by Toshiba Research Europe Ltd, Cambridge Research Lab.This is the accepted manuscript of a paper published in the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, Issue Date: 4-9 May 2014, Written by: Yoshioka, T.; Xie Chen; Gales, M.J.F.)

    Robust excitation-based features for Automatic Speech Recognition

    Get PDF
    In this paper we investigate the use of robust to noise features characterizing the speech excitation signal as complementary features to the usually considered vocal tract based features for automatic speech recognition (ASR). The features are tested in a state-of-the-art Deep Neural Network (DNN) based hybrid acoustic model for speech recognition. The suggested excitation features expands the set of excitation features previously considered for ASR, expecting that these features help in a better discrimination of the broad phonetic classes (e.g., fricatives, nasal, vowels, etc.). Relative improvements in the word error rate are observed in the AMI meeting transcription system with greater gains (about 5%) if PLP features are combined with the suggested excitation features. For Aurora 4, significant improvements are observed as well. Combining the suggested excitation features with filter banks, a word error rate of 9.96% is achieved.This is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/ICASSP.2015.717885

    Blind Normalization of Speech From Different Channels

    Full text link
    We show how to construct a channel-independent representation of speech that has propagated through a noisy reverberant channel. This is done by blindly rescaling the cepstral time series by a non-linear function, with the form of this scale function being determined by previously encountered cepstra from that channel. The rescaled form of the time series is an invariant property of it in the following sense: it is unaffected if the time series is transformed by any time-independent invertible distortion. Because a linear channel with stationary noise and impulse response transforms cepstra in this way, the new technique can be used to remove the channel dependence of a cepstral time series. In experiments, the method achieved greater channel-independence than cepstral mean normalization, and it was comparable to the combination of cepstral mean normalization and spectral subtraction, despite the fact that no measurements of channel noise or reverberations were required (unlike spectral subtraction).Comment: 25 pages, 7 figure

    Clinical and Genetic Analysis of Children with Kartagener Syndrome

    Get PDF
    Primary ciliary dyskinesia (PCD) is a rare autosomal recessive disorder characterized by dysfunction of motile cilia causing ineffective mucus clearance and organ laterality defects. In this study, two unrelated Portuguese children with strong PCD suspicion underwent extensive clinical and genetic assessments by whole-exome sequencing (WES), as well as ultrastructural analysis of cilia by transmission electron microscopy (TEM) to identify their genetic etiology. These analyses confirmed the diagnostic of Kartagener syndrome (KS) (PCD with situs inversus). Patient-1 showed a predominance of the absence of the inner dynein arms with two disease-causing variants in the CCDC40 gene. Patient-2 showed the absence of both dynein arms and WES disclosed two novel high impact variants in the DNAH5 gene and two missense variants in the DNAH7 gene, all possibly deleterious. Moreover, in Patient-2, functional data revealed a reduction of gene expression and protein mislocalization in both genes' products. Our work calls the researcher's attention to the complexity of the PCD and to the possibility of gene interactions modelling the PCD phenotype. Further, it is demonstrated that even for well-known PCD genes, novel pathogenic variants could have importance for a PCD/KS diagnosis, reinforcing the difficulty of providing genetic counselling and prenatal diagnosis to families.RP was funded by a PhD grant from the National Foundation for Science and Technology (FCT) (Ref.: PD/BD/105767/2014). This work was also supported by the Institutions of the authors and in part by FCT/UMIB (Pest-OE/SAU/UI0215/2014).info:eu-repo/semantics/publishedVersio

    The RCK2 domain of the human BKCa channel is a calcium sensor

    Get PDF
    Large conductance voltage and Ca2+-dependent K+ channels (BKCa) are activated by both membrane depolarization and intracellular Ca2+. Recent studies on bacterial channels have proposed that a Ca2+-induced conformational change within specialized regulators of K+ conductance (RCK) domains is responsible for channel gating. Each pore-forming α subunit of the homotetrameric BKCa channel is expected to contain two intracellular RCK domains. The first RCK domain in BKCa channels (RCK1) has been shown to contain residues critical for Ca2+ sensitivity, possibly participating in the formation of a Ca2+-binding site. The location and structure of the second RCK domain in the BKCa channel (RCK2) is still being examined, and the presence of a high-affinity Ca2+-binding site within this region is not yet established. Here, we present a structure-based alignment of the C terminus of BKCa and prokaryotic RCK domains that reveal the location of a second RCK domain in human BKCa channels (hSloRCK2). hSloRCK2 includes a high-affinity Ca2+-binding site (Ca bowl) and contains similar secondary structural elements as the bacterial RCK domains. Using CD spectroscopy, we provide evidence that hSloRCK2 undergoes a Ca2+-induced change in conformation, associated with an α-to-β structural transition. We also show that the Ca bowl is an essential element for the Ca2+-induced rearrangement of hSloRCK2. We speculate that the molecular rearrangements of RCK2 likely underlie the Ca2+-dependent gating mechanism of BKCa channels. A structural model of the heterodimeric complex of hSloRCK1 and hSloRCK2 domains is discussed

    On the extraction of weak transition strengths via the (3He,t) reaction at 420 MeV

    Full text link
    Differential cross sections for transitions of known weak strength were measured with the (3He,t) reaction at 420 MeV on targets of 12C, 13C, 18O, 26Mg, 58Ni, 60Ni, 90Zr, 118Sn, 120Sn and 208Pb. Using this data, it is shown the proportionalities between strengths and cross sections for this probe follow simple trends as a function of mass number. These trends can be used to confidently determine Gamow-Teller strength distributions in nuclei for which the proportionality cannot be calibrated via beta-decay strengths. Although theoretical calculations in distorted-wave Born approximation overestimate the data, they allow one to understand the main experimental features and to predict deviations from the simple trends observed in some of the transitions.Comment: 4 pages, 2 figure

    The MGB Challenge: Evaluating Multi-genre Broadcast Media Recognition

    Get PDF
    This paper describes the Multi-Genre Broadcast (MGB) Challenge at ASRU 2015, an evaluation focused on speech recognition, speaker diarization, and "lightly supervised" alignment of BBC TV recordings. The challenge training data covered the whole range of seven weeks BBC TV output across four channels, resulting in about 1,600 hours of broadcast audio. In addition several hundred million words of BBC subtitle text was provided for language modelling. A novel aspect of the evaluation was the exploration of speech recognition and speaker diarization in a longitudinal setting - i.e. recognition of several episodes of the same show, and speaker diarization across these episodes, linking speakers. The longitudinal tasks also offered the opportunity for systems to make use of supplied metadata including show title, genre tag, and date/time of transmission. This paper describes the task data and evaluation process used in the MGB challenge, and summarises the results obtained
    corecore