21 research outputs found

    Fully Learnable Front-End for Multi-Channel Acoustic Modeling using Semi-Supervised Learning

    Full text link
    In this work, we investigated the teacher-student training paradigm to train a fully learnable multi-channel acoustic model for far-field automatic speech recognition (ASR). Using a large offline teacher model trained on beamformed audio, we trained a simpler multi-channel student acoustic model used in the speech recognition system. For the student, both multi-channel feature extraction layers and the higher classification layers were jointly trained using the logits from the teacher model. In our experiments, compared to a baseline model trained on about 600 hours of transcribed data, a relative word-error rate (WER) reduction of about 27.3% was achieved when using an additional 1800 hours of untranscribed data. We also investigated the benefit of pre-training the multi-channel front end to output the beamformed log-mel filter bank energies (LFBE) using L2 loss. We find that pre-training improves the word error rate by 10.7% when compared to a multi-channel model directly initialized with a beamformer and mel-filter bank coefficients for the front end. Finally, combining pre-training and teacher-student training produces a WER reduction of 31% compared to our baseline.Comment: To appear in ICASSP 202

    Capacity and Coding for 2D Channels

    Get PDF
    Consider a piece of information printed on paper and scanned in the form of an image. The printer, scanner, and the paper naturally form a communication channel, where the printer is equivalent to the sender, scanner is equivalent to the receiver, and the paper is the medium of communication. The channel created in this way is quite complicated and it maps 2D input patterns to 2D output patterns. Inter-symbol interference is introduced in the channel as a result of printing and scanning. During printing, ink from the neighboring pixels can spread out. The scanning process can introduce interference in the data obtained because of the finite size of each pixel and the fact that the scanner doesn't have infinite resolution. Other degradations in the process can be modeled as noise in the system. The scanner may also introduce some spherical aberration due to the lensing effect. Finally, when the image is scanned, it might not be aligned exactly below the scanner, which may lead to rotation and translation of the image. In this work, we present a coding scheme for the channel, and possible solutions for a few of the distortions stated above. Our solution consists of the structure, encoding and decoding scheme for the code, a scheme to undo the rotational distortion, and an equalization method. The motivation behind this is the question: What is the information capacity of paper. The purpose is to find out how much data can be printed out and retrieved successfully. Of course, this question has potential practical impact on the design of 2D bar codes, which is why encodability is a desired feature. There are also a number of other useful applications however. We could successfully decode 41.435 kB of data printed on a paper of size 6.7 X 6.7 inches using a Xerox Phasor 550 printer and a Canon CanoScan LiDE200 scanner. As described in the last chapter, the capacity of the paper using this channel is clearly greater than 0.9230 kB per square inch. The main contribution of the thesis lies in constructing the entire system and testing its performance. Since the focus is on encodable and practically implementable schemes, the proposed encoding method is compared with another well known and easily encodable code, namely the repeat accumulate code

    Cross-utterance ASR Rescoring with Graph-based Label Propagation

    Full text link
    We propose a novel approach for ASR N-best hypothesis rescoring with graph-based label propagation by leveraging cross-utterance acoustic similarity. In contrast to conventional neural language model (LM) based ASR rescoring/reranking models, our approach focuses on acoustic information and conducts the rescoring collaboratively among utterances, instead of individually. Experiments on the VCTK dataset demonstrate that our approach consistently improves ASR performance, as well as fairness across speaker groups with different accents. Our approach provides a low-cost solution for mitigating the majoritarian bias of ASR systems, without the need to train new domain- or accent-specific models.Comment: To appear in IEEE ICASSP 202

    ASR-Aware End-to-end Neural Diarization

    Full text link
    We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model. Two categories of features are explored: features derived directly from ASR output (phones, position-in-word and word boundaries) and features derived from a lexical speaker change detection model, trained by fine-tuning a pretrained BERT model on the ASR output. Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features. First, ASR features are concatenated with acoustic features. Second, we propose a new attention mechanism called contextualized self-attention that utilizes ASR features to build robust speaker representations. Finally, multi-task learning is used to train the model to minimize classification loss for the ASR features along with diarization loss. Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features, reducing the diarization error rate (DER) by 20% relative to the baseline.Comment: To appear in ICASSP 202

    ASR-aware end-to-end neural diarization

    No full text
    We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model. Two categories of features are explored: features derived directly from ASR output (phones, position-in-word and word boundaries) and features derived from a lexical speaker change detection model, trained by fine-tuning a pretrained BERT model on the ASR output. Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features. First, ASR features are concatenated with acoustic features. Second, we propose a new attention mechanism called contextualized self-attention that utilizes ASR features to build robust speaker representations. Finally, multi-task learning is used to train the model to minimize classification loss for the ASR features along with diarization loss. Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features, reducing the diarization error rate (DER) by 20\% relative to the baseline
    corecore