Search CORE

24 research outputs found

Fully Learnable Front-End for Multi-Channel Acoustic Modeling using Semi-Supervised Learning

Author: Khare Aparna
Kumatani Kenichi
Sundaram Shiva
Wager Sanna
Wu Minhua
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 31/01/2020
Field of study

In this work, we investigated the teacher-student training paradigm to train a fully learnable multi-channel acoustic model for far-field automatic speech recognition (ASR). Using a large offline teacher model trained on beamformed audio, we trained a simpler multi-channel student acoustic model used in the speech recognition system. For the student, both multi-channel feature extraction layers and the higher classification layers were jointly trained using the logits from the teacher model. In our experiments, compared to a baseline model trained on about 600 hours of transcribed data, a relative word-error rate (WER) reduction of about 27.3% was achieved when using an additional 1800 hours of untranscribed data. We also investigated the benefit of pre-training the multi-channel front end to output the beamformed log-mel filter bank energies (LFBE) using L2 loss. We find that pre-training improves the word error rate by 10.7% when compared to a multi-channel model directly initialized with a beamformer and mel-filter bank coefficients for the front end. Finally, combining pre-training and teacher-student training produces a WER reduction of 31% compared to our baseline.Comment: To appear in ICASSP 202

arXiv.org e-Print Archive

Crossref

Capacity and Coding for 2D Channels

Author: Khare Aparna
Publication venue
Publication date
Field of study

Consider a piece of information printed on paper and scanned in the form of an image. The printer, scanner, and the paper naturally form a communication channel, where the printer is equivalent to the sender, scanner is equivalent to the receiver, and the paper is the medium of communication. The channel created in this way is quite complicated and it maps 2D input patterns to 2D output patterns. Inter-symbol interference is introduced in the channel as a result of printing and scanning. During printing, ink from the neighboring pixels can spread out. The scanning process can introduce interference in the data obtained because of the finite size of each pixel and the fact that the scanner doesn't have infinite resolution. Other degradations in the process can be modeled as noise in the system. The scanner may also introduce some spherical aberration due to the lensing effect. Finally, when the image is scanned, it might not be aligned exactly below the scanner, which may lead to rotation and translation of the image. In this work, we present a coding scheme for the channel, and possible solutions for a few of the distortions stated above. Our solution consists of the structure, encoding and decoding scheme for the code, a scheme to undo the rotational distortion, and an equalization method. The motivation behind this is the question: What is the information capacity of paper. The purpose is to find out how much data can be printed out and retrieved successfully. Of course, this question has potential practical impact on the design of 2D bar codes, which is why encodability is a desired feature. There are also a number of other useful applications however. We could successfully decode 41.435 kB of data printed on a paper of size 6.7 X 6.7 inches using a Xerox Phasor 550 printer and a Canon CanoScan LiDE200 scanner. As described in the last chapter, the capacity of the paper using this channel is clearly greater than 0.9230 kB per square inch. The main contribution of the thesis lies in constructing the entire system and testing its performance. Since the focus is on encodable and practically implementable schemes, the proposed encoding method is compared with another well known and easily encodable code, namely the repeat accumulate code

OAKTrust Digital Repository (Texas A&M Univ)

Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

Author: Chan David
Dheram Pranav
Ghosh Shalini
Jain Yash
Khare Aparna
Ravichandran Venkatesh
Shonibare Olabanji
Publication venue
Publication date: 28/03/2024
Field of study

Recent advances in machine learning have demonstrated that multi-modal pre-training can improve automatic speech recognition (ASR) performance compared to randomly initialized models, even when models are fine-tuned on uni-modal tasks. Existing multi-modal pre-training methods for the ASR task have primarily focused on single-stage pre-training where a single unsupervised task is used for pre-training followed by fine-tuning on the downstream task. In this work, we introduce a novel method combining multi-modal and multi-task unsupervised pre-training with a translation-based supervised mid-training approach. We empirically demonstrate that such a multi-stage approach leads to relative word error rate (WER) improvements of up to 38.45% over baselines on both Librispeech and SUPERB. Additionally, we share several important findings for choosing pre-training methods and datasets.Comment: Accepted in LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluatio

arXiv.org e-Print Archive

Cross-utterance ASR Rescoring with Graph-based Label Propagation

Author: Chandak Chander
Chen Long
Deng Qianli
Khare Aparna
Maas Roland
Raju Anirudh
Ravichandran Venkatesh
Stolcke Andreas
Tankasala Srinath
Publication venue
Publication date: 27/03/2023
Field of study

We propose a novel approach for ASR N-best hypothesis rescoring with graph-based label propagation by leveraging cross-utterance acoustic similarity. In contrast to conventional neural language model (LM) based ASR rescoring/reranking models, our approach focuses on acoustic information and conducts the rescoring collaboratively among utterances, instead of individually. Experiments on the VCTK dataset demonstrate that our approach consistently improves ASR performance, as well as fairness across speaker groups with different accents. Our approach provides a low-cost solution for mitigating the majoritarian bias of ASR systems, without the need to train new domain- or accent-specific models.Comment: To appear in IEEE ICASSP 202

arXiv.org e-Print Archive

Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Author: Chen Long
Dheram Pranav
He Di
Khare Aparna
Raju Anirudh
Ravichandran Venkatesh
Stolcke Andreas
Wang Jinhan
Wu Minhua
Publication venue
Publication date: 26/01/2024
Field of study

We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM). Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction fine-tuning strategy to further benefit from LLM-encoded knowledge for understanding the tasks and conversational contexts, leading to additional improvements. Our approach demonstrates the potential of combined LLMs and acoustic models for a more natural and conversational interaction between humans and speech-enabled AI agents.Comment: To appear in IEEE ICASSP 202

arXiv.org e-Print Archive

Recommended from our members

Characterizing Long COVID in Children and Adolescents

Author: Akshoomoff Natascha
Aschner Judy L
Aupperle Robin
Baker Fiona C
Balaraman Venkataraman
Bhattacharjee Rakesh
Bind Marie-Abele C
Bogie Amanda
Bukulmez Hulya
Carmilani Megan
Chan James
Chibnik Lori B
Coombs K
Cottrell Lesley A
Cowan Kelly
D'Sa Viren A
Dozor Allen J
Dreyer Benard P
Eckrich Daniel
Elliott Amy J
Evans Danielle N
Farkas Jonathan S
Faustino E Vincent S
Fiks Alexander G
Fischer Laura
Fitzgerald Megan L
Flaherman Valerie J
Foulkes Andrea S
Gallagher Richard
Gaur Sunanda
Gennaro Maria L
Gross Rachel S
Guan Zoe
Güthe Nick
Harahsheh Ashraf S
Hasan Uzma N
Hasson Denise C
Hornig Mady
Hsia Daniel S
Huerta-Montanez Gredia
Hummel Kathy D
Irby Katherine
Jernigan Terry L
Kadish Matt P
Kaelber David C
Karlson Elizabeth W
Katz Stuart D
Khare Manaswitha
Kinser Patricia A
Kleinman Lawrence C
Kosut Jessica S
Krishnamoorthy Aparna
Krishnan Sankaran
Lamendola-Essel Michelle F
Landeo Guttierrez Jeremy
Larrabee Jerry
Letts Rebecca J
Lim Peter Paul C
McCulloh Russell J
Metz Torri D
Michelow Ian C
Milner Joshua D
Mohandas Sindhu
Morse Richard E
Narang Shalu
Ness- Cochinwala Manette
Newburger Jane W
Nolan Sheila
Oliveira Carlos R
Palumbo Paul
Pant Deepti B
Peddie Aimee K
Raissy Hengameh
Reeder Harrison T
Rhee Kyung E
Rosario-Pabon Zaira
Rosenzweig Erika B
Ross Judith L
Ryu Julie
Salazar Juan C
Salisbury Amy L
Sato Alice I
Selvarangan Rangaraj
Snowden Jessica N
Stein Cheryl R
Stevenson Michelle D
Stockwell Melissa S
Talavera-Barber Maria M
Tantisira Kelan G
Taylor Brittany D
Teufel Ronald J
Thaweethai Tanayott
Truong Dongngan T
Warburton David
Weakley Kathryn E
Werzberger Alan
Wood John C
Yin Shonna
Zempsky William T
Zimmerman Emily
Publication venue: eScholarship, University of California
Publication date: 01/08/2024
Field of study

ImportanceMost research to understand postacute sequelae of SARS-CoV-2 infection (PASC), or long COVID, has focused on adults, with less known about this complex condition in children. Research is needed to characterize pediatric PASC to enable studies of underlying mechanisms that will guide future treatment.ObjectiveTo identify the most common prolonged symptoms experienced by children (aged 6 to 17 years) after SARS-CoV-2 infection, how these symptoms differ by age (school-age [6-11 years] vs adolescents [12-17 years]), how they cluster into distinct phenotypes, and what symptoms in combination could be used as an empirically derived index to assist researchers to study the likely presence of PASC.Design, setting, and participantsMulticenter longitudinal observational cohort study with participants recruited from more than 60 US health care and community settings between March 2022 and December 2023, including school-age children and adolescents with and without SARS-CoV-2 infection history.ExposureSARS-CoV-2 infection.Main outcomes and measuresPASC and 89 prolonged symptoms across 9 symptom domains.ResultsA total of 898 school-age children (751 with previous SARS-CoV-2 infection [referred to as infected] and 147 without [referred to as uninfected]; mean age, 8.6 years; 49% female; 11% were Black or African American, 34% were Hispanic, Latino, or Spanish, and 60% were White) and 4469 adolescents (3109 infected and 1360 uninfected; mean age, 14.8 years; 48% female; 13% were Black or African American, 21% were Hispanic, Latino, or Spanish, and 73% were White) were included. Median time between first infection and symptom survey was 506 days for school-age children and 556 days for adolescents. In models adjusted for sex and race and ethnicity, 14 symptoms in both school-age children and adolescents were more common in those with SARS-CoV-2 infection history compared with those without infection history, with 4 additional symptoms in school-age children only and 3 in adolescents only. These symptoms affected almost every organ system. Combinations of symptoms most associated with infection history were identified to form a PASC research index for each age group; these indices correlated with poorer overall health and quality of life. The index emphasizes neurocognitive, pain, and gastrointestinal symptoms in school-age children but change or loss in smell or taste, pain, and fatigue/malaise-related symptoms in adolescents. Clustering analyses identified 4 PASC symptom phenotypes in school-age children and 3 in adolescents.Conclusions and relevanceThis study developed research indices for characterizing PASC in children and adolescents. Symptom patterns were similar but distinguishable between the 2 groups, highlighting the importance of characterizing PASC separately for these age ranges

eScholarship - University of California