10,545 research outputs found
Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation
Automatic recognition of dysarthric speech remains a highly challenging task
to date. Neuro-motor conditions and co-occurring physical disabilities create
difficulty in large-scale data collection for ASR system development. Adapting
SSL pre-trained ASR models to limited dysarthric speech via data-intensive
parameter fine-tuning leads to poor generalization. To this end, this paper
presents an extensive comparative study of various data augmentation approaches
to improve the robustness of pre-trained ASR model fine-tuning to dysarthric
speech. These include: a) conventional speaker-independent perturbation of
impaired speech; b) speaker-dependent speed perturbation, or GAN-based
adversarial perturbation of normal, control speech based on their time
alignment against parallel dysarthric speech; c) novel Spectral basis GAN-based
adversarial data augmentation operating on non-parallel data. Experiments
conducted on the UASpeech corpus suggest GAN-based data augmentation
consistently outperforms fine-tuned Wav2vec2.0 and HuBERT models using no data
augmentation and speed perturbation across different data expansion operating
points by statistically significant word error rate (WER) reductions up to
2.01% and 0.96% absolute (9.03% and 4.63% relative) respectively on the
UASpeech test set of 16 dysarthric speakers. After cross-system outputs
rescoring, the best system produced the lowest published WER of 16.53% (46.47%
on very low intelligibility) on UASpeech.Comment: To appear at IEEE ICASSP 202
Articulatory and bottleneck features for speaker-independent ASR of dysarthric speech
The rapid population aging has stimulated the development of assistive
devices that provide personalized medical support to the needies suffering from
various etiologies. One prominent clinical application is a computer-assisted
speech training system which enables personalized speech therapy to patients
impaired by communicative disorders in the patient's home environment. Such a
system relies on the robust automatic speech recognition (ASR) technology to be
able to provide accurate articulation feedback. With the long-term aim of
developing off-the-shelf ASR systems that can be incorporated in clinical
context without prior speaker information, we compare the ASR performance of
speaker-independent bottleneck and articulatory features on dysarthric speech
used in conjunction with dedicated neural network-based acoustic models that
have been shown to be robust against spectrotemporal deviations. We report ASR
performance of these systems on two dysarthric speech datasets of different
characteristics to quantify the achieved performance gains. Despite the
remaining performance gap between the dysarthric and normal speech, significant
improvements have been reported on both datasets using speaker-independent ASR
architectures.Comment: to appear in Computer Speech & Language -
https://doi.org/10.1016/j.csl.2019.05.002 - arXiv admin note: substantial
text overlap with arXiv:1807.1094
Automatic Transcription of Northern Prinmi Oral Art: Approaches and Challenges to Automatic Speech Recognition for Language Documentation
One significant issue facing language documentation efforts is the transcription bottleneck: each documented recording must be transcribed and annotated, and these tasks are extremely labor intensive (Ćavar et al., 2016). Researchers have sought to accelerate these tasks with partial automation via forced alignment, natural language processing, and automatic speech recognition (ASR) (Neubig et al., 2020). Neural network—especially transformer-based—approaches have enabled large advances in ASR over the last decade. Models like XLSR-53 promise improved performance on under-resourced languages by leveraging massive data sets from many different languages (Conneau et al., 2020). This project extends these efforts to a novel context, applying XLSR-53 to Northern Prinmi, a Tibeto-Burman Qiangic language spoken in Southwest China (Daudey & Pincuo, 2020).
Specifically, this thesis aims to answer two questions. First, is the XLSR-53 ASR model useful for first-pass transcription of oral art recordings from Northern Prinmi, an under-resourced tonal language? Second, does preprocessing target transcripts to combine grapheme clusters—multi-character representations of lexical tones and characters with modifying diacritics—into more phonologically salient units improve the model\u27s predictions? Results indicate that—with substantial adaptations—XLSR-53 will be useful for this task, and that preprocessing to combine grapheme clusters does improve model performance
Follow-up question handling in the IMIX and Ritel systems: A comparative study
One of the basic topics of question answering (QA) dialogue systems is how follow-up questions should be interpreted by a QA system. In this paper, we shall discuss our experience with the IMIX and Ritel systems, for both of which a follow-up question handling scheme has been developed, and corpora have been collected. These two systems are each other's opposites in many respects: IMIX is multimodal, non-factoid, black-box QA, while Ritel is speech, factoid, keyword-based QA. Nevertheless, we will show that they are quite comparable, and that it is fruitful to examine the similarities and differences. We shall look at how the systems are composed, and how real, non-expert, users interact with the systems. We shall also provide comparisons with systems from the literature where possible, and indicate where open issues lie and in what areas existing systems may be improved. We conclude that most systems have a common architecture with a set of common subtasks, in particular detecting follow-up questions and finding referents for them. We characterise these tasks using the typical techniques used for performing them, and data from our corpora. We also identify a special type of follow-up question, the discourse question, which is asked when the user is trying to understand an answer, and propose some basic methods for handling it
The Lowlands team at TRECVID 2007
In this report we summarize our methods and results for the search tasks in\ud
TRECVID 2007. We employ two different kinds of search: purely ASR based and\ud
purely concept based search. However, there is not significant difference of the\ud
performance of the two systems. Using neighboring shots for the combination of\ud
two concepts seems to be beneficial. General preprocessing of queries increased\ud
the performance and choosing detector sources helped. However, for all automatic\ud
search components we need to perform further investigations
Integrated urban water management in Texas: a review to inform a one water approach for the future
Texas has considerable experience grappling with historic droughts as well as flooding
associated with tropical storms and hurricanes, yet the State’s water management challenges
are projected to increase. Urban densification, increased frequency and severity of droughts
and floods, aging infrastructure, and a management system that is not reflective of the true
cost of water all influence water risk. Integrated urban water management strategies, like ‘One
Water’, represent an emerging management paradigm that emphasizes the interconnectedness
of water throughout the water cycle and capitalizes on opportunities that arise from this
holistic viewpoint. Here, we review water management practices in five Texas cities and
examine how the One Water approach could represent a viable framework to maintain a
reliable, sustainable, and affordable water supply for the future. We also examine financial and
business models that establish a foundational pathway towards the ‘utility of the future’ and
the One Water paradigm more broadly
Piggybacking on an Autonomous Hauler: Business Models Enabling a System-of-Systems Approach to Mapping an Underground Mine
With ever-increasing productivity targets in mining operations, there is a
growing interest in mining automation. In future mines, remote-controlled and
autonomous haulers will operate underground guided by LiDAR sensors. We
envision reusing LiDAR measurements to maintain accurate mine maps that would
contribute to both safety and productivity. Extrapolating from a pilot project
on reliable wireless communication in Boliden's Kankberg mine, we propose
establishing a system-of-systems (SoS) with LIDAR-equipped haulers and existing
mapping solutions as constituent systems. SoS requirements engineering
inevitably adds a political layer, as independent actors are stakeholders both
on the system and SoS levels. We present four SoS scenarios representing
different business models, discussing how development and operations could be
distributed among Boliden and external stakeholders, e.g., the vehicle
suppliers, the hauling company, and the developers of the mapping software.
Based on eight key variation points, we compare the four scenarios from both
technical and business perspectives. Finally, we validate our findings in a
seminar with participants from the relevant stakeholders. We conclude that to
determine which scenario is the most promising for Boliden, trade-offs
regarding control, costs, risks, and innovation must be carefully evaluated.Comment: Preprint of industry track paper accepted for the 25th IEEE
International Conference on Requirements Engineering (RE'17
Vocabulary size influences spontaneous speech in native language users: Validating the use of automatic speech recognition in individual differences research
Previous research has shown that vocabulary size affects performance on laboratory word production tasks. Individuals who know many words show faster lexical access and retrieve more words belonging to pre-specified categories than individuals who know fewer words. The present study examined the relationship between receptive vocabulary size and speaking skills as assessed in a natural sentence production task. We asked whether measures derived from spontaneous responses to every-day questions correlate with the size of participants’ vocabulary. Moreover, we assessed the suitability of automatic speech recognition for the analysis of participants’ responses in complex language production data. We found that vocabulary size predicted indices of spontaneous speech: Individuals with a larger vocabulary produced more words and had a higher speech-silence ratio compared to individuals with a smaller vocabulary. Importantly, these relationships were reliably identified using manual and automated transcription methods. Taken together, our results suggest that spontaneous speech elicitation is a useful method to investigate natural language production and that automatic speech recognition can alleviate the burden of labor-intensive speech transcription
- …