Search CORE

137 research outputs found

Articulatory feature classification using convolutional neural networks

Author: Merkx D.
Scharenborg O.
Publication venue: 'International Speech Communication Association'
Publication date: 01/10/2018
Field of study

The ultimate goal of our research is to improve an existing speech-based computational model of human speech recognition on the task of simulating the role of fine-grained phonetic information in human speech processing. As part of this work we are investigating articulatory feature classifiers that are able to create reliable and accurate transcriptions of the articulatory behaviour encoded in the acoustic speech signal. Articulatory feature (AF) modelling of speech has received a considerable amount of attention in automatic speech recognition research. Different approaches have been used to build AF classifiers, most notably multi-layer perceptrons. Recently, deep neural networks have been applied to the task of AF classification. This paper aims to improve AF classification by investigating two different approaches: 1) investigating the usefulness of a deep Convolutional neural network (CNN) for AF classification; 2) integrating the Mel filtering operation into the CNN architecture. The results showed a remarkable improvement in classification accuracy of the CNNs over state-of-the-art AF classification results for Dutch, most notably in the minority classes. Integrating the Mel filtering operation into the CNN architecture did not further improve classification performance

Crossref

MPG.PuRe

WERd: Using Social Text Spelling Variants for Evaluating Dialectal Speech Recognition

Author: Ali Ahmed
Bell Peter
Nakov Preslav
Renals Steve
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 21/09/2017
Field of study

We study the problem of evaluating automatic speech recognition (ASR) systems that target dialectal speech input. A major challenge in this case is that the orthography of dialects is typically not standardized. From an ASR evaluation perspective, this means that there is no clear gold standard for the expected output, and several possible outputs could be considered correct according to different human annotators, which makes standard word error rate (WER) inadequate as an evaluation metric. Such a situation is typical for machine translation (MT), and thus we borrow ideas from an MT evaluation metric, namely TERp, an extension of translation error rate which is closely-related to WER. In particular, in the process of comparing a hypothesis to a reference, we make use of spelling variants for words and phrases, which we mine from Twitter in an unsupervised fashion. Our experiments with evaluating ASR output for Egyptian Arabic, and further manual analysis, show that the resulting WERd (i.e., WER for dialects) metric, a variant of TERp, is more adequate than WER for evaluating dialectal ASR.Comment: ASRU-201

arXiv.org e-Print Archive

Edinburgh Research Explorer

Semisupervised Speech Data Extraction from Basque Parliament Sessions and Validation on Fully Bilingual Basque–Spanish ASR

Author: Bordel García German
Peñagarikano Badiola Mikel
Rodríguez Fuentes Luis Javier
Varona Fernández Amparo
Publication venue: MDPI
Publication date: 28/07/2023
Field of study

In this paper, a semisupervised speech data extraction method is presented and applied to create a new dataset designed for the development of fully bilingual Automatic Speech Recognition (ASR) systems for Basque and Spanish. The dataset is drawn from an extensive collection of Basque Parliament plenary sessions containing frequent code switchings. Since session minutes are not exact, only the most reliable speech segments are kept for training. To that end, we use phonetic similarity scores between nominal and recognized phone sequences. The process starts with baseline acoustic models trained on generic out-of-domain data, then iteratively updates the models with the extracted data and applies the updated models to refine the training dataset until the observed improvement between two iterations becomes small enough. A development dataset, involving five plenary sessions not used for training, has been manually audited for tuning and evaluation purposes. Cross-validation experiments (with 20 random partitions) have been carried out on the development dataset, using the baseline and the iteratively updated models. On average, Word Error Rate (WER) reduces from 16.57% (baseline) to 4.41% (first iteration) and further to 4.02% (second iteration), which corresponds to relative WER reductions of 73.4% and 8.8%, respectively. When considering only Basque segments, WER reduces on average from 16.57% (baseline) to 5.51% (first iteration) and further to 5.13% (second iteration), which corresponds to relative WER reductions of 66.7% and 6.9%, respectively. As a result of this work, a new bilingual Basque–Spanish resource has been produced based on Basque Parliament sessions, including 998 h of training data (audio segments + transcriptions), a development set (17 h long) designed for tuning and evaluation under a cross-validation scheme and a fully bilingual trigram language model.This work was partially funded by the Spanish Ministry of Science and Innovation (OPEN-SPEECH project, PID2019-106424RB-I00) and by the Basque Government under the general support program to research groups (IT-1704-22)

Archivo Digital para la Docencia y la Investigación

Deep Maxout Networks applied to Noise-Robust Speech Recognition

Author: Calle Silos Fernando de la
Gallardo Antolín Ascensión
Peláez Moreno Carmen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Proceedings of: IberSPEECH 2014 "VIII Jornadas en Tecnologías del Habla" and "IV Iberian SLTech Workshop". Las Palmas de Gran Canaria, Spain, November 19-21, 2014.Deep Neural Networks (DNN) have become very popular for acoustic modeling due to the improvements found over traditional Gaussian Mixture Models (GMM). However, not many works have addressed the robustness of these systems under noisy conditions. Recently, the machine learning community has proposed new methods to improve the accuracy of DNNs by using techniques such as dropout and maxout. In this paper, we investigate Deep Maxout Networks (DMN) for acoustic modeling in a noisy automatic speech recognition environment. Experiments show that DMNs improve substantially the recognition accuracy over DNNs and other traditional techniques in both clean and noisy conditions on the TIMIT dataset.This contribution has been supported by an Airbus Defense and Space Grant (Open Innovation - SAVIER) and Spanish Government-CICYT project 2011-26807/TEC.Publicad

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

A cross-lingual adaptation approach for rapid development of speech recognizers for learning disabled users

Author: D Imseng
D Imseng
D-L Choi
DP Córdova Lucero
Ed Joode
F Rudzicz
F Rudzicz
GE Lancioni
I Kraljevski
J Borg
J Nouza
J Nouza
J Nouza
J Nouza
J Sigafoos
J Zhang
J-P Hosom
Jan Nouza
KF McCoy
L Besacier
M Bohac
M Bohac
M Bohac
M Bohac
MA Neerincx
Marek Bohac
Michaela Kucharova
MJF Gales
MJF Gales
MS Hawley
O Chia Ai
O Saz
P Lal
P Xu
P Červa
P Červa
Petr Červa
RA Wagner
SA Borrie
T Schultz
TH Falk
WK Seong
WR Rodríguez
Zoraida Callejas
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Building a voice-operated system for learning disabled users is a difficult task that requires a considerable amount of time and effort. Due to the wide spectrum of disabilities and their different related phonopathies, most approaches available are targeted to a specific pathology. This may improve their accuracy for some users, but makes them unsuitable for others. In this paper, we present a cross-lingual approach to adapt a general-purpose modular speech recognizer for learning disabled people. The main advantage of this approach is that it allows rapid and cost-effective development by taking the already built speech recognition engine and its modules, and utilizing existing resources for standard speech in different languages for the recognition of the users’ atypical voices. Although the recognizers built with the proposed technique obtain lower accuracy rates than those trained for specific pathologies, they can be used by a wide population and developed more rapidly, which makes it possible to design various types of speech-based applications accessible to learning disabled users.This research was supported by the project ‘Favoreciendo la vida autónoma de discapacitados intelectuales con problemas de comunicación oral mediante interfaces personalizados de reconocimiento automático del habla’, financed by the Centre of Initiatives for Development Cooperation (Centro de Iniciativas de Cooperación al Desarrollo, CICODE), University of Granada, Spain. This research was supported by the Student Grant Scheme 2014 (SGS) at the Technical University of Liberec

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Springer - Publisher Connector

Repositorio Institucional Universidad de Granada

DSpace@TUL

On Distant Speech Recognition for Home Automation

Author: A Baba
B Lecouteux
B Vlasenko
D Istrate
F Mäyrä
F Portet
G Filho
J Barker
J Fozard
JM Valin
K McCoy
K McCoy
K Reidel
L Baeckman
L Lines
M Chan
M Hamill
M Vacher
M Vacher
M Vacher
M Wölfel
MK Wolters
N Takeda
P Chahuara
P Mueller
P Nocera
R López-Cózar
RC Vipperla
S Bouakaz
S Katz
T Koskela
T Pellegrini
W Edwards
W Ryan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 02/02/2015
Field of study

The official version of this draft is available at Springer via http://dx.doi.org/10.1007/978-3-319-16226-3_7International audienceIn the framework of Ambient Assisted Living, home automation may be a solution for helping elderly people living alone at home. This study is part of the Sweet-Home project which aims at developing a new home automation system based on voice command to improve support and well-being of people in loss of autonomy. The goal of the study is vocal order recognition with a focus on two aspects: distance speech recognition and sentence spotting. Several ASR techniques were evaluated on a realistic corpus acquired in a 4-room flat equipped with microphones set in the ceiling. This distant speech French corpus was recorded with 21 speakers who acted scenarios of activities of daily living. Techniques acting at the decoding stage, such as our novel approach called Driven Decoding Algorithm (DDA), gave better speech recognition results than the baseline and other approaches. This solution which uses the two best SNR channels and a priori knowledge (voice commands and distress sentences) has demonstrated an increase in recognition rate without introducing false alarms

Crossref

Hal - Université Grenoble Alpes

COTA: Improving the Speed and Accuracy of Customer Support through Ranking and Deep Networks

Author: Bahdanau Dzmitry
Diederik
Hakkani-Tür Dilek
Ioffe Sergey
Liang Chen
McCulloh Ian
Rocktäschel Tim
Sarikaya R.
Sutskever Ilya
van der Maaten Laurens
Zhang Xiang
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 03/07/2018
Field of study

For a company looking to provide delightful user experiences, it is of paramount importance to take care of any customer issues. This paper proposes COTA, a system to improve speed and reliability of customer support for end users through automated ticket classification and answers selection for support representatives. Two machine learning and natural language processing techniques are demonstrated: one relying on feature engineering (COTA v1) and the other exploiting raw signals through deep learning architectures (COTA v2). COTA v1 employs a new approach that converts the multi-classification task into a ranking problem, demonstrating significantly better performance in the case of thousands of classes. For COTA v2, we propose an Encoder-Combiner-Decoder, a novel deep learning architecture that allows for heterogeneous input and output feature types and injection of prior knowledge through network architecture choices. This paper compares these models and their variants on the task of ticket classification and answer selection, showing model COTA v2 outperforms COTA v1, and analyzes their inner workings and shortcomings. Finally, an A/B test is conducted in a production setting validating the real-world impact of COTA in reducing issue resolution time by 10 percent without reducing customer satisfaction

arXiv.org e-Print Archive

Crossref