Search CORE

489 research outputs found

Exploration of End-to-End ASR for OpenSTT -- Russian Open Speech-to-Text Dataset

Author: A Prudnikov
I Kipyatkova
I Medennikov
N Markovnikov
V Bataev
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 26/07/2020
Field of study

This paper presents an exploration of end-to-end automatic speech recognition systems (ASR) for the largest open-source Russian language data set -- OpenSTT. We evaluate different existing end-to-end approaches such as joint CTC/Attention, RNN-Transducer, and Transformer. All of them are compared with the strong hybrid ASR system based on LF-MMI TDNN-F acoustic model. For the three available validation sets (phone calls, YouTube, and books), our best end-to-end model achieves word error rate (WER) of 34.8%, 19.1%, and 18.1%, respectively. Under the same conditions, the hybridASR system demonstrates 33.5%, 20.9%, and 18.6% WER.Comment: Accepted by SPECOM 202

arXiv.org e-Print Archive

Crossref

KNOWLEDGE TRANSFER FOR RUSSIAN CONVERSATIONAL TELEPHONE AUTOMATIC SPEECH RECOGNITION

Author: A. N. Romanenko
W. Minker
Y. N. Matveev
Publication venue: 'ITMO University'
Publication date: 01/03/2018
Field of study

This paper describes the method of knowledge transfer between the ensemble of neural network acoustic models and student-network. This method is used to reduce computational costs and improve the quality of the speech recognition system. The experiments consider two variants of generation of class labels from the ensemble of models: interpolation with alignment, and the posteriori probabilities. Also, the quality of models was studied in relation with the smoothing coefficient. This coefficient was built into the output log-linear classifier of the neural network (softmax layer) and was used both in the ensemble and in the student-network. Additionally, the initial and final learning rates were analyzed. We were successful in relationship establishing between the usage of the smoothing coefficient for generation of the posteriori probabilities and the parameters of the learning rate. Finally, the application of the knowledge transfer for the automatic recognition of Russian conversational telephone speech gave the possibility to reduce the WER (Word Error Rate) by 2.49%, in comparison with the model trained on alignment from the ensemble of neural networks

Directory of Open Access Journals

Разновидности глубоких искусственных нейронных сетей для систем распознавания речи

Author: Карпов Алексей Анатольевич
Кипяткова Ирина Сергеевна
Publication venue: СПб ФИЦ РАН
Publication date: 15/12/2016
Field of study

This paper presents a survey of basic methods for acoustic and language model development based on artificial neural networks for automatic speech recognition systems. The hybrid and tandem approaches for combination of Hidden Markov Models and artificial neural networks for acoustic modelling are given. The creation of language models using feedforward and recurrent neural networks is described. The survey of researches, conducted in this field, shows that application of artificial neural networks at the stages of both acoustic and language modeling allows decreasing word error rate.В статье представлен аналитический обзор основных разновидностей акустических и языковых моделей на основе искусственных нейронных сетей для систем автоматического распознавания речи. Рассмотрены гибридный и тандемный под-ходы объединения скрытых марковских моделей и искусственных нейронных сетей для акустического моделирования, описано построение языковых моделей с применением сетей прямого распространения и рекуррентных нейросетей. Обзор исследований в данной области показывает, что применение искусственных нейронных сетей как на этапе акустического, так и на этапе языкового моделирования позволяет снизить ошибку распознавания слов

Информатика и автоматизация

Computational Intelligence and Human- Computer Interaction: Modern Methods and Applications

Author
Publication venue: 'MDPI AG'
Publication date: 21/06/2022
Field of study

The present book contains all of the articles that were accepted and published in the Special Issue of MDPI’s journal Mathematics titled "Computational Intelligence and Human–Computer Interaction: Modern Methods and Applications". This Special Issue covered a wide range of topics connected to the theory and application of different computational intelligence techniques to the domain of human–computer interaction, such as automatic speech recognition, speech processing and analysis, virtual reality, emotion-aware applications, digital storytelling, natural language processing, smart cars and devices, and online learning. We hope that this book will be interesting and useful for those working in various areas of artificial intelligence, human–computer interaction, and software engineering as well as for those who are interested in how these domains are connected in real-life situations

Directory of Open Access Books (DOAB)

Recognition and cortical haemodynamics of vocal emotions-an fNIRS perspective

Author: Moffat Ryssa Evelyn
Publication venue: 'University of Groningen Press'
Publication date: 01/01/2022
Field of study

Normal-hearing listeners rely heavily on variations in the fundamental frequency (F0) of speech to identify vocal emotions. Without reliable F0 cues, as is the case for cochlear implant users, listeners’ ability to extract emotional meaning from speech is reduced. This thesis describes the development of an objective measure of vocal emotion recognition. The program of three experiments investigates: 1) NH listeners’ abilities to use F0, intensity, and speech-rate cues to recognise emotions; 2) cortical activity associated with individual vocal emotions assessed using functional near-infrared spectroscopy (fNIRS); 3) cortical activity evoked by vocal emotions in natural speech and in speech with uninformative F0 using fNIRS

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information

Author: Vu Ngoc Thang
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2014
Field of study

This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech

KITopen

Learning representations for speech recognition using artificial neural networks

Author: Swietojanski Paweł
Publication venue: The University of Edinburgh
Publication date: 29/11/2016
Field of study

Learning representations is a central challenge in machine learning. For speech recognition, we are interested in learning robust representations that are stable across different acoustic environments, recording equipment and irrelevant inter– and intra– speaker variabilities. This thesis is concerned with representation learning for acoustic model adaptation to speakers and environments, construction of acoustic models in low-resource settings, and learning representations from multiple acoustic channels. The investigations are primarily focused on the hybrid approach to acoustic modelling based on hidden Markov models and artificial neural networks (ANN). The first contribution concerns acoustic model adaptation. This comprises two new adaptation transforms operating in ANN parameters space. Both operate at the level of activation functions and treat a trained ANN acoustic model as a canonical set of fixed-basis functions, from which one can later derive variants tailored to the specific distribution present in adaptation data. The first technique, termed Learning Hidden Unit Contributions (LHUC), depends on learning distribution-dependent linear combination coefficients for hidden units. This technique is then extended to altering groups of hidden units with parametric and differentiable pooling operators. We found the proposed adaptation techniques pose many desirable properties: they are relatively low-dimensional, do not overfit and can work in both a supervised and an unsupervised manner. For LHUC we also present extensions to speaker adaptive training and environment factorisation. On average, depending on the characteristics of the test set, 5-25% relative word error rate (WERR) reductions are obtained in an unsupervised two-pass adaptation setting. The second contribution concerns building acoustic models in low-resource data scenarios. In particular, we are concerned with insufficient amounts of transcribed acoustic material for estimating acoustic models in the target language – thus assuming resources like lexicons or texts to estimate language models are available. First we proposed an ANN with a structured output layer which models both context–dependent and context–independent speech units, with the context-independent predictions used at runtime to aid the prediction of context-dependent states. We also propose to perform multi-task adaptation with a structured output layer. We obtain consistent WERR reductions up to 6.4% in low-resource speaker-independent acoustic modelling. Adapting those models in a multi-task manner with LHUC decreases WERRs by an additional 13.6%, compared to 12.7% for non multi-task LHUC. We then demonstrate that one can build better acoustic models with unsupervised multi– and cross– lingual initialisation and find that pre-training is a largely language-independent. Up to 14.4% WERR reductions are observed, depending on the amount of the available transcribed acoustic data in the target language. The third contribution concerns building acoustic models from multi-channel acoustic data. For this purpose we investigate various ways of integrating and learning multi-channel representations. In particular, we investigate channel concatenation and the applicability of convolutional layers for this purpose. We propose a multi-channel convolutional layer with cross-channel pooling, which can be seen as a data-driven non-parametric auditory attention mechanism. We find that for unconstrained microphone arrays, our approach is able to match the performance of the comparable models trained on beamform-enhanced signals

Edinburgh Research Archive

Planning in Cold War Europe

Author
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 10/02/2021
Field of study

This volume aims at enlarging our understanding of planning, engineering and more generally regulation ideas by looking at possible forms of mutual interest, circulation of idea and models in a “pan’european” perspective. Contributions will emphasize the role played by actors from both sides of Europe at the micro- as well as macro-level and will highlight the role played by international organizations as fora and platforms where ideas and know-how were exchanged. The volume will also look at development projects in the developing countries as a field where European conceptions of planned development were competing but also converging especially in the eyes of the countries who benefited from these policies

Directory of Open Access Books (DOAB)

Assessing speech production in english as a foreign language: an analysis of international proficiency tests and guidelines

Author: Kuerten Anna Belavina
Publication venue
Publication date: 25/10/2012
Field of study

Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro de Comunicação e Expressão, Programa de Pós-Graduação em Letras/Inglês e Literatura Correspondente, Florianópolis, 2010Este estudo investigou os componentes da habilidade oral que são tratados nas escalas orais de dois testes de proficiência em inglês como lingual estrangeira (TOEFL e IELTS) e duas diretrizes para orientações em ensino, aprendizagem e testagem (ACTFL e CEFR). Para alcançar o objetivo do estudo, primeiramente, cada escala de produção oral foi analisada através da lista de verificação e instrumento de avaliação da habilidade comunicativa de linguagem proposta por Bachman (1995). Esta análise revelou o grau de envolvimento de cada componente da habilidade omunicativa de linguagem em todas as escalas de produção oral. As escalas de produção oral foram analisadas pelo framework para descrição do construto oral proposto por Fulcher (2003). As análises demonstraram que os componentes da habilidade das escalas do TOEFL e do IELTS são similares enquanto aquelas do ACTFL e CEFR são também muito comparáveis. Além disso, a escala oral do IELTS é mais comparável às escalas orais do ACTFL e CEFR do que à escala oral do TOEFL. Os principais resultados deste estudo podem contribuir para o melhor entendimento, por professores e estudantes, dos componentes da habilidade oral que estão presente em exames internacionais de proficiência em inglês e em diretrizes internacionais para orientações em ensino, aprendizagem e testagem

Repositório Institucional da UFSC

Planning in Cold War Europe

Author
Publication venue: 'Walter de Gruyter GmbH'
Publication date
Field of study

OAPEN Library