Search CORE

40 research outputs found

Integrated Phoneme Subspace Method for Speech Feature Extraction

Author
Publication venue: Springer
Publication date
Field of study

Comparative analysis of speech recognition algorithms in UAV voice control system

Author: Bakhtiiarov D.
Konakhovych G.
Lavrynenko O.
Publication venue: 'National Aviation University'
Publication date: 03/07/2018
Field of study

The article proposes to perform a comparative analysis of the presented algorithms for processing voice control signals for an unmanned aerial vehicle, which can be implemented on processors with low computing power using online processing in real time. It is shown that these approaches are effective in improving the accuracy of speech recognition in the presence of various types of noise and a sound-reflecting control environment, which is an important problem in voice control systems for an unmanned aerial vehicle. An algorithm for calculating the mel-frequency cepstral coefficients, which appear in the role of the main features of speech recognition, is presented. A comparative analysis of two methods of distinguishing informative features of speech recognition in the voice control system of an unmanned aerial vehicle was made, namely, mel-frequency cepstral factors and the coefficients obtained with the aid of a linear prediction algorithm, where as a result of the conducted scientific experiment, under the influence of given noise, it was concluded that in these problems, the optimal method of exclusion is the mel-frequency cepstral factors, since they show the best value for obsalutnomu criterion of speech recognition quality. The expediency of using the proposed system for recognizing voice commands of an unmanned aerial vehicle based on the cepstral analysis is substantiated and experimentally proved. The obtained results of the experimental research allow to draw a conclusion about the advisability of further practical application of the developed system for recognizing voice commands for the control of an unmanned aerial vehicle on the basis of a cepstral analysis.В статье предлагается произвести сравнительный анализ представленных алгоритмов обработки сигналов голосового управления беспилотным летательным аппаратом, которые могут быть реализованы на процессорах с малой вычислительной способностью используя онлайн-обработку в режиме реального времени. Показано, что эти подходы эффективны в улучшении точности распознавания речи при наличии различных типов шумов и звукоотражающей среды управления, что является важной проблемой в системах голосового управления беспилотным летательным аппаратом. Представлен алгоритм вычисления мел-частотных кепстральных коэффициентов, которые выступают в роле основных признаков распознавания речи. Был проведен сравнительный анализ двух методов выделения информативных признаков распознавания речи в системе голосового управления беспилотным летательным аппаратом, а именно мел-частотные кепстральные коефициенты и коефициенты полученные с помощю алгоритма линейного предсказания, где в результате проведенного научного експереманта при воздействии заданных шумов были сделаны выводы, что в данных задачах оптимальным методом выдиления является мел-частотные кепстральные коефициенты, так как они показывают наилучший показатель по обсалютному критерию качества распознавания речи. Обосновано и экспериментально доказано целесообразность использования предложенной системы распознавания голосовых команд управления беспилотным летательным аппаратом на основе кепстрального анализа. Полученные результаты экспериментального исследования позволяют сделать вывод о целесообразности дальнейшего практического применения разработанной системы распознавания голосовых команд управления беспилотным летательным аппаратом на основе кепстрального анализа.У статті пропонується провести порівняльний аналіз представлених алгоритмів обробки сигналів голосового управління безпілотним літальним апаратом, які можуть бути реалізовані на процесорах з малою обчислювальною здатністю використовуючи онлайн-обробку в режимі реального часу. Показано, що запропоновані підходи ефективні в поліпшенні точності розпізнавання мови при наявності різних типів шумів і звуковідбиваючого середовища управління, що є важливою проблемою в системах голосового управління безпілотним літальним апаратом. Представлений алгоритм обчислення мел-частотних кепстральних коефіцієнтів, які виступають в ролі основних ознак розпізнавання мови. Був проведений порівняльний аналіз двох методів виділення інформативних ознак розпізнавання мови в системі голосового управління безпілотним літальним апаратом, а саме мел-частотні кепстральні коефіцієнти і коефіцієнти отримані за допомогою алгоритму лінійного передбачення, де в результаті проведеного наукового експерименту при впливі заданих шумів були зроблені висновки, що в даних задачах оптимальним методом виділення є мел-частотні кепстральні коефіцієнти, так як вони показують найкращий показник по абсолютному критерію якості розпізнавання мови. Обґрунтовано та експериментально доведено доцільність використання запропонованої системи розпізнавання голосових команд управління безпілотним літальним апаратом на основі кепстрального аналізу. Отримані результати експериментального дослідження дозволяють зробити висновок про доцільність подальшого практичного застосування розробленої системи розпізнавання голосових команд управління безпілотним літальним апаратом на основі кепстрального аналізу

Наукові журнали Національного Авіаційного Університету

Studies on noise robust automatic speech recognition

Author: Kurimo Mikko
Palomäki Kalle J.
Remes Ulpu
Publication venue: Teknillinen korkeakoulu
Publication date: 01/01/2009
Field of study

Noise in everyday acoustic environments such as cars, traffic environments, and cafeterias remains one of the main challenges in automatic speech recognition (ASR). As a research theme, it has received wide attention in conferences and scientific journals focused on speech technology. This article collection reviews both the classic and novel approaches suggested for noise robust ASR. The articles are literature reviews written for the spring 2009 seminar course on noise robust automatic speech recognition (course code T-61.6060) held at TKK

Aaltodoc Publication Archive

Distant-talking speaker identification by generalized spectral subtraction-based dereverberation and its efficient computation

Author: Atsuhiko Kai
Longbiao Wang
Zhaofeng Zhang
Publication venue: Springer Nature
Publication date: 01/01/2014
Field of study

Springer - Publisher Connector

Genetic Algorithm for Combined Speaker and Speech Recognition using Deep Neural Networks, Journal of Telecommunications and Information Technology, 2018, nr 2

Author: Kaur Gurpreet
Kumar Amod
Srivastava Mohit
Publication venue: 'National Institute of Telecommunications'
Publication date
Field of study

Huge growth is observed in the speech and speaker recognition ﬁeld due to many artiﬁcial intelligence algorithms being applied. Speech is used to convey messages via the language being spoken, emotions, gender and speaker identity. Many real applications in healthcare are based upon speech and speaker recognition, e.g. a voice-controlled wheelchair helps control the chair. In this paper, we use a genetic algorithm (GA) for combined speaker and speech recognition, relying on optimized Mel Frequency Cepstral Coeﬃcient (MFCC) speech features, and classiﬁcation is performed using a Deep Neural Network (DNN). In the ﬁrst phase, feature extraction using MFCC is executed. Then, feature optimization is performed using GA. In the second phase training is conducted using DNN. Evaluation and validation of the proposed work model is done by setting a real environment, and eﬃciency is calculated on the basis of such parameters as accuracy, precision rate, recall rate, sensitivity, and speciﬁcity. Also, this paper presents an evaluation of such feature extraction methods as linear predictive coding coeﬃcient (LPCC), perceptual linear prediction (PLP), mel frequency cepstral coefﬁcients (MFCC) and relative spectra ﬁltering (RASTA), with all of them used for combined speaker and speech recognition systems. A comparison of diﬀerent methods based on existing techniques for both clean and noisy environments is made as well

Biblioteka Cyfrowa Instytutu Łączności / National Institute of Telecomunications: Digital Library

Development of Bilingual ASR System for MediaParl Corpus

Author: Cernak Milos
Imseng David
Kim Namhoon
Motlicek Petr
Publication venue: Rue Marconi 19, Idiap
Publication date: 19/12/2014
Field of study

The development of an Automatic Speech Recognition (ASR) system for the bilingual MediaParl corpus is challenging for several reasons: (1) reverberant recordings, (2) accented speech, and (3) no prior information about the language. In that context, we employ frequency domain linear prediction-based (FDLP) features to reduce the effect of reverberation, exploit bilingual deep neural networks applied in Tandem and hybrid acoustic modeling approaches to significantly improve ASR for accented speech and develop a fully bilingual ASR system using entropy-based decoding-graph selection. Our experiments indicate that the proposed bilingual ASR system performs similar to a language-specific ASR system if approximately five seconds of speech are available

Infoscience - École polytechnique fédérale de Lausanne

Transfer Learning from Audio-Visual Grounding to Speech Recognition

Author: Glass James
Harwath David
Hsu Wei-Ning
Publication venue
Publication date: 09/07/2019
Field of study

Transfer learning aims to reduce the amount of data required to excel at a new task by re-using the knowledge acquired from learning other related tasks. This paper proposes a novel transfer learning scenario, which distills robust phonetic features from grounding models that are trained to tell whether a pair of image and speech are semantically correlated, without using any textual transcripts. As semantics of speech are largely determined by its lexical content, grounding models learn to preserve phonetic information while disregarding uncorrelated factors, such as speaker and channel. To study the properties of features distilled from different layers, we use them as input separately to train multiple speech recognition models. Empirical results demonstrate that layers closer to input retain more phonetic information, while following layers exhibit greater invariance to domain shift. Moreover, while most previous studies include training data for speech recognition for feature extractor training, our grounding models are not trained on any of those data, indicating more universal applicability to new domains.Comment: Accepted to Interspeech 2019. 4 pages, 2 figure

arXiv.org e-Print Archive

Crossref

Single channel overlapped-speech detection and separation of spontaneous conversations

Author: Kadhim Hasan Mohammad-Ali
Publication venue
Publication date: 01/01/2018
Field of study

PhD ThesisIn the thesis, spontaneous conversation containing both speech mixture and speech dialogue is considered. The speech mixture refers to speakers speaking simultaneously (i.e. the overlapped-speech). The speech dialogue refers to only one speaker is actively speaking and the other is silent. That Input conversation is firstly processed by the overlapped-speech detection. Two output signals are then segregated into dialogue and mixture formats. The dialogue is processed by speaker diarization. Its outputs are the individual speech of each speaker. The mixture is processed by speech separation. Its outputs are independent separated speech signals of the speaker. When the separation input contains only the mixture, blind speech separation approach is used. When the separation is assisted by the outputs of the speaker diarization, it is informed speech separation. The research presents novel: overlapped-speech detection algorithm, and two speech separation algorithms. The proposed overlapped-speech detection is an algorithm to estimate the switching instants of the input. Optimization loop is adapted to adopt the best capsulated audio features and to avoid the worst. The optimization depends on principles of the pattern recognition, and k-means clustering. For of 300 simulated conversations, averages of: False-Alarm Error is 1.9%, Missed-Speech Error is 0.4%, and Overlap-Speaker Error is 1%. Approximately, these errors equal the errors of best recent reliable speaker diarization corpuses. The proposed blind speech separation algorithm consists of four sequential techniques: filter-bank analysis, Non-negative Matrix Factorization (NMF), speaker clustering and filter-bank synthesis. Instead of the required speaker segmentation, effective standard framing is contributed. Average obtained objective tests (SAR, SDR and SIR) of 51 simulated conversations are: 5.06dB, 4.87dB and 12.47dB respectively. For the proposed informed speech separation algorithm, outputs of the speaker diarization are a generated-database. The database associated the speech separation by creating virtual targeted-speech and mixture. The contributed virtual signals are trained to facilitate the separation by homogenising them with the NMF-matrix elements of the real mixture. Contributed masking optimized the resulting speech. Average obtained SAR, SDR and SIR of 341 simulated conversations are 9.55dB, 1.12dB, and 2.97dB respectively. Per the objective tests of the two speech separation algorithms, they are in the mid-range of the well-known NMF-based audio and speech separation methods

Newcastle University eTheses