40 research outputs found

    Comparative analysis of speech recognition algorithms in UAV voice control system

    Get PDF
    The article proposes to perform a comparative analysis of the presented algorithms for processing voice control signals for an unmanned aerial vehicle, which can be implemented on processors with low computing power using online processing in real time. It is shown that these approaches are effective in improving the accuracy of speech recognition in the presence of various types of noise and a sound-reflecting control environment, which is an important problem in voice control systems for an unmanned aerial vehicle. An algorithm for calculating the mel-frequency cepstral coefficients, which appear in the role of the main features of speech recognition, is presented. A comparative analysis of two methods of distinguishing informative features of speech recognition in the voice control system of an unmanned aerial vehicle was made, namely, mel-frequency cepstral factors and the coefficients obtained with the aid of a linear prediction algorithm, where as a result of the conducted scientific experiment, under the influence of given noise, it was concluded that in these problems, the optimal method of exclusion is the mel-frequency cepstral factors, since they show the best value for obsalutnomu criterion of speech recognition quality. The expediency of using the proposed system for recognizing voice commands of an unmanned aerial vehicle based on the cepstral analysis is substantiated and experimentally proved. The obtained results of the experimental research allow to draw a conclusion about the advisability of further practical application of the developed system for recognizing voice commands for the control of an unmanned aerial vehicle on the basis of a cepstral analysis.В статье предлагается произвести сравнительный анализ представленных алгоритмов обработки сигналов голосового управления беспилотным летательным аппаратом, которые могут быть реализованы на процессорах с малой вычислительной способностью используя онлайн-обработку в режиме реального времени. Показано, что эти подходы эффективны в улучшении точности распознавания речи при наличии различных типов шумов и звукоотражающей среды управления, что является важной проблемой в системах голосового управления беспилотным летательным аппаратом. Представлен алгоритм вычисления мел-частотных кепстральных коэффициентов, которые выступают в роле основных признаков распознавания речи. Был проведен сравнительный анализ двух методов выделения информативных признаков распознавания речи в системе голосового управления беспилотным летательным аппаратом, а именно мел-частотные кепстральные коефициенты и коефициенты полученные с помощю алгоритма линейного предсказания, где в результате проведенного научного експереманта при воздействии заданных шумов были сделаны выводы, что в данных задачах оптимальным методом выдиления  является мел-частотные кепстральные коефициенты, так как они показывают наилучший показатель по обсалютному критерию качества распознавания речи. Обосновано и экспериментально доказано целесообразность использования предложенной системы распознавания голосовых команд управления беспилотным летательным аппаратом на основе кепстрального анализа. Полученные результаты экспериментального исследования позволяют сделать вывод о целесообразности дальнейшего практического применения разработанной системы распознавания голосовых команд управления беспилотным летательным аппаратом на основе кепстрального анализа.У статті пропонується провести порівняльний аналіз представлених алгоритмів обробки сигналів голосового управління безпілотним літальним апаратом, які можуть бути реалізовані на процесорах з малою обчислювальною здатністю використовуючи онлайн-обробку в режимі реального часу. Показано, що запропоновані підходи ефективні в поліпшенні точності розпізнавання мови при наявності різних типів шумів і звуковідбиваючого середовища управління, що є важливою проблемою в системах голосового управління безпілотним літальним апаратом. Представлений алгоритм обчислення мел-частотних кепстральних коефіцієнтів, які виступають в ролі основних ознак розпізнавання мови. Був проведений порівняльний аналіз двох методів виділення інформативних ознак розпізнавання мови в системі голосового управління безпілотним літальним апаратом, а саме мел-частотні кепстральні коефіцієнти і коефіцієнти отримані за допомогою алгоритму лінійного передбачення, де в результаті проведеного наукового експерименту при впливі заданих шумів були зроблені висновки, що в даних задачах оптимальним методом виділення є мел-частотні кепстральні коефіцієнти, так як вони показують найкращий показник по абсолютному критерію якості розпізнавання мови. Обґрунтовано та експериментально доведено доцільність використання запропонованої системи розпізнавання голосових команд управління безпілотним літальним апаратом на основі кепстрального аналізу. Отримані результати експериментального дослідження дозволяють зробити висновок про доцільність подальшого практичного застосування розробленої системи розпізнавання голосових команд управління безпілотним літальним апаратом на основі кепстрального аналізу

    Studies on noise robust automatic speech recognition

    Get PDF
    Noise in everyday acoustic environments such as cars, traffic environments, and cafeterias remains one of the main challenges in automatic speech recognition (ASR). As a research theme, it has received wide attention in conferences and scientific journals focused on speech technology. This article collection reviews both the classic and novel approaches suggested for noise robust ASR. The articles are literature reviews written for the spring 2009 seminar course on noise robust automatic speech recognition (course code T-61.6060) held at TKK

    Genetic Algorithm for Combined Speaker and Speech Recognition using Deep Neural Networks, Journal of Telecommunications and Information Technology, 2018, nr 2

    Get PDF
    Huge growth is observed in the speech and speaker recognition field due to many artificial intelligence algorithms being applied. Speech is used to convey messages via the language being spoken, emotions, gender and speaker identity. Many real applications in healthcare are based upon speech and speaker recognition, e.g. a voice-controlled wheelchair helps control the chair. In this paper, we use a genetic algorithm (GA) for combined speaker and speech recognition, relying on optimized Mel Frequency Cepstral Coefficient (MFCC) speech features, and classification is performed using a Deep Neural Network (DNN). In the first phase, feature extraction using MFCC is executed. Then, feature optimization is performed using GA. In the second phase training is conducted using DNN. Evaluation and validation of the proposed work model is done by setting a real environment, and efficiency is calculated on the basis of such parameters as accuracy, precision rate, recall rate, sensitivity, and specificity. Also, this paper presents an evaluation of such feature extraction methods as linear predictive coding coefficient (LPCC), perceptual linear prediction (PLP), mel frequency cepstral coefficients (MFCC) and relative spectra filtering (RASTA), with all of them used for combined speaker and speech recognition systems. A comparison of different methods based on existing techniques for both clean and noisy environments is made as well

    Development of Bilingual ASR System for MediaParl Corpus

    Get PDF
    The development of an Automatic Speech Recognition (ASR) system for the bilingual MediaParl corpus is challenging for several reasons: (1) reverberant recordings, (2) accented speech, and (3) no prior information about the language. In that context, we employ frequency domain linear prediction-based (FDLP) features to reduce the effect of reverberation, exploit bilingual deep neural networks applied in Tandem and hybrid acoustic modeling approaches to significantly improve ASR for accented speech and develop a fully bilingual ASR system using entropy-based decoding-graph selection. Our experiments indicate that the proposed bilingual ASR system performs similar to a language-specific ASR system if approximately five seconds of speech are available

    Transfer Learning from Audio-Visual Grounding to Speech Recognition

    Full text link
    Transfer learning aims to reduce the amount of data required to excel at a new task by re-using the knowledge acquired from learning other related tasks. This paper proposes a novel transfer learning scenario, which distills robust phonetic features from grounding models that are trained to tell whether a pair of image and speech are semantically correlated, without using any textual transcripts. As semantics of speech are largely determined by its lexical content, grounding models learn to preserve phonetic information while disregarding uncorrelated factors, such as speaker and channel. To study the properties of features distilled from different layers, we use them as input separately to train multiple speech recognition models. Empirical results demonstrate that layers closer to input retain more phonetic information, while following layers exhibit greater invariance to domain shift. Moreover, while most previous studies include training data for speech recognition for feature extractor training, our grounding models are not trained on any of those data, indicating more universal applicability to new domains.Comment: Accepted to Interspeech 2019. 4 pages, 2 figure

    Single channel overlapped-speech detection and separation of spontaneous conversations

    Get PDF
    PhD ThesisIn the thesis, spontaneous conversation containing both speech mixture and speech dialogue is considered. The speech mixture refers to speakers speaking simultaneously (i.e. the overlapped-speech). The speech dialogue refers to only one speaker is actively speaking and the other is silent. That Input conversation is firstly processed by the overlapped-speech detection. Two output signals are then segregated into dialogue and mixture formats. The dialogue is processed by speaker diarization. Its outputs are the individual speech of each speaker. The mixture is processed by speech separation. Its outputs are independent separated speech signals of the speaker. When the separation input contains only the mixture, blind speech separation approach is used. When the separation is assisted by the outputs of the speaker diarization, it is informed speech separation. The research presents novel: overlapped-speech detection algorithm, and two speech separation algorithms. The proposed overlapped-speech detection is an algorithm to estimate the switching instants of the input. Optimization loop is adapted to adopt the best capsulated audio features and to avoid the worst. The optimization depends on principles of the pattern recognition, and k-means clustering. For of 300 simulated conversations, averages of: False-Alarm Error is 1.9%, Missed-Speech Error is 0.4%, and Overlap-Speaker Error is 1%. Approximately, these errors equal the errors of best recent reliable speaker diarization corpuses. The proposed blind speech separation algorithm consists of four sequential techniques: filter-bank analysis, Non-negative Matrix Factorization (NMF), speaker clustering and filter-bank synthesis. Instead of the required speaker segmentation, effective standard framing is contributed. Average obtained objective tests (SAR, SDR and SIR) of 51 simulated conversations are: 5.06dB, 4.87dB and 12.47dB respectively. For the proposed informed speech separation algorithm, outputs of the speaker diarization are a generated-database. The database associated the speech separation by creating virtual targeted-speech and mixture. The contributed virtual signals are trained to facilitate the separation by homogenising them with the NMF-matrix elements of the real mixture. Contributed masking optimized the resulting speech. Average obtained SAR, SDR and SIR of 341 simulated conversations are 9.55dB, 1.12dB, and 2.97dB respectively. Per the objective tests of the two speech separation algorithms, they are in the mid-range of the well-known NMF-based audio and speech separation methods
    corecore