Search CORE

1,767 research outputs found

Word recognition in continuous speech using linear prediction analysis

Author: Christiansen Richard Wesley
Publication venue: University of Utah
Publication date: 01/01/1976
Field of study

Journal ArticleA promising method of automatic word recognition in continuous speech, recently designated as word spotting, has been demonstrated. The method uses error residual ratios from LPC (Linear Predictive Coding) vocoder analysis for waveform comparison and a dynamic programming procedure for time registration between the incoming speech and a template of the key word. Using a similarity threshold, the incoming speech is compared with several templates to account for variability is spectral shape. This system can work in real time using a real time vocoder. The multiple templates are used in such a way that a small number of templates, three or four, is made to look like several hundred or more. This is accomplished by dynamically constructing a composite template from parts of each single template as part of the processing of the incoming speech. Thus, a particular composite template is constructed for each word being recognized. An accuracy of 99 percent with no false alarms was achieved using 205 key words, five different speakers, and approximately ten minutes of speech text. Performance in the presence of additive white gaussian noise of approximately 11 dB signal-to-noise ratio was 66 percent. When the speech was processed to account for the noise, results improved to 85 percent to 90 percent accuracy. Finally, a digit recognition experiment was performed using over 1200 digits spoken by ten different people with a resultant accuracy of 97 percent

The University of Utah: J. Willard Marriott Digital Library

Fast Keyword Spotting in Telephone Speech

Author: Nouza J.
Silovsky J.
Publication venue: Společnost pro radioelektronické inženýrství
Publication date: 01/01/2009
Field of study

In the paper, we present a system designed for detecting keywords in telephone speech. We focus not only on achieving high accuracy but also on very short processing time. The keyword spotting system can run in three modes: a) an off-line mode requiring less than 0.1xRT, b) an on-line mode with minimum (2 s) latency, and c) a repeated spotting mode, in which pre-computed values allow for additional acceleration. Its performance is evaluated on recordings of Czech spontaneous telephone speech using rather large and complex keyword lists

Directory of Open Access Journals

DSpace@TUL

Digital library of Brno University of Technology

Study to determine potential flight applications and human factors design guidelines for voice recognition and synthesis systems

Author: Parks D. L.
White R. W.
Publication venue
Publication date
Field of study

A study was conducted to determine potential commercial aircraft flight deck applications and implementation guidelines for voice recognition and synthesis. At first, a survey of voice recognition and synthesis technology was undertaken to develop a working knowledge base. Then, numerous potential aircraft and simulator flight deck voice applications were identified and each proposed application was rated on a number of criteria in order to achieve an overall payoff rating. The potential voice recognition applications fell into five general categories: programming, interrogation, data entry, switch and mode selection, and continuous/time-critical action control. The ratings of the first three categories showed the most promise of being beneficial to flight deck operations. Possible applications of voice synthesis systems were categorized as automatic or pilot selectable and many were rated as being potentially beneficial. In addition, voice system implementation guidelines and pertinent performance criteria are proposed. Finally, the findings of this study are compared with those made in a recent NASA study of a 1995 transport concept

NASA Technical Reports Server

Speech Recognition of Isolated Arabic words via using Wavelet Transformation and Fuzzy Neural Network

Author: Al-Irhayim Yusra Faisal
Hussein Maher Khalaf
Publication venue: The International Institute for Science, Technology and Education (IISTE)
Publication date: 27/03/2016
Field of study

In this paper two new methods for feature extraction are presented for speech recognition the first method use a combination of linear predictive coding technique(LPC) and skewness equation. The second one(WLPCC) use a combination of linear predictive coding technique(LPC), discrete wavelet transform(DWT), and cpestrum analysis. The objective of this method is to enhance the performance of the proposed method by introducing more features from the signal. Neural Network(NN) and Neuro-Fuzzy Network are used in the proposed methods for classification. Test result show that the WLPCC method in the process of features extraction, and the neuro fuzzy network in the classification process had highest recognition rate for both the trained and non trained data. The proposed system has been built using MATLAB software and the data involve ten isolated Arabic words that are (الله، محمد، خديجة، ياسين، يتكلم، الشارقة، لندن، يسار، يمين، أحزان), for fifteen male speakers. The recognition rate of trained data is (97.8%) and non-trained data is (81.1%). Keywords: Speech Recognition, Feature Extraction, Linear Predictive Coding (LPC),Neural Network, Fuzzy networ

International Institute for Science, Technology and Education (IISTE): E-Journals

Progress in Speech Recognition for Romanian Language

Author: Corneliu-Octavian Dumitru
Inge Gavat
Publication venue: 'IntechOpen'
Publication date: 01/10/2008
Field of study

IntechOpen

Improving Label-Deficient Keyword Spotting Using Self-Supervised Pretraining

Author: Bovbjerg Holger Severin
Tan Zheng-Hua
Publication venue
Publication date: 04/10/2022
Field of study

In recent years, the development of accurate deep keyword spotting (KWS) models has resulted in KWS technology being embedded in a number of technologies such as voice assistants. Many of these models rely on large amounts of labelled data to achieve good performance. As a result, their use is restricted to applications for which a large labelled speech data set can be obtained. Self-supervised learning seeks to mitigate the need for large labelled data sets by leveraging unlabelled data, which is easier to obtain in large amounts. However, most self-supervised methods have only been investigated for very large models, whereas KWS models are desired to be small. In this paper, we investigate the use of self-supervised pretraining for the smaller KWS models in a label-deficient scenario. We pretrain the Keyword Transformer model using the self-supervised framework Data2Vec and carry out experiments on a label-deficient setup of the Google Speech Commands data set. It is found that the pretrained models greatly outperform the models without pretraining, showing that Data2Vec pretraining can increase the performance of KWS models in label-deficient scenarios. The source code is made publicly available.Comment: 8 pages, 3 figures, 4 tables, Submitted to Northern Lights Deep Learning Conference 202

arXiv.org e-Print Archive

Mispronunciation Detection in Children's Reading of Sentences

Author: Candeias Sara
Lopes Carla Alexandra
Perdigão Fernando
Proença Jorge
Stolcke Andreas
Tjalve Michael
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/03/2018
Field of study

This work proposes an approach to automatically parse children’s reading of sentences by detecting word pronunciations and extra content, and to classify words as correctly or incorrectly pronounced. This approach can be directly helpful for automatic assessment of reading level or for automatic reading tutors, where a correct reading must be identified. We propose a first segmentation stage to locate candidate word pronunciations based on allowing repetitions and false starts of a word’s syllables. A decoding grammar based solely on syllables allows silence to appear during a word pronunciation. At a second stage, word candidates are classified as mispronounced or not. The feature that best classifies mispronunciations is found to be the log-likelihood ratio between a free phone loop and a word spotting model in the very close vicinity of the candidate segmentation. Additional features are combined in multi-feature models to further improve classification, including: normalizations of the log-likelihood ratio, derivations from phone likelihoods, and Levenshtein distances between the correct pronunciation and recognized phonemes through two phoneme recognition approaches. Results show that most extra events were detected (close to 2% word error rate achieved) and that using automatic segmentation for mispronunciation classification approaches the performance of manual segmentation. Although the log-likelihood ratio from a spotting approach is already a good metric to classify word pronunciations, the combination of additional features provides a relative reduction of the miss rate of 18% (from 34.03% to 27.79% using manual segmentation and from 35.58% to 29.35% using automatic segmentation, at constant 5% false alarm rate).info:eu-repo/semantics/publishedVersio

Crossref

IC-online

Execution of a Voice - Based Attendence System

Author: Barpanda Siddharth Sagar
Publication venue
Publication date: 11/05/2015
Field of study

Speech Recognition is the methodology of consequently perceiving a certain word talked by a specific speaker taking into account singular data included in speech waves. This system makes it conceivable to utilize the speaker's voice to confirm his/her personality and give controlled access to administrations like voice based biometrics, database access administrations, voice based dialling, phone message and remote access to PCs. Speech processing front end for extricating the feature set is a critical stage in any voice recognition system. The ideal list of capabilities is still not yet chosen however the limitless endeavours of scientists. There are numerous sorts of highlights, which are determined distinctively and have great effect on the acknowledgment rate. This project shows one of the strategies to extract the feature from a voice signal, which can be utilized as a part of speech acknowledgment system. The key is to change the speech wave to some kind of parametric representation (at an impressively lower data rate) for further examination and processing. This is frequently known as the voice processing front end. An extensive variety of potential outcomes exist for parametrically speaking to the discourse signal for the speaker acknowledgment undertaking, for example, Mel-Frequency Cepstrum Coefficients (MFCC), Linear Prediction Coding (LPC), and others. MFCC is maybe the best known and generally prominent, furthermore, these will be utilized as a part of this undertaking. MFCCs are in view of the known variety of the human ear’s discriminating transmission capacities with recurrence channels dispersed sprightly at low frequencies and logarithmically at high frequencies have been utilized to catch the phonetically essential qualities of discourse. Nonetheless, another key normal for discourse is semi stationary, i.e. it is brief time stationary which is contemplated and investigated utilizing brief time, recurrence space examination. In this project work, I have built a straightforward yet completed and agent automatic speaker recognition (ASR) framework, as connected to a voice based attention framework, i.e., a speech based access control system. To attain to this, I had to first made a relative investigation of the MFCC approach with the Time space approach for acknowledgment by simulating both these strategies utilizing MATLAB 7.0 and investigating the consistency of acknowledgment utilizing both the procedures. The voice based attendance system is based with respect to confined or one word recognition. A specific speaker articulates the secret word once in the instructional course so as to prepare and store the highlights of the entrance word. While in the testing session the speaker articulates the secret key again to accomplish acknowledgment if there is a match. The highlight vectors interesting to that speaker are acquired in the preparation stage and this is made utilization of later on to allow validation to the same speaker who at the end of the day expresses the same word in the testing stage. At this stage a gate crasher can likewise test the framework to test the inalienable security include by expressing the same word

ethesis@nitr