Search CORE

108 research outputs found

DWT and LPC based feature extraction methods for isolated word recognition

Author: AE Rosenberg
B Kotnik
DS Pallett
F Itakura
H Hermansky
H Hermansky
J Xu
JN Gowdy
K Wang
KP Soman
L Rabiner
M Gupta
M Krishnan
MJF Gales
Navnath S Nehe
NS Nehe
O Farooq
O Farooq
Raghunath S Holambe
S Mallat
SB Davis
SF Boll
Y Hao
Z Tufekci
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Speech Recognition of Isolated Arabic words via using Wavelet Transformation and Fuzzy Neural Network

Author: Al-Irhayim Yusra Faisal
Hussein Maher Khalaf
Publication venue: The International Institute for Science, Technology and Education (IISTE)
Publication date: 27/03/2016
Field of study

In this paper two new methods for feature extraction are presented for speech recognition the first method use a combination of linear predictive coding technique(LPC) and skewness equation. The second one(WLPCC) use a combination of linear predictive coding technique(LPC), discrete wavelet transform(DWT), and cpestrum analysis. The objective of this method is to enhance the performance of the proposed method by introducing more features from the signal. Neural Network(NN) and Neuro-Fuzzy Network are used in the proposed methods for classification. Test result show that the WLPCC method in the process of features extraction, and the neuro fuzzy network in the classification process had highest recognition rate for both the trained and non trained data. The proposed system has been built using MATLAB software and the data involve ten isolated Arabic words that are (الله، محمد، خديجة، ياسين، يتكلم، الشارقة، لندن، يسار، يمين، أحزان), for fifteen male speakers. The recognition rate of trained data is (97.8%) and non-trained data is (81.1%). Keywords: Speech Recognition, Feature Extraction, Linear Predictive Coding (LPC),Neural Network, Fuzzy networ

International Institute for Science, Technology and Education (IISTE): E-Journals

Recognition of Isolated Marathi words from Side Pose for multi-pose Audio Visual Speech Recognition

Author: Pravin Yannawar Sadhana Sukale, Prashant Borde Shivanand Gornale,
Publication venue: Assam Don Bosco University
Publication date: 29/06/2016
Field of study

Abstract: This paper presents a new multi pose audio visual speech recognition system based on fusion of side pose visual features and acoustic signals. The proposed method improved robustness and circumvention of conventional multimodal speech recognition system. The work was implemented on â€˜vVISWAâ€™ (Visual Vocabulary of Independent Standard Words) dataset comprised of full frontal, 45degree and side pose visual streams.The feature sets originating from the visual feature for Side pose are extracted using 2D Stationary Wavelet Transform (2D-SWT) and acoustic features extracted using (Linear Predictive Coding) LPC were fused and classified using KNN algorithm resulted in 90 % accuracy. This work facilitates approach of automatic recognition of isolated words from side pose in Multipose audio visual speech recognition domainwhere partial visual features of face were exists.Keywords: Side pose face detection, stationary wavelet transform, linear predictive analysis, Feature level fusion, KNN classifier

Assam Don Bosco University Journals

Arabic Isolated Word Speaker Dependent Recognition System

Author: Elkourd Amer.m
Publication venue: الجامعة الإسلامية - غزة
Publication date: 01/01/2014
Field of study

In this thesis we designed a new Arabic isolated word speaker dependent recognition system based on a combination of several features extraction and classifications techniques. Where, the system combines the methods outputs using a voting rule. The system is implemented with a graphic user interface under Matlab using G62 Core I3/2.26 Ghz processor laptop. The dataset used in this system include 40 Arabic words recorded in a calm environment with 5 different speakers using laptop microphone. Each speaker will read each word 8 times. 5 of them are used in training and the remaining are used in the test phase. First in the preprocessing step we used an endpoint detection technique based on energy and zero crossing rates to identify the start and the end of each word and remove silences then we used a discrete wavelet transform to remove noise from signal. In order to accelerate the system and reduce the execution time we make the system first to recognize the speaker and load only the reference model of that user. We compared 5 different methods which are pairwise Euclidean distance with MelFrequency cepstral coefficients (MFCC), Dynamic Time Warping (DTW) with Formants features, Gaussian Mixture Model (GMM) with MFCC, MFCC+DTW and Itakura distance with Linear Predictive Coding features (LPC) and we got a recognition rate of 85.23%, 57% , 87%, 90%, 83% respectively. In order to improve the accuracy of the system, we tested several combinations of these 5 methods. We find that the best combination is MFCC | Euclidean + Formant | DTW + MFCC | DTW + LPC | Itakura with an accuracy of 94.39% but with large computation time of 2.9 seconds. In order to reduce the computation time of this hybrid, we compare several subcombination of it and find that the best performance in trade off computation time is by first combining MFCC | Euclidean + LPC | Itakura and only when the two methods do not match the system will add Formant | DTW + MFCC | DTW methods to the combination, where the average computation time is reduced to the half to 1.56 seconds and the system accuracy is improved to 94.56%. Finally, the proposed system is good and competitive compared with other previous researches

Institutional Repository of the Islamic University of Gaza

Wavelet Based Feature Extraction for The Indonesian CV Syllables Sound

Author: Hidayat Risanuri
Kristomo Domy
Soesanti Indah
Publication venue: 'Universitas Ahmad Dahlan'
Publication date: 01/06/2018
Field of study

This paper proposes the combined methods of Wavelet Transform (WT) and Euclidean Distance (ED) to estimate the expected value of the possibly feature vector of Indonesian syllables. This research aims to find the best properties in effectiveness and efficiency on performing feature extraction of each syllable sound to be applied in the speech recognition systems. This proposed approach which is the state-of-the-art of the previous study consist of three main phase. In the first phase, the speech signal is segmented and normalized. In the second phase, the signal is transformed into frequency domain by using the WT. In the third phase, to estimate the expected feature vector, the ED algorithm is used. Th e result shows the list of features of each syllables can be used for the next research, and some recommendations on the most effective and efficient WT to be used in performing syllable sound recognition

Journal of Education and Learning (EduLearn)

TELKOMNIKA (Telecommunication Computing Electronics and Control)

UAD Journal Management System

Continuous kannada speech segmentation and speech recognition based on threshold using MFCC And VQ

Author: Gowda Vanajakshi Puttaswamy
Murugavelu Mathivanan
Thangamuthu Senthil Kumaran
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/12/2019
Field of study

Continuous speech segmentation and its recognition is playing important role in natural language processing. Continuous context based Kannada speech segmentation depends on context, grammer and semantics rules present in the kannada language. The significant feature extraction of kannada speech signal for recognition system is quite exciting for researchers. In this paper proposed method is divided into two parts. First part of the method is continuous kannada speech signal segmentation with respect to the context based is carried out by computing average short term energy and its spectral centroid coefficients of the speech signal present in the specified window. The segmented outputs are completely meaningful segmentation for different scenarios with less segmentation error. The second part of the method is speech recognition by extracting less number Mel frequency cepstral coefficients with less number of codebooks using vector quantization .In this recognition is completely based on threshold value.This threshold setting is a challenging task however the simple method is used to achieve better recognition rate.The experimental results shows more efficient and effective segmentation with high recognition rate for any continuous context based kannada speech signal with different accents for male and female than the existing methods and also used minimal feature dimensions for training data

Crossref

ZENODO

Institute of Advanced Engineering and Science

A motion-based approach for audio-visual automatic speech recognition

Author: Nasir Ahmad (821439)
Publication venue
Publication date: 12/08/2019
Field of study

The research work presented in this thesis introduces novel approaches for both visual region of interest extraction and visual feature extraction for use in audio-visual automatic speech recognition. In particular, the speaker‘s movement that occurs during speech is used to isolate the mouth region in video sequences and motionbased features obtained from this region are used to provide new visual features for audio-visual automatic speech recognition. The mouth region extraction approach proposed in this work is shown to give superior performance compared with existing colour-based lip segmentation methods. The new features are obtained from three separate representations of motion in the region of interest, namely the difference in luminance between successive images, block matching based motion vectors and optical flow. The new visual features are found to improve visual-only and audiovisual speech recognition performance when compared with the commonly-used appearance feature-based methods. In addition, a novel approach is proposed for visual feature extraction from either the discrete cosine transform or discrete wavelet transform representations of the mouth region of the speaker. In this work, the image transform is explored from a new viewpoint of data discrimination; in contrast to the more conventional data preservation viewpoint. The main findings of this work are that audio-visual automatic speech recognition systems using the new features extracted from the frequency bands selected according to their discriminatory abilities generally outperform those using features designed for data preservation. To establish the noise robustness of the new features proposed in this work, their performance has been studied in presence of a range of different types of noise and at various signal-to-noise ratios. In these experiments, the audio-visual automatic speech recognition systems based on the new approaches were found to give superior performance both to audio-visual systems using appearance based features and to audio-only speech recognition systems

Loughborough University Institutional Repository

Recommended from our members

Evaluation and analysis of hybrid intelligent pattern recognition techniques for speaker identification

Author: Almaadeed Noor
Publication venue: Brunel University School of Engineering and Design PhD Theses
Publication date: 01/01/2014
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The rapid momentum of the technology progress in the recent years has led to a tremendous rise in the use of biometric authentication systems. The objective of this research is to investigate the problem of identifying a speaker from its voice regardless of the content (i.e. text-independent), and to design efficient methods of combining face and voice in producing a robust authentication system. A novel approach towards speaker identification is developed using wavelet analysis, and multiple neural networks including Probabilistic Neural Network (PNN), General Regressive Neural Network (GRNN)and Radial Basis Function-Neural Network (RBF NN) with the AND voting scheme. This approach is tested on GRID and VidTIMIT cor-pora and comprehensive test results have been validated with state- of-the-art approaches. The system was found to be competitive and it improved the recognition rate by 15% as compared to the classical Mel-frequency Cepstral Coe±cients (MFCC), and reduced the recognition time by 40% compared to Back Propagation Neural Network (BPNN), Gaussian Mixture Models (GMM) and Principal Component Analysis (PCA). Another novel approach using vowel formant analysis is implemented using Linear Discriminant Analysis (LDA). Vowel formant based speaker identification is best suitable for real-time implementation and requires only a few bytes of information to be stored for each speaker, making it both storage and time efficient. Tested on GRID and Vid-TIMIT, the proposed scheme was found to be 85.05% accurate when Linear Predictive Coding (LPC) is used to extract the vowel formants, which is much higher than the accuracy of BPNN and GMM. Since the proposed scheme does not require any training time other than creating a small database of vowel formants, it is faster as well. Furthermore, an increasing number of speakers makes it di±cult for BPNN and GMM to sustain their accuracy, but the proposed score-based methodology stays almost linear. Finally, a novel audio-visual fusion based identification system is implemented using GMM and MFCC for speaker identi¯cation and PCA for face recognition. The results of speaker identification and face recognition are fused at different levels, namely the feature, score and decision levels. Both the score-level and decision-level (with OR voting) fusions were shown to outperform the feature-level fusion in terms of accuracy and error resilience. The result is in line with the distinct nature of the two modalities which lose themselves when combined at the feature-level. The GRID and VidTIMIT test results validate that the proposed scheme is one of the best candidates for the fusion of face and voice due to its low computational time and high recognition accuracy

Brunel University Research Archive

Identification of Transient Speech Using Wavelet Transforms

Author: Rasetshwane Daniel Motlotle
Publication venue
Publication date: 21/06/2005
Field of study

It is generally believed that abrupt stimulus changes, which in speech may be time-varying frequency edges associated with consonants, transitions between consonants and vowels and transitions within vowels are critical to the perception of speech by humans and for speech recognition by machines. Noise affects speech transitions more than it affects quasi-steady-state speech. I believe that identifying and selectively amplifying speech transitions may enhance the intelligibility of speech in noisy conditions. The purpose of this study is to evaluate the use of wavelet transforms to identify speech transitions. Using wavelet transforms may be computationally efficient and allow for real-time applications. The discrete wavelet transform (DWT), stationary wavelet transform (SWT) and wavelet packets (WP) are evaluated. Wavelet analysis is combined with variable frame rate processing to improve the identification process. Variable frame rate can identify time segments when speech feature vectors are changing rapidly and when they are relatively stationary. Energy profiles for words, which show the energy in each node of a speech signal decomposed using wavelets, are used to identify nodes that include predominately transient information and nodes that include predominately quasi-steady-state information, and these are used to synthesize transient and quasi-steady-state speech components. These speech components are estimates of the tonal and nontonal speech components, which Yoo et al identified using time-varying band-pass filters. Comparison of spectra, a listening test and mean-squared-errors between the transient components synthesized using wavelets and Yoo's nontonal components indicated that wavelet packets identified the best estimates of Yoo's components. An algorithm that incorporates variable frame rate analysis into wavelet packet analysis is proposed. The development of this algorithm involves the processes of choosing a wavelet function and a decomposition level to be used. The algorithm itself has 4 steps: wavelet packet decomposition; classification of terminal nodes; incorporation of variable frame rate processing; synthesis of speech components. Combining wavelet analysis with variable frame rate analysis provides the best estimates of Yoo's speech components

D-Scholarship@Pitt