Search CORE

285 research outputs found

Experiments with time-frequency inversions

Author: Jensen Karl Kristoffer
Publication venue: ITC Sangeet Research Academy, Kolkata, India
Publication date: 01/01/2009
Field of study

Analysis and Implementation of Speech Recognition System using ARM7 Processor

Author: Vijayan Nishiya
Publication venue: The International Institute for Science, Technology and Education (IISTE)
Publication date: 31/12/2014
Field of study

This paper introduces implementation and analysis of speech recognition system. Speech Recognition is the process of automatically recognizing a certain word spoken by a particular speaker based on individual information included in speech waves. This paper presents one of the techniques to extract the feature set from a speech signal, which can be used in speech recognition systems and an analysis study has been performed. A wide range of possibilities exist for parametrically representing the speech signal for the speaker recognition task, such as Linear Prediction Coding (LPC), Mel-Frequency Cepstrum Coefficients (MFCC),and others. Studies and experiments show that MFCC provides better results than LPC. Here vector quantization is used to increase speech recognition accuracy. Experiments shows that as the no. of MFCC coefficients increases get better accuracy, code book size also affects accuracy. The MFCC and VQ algorithm, for speech recognition have been implemented in MATLAB 7.7(R2008b) version on Windows7 platform. The control circuitry has been implemented in Keil µVision3; the supporting hardware setup is being implemented. Keywords: Speech Recognition; MFCC; Vector Quantization; LP

International Institute for Science, Technology and Education (IISTE): E-Journals

Development of a sensory substitution API

Author: Martinez Marco
Publication venue: Colorado State University. Libraries
Publication date: 01/01/2018
Field of study

2018 Summer.Includes bibliographical references.Sensory substitution – or the practice of mapping information from one sensory modality to another – has been shown to be a viable technique for non-invasive sensory replacement and augmentation. With the rise in popularity, ubiquity, and capability of mobile devices and wearable electronics, sensory substitution research has seen a resurgence in recent years. Due to the standard features of mobile/wearable electronics such as Bluetooth, multicore processing, and audio recording, these devices can be used to drive sensory substitution systems. Therefore, there exists a need for a flexible, extensible software package capable of performing the required real-time data processing for sensory substitution, on modern mobile devices. The primary contribution of this thesis is the development and release of an Open Source Application Programming Interface (API) capable of managing an audio stream from the source of sound to a sensory stimulus interface on the body. The API (named Tactile Waves) is written in the Java programming language and packaged as both a Java library (JAR) and Android library (AAR). The development and design of the library is presented, and its primary functions are explained. Implementation details for each primary function are discussed. Performance evaluation of all processing routines is performed to ensure real-time capability, and the results are summarized. Finally, future improvements to the library and additional applications of sensory substitution are proposed

Mountain Scholar (Digital Collections of Colorado and Wyoming)

Source-filter Separation of Speech Signal in the Phase Domain

Author: Barker J.
Hain T.
Loweimi E.
Publication venue: 'The International Fiscal Association of Korea'
Publication date: 06/09/2015
Field of study

Deconvolution of the speech excitation (source) and vocal tract (filter) components through log-magnitude spectral processing is well-established and has led to the well-known cepstral features used in a multitude of speech processing tasks. This paper presents a novel source-filter decomposition based on processing in the phase domain. We show that separation between source and filter in the log-magnitude spectra is far from perfect, leading to loss of vital vocal tract information. It is demonstrated that the same task can be better performed by trend and fluctuation analysis of the phase spectrum of the minimum-phase component of speech, which can be computed via the Hilbert transform. Trend and fluctuation can be separated through low-pass filtering of the phase, using additivity of vocal tract and source in the phase domain. This results in separated signals which have a clear relation to the vocal tract and excitation components. The effectiveness of the method is put to test in a speech recognition task. The vocal tract component extracted in this way is used as the basis of a feature extraction algorithm for speech recognition on the Aurora-2 database. The recognition results shows upto 8.5% absolute improvement in comparison with MFCC features on average (0-20dB)

White Rose Research Online

Audio-Visual Automatic Speech Recognition Using PZM, MFCC and Statistical Analysis

Author: Debnath Saswati
Roy Pinki
Publication venue: 'Universidad Internacional de La Rioja'
Publication date: 10/05/2022
Field of study

Audio-Visual Automatic Speech Recognition (AV-ASR) has become the most promising research area when the audio signal gets corrupted by noise. The main objective of this paper is to select the important and discriminative audio and visual speech features to recognize audio-visual speech. This paper proposes Pseudo Zernike Moment (PZM) and feature selection method for audio-visual speech recognition. Visual information is captured from the lip contour and computes the moments for lip reading. We have extracted 19th order of Mel Frequency Cepstral Coefficients (MFCC) as speech features from audio. Since all the 19 speech features are not equally important, therefore, feature selection algorithms are used to select the most efficient features. The various statistical algorithm such as Analysis of Variance (ANOVA), Kruskal-wallis, and Friedman test are employed to analyze the significance of features along with Incremental Feature Selection (IFS) technique. Statistical analysis is used to analyze the statistical significance of the speech features and after that IFS is used to select the speech feature subset. Furthermore, multiclass Support Vector Machine (SVM), Artificial Neural Network (ANN) and Naive Bayes (NB) machine learning techniques are used to recognize the speech for both the audio and visual modalities. Based on the recognition rate combined decision is taken from the two individual recognition systems. This paper compares the result achieved by the proposed model and the existing model for both audio and visual speech recognition. Zernike Moment (ZM) is compared with PZM and shows that our proposed model using PZM extracts better discriminative features for visual speech recognition. This study also proves that audio feature selection using statistical analysis outperforms methods without any feature selection technique

Re-UNIR

New time-frequency domain pitch estimation methods for speed signals under low levels of SNR

Author: Shahnaz Celia
Publication venue
Publication date: 01/01/2009
Field of study

The major objective of this research is to develop novel pitch estimation methods capable of handling speech signals in practical situations where only noise-corrupted speech observations are available. With this objective in mind, the estimation task is carried out in two different approaches. In the first approach, the noisy speech observations are directly employed to develop two new time-frequency domain pitch estimation methods. These methods are based on extracting a pitch-harmonic and finding the corresponding harmonic number required for pitch estimation. Considering that voiced speech is the output of a vocal tract system driven by a sequence of pulses separated by the pitch period, in the second approach, instead of using the noisy speech directly for pitch estimation, an excitation-like signal (ELS) is first generated from the noisy speech or its noise- reduced version. In the first approach, at first, a harmonic cosine autocorrelation (HCAC) model of clean speech in terms of its pitch-harmonics is introduced. In order to extract a pitch-harmonic, we propose an optimization technique based on least-squares fitting of the autocorrelation function (ACF) of the noisy speech to the HCAC model. By exploiting the extracted pitch-harmonic along with the fast Fourier transform (FFT) based power spectrum of noisy speech, we then deduce a harmonic measure and a harmonic-to-noise-power ratio (HNPR) to determine the desired harmonic number of the extracted pitch-harmonic. In the proposed optimization, an initial estimate of the pitch-harmonic is obtained from the maximum peak of the smoothed FFT power spectrum. In addition to the HCAC model, where the cross-product terms of different harmonics are neglected, we derive a compact yet accurate harmonic sinusoidal autocorrelation (HSAC) model for clean speech signal. The new HSAC model is then used in the least-squares model-fitting optimization technique to extract a pitch-harmonic. In the second approach, first, we develop a pitch estimation method by using an excitation-like signal (ELS) generated from the noisy speech. To this end, a technique is based on the principle of homomorphic deconvolution is proposed for extracting the vocal-tract system (VTS) parameters from the noisy speech, which are utilized to perform an inverse-filtering of the noisy speech to produce a residual signal (RS). In order to reduce the effect of noise on the RS, a noise-compensation scheme is introduced in the autocorrelation domain. The noise-compensated ACF of the RS is then employed to generate a squared Hilbert envelope (SHE) as the ELS of the voiced speech. With a view to further overcome the adverse effect of noise on the ELS, a new symmetric normalized magnitude difference function of the ELS is proposed for eventual pitch estimation. Cepstrum has been widely used in speech signal processing but has limited capability of handling noise. One potential solution could be the introduction of a noise reduction block prior to pitch estimation based on the conventional cepstrum, a framework already available in many practical applications, such as mobile communication and hearing aids. Motivated by the advantages of the existing framework and considering the superiority of our ELS to the speech itself in providing clues for pitch information, we develop a cepstrum-based pitch estimation method by using the ELS obtained from the noise-reduced speech. For this purpose, we propose a noise subtraction scheme in frequency domain, which takes into account the possible cross-correlation between speech and noise and has advantages of noise being updated with time and adjusted at each frame. The enhanced speech thus obtained is utilized to extract the vocal-tract system (VTS) parameters via the homomorphic deconvolution technique. A residual signal (RS) is then produced by inverse-filtering the enhanced speech with the extracted VTS parameters. It is found that, unlike the previous ELS-based method, the squared Hilbert envelope (SHE) computed from the RS of the enhanced speech without noise compensation, is sufficient to represent an ELS. Finally, in order to tackle the undesirable effect of noise of the ELS at a very low SNR and overcome the limitation of the conventional cepstrum in handling different types of noises, a time-frequency domain pseudo cepstrum of the ELS of the enhanced speech, incorporating information of both magnitude and phase spectra of the ELS, is proposed for pitch estimation. (Abstract shortened by UMI.

Concordia University Research Repository