170 research outputs found
Speech Detection Using Gammatone Features And One-class Support Vector Machine
A network gateway is a mechanism which provides protocol translation and/or validation of network traffic using the metadata contained in network packets. For media applications such as Voice-over-IP, the portion of the packets containing speech data cannot be verified and can provide a means of maliciously transporting code or sensitive data undetected. One solution to this problem is through Voice Activity Detection (VAD). Many VADâs rely on time-domain features and simple thresholds for efficient speech detection however this doesnât say much about the signal being passed. More sophisticated methods employ machine learning algorithms, but train on specific noises intended for a target environment. Validating speech under a variety of unknown conditions must be possible; as well as differentiating between speech and nonspeech data embedded within the packets. A real-time speech detection method is proposed that relies only on a clean speech model for detection. Through the use of Gammatone filter bank processing, the Cepstrum and several frequency domain features are used to train a One-Class Support Vector Machine which provides a clean-speech model irrespective of environmental noise. A Wiener filter is used to provide improved operation for harsh noise environments. Greater than 90% detection accuracy is achieved for clean speech with approximately 70% accuracy for SNR as low as 5d
Emotion Recognition from Speech with Acoustic, Non-Linear and Wavelet-based Features Extracted in Different Acoustic Conditions
ABSTRACT: In the last years, there has a great progress in automatic speech recognition. The challenge now it is not only recognize the semantic content in the speech but also the called "paralinguistic" aspects of the speech, including the emotions, and the personality of the speaker. This research work aims in the development of a methodology for the automatic emotion recognition from speech signals in non-controlled noise conditions. For that purpose, different sets of acoustic, non-linear, and wavelet based features are used to characterize emotions in different databases created for such purpose
Recommended from our members
Evaluation and analysis of hybrid intelligent pattern recognition techniques for speaker identification
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The rapid momentum of the technology progress in the recent years has led to a tremendous rise in the use of biometric authentication systems. The objective of this research is to investigate the problem
of identifying a speaker from its voice regardless of the content (i.e.
text-independent), and to design efficient methods of combining face and voice in producing a robust authentication system.
A novel approach towards speaker identification is developed using
wavelet analysis, and multiple neural networks including Probabilistic
Neural Network (PNN), General Regressive Neural Network (GRNN)and Radial Basis Function-Neural Network (RBF NN) with the AND
voting scheme. This approach is tested on GRID and VidTIMIT cor-pora and comprehensive test results have been validated with state-
of-the-art approaches. The system was found to be competitive and it improved the recognition rate by 15% as compared to the classical Mel-frequency Cepstral Coe±cients (MFCC), and reduced the recognition time by 40% compared to Back Propagation Neural Network (BPNN), Gaussian Mixture Models (GMM) and Principal Component Analysis (PCA).
Another novel approach using vowel formant analysis is implemented using Linear Discriminant Analysis (LDA). Vowel formant based speaker identification is best suitable for real-time implementation and requires only a few bytes of information to be stored for each speaker, making it both storage and time efficient. Tested on GRID and Vid-TIMIT, the proposed scheme was found to be 85.05% accurate when Linear Predictive Coding (LPC) is used to extract the vowel formants, which is much higher than the accuracy of BPNN and GMM. Since the proposed scheme does not require any training time other than creating a small database of vowel formants, it is faster as well. Furthermore, an increasing number of speakers makes it di±cult for BPNN and GMM to sustain their accuracy, but the proposed score-based methodology stays almost linear.
Finally, a novel audio-visual fusion based identification system is implemented using GMM and MFCC for speaker identiÂŻcation and PCA for face recognition. The results of speaker identification and face recognition are fused at different levels, namely the feature, score and decision levels. Both the score-level and decision-level (with OR voting) fusions were shown to outperform the feature-level fusion in terms of accuracy and error resilience. The result is in line with the distinct nature of the two modalities which lose themselves when combined at the feature-level. The GRID and VidTIMIT test results validate that
the proposed scheme is one of the best candidates for the fusion of
face and voice due to its low computational time and high recognition accuracy
Speaker recognition: current state and experiment
[ANGLĂS] In this thesis the operation of the speaker recognition systems is described and the state of the art of the main working blocks is studied. All the research papers looked through can be found in the References. As voice is unique to the individual, it has emerged as a viable authentication method. There are several problems that should be considered as the presence of noise in the environment and changes in the voice of the speakers due to sickness for example. These systems combine knowledge from signal processing for the feature extraction part and signal modeling for the classification and decision part. There are several techniques for the feature extraction and the pattern matching blocks, so it is quite tricky to establish a unique and optimum solution. MFCC and DTW are the most common techniques for each block, respectively. They are discussed in this document, with a special emphasis on their drawbacks, that motivate new techniques which are also presented here. A search through the Internet is done in order to find commercial working implementations, which are quite rare, then a basic introduction to Praat is presented. Finally, some intra-speaker and inter-speaker tests are done using this software.[CASTELLĂ] En esta tesis, el funcionamento de los sistemas de reconocimiento del hablante es descrito y el estado del arte de los principales bloques de funcionamento es estudiado. Todos los documentos de investigaciĂłn consultados se encuentran en las referencias. Dado que la voz es Ășnica al individuo, se ha vuelto un mĂ©todo viable de identificaciĂłn. Hay varios problemas que han de ser considerados, como la presencia de ruido en el ambiente y los cambios en la voz de los hablantes, por ejemplo debido a enfermedades. Estos sistemas combinan conocimiento de procesado de señal en la parte de extracciĂłn de caracterĂsticas de la voz y modelaje de señal en la parte de clasificaciĂłn y decisiĂłn. Hay diferentes tĂ©cnicas para la extracciĂłn de las caracterĂsticas, y para el tratamiento de la similitud entre patrones, de tal manera que es complicado establecer una Ășnica y Ăłptima soluciĂłn. MFCC y DTW son las tĂ©cnicas mĂĄs comunes para cada bloque, respectivamente. Son tratadas en este documento, haciendo Ă©nfasis en sus problemas, que motivan nuevas tĂ©cnicas, que tambiĂ©n son presentadas aquĂ. Se realiza una bĂșsqueda por Internet, para encontrar productos comerciales implementados, que son pocos, posteriormente se hace una introducciĂłn al software Praat. Finalmente, se realizan algunos intra-speaker i inter-speaker tests usando este programa.[CATALĂ] En aquesta tesi, el funcionament dels sistemes de reconeixement del parlant Ă©s descrit i l'estat de l'art dels principals blocs de funcionament Ă©s estudiat. Tots els documents de recerca consultats es troben a les referĂšncies. Donat que la veu Ă©s Ășnica a l'individu, ha esdevingut un mĂštode viable d'identificaciĂł. Hi ha diversos problemes que han de ser considerats, com ara la presĂšncia de soroll en l'ambient i els canvis en la veu dels parlants, per exemple deguts a malalties. Aquests sistemes combinen coneixement de processament de senyal en la part d'extracciĂł de caracterĂstiques de la veu i modelatge de senyal en la part de classificaciĂł i decisiĂł. Hi ha diferents tĂšcniques per a l'extracciĂł de les caracterĂstiques, i per al tractament de la similitud entre patrons, de tal manera que Ă©s complicat establir una Ășnica i ĂČptima soluciĂł. MFCC i DTW sĂłn les tĂšcniques mĂ©s comunes per a cada bloc, respectivament. SĂłn tractades en aquest document, fent Ăšmfasi en els seus problemes, que motiven noves tĂšcniques, que tambĂ© sĂłn presentades aquĂ. Es realitza una cerca per Internet, per tal de trobar productes comercials implementats, que sĂłn pocs, posteriorment es fa una introducciĂł al software Praat. Finalment, es realitzen alguns intra-speaker i inter-speaker tests fent servir aquest programa
Stress and emotion recognition in natural speech in the work and family environments
The speech stress and emotion recognition and classification technology has a potential to provide significant benefits to the national and international industry and society in general. The accuracy of an automatic emotion speech and emotion recognition relays heavily on the discrimination power of the characteristic features. This work introduced and examined a number of new linear and nonlinear feature extraction methods for an automatic detection of stress and emotion in speech. The proposed linear feature extraction methods included features derived from the speech spectrograms (SS-CB/BARK/ERB-AE, SS-AF-CB/BARK/ERB-AE, SS-LGF-OFS, SS-ALGF-OFS, SS-SP-ALGF-OFS and SS-sigma-pi), wavelet packets (WP-ALGF-OFS) and the empirical mode decomposition (EMD-AER). The proposed nonlinear feature extraction methods were based on the results of recent laryngological studies and nonlinear modelling of the phonation process. The proposed nonlinear features included the area under the TEO autocorrelation envelope based on different spectral decompositions (TEO-DWT, TEO-WP, TEO-PWP-S and TEO-PWP-G), as well as features representing spectral energy distribution of speech (AUSEES) and glottal waveform (AUSEEG). The proposed features were compared with features based on the classical linear model of speech production including F0, formants, MFCC and glottal time/frequency parameters. Two classifiers GMM and KNN were tested for consistency. The experiments used speech under actual stress from the SUSAS database (7 speakers; 3 female and 4 male) and speech with five naturally expressed emotions (neutral, anger, anxious, dysphoric and happy) from the ORI corpora (71 speakers; 27 female and 44 male). The nonlinear features clearly outperformed all the linear features. The classification results demonstrated consistency with the nonlinear model of the phonation process indicating that the harmonic structure and the spectral distribution of the glottal energy provide the most important cues for stress and emotion recognition in speech. The study also investigated if the automatic emotion recognition can determine differences in emotion expression between parents of depressed adolescents and parents of non-depressed adolescents. It was also investigated if there are differences in emotion expression between mothers and fathers in general. The experiment results indicated that parents of depressed adolescent produce stronger more exaggerated expressions of affect than parents of non-depressed children. And females in general provide easier to discriminate (more exaggerated) expressions of affect than males
Multibiometric security in wireless communication systems
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University, 05/08/2010.This thesis has aimed to explore an application of Multibiometrics to secured wireless communications. The medium of study for this purpose included Wi-Fi, 3G, and
WiMAX, over which simulations and experimental studies were carried out to assess the performance. In specific, restriction of access to authorized users only is provided by a technique referred to hereafter as multibiometric cryptosystem. In brief, the system is built upon a complete challenge/response methodology in order to obtain a high level of security on the basis of user identification by fingerprint and further confirmation by verification of the user through text-dependent speaker recognition.
First is the enrolment phase by which the database of watermarked fingerprints with
memorable texts along with the voice features, based on the same texts, is created by sending them to the server through wireless channel.
Later is the verification stage at which claimed users, ones who claim are genuine, are verified against the database, and it consists of five steps. Initially faced by the identification level, one is asked to first present oneâs fingerprint and a memorable word, former is watermarked into latter, in order for system to authenticate the fingerprint and verify the validity of it by retrieving the challenge for accepted user.
The following three steps then involve speaker recognition including the user
responding to the challenge by text-dependent voice, server authenticating the response, and finally server accepting/rejecting the user.
In order to implement fingerprint watermarking, i.e. incorporating the memorable word as a watermark message into the fingerprint image, an algorithm of five steps has been developed. The first three novel steps having to do with the fingerprint
image enhancement (CLAHE with 'Clip Limit', standard deviation analysis and
sliding neighborhood) have been followed with further two steps for embedding, and
extracting the watermark into the enhanced fingerprint image utilising Discrete
Wavelet Transform (DWT).
In the speaker recognition stage, the limitations of this technique in wireless
communication have been addressed by sending voice feature (cepstral coefficients)
instead of raw sample. This scheme is to reap the advantages of reducing the
transmission time and dependency of the data on communication channel, together
with no loss of packet. Finally, the obtained results have verified the claims
A motion-based approach for audio-visual automatic speech recognition
The research work presented in this thesis introduces novel approaches for both visual
region of interest extraction and visual feature extraction for use in audio-visual
automatic speech recognition. In particular, the speakerâs movement that occurs
during speech is used to isolate the mouth region in video sequences and motionbased
features obtained from this region are used to provide new visual features for
audio-visual automatic speech recognition. The mouth region extraction approach
proposed in this work is shown to give superior performance compared with existing
colour-based lip segmentation methods. The new features are obtained from three
separate representations of motion in the region of interest, namely the difference in
luminance between successive images, block matching based motion vectors and
optical flow. The new visual features are found to improve visual-only and audiovisual
speech recognition performance when compared with the commonly-used
appearance feature-based methods.
In addition, a novel approach is proposed for visual feature extraction from either the
discrete cosine transform or discrete wavelet transform representations of the mouth
region of the speaker. In this work, the image transform is explored from a new
viewpoint of data discrimination; in contrast to the more conventional data
preservation viewpoint. The main findings of this work are that audio-visual
automatic speech recognition systems using the new features extracted from the
frequency bands selected according to their discriminatory abilities generally
outperform those using features designed for data preservation.
To establish the noise robustness of the new features proposed in this work, their
performance has been studied in presence of a range of different types of noise and at
various signal-to-noise ratios. In these experiments, the audio-visual automatic speech
recognition systems based on the new approaches were found to give superior
performance both to audio-visual systems using appearance based features and to
audio-only speech recognition systems
Automatic speaker recognition: modelling, feature extraction and effects of clinical environment
Speaker recognition is the task of establishing identity of an individual based on his/her voice. It has a significant potential as a convenient biometric method for telephony applications and does not require sophisticated or dedicated hardware. The Speaker Recognition task is typically achieved by two-stage signal processing: training and testing. The training process calculates speaker-specific feature parameters from the speech. The features are used to generate statistical models of different speakers. In the testing phase, speech samples from unknown speakers are compared with the models and classified. Current state of the art speaker recognition systems use the Gaussian mixture model (GMM) technique in combination with the Expectation Maximization (EM) algorithm to build the speaker models. The most frequently used features are the Mel Frequency Cepstral Coefficients (MFCC). This thesis investigated areas of possible improvements in the field of speaker recognition. The identified drawbacks of the current speaker recognition systems included: slow convergence rates of the modelling techniques and featureâs sensitivity to changes due aging of speakers, use of alcohol and drugs, changing health conditions and mental state. The thesis proposed a new method of deriving the Gaussian mixture model (GMM) parameters called the EM-ITVQ algorithm. The EM-ITVQ showed a significant improvement of the equal error rates and higher convergence rates when compared to the classical GMM based on the expectation maximization (EM) method. It was demonstrated that features based on the nonlinear model of speech production (TEO based features) provided better performance compare to the conventional MFCCs features. For the first time the effect of clinical depression on the speaker verification rates was tested. It was demonstrated that the speaker verification results deteriorate if the speakers are clinically depressed. The deterioration process was demonstrated using conventional (MFCC) features. The thesis also showed that when replacing the MFCC features with features based on the nonlinear model of speech production (TEO based features), the detrimental effect of the clinical depression on speaker verification rates can be reduced
Single-trial classification of an EEG-based brain computer interface using the wavelet packet decomposition and cepstral analysis
Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2009.ENGLISH ABSTRACT: Brain-Computer Interface (BCI) monitors brain activity by using signals
such as EEG, EcOG, and MEG, and attempts to bridge the gap between
thoughts and actions by providing control to physical devices that range from
wheelchairs to computers. A crucial process for a BCI system is feature extraction,
and many studies have been undertaken to find relevant information
from a set of input signals.
This thesis investigated feature extraction from EEG signals using two
different approaches. Wavelet packet decomposition was used to extract information
from the signals in their frequency domain, and cepstral analysis was
used to search for relevant information in the cepstral domain. A BCI was implemented
to evaluate the two approaches, and three classification techniques
contributed to finding the effectiveness of each feature type.
Data containing two-class motor imagery was used for testing, and the BCI
was compared to some of the other systems currently available. Results indicate
that both approaches investigated were effective in producing separable
features, and, with further work, can be used for the classification of trials
based on a paradigm exploiting motor imagery as a means of control.AFRIKAANSE OPSOMMING: ân Brein-Rekenaar Koppelvlak (BRK) monitor brein aktiwiteit deur gebruik
te maak van seine soos EEG, EcOG, en MEG. Dit poog om die gaping
tussen gedagtes en fisiese aksies te oorbrug deur beheer aan toestelle soos
rolstoele en rekenaars te verskaf. ân Noodsaaklike proses vir ân BRK is die
ontginning van toepaslike inligting uit inset-seine, wat kan help om tussen verskillende
gedagtes te onderskei. Vele studies is al onderneem oor hoe om sulke
inligting te vind.
Hierdie tesis ondersoek die ontginning van kenmerk-vektore in EEG-seine
deur twee verskillende benaderings. Die eerste hiervan is golfies pakkie ontleding,
ân metode wat gebruik word om die sein in die frekwensie gebied voor
te stel. Die tweede benadering gebruik kepstrale analise en soek vir toepaslike
inligting in die kepstrale domein. ân BRK is geĂŻmplementeer om beide metodes
te evalueer.
Die toetsdata wat gebruik is, het bestaan uit twee-klas motoriese verbeelde
bewegings, en drie klassifikasie-tegnieke was gebruik om die doeltreffendheid
van die twee metodes te evalueer. Die BRK is vergelyk met ander stelsels
wat tans beskikbaar is, en resultate dui daarop dat beide metodes doeltreffend
was. Met verdere navorsing besit hulle dus die potensiaal om gebruik te word
in stelsels wat gebruik maak van motoriese verbeelde bewegings om fisiese
toestelle te beheer
- âŠ