170 research outputs found

    Speech Detection Using Gammatone Features And One-class Support Vector Machine

    Get PDF
    A network gateway is a mechanism which provides protocol translation and/or validation of network traffic using the metadata contained in network packets. For media applications such as Voice-over-IP, the portion of the packets containing speech data cannot be verified and can provide a means of maliciously transporting code or sensitive data undetected. One solution to this problem is through Voice Activity Detection (VAD). Many VAD’s rely on time-domain features and simple thresholds for efficient speech detection however this doesn’t say much about the signal being passed. More sophisticated methods employ machine learning algorithms, but train on specific noises intended for a target environment. Validating speech under a variety of unknown conditions must be possible; as well as differentiating between speech and nonspeech data embedded within the packets. A real-time speech detection method is proposed that relies only on a clean speech model for detection. Through the use of Gammatone filter bank processing, the Cepstrum and several frequency domain features are used to train a One-Class Support Vector Machine which provides a clean-speech model irrespective of environmental noise. A Wiener filter is used to provide improved operation for harsh noise environments. Greater than 90% detection accuracy is achieved for clean speech with approximately 70% accuracy for SNR as low as 5d

    Emotion Recognition from Speech with Acoustic, Non-Linear and Wavelet-based Features Extracted in Different Acoustic Conditions

    Get PDF
    ABSTRACT: In the last years, there has a great progress in automatic speech recognition. The challenge now it is not only recognize the semantic content in the speech but also the called "paralinguistic" aspects of the speech, including the emotions, and the personality of the speaker. This research work aims in the development of a methodology for the automatic emotion recognition from speech signals in non-controlled noise conditions. For that purpose, different sets of acoustic, non-linear, and wavelet based features are used to characterize emotions in different databases created for such purpose

    Speaker recognition: current state and experiment

    Get PDF
    [ANGLÈS] In this thesis the operation of the speaker recognition systems is described and the state of the art of the main working blocks is studied. All the research papers looked through can be found in the References. As voice is unique to the individual, it has emerged as a viable authentication method. There are several problems that should be considered as the presence of noise in the environment and changes in the voice of the speakers due to sickness for example. These systems combine knowledge from signal processing for the feature extraction part and signal modeling for the classification and decision part. There are several techniques for the feature extraction and the pattern matching blocks, so it is quite tricky to establish a unique and optimum solution. MFCC and DTW are the most common techniques for each block, respectively. They are discussed in this document, with a special emphasis on their drawbacks, that motivate new techniques which are also presented here. A search through the Internet is done in order to find commercial working implementations, which are quite rare, then a basic introduction to Praat is presented. Finally, some intra-speaker and inter-speaker tests are done using this software.[CASTELLÀ] En esta tesis, el funcionamento de los sistemas de reconocimiento del hablante es descrito y el estado del arte de los principales bloques de funcionamento es estudiado. Todos los documentos de investigaciĂłn consultados se encuentran en las referencias. Dado que la voz es Ășnica al individuo, se ha vuelto un mĂ©todo viable de identificaciĂłn. Hay varios problemas que han de ser considerados, como la presencia de ruido en el ambiente y los cambios en la voz de los hablantes, por ejemplo debido a enfermedades. Estos sistemas combinan conocimiento de procesado de señal en la parte de extracciĂłn de caracterĂ­sticas de la voz y modelaje de señal en la parte de clasificaciĂłn y decisiĂłn. Hay diferentes tĂ©cnicas para la extracciĂłn de las caracterĂ­sticas, y para el tratamiento de la similitud entre patrones, de tal manera que es complicado establecer una Ășnica y Ăłptima soluciĂłn. MFCC y DTW son las tĂ©cnicas mĂĄs comunes para cada bloque, respectivamente. Son tratadas en este documento, haciendo Ă©nfasis en sus problemas, que motivan nuevas tĂ©cnicas, que tambiĂ©n son presentadas aquĂ­. Se realiza una bĂșsqueda por Internet, para encontrar productos comerciales implementados, que son pocos, posteriormente se hace una introducciĂłn al software Praat. Finalmente, se realizan algunos intra-speaker i inter-speaker tests usando este programa.[CATALÀ] En aquesta tesi, el funcionament dels sistemes de reconeixement del parlant Ă©s descrit i l'estat de l'art dels principals blocs de funcionament Ă©s estudiat. Tots els documents de recerca consultats es troben a les referĂšncies. Donat que la veu Ă©s Ășnica a l'individu, ha esdevingut un mĂštode viable d'identificaciĂł. Hi ha diversos problemes que han de ser considerats, com ara la presĂšncia de soroll en l'ambient i els canvis en la veu dels parlants, per exemple deguts a malalties. Aquests sistemes combinen coneixement de processament de senyal en la part d'extracciĂł de caracterĂ­stiques de la veu i modelatge de senyal en la part de classificaciĂł i decisiĂł. Hi ha diferents tĂšcniques per a l'extracciĂł de les caracterĂ­stiques, i per al tractament de la similitud entre patrons, de tal manera que Ă©s complicat establir una Ășnica i ĂČptima soluciĂł. MFCC i DTW sĂłn les tĂšcniques mĂ©s comunes per a cada bloc, respectivament. SĂłn tractades en aquest document, fent Ăšmfasi en els seus problemes, que motiven noves tĂšcniques, que tambĂ© sĂłn presentades aquĂ­. Es realitza una cerca per Internet, per tal de trobar productes comercials implementats, que sĂłn pocs, posteriorment es fa una introducciĂł al software Praat. Finalment, es realitzen alguns intra-speaker i inter-speaker tests fent servir aquest programa

    Stress and emotion recognition in natural speech in the work and family environments

    Get PDF
    The speech stress and emotion recognition and classification technology has a potential to provide significant benefits to the national and international industry and society in general. The accuracy of an automatic emotion speech and emotion recognition relays heavily on the discrimination power of the characteristic features. This work introduced and examined a number of new linear and nonlinear feature extraction methods for an automatic detection of stress and emotion in speech. The proposed linear feature extraction methods included features derived from the speech spectrograms (SS-CB/BARK/ERB-AE, SS-AF-CB/BARK/ERB-AE, SS-LGF-OFS, SS-ALGF-OFS, SS-SP-ALGF-OFS and SS-sigma-pi), wavelet packets (WP-ALGF-OFS) and the empirical mode decomposition (EMD-AER). The proposed nonlinear feature extraction methods were based on the results of recent laryngological studies and nonlinear modelling of the phonation process. The proposed nonlinear features included the area under the TEO autocorrelation envelope based on different spectral decompositions (TEO-DWT, TEO-WP, TEO-PWP-S and TEO-PWP-G), as well as features representing spectral energy distribution of speech (AUSEES) and glottal waveform (AUSEEG). The proposed features were compared with features based on the classical linear model of speech production including F0, formants, MFCC and glottal time/frequency parameters. Two classifiers GMM and KNN were tested for consistency. The experiments used speech under actual stress from the SUSAS database (7 speakers; 3 female and 4 male) and speech with five naturally expressed emotions (neutral, anger, anxious, dysphoric and happy) from the ORI corpora (71 speakers; 27 female and 44 male). The nonlinear features clearly outperformed all the linear features. The classification results demonstrated consistency with the nonlinear model of the phonation process indicating that the harmonic structure and the spectral distribution of the glottal energy provide the most important cues for stress and emotion recognition in speech. The study also investigated if the automatic emotion recognition can determine differences in emotion expression between parents of depressed adolescents and parents of non-depressed adolescents. It was also investigated if there are differences in emotion expression between mothers and fathers in general. The experiment results indicated that parents of depressed adolescent produce stronger more exaggerated expressions of affect than parents of non-depressed children. And females in general provide easier to discriminate (more exaggerated) expressions of affect than males

    Multibiometric security in wireless communication systems

    Get PDF
    This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University, 05/08/2010.This thesis has aimed to explore an application of Multibiometrics to secured wireless communications. The medium of study for this purpose included Wi-Fi, 3G, and WiMAX, over which simulations and experimental studies were carried out to assess the performance. In specific, restriction of access to authorized users only is provided by a technique referred to hereafter as multibiometric cryptosystem. In brief, the system is built upon a complete challenge/response methodology in order to obtain a high level of security on the basis of user identification by fingerprint and further confirmation by verification of the user through text-dependent speaker recognition. First is the enrolment phase by which the database of watermarked fingerprints with memorable texts along with the voice features, based on the same texts, is created by sending them to the server through wireless channel. Later is the verification stage at which claimed users, ones who claim are genuine, are verified against the database, and it consists of five steps. Initially faced by the identification level, one is asked to first present one’s fingerprint and a memorable word, former is watermarked into latter, in order for system to authenticate the fingerprint and verify the validity of it by retrieving the challenge for accepted user. The following three steps then involve speaker recognition including the user responding to the challenge by text-dependent voice, server authenticating the response, and finally server accepting/rejecting the user. In order to implement fingerprint watermarking, i.e. incorporating the memorable word as a watermark message into the fingerprint image, an algorithm of five steps has been developed. The first three novel steps having to do with the fingerprint image enhancement (CLAHE with 'Clip Limit', standard deviation analysis and sliding neighborhood) have been followed with further two steps for embedding, and extracting the watermark into the enhanced fingerprint image utilising Discrete Wavelet Transform (DWT). In the speaker recognition stage, the limitations of this technique in wireless communication have been addressed by sending voice feature (cepstral coefficients) instead of raw sample. This scheme is to reap the advantages of reducing the transmission time and dependency of the data on communication channel, together with no loss of packet. Finally, the obtained results have verified the claims

    A motion-based approach for audio-visual automatic speech recognition

    Get PDF
    The research work presented in this thesis introduces novel approaches for both visual region of interest extraction and visual feature extraction for use in audio-visual automatic speech recognition. In particular, the speaker‘s movement that occurs during speech is used to isolate the mouth region in video sequences and motionbased features obtained from this region are used to provide new visual features for audio-visual automatic speech recognition. The mouth region extraction approach proposed in this work is shown to give superior performance compared with existing colour-based lip segmentation methods. The new features are obtained from three separate representations of motion in the region of interest, namely the difference in luminance between successive images, block matching based motion vectors and optical flow. The new visual features are found to improve visual-only and audiovisual speech recognition performance when compared with the commonly-used appearance feature-based methods. In addition, a novel approach is proposed for visual feature extraction from either the discrete cosine transform or discrete wavelet transform representations of the mouth region of the speaker. In this work, the image transform is explored from a new viewpoint of data discrimination; in contrast to the more conventional data preservation viewpoint. The main findings of this work are that audio-visual automatic speech recognition systems using the new features extracted from the frequency bands selected according to their discriminatory abilities generally outperform those using features designed for data preservation. To establish the noise robustness of the new features proposed in this work, their performance has been studied in presence of a range of different types of noise and at various signal-to-noise ratios. In these experiments, the audio-visual automatic speech recognition systems based on the new approaches were found to give superior performance both to audio-visual systems using appearance based features and to audio-only speech recognition systems

    Automatic speaker recognition: modelling, feature extraction and effects of clinical environment

    Get PDF
    Speaker recognition is the task of establishing identity of an individual based on his/her voice. It has a significant potential as a convenient biometric method for telephony applications and does not require sophisticated or dedicated hardware. The Speaker Recognition task is typically achieved by two-stage signal processing: training and testing. The training process calculates speaker-specific feature parameters from the speech. The features are used to generate statistical models of different speakers. In the testing phase, speech samples from unknown speakers are compared with the models and classified. Current state of the art speaker recognition systems use the Gaussian mixture model (GMM) technique in combination with the Expectation Maximization (EM) algorithm to build the speaker models. The most frequently used features are the Mel Frequency Cepstral Coefficients (MFCC). This thesis investigated areas of possible improvements in the field of speaker recognition. The identified drawbacks of the current speaker recognition systems included: slow convergence rates of the modelling techniques and feature’s sensitivity to changes due aging of speakers, use of alcohol and drugs, changing health conditions and mental state. The thesis proposed a new method of deriving the Gaussian mixture model (GMM) parameters called the EM-ITVQ algorithm. The EM-ITVQ showed a significant improvement of the equal error rates and higher convergence rates when compared to the classical GMM based on the expectation maximization (EM) method. It was demonstrated that features based on the nonlinear model of speech production (TEO based features) provided better performance compare to the conventional MFCCs features. For the first time the effect of clinical depression on the speaker verification rates was tested. It was demonstrated that the speaker verification results deteriorate if the speakers are clinically depressed. The deterioration process was demonstrated using conventional (MFCC) features. The thesis also showed that when replacing the MFCC features with features based on the nonlinear model of speech production (TEO based features), the detrimental effect of the clinical depression on speaker verification rates can be reduced

    Content-based music structure analysis

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Single-trial classification of an EEG-based brain computer interface using the wavelet packet decomposition and cepstral analysis

    Get PDF
    Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2009.ENGLISH ABSTRACT: Brain-Computer Interface (BCI) monitors brain activity by using signals such as EEG, EcOG, and MEG, and attempts to bridge the gap between thoughts and actions by providing control to physical devices that range from wheelchairs to computers. A crucial process for a BCI system is feature extraction, and many studies have been undertaken to find relevant information from a set of input signals. This thesis investigated feature extraction from EEG signals using two different approaches. Wavelet packet decomposition was used to extract information from the signals in their frequency domain, and cepstral analysis was used to search for relevant information in the cepstral domain. A BCI was implemented to evaluate the two approaches, and three classification techniques contributed to finding the effectiveness of each feature type. Data containing two-class motor imagery was used for testing, and the BCI was compared to some of the other systems currently available. Results indicate that both approaches investigated were effective in producing separable features, and, with further work, can be used for the classification of trials based on a paradigm exploiting motor imagery as a means of control.AFRIKAANSE OPSOMMING: ’n Brein-Rekenaar Koppelvlak (BRK) monitor brein aktiwiteit deur gebruik te maak van seine soos EEG, EcOG, en MEG. Dit poog om die gaping tussen gedagtes en fisiese aksies te oorbrug deur beheer aan toestelle soos rolstoele en rekenaars te verskaf. ’n Noodsaaklike proses vir ’n BRK is die ontginning van toepaslike inligting uit inset-seine, wat kan help om tussen verskillende gedagtes te onderskei. Vele studies is al onderneem oor hoe om sulke inligting te vind. Hierdie tesis ondersoek die ontginning van kenmerk-vektore in EEG-seine deur twee verskillende benaderings. Die eerste hiervan is golfies pakkie ontleding, ’n metode wat gebruik word om die sein in die frekwensie gebied voor te stel. Die tweede benadering gebruik kepstrale analise en soek vir toepaslike inligting in die kepstrale domein. ’n BRK is geïmplementeer om beide metodes te evalueer. Die toetsdata wat gebruik is, het bestaan uit twee-klas motoriese verbeelde bewegings, en drie klassifikasie-tegnieke was gebruik om die doeltreffendheid van die twee metodes te evalueer. Die BRK is vergelyk met ander stelsels wat tans beskikbaar is, en resultate dui daarop dat beide metodes doeltreffend was. Met verdere navorsing besit hulle dus die potensiaal om gebruik te word in stelsels wat gebruik maak van motoriese verbeelde bewegings om fisiese toestelle te beheer
    • 

    corecore