500 research outputs found
Query by Example of Speaker Audio Signals using Power Spectrum and MFCCs
Search engine is the popular term for an information retrieval (IR) system. Typically, search engine can be based on full-text indexing. Changing the presentation from the text data to multimedia data types make an information retrieval process more complex such as a retrieval of image or sounds in large databases. This paper introduces the use of language and text independent speech as input queries in a large sound database by using Speaker identification algorithm. The method consists of 2 main processing first steps, we separate vocal and non-vocal identification after that vocal be used to speaker identification for audio query by speaker voice. For the speaker identification and audio query by process, we estimate the similarity of the example signal and the samples in the queried database by calculating the Euclidian distance between the Mel frequency cepstral coefficients (MFCC) and Energy spectrum of acoustic features. The simulations show that the good performance with a sustainable computational cost and obtained the average accuracy rate more than 90%
Robust text independent closed set speaker identification systems and their evaluation
PhD ThesisThis thesis focuses upon text independent closed set speaker
identi cation. The contributions relate to evaluation studies in the
presence of various types of noise and handset e ects. Extensive
evaluations are performed on four databases.
The rst contribution is in the context of the use of the Gaussian
Mixture Model-Universal Background Model (GMM-UBM) with
original speech recordings from only the TIMIT database. Four main
simulations for Speaker Identi cation Accuracy (SIA) are presented
including di erent fusion strategies: Late fusion (score based), early
fusion (feature based) and early-late fusion (combination of feature and
score based), late fusion using concatenated static and dynamic
features (features with temporal derivatives such as rst order
derivative delta and second order derivative delta-delta features,
namely acceleration features), and nally fusion of statistically
independent normalized scores.
The second contribution is again based on the GMM-UBM
approach. Comprehensive evaluations of the e ect of Additive White
Gaussian Noise (AWGN), and Non-Stationary Noise (NSN) (with and
without a G.712 type handset) upon identi cation performance are
undertaken. In particular, three NSN types with varying Signal to
Noise Ratios (SNRs) were tested corresponding to: street tra c, a bus
interior and a crowded talking environment. The performance
evaluation also considered the e ect of late fusion techniques based on
score fusion, namely mean, maximum, and linear weighted sum fusion.
The databases employed were: TIMIT, SITW, and NIST 2008; and 120
speakers were selected from each database to yield 3,600 speech
utterances.
The third contribution is based on the use of the I-vector, four
combinations of I-vectors with 100 and 200 dimensions were employed.
Then, various fusion techniques using maximum, mean, weighted sum
and cumulative fusion with the same I-vector dimension were used to
improve the SIA. Similarly, both interleaving and concatenated I-vector
fusion were exploited to produce 200 and 400 I-vector dimensions. The
system was evaluated with four di erent databases using 120 speakers
from each database. TIMIT, SITW and NIST 2008 databases were
evaluated for various types of NSN namely, street-tra c NSN,
bus-interior NSN and crowd talking NSN; and the G.712 type handset
at 16 kHz was also applied.
As recommendations from the study in terms of the GMM-UBM
approach, mean fusion is found to yield overall best performance in terms
of the SIA with noisy speech, whereas linear weighted sum fusion is
overall best for original database recordings. However, in the I-vector
approach the best SIA was obtained from the weighted sum and the
concatenated fusion.Ministry of Higher Education
and Scienti c Research (MoHESR), and the Iraqi Cultural Attach e,
Al-Mustansiriya University, Al-Mustansiriya University College of
Engineering in Iraq for supporting my PhD scholarship
Robust speaker recognition in presence of non-trivial environmental noise (toward greater biometric security)
The aim of this thesis is to investigate speaker recognition in the presence of environmental noise, and to develop a robust speaker recognition method. Recently, Speaker Recognition has been the object of considerable research due to its wide use in various areas. Despite major developments in this field, there are still many limitations and challenges. Environmental noises and their variations are high up in the list of challenges since it impossible to provide a noise free environment. A novel approach is proposed to address the issue of performance degradation in environmental noise. This approach is based on the estimation of signal-to-noise ratio (SNR) and detection of ambient noise from the recognition signal to re-train the reference model for the claimed speaker and to generate a new adapted noisy model to decrease the noise mismatch with recognition utterances. This approach is termed “Training on the fly” for robustness of speaker recognition under noisy environments. To detect the noise in the recognition signal two different techniques are proposed: the first technique including generating an emulated noise depending on estimated power spectrum of the original noise using 1/3 octave band filter bank and white noise signal. This emulated noise become close enough to original one that includes in the input signal (recognition signal). The second technique deals with extracting the noise from the input signal using one of speech enhancement algorithm with spectral subtraction to find the noise in the signal. Training on the fly approach (using both techniques) has been examined using two feature approaches and two different kinds of artificial clean and noisy speech databases collected in different environments. Furthermore, the speech samples were text independent. The training on the fly approach is a significant improvement in performance when compared with the performance of conventional speaker recognition (based on clean reference models). Moreover, the training on the fly based on noise extraction showed the best results for all types of noisy data
A proof-of-proximity framework for device pairing in ubiquitous computing environments
Ad hoc interactions between devices over wireless networks in ubiquitous
computing environments present a security problem: the generation of shared secrets
to initialize secure communication over a medium that is inherently vulnerable to
various attacks. However, these ad hoc scenarios also offer the potential for physical
security of spaces and the use of protocols in which users must visibly demonstrate
their presence and/or involvement to generate an association. As a consequence,
recently secure device pairing has had significant attention from a wide community of
academic as well as industrial researchers and a plethora of schemes and protocols
have been proposed, which use various forms of out-of-band exchange to form an
association between two unassociated devices. These protocols and schemes have
different strengths and weaknesses – often in hardware requirements, strength against
various attacks or usability in particular scenarios. From ordinary user‟s point of
view, the problem then becomes which to choose or which is the best possible scheme
in a particular scenario.
We advocate that in a world of modern heterogeneous devices and
requirements, there is a need for mechanisms that allow automated selection of the
best protocols without requiring the user to have an in-depth knowledge of the
minutiae of the underlying technologies. Towards this, the main argument forming the
basis of this dissertation is that the integration of a discovery mechanism and several
pairing schemes into a single system is more efficient from a usability point of view
as well as security point of view in terms of dynamic choice of pairing schemes. In
pursuit of this, we have proposed a generic system for secure device pairing by
demonstration of physical proximity. Our main contribution is the design and
prototype implementation of Proof-of-Proximity framework along with a novel Co-
Location protocol. Other contributions include a detailed analysis of existing device
pairing schemes, a simple device discovery mechanism, a protocol selection
mechanism that is used to find out the best possible scheme to demonstrate the
physical proximity of the devices according to the scenario, and a usability study of
eight pairing schemes and the proposed system
VOICE BIOMETRICS UNDER MISMATCHED NOISE CONDITIONS
This thesis describes research into effective voice biometrics (speaker recognition) under mismatched noise conditions. Over the last two decades, this class of biometrics has been the subject of considerable research due to its various applications in such areas as telephone banking, remote access control and surveillance. One of the main challenges associated with the deployment of voice biometrics in practice is that of undesired variations in speech characteristics caused by environmental noise. Such variations can in turn lead to a mismatch between the corresponding test and reference material from the same speaker. This is found to adversely affect the performance of speaker recognition in terms of accuracy.
To address the above problem, a novel approach is introduced and investigated. The proposed method is based on minimising the noise mismatch between reference speaker models and the given test utterance, and involves a new form of Test-Normalisation (T-Norm) for further enhancing matching scores under the aforementioned adverse operating conditions. Through experimental investigations, based on the two main classes of speaker recognition (i.e. verification/ open-set identification), it is shown that the proposed approach can significantly improve the performance accuracy under mismatched noise conditions.
In order to further improve the recognition accuracy in severe mismatch conditions, an approach to enhancing the above stated method is proposed. This, which involves providing a closer adjustment of the reference speaker models to the noise condition in the test utterance, is shown to considerably increase the accuracy in extreme cases of noisy test data. Moreover, to tackle the computational burden associated with the use of the enhanced approach with open-set identification, an efficient algorithm for its realisation in this context is introduced and evaluated.
The thesis presents a detailed description of the research undertaken, describes the experimental investigations and provides a thorough analysis of the outcomes
Single-Microphone Speech Enhancement and Separation Using Deep Learning
The cocktail party problem comprises the challenging task of understanding a
speech signal in a complex acoustic environment, where multiple speakers and
background noise signals simultaneously interfere with the speech signal of
interest. A signal processing algorithm that can effectively increase the
speech intelligibility and quality of speech signals in such complicated
acoustic situations is highly desirable. Especially for applications involving
mobile communication devices and hearing assistive devices. Due to the
re-emergence of machine learning techniques, today, known as deep learning, the
challenges involved with such algorithms might be overcome. In this PhD thesis,
we study and develop deep learning-based techniques for two sub-disciplines of
the cocktail party problem: single-microphone speech enhancement and
single-microphone multi-talker speech separation. Specifically, we conduct
in-depth empirical analysis of the generalizability capability of modern deep
learning-based single-microphone speech enhancement algorithms. We show that
performance of such algorithms is closely linked to the training data, and good
generalizability can be achieved with carefully designed training data.
Furthermore, we propose uPIT, a deep learning-based algorithm for
single-microphone speech separation and we report state-of-the-art results on a
speaker-independent multi-talker speech separation task. Additionally, we show
that uPIT works well for joint speech separation and enhancement without
explicit prior knowledge about the noise type or number of speakers. Finally,
we show that deep learning-based speech enhancement algorithms designed to
minimize the classical short-time spectral amplitude mean squared error leads
to enhanced speech signals which are essentially optimal in terms of STOI, a
state-of-the-art speech intelligibility estimator.Comment: PhD Thesis. 233 page
- …