376 research outputs found
New Features Using Robust MVDR Spectrum of Filtered Autocorrelation Sequence for Robust Speech Recognition
This paper presents a novel noise-robust feature
extraction method for speech recognition using the robust perceptual minimum variance distortionless response (MVDR) spectrum of temporally filtered autocorrelation sequence. The perceptual
MVDR spectrum of the filtered short-time autocorrelation
sequence can reduce the effects of residue of the nonstationary
additive noise which remains after filtering the autocorrelation.
To achieve a more robust front-end, we also modify the robust
distortionless constraint of the MVDR spectral estimation method
via revised weighting of the subband power spectrum values
based on the sub-band signal to noise ratios (SNRs), which adjusts
it to the new proposed approach. This new function allows the
components of the input signal at the frequencies least affected by
noise to pass with larger weights and attenuates more effectively
the noisy and undesired components. This modification results
in reduction of the noise residuals of the estimated spectrum
from the filtered autocorrelation sequence, thereby leading to
a more robust algorithm. Our proposed method, when evaluated
on Aurora 2 task for recognition purposes, outperformed all Mel frequency cepstral coefficients (MFCC) as the baseline, relative autocorrelation sequence MFCC (RAS-MFCC), and the MVDR-based features in several different noisy conditions
Investigation of the impact of high frequency transmitted speech on speaker recognition
Thesis (MScEng)--Stellenbosch University, 2002.Some digitised pages may appear illegible due to the condition of the original hard copy.ENGLISH ABSTRACT: Speaker recognition systems have evolved to a point where near perfect performance can be
obtained under ideal conditions, even if the system must distinguish between a large number
of speakers. Under adverse conditions, such as when high noise levels are present or when the
transmission channel deforms the speech, the performance is often less than satisfying.
This project investigated the performance of a popular speaker recognition system, that use
Gaussian mixture models, on speech transmitted over a high frequency channel. Initial experiments
demonstrated very unsatisfactory results for the base line system.
We investigated a number of robust techniques. We implemented and applied some of them in
an attempt to improve the performance of the speaker recognition systems. The techniques we
tested showed only slight improvements.
We also investigates the effects of a high frequency channel and single sideband modulation on
the speech features of speech processing systems. The effects that can deform the features, and
therefore reduce the performance of speech systems, were identified.
One of the effects that can greatly affect the performance of a speech processing system is
noise. We investigated some speech enhancement techniques and as a result we developed a
new statistical based speech enhancement technique that employs hidden Markov models to
represent the clean speech process.AFRIKAANSE OPSOMMING: Sprekerherkenning-stelsels het 'n punt bereik waar nabyaan perfekte resultate verwag kan word
onder ideale kondisies, selfs al moet die stelsel tussen 'n groot aantal sprekers onderskei. Wanneer
nie-ideale kondisies, soos byvoorbeeld hoë ruisvlakke of 'n transmissie kanaal wat die
spraak vervorm, teenwoordig is, is die resultate gewoonlik nie bevredigend nie.
Die projek ondersoek die werksverrigting van 'n gewilde sprekerherkenning-stelsel, wat gebruik
maak van Gaussiese mengselmodelle, op spraak wat oor 'n hoë frekwensie transmissie
kanaal gestuur is. Aanvanklike eksperimente wat gebruik maak van 'n basiese stelsel het nie
goeie resultate opgelewer nie.
Ons het 'n aantal robuuste tegnieke ondersoek en 'n paar van hulle geĂŻmplementeer en getoets
in 'n poging om die resultate van die sprekerherkenning-stelsel te verbeter. Die tegnieke wat
ons getoets het, het net geringe verbetering getoon.
Die studie het ook die effekte wat die hoë-frekwensie kanaal en enkel-syband modulasie op
spraak kenmerkvektore, ondersoek. Die effekte wat die spraak kenmerkvektore kan vervorm en
dus die werkverrigting van spraak stelsels kan verlaag, is geĂŻdentifiseer.
Een van die effekte wat 'n groot invloed op die werkverrigting van spraakstelsels het, is ruis.
Ons het spraak verbeterings metodes ondersoek en dit het gelei tot die ontwikkeling van 'n
statisties gebaseerde spraak verbeteringstegniek wat gebruik maak van verskuilde Markov modelle
om die skoon spraakproses voor te stel
Recommended from our members
Modelling and extraction of fundamental frequency in speech signals
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.One of the most important parameters of speech is the fundamental frequency of vibration of voiced sounds. The audio sensation of the fundamental frequency is known as the pitch. Depending on the tonal/non-tonal category of language, the fundamental frequency conveys intonation, pragmatics and meaning. In addition the fundamental frequency and intonation carry speaker gender, age, identity, speaking style and emotional state. Accurate estimation of the fundamental frequency is critically important for functioning of speech processing applications such as speech coding, speech recognition, speech synthesis and voice morphing. This thesis makes contributions to the development of accurate pitch estimation research in three distinct ways: (1) an investigation of the impact of the window length on pitch estimation error, (2) an investigation of the use of the higher order moments and (3) an investigation of an analysis-synthesis method for selection of the best pitch value among N proposed candidates. Experimental evaluations show that the length of the speech window has a major impact on the accuracy of pitch estimation. Depending on the similarity criteria and the order of the statistical moment a window length of 37 to 80 ms gives the least error. In order to avoid excessive delay as a consequence of using a longer window, a method is proposed
ii where the current short window is concatenated with the previous frames to form a longer signal window for pitch extraction. The use of second order and higher order moments, and the magnitude difference function, as the similarity criteria were explored and compared. A novel method of calculation of moments is introduced where the signal is split, i.e. rectified, into positive and negative valued samples. The moments for the positive and negative parts of the signal are computed separately and combined. The new method of calculation of moments from positive and negative parts and the higher order criteria provide competitive results. A challenging issue in pitch estimation is the determination of the best candidate from N extrema of the similarity criteria. The analysis-synthesis method proposed in this thesis selects the pitch candidate that provides the best reproduction (synthesis) of the harmonic spectrum of the original speech. The synthesis method must be such that the distortion increases with the increasing error in the estimate of the fundamental frequency. To this end a new method of spectral synthesis is proposed using an estimate of the spectral envelop and harmonically spaced asymmetric Gaussian pulses as excitation. The N-best method provides consistent reduction in pitch estimation error. The methods described in this thesis result in a significant improvement in the pitch accuracy and outperform the benchmark YIN method
Generalized Hidden Filter Markov Models Applied to Speaker Recognition
Classification of time series has wide Air Force, DoD and commercial interest, from automatic target recognition systems on munitions to recognition of speakers in diverse environments. The ability to effectively model the temporal information contained in a sequence is of paramount importance. Toward this goal, this research develops theoretical extensions to a class of stochastic models and demonstrates their effectiveness on the problem of text-independent (language constrained) speaker recognition. Specifically within the hidden Markov model architecture, additional constraints are implemented which better incorporate observation correlations and context, where standard approaches fail. Two methods of modeling correlations are developed, and their mathematical properties of convergence and reestimation are analyzed. These differ in modeling correlation present in the time samples and those present in the processed features, such as Mel frequency cepstral coefficients. The system models speaker dependent phonemes, making use of word dictionary grammars, and recognition is based on normalized log-likelihood Viterbi decoding. Both closed set identification and speaker verification using cohorts are performed on the YOHO database. YOHO is the only large scale, multiple-session, high-quality speech database for speaker authentication and contains over one hundred speakers stating combination locks. Equal error rates of 0.21% for males and 0.31% for females are demonstrated. A critical error analysis using a hypothesis test formulation provides the maximum number of errors observable while still meeting the goal error rates of 1% False Reject and 0.1% False Accept. Our system achieves this goal
Wavelet-based techniques for speech recognition
In this thesis, new wavelet-based techniques have been developed for the
extraction of features from speech signals for the purpose of automatic speech
recognition (ASR). One of the advantages of the wavelet transform over the short
time Fourier transform (STFT) is its capability to process non-stationary signals.
Since speech signals are not strictly stationary the wavelet transform is a better
choice for time-frequency transformation of these signals. In addition it has
compactly supported basis functions, thereby reducing the amount of
computation as opposed to STFT where an overlapping window is needed. [Continues.
Voice signature based Speaker Recognition
Magister Scientiae - MSc (Computer Science)Personal identification and the protection of data are important issues because of the ubiquitousness of computing and these havethus become interesting areas of research in the field of computer science. Previously people have used a variety of ways to identify an individual and protect themselves, their property and their information
Recent Advances in Signal Processing
The signal processing task is a very critical issue in the majority of new technological inventions and challenges in a variety of applications in both science and engineering fields. Classical signal processing techniques have largely worked with mathematical models that are linear, local, stationary, and Gaussian. They have always favored closed-form tractability over real-world accuracy. These constraints were imposed by the lack of powerful computing tools. During the last few decades, signal processing theories, developments, and applications have matured rapidly and now include tools from many areas of mathematics, computer science, physics, and engineering. This book is targeted primarily toward both students and researchers who want to be exposed to a wide variety of signal processing techniques and algorithms. It includes 27 chapters that can be categorized into five different areas depending on the application at hand. These five categories are ordered to address image processing, speech processing, communication systems, time-series analysis, and educational packages respectively. The book has the advantage of providing a collection of applications that are completely independent and self-contained; thus, the interested reader can choose any chapter and skip to another without losing continuity
Voice-signature-based Speaker Recognition
Magister Scientiae - MSc (Computer Science)Personal
identification
and
the
protection
of
data
are
important
issues
because
of
the
ubiquitousness
of
computing
and
these
have
thus
become
interesting
areas
of
research
in
the
field
of
computer
science.
Previously
people
have
used
a
variety
of
ways
to
identify
an
individual
and
protect
themselves,
their
property
and
their
information.
This
they
did
mostly
by
means
of
locks,
passwords,
smartcards
and
biometrics.
Verifying
individuals
by
using
their
physical
or
behavioural
features
is
more
secure
than
using
other
data
such
as
passwords
or
smartcards,
because
everyone
has
unique
features
which
distinguish
him
or
her
from
others.
Furthermore
the
biometrics
of
a
person
are
difficult
to
imitate
or
steal.
Biometric
technologies
represent
a
significant
component
of
a
comprehensive
digital
identity
solution
and
play
an
important
role
in
security.
The
technologies
that
support
identification
and
authentication
of
individuals
is
based
on
either
their
physiological
or
their
behavioural
characteristics.
Live-Ââdata,
in
this
instance
the
human
voice,
is
the
topic
of
this
research.
The
aim
is
to
recognize
a
personâs
voice
and
to
identify
the
user
by
verifying
that
his/her
voice
is
the
same
as
a
record
of
his
/
her
voice-Ââsignature
in
a
systems
database.
To
address
the
main
research
question:
âWhat
is
the
best
way
to
identify
a
person
by
his
/
her
voice
signature?â,
design
science
research,
was
employed.
This
methodology
is
used
to
develop
an
artefact
for
solving
a
problem.
Initially
a
pilot
study
was
conducted
using
visual
representation
of
voice
signatures,
to
check
if
it
is
possible
to
identify
speakers
without
using
feature
extraction
or
matching
methods.
Subsequently,
experiments
were
conducted
with
6300
data
sets
derived
from
Texas
Instruments
and
the
Massachusetts
Institute
of
Technology
audio
database.
Two
methods
of
feature
extraction
and
classification
were
consideredâmel
frequency
cepstrum
coefficient
and
linear
prediction
cepstral
coefficient
feature
extractionâand
for
classification,
the
Support
Vector
Machines
method
was
used.
The
three
methods
were
compared
in
terms
of
their
effectiveness
and
it
was
found
that
the
system
using
the
mel
frequency
cepstrum
coefficient,
for
feature
extraction,
gave
the
marginally
better
results
for
speaker
recognition
Speech Detection Using Gammatone Features And One-class Support Vector Machine
A network gateway is a mechanism which provides protocol translation and/or validation of network traffic using the metadata contained in network packets. For media applications such as Voice-over-IP, the portion of the packets containing speech data cannot be verified and can provide a means of maliciously transporting code or sensitive data undetected. One solution to this problem is through Voice Activity Detection (VAD). Many VADâs rely on time-domain features and simple thresholds for efficient speech detection however this doesnât say much about the signal being passed. More sophisticated methods employ machine learning algorithms, but train on specific noises intended for a target environment. Validating speech under a variety of unknown conditions must be possible; as well as differentiating between speech and nonspeech data embedded within the packets. A real-time speech detection method is proposed that relies only on a clean speech model for detection. Through the use of Gammatone filter bank processing, the Cepstrum and several frequency domain features are used to train a One-Class Support Vector Machine which provides a clean-speech model irrespective of environmental noise. A Wiener filter is used to provide improved operation for harsh noise environments. Greater than 90% detection accuracy is achieved for clean speech with approximately 70% accuracy for SNR as low as 5d
- âŠ