8,912 research outputs found
Multimodal person recognition for human-vehicle interaction
Next-generation vehicles will undoubtedly feature biometric person recognition as part of an effort to improve the driving experience. Today's technology prevents such systems from operating satisfactorily under adverse conditions. A proposed framework for achieving person recognition successfully combines different biometric modalities, borne out in two case studies
Speaker Representation Learning using Global Context Guided Channel and Time-Frequency Transformations
In this study, we propose the global context guided channel and
time-frequency transformations to model the long-range, non-local
time-frequency dependencies and channel variances in speaker representations.
We use the global context information to enhance important channels and
recalibrate salient time-frequency locations by computing the similarity
between the global context and local features. The proposed modules, together
with a popular ResNet based model, are evaluated on the VoxCeleb1 dataset,
which is a large scale speaker verification corpus collected in the wild. This
lightweight block can be easily incorporated into a CNN model with little
additional computational costs and effectively improves the speaker
verification performance compared to the baseline ResNet-LDE model and the
Squeeze&Excitation block by a large margin. Detailed ablation studies are also
performed to analyze various factors that may impact the performance of the
proposed modules. We find that by employing the proposed L2-tf-GTFC
transformation block, the Equal Error Rate decreases from 4.56% to 3.07%, a
relative 32.68% reduction, and a relative 27.28% improvement in terms of the
DCF score. The results indicate that our proposed global context guided
transformation modules can efficiently improve the learned speaker
representations by achieving time-frequency and channel-wise feature
recalibration.Comment: Accepted to Interspeech 202
Data-driven Attention and Data-independent DCT based Global Context Modeling for Text-independent Speaker Recognition
Learning an effective speaker representation is crucial for achieving
reliable performance in speaker verification tasks. Speech signals are
high-dimensional, long, and variable-length sequences that entail a complex
hierarchical structure. Signals may contain diverse information at each
time-frequency (TF) location. For example, it may be more beneficial to focus
on high-energy parts for phoneme classes such as fricatives. The standard
convolutional layer that operates on neighboring local regions cannot capture
the complex TF global context information. In this study, a general global
time-frequency context modeling framework is proposed to leverage the context
information specifically for speaker representation modeling. First, a
data-driven attention-based context model is introduced to capture the
long-range and non-local relationship across different time-frequency
locations. Second, a data-independent 2D-DCT based context model is proposed to
improve model interpretability. A multi-DCT attention mechanism is presented to
improve modeling power with alternate DCT base forms. Finally, the global
context information is used to recalibrate salient time-frequency locations by
computing the similarity between the global context and local features. The
proposed lightweight blocks can be easily incorporated into a speaker model
with little additional computational costs and effectively improves the speaker
verification performance compared to the standard ResNet model and
Squeeze\&Excitation block by a large margin. Detailed ablation studies are also
performed to analyze various factors that may impact performance of the
proposed individual modules. Results from experiments show that the proposed
global context modeling framework can efficiently improve the learned speaker
representations by achieving channel-wise and time-frequency feature
recalibration
Research in nonlinear structural and solid mechanics
Recent and projected advances in applied mechanics, numerical analysis, computer hardware and engineering software, and their impact on modeling and solution techniques in nonlinear structural and solid mechanics are discussed. The fields covered are rapidly changing and are strongly impacted by current and projected advances in computer hardware. To foster effective development of the technology perceptions on computing systems and nonlinear analysis software systems are presented
Multibiometric security in wireless communication systems
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University, 05/08/2010.This thesis has aimed to explore an application of Multibiometrics to secured wireless communications. The medium of study for this purpose included Wi-Fi, 3G, and
WiMAX, over which simulations and experimental studies were carried out to assess the performance. In specific, restriction of access to authorized users only is provided by a technique referred to hereafter as multibiometric cryptosystem. In brief, the system is built upon a complete challenge/response methodology in order to obtain a high level of security on the basis of user identification by fingerprint and further confirmation by verification of the user through text-dependent speaker recognition.
First is the enrolment phase by which the database of watermarked fingerprints with
memorable texts along with the voice features, based on the same texts, is created by sending them to the server through wireless channel.
Later is the verification stage at which claimed users, ones who claim are genuine, are verified against the database, and it consists of five steps. Initially faced by the identification level, one is asked to first present oneâs fingerprint and a memorable word, former is watermarked into latter, in order for system to authenticate the fingerprint and verify the validity of it by retrieving the challenge for accepted user.
The following three steps then involve speaker recognition including the user
responding to the challenge by text-dependent voice, server authenticating the response, and finally server accepting/rejecting the user.
In order to implement fingerprint watermarking, i.e. incorporating the memorable word as a watermark message into the fingerprint image, an algorithm of five steps has been developed. The first three novel steps having to do with the fingerprint
image enhancement (CLAHE with 'Clip Limit', standard deviation analysis and
sliding neighborhood) have been followed with further two steps for embedding, and
extracting the watermark into the enhanced fingerprint image utilising Discrete
Wavelet Transform (DWT).
In the speaker recognition stage, the limitations of this technique in wireless
communication have been addressed by sending voice feature (cepstral coefficients)
instead of raw sample. This scheme is to reap the advantages of reducing the
transmission time and dependency of the data on communication channel, together
with no loss of packet. Finally, the obtained results have verified the claims
Detection and handling of overlapping speech for speaker diarization
For the last several years, speaker diarization has been attracting substantial research attention as one of the spoken
language technologies applied for the improvement, or enrichment, of recording transcriptions. Recordings of meetings,
compared to other domains, exhibit an increased complexity due to the spontaneity of speech, reverberation effects, and also
due to the presence of overlapping speech.
Overlapping speech refers to situations when two or more speakers are speaking simultaneously. In meeting data, a
substantial portion of errors of the conventional speaker diarization systems can be ascribed to speaker overlaps, since usually
only one speaker label is assigned per segment. Furthermore, simultaneous speech included in training data can eventually
lead to corrupt single-speaker models and thus to a worse segmentation.
This thesis concerns the detection of overlapping speech segments and its further application for the improvement of speaker
diarization performance. We propose the use of three spatial cross-correlationbased parameters for overlap detection on
distant microphone channel data. Spatial features from different microphone pairs are fused by means of principal component
analysis, linear discriminant analysis, or by a multi-layer perceptron.
In addition, we also investigate the possibility of employing longterm prosodic information. The most suitable subset from a set
of candidate prosodic features is determined in two steps. Firstly, a ranking according to mRMR criterion is obtained, and then,
a standard hill-climbing wrapper approach is applied in order to determine the optimal number of features.
The novel spatial as well as prosodic parameters are used in combination with spectral-based features suggested previously in
the literature. In experiments conducted on AMI meeting data, we show that the newly proposed features do contribute to the
detection of overlapping speech, especially on data originating from a single recording site.
In speaker diarization, for segments including detected speaker overlap, a second speaker label is picked, and such segments
are also discarded from the model training. The proposed overlap labeling technique is integrated in Viterbi decoding, a part of
the diarization algorithm. During the system development it was discovered that it is favorable to do an independent
optimization of overlap exclusion and labeling with respect to the overlap detection system.
We report improvements over the baseline diarization system on both single- and multi-site AMI data. Preliminary experiments
with NIST RT data show DER improvement on the RT Âż09 meeting recordings as well.
The addition of beamforming and TDOA feature stream into the baseline diarization system, which was aimed at improving the
clustering process, results in a bit higher effectiveness of the overlap labeling algorithm. A more detailed analysis on the
overlap exclusion behavior reveals big improvement contrasts between individual meeting recordings as well as between
various settings of the overlap detection operation point. However, a high performance variability across different recordings is
also typical of the baseline diarization system, without any overlap handling
- âŠ