8,271 research outputs found
Who Spoke What? A Latent Variable Framework for the Joint Decoding of Multiple Speakers and their Keywords
In this paper, we present a latent variable (LV) framework to identify all
the speakers and their keywords given a multi-speaker mixture signal. We
introduce two separate LVs to denote active speakers and the keywords uttered.
The dependency of a spoken keyword on the speaker is modeled through a
conditional probability mass function. The distribution of the mixture signal
is expressed in terms of the LV mass functions and speaker-specific-keyword
models. The proposed framework admits stochastic models, representing the
probability density function of the observation vectors given that a particular
speaker uttered a specific keyword, as speaker-specific-keyword models. The LV
mass functions are estimated in a Maximum Likelihood framework using the
Expectation Maximization (EM) algorithm. The active speakers and their keywords
are detected as modes of the joint distribution of the two LVs. In mixture
signals, containing two speakers uttering the keywords simultaneously, the
proposed framework achieves an accuracy of 82% for detecting both the speakers
and their respective keywords, using Student's-t mixture models as
speaker-specific-keyword models.Comment: 6 pages, 2 figures Submitted to : IEEE Signal Processing Letter
Scale Selective Extended Local Binary Pattern for Texture Classification
In this paper, we propose a new texture descriptor, scale selective extended
local binary pattern (SSELBP), to characterize texture images with scale
variations. We first utilize multi-scale extended local binary patterns (ELBP)
with rotation-invariant and uniform mappings to capture robust local micro- and
macro-features. Then, we build a scale space using Gaussian filters and
calculate the histogram of multi-scale ELBPs for the image at each scale.
Finally, we select the maximum values from the corresponding bins of
multi-scale ELBP histograms at different scales as scale-invariant features. A
comprehensive evaluation on public texture databases (KTH-TIPS and UMD) shows
that the proposed SSELBP has high accuracy comparable to state-of-the-art
texture descriptors on gray-scale-, rotation-, and scale-invariant texture
classification but uses only one-third of the feature dimension.Comment: IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 201
Kalman tracking of linear predictor and harmonic noise models for noisy speech enhancement
This paper presents a speech enhancement method based on the tracking and denoising of the formants of a linear prediction (LP) model of the spectral envelope of speech and the parameters of a harmonic noise model (HNM) of its excitation. The main advantages of tracking and denoising the prominent energy contours of speech are the efficient use of the spectral and temporal structures of successive speech frames and a mitigation of processing artefact known as the ‘musical noise’ or ‘musical tones’.The formant-tracking linear prediction (FTLP) model estimation consists of three stages: (a) speech pre-cleaning based on a spectral amplitude estimation, (b) formant-tracking across successive speech frames using the Viterbi method, and (c) Kalman filtering of the formant trajectories across successive speech frames.The HNM parameters for the excitation signal comprise; voiced/unvoiced decision, the fundamental frequency, the harmonics’ amplitudes and the variance of the noise component of excitation. A frequency-domain pitch extraction method is proposed that searches for the peak signal to noise ratios (SNRs) at the harmonics. For each speech frame several pitch candidates are calculated. An estimate of the pitch trajectory across successive frames is obtained using a Viterbi decoder. The trajectories of the noisy excitation harmonics across successive speech frames are modeled and denoised using Kalman filters.The proposed method is used to deconstruct noisy speech, de-noise its model parameters and then reconstitute speech from its cleaned parts. Experimental evaluations show the performance gains of the formant tracking, pitch extraction and noise reduction stages
A Comparison of Visualisation Methods for Disambiguating Verbal Requests in Human-Robot Interaction
Picking up objects requested by a human user is a common task in human-robot
interaction. When multiple objects match the user's verbal description, the
robot needs to clarify which object the user is referring to before executing
the action. Previous research has focused on perceiving user's multimodal
behaviour to complement verbal commands or minimising the number of follow up
questions to reduce task time. In this paper, we propose a system for reference
disambiguation based on visualisation and compare three methods to disambiguate
natural language instructions. In a controlled experiment with a YuMi robot, we
investigated real-time augmentations of the workspace in three conditions --
mixed reality, augmented reality, and a monitor as the baseline -- using
objective measures such as time and accuracy, and subjective measures like
engagement, immersion, and display interference. Significant differences were
found in accuracy and engagement between the conditions, but no differences
were found in task time. Despite the higher error rates in the mixed reality
condition, participants found that modality more engaging than the other two,
but overall showed preference for the augmented reality condition over the
monitor and mixed reality conditions
Deep Feature-based Face Detection on Mobile Devices
We propose a deep feature-based face detector for mobile devices to detect
user's face acquired by the front facing camera. The proposed method is able to
detect faces in images containing extreme pose and illumination variations as
well as partial faces. The main challenge in developing deep feature-based
algorithms for mobile devices is the constrained nature of the mobile platform
and the non-availability of CUDA enabled GPUs on such devices. Our
implementation takes into account the special nature of the images captured by
the front-facing camera of mobile devices and exploits the GPUs present in
mobile devices without CUDA-based frameorks, to meet these challenges.Comment: ISBA 201
- …