Search CORE

8,271 research outputs found

Who Spoke What? A Latent Variable Framework for the Joint Decoding of Multiple Speakers and their Keywords

Author: Sreenivas Thippur V.
Sundar Harshavardhan
Publication venue
Publication date: 29/04/2015
Field of study

In this paper, we present a latent variable (LV) framework to identify all the speakers and their keywords given a multi-speaker mixture signal. We introduce two separate LVs to denote active speakers and the keywords uttered. The dependency of a spoken keyword on the speaker is modeled through a conditional probability mass function. The distribution of the mixture signal is expressed in terms of the LV mass functions and speaker-specific-keyword models. The proposed framework admits stochastic models, representing the probability density function of the observation vectors given that a particular speaker uttered a specific keyword, as speaker-specific-keyword models. The LV mass functions are estimated in a Maximum Likelihood framework using the Expectation Maximization (EM) algorithm. The active speakers and their keywords are detected as modes of the joint distribution of the two LVs. In mixture signals, containing two speakers uttering the keywords simultaneously, the proposed framework achieves an accuracy of 82% for detecting both the speakers and their respective keywords, using Student's-t mixture models as speaker-specific-keyword models.Comment: 6 pages, 2 figures Submitted to : IEEE Signal Processing Letter

arXiv.org e-Print Archive

Crossref

Scale Selective Extended Local Binary Pattern for Texture Classification

Author: AlRegib Ghassan
Hu Yuting
Long Zhiling
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/12/2018
Field of study

In this paper, we propose a new texture descriptor, scale selective extended local binary pattern (SSELBP), to characterize texture images with scale variations. We first utilize multi-scale extended local binary patterns (ELBP) with rotation-invariant and uniform mappings to capture robust local micro- and macro-features. Then, we build a scale space using Gaussian filters and calculate the histogram of multi-scale ELBPs for the image at each scale. Finally, we select the maximum values from the corresponding bins of multi-scale ELBP histograms at different scales as scale-invariant features. A comprehensive evaluation on public texture databases (KTH-TIPS and UMD) shows that the proposed SSELBP has high accuracy comparable to state-of-the-art texture descriptors on gray-scale-, rotation-, and scale-invariant texture classification but uses only one-third of the feature dimension.Comment: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 201

arXiv.org e-Print Archive

Crossref

Kalman tracking of linear predictor and harmonic noise models for noisy speech enhancement

Author: Ben Milner
Boll
Chen
Deller
Ephraim
Ephraim
Ephraim
Ephraim
Esfandiar Zavarehei
Friedman
Griffin
Hansen
Ioannis Andrianakis
Jonathan Darch
Kalman
Lim
Lim
Paul White
Qin Yan
Rentzos
Saeed Vaseghi
Sameti
Secrest
Seltzer
Stylianou
Stylianou
Tucker
Turunen
Vaseghi
Weber
Yan
Publication venue: 'Elsevier BV'
Publication date: 01/01/2008
Field of study

This paper presents a speech enhancement method based on the tracking and denoising of the formants of a linear prediction (LP) model of the spectral envelope of speech and the parameters of a harmonic noise model (HNM) of its excitation. The main advantages of tracking and denoising the prominent energy contours of speech are the efficient use of the spectral and temporal structures of successive speech frames and a mitigation of processing artefact known as the ‘musical noise’ or ‘musical tones’.The formant-tracking linear prediction (FTLP) model estimation consists of three stages: (a) speech pre-cleaning based on a spectral amplitude estimation, (b) formant-tracking across successive speech frames using the Viterbi method, and (c) Kalman filtering of the formant trajectories across successive speech frames.The HNM parameters for the excitation signal comprise; voiced/unvoiced decision, the fundamental frequency, the harmonics’ amplitudes and the variance of the noise component of excitation. A frequency-domain pitch extraction method is proposed that searches for the peak signal to noise ratios (SNRs) at the harmonics. For each speech frame several pitch candidates are calculated. An estimate of the pitch trajectory across successive frames is obtained using a Viterbi decoder. The trajectories of the noisy excitation harmonics across successive speech frames are modeled and denoised using Kalman filters.The proposed method is used to deconstruct noisy speech, de-noise its model parameters and then reconstitute speech from its cleaned parts. Experimental evaluations show the performance gains of the formant tracking, pitch extraction and noise reduction stages

Crossref

Southampton (e-Prints Soton)

University of East Anglia digital repository

A Comparison of Visualisation Methods for Disambiguating Verbal Requests in Human-Robot Interaction

Author: Gustafson Joakim
Karaoguz Hakan
Kontogiorgos Dimosthenis
Kragic Danica
Leite Iolanda
Nykvist Olov
Sibirtseva Elena
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 26/01/2018
Field of study

Picking up objects requested by a human user is a common task in human-robot interaction. When multiple objects match the user's verbal description, the robot needs to clarify which object the user is referring to before executing the action. Previous research has focused on perceiving user's multimodal behaviour to complement verbal commands or minimising the number of follow up questions to reduce task time. In this paper, we propose a system for reference disambiguation based on visualisation and compare three methods to disambiguate natural language instructions. In a controlled experiment with a YuMi robot, we investigated real-time augmentations of the workspace in three conditions -- mixed reality, augmented reality, and a monitor as the baseline -- using objective measures such as time and accuracy, and subjective measures like engagement, immersion, and display interference. Significant differences were found in accuracy and engagement between the conditions, but no differences were found in task time. Despite the higher error rates in the mixed reality condition, participants found that modality more engaging than the other two, but overall showed preference for the augmented reality condition over the monitor and mixed reality conditions

arXiv.org e-Print Archive

Crossref

Deep Feature-based Face Detection on Mobile Devices

Author: Chellappa Rama
Patel Vishal M.
Sarkar Sayantan
Publication venue
Publication date: 15/02/2016
Field of study

We propose a deep feature-based face detector for mobile devices to detect user's face acquired by the front facing camera. The proposed method is able to detect faces in images containing extreme pose and illumination variations as well as partial faces. The main challenge in developing deep feature-based algorithms for mobile devices is the constrained nature of the mobile platform and the non-availability of CUDA enabled GPUs on such devices. Our implementation takes into account the special nature of the images captured by the front-facing camera of mobile devices and exploits the GPUs present in mobile devices without CUDA-based frameorks, to meet these challenges.Comment: ISBA 201

arXiv.org e-Print Archive

Crossref