Search CORE

6,020 research outputs found

Robust Speech Detection for Noisy Environments

Author: Hernández Luis A.
San Segundo Hernández Rubén
Varela Serrano Oscar
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2011
Field of study

This paper presents a robust voice activity detector (VAD) based on hidden Markov models (HMM) to improve speech recognition systems in stationary and non-stationary noise environments: inside motor vehicles (like cars or planes) or inside buildings close to high traffic places (like in a control tower for air traffic control (ATC)). In these environments, there is a high stationary noise level caused by vehicle motors and additionally, there could be people speaking at certain distance from the main speaker producing non-stationary noise. The VAD presented in this paper is characterized by a new front-end and a noise level adaptation process that increases significantly the VAD robustness for different signal to noise ratios (SNRs). The feature vector used by the VAD includes the most relevant Mel Frequency Cepstral Coefficients (MFCC), normalized log energy and delta log energy. The proposed VAD has been evaluated and compared to other well-known VADs using three databases containing different noise conditions: speech in clean environments (SNRs mayor que 20 dB), speech recorded in stationary noise environments (inside or close to motor vehicles), and finally, speech in non stationary environments (including noise from bars, television and far-field speakers). In the three cases, the detection error obtained with the proposed VAD is the lowest for all SNRs compared to Acero¿s VAD (reference of this work) and other well-known VADs like AMR, AURORA or G729 annex b

Continuous Estimation of Emotions in Speech by Dynamic Cooperative Speaker Models

Author: Martinelli E
Mencattini A
Natale CD
Ringeval F
Schuller B
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 18/02/2016
Field of study

Automatic emotion recognition from speech has been recently focused on the prediction of time-continuous dimensions (e.g., arousal and valence) of spontaneous and realistic expressions of emotion, as found in real-life interactions. However, the automatic prediction of such emotions poses several challenges, such as the subjectivity found in the definition of a gold standard from a pool of raters and the issue of data scarcity in training models. In this work, we introduce a novel emotion recognition system, based on ensemble of single-speaker-regression-models (SSRMs). The estimation of emotion is provided by combining a subset of the initial pool of SSRMs selecting those that are most concordance among them. The proposed approach allows the addition or removal of speakers from the ensemble without the necessity to re-build the entire machine learning system. The simplicity of this aggregation strategy, coupled with the flexibility assured by the modular architecture, and the promising results obtained on the RECOLA database highlight the potential implications of the proposed method in a real-life scenario and in particular in WEB-based applications

OPUS Augsburg

Hal - Université Grenoble Alpes

Spiral - Imperial College Digital Repository

A hidden Markov model-based acoustic cicada detector for crowdsourced smartphone biodiversity monitoring

Author: Merrett Geoff
Parson Oliver
Rogers Alex
Zilli Davide
Publication venue
Publication date: 01/08/2013
Field of study

In recent years, the field of computational sustainability has striven to apply artificial intelligence techniques to solve ecological and environmental problems. In ecology, a key issue for the safeguarding of our planet is the monitoring of biodiversity. Automated acoustic recognition of species aims to provide a cost-effective method for biodiversity monitoring. This is particularly appealing for detecting endangered animals with a distinctive call, such as the New Forest cicada. To this end, we pursue a crowdsourcing approach, whereby the millions of visitors to the New Forest, where this insect was historically found, will help to monitor its presence by means of a smartphone app that can detect its mating call. Existing research in the field of acoustic insect detection has typically focused upon the classification of recordings collected from fixed field microphones. Such approaches segment a lengthy audio recording into individual segments of insect activity, which are independently classified using cepstral coefficients extracted from the recording as features. This paper reports on a contrasting approach, whereby we use crowdsourcing to collect recordings via a smartphone app, and present an immediate feedback to the users as to whether an insect has been found. Our classification approach does not remove silent parts of the recording via segmentation, but instead uses the temporal patterns throughout each recording to classify the insects present. We show that our approach can successfully discriminate between the call of the New Forest cicada and similar insects found in the New Forest, and is robust to common types of environment noise. A large scale trial deployment of our smartphone app collected over 6000 reports of insect activity from over 1000 users. Despite the cicada not having been rediscovered in the New Forest, the effectiveness of this approach was confirmed for both the detection algorithm, which successfully identified the same cicada through the app in countries where the same species is still present, and of the crowdsourcing methodology, which collected a vast number of recordings and involved thousands of contributors.</p

Southampton (e-Prints Soton)

Oxford University Research Archive

Dublin City University video track experiments for TREC 2002

Author: Browne Paul
Czirjék Csaba
Gurrin Cathal
Jarina Roman
Lee Hyowon
Marlow Seán
McDonald Kieran
Murphy Noel
O'Connor Noel E.
Smeaton Alan F.
Ye Jiamin
Publication venue: 'University of Aden - Faculty of Economics and Administration'
Publication date: 01/11/2002
Field of study

Dublin City University participated in the Feature Extraction task and the Search task of the TREC-2002 Video Track. In the Feature Extraction task, we submitted 3 features: Face, Speech, and Music. In the Search task, we developed an interactive video retrieval system, which incorporated the 40 hours of the video search test collection and supported user searching using our own feature extraction data along with the donated feature data and ASR transcript from other Video Track groups. This video retrieval system allows a user to specify a query based on the 10 features and ASR transcript, and the query result is a ranked list of videos that can be further browsed at the shot level. To evaluate the usefulness of the feature-based query, we have developed a second system interface that provides only ASR transcript-based querying, and we conducted an experiment with 12 test users to compare these 2 systems. Results were submitted to NIST and we are currently conducting further analysis of user performance with these 2 systems

Voice activity detection using audio, video and depth information

Author: Θερμός Σπυρίδων Κ.
Publication venue
Publication date: 01/01/2015
Field of study