Search CORE

11,623 research outputs found

I hear you eat and speak: automatic recognition of eating condition and food type, use-cases, and impact on ASR performance

Author: Batliner A
Hantke S
Kurle R
Mousa AELD
Ringeval F
Schuller B
Weninger F
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 14/04/2016
Field of study

We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient

Directory of Open Access Journals

Spiral - Imperial College Digital Repository

Blind Normalization of Speech From Different Channels

Author: Boll S. F.
Cox R. V.
David N. Levin
Gales M. J. F.
Levin D. N.
Levin D. N.
Levin D. N.
Roweis S. T.
Tenenbaum J. B.
Young S.
Publication venue: 'Acoustical Society of America (ASA)'
Publication date: 02/04/2002
Field of study

We show how to construct a channel-independent representation of speech that has propagated through a noisy reverberant channel. This is done by blindly rescaling the cepstral time series by a non-linear function, with the form of this scale function being determined by previously encountered cepstra from that channel. The rescaled form of the time series is an invariant property of it in the following sense: it is unaffected if the time series is transformed by any time-independent invertible distortion. Because a linear channel with stationary noise and impulse response transforms cepstra in this way, the new technique can be used to remove the channel dependence of a cepstral time series. In experiments, the method achieved greater channel-independence than cepstral mean normalization, and it was comparable to the combination of cepstral mean normalization and spectral subtraction, despite the fact that no measurements of channel noise or reverberations were required (unlike spectral subtraction).Comment: 25 pages, 7 figure

arXiv.org e-Print Archive

Crossref

The Speech-Language Interface in the Spoken Language Translator

Author: Carter David
Rayner Manny
Publication venue
Publication date: 01/01/1994
Field of study

The Spoken Language Translator is a prototype for practically useful systems capable of translating continuous spoken language within restricted domains. The prototype system translates air travel (ATIS) queries from spoken English to spoken Swedish and to French. It is constructed, with as few modifications as possible, from existing pieces of speech and language processing software. The speech recognizer and language understander are connected by a fairly conventional pipelined N-best interface. This paper focuses on the ways in which the language processor makes intelligent use of the sentence hypotheses delivered by the recognizer. These ways include (1) producing modified hypotheses to reflect the possible presence of repairs in the uttered word sequence; (2) fast parsing with a version of the grammar automatically specialized to the more frequent constructions in the training corpus; and (3) allowing syntactic and semantic factors to interact with acoustic ones in the choice of a meaning structure for translation, so that the acoustically preferred hypothesis is not always selected even if it is within linguistic coverage.Comment: 9 pages, LaTeX. Published: Proceedings of TWLT-8, December 199

arXiv.org e-Print Archive

CiteSeerX

Machine Understanding of Human Behavior

Author: Huang Thomas
Nijholt Anton
Pantic Maja
Pentland Alex
Publication venue: University of Twente, Centre for Telematics and Information Technology (CTIT)
Publication date: 01/01/2007
Field of study

A widely accepted prediction is that computing will move to the background, weaving itself into the fabric of our everyday living spaces and projecting the human user into the foreground. If this prediction is to come true, then next generation computing, which we will call human computing, should be about anticipatory user interfaces that should be human-centered, built for humans based on human models. They should transcend the traditional keyboard and mouse to include natural, human-like interactive functions including understanding and emulating certain human behaviors such as affective and social signaling. This article discusses a number of components of human behavior, how they might be integrated into computers, and how far we are from realizing the front end of human computing, that is, how far are we from enabling computers to understand human behavior

University of Twente Research Information

The INTERSPEECH 2016 Computational Paralinguistics Challenge: Deception, Sincerity & Native Language

Author: Assoc Int Speech Commun
Baird Alice
Batliner Anton
Burgoon Judee K
Coutinho Eduardo
Elkins Aaron
Evanini Keelan
Hirschberg Julia
Schuller Bjoern
Steidl Stefan
Zhang Yue
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2016
Field of study

University of Liverpool Repository

OPUS Augsburg

Recommended from our members

English Speaking and Listening Assessment Project - Baseline. Bangladesh

Author: Cullen Jane
Mathew Rama
McCormick Robert
Payler Jane
Power Tom
Woodward Clare
Publication venue: The British Council
Publication date: 01/03/2016
Field of study

This study seeks to understand the current practices of English Language Teaching (ELT) and assessment at the secondary school level in Bangladesh, with specific focus on speaking and listening skills. The study draws upon prior research on general ELT practices, English language proficiencies and exploration of assessment practices, in Bangladesh. The study aims to provide some baseline evidence about the way speaking and listening are taught currently, whether these skills are assessed informally, and if so, how this is done. The study addresses two research questions: 1. How ready are English Language Teachers in government-funded secondary schools in Bangladesh to implement continuous assessment of speaking and listening skills? 2. Are there identifiable contextual factors that promote or inhibit the development of effective assessment of listening and speaking in English? These were assessed with a mixed-methods design, drawing upon prior quantitative research and new qualitative fieldwork in 22 secondary schools across three divisions (Dhaka, Sylhet and Chittagong). At the suggestion of DESHE, the sample also included 2 of the ‘highest performing’ schools from Dhaka city. There are some signs of readiness for effective school-based assessment of speaking and listening skills: teachers, students and community members alike are enthusiastic for a greater emphasis on speaking and listening skills, which are highly valued. Teachers and students are now speaking mostly in English and most teachers also attempt to organise some student talk in pairs or groups, at least briefly. Yet several factors limit students’ opportunities to develop skills at the level of CEFR A1 or A2. Firstly, teachers generally do not yet have sufficient confidence, understanding or competence to introduce effective teaching or assessment practices at CEFR A1-A2. In English lessons, students generally make short, predictable utterances or recite texts. No lessons were observed in which students had an opportunity to develop or demonstrate language functions at CEFR A1-A2. Secondly, teachers acknowledge a washback effect from final examinations, agreeing that inclusion of marks for speaking and listening would ensure teachers and students took these skills more seriously during lesson time. Thirdly, almost two thirds of secondary students achieve no CEFR level, suggesting many enter and some leave secondary education with limited communicative English language skills. One possible contributor to this may be that almost half (43%) of the ELT population are only at the target level for students (CEFR A2) themselves, whilst approximately one in ten teachers (12%) do not achieve the student target (being at A1 or below). Fourthly, the Bangladesh curriculum student competency statements are generic and broad, providing little support to the development of teaching or assessment practices. The introduction and development of effective teaching and assessment strategies at CEFR A1-A2 requires a profound shift in teachers’ understanding and practice. We recommend that: 1. Future sector wide programmes provide sustained support to the develop teachers' competence in teaching and assessment of speaking and listening skills at CEFR A1-A2 2. Options are explored for introducing assessment of these skills in terminal examinations 3. Mechanisms are identified for improving teachers own speaking and listening skills 4. Student competency statements within the Bangladesh curriculum are revised to provide more guidance to teachers and students

Open Research Online (The Open University)

Visual to Sound: Generating Natural Sound for Videos in the Wild

Author: Berg Tamara L.
Bui Trung
Fang Chen
Wang Zhaowen
Zhou Yipin
Publication venue
Publication date: 01/06/2018
Field of study

As two of the five traditional human senses (sight, hearing, taste, smell, and touch), vision and sound are basic sources through which humans understand the world. Often correlated during natural events, these two modalities combine to jointly affect human perception. In this paper, we pose the task of generating sound given visual input. Such capabilities could help enable applications in virtual reality (generating sound for virtual scenes automatically) or provide additional accessibility to images or videos for people with visual impairments. As a first step in this direction, we apply learning-based methods to generate raw waveform samples given input video frames. We evaluate our models on a dataset of videos containing a variety of sounds (such as ambient sounds and sounds from people/animals). Our experiments show that the generated sounds are fairly realistic and have good temporal synchronization with the visual inputs.Comment: Project page: http://bvision11.cs.unc.edu/bigpen/yipin/visual2sound_webpage/visual2sound.htm

arXiv.org e-Print Archive

Crossref