11,623 research outputs found
I hear you eat and speak: automatic recognition of eating condition and food type, use-cases, and impact on ASR performance
We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient
Blind Normalization of Speech From Different Channels
We show how to construct a channel-independent representation of speech that
has propagated through a noisy reverberant channel. This is done by blindly
rescaling the cepstral time series by a non-linear function, with the form of
this scale function being determined by previously encountered cepstra from
that channel. The rescaled form of the time series is an invariant property of
it in the following sense: it is unaffected if the time series is transformed
by any time-independent invertible distortion. Because a linear channel with
stationary noise and impulse response transforms cepstra in this way, the new
technique can be used to remove the channel dependence of a cepstral time
series. In experiments, the method achieved greater channel-independence than
cepstral mean normalization, and it was comparable to the combination of
cepstral mean normalization and spectral subtraction, despite the fact that no
measurements of channel noise or reverberations were required (unlike spectral
subtraction).Comment: 25 pages, 7 figure
The Speech-Language Interface in the Spoken Language Translator
The Spoken Language Translator is a prototype for practically useful systems
capable of translating continuous spoken language within restricted domains.
The prototype system translates air travel (ATIS) queries from spoken English
to spoken Swedish and to French. It is constructed, with as few modifications
as possible, from existing pieces of speech and language processing software.
The speech recognizer and language understander are connected by a fairly
conventional pipelined N-best interface. This paper focuses on the ways in
which the language processor makes intelligent use of the sentence hypotheses
delivered by the recognizer. These ways include (1) producing modified
hypotheses to reflect the possible presence of repairs in the uttered word
sequence; (2) fast parsing with a version of the grammar automatically
specialized to the more frequent constructions in the training corpus; and (3)
allowing syntactic and semantic factors to interact with acoustic ones in the
choice of a meaning structure for translation, so that the acoustically
preferred hypothesis is not always selected even if it is within linguistic
coverage.Comment: 9 pages, LaTeX. Published: Proceedings of TWLT-8, December 199
Machine Understanding of Human Behavior
A widely accepted prediction is that computing will move to the background, weaving itself into the fabric of our everyday living spaces and projecting the human user into the foreground. If this prediction is to come true, then next generation computing, which we will call human computing, should be about anticipatory user interfaces that should be human-centered, built for humans based on human models. They should transcend the traditional keyboard and mouse to include natural, human-like interactive functions including understanding and emulating certain human behaviors such as affective and social signaling. This article discusses a number of components of human behavior, how they might be integrated into computers, and how far we are from realizing the front end of human computing, that is, how far are we from enabling computers to understand human behavior
The INTERSPEECH 2016 Computational Paralinguistics Challenge: Deception, Sincerity & Native Language
Recommended from our members
English Speaking and Listening Assessment Project - Baseline. Bangladesh
This study seeks to understand the current practices of English Language Teaching (ELT) and assessment at the secondary school level in Bangladesh, with specific focus on speaking and listening skills. The study draws upon prior research on general ELT practices, English language proficiencies and exploration of assessment practices, in Bangladesh. The study aims to provide some baseline evidence about the way speaking and listening are taught currently, whether these skills are assessed informally, and if so, how this is done. The study addresses two research questions:
1. How ready are English Language Teachers in government-funded secondary schools in Bangladesh to implement continuous assessment of speaking and listening skills?
2. Are there identifiable contextual factors that promote or inhibit the development of effective assessment of listening and speaking in English?
These were assessed with a mixed-methods design, drawing upon prior quantitative research and new qualitative fieldwork in 22 secondary schools across three divisions (Dhaka, Sylhet and Chittagong). At the suggestion of DESHE, the sample also included 2 of the ‘highest performing’ schools from Dhaka city.
There are some signs of readiness for effective school-based assessment of speaking and listening skills: teachers, students and community members alike are enthusiastic for a greater emphasis on speaking and listening skills, which are highly valued. Teachers and students are now speaking mostly in English and most teachers also attempt to organise some student talk in pairs or groups, at least briefly. Yet several factors limit students’ opportunities to develop skills at the level of CEFR A1 or A2.
Firstly, teachers generally do not yet have sufficient confidence, understanding or competence to introduce effective teaching or assessment practices at CEFR A1-A2. In English lessons, students generally make short, predictable utterances or recite texts. No lessons were observed in which students had an opportunity to develop or demonstrate language functions at CEFR A1-A2. Secondly, teachers acknowledge a washback effect from final examinations, agreeing that inclusion of marks for speaking and listening would ensure teachers and students took these skills more seriously during lesson time. Thirdly, almost two thirds of secondary students achieve no CEFR level, suggesting many enter and some leave secondary education with limited communicative English language skills. One possible contributor to this may be that almost half (43%) of the ELT population are only at the target level for students (CEFR A2) themselves, whilst approximately one in ten teachers (12%) do not achieve the student target (being at A1 or below). Fourthly, the Bangladesh curriculum student competency statements are generic and broad, providing little support to the development of teaching or assessment practices.
The introduction and development of effective teaching and assessment strategies at CEFR A1-A2 requires a profound shift in teachers’ understanding and practice. We recommend that:
1. Future sector wide programmes provide sustained support to the develop teachers' competence in teaching and assessment of speaking and listening skills at CEFR A1-A2
2. Options are explored for introducing assessment of these skills in terminal examinations
3. Mechanisms are identified for improving teachers own speaking and listening skills
4. Student competency statements within the Bangladesh curriculum are revised to provide more guidance to teachers and students
Visual to Sound: Generating Natural Sound for Videos in the Wild
As two of the five traditional human senses (sight, hearing, taste, smell,
and touch), vision and sound are basic sources through which humans understand
the world. Often correlated during natural events, these two modalities combine
to jointly affect human perception. In this paper, we pose the task of
generating sound given visual input. Such capabilities could help enable
applications in virtual reality (generating sound for virtual scenes
automatically) or provide additional accessibility to images or videos for
people with visual impairments. As a first step in this direction, we apply
learning-based methods to generate raw waveform samples given input video
frames. We evaluate our models on a dataset of videos containing a variety of
sounds (such as ambient sounds and sounds from people/animals). Our experiments
show that the generated sounds are fairly realistic and have good temporal
synchronization with the visual inputs.Comment: Project page:
http://bvision11.cs.unc.edu/bigpen/yipin/visual2sound_webpage/visual2sound.htm
- …