Search CORE

347 research outputs found

The listening talker: A review of human and algorithmic context-induced modifications of speech

Author: Adriaans
Albin
Alcántara
Andruski
ANSI S3.5-1997
Arai
Assmann
Assmann
Aubanel
Aubanel
Aubanel
Babel
Babel
Bailly
Baran
Barker
Batliner
Beautemps
Beckford Wassink
Beckman
Beckman
Bele
Bell
Benoit
Best
Biersack
Bird
Blamey
Boike
Bond
Bond
Bond
Boril
Bradlow
Bradlow
Bradlow
Bradlow
Branigan
Bregman
Bronkhorst
Brungart
Brungart
Brunskog
Burnham
Burnham
Burnham
Burnham
Castellanos
Chen
Cheskin
Cheyne
Chládková
Chung
Church
Cole
Cooke
Cooke
Cooke
Cooke
Cooke
Cooke
Cooper
Cooper
Cox
Cox
Cristia
Cristià
Cutler
Darwin
Dau
Davis
Davis
Dejonckere
Delvaux
Dodane
Dreher
Dudley
Dunst
Egan
Englund
Eriksson
Erting
Estival
Falk
Farris
Ferguson
Ferguson
Fernald
Fernald
Fernald
Fernald
Fernald
Field
Fisher
Fisher
Fitzpatrick
Floccia
Fogerty
Fogerty
Fowler
Fowler
Freed
Fux
Fux
Fux
Gagne
Gagne
Gagne
Galati
Garnier
Garnier
Garnier
Garnier
Garnier
Garnier
Garnier
Garrod
Giles
Goldwater
Golinkoff
Golinkoff
Gordon-Salant
Granlund
Granlund
Green
Grieser
Hawley
Hazan
Hazan
Hazan
Hazan
Healey
Helfer
Helfer
Hornsby
Horwitz
Howell
Imaizumi
Imaizumi
Ishizuka
Janarthanam
Johnson
Jun
Jung
Junqua
Junqua
Junqua
Kadiri
Kang
Kaplan
Kappes
Kawahara
Kewley-Port
Kim
Kim
Kirchhoff
Kitamura
Kitamura
Kondaurova
Kondaurova
Korn
Krause
Krause
Krause
Krause
Krause
Kretsinger
Kryter
Kuhl
Kusumoto
Lam
Lane
Laures
Laures
Lee
Lienard
Lindblom
Lindblom
Little
Liu
Liu
Liu
Lombard
Long
Long
Lu
Lu
Lu
Malsheen
Maniwa
Marin
Martin Cooke
Masataka
Matthies
Mattys
Mattys
Mattys
Maye
Maye
Mayo
Maëva Garnier
Metz
Michael
Miller
Mokbel
Monsen
Montgomery
Moon
Moon
Moore
Moore
Moulines
Naoi
Natale
Nejime
Newport
Niederjohn
Niwano
Niwano
Ostroff
Oviatt
Owren
Papoušek
Papoušek
Papoušek
Pardo
Patel
Patel
Payne
Payton
Pegg
Pelegrín-García
Perkell
Petkov
Peutz
Phillips
Picheny
Picheny
Picheny
Pickering
Pickett
Pickett
Pisoni
Pittman
Pollack
Pucher
Pye
Rasetshwane
Ratner
Ratner
Ratner
Rieser
Rogers
Rostolland
Rostolland
Ryan
Räsänen
Sachs
Sankowska
Sauert
Scarborough
Schmitt
Schulman
Schum
Shimron
Simon King
Sims
Singh
Skowronski
Smiljanic
Smith
Snow
Song
Stanton
Stern
Stilp
Stylianou
Summers
Summers
Sundberg
Sundberg
Sundberg
Suni
Synnestvedt
Taal
Taal
Tang
Tang
Tang
Tartter
Ternström
Thanavisuth
Titze
Torick
Trainor
Trainor
Traunmuller
Uchanski
Uchanski
Uther
Valentini-Botinhao
Valentini-Botinhao
Valian
Valian
van de Weijer
van Rooij
Vatikiotis-Bateson
Villegas
Vincent Aubanel
Vitevitch
Wang
Warner
Warren
Watson
Webster
Welby
Welby
Werker
World Health Organisation
Xu
Xu
Yamagishi
Yang
Yoo
Zajdó
Zampini
Zangl
Zhao
Zipf
Zorilă
Publication venue: 'Elsevier BV'
Publication date: 01/01/2014
Field of study

International audienceSpeech output technology is finding widespread application, including in scenarios where intelligibility might be compromised - at least for some listeners - by adverse conditions. Unlike most current algorithms, talkers continually adapt their speech patterns as a response to the immediate context of spoken communication, where the type of interlocutor and the environment are the dominant situational factors influencing speech production. Observations of talker behaviour can motivate the design of more robust speech output algorithms. Starting with a listener-oriented categorisation of possible goals for speech modification, this review article summarises the extensive set of behavioural findings related to human speech modification, identifies which factors appear to be beneficial, and goes on to examine previous computational attempts to improve intelligibility in noise. The review concludes by tabulating 46 speech modifications, many of which have yet to be perceptually or algorithmically evaluated. Consequently, the review provides a roadmap for future work in improving the robustness of speech output

Crossref

Hal - Université Grenoble Alpes

Edinburgh Research Explorer

Western Sydney ResearchDirect

Recommended from our members

Developmental and cultural factors of audiovisual speech perception in noise

Author: Reetzke Rachel Denise
Publication venue
Publication date: 16/09/2014
Field of study

textThe aim of this project is two-fold: 1) to investigate developmental differences in intelligibility gains from visual cues in speech perception-in-noise, and 2) to examine how different types of maskers modulate visual enhancement across age groups. A secondary aim of this project is to investigate whether or not bilingualism differentially modulates audiovisual integration during speech in noise tasks. To that end, both child and adult, monolingual and bilingual participants completed speech perception in noise tasks through three within-subject variables: (1) masker type: pink noise or two-talker babble, (2) modality: audio-only (AO) and audiovisual (AV), and (3) Signal-to-noise ratio (SNR): 0 dB, -4 dB, -8 dB, -12 dB, and -16 dB. The findings revealed that, although both children and adults benefited from visual cues in speech-in-noise tasks, adults showed greater benefit at lower SNRs. Moreover, although child monolingual and bilingual participants performed comparably across all conditions, monolingual adults outperformed simultaneous bilingual adult participants. These results may indicate that the divergent use of visual cues in speech perception between bilingual and monolingual speakers occurs later in development.Communication Sciences and Disorder

Texas ScholarWorks

Navigating the bilingual cocktail party: Interference from background speakers in listeners with varying L1/L2 proficiency

Author: Byers-Heinlein Krista
Deroche Mickael
Hallot Sophie
Lew Emilia
Publication venue
Publication date: 14/03/2023
Field of study

Cocktail party environments require listeners to tune in to a target voice while ignoring surrounding speakers (maskers), which could present unique challenges for bilingual listeners. Our study recruited English-French bilinguals to listen to a male target speaking French or English, masked by two female voices speaking French, English, or Tamil, or by speech-shaped noise. Listeners performed better with first language (L1) than second language (L2) targets, and relative L1/L2 proficiency acted like a categorical rather than a continuous variable with respect to speech reception threshold (SRT) averaged over maskers. Further, listeners struggled the most with L1 maskers and struggled the least with Tamil maskers. The results suggest that the balanced bilinguals have a slight disadvantage with L1 targets but compensate with a larger advantage with L2 targets, compared to unbalanced bilinguals. This positive net result supports the idea that being a balanced bilingual is helpful in speech-on-speech perception tasks in environments that offer substantial exposure to L2

Concordia University Research Repository

The effects of adverse conditions on speech recognition by non-native listeners: Electrophysiological and behavioural evidence

Author: Song Jieun
Publication venue: UCL (University College London)
Publication date: 28/01/2018
Field of study

This thesis investigated speech recognition by native (L1) and non-native (L2) listeners (i.e., native English and Korean speakers) in diverse adverse conditions using electroencephalography (EEG) and behavioural measures. Study 1 investigated speech recognition in noise for read and casually produced, spontaneous speech using behavioural measures. The results showed that the detrimental effect of casual speech was greater for L2 than L1 listeners, demonstrating real-life L2 speech recognition problems caused by casual speech. Intelligibility was also shown to decrease when the accents of the talker and listener did not match when listening to casual speech as well as read speech. Study 2 set out to develop EEG methods to measure L2 speech processing difficulties for natural, continuous speech. This study thus examined neural entrainment to the amplitude envelope of speech (i.e., slow amplitude fluctuations in speech) while subjects listened to their L1, L2 and a language that they did not understand. The results demonstrate that neural entrainment to the speech envelope is not modulated by whether or not listeners understand the language, opposite to previously reported positive relationships between speech entrainment and intelligibility. Study 3 investigated speech processing in a two-talker situation using measures of neural entrainment and N400, combined with a behavioural speech recognition task. L2 listeners had greater entrainment for target talkers than did L1 listeners, likely because their difficulty with L2 speech comprehension caused them to focus greater attention on the speech signal. L2 listeners also had a greater degree of lexical processing (i.e., larger N400) for highly predictable words than did native listeners, while native listeners had greater lexical processing when listening to foreign-accented speech. The results suggest that the increased listening effort experienced by L2 listeners during speech recognition modulates their auditory and lexical processing

UCL Discovery

Exploiting correlogram structure for robust speech recognition with multiple speech sources

Author: Barker J.
Coy A.
Green P.
Ma N.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2007
Field of study

This paper addresses the problem of separating and recognising speech in a monaural acoustic mixture with the presence of competing speech sources. The proposed system treats sound source separation and speech recognition as tightly coupled processes. In the first stage sound source separation is performed in the correlogram domain. For periodic sounds, the correlogram exhibits symmetric tree-like structures whose stems are located on the delay that corresponds to multiple pitch periods. These pitch-related structures are exploited in the study to group spectral components at each time frame. Local pitch estimates are then computed for each spectral group and are used to form simultaneous pitch tracks for temporal integration. These processes segregate a spectral representation of the acoustic mixture into several time-frequency regions such that the energy in each region is likely to have originated from a single periodic sound source. The identified time-frequency regions, together with the spectral representation, are employed by a `speech fragment decoder' which employs `missing data' techniques with clean speech models to simultaneously search for the acoustic evidence that best matches model sequences. The paper presents evaluations based on artificially mixed simultaneous speech utterances. A coherence-measuring experiment is first reported which quantifies the consistency of the identified fragments with a single source. The system is then evaluated in a speech recognition task and compared to a conventional fragment generation approach. Results show that the proposed system produces more coherent fragments over different conditions, which results in significantly better recognition accuracy

CiteSeerX

White Rose Research Online

Word Spotting in Background Music: A Behavioural Study

Author: Fafoutis F
Marchegiani M
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Archivio istituzionale della Ricerca - Università degli Studi di Parma

Experience with foreign accent influences non-native (L2) word recognition: The case of th-substitutions [Abstract]

Author: Hanulikova A.
Weber A.
Publication venue
Publication date: 01/04/2009
Field of study

MPG.PuRe

Cognitive performance in open-plan office acoustic simulations: Effects of room acoustics and semantics but not spatial separation of sound sources

Author: Fels Janina
Georgi Markus
Klatte Maria
Leist Larissa
Schlittmeier Sabine J.
Yadav Manuj
Publication venue
Publication date: 08/08/2023
Field of study

The irrelevant sound effect (ISE) characterizes short-term memory performance impairment during irrelevant sounds relative to quiet. Irrelevant sound presentation in most laboratory-based ISE studies has been rather limited to represent complex scenarios including open-plan offices (OPOs) and not many studies have considered serial recall of heard information. This paper investigates ISE using an auditory-verbal serial recall task, wherein performance was evaluated for relevant factors in simulating OPO acoustics: the irrelevant sounds including the semanticity of speech, reproduction methods over headphones, and room acoustics. Results (Experiments 1 and 2) show that ISE was exhibited in most conditions with anechoic (irrelevant) nonspeech sounds with/without speech, but the effect was substantially higher with meaningful speech compared to foreign speech, suggesting a semantic effect. Performance differences in conditions with diotic and binaural reproductions were not statistically robust, suggesting limited role of spatial separation of sources. In Experiment 3, statistically robust ISE were exhibited for binaural room acoustic conditions with mid-frequency reverberation times, T30 (s) = 0.4, 0.8, 1.1, suggesting cognitive impairment regardless of sound absorption representative of OPOs. Performance differences in T30 = 0.4 s relative to T30 = 0.8 and 1.1 s conditions were statistically robust. This emphasizes the benefits for cognitive performance with increased sound absorption, reinforcing extant room acoustic design recommendations. Performance differences in T30 = 0.8 s vs. 1.1 s were not statistically robust. Collectively, these results suggest that certain findings from ISE studies with idiosyncratic acoustics may not translate well to complex OPO acoustic environments

arXiv.org e-Print Archive

Intelligibility model optimisation approaches for speech pre-enhancement

Author: Al Dabel Maryam
Publication venue: 'University of Sheffield Conference Proceedings'
Publication date: 13/12/2016
Field of study

The goal of improving the intelligibility of broadcast speech is being met by a recent new direction in speech enhancement: near-end intelligibility enhancement. In contrast to the conventional speech enhancement approach that processes the corrupted speech at the receiver-side of the communication chain, the near-end intelligibility enhancement approach pre-processes the clean speech at the transmitter-side, i.e. before it is played into the environmental noise. In this work, we describe an optimisation-based approach to near-end intelligibility enhancement using models of speech intelligibility to improve the intelligibility of speech in noise. This thesis first presents a survey of speech intelligibility models and how the adverse acoustic conditions affect the intelligibility of speech. The purpose of this survey is to identify models that we can adopt in the design of the pre-enhancement system. Then, we investigate the strategies humans use to increase speech intelligibility in noise. We then relate human strategies to existing algorithms for near-end intelligibility enhancement. A closed-loop feedback approach to near-end intelligibility enhancement is then introduced. In this framework, speech modifications are guided by a model of intelligibility. For the closed-loop system to work, we develop a simple spectral modification strategy that modifies the first few coefficients of an auditory cepstral representation such as to maximise an intelligibility measure. We experiment with two contrasting measures of objective intelligibility. The first, as a baseline, is an audibility measure named 'glimpse proportion' that is computed as the proportion of the spectro-temporal representation of the speech signal that is free from masking. We then propose a discriminative intelligibility model, building on the principles of missing data speech recognition, to model the likelihood of specific phonetic confusions that may occur when speech is presented in noise. The discriminative intelligibility measure is computed using a statistical model of speech from the speaker that is to be enhanced. Interim results showed that, unlike the glimpse proportion based system, the discriminative based system did not improve intelligibility. We investigated the reason behind that and we found that the discriminative based system was not able to target the phonetic confusion with the fixed spectral shaping. To address that, we introduce a time-varying spectral modification. We also propose to perform the optimisation on a segment-by-segment basis which enables a robust solution against the fluctuating noise. We further combine our system with a noise-independent enhancement technique, i.e. dynamic range compression. We found significant improvement in non-stationary noise condition, but no significant differences to the state-of-the art system (spectral shaping and dynamic range compression) where found in stationary noise condition

White Rose E-theses Online

Real-time lexical competitions during speech-in-speech comprehension

Author
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref