3,010 research outputs found
Recommended from our members
Real-time decoding of question-and-answer speech dialogue using human cortical activity.
Natural communication often occurs in dialogue, differentially engaging auditory and sensorimotor brain regions during listening and speaking. However, previous attempts to decode speech directly from the human brain typically consider listening or speaking tasks in isolation. Here, human participants listened to questions and responded aloud with answers while we used high-density electrocorticography (ECoG) recordings to detect when they heard or said an utterance and to then decode the utterance's identity. Because certain answers were only plausible responses to certain questions, we could dynamically update the prior probabilities of each answer using the decoded question likelihoods as context. We decode produced and perceived utterances with accuracy rates as high as 61% and 76%, respectively (chance is 7% and 20%). Contextual integration of decoded question likelihoods significantly improves answer decoding. These results demonstrate real-time decoding of speech in an interactive, conversational setting, which has important implications for patients who are unable to communicate
Language Identification Using Visual Features
Automatic visual language identification (VLID) is the technology of using information derived from the visual appearance and movement of the speech articulators to iden- tify the language being spoken, without the use of any audio information. This technique for language identification (LID) is useful in situations in which conventional audio processing is ineffective (very noisy environments), or impossible (no audio signal is available). Research in this field is also beneficial in the related field of automatic lip-reading. This paper introduces several methods for visual language identification (VLID). They are based upon audio LID techniques, which exploit language phonology and phonotactics to discriminate languages. We show that VLID is possible in a speaker-dependent mode by discrimi- nating different languages spoken by an individual, and we then extend the technique to speaker-independent operation, taking pains to ensure that discrimination is not due to artefacts, either visual (e.g. skin-tone) or audio (e.g. rate of speaking). Although the low accuracy of visual speech recognition currently limits the performance of VLID, we can obtain an error-rate of < 10% in discriminating between Arabic and English on 19 speakers and using about 30s of visual speech
Combined Acoustic and Pronunciation Modelling for Non-Native Speech Recognition
In this paper, we present several adaptation methods for non-native speech
recognition. We have tested pronunciation modelling, MLLR and MAP non-native
pronunciation adaptation and HMM models retraining on the HIWIRE foreign
accented English speech database. The ``phonetic confusion'' scheme we have
developed consists in associating to each spoken phone several sequences of
confused phones. In our experiments, we have used different combinations of
acoustic models representing the canonical and the foreign pronunciations:
spoken and native models, models adapted to the non-native accent with MAP and
MLLR. The joint use of pronunciation modelling and acoustic adaptation led to
further improvements in recognition accuracy. The best combination of the above
mentioned techniques resulted in a relative word error reduction ranging from
46% to 71%
Visual units and confusion modelling for automatic lip-reading
Automatic lip-reading (ALR) is a challenging task because the visual speech signal is known to be missing some important information, such as voicing. We propose an approach to ALR that acknowledges that this information is missing but assumes that it is substituted or deleted in a systematic way that can be modelled. We describe a system that learns such a model and then incorporates it into decoding, which is realised as a cascade of weighted finite-state transducers. Our results show a small but statistically significant improvement in recognition accuracy. We also investigate the issue of suitable visual units for ALR, and show that visemes are sub-optimal, not but because they introduce lexical ambiguity, but because the reduction in modelling units entailed by their use reduces accuracy
Learning to Understand Child-directed and Adult-directed Speech
Speech directed to children differs from adult-directed speech in linguistic
aspects such as repetition, word choice, and sentence length, as well as in
aspects of the speech signal itself, such as prosodic and phonemic variation.
Human language acquisition research indicates that child-directed speech helps
language learners. This study explores the effect of child-directed speech when
learning to extract semantic information from speech directly. We compare the
task performance of models trained on adult-directed speech (ADS) and
child-directed speech (CDS). We find indications that CDS helps in the initial
stages of learning, but eventually, models trained on ADS reach comparable task
performance, and generalize better. The results suggest that this is at least
partially due to linguistic rather than acoustic properties of the two
registers, as we see the same pattern when looking at models trained on
acoustically comparable synthetic speech.Comment: Authors found an error in preprocessing of transcriptions before they
were fed to SBERT. After correction, the experiments were rerun. The updated
results can be found in this version. Importantly, - Most scores were
affected to a small degree (performance was slightly worse). - The effect was
consistent across conditions. Therefore, the general patterns remain the sam
- …