Search CORE

575 research outputs found

Parsing Speech: A Neural Approach to Integrating Lexical and Acoustic-Prosodic Information

Author: Bansal Mohit
Gimpel Kevin
Livescu Karen
Ostendorf Mari
Toshniwal Shubham
Tran Trang
Publication venue
Publication date: 01/01/2018
Field of study

In conversational speech, the acoustic signal provides cues that help listeners disambiguate difficult parses. For automatically parsing spoken utterances, we introduce a model that integrates transcribed text and acoustic-prosodic features using a convolutional neural network over energy and pitch trajectories coupled with an attention-based recurrent neural network that accepts text and prosodic features. We find that different types of acoustic-prosodic features are individually helpful, and together give statistically significant improvements in parse and disfluency detection F1 scores over a strong text-only baseline. For this study with known sentence boundaries, error analyses show that the main benefit of acoustic-prosodic features is in sentences with disfluencies, attachment decisions are most improved, and transcription errors obscure gains from prosody.Comment: Accepted in NAACL HLT 201

arXiv.org e-Print Archive

Crossref

Double Articulation Analyzer with Prosody for Unsupervised Word and Phoneme Discovery

Author: Okuda Yasuaki
Ozaki Ryo
Taniguchi Tadahiro
Publication venue
Publication date: 15/03/2021
Field of study

Infants acquire words and phonemes from unsegmented speech signals using segmentation cues, such as distributional, prosodic, and co-occurrence cues. Many pre-existing computational models that represent the process tend to focus on distributional or prosodic cues. This paper proposes a nonparametric Bayesian probabilistic generative model called the prosodic hierarchical Dirichlet process-hidden language model (Prosodic HDP-HLM). Prosodic HDP-HLM, an extension of HDP-HLM, considers both prosodic and distributional cues within a single integrative generative model. We conducted three experiments on different types of datasets, and demonstrate the validity of the proposed method. The results show that the Prosodic DAA successfully uses prosodic cues and outperforms a method that solely uses distributional cues. The main contributions of this study are as follows: 1) We develop a probabilistic generative model for time series data including prosody that potentially has a double articulation structure; 2) We propose the Prosodic DAA by deriving the inference procedure for Prosodic HDP-HLM and show that Prosodic DAA can discover words directly from continuous human speech signals using statistical information and prosodic information in an unsupervised manner; 3) We show that prosodic cues contribute to word segmentation more in naturally distributed case words, i.e., they follow Zipf's law.Comment: 11 pages, Submitted to IEEE Transactions on Cognitive and Developmental System

arXiv.org e-Print Archive

The cross-linguistic performance of word segmentation models over time.

Author: Andrew CAINES
Basbøll
Basbøll
Bernard
Bird
Braginsky
Emma ALTMANN-RICHER
Grønnum
Krogh
Ladefoged
MacWhinney
MacWhinney
Mampe
McCauley
Nespor
Paula BUTTERY
Zipf
Publication venue: J Child Lang
Publication date: 01/11/2019
Field of study

We select three word segmentation models with psycholinguistic foundations - transitional probabilities, the diphone-based segmenter, and PUDDLE - which track phoneme co-occurrence and positional frequencies in input strings, and in the case of PUDDLE build lexical and diphone inventories. The models are evaluated on caregiver utterances in 132 CHILDES corpora representing 28 languages and 11.9 m words. PUDDLE shows the best performance overall, albeit with wide cross-linguistic variation. We explore the reasons for this variation, fitting regression models to performance scores with linguistic properties which capture lexico-phonological characteristics of the input: word length, utterance length, diversity in the lexicon, the frequency of one-word utterances, the regularity of phoneme patterns at word boundaries, and the distribution of diphones in each language. These properties together explain four-tenths of the observed variation in segmentation performance, a strong outcome and a solid foundation for studying further variables which make the segmentation task difficult

Crossref

Apollo (Cambridge)

Catching words in a stream of speach:computational simulations of segmenting transcribed child-directed speech

Author: Coltekin Cagri
Publication venue: s.n.
Publication date: 01/01/2011
Field of study

Proceedings - University of Groningen

Catching words in a stream of speach:computational simulations of segmenting transcribed child-directed speech

Author: Coltekin Cagri
Publication venue: s.n.
Publication date: 01/01/2011
Field of study

ARTS repository - University of Groningen

Hierarchical recurrent neural network for story segmentation using fusion of lexical and acoustic features

Author: Bell Peter
Klejch Ondrej
Renals Steve
Tsunoo Emiru
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 25/01/2018
Field of study

Edinburgh Research Explorer

ProsAudit, a prosodic benchmark for self-supervised speech models

Author: de Seyssel Maureen
Dupoux Emmanuel
Lavechin Marvin
Ludusan Bogdan
Revilla Andrea Santos
Thomas Arthur
Titeux Hadrien
Virlet Gwendal
Wisniewski Guillaume
Publication venue
Publication date: 01/06/2023
Field of study

We present ProsAudit, a benchmark in English to assess structural prosodic knowledge in self-supervised learning (SSL) speech models. It consists of two subtasks, their corresponding metrics, and an evaluation dataset. In the protosyntax task, the model must correctly identify strong versus weak prosodic boundaries. In the lexical task, the model needs to correctly distinguish between pauses inserted between words and within words. We also provide human evaluation scores on this benchmark. We evaluated a series of SSL models and found that they were all able to perform above chance on both tasks, even when evaluated on an unseen language. However, non-native models performed significantly worse than native ones on the lexical task, highlighting the importance of lexical knowledge in this task. We also found a clear effect of size with models trained on more data performing better in the two subtasks.Comment: Accepted at Interspeech 2023. 4 pages + references, 1 figur

arXiv.org e-Print Archive