Search CORE

202 research outputs found

Lexical Access Model for Italian -- Modeling human speech processing: identification of words in running speech toward lexical access based on the detection of landmarks and other acoustic cues to features

Author: Arango Javier
Chan Ian
Choi Jeung-Yoon
De Nardis Luca
DeCaprio Alec
Di Benedetto Maria-Gabriella
Shattuck-Hufnagel Stefanie
Publication venue
Publication date: 01/01/2021
Field of study

Modelling the process that a listener actuates in deriving the words intended by a speaker requires setting a hypothesis on how lexical items are stored in memory. This work aims at developing a system that imitates humans when identifying words in running speech and, in this way, provide a framework to better understand human speech processing. We build a speech recognizer for Italian based on the principles of Stevens' model of Lexical Access in which words are stored as hierarchical arrangements of distinctive features (Stevens, K. N. (2002). "Toward a model for lexical access based on acoustic landmarks and distinctive features," J. Acoust. Soc. Am., 111(4):1872-1891). Over the past few decades, the Speech Communication Group at the Massachusetts Institute of Technology (MIT) developed a speech recognition system for English based on this approach. Italian will be the first language beyond English to be explored; the extension to another language provides the opportunity to test the hypothesis that words are represented in memory as a set of hierarchically-arranged distinctive features, and reveal which of the underlying mechanisms may have a language-independent nature. This paper also introduces a new Lexical Access corpus, the LaMIT database, created and labeled specifically for this work, that will be provided freely to the speech research community. Future developments will test the hypothesis that specific acoustic discontinuities - called landmarks - that serve as cues to features, are language independent, while other cues may be language-dependent, with powerful implications for understanding how the human brain recognizes speech.Comment: Submitted to Language and Speech, 202

arXiv.org e-Print Archive

Archivio della ricerca- Università di Roma La Sapienza

Optimization of acoustic feature extraction from dysarthric speech

Author: DiCicco Thomas M., Jr. (Thomas Minotti)
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2010
Field of study

Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, February 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 171-180).Dysarthria is a motor speech disorder characterized by weak or uncoordinated movements of the speech musculature. While unfamiliar listeners struggle to understand speakers with severe dysarthria, familiar listeners are often able to comprehend with high accuracy. This observation implies that although the speech produced by an individual with dysarthria may appear distorted and unintelligible to the untrained listener, there must be a set of consistent acoustic cues that the familiar communication partner is able to interpret. While dysarthric speech has been characterized both acoustically and perceptually, most accounts tend to compare dysarthric productions to those of healthy controls rather than identify the set of reliable and consistently controlled segmental cues. This work aimed to elucidate possible recognition strategies used by familiar listeners by optimizing a model of human speech recognition, Stevens' Lexical Access from Features (LAFF) framework, for ten individual speakers with dysarthria (SWDs). The LAFF model is rooted in distinctive feature theory, with acoustic landmarks indicating changes in the manner of articulation. The acoustic correlates manifested around landmarks provide the identity to articulator-free (manner) and articulator-bound (place) features.(cont.) SWDs created weaker consonantal landmarks, likely due to an inability to form complete closures in the vocal tract and to fully release consonantal constrictions. Identification of speaker-optimized acoustic correlate sets improved discrimination of each speaker's productions, evidenced by increased sensitivity and specificity. While there was overlap between the types of correlates identified for healthy and dysarthric speakers, using the optimal sets of correlates identified for SWDs adversely impaired discrimination of healthy speech. These results suggest that the combinations of correlates suggested for SWDs were specific to the individual and different from the segmental cues used by healthy individuals. Application of the LAFF model to dysarthric speech has potential clinical utility as a diagnostic tool, highlighting the fine-grain components of speech production that require intervention and quantifying the degree of impairment.by Thomas M. DiCicco, Jr.Ph.D

DSpace@MIT

Compression Effects in English

Author: Katz Jonah
Publication venue: The Research Repository @ WVU
Publication date: 01/01/2012
Field of study

This paper reports the results of an English experiment on vowel-shortening in different contexts. The data concern compression effects, whereby, in syllables with a greater number of segments, each one of the segments is shorter than in syllables with fewer segments. The experiment demonstrates that the amount of vowel compression found in English monosyllabic words depends in part on which consonants occur adjacent to the vowel in that word, how many consonants occur, and in which position they occur. Consonant clusters drive more vowel shortening than singletons when they involve liquids, but not when they involve only obstruents. Clusters involving nasals drive shortening relative to singletons only in onset position. We suggest that the results cannot be reduced to general principles of gestural overlap and coordination between consonants and vowels, but instead require a theory with overt representation of auditory duration

The Research Repository @ WVU (West Virginia University)

Automated nasal feature detection for the lexical access from features project

Author: Hajro Neira, 1978-
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2004
Field of study

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.Includes bibliographical references (leaves 150-151).The focus of this thesis was the design, implementation, and evaluation of a set of automated algorithms to detect nasal consonants from the speech waveform in a distinctive feature-based speech recognition system. The study used a VCV database of over 450 utterances recorded from three speakers, two male and one female. The first stage of processing for each speech waveform included automated 'pivot' estimation using the Consonant Landmark Detector - these 'pivots' were considered possible sonorant closures and releases in further analyses. Estimated pivots were analyzed acoustically for the nasal murmur and vowel-nasal boundary characteristics. For nasal murmur, the analyzed cues included observing the presence of a low frequency resonance in the short-time spectra, stability in the signal energy, and characteristic spectral tilt. The acoustic cues for the nasal boundary measured the change in the energy of the first harmonic and the net energy change of the 0-350Hz and 350-1000Hz frequency bands around the pivot time. The results of the acoustic analyses were translated into a simple set of general acoustic criteria that detected 98% of true nasal pivots. The high detection rate was partially offset by a relatively large number of false positives - 16% of all non-nasal pivots were also detected as showing characteristics of the nasal murmur and nasal boundary. The advantage of the presented algorithms is in their consistency and accuracy across users and contexts, and unlimited applicability to spontaneous speech.by Neira Hajro.M.Eng

DSpace@MIT

Consonant landmark detection for speech recognition

Author: Park Chi-youn, 1981-
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2008
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 191-197).This thesis focuses on the detection of abrupt acoustic discontinuities in the speech signal, which constitute landmarks for consonant sounds. Because a large amount of phonetic information is concentrated near acoustic discontinuities, more focused speech analysis and recognition can be performed based on the landmarks. Three types of consonant landmarks are defined according to its characteristics -- glottal vibration, turbulence noise, and sonorant consonant -- so that the appropriate analysis method for each landmark point can be determined. A probabilistic knowledge-based algorithm is developed in three steps. First, landmark candidates are detected and their landmark types are classified based on changes in spectral amplitude. Next, a bigram model describing the physiologically-feasible sequences of consonant landmarks is proposed, so that the most likely landmark sequence among the candidates can be found. Finally, it has been observed that certain landmarks are ambiguous in certain sets of phonetic and prosodic contexts, while they can be reliably detected in other contexts. A method to represent the regions where the landmarks are reliably detected versus where they are ambiguous is presented. On TIMIT test set, 91% of all the consonant landmarks and 95% of obstruent landmarks are located as landmark candidates. The bigram-based process for determining the most likely landmark sequences yields 12% deletion and substitution rates and a 15% insertion rate. An alternative representation that distinguishes reliable and ambiguous regions can detect 92% of the landmarks and 40% of the landmarks are judged to be reliable. The deletion rate within reliable regions is as low as 5%.(cont.) The resulting landmark sequences form a basis for a knowledge-based speech recognition system since the landmarks imply broad phonetic classes of the speech signal and indicate the points of focus for estimating detailed phonetic information. In addition, because the reliable regions generally correspond to lexical stresses and word boundaries, it is expected that the landmarks can guide the focus of attention not only at the phoneme-level, but at the phrase-level as well.by Chiyoun Park.Ph.D

DSpace@MIT

Phonation Types in Marathi: An Acoustic Investigation

Author: Berkson Kelly Harper
Publication venue: 'Paleontological Institute at The University of Kansas'
Publication date: 01/01/2013
Field of study

This dissertation presents a comprehensive instrumental acoustic analysis of phonation type distinctions in Marathi, an Indic language with numerous breathy voiced sonorants and obstruents. Important new facts about breathy voiced sonorants, which are crosslinguistically rare, are established: male and female speakers cue breathy phonation in sonorants differently, there are an abundance of trading relations, and--critically--phonation type distinctions are not cued as well by sonorants as by obstruents. Ten native speakers (five male, five female) were recorded producing Marathi words embedded in a carrier sentence. Tokens included plain and breathy voiced stops, affricates, nasals, laterals, rhotics, and approximants before the vowels [a] and [e]. Measures reported for consonants and subsequent vowels include duration, F0, Cepstral Peak Prominence (CPP), and corrected H1-H2*, H1-A1*, H1-A2*, and H1-A3* values. As expected, breathy voice is associated with decreased CPP and increased spectral values. A strong gender difference is revealed: low-frequency measures like H1-H2* cue breathy phonation more reliably in male speech, while CPP--which provides information about the aspiration noise included in the signal--is a more reliable cue in female speech. Trading relations are also reported: time and again, where one cue is weak or absent another cue is strong or present, underscoring the importance of including both genders and multiple vowel contexts when testing phonation type differences. Overall, the cues that are present for obstruents are not necessarily mirrored by sonorants. These findings are interpreted with reference to Dispersion Theory (Flemming 1995; Liljencrants & Lindblom 1972; Lindblom 1986, 1990). While various incarnations of Dispersion Theory focus on different aspects of perceptual and auditory distinctiveness, a basic claim is that one requirement for phonological contrasts is that they must be perceptually distinct: contrasts that are subject to great confusability are phonologically disfavored. The proposal, then, is that the typology of breathy voiced sonorants is due in part to the fact that they are not well differentiated acoustically. Breathy voiced sonorants are crosslinguistically rare because they do not make for strong phonemic contrasts

KU ScholarWorks

Categories, words and rules in language acquisition

Author: Hochmann Jean Remy
Publication venue: place:Trieste
Publication date: 06/12/2010
Field of study

Acquiring language requires learning a set of words (i.e. the lexicon) and abstract rules that combine them to form sentences (i.e. syntax). In this thesis, we show that infants acquiring their mother tongue rely on different speech categories to extract: words and to abstract regularities. We address this issue with a study that investigates how young infants use consonants and vowels, showing that certain computations are tuned to one or the other of these speech categories..

Sissa Digital Library

A model of sonority based on pitch intelligibility

Author: Albert Aviad
Publication venue
Publication date: 01/01/2023
Field of study

Synopsis: Sonority is a central notion in phonetics and phonology and it is essential for generalizations related to syllabic organization. However, to date there is no clear consensus on the phonetic basis of sonority, neither in perception nor in production. The widely used Sonority Sequencing Principle (SSP) represents the speech signal as a sequence of discrete units, where phonological processes are modeled as symbol manipulating rules that lack a temporal dimension and are devoid of inherent links to perceptual, motoric or cognitive processes. The current work aims to change this by outlining a novel approach for the extraction of continuous entities from acoustic space in order to model dynamic aspects of phonological perception. It is used here to advance a functional understanding of sonority as a universal aspect of prosody that requires pitch-bearing syllables as the building blocks of speech. This book argues that sonority is best understood as a measurement of pitch intelligibility in perception, which is closely linked to periodic energy in acoustics. It presents a novel principle for sonority-based determinations of well-formedness – the Nucleus Attraction Principle (NAP). Two complementary NAP models independently account for symbolic and continuous representations and they mostly outperform SSP-based models, demonstrated here with experimental perception studies and with a corpus study of Modern Hebrew nouns. This work also includes a description of ProPer (Prosodic Analysis with Periodic Energy). The ProPer toolbox further exploits the proposal that periodic energy reflects sonority in order to cover major topics in prosodic research, such as prominence, intonation and speech rate. The book is finally concluded with brief discussions on selected topics: (i) the phonotactic division of labor with respect to /s/-stop clusters; (ii) the debate about the universality of sonority; and (iii) the fate of the classic phonetics–phonology dichotomy as it relates to continuity and dynamics in phonology

Institutional Repository of the Freie Universität Berlin

Tonal placement in Tashlhiyt: How an intonation system accommodates to adverse phonological environments

Author
Publication venue: Language Science Press
Publication date: 17/03/2017
Field of study

In most languages, words contain vowels, elements of high intensity with rich harmonic structure, enabling the  perceptual retrieval of pitch. By contrast, in Tashlhiyt, a Berber language, words can be composed entirely of voiceless segments. When an utterance consists of such words, the phonetic opportunity for the execution of intonational pitch movements is exceptionally limited. This book explores in a series of production and perception experiments how these typologically rare phonotactic patterns interact with intonational aspects of linguistic structure. It turns out that Tashlhiyt allows for a tremendously flexible placement of tonal events. Observed intonational structures can be conceived of as different solutions to a functional dilemma: The requirement to realise meaningful pitch movements in certain positions and the extent to which segments lend themselves to a clear manifestation of these pitch movements

Language Science Press

Temporal articulatory stability, phonological variation, and lexical contrast preservation in diaspora Tibetan

Author: Geissler Christopher Alden
Publication venue: EliScholar – A Digital Platform for Scholarly Publishing at Yale
Publication date: 01/04/2021
Field of study

This dissertation examines how lexical tone can be represented with articulatory gestures, and the ways a gestural perspective can inform synchronic and diachronic analysis of the phonology and phonetics of a language. Tibetan is chosen an example of a language with interacting laryngeal and tonal phonology, a history of tonogenesis and dialect diversification, and recent contact-induced realignment of the tonal and consonantal systems. Despite variation in voice onset time (VOT) and presence/absence of the lexical tone contrast, speakers retain a consistent relative timing of consonant and vowel gestures. Recent research has attempted to integrate tone into the framework of Articulatory Phonology through the addition of tone gestures. Unlike other theories of phonetics-phonology, Articulatory Phonology uniquely incorporates relative timing as a key parameter. This allows the system to represent contrasts instantiated not just in the presence or absence of gestures, but also in how gestures are timed with each other. Building on the different predictions of various timing relations, along with the historical developments in the language, hypotheses are generated and tested with acoustic and articulatory experiments. Following an overview of relevant theory, the second chapter surveys past literature on the history of sound change and present phonological diversity of Tibetic dialects. Whereas Old Tibetan lacked lexical tone, contrasted voiced and voiceless obstruents, and exhibited complex clusters, a series of overlapping sound changes have led to some modern varieties that are tone, lack clusters, and vary in the expression of voicing and aspiration. Furthermore, speakers in the Tibetan diaspora use a variety that has grown out of the contact between diverse Tibetic dialects. The state of the language and the dynamics of diaspora have created a situation ripe for sound change, including the recombination of elements from different dialects and, potentially, the loss of tone contrasts. The nature of the diaspora Tibetan is investigated through an acoustic corpus study. Recordings made in Kathmandu, Nepal, are being transcribed and forced-aligned into a useful audio corpus. Speakers in the corpus come from diverse backgrounds across and outside traditional Tibetan-speaking regions, but the analysis presented here focuses on speakers who grew up in diaspora, with a mixed input of Standard Tibetan (spyi skad) and other Tibetan varieties. Especially notable among these speakers is the high variability of voice onset time (VOT) and its interaction with tone. An analysis of this data in terms of the relative timing of oral, laryngeal, and tone gestures leads to the generation of hypotheses for testing using articulatory data. The articulatory study is conducted using electromagnetic articulography (EMA), and six Tibetan-speaking participants. The key finding is that the relative timing of consonant and vowel gestures is consistent across phonological categories and across speakers who do and do not contrast tone. This result leads to the conclusion that the relative timing of speech gestures is conserved and acquired independently. Speakers acquire and generalize a limited inventory of timing patterns, and can use timing patterns even when the conditioning environment for the development of those patterns, namely tone, has been lost

Yale University