2,030 research outputs found

    Lexical Access Model for Italian -- Modeling human speech processing: identification of words in running speech toward lexical access based on the detection of landmarks and other acoustic cues to features

    Full text link
    Modelling the process that a listener actuates in deriving the words intended by a speaker requires setting a hypothesis on how lexical items are stored in memory. This work aims at developing a system that imitates humans when identifying words in running speech and, in this way, provide a framework to better understand human speech processing. We build a speech recognizer for Italian based on the principles of Stevens' model of Lexical Access in which words are stored as hierarchical arrangements of distinctive features (Stevens, K. N. (2002). "Toward a model for lexical access based on acoustic landmarks and distinctive features," J. Acoust. Soc. Am., 111(4):1872-1891). Over the past few decades, the Speech Communication Group at the Massachusetts Institute of Technology (MIT) developed a speech recognition system for English based on this approach. Italian will be the first language beyond English to be explored; the extension to another language provides the opportunity to test the hypothesis that words are represented in memory as a set of hierarchically-arranged distinctive features, and reveal which of the underlying mechanisms may have a language-independent nature. This paper also introduces a new Lexical Access corpus, the LaMIT database, created and labeled specifically for this work, that will be provided freely to the speech research community. Future developments will test the hypothesis that specific acoustic discontinuities - called landmarks - that serve as cues to features, are language independent, while other cues may be language-dependent, with powerful implications for understanding how the human brain recognizes speech.Comment: Submitted to Language and Speech, 202

    Behavioral sentiment analysis of depressive states

    Get PDF
    The need to release accurate and incontrovertible diagnoses of depression has fueled the search for new methodologies to obtain more reliable measurements than the commonly adopted questionnaires. In such a context, research has sought to identify non-biased measures derived from analyses of behavioral data such as voice and language. For this purpose, sentiment analysis techniques were developed, initially based on linguistic characteristics extracted from texts and gradually becoming more and more sophisticated by adding tools for the analyses of voice and visual data (such as facial expressions and movements). This work summarizes the behavioral features accounted for detecting depressive states and sentiment analysis tools developed to extract them from text, audio, and video recordings

    Dealing with linguistic mismatches for automatic speech recognition

    Get PDF
    Recent breakthroughs in automatic speech recognition (ASR) have resulted in a word error rate (WER) on par with human transcribers on the English Switchboard benchmark. However, dealing with linguistic mismatches between the training and testing data is still a significant challenge that remains unsolved. Under the monolingual environment, it is well-known that the performance of ASR systems degrades significantly when presented with the speech from speakers with different accents, dialects, and speaking styles than those encountered during system training. Under the multi-lingual environment, ASR systems trained on a source language achieve even worse performance when tested on another target language because of mismatches in terms of the number of phonemes, lexical ambiguity, and power of phonotactic constraints provided by phone-level n-grams. In order to address the issues of linguistic mismatches for current ASR systems, my dissertation investigates both knowledge-gnostic and knowledge-agnostic solutions. In the first part, classic theories relevant to acoustics and articulatory phonetics that present capability of being transferred across a dialect continuum from local dialects to another standardized language are re-visited. Experiments demonstrate the potentials that acoustic correlates in the vicinity of landmarks could help to build a bridge for dealing with mismatches across difference local or global varieties in a dialect continuum. In the second part, we design an end-to-end acoustic modeling approach based on connectionist temporal classification loss and propose to link the training of acoustics and accent altogether in a manner similar to the learning process in human speech perception. This joint model not only performed well on ASR with multiple accents but also boosted accuracies of accent identification task in comparison to separately-trained models

    The analysis of breathing and rhythm in speech

    Get PDF
    Speech rhythm can be described as the temporal patterning by which speech events, such as vocalic onsets, occur. Despite efforts to quantify and model speech rhythm across languages, it remains a scientifically enigmatic aspect of prosody. For instance, one challenge lies in determining how to best quantify and analyse speech rhythm. Techniques range from manual phonetic annotation to the automatic extraction of acoustic features. It is currently unclear how closely these differing approaches correspond to one another. Moreover, the primary means of speech rhythm research has been the analysis of the acoustic signal only. Investigations of speech rhythm may instead benefit from a range of complementary measures, including physiological recordings, such as of respiratory effort. This thesis therefore combines acoustic recording with inductive plethysmography (breath belts) to capture temporal characteristics of speech and speech breathing rhythms. The first part examines the performance of existing phonetic and algorithmic techniques for acoustic prosodic analysis in a new corpus of rhythmically diverse English and Mandarin speech. The second part addresses the need for an automatic speech breathing annotation technique by developing a novel function that is robust to the noisy plethysmography typical of spontaneous, naturalistic speech production. These methods are then applied in the following section to the analysis of English speech and speech breathing in a second, larger corpus. Finally, behavioural experiments were conducted to investigate listeners' perception of speech breathing using a novel gap detection task. The thesis establishes the feasibility, as well as limits, of automatic methods in comparison to manual annotation. In the speech breathing corpus analysis, they help show that speakers maintain a normative, yet contextually adaptive breathing style during speech. The perception experiments in turn demonstrate that listeners are sensitive to the violation of these speech breathing norms, even if unconsciously so. The thesis concludes by underscoring breathing as a necessary, yet often overlooked, component in speech rhythm planning and production

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies

    Speech Communication

    Get PDF
    Contains table of contents for Part IV, table of contents for Section 1, an introduction, reports on seven research projects and a list of publications.C.J. Lebel FellowshipDennis Klatt Memorial FundNational Institutes of Health Grant T32-DC00005National Institutes of Health Grant R01-DC00075National Institutes of Health Grant F32-DC00015National Institutes of Health Grant R01-DC00266National Institutes of Health Grant P01-DC00361National Institutes of Health Grant R01-DC00776National Science Foundation Grant IRI 89-10561National Science Foundation Grant IRI 88-05680National Science Foundation Grant INT 90-2471

    Neural mechanisms of foreign language phoneme acquisition in early adulthood : MEG study

    Get PDF
    Tämän tutkimuksen tavoitteena on selvittää omaan äidinkieleen kuulumattomien foneemikontrastien oppimisen mekanismeja nuorilla aikuisilla neurofysiologisten ja behavioraalisten menetelmien avulla. Perinteisesti kielen foneettisen avaruuden omaksumisen on ajateltu tapahtuvan ensisijaisesti varhaislapsuuden kielellisten herkkyyskausien aikana, jonka jälkeen uusien foneemien oppiminen on haastavaa. Myöhemmät tutkimukset ovat kuitenkin osoittaneet, että vieraiden foneemien omaksuminen on mahdollista myös aikuisiällä. Uusien foneemikategorioiden muodostuminen vaatii aivoissa solutason plastisia muutoksia. Aivojen kykyä erotella läheisesti toisiaan muistuttavia foneemikategorioita kielenprosessoinnin varhaisella tasolla on tutkittu neurofysiologisin menetelmin esimerkiksi tapahtumasidonnaisen poikkeavuusnegatiivisuusvasteen (eng. mismatch negativity, MMN) avulla. MMN-vaste, tai sen magneettinen vastine MMNm, syntyy seurauksena muutoksiin sensorisessa havaintoympäristössä. Tutkimuksissa lyhyenkin auditiivisen harjoittelujakson on havaittu vahvistavan aivojen kykyä erotella läheisesti toisiaan muistuttavia vieraita foneemeja ja voimistavan MMN- ja MMNm-vasteita. Tässä tutkimuksessa vieraan kielen foneettisen oppimisen neuraalista perustaa ja oppimisen aiheuttamia plastisia muutoksia aivoissa tutkittiin magnetoenkefalografialla (MEG) neuromagneettisten tapahtumasidonnaisten vasteiden (erityisesti MMNm) avulla. Tutkimuksessa mitattiin 20 suomalaista koehenkilöä, joiden tehtävänä oli oppia erottelemaan akustisesti toisiaan läheisesti muistuttavia venäjän kielen frikatiiveja Ш /ʂ/ ja Щ /ɕ(ː)/. Erottelukykyä mitattiin ensin behavioraalisella tehtävällä, jossa koehenkilöille toistettiin nauhoitettuja venäjänkielisiä epäsanaminimipareja, jossa sanan ensimmäistä foneemia varioitiin. Koehenkilöiden tehtävänä oli vastata, kuulivatko he sanoissa eroa. Samoja kuuloärsykkeitä toistettiin koehenkilöille sen jälkeen passiivisessa MEG-tehtävässä, jossa testattiin aivojen kykyä havaita ero ärsykkeissä ilman, että niihin kiinnitetään huomiota (koehenkilöt katselivat samalla äänetöntä elokuvaa). Mittauksen jälkeen koehenkilöt harjoittelivat foneemien erottelua kotona noin viikon ajan tietokoneavusteisen oppimispelin avulla, jonka jälkeen heidät mitattiin uudelleen. MEG-signaalien lähdemallinnusta varten koehenkilöiden aivoista otettiin myös rakenteelliset magneettikuvat. Tutkittavien foneemien behavioraalinen erottelukyky oli selvästi tuttuja kontrollifoneemeita heikompaa. Erottelukyky vaikutti paranevan harjoittelun seurauksena hieman, mutta ero ei ollut tilastollisesti merkitsevä. Hypoteesien vastaisesti tilastollisesti merkitseviä MMNm-vasteita ei löydetty ennen eikä jälkeen harjoittelun, eikä muissakaan auditorisissa MEG-vasteissa tai niiden neuraalisten lähdevirtojen voimakkuuksissa tai jakaumassa ollut tilastollisesti merkitsevää eroa mittauskertojen välillä. Yksilölliset erot oppimisessa olivat kuitenkin suuria. Koehenkilöillä, joilla behavioraalinen erottelukyky parani harjoittelun myötä, oli silmämääräisesti havaittavissa hypoteesien mukaista vahvistumista auditorisissa vasteissa. Vaikka efekti oli erittäin pieni eikä tilastollisesti merkitsevä, vastaavaa ei havaittu epäoppijoilla eikä kontrollitilanteessa. Tässä tutkimuksessa ei kyetty replikoimaan aiempien tutkimusten tuloksia foneemien omaksumisesta aikuisiällä. Vaikka on todennäköistä, että tietyt metodologiset heikkoudet (mm. vähäinen ärsykkeiden määrä MEG-tehtävässä, haastavat ärsykkeet) vaikuttivat tulosten merkitsevyyteen, voidaan tämän tutkimuksen valossa aiempien tutkimustulosten yleistettävyyttä kyseenalaistaa.The aim of this study is to examine the learning mechanisms and acquisition of non-native phoneme contrasts in young adults using neurophysiological and behavioral methods. According to the traditional view, acquiring novel phonemes after the sensitive periods in the early childhood is very difficult. However, later findings have shown that foreign phoneme contrasts can be learned at a later age, too. Acquiring new phonemic categories requires neuroplastic changes in the brain. Neurophysiological studies have examined the brain’s ability to differentiate between closely related phonemic categories at the early stage of spoken language processing by measuring, for example, event-related mismatch negativity responses (MMN). MMN, or its magnetic equivalent MMNm, is elicited when the brain registers a difference in a repetitive sensory stimulus. Studies have shown that even a moderate amount of auditory training with closely related foreign phonemes improves the brain’s ability to discriminate between them resulting in enhanced MMN or MMNm responses. In this experiment the neural mechanisms of foreign language phoneme acquisition and the learning-related neuroplastic changes were studied using magnetoencephalography (MEG) and neuromagnetic evoked responses (MMNm in particular). 20 Finnish subjects were measured in the experiment. Their task was to learn to differentiate between acoustically closely related Russian fricatives Ш /ʂ/ and Щ /ɕ(ː)/. The subjects’ differentiation skills were first tested in a behavioral task where Russian pseudoword minimal pairs were presented to them auditorily. The first phoneme in the word pairs was varied and the subjects had to report whether they heard a difference between the words or not. The same stimuli were then presented in a passive MEG task where the brain’s change detection responses were tested in an unattended situation as the subjects were watching a silent film. After the measurement the subjects practiced the phonemes at home for approximately one week by playing a learning game by computer. After training they were measured again. Structural magnetic resonance images of the subjects’ brain were also measured for MEG source localization purposes. Behavioral discrimination ability of the experimental phonemes was considerably worse than with familiar control phonemes. The discrimination skills seemed to improve by training, but the difference was not statistically significant. Contrary to the hypotheses, statistically significant MMNm responses were not found before or after training. No significant differences were found in other auditory MEG responses or their neural source current distributions between the measurements either. However, individual differences in learning were sizeable. For the subjects who improved their performance in the behavioral task a modest training-related boost in the auditory responses supporting the hypotheses could be observed. Although very small and statistically insignificant, the effect was opposite for control stimuli and did not exist in the non-learner group suggesting some sort of change in neural processing in the learner group. This study was not able to replicate the findings from various previous studies on phoneme acquisition in adulthood. Although it is likely that certain methodological limitations (e.g. small number of stimulus repetitions, challenging stimuli) affected the significance of the results, based on this study the generalizability of some of the previous findings can be called into question

    The benefits of acoustic perceptual information for speech processing systems

    Get PDF
    The frame-synchronized framework has dominated many speech processing systems, such as ASR and AED targeting human speech activities. These systems have little consideration for the science behind speech and treat the task as a simple statistical classification. The framework also assumes each feature vector to be equally important to the task. However, through some preliminary experiments, this study has found evidence that some concepts defined in speech perception theories such as auditory roughness and acoustic landmarks can act as heuristics to these systems and benefit them in multiple ways. Findings of acoustic landmarks hint that the idea of treating each frame equally might not be optimal. In some cases, landmark information can improve system accuracy through highlighting the more significant frames, or improve the acoustic model accuracy by training through MTL. Further investigation into the topic found experimental evidence suggesting that acoustic landmark information can also benefit end-to-end acoustic models trained through CTC loss. With the help of acoustic landmarks, CTC models can converge with less training data and achieve lower error rate. For the first time, positive results were collected on a mid-size ASR corpus (WSJ) for acoustic landmarks. The results indicate that audio perception information can benefit a broad range of audio processing systems

    Acoustic characterization of the glides /j/ and /w/ in American English

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student submitted PDF version of thesis.Includes bibliographical references (p. 141-145).Acoustic analyses were conducted to identify the characteristics that differentiate the glides /j,w/ from adjacent vowels. These analyses were performed on a recorded database of intervocalic glides, produced naturally by two male and two female speakers in controlled vocalic and prosodic contexts. Glides were found to differ significantly from adjacent vowels through RMS amplitude reduction, first formant frequency reduction, open quotient increase, harmonics-to-noise ratio reduction, and fundamental frequency reduction. The acoustic data suggest that glides differ from their cognate high vowels /i,u/ in that the glides are produced with a greater degree of constriction in the vocal tract. The narrower constriction causes an increase in oral pressure, which produces aerodynamic effects on the glottal voicing source. This interaction between the vocal tract filter and its excitation source results in skewing of the glottal waveform, increasing its open quotient and decreasing the amplitude of voicing. A listening experiment with synthetic tokens was performed to isolate and compare the perceptual salience of acoustic cues to the glottal source effects of glides and to the vocal tract configuration itself. Voicing amplitude (representing source effects) and first formant frequency (representing filter configuration) were manipulated in cooperating and conflicting patterns to create percepts of /V#V/ or /V#GV/ sequences, where Vs were high vowels and Gs were their cognate glides.(cont.) In the responses of ten naïve subjects, voicing amplitude had a greater effect on the detection of glides than first formant frequency, suggesting that glottal source effects are more important to the distinction between glides and high vowels. The results of the acoustic and perceptual studies provide evidence for an articulatory-acoustic mapping defining the glide category. It is suggested that glides are differentiated from high vowels and fricatives by articulatory-acoustic boundaries related to the aerodynamic consequences of different degrees of vocal tract constriction. The supraglottal constriction target for glides is sufficiently narrow to produce a non-vocalic oral pressure drop, but not sufficiently narrow to produce a significant frication noise source. This mapping is consistent with the theory that articulator-free features are defined by aero-mechanical interactions. Implications for phonological classification systems and speech technology applications are discussed.by Elisabeth Hon Hunt.Ph.D

    Consonant landmark detection for speech recognition

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 191-197).This thesis focuses on the detection of abrupt acoustic discontinuities in the speech signal, which constitute landmarks for consonant sounds. Because a large amount of phonetic information is concentrated near acoustic discontinuities, more focused speech analysis and recognition can be performed based on the landmarks. Three types of consonant landmarks are defined according to its characteristics -- glottal vibration, turbulence noise, and sonorant consonant -- so that the appropriate analysis method for each landmark point can be determined. A probabilistic knowledge-based algorithm is developed in three steps. First, landmark candidates are detected and their landmark types are classified based on changes in spectral amplitude. Next, a bigram model describing the physiologically-feasible sequences of consonant landmarks is proposed, so that the most likely landmark sequence among the candidates can be found. Finally, it has been observed that certain landmarks are ambiguous in certain sets of phonetic and prosodic contexts, while they can be reliably detected in other contexts. A method to represent the regions where the landmarks are reliably detected versus where they are ambiguous is presented. On TIMIT test set, 91% of all the consonant landmarks and 95% of obstruent landmarks are located as landmark candidates. The bigram-based process for determining the most likely landmark sequences yields 12% deletion and substitution rates and a 15% insertion rate. An alternative representation that distinguishes reliable and ambiguous regions can detect 92% of the landmarks and 40% of the landmarks are judged to be reliable. The deletion rate within reliable regions is as low as 5%.(cont.) The resulting landmark sequences form a basis for a knowledge-based speech recognition system since the landmarks imply broad phonetic classes of the speech signal and indicate the points of focus for estimating detailed phonetic information. In addition, because the reliable regions generally correspond to lexical stresses and word boundaries, it is expected that the landmarks can guide the focus of attention not only at the phoneme-level, but at the phrase-level as well.by Chiyoun Park.Ph.D
    corecore