206 research outputs found

    Machine Learning of Probabilistic Phonological Pronunciation Rules from the Italian CLIPS Corpus

    Get PDF
    A blending of phonological concepts and technical analysis is proposed to yield a better modeling and understanding of phonological processes. Based on the manual segmentation and labeling of the Italian CLIPS corpus we automatically derive a probabilistic set of phonological pronunciation rules: a new alignment technique is used to map the phonological form of spontaneous sentences onto the phonetic surface form. A machine-learning algorithm then calculates a set of phonologi- cal replacement rules together with their conditional probabilities. A critical analysis of the resulting probabilistic rule set is presented and discussed with regard to regional Italian accents. The rule set presented here is also applied in the newly published web-service WebMAUS that allows a user to segment and phonetically label Italian speech via a simple web-interface

    An automatically built named entity lexicon for Arabic

    Get PDF
    We have successfully adapted and extended the automatic Multilingual, Interoperable Named Entity Lexicon approach to Arabic, using Arabic WordNet (AWN) and Arabic Wikipedia (AWK). First, we extract AWN’s instantiable nouns and identify the corresponding categories and hyponym subcategories in AWK. Then, we exploit Wikipedia inter-lingual links to locate correspondences between articles in ten different languages in order to identify Named Entities (NEs). We apply keyword search on AWK abstracts to provide for Arabic articles that do not have a correspondence in any of the other languages. In addition, we perform a post-processing step to fetch further NEs from AWK not reachable through AWN. Finally, we investigate diacritization using matching with geonames databases, MADA-TOKAN tools and different heuristics for restoring vowel marks of Arabic NEs. Using this methodology, we have extracted approximately 45,000 Arabic NEs and built, to the best of our knowledge, the largest, most mature and well-structured Arabic NE lexical resource to date. We have stored and organised this lexicon following the Lexical Markup Framework (LMF) ISO standard. We conduct a quantitative and qualitative evaluation of the lexicon against a manually annotated gold standard and achieve precision scores from 95.83% (with 66.13% recall) to 99.31% (with 61.45% recall) according to different values of a threshold

    D6.1: Technologies and Tools for Lexical Acquisition

    Get PDF
    This report describes the technologies and tools to be used for Lexical Acquisition in PANACEA. It includes descriptions of existing technologies and tools which can be built on and improved within PANACEA, as well as of new technologies and tools to be developed and integrated in PANACEA platform. The report also specifies the Lexical Resources to be produced. Four main areas of lexical acquisition are included: Subcategorization frames (SCFs), Selectional Preferences (SPs), Lexical-semantic Classes (LCs), for both nouns and verbs, and Multi-Word Expressions (MWEs)

    Using graphone models in automatic speech recognition

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Includes bibliographical references (p. 87-90).This research explores applications of joint letter-phoneme subwords, known as graphones, in several domains to enable detection and recognition of previously unknown words. For these experiments, graphones models are integrated into the SUMMIT speech recognition framework. First, graphones are applied to automatically generate pronunciations of restaurant names for a speech recognizer. Word recognition evaluations show that graphones are effective for generating pronunciations for these words. Next, a graphone hybrid recognizer is built and tested for searching song lyrics by voice, as well as transcribing spoken lectures in a open vocabulary scenario. These experiments demonstrate significant improvement over traditional word-only speech recognizers. Modifications to the flat hybrid model such as reducing the graphone set size are also considered. Finally, a hierarchical hybrid model is built and compared with the flat hybrid model on the lecture transcription task.by Stanley Xinlei Wang.M.Eng

    Linguistic Resources and Technologies for Romanian Language

    Get PDF
    This paper revises notions related to Language Resources and Technologies (LRT), including a brief overview of some resources developed worldwide and with a special focus on Romanian language. It then describes a joined Romanian, Moldavian, English initiative aimed at developing electronically coded resources for Romanian language, tools for their maintenance and usage, as well as for the creation of applications based on these resources

    Max Planck Institute for Psycholinguistics: Annual report 1996

    No full text

    Quantifying Speech Rhythms: Perception and Production Data in the Case of Spanish, Portuguese, and English

    Get PDF
    This dissertation addresses the methodology used in classifying speech rhythms in order to resolve a long-standing linguistic conundrum about whether languages differ rhythmically. There is a widespread perception, both among linguists and the general population, that some languages are stress-timed and others are syllable timed. Stress-timed languages are described as having less-regular rhythms, as syllable durations vary according to the placement of stress in the phrase. Meanwhile, syllable-timed languages are described as displaying less variation in rhythm, which syllable durations being more regular. This dissertation quantitatively evaluates these described rhythmic differences in Spanish,Portuguese, and English. The first chapter introduces speech rhythms and reviews past literature on their perception and production. The second chapter evaluates a widely used metric of speech rhythms, the PVI, and determines that it is not effective in distinguishing between two dialects of Spanish. The third chapter compares the speech rhythms of Mexican and Chicano Spanish. This chapter concludes that Chicano Spanish is more restricted in its vowel duration variability, while Mexican Spanish employs both highly variable durations (i.e. stress-timed) and highly uniform durations (i.e. syllable-timed). The fourth chapter describes a perception study used to compare the speech rhythms of Spanish, English, and Portuguese, and shows that these languages' rhythms do not always group according to language. In the fifth chapter, I describe a study of the production of the same utterances initially used in the perception experiment; this allows an analysis of what prompts the perceptual differences in speech rhythm described in Chapter Four. The sixth and final chapter discusses the implications and applications of these findings and gives direction for further investigation. Although both production and perception studies of speech rhythms have been performed in the past, my dissertation expands these methodologies by combining production and perception data is a single analysis. I use perception data to relatively classify the rhythms of utterances through low-pass speech filtering, then analyze the production of these data computationally to provide a more complete perspective of what prompts differences in speech-rhythms and how Spanish, Portuguese, and English data relate rhythmically. Thus, my dissertation is thorough, while still addressing traditional rhythm metrics and employing current computational methodology. It seeks to challenge linguists' methodologies in quantitatively addressing speech rhythms, and to further clarify the position of Spanish, Portuguese, and English on the speech rhythm continuum
    corecore