518 research outputs found

    On the use of probabilistic grammars in speech annotation and segmentation tasks

    No full text
    International audienceThe present paper explores the issue of corpus prosodic parsing in terms of prosodic words. This question is of importance in both speech processing and corpus annotation studies. We propose a method grounded on both statistical ans symbolic (phonologicial) representations of tonal phenomena and we have recourse to probabilisitic grammars, within which we implement a minimal prosodic hierarchical structure. Both stages of probabilistic grammar building and its testing in prediction are explored and quantitatively and qualitatively evaluated

    Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing

    Get PDF
    Linguistic typology aims to capture structural and semantic variation across the world's languages. A large-scale typology could provide excellent guidance for multilingual Natural Language Processing (NLP), particularly for languages that suffer from the lack of human labeled resources. We present an extensive literature survey on the use of typological information in the development of NLP techniques. Our survey demonstrates that to date, the use of information in existing typological databases has resulted in consistent but modest improvements in system performance. We show that this is due to both intrinsic limitations of databases (in terms of coverage and feature granularity) and under-employment of the typological features included in them. We advocate for a new approach that adapts the broad and discrete nature of typological categories to the contextual and continuous nature of machine learning algorithms used in contemporary NLP. In particular, we suggest that such approach could be facilitated by recent developments in data-driven induction of typological knowledge

    Introduction to the special issue on cross-language algorithms and applications

    Get PDF
    With the increasingly global nature of our everyday interactions, the need for multilingual technologies to support efficient and efective information access and communication cannot be overemphasized. Computational modeling of language has been the focus of Natural Language Processing, a subdiscipline of Artificial Intelligence. One of the current challenges for this discipline is to design methodologies and algorithms that are cross-language in order to create multilingual technologies rapidly. The goal of this JAIR special issue on Cross-Language Algorithms and Applications (CLAA) is to present leading research in this area, with emphasis on developing unifying themes that could lead to the development of the science of multi- and cross-lingualism. In this introduction, we provide the reader with the motivation for this special issue and summarize the contributions of the papers that have been included. The selected papers cover a broad range of cross-lingual technologies including machine translation, domain and language adaptation for sentiment analysis, cross-language lexical resources, dependency parsing, information retrieval and knowledge representation. We anticipate that this special issue will serve as an invaluable resource for researchers interested in topics of cross-lingual natural language processing.Postprint (published version

    The pervasiveness of language contact: Evidence from negative existentials in Romeyka/Turkish code-switching

    Get PDF
    This paper investigates the morpho-syntactic features of language contact in the endangered Greek dialect Romeyka with Turkish. We analyze the use of the borrowed negative existential jok to (a) determine its role in Romeyka’s negation patterns (b) examine the effects of contact in Romeyka through cross-linguistic comparisons of jok with Turkish and forms of the dialect as spoken in Greece and (c) apply the identified grammatical patterns of jok to Myers-Scotton’s linguistic explanations for the code switching phenomena in the Matrix Language Turnover Hypothesis. The analysis demonstrates the pervasive influence of Turkish on the morpho-syntax of Romeyka through the incorporation of Turkish grammatical structures. We observe changes in the fundamental predicate grammar that are aligned with Turkish and that are inconsistent with Pontic’s existential constructions where the verb indicating existence is used. The patterns of contact confirm the Matrix Language hypothesis and provide evidence that indicate that Romeyka may be undergoing language turnover. Our findings are relevant to further understanding code switching among speakers of minority languages and assessing the vitality of Romeyka in Turkey

    Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing

    Get PDF
    Linguistic typology aims to capture structural and semantic variation across the world’s languages. A large-scale typology could provide excellent guidance for multilingual Natural Language Processing (NLP), particularly for languages that suffer from the lack of human labeled resources. We present an extensive literature survey on the use of typological information in the development of NLP techniques. Our survey demonstrates that to date, the use of information in existing typological databases has resulted in consistent but modest improvements in system performance. We show that this is due to both intrinsic limitations of databases (in terms of coverage and feature granularity) and under-utilization of the typological features included in them. We advocate for a new approach that adapts the broad and discrete nature of typological categories to the contextual and continuous nature of machine learning algorithms used in contemporary NLP. In particular, we suggest that such an approach could be facilitated by recent developments in data-driven induction of typological knowledge.</jats:p

    Normalization of Dutch user-generated content

    Get PDF
    Abstract This paper describes a phrase-based machine translation approach to normalize Dutch user-generated content (UGC). We compiled a corpus of three different social media genres (text messages, message board posts and tweets) to have a sample of this recent domain. We describe the various characteristics of this noisy text material and explain how it has been manually normalized using newly developed guidelines. For the automatic normalization task we focus on text messages, and find that a cascaded SMT system where a token-based module is followed by a translation at the character level gives the best word error rate reduction. After these initial experiments, we investigate the system&apos;s robustness on the complete domain of UGC by testing it on the other two social media genres, and find that the cascaded approach performs best on these genres as well. To our knowledge, we deliver the first proof-of-concept system for Dutch UGC normalization, which can serve as a baseline for future work
    • …
    corecore