25 research outputs found

    Creating a morphological and syntactic tagged corpus for the Uzbek language

    Full text link
    Nowadays, creation of the tagged corpora is becoming one of the most important tasks of Natural Language Processing (NLP). There are not enough tagged corpora to build machine learning models for the low-resource Uzbek language. In this paper, we tried to fill that gap by developing a novel Part Of Speech (POS) and syntactic tagset for creating the syntactic and morphologically tagged corpus of the Uzbek language. This work also includes detailed description and presentation of a web-based application to work on a tagging as well. Based on the developed annotation tool and the software, we share our experience results of the first stage of the tagged corpus creatio

    Building a New Sentiment Analysis Dataset for Uzbek Language and Creating Baseline Models

    Get PDF
    [Abstract] Making natural language processing technologies available for low-resource languages is an important goal to improve the access to technology in their communities of speakers. In this paper, we provide the first annotated corpora for polarity classification for Uzbek language. Our methodology considers collecting a medium-size manually annotated dataset and a larger-size dataset automatically translated from existing resources. Then, we use these datasets to train sentiment analysis models on the Uzbek language, using both traditional machine learning techniques and recent deep learning models.Ministerio de EconomĂ­a y Empresa; TIN2017-85160-C2-1-RXunta de Galicia; ED431B 2017/0

    Syntax inside the grammar

    Get PDF
    This volume collects novel contributions to comparative generative linguistics that “rethink” existing approaches to an extensive range of phenomena, domains, and architectural questions in linguistic theory. At the heart of the contributions is the tension between descriptive and explanatory adequacy which has long animated generative linguistics and which continues to grow thanks to the increasing amount and diversity of data available to us. The chapters address research questions on the relation of syntax to other aspects of grammar and linguistics more generally, including studies on language acquisition, variation and change, and syntactic interfaces. Many of these contributions show the influence of research by Ian Roberts and collaborators and give the reader a sense of the lively nature of current discussion of topics in synchronic and diachronic comparative syntax ranging from the core verbal domain to higher, propositional domains

    Syntactic architecture and its consequences I

    Get PDF
    This volume collects novel contributions to comparative generative linguistics that “rethink” existing approaches to an extensive range of phenomena, domains, and architectural questions in linguistic theory. At the heart of the contributions is the tension between descriptive and explanatory adequacy which has long animated generative linguistics and which continues to grow thanks to the increasing amount and diversity of data available to us. The chapters address research questions on the relation of syntax to other aspects of grammar and linguistics more generally, including studies on language acquisition, variation and change, and syntactic interfaces. Many of these contributions show the influence of research by Ian Roberts and collaborators and give the reader a sense of the lively nature of current discussion of topics in synchronic and diachronic comparative syntax ranging from the core verbal domain to higher, propositional domains. This book is complemented by volume II available at https://langsci-press.org/catalog/book/276 and volume III available at https://langsci-press.org/catalog/book/277

    Bamberger Orientstudien

    Get PDF
    Der Band versammelt insgesamt vierzehn Beiträge von Mitgliedern des Instituts für Orientalistik der Otto-Friedrich-Universität Bamberg zu verschiedenen Forschungsthemen. Das Institut umfasst sieben Professuren/Lehrstühle: Allgemeine Sprachwissenschaft (Prof. Dr. Geoffrey Haig), Arabistik (Prof. Lale Behzadi), Iranistik (Prof. Dr. Birgitt Hoffmann), Islamische Kunstgeschichte und Archäologie (Prof. Dr. Lorenz Korn), Islamwissenschaft (Prof. Dr. Patrick Franke), Judaistik (Prof. Dr. Susanne Talabardon) und Turkologie (Prof. Dr. Christoph Herzog). Bei den Artikeln handelt es sich um Forschungsbeiträge zur Sprachgeographie Ostanatoliens (Geoffrey Haig), der türkischen (Patrick Bartsch), persischen (Roxane Haag-Higuchi) und arabischen (Lale Behzadi) Literatur, der Religionswissenschaft (Patrick Franke, Johannes Rosenbaum, Susanne Talabardon), zu verschiedenen Aspekten der Geschichte des Nahen und Mittleren Ostens (Birgitt Hoffmann, Nana Kharebava, Andreas Wilde, Barbara Henning, Christoph Herzog) sowie zur islamischen Architektur (Lorenz Korn, Mustafa Tupev).This volume brings together fourteen contributions on diverse fields of research by members of the Institute for Near and Middle Eastern Studies of the Otto-Friedrich University, Bamberg. The institute includes seven departments: General Linguistics (Geoffrey Haig), Arabic Studies (Lale Behzadi), Iranian Studies (Birgitt Hoffmann), Islamic Art and Archaeology (Lorenz Korn), Islamic Studies (Patrick Franke), Jewish Studies (Susanne Talabardon) and Turkish Studies (Christoph Herzog). The contributions to the volume represent original research in the following fields: the linguistic geography of Eastern Anatolia (Geoffrey Haig), Turkish (Patrick Bartsch), Persian (Roxane Haag-Higuchi) and Arabic (Lale Behzadi) literature, aspects of religious studies (Patrick Franke, Johannes Rosenbaum, Susanne Talabardon), Near and Middle Eastern history (Birgitt Hoffmann, Nana Kharebava, Andreas Wilde, Barbara Henning, Christoph Herzog), and Islamic architecture (Lorenz Korn, Mustafa Tupev)

    Systematic Comparison Of Cross-Lingual Projection Techniques For Low-Density Nlp Under Strict Resource Constraints

    Full text link
    The field of low-density NLP is often approached from an engineering perspective, and evaluations are typically haphazard - considering different architectures, given different languages, and different available resources - without a systematic comparison. The resulting architectures are then tested on the unique corpus and language for which this approach has been designed. This makes it difficult to truly evaluate which approach is truly the best, or which approaches are best for a given language. In this dissertation, several state-of-the-art architectures and approaches to low-density language Part-Of-Speech Tagging are reimplemented; all of these techniques exploit a relationship between a high-density (HD) language and a low-density (LD) language. As a novel contribution, a testbed is created using a representative sample of seven (HD - LD) language pairs, all drawn from the same massively parallel corpus, Europarl, and selected for their particular linguistic features. With this testbed in place, never-before-possible comparisons are conducted, to evaluate which broad approach performs the best for particular language pairs, and investigate whether particular language features should suggest a particular NLP approach. A survey of the field suggested some unexplored approaches with potential to yield better performance, be quicker to implement, and require less intensive linguistic resources. Under strict resource limitations, which are typical for low-density NLP environments, these characteristics are important. The approaches investigated in this dissertation are each a form of language ifier, which modifies an LD-corpus to be more like an HD-corpus, or alternatively, modifies an HD-corpus to be more like an LD-corpus, prior to supervised training. Each relying on relatively few linguistic resources, four variations of language ifier designs have been implemented and evaluated in this dissertation: lexical replacement, affix replacement, cognate replacement, and exemplar replacement. Based on linguistic properties of the languages drawn from the Europarl corpus, various predictions were made of which prior and novel approaches would be most effective for languages with specific linguistic properties, and these predictions were evaluated through systematic evaluations with the testbed of languages. The results of this dissertation serve as guidance for future researchers who must select an appropriate cross-lingual projection approach (and a high-density language from which to project) for a given low-density language. Finally, all the languages drawn from the Europarl corpus are actually HD, but for the sake of the evaluation testbed in this dissertation, certain languages are treated as if they were LD (ignoring any available HD resources). In order to evaluate how various approaches perform on an actual LD language, a case study was conducted in which part-of-speech taggers were implemented for Tajiki, harnessing linguistic resources from a related HD-language, Farsi, using all of the prior and novel approaches investigated in this dissertation. Insights from this case study were documented so that future researchers can gain insight into what their experience might be in implementing NLP tools for an LD language given the strict resource limitations considered in this dissertation

    Faculty Publications and Creative Works 2004

    Get PDF
    Faculty Publications & Creative Works is an annual compendium of scholarly and creative activities of University of New Mexico faculty during the noted calendar year. Published by the Office of the Vice President for Research and Economic Development, it serves to illustrate the robust and active intellectual pursuits conducted by the faculty in support of teaching and research at UNM
    corecore