44 research outputs found

    Statistical Parsing by Machine Learning from a Classical Arabic Treebank

    Get PDF
    Research into statistical parsing for English has enjoyed over a decade of successful results. However, adapting these models to other languages has met with difficulties. Previous comparative work has shown that Modern Arabic is one of the most difficult languages to parse due to rich morphology and free word order. Classical Arabic is the ancient form of Arabic, and is understudied in computational linguistics, relative to its worldwide reach as the language of the Quran. The thesis is based on seven publications that make significant contributions to knowledge relating to annotating and parsing Classical Arabic. Classical Arabic has been studied in depth by grammarians for over a thousand years using a traditional grammar known as i’rāb (Ű„ŰčŰșۧ۩ ). Using this grammar to develop a representation for parsing is challenging, as it describes syntax using a hybrid of phrase-structure and dependency relations. This work aims to advance the state-of-the-art for hybrid parsing by introducing a formal representation for annotation and a resource for machine learning. The main contributions are the first treebank for Classical Arabic and the first statistical dependency-based parser in any language for ellipsis, dropped pronouns and hybrid representations. A central argument of this thesis is that using a hybrid representation closely aligned to traditional grammar leads to improved parsing for Arabic. To test this hypothesis, two approaches are compared. As a reference, a pure dependency parser is adapted using graph transformations, resulting in an 87.47% F1-score. This is compared to an integrated parsing model with an F1-score of 89.03%, demonstrating that joint dependency-constituency parsing is better suited to Classical Arabic. The Quran was chosen for annotation as a large body of work exists providing detailed syntactic analysis. Volunteer crowdsourcing is used for annotation in combination with expert supervision. A practical result of the annotation effort is the corpus website: http://corpus.quran.com, an educational resource with over two million users per year

    GeoAnnotator: A Collaborative Semi-Automatic Platform for Constructing Geo-Annotated Text Corpora

    Get PDF
    Ground-truth datasets are essential for the training and evaluation of any automated algorithm. As such, gold-standard annotated corpora underlie most advances in natural language processing (NLP). However, only a few relatively small (geo-)annotated datasets are available for geoparsing, i.e., the automatic recognition and geolocation of place references in unstructured text. The creation of geoparsing corpora that include both the recognition of place names in text and matching of those names to toponyms in a geographic gazetteer (a process we call geo-annotation), is a laborious, time-consuming and expensive task. The field lacks efficient geo-annotation tools to support corpus building and lacks design guidelines for the development of such tools. Here, we present the iterative design of GeoAnnotator, a web-based, semi-automatic and collaborative visual analytics platform for geo-annotation. GeoAnnotator facilitates collaborative, multi-annotator creation of large corpora of geo-annotated text by generating computationally-generated pre-annotations that can be improved by human-annotator users. The resulting corpora can be used in improving and benchmarking geoparsing algorithms as well as various other spatial language-related methods. Further, the iterative design process and the resulting design decisions can be used in annotation platforms tailored for other application domains of NLP

    Collaborative Research Practices and Shared Infrastructures for Humanities Computing

    Get PDF
    The volume collect the proceedings of the 2nd Annual Conference of the Italian Association for Digital Humanities (Aiucd 2013), which took place at the Department of Information Engineering of the University of Padua, 11-12 December 2013. The general theme of Aiucd 2013 was “Collaborative Research Practices and Shared Infrastructures for Humanities Computing” so we particularly welcomed submissions on interdisciplinary work and new developments in the field, encouraging proposals relating to the theme of the conference, or more specifically: interdisciplinarity and multidisciplinarity, legal and economic issues, tools and collaborative methodologies, measurement and impact of collaborative methodologies, sharing and collaboration methods and approaches, cultural institutions and collaborative facilities, infrastructures and digital libraries as collaborative environments, data resources and technologies sharing

    Collaborative Research Practices and Shared Infrastructures for Humanities Computing

    Get PDF
    The volume collect the proceedings of the 2nd Annual Conference of the Italian Association for Digital Humanities (Aiucd 2013), which took place at the Department of Information Engineering of the University of Padua, 11-12 December 2013. The general theme of Aiucd 2013 was “Collaborative Research Practices and Shared Infrastructures for Humanities Computing” so we particularly welcomed submissions on interdisciplinary work and new developments in the field, encouraging proposals relating to the theme of the conference, or more specifically: interdisciplinarity and multidisciplinarity, legal and economic issues, tools and collaborative methodologies, measurement and impact of collaborative methodologies, sharing and collaboration methods and approaches, cultural institutions and collaborative facilities, infrastructures and digital libraries as collaborative environments, data resources and technologies sharing

    Arabic dialect identification in the context of bivalency and code-switching

    Get PDF
    In this paper we use a novel approach towards Arabic dialect identification using language bivalency and written code-switching. Bivalency between languages or dialects is where a word or element is treated by language users as having a fundamentally similar semantic content in more than one language or dialect. Arabic dialect identification in writing is a difficult task even for humans due to the fact that words are used interchangeably between dialects. The task of automatically identifying dialect is harder and classifiers trained using only n-grams will perform poorly when tested on unseen data. Such approaches require significant amounts of annotated training data which is costly and time consuming to produce. Currently available Arabic dialect datasets do not exceed a few hundred thousand sentences, thus we need to extract features other than word and character n-grams. In our work we present experimental results from automatically identifying dialects from the four main Arabic dialect regions (Egypt, North Africa, Gulf and Levant) in addition to Standard Arabic. We extend previous work by incorporating additional grammatical and stylistic features and define a subtractive bivalency profiling approach to address issues of bivalent words across the examined Arabic dialects. The results show that our new methods classification accuracy can reach more than 76% and score well (66%) when tested on completely unseen data

    Vielfalt und Integration - diversitĂĄ ed integrazione - diversitĂ© et intĂ©gration: Sprache(n) in sozialen und digitalen RĂ€umen: Eine Festschrift fĂŒr Elisabeth Burr

    Get PDF
    Diese Festschrift fĂŒr Elisabeth Burr stellt Vielfalt und Integration in der Sprachwissenschaft und in den Digital Humanities in den Mittelpunkt. Die BeitrĂ€ge berĂŒhren zentrale Fragen im Schaffen Burrs: Wie kann Sprache und ihre Variation in AbhĂ€ngigkeit von sozialen und geographischen Faktoren adĂ€quat beschrieben werden? Wie lassen sich informatische und digitale ZugĂ€nge dafĂŒr nutzen? VerknĂŒpft werden sie mit ihr wichtigen und aktuellen Themen aus Sozio-, Gender- und Korpuslinguistik, Dialektologie und Sprachgeographie sowie den digitalen Geisteswissenschaften. Die Beitragenden sind u. a. Stefania Spina, Thomas Krefeld, Annette Gerstenberg, Lazslo Hinyadi, Carol Chiodo und Lauren Tilton, Manuel Burghardt, Øyvind Eide, JĂŒrgen Hermes, Andreas Witt. Ray Siemens, Arianna Ciula, Alejandro BĂ­a sowie Rob Evans

    AIUCD2018 - Book of Abstracts

    Get PDF
    Questo volume raccoglie gli abstract dei paper presentati al Settimo Convegno Annuale AIUCD 2018 (Bari, 31 gennaio – 2 febbraio 2018) dal titolo "Patrimoni culturali nell’era digitale. Memorie, culture umanistiche e tecnologia" (Cultural Heritage in the Digital Age. Memory, Humanities and Technologies). Gli abstract pubblicati in questo volume hanno ottenuto il parere favorevole da parte di valutatori esperti della materia, attraverso un processo di revisione anonima mediante double-blind peer review sotto la responsabilità del Comitato Scientifico di AIUCD. Il programma della conferenza AIUCD 2018 Ăš disponibile online all'indirizzo http://www.aiucd2018.uniba.it/
    corecore