137 research outputs found

    The Taming of the Shrew - non-standard text processing in the Digital Humanities

    Get PDF
    Natural language processing (NLP) has focused on the automatic processing of newspaper texts for many years. With the growing importance of text analysis in various areas such as spoken language understanding, social media processing and the interpretation of text material from the humanities, techniques and methodologies have to be reviewed and redefined since so called non-standard texts pose challenges on the lexical and syntactic level especially for machine-learning-based approaches. Automatic processing tools developed on the basis of newspaper texts show a decreased performance for texts with divergent characteristics. Digital Humanities (DH) as a field that has risen to prominence in the last decades, holds a variety of examples for this kind of texts. Thus, the computational analysis of the relationships of Shakespeare’s dramatic characters requires the adjustment of processing tools to English texts from the 16th-century in dramatic form. Likewise, the investigation of narrative perspective in Goethe’s ballads calls for methods that can handle German verse from the 18th century. In this dissertation, we put forward a methodology for NLP in a DH environment. We investigate how an interdisciplinary context in combination with specific goals within projects influences the general NLP approach. We suggest thoughtful collaboration and increased attention to the easy applicability of resulting tools as a solution for differences in the store of knowledge between project partners. Projects in DH are not only constituted by the automatic processing of texts but are usually framed by the investigation of a research question from the humanities. As a consequence, time limitations complicate the successful implementation of analysis techniques especially since the diversity of texts impairs the transferability and reusability of tools beyond a specific project. We answer to this with modular and thus easily adjustable project workflows and system architectures. Several instances serve as examples for our methodology on different levels. We discuss modular architectures that balance time-saving solutions and problem-specific implementations on the example of automatic postcorrection of the output text from an optical character recognition system. We address the problem of data diversity and low resource situations by investigating different approaches towards non-standard text processing. We examine two main techniques: text normalization and tool adjustment. Text normalization aims at the transformation of non-standard text in order to assimilate it to the standard whereas tool adjustment concentrates on the contrary direction of enabling tools to successfully handle a specific kind of text. We focus on the task of part-of-speech tagging to illustrate various approaches toward the processing of historical texts as an instance for non-standard texts. We discuss how the level of deviation from a standard form influences the performance of different methods. Our approaches shed light on the importance of data quality and quantity and emphasize the indispensability of annotations for effective machine learning. In addition, we highlight the advantages of problem-driven approaches where the purpose of a tool is clearly formulated through the research question. Another significant finding to emerge from this work is a summary of the experiences and increased knowledge through collaborative projects between computer scientists and humanists. We reflect on various aspects of the elaboration and formalization of research questions in the DH and assess the limitations and possibilities of the computational modeling of humanistic research questions. An emphasis is placed on the interplay of expert knowledge with respect to a subject of investigation and the implementation of tools for that purpose and the thereof resulting advantages such as the targeted improvement of digital methods through purposeful manual correction and error analysis. We show obstacles and chances and give prospects and directions for future development in this realm of interdisciplinary research

    Ensemble Morphosyntactic Analyser for Classical Arabic

    Get PDF
    Classical Arabic (CA) is an influential language for Muslim lives around the world. It is the language of two sources of Islamic laws: the Quran and the Sunnah, the collection of traditions and sayings attributed to the prophet Mohammed. However, classical Arabic in general, and the Sunnah, in particular, is underexplored and under-resourced in the field of computational linguistics. This study examines the possible directions for adapting existing tools, specifically morphological analysers, designed for modern standard Arabic (MSA) to classical Arabic. Morphological analysers of CA are limited, as well as the data for evaluating them. In this study, we adapt existing analysers and create a validation data-set from the Sunnah books. Inspired by the advances in deep learning and the promising results of ensemble methods, we developed a systematic method for transferring morphological analysis that is capable of handling different labelling systems and various sequence lengths. In this study, we handpicked the best four open access MSA morphological analysers. Data generated from these analysers are evaluated before and after adaptation through the existing Quranic Corpus and the Sunnah Arabic Corpus. The findings are as follows: first, it is feasible to analyse under-resourced languages using existing comparable language resources given a small sufficient set of annotated text. Second, analysers typically generate different errors and this could be exploited. Third, an explicit alignment of sequences and the mapping of labels is not necessary to achieve comparable accuracies given a sufficient size of training dataset. Adapting existing tools is easier than creating tools from scratch. The resulting quality is dependent on training data size and number and quality of input taggers. Pipeline architecture performs less well than the End-to-End neural network architecture due to error propagation and limitation on the output format. A valuable tool and data for annotating classical Arabic is made freely available

    Automatic text filtering using limited supervision learning for epidemic intelligence

    Get PDF
    [no abstract

    Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

    Get PDF
    Peer reviewe

    Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021

    Get PDF
    The eighth edition of the Italian Conference on Computational Linguistics (CLiC-it 2021) was held at Università degli Studi di Milano-Bicocca from 26th to 28th January 2022. After the edition of 2020, which was held in fully virtual mode due to the health emergency related to Covid-19, CLiC-it 2021 represented the first moment for the Italian research community of Computational Linguistics to meet in person after more than one year of full/partial lockdown
    • …
    corecore