184 research outputs found

    Building a Corpus of 2L English for Automatic Assessment: the CLEC Corpus

    Get PDF
    In this paper we describe the CLEC corpus, an ongoing project set up at the University of CĂĄdiz with the purpose of building up a large corpus of English as a 2L classified according to CEFR proficiency levels and formed to train statistical models for automatic proficiency assessment. The goal of this corpus is twofold: on the one hand it will be used as a data resource for the development of automatic text classification systems and, on the other, it has been used as a means of teaching innovation techniques

    Using Natural Language Processing to Categorize Fictional Literature in an Unsupervised Manner

    Get PDF
    When following a plot in a story, categorization is something that humans do without even thinking; whether this is simple classification like “This is science fiction” or more complex trope recognition like recognizing a Chekhov\u27s gun or a rags to riches storyline, humans group stories with other similar stories. Research has been done to categorize basic plots and acknowledge common story tropes on the literary side, however, there is not a formula or set way to determine these plots in a story line automatically. This paper explores multiple natural language processing techniques in an attempt to automatically compare and cluster a fictional story into categories in an unsupervised manner. The aim is to mimic how a human may look deeper into a plot, find similar concepts like certain words being used, the types of words being used, for example an adventure book may have more verbs, as well as the sentiment of the sentences in order to group books into similar clusters

    Language and Linguistics in a Complex World Data, Interdisciplinarity, Transfer, and the Next Generation. ICAME41 Extended Book of Abstracts

    Get PDF
    This is a collection of papers, work-in-progress reports, and other contributions that were part of the ICAME41 digital conference

    Language and Linguistics in a Complex World Data, Interdisciplinarity, Transfer, and the Next Generation. ICAME41 Extended Book of Abstracts

    Get PDF
    This is a collection of papers, work-in-progress reports, and other contributions that were part of the ICAME41 digital conference

    Close and Distant Reading Visualizations for the Comparative Analysis of Digital Humanities Data

    Get PDF
    Traditionally, humanities scholars carrying out research on a specific or on multiple literary work(s) are interested in the analysis of related texts or text passages. But the digital age has opened possibilities for scholars to enhance their traditional workflows. Enabled by digitization projects, humanities scholars can nowadays reach a large number of digitized texts through web portals such as Google Books or Internet Archive. Digital editions exist also for ancient texts; notable examples are PHI Latin Texts and the Perseus Digital Library. This shift from reading a single book “on paper” to the possibility of browsing many digital texts is one of the origins and principal pillars of the digital humanities domain, which helps developing solutions to handle vast amounts of cultural heritage data – text being the main data type. In contrast to the traditional methods, the digital humanities allow to pose new research questions on cultural heritage datasets. Some of these questions can be answered with existent algorithms and tools provided by the computer science domain, but for other humanities questions scholars need to formulate new methods in collaboration with computer scientists. Developed in the late 1980s, the digital humanities primarily focused on designing standards to represent cultural heritage data such as the Text Encoding Initiative (TEI) for texts, and to aggregate, digitize and deliver data. In the last years, visualization techniques have gained more and more importance when it comes to analyzing data. For example, Saito introduced her 2010 digital humanities conference paper with: “In recent years, people have tended to be overwhelmed by a vast amount of information in various contexts. Therefore, arguments about ’Information Visualization’ as a method to make information easy to comprehend are more than understandable.” A major impulse for this trend was given by Franco Moretti. In 2005, he published the book “Graphs, Maps, Trees”, in which he proposes so-called distant reading approaches for textual data that steer the traditional way of approaching literature towards a completely new direction. Instead of reading texts in the traditional way – so-called close reading –, he invites to count, to graph and to map them. In other words, to visualize them. This dissertation presents novel close and distant reading visualization techniques for hitherto unsolved problems. Appropriate visualization techniques have been applied to support basic tasks, e.g., visualizing geospatial metadata to analyze the geographical distribution of cultural heritage data items or using tag clouds to illustrate textual statistics of a historical corpus. In contrast, this dissertation focuses on developing information visualization and visual analytics methods that support investigating research questions that require the comparative analysis of various digital humanities datasets. We first take a look at the state-of-the-art of existing close and distant reading visualizations that have been developed to support humanities scholars working with literary texts. We thereby provide a taxonomy of visualization methods applied to show various aspects of the underlying digital humanities data. We point out open challenges and we present our visualizations designed to support humanities scholars in comparatively analyzing historical datasets. In short, we present (1) GeoTemCo for the comparative visualization of geospatial-temporal data, (2) the two tag cloud designs TagPies and TagSpheres that comparatively visualize faceted textual summaries, (3) TextReuseGrid and TextReuseBrowser to explore re-used text passages among the texts of a corpus, (4) TRAViz for the visualization of textual variation between multiple text editions, and (5) the visual analytics system MusikerProfiling to detect similar musicians to a given musician of interest. Finally, we summarize our and the collaboration experiences of other visualization researchers to emphasize the ingredients required for a successful project in the digital humanities, and we take a look at future challenges in that research field

    Does the medium matter? Digital vs. paper reading for leisure and foreign language learning

    Get PDF
    Die Auswirkungen und Folgen der digitalen Techniken auf unseren Alltag, auf die Weise, wie wir lernen und wie wir in Zukunft lesen werden, sind in den letzten beiden Dekaden Gegenstand verschiedener Forschungsanstrengungen. Besonders das Themenfeld Lesen hat eine hohe, nicht zuletzt auch öffentliche Aufmerksamkeit auf sich gezogen. Zahlreiche Studien haben sich zum Ziel gesetzt, die Nachteile oder negativen Auswirkungen digitaler GerĂ€te im Vergleich zu ihrem analogen Pendant empirisch zu belegen. Die vorliegende Dissertation ist Teil dieser Forschungsanstrengung. Sie versammelt eine Reihe von einzelnen Experimenten zum Lesen und Lernen im digitalen Zeitalter und versteht sich als ein Beitrag zur Versachlichung der Debatte ĂŒber die zukĂŒnftige Ausrichtung der Forschung zum digitalen Lesen und Lernen. Die in dieser Arbeit berichteten Experimente untersuchen genauer das PhĂ€nomen des literarischen Lesens aus verschiedenen Blickwinkeln und konzentrieren sich nicht nur auf die Differenz analog vs. digital. Vielmehr fassen sie das Lesen als ein komplexes PhĂ€nomen auf, das in eine komplexe Gesellschaft eingebettet ist. Die vorliegende Arbeit ist in zwei Teile gegliedert. Im ersten Teil werden zwei Studien zum digitalen Lesen von Literatur zu Freizeit- oder Unterhaltungszwecken vorgestellt. Das erste Experiment untersucht, ob die literarischen Wertzuschreibungen der Leser durch den LesetrĂ€ger (gedruckt vs. digital) beeinflusst werden, um zu herauszufinden, ob das gedruckte Buch in der digitalen Gesellschaft noch ĂŒber ein soziales Prestige verfĂŒgt. Im zweiten Experiment wird der Faktor Alter in Bezug auf digitales vs. analoges Lesen untersucht. Ausgehend von der 2001 von Mark Prensky geprĂ€gten Metapher “digital natives/digital immigrants” wurde eine Studie durchgefĂŒhrt, um deren TragfĂ€higkeit zu testen. Besonderes Augenmerk wurde auf die Untersuchung der Lesegewohnheiten und -neigung von jungen und Ă€lteren Menschen in Bezug auf das literarische Lesen auf Papier vs. Lesen am Bildschirm gelegt. Der zweite Teil dieser Arbeit behandelt Fragen der Bildung und untersucht detaillierter das literarische Lesen in einer Fremdsprache, hier dem Englischen, und die Nutzung von gedruckten und digitalen WörterbĂŒchern, um neue Wörter durch das Lesen von Literatur zu lernen. Besonderes Augenmerk wird auf die Nutzungsgewohnheiten im Umgang mit WörterbĂŒchern und auf den Wortschatzerwerb beim Lesen langer literarischer Texte in einer Fremdsprache (Englisch) gelegt. Die in dieser Arbeit berichteten Experimente, sowohl ĂŒber das Lesen zum VergnĂŒgen als auch ĂŒber das Lesen im didaktischen Kontext, belegen, dass die Vertrautheit mit dem Medium, die Lesegewohnheiten und Lesepraxen fĂŒr die gelingende Lesen und Lernen ausschlaggebender sind als die in der Öffentlichkeit so intensiv diskutierten Formate Druck und Digital. Die Ergebnisse dieser hier versammelten Studien sind daher ein Beitrag zur Versachlichung einer allzu aufgeregt gefĂŒhrten Debatte und eine Handreichung fĂŒr die Lese- und Lernförderung.In the last two decades more and more researches were dedicated to the impact of new technology on our everyday life, our learning processes, our reading activities. Digitization become an umbrella term to cover this change. Numerous studies on reading stuck on the purpose to prove empirically the disadvantages or the negative effects of digital devices compared to its analogue counterpart. This dissertation is a compilation of publications that seek to contribute to a more fact-based debate on future direction in research on digital reading and learning. The experiments reported in this work study the phenomenon of literary reading from different angles not focusing only on analogue-digital divide, but looking at reading as a complex phenomenon embedded in a more complex society. The goal of the experiments, reported here, is to give evidence-based advice how to read and learn in todays’ society. The present work is divided in two sections. In the first part, two studies on literary (e-) reading for recreational purposes are presented. The first experiment investigates whether the readers’ attributions of literary value might be affected by the reading support (paper vs. digital), in order to explore whether the paper book still carry a social prestige in the digital society. In the second experiment the factor “age” in relation to digital vs paper reading is investigated. Starting from the metaphor “digital natives/digital immigrants” created in 2001 by Mark Prensky, a study was conducted in order to test its reliability. Particular attention was paid to the investigation of the reading habits and inclination of young and elderly people in relation to literary reading on paper vs. on screen. The second part of this thesis moves to the educational context and explore the literary reading in a foreign language, here English, and the dictionary use (paper vs digital) in order to learn new words. A particular attention is given to students’ dictionary-using habits and to vocabulary acquisition while reading long literary text in a foreign language (English). The experiments reported in this work, both in the reading for pleasure and in didactic context, give evidence that familiarity with medium and reading habits, were more determinant for the outcomes than the support (paper vs digital) in itself. The results of the following experiments contribute to a more evidence-based debate, so tightly fought in recent years, and are a handout how to support reading and learning in the digital society.2021-06-2

    Using data mining to repurpose German language corpora. An evaluation of data-driven analysis methods for corpus linguistics

    Get PDF
    A growing number of studies report interesting insights gained from existing data resources. Among those, there are analyses on textual data, giving reason to consider such methods for linguistics as well. However, the field of corpus linguistics usually works with purposefully collected, representative language samples that aim to answer only a limited set of research questions. This thesis aims to shed some light on the potentials of data-driven analysis based on machine learning and predictive modelling for corpus linguistic studies, investigating the possibility to repurpose existing German language corpora for linguistic inquiry by using methodologies developed for data science and computational linguistics. The study focuses on predictive modelling and machine-learning-based data mining and gives a detailed overview and evaluation of currently popular strategies and methods for analysing corpora with computational methods. After the thesis introduces strategies and methods that have already been used on language data, discusses how they can assist corpus linguistic analysis and refers to available toolkits and software as well as to state-of-the-art research and further references, the introduced methodological toolset is applied in two differently shaped corpus studies that utilize readily available corpora for German. The first study explores linguistic correlates of holistic text quality ratings on student essays, while the second deals with age-related language features in computer-mediated communication and interprets age prediction models to answer a set of research questions that are based on previous research in the field. While both studies give linguistic insights that integrate into the current understanding of the investigated phenomena in German language, they systematically test the methodological toolset introduced beforehand, allowing a detailed discussion of added values and remaining challenges of machine-learning-based data mining methods in corpus at the end of the thesis

    Analyzing Text Complexity and Text Simplification: Connecting Linguistics, Processing and Educational Applications

    Get PDF
    Reading plays an important role in the process of learning and knowledge acquisition for both children and adults. However, not all texts are accessible to every prospective reader. Reading difficulties can arise when there is a mismatch between a reader’s language proficiency and the linguistic complexity of the text they read. In such cases, simplifying the text in its linguistic form while retaining all the content could aid reader comprehension. In this thesis, we study text complexity and simplification from a computational linguistic perspective. We propose a new approach to automatically predict the text complexity using a wide range of word level and syntactic features of the text. We show that this approach results in accurate, generalizable models of text readability that work across multiple corpora, genres and reading scales. Moving from documents to sentences, We show that our text complexity features also accurately distinguish different versions of the same sentence in terms of the degree of simplification performed. This is useful in evaluating the quality of simplification performed by a human expert or a machine-generated output and for choosing targets to simplify in a difficult text. We also experimentally show the effect of text complexity on readers’ performance outcomes and cognitive processing through an eye-tracking experiment. Turning from analyzing text complexity and identifying sentential simplifications to generating simplified text, one can view automatic text simplification as a process of translation from English to simple English. In this thesis, we propose a statistical machine translation based approach for text simplification, exploring the role of focused training data and language models in the process. Exploring the linguistic complexity analysis further, we show that our text complexity features can be useful in assessing the language proficiency of English learners. Finally, we analyze German school textbooks in terms of their linguistic complexity, across various grade levels, school types and among different publishers by applying a pre-existing set of text complexity features developed for German
    • 

    corecore