5,467 research outputs found

    Detecting and Analyzing Text Reuse with BLAST

    Get PDF
    In this thesis I expand upon my previous work on text reuse detection. I propose a novel method of detecting text reuse by leveraging BLAST (Basic Local Alignment Search Tool), an algorithm originally designed for aligning and comparing biomedical sequences, such as DNA and protein sequences. I explain the original BLAST algorithm in depth by going through it step-by-step. I also describe two other popular sequence alignment methods. I demonstrate the effectiveness of the BLAST text reuse detection method by comparing it against the previous state-of-the-art and show that the proposed method beats it by a large margin. I apply the method to a dataset of 3 million documents of scanned Finnish newspapers and journals, which have been turned into text using OCR (Optical Character Recognition) software. I categorize the results from the method into three categories: every day text reuse, long term reuse and viral news. I describe them and provide examples of them as well as propose a new, novel method of calculating a virality score for the clusters

    09051 Abstracts Collection -- Knowledge representation for intelligent music processing

    Get PDF
    From the twenty-fifth to the thirtieth of January, 2009, the Dagstuhl Seminar 09051 on ``Knowledge representation for intelligent music processing\u27\u27 was held in Schloss Dagstuhl~--~Leibniz Centre for Informatics. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations and demos given during the seminar as well as plenary presentations, reports of workshop discussions, results and ideas are put together in this paper. The first section describes the seminar topics and goals in general, followed by plenary `stimulus\u27 papers, followed by reports and abstracts arranged by workshop followed finally by some concluding materials providing views of both the seminar itself and also forward to the longer-term goals of the discipline. Links to extended abstracts, full papers and supporting materials are provided, if available. The organisers thank David Lewis for editing these proceedings

    Optimizing a Data Science System for Text Reuse Analysis

    Full text link
    Text reuse is a methodological element of fundamental importance in humanities research: pieces of text that re-appear across different documents, verbatim or paraphrased, provide invaluable information about the historical spread and evolution of ideas. Large modern digitized corpora enable the joint analysis of text collections that span entire centuries and the detection of large-scale patterns, impossible to detect with traditional small-scale analysis. For this opportunity to materialize, it is necessary to develop efficient data science systems that perform the corresponding analysis tasks. In this paper, we share insights from ReceptionReader, a system for analyzing text reuse in large historical corpora. The system is built upon billions of instances of text reuses from large digitized corpora of 18th-century texts. Its main functionality is to perform downstream text reuse analysis tasks, such as finding reuses that stem from a given article or identifying the most reused quotes from a set of documents, with each task expressed as a database query. For the purposes of the paper, we discuss the related design choices including various database normalization levels and query execution frameworks, such as distributed data processing (Apache Spark), indexed row store engine (MariaDB Aria), and compressed column store engine (MariaDB Columnstore). Moreover, we present an extensive evaluation with various metrics of interest (latency, storage size, and computing costs) for varying workloads, and we offer insights from the trade-offs we observed and the choices that emerged as optimal in our setting. In summary, our results show that (1) for the workloads that are most relevant to text-reuse analysis, the MariaDB Aria framework emerges as the overall optimal choice, (2) big data processing (Apache Spark) is irreplaceable for all processing stages of the system's pipeline.Comment: Early Draf

    Predicting the Law Area and Decisions of French Supreme Court Cases

    Get PDF
    In this paper, we investigate the application of text classification methods to predict the law area and the decision of cases judged by the French Supreme Court. We also investigate the influence of the time period in which a ruling was made over the textual form of the case description and the extent to which it is necessary to mask the judge's motivation for a ruling to emulate a real-world test scenario. We report results of 96% f1 score in predicting a case ruling, 90% f1 score in predicting the law area of a case, and 75.9% f1 score in estimating the time span when a ruling has been issued using a linear Support Vector Machine (SVM) classifier trained on lexical features.Comment: RANLP 201

    Plotting Poetry 3. Conference report

    Get PDF
    Plotting Poetry 3. Conference repor

    Unlocking environmental narratives: towards understanding human environment interactions through computational text analysis

    Full text link
    Understanding the role of humans in environmental change is one of the most pressing challenges of the 21st century. Environmental narratives – written texts with a focus on the environment – offer rich material capturing relationships between people and surroundings. We take advantage of two key opportunities for their computational analysis: massive growth in the availability of digitised contemporary and historical sources, and parallel advances in the computational analysis of natural language. We open by introducing interdisciplinary research questions related to the environment and amenable to analysis through written sources. The reader is then introduced to potential collections of narratives including newspapers, travel diaries, policy documents, scientific proposals and even fiction. We demonstrate the application of a range of approaches to analysing natural language computationally, introducing key ideas through worked examples, and providing access to the sources analysed and accompanying code. The second part of the book is centred around case studies, each applying computational analysis to some aspect of environmental narrative. Themes include the use of language to describe narratives about glaciers, urban gentrification, diversity and writing about nature and ways in which locations are conceptualised and described in nature writing. We close by reviewing the approaches taken, and presenting an interdisciplinary research agenda for future work. The book is designed to be of interest to newcomers to the field and experienced researchers, and set out in a way that it can be used as an accompanying text for graduate level courses in, for example, geography, environmental history or the digital humanities

    New perspectives on cohesion and coherence: Implications for translation

    Get PDF
    The contributions to this volume investigate relations of cohesion and coherence as well as instantiations of discourse phenomena and their interaction with information structure in multilingual contexts. Some contributions concentrate on procedures to analyze cohesion and coherence from a corpus-linguistic perspective. Others have a particular focus on textual cohesion in parallel corpora that include both originals and translated texts. Additionally, the papers in the volume discuss the nature of cohesion and coherence with implications for human and machine translation.The contributors are experts on discourse phenomena and textuality who address these issues from an empirical perspective. The chapters in this volume are grounded in the latest research making this book useful to both experts of discourse studies and computational linguistics, as well as advanced students with an interest in these disciplines. We hope that this volume will serve as a catalyst to other researchers and will facilitate further advances in the development of cost-effective annotation procedures, the application of statistical techniques for the analysis of linguistic phenomena and the elaboration of new methods for data interpretation in multilingual corpus linguistics and machine translation

    New perspectives on cohesion and coherence: Implications for translation

    Get PDF
    The contributions to this volume investigate relations of cohesion and coherence as well as instantiations of discourse phenomena and their interaction with information structure in multilingual contexts. Some contributions concentrate on procedures to analyze cohesion and coherence from a corpus-linguistic perspective. Others have a particular focus on textual cohesion in parallel corpora that include both originals and translated texts. Additionally, the papers in the volume discuss the nature of cohesion and coherence with implications for human and machine translation.The contributors are experts on discourse phenomena and textuality who address these issues from an empirical perspective. The chapters in this volume are grounded in the latest research making this book useful to both experts of discourse studies and computational linguistics, as well as advanced students with an interest in these disciplines. We hope that this volume will serve as a catalyst to other researchers and will facilitate further advances in the development of cost-effective annotation procedures, the application of statistical techniques for the analysis of linguistic phenomena and the elaboration of new methods for data interpretation in multilingual corpus linguistics and machine translation

    Unlocking Environmental Narratives

    Get PDF
    Understanding the role of humans in environmental change is one of the most pressing challenges of the 21st century. Environmental narratives – written texts with a focus on the environment – offer rich material capturing relationships between people and surroundings. We take advantage of two key opportunities for their computational analysis: massive growth in the availability of digitised contemporary and historical sources, and parallel advances in the computational analysis of natural language. We open by introducing interdisciplinary research questions related to the environment and amenable to analysis through written sources. The reader is then introduced to potential collections of narratives including newspapers, travel diaries, policy documents, scientific proposals and even fiction. We demonstrate the application of a range of approaches to analysing natural language computationally, introducing key ideas through worked examples, and providing access to the sources analysed and accompanying code. The second part of the book is centred around case studies, each applying computational analysis to some aspect of environmental narrative. Themes include the use of language to describe narratives about glaciers, urban gentrification, diversity and writing about nature and ways in which locations are conceptualised and described in nature writing. We close by reviewing the approaches taken, and presenting an interdisciplinary research agenda for future work. The book is designed to be of interest to newcomers to the field and experienced researchers, and set out in a way that it can be used as an accompanying text for graduate level courses in, for example, geography, environmental history or the digital humanities
    • …
    corecore