5,467 research outputs found
Detecting and Analyzing Text Reuse with BLAST
In this thesis I expand upon my previous work on text reuse detection. I propose a novel method of detecting text reuse by leveraging BLAST (Basic Local Alignment Search Tool), an algorithm originally designed for aligning and comparing biomedical sequences, such as DNA and protein sequences.
I explain the original BLAST algorithm in depth by going through it step-by-step. I also describe two other popular sequence alignment methods. I demonstrate the effectiveness of the BLAST text reuse detection method by comparing it against the previous state-of-the-art and show that the proposed method beats it by a large margin.
I apply the method to a dataset of 3 million documents of scanned Finnish newspapers and journals, which have been turned into text using OCR (Optical Character Recognition) software. I categorize the results from the method into three categories: every day text reuse, long term reuse and viral news. I describe them and provide examples of them as well as propose a new, novel method of calculating a virality score for the clusters
09051 Abstracts Collection -- Knowledge representation for intelligent music processing
From the twenty-fifth to the thirtieth of January, 2009, the
Dagstuhl Seminar 09051 on ``Knowledge representation for intelligent music
processing\u27\u27 was held in Schloss Dagstuhl~--~Leibniz Centre for Informatics.
During the seminar, several participants presented their current
research, and ongoing work and open problems were discussed. Abstracts
of the presentations and demos given during the seminar as well as
plenary presentations, reports of workshop discussions, results and
ideas are put together in this paper. The first section describes the
seminar topics and goals in general, followed by plenary `stimulus\u27
papers, followed by reports and abstracts arranged by workshop
followed finally by some concluding materials providing views of both
the seminar itself and also forward to the longer-term goals of the
discipline. Links to extended abstracts, full papers and supporting
materials are provided, if available.
The organisers thank David Lewis for editing these proceedings
Optimizing a Data Science System for Text Reuse Analysis
Text reuse is a methodological element of fundamental importance in
humanities research: pieces of text that re-appear across different documents,
verbatim or paraphrased, provide invaluable information about the historical
spread and evolution of ideas. Large modern digitized corpora enable the joint
analysis of text collections that span entire centuries and the detection of
large-scale patterns, impossible to detect with traditional small-scale
analysis. For this opportunity to materialize, it is necessary to develop
efficient data science systems that perform the corresponding analysis tasks.
In this paper, we share insights from ReceptionReader, a system for analyzing
text reuse in large historical corpora. The system is built upon billions of
instances of text reuses from large digitized corpora of 18th-century texts.
Its main functionality is to perform downstream text reuse analysis tasks, such
as finding reuses that stem from a given article or identifying the most reused
quotes from a set of documents, with each task expressed as a database query.
For the purposes of the paper, we discuss the related design choices including
various database normalization levels and query execution frameworks, such as
distributed data processing (Apache Spark), indexed row store engine (MariaDB
Aria), and compressed column store engine (MariaDB Columnstore). Moreover, we
present an extensive evaluation with various metrics of interest (latency,
storage size, and computing costs) for varying workloads, and we offer insights
from the trade-offs we observed and the choices that emerged as optimal in our
setting. In summary, our results show that (1) for the workloads that are most
relevant to text-reuse analysis, the MariaDB Aria framework emerges as the
overall optimal choice, (2) big data processing (Apache Spark) is irreplaceable
for all processing stages of the system's pipeline.Comment: Early Draf
Predicting the Law Area and Decisions of French Supreme Court Cases
In this paper, we investigate the application of text classification methods
to predict the law area and the decision of cases judged by the French Supreme
Court. We also investigate the influence of the time period in which a ruling
was made over the textual form of the case description and the extent to which
it is necessary to mask the judge's motivation for a ruling to emulate a
real-world test scenario. We report results of 96% f1 score in predicting a
case ruling, 90% f1 score in predicting the law area of a case, and 75.9% f1
score in estimating the time span when a ruling has been issued using a linear
Support Vector Machine (SVM) classifier trained on lexical features.Comment: RANLP 201
Plotting Poetry 3. Conference report
Plotting Poetry 3. Conference repor
Unlocking environmental narratives: towards understanding human environment interactions through computational text analysis
Understanding the role of humans in environmental change is one of the most pressing challenges of the 21st century. Environmental narratives – written texts with a focus on the environment – offer rich material capturing relationships between people and surroundings. We take advantage of two key opportunities for their computational analysis: massive growth in the availability of digitised contemporary and historical sources, and parallel advances in the computational analysis of natural language. We open by introducing interdisciplinary research questions related to the environment and amenable to analysis through written sources. The reader is then introduced to potential collections of narratives including newspapers, travel diaries, policy documents, scientific proposals and even fiction. We demonstrate the application of a range of approaches to analysing natural language computationally, introducing key ideas through worked examples, and providing access to the sources analysed and accompanying code. The second part of the book is centred around case studies, each applying computational analysis to some aspect of environmental narrative. Themes include the use of language to describe narratives about glaciers, urban gentrification, diversity and writing about nature and ways in which locations are conceptualised and described in nature writing. We close by reviewing the approaches taken, and presenting an interdisciplinary research agenda for future work. The book is designed to be of interest to newcomers to the field and experienced researchers, and set out in a way that it can be used as an accompanying text for graduate level courses in, for example, geography, environmental history or the digital humanities
New perspectives on cohesion and coherence: Implications for translation
The contributions to this volume investigate relations of cohesion and coherence as well as instantiations of discourse phenomena and their interaction with information structure in multilingual contexts. Some contributions concentrate on procedures to analyze cohesion and coherence from a corpus-linguistic perspective. Others have a particular focus on textual cohesion in parallel corpora that include both originals and translated texts. Additionally, the papers in the volume discuss the nature of cohesion and coherence with implications for human and machine translation.The contributors are experts on discourse phenomena and textuality who address these issues from an empirical perspective. The chapters in this volume are grounded in the latest research making this book useful to both experts of discourse studies and computational linguistics, as well as advanced students with an interest in these disciplines. We hope that this volume will serve as a catalyst to other researchers and will facilitate further advances in the development of cost-effective annotation procedures, the application of statistical techniques for the analysis of linguistic phenomena and the elaboration of new methods for data interpretation in multilingual corpus linguistics and machine translation
New perspectives on cohesion and coherence: Implications for translation
The contributions to this volume investigate relations of cohesion and coherence as well as instantiations of discourse phenomena and their interaction with information structure in multilingual contexts. Some contributions concentrate on procedures to analyze cohesion and coherence from a corpus-linguistic perspective. Others have a particular focus on textual cohesion in parallel corpora that include both originals and translated texts. Additionally, the papers in the volume discuss the nature of cohesion and coherence with implications for human and machine translation.The contributors are experts on discourse phenomena and textuality who address these issues from an empirical perspective. The chapters in this volume are grounded in the latest research making this book useful to both experts of discourse studies and computational linguistics, as well as advanced students with an interest in these disciplines. We hope that this volume will serve as a catalyst to other researchers and will facilitate further advances in the development of cost-effective annotation procedures, the application of statistical techniques for the analysis of linguistic phenomena and the elaboration of new methods for data interpretation in multilingual corpus linguistics and machine translation
Unlocking Environmental Narratives
Understanding the role of humans in environmental change is one of the most pressing challenges of the 21st century. Environmental narratives – written texts with a focus on the environment – offer rich material capturing relationships between people and surroundings. We take advantage of two key opportunities for their computational analysis: massive growth in the availability of digitised contemporary and historical sources, and parallel advances in the computational analysis of natural language. We open by introducing interdisciplinary research questions related to the environment and amenable to analysis through written sources. The reader is then introduced to potential collections of narratives including newspapers, travel diaries, policy documents, scientific proposals and even fiction. We demonstrate the application of a range of approaches to analysing natural language computationally, introducing key ideas through worked examples, and providing access to the sources analysed and accompanying code. The second part of the book is centred around case studies, each applying computational analysis to some aspect of environmental narrative. Themes include the use of language to describe narratives about glaciers, urban gentrification, diversity and writing about nature and ways in which locations are conceptualised and described in nature writing. We close by reviewing the approaches taken, and presenting an interdisciplinary research agenda for future work. The book is designed to be of interest to newcomers to the field and experienced researchers, and set out in a way that it can be used as an accompanying text for graduate level courses in, for example, geography, environmental history or the digital humanities
- …