133 research outputs found
Automatic Detection of Reuses and Citations in Literary Texts
For more than forty years now, modern theories of literature (Compagnon,
1979) insist on the role of paraphrases, rewritings, citations, reciprocal
borrowings and mutual contributions of any kinds. The notions of
intertextuality, transtextuality, hypertextuality/hypotextuality, were
introduced in the seventies and eighties to approach these phenomena. The
careful analysis of these references is of particular interest in evaluating
the distance that the creator voluntarily introduces with his/her masters.
Phoebus is collaborative project that makes computer scientists from the
University Pierre and Marie Curie (LIP6-UPMC) collaborate with the literary
teams of Paris-Sorbonne University with the aim to develop efficient tools for
literary studies that take advantage of modern computer science techniques. In
this context, we have developed a piece of software that automatically detects
and explores networks of textual reuses in classical literature. This paper
describes the principles on which is based this program, the significant
results that have already been obtained and the perspectives for the near
future
Ancient Greek Historians in the Digital Age
Ein Beitrag zur Digital History 2023: Digitale Methoden in der geschichtswissenschaftlichen Praxis: Fachliche Transformationen und ihre epistemologischen Konsequenzen, Berlin, 23.-26.5.2023.
Abstract: This paper presents results of ongoing digital projects on ancient Greek historians. The research question is the analysis of the language used by ancient sources to refer to historians and cite their works with a particular reference to lost historians (the so-called fragmentary authors). If a lot of scholarship has been devoted to collect fragments of many different genres and try to reconstruct the texts from which they were taken, less effort has been spent on collecting data pertaining to the language used by ancient authors to refer to them and their works. The paper discusses the use of Computational Linguistics techniques and Named Entity Recognition to extract and annotate information about ancient Greek historians and their works from the sources where they are preserved. Morevoer, the paper describes a new catalog of ancient Greek authors and works based on the extraction and annotation of references to them in ancient sources
Predicting the Law Area and Decisions of French Supreme Court Cases
In this paper, we investigate the application of text classification methods
to predict the law area and the decision of cases judged by the French Supreme
Court. We also investigate the influence of the time period in which a ruling
was made over the textual form of the case description and the extent to which
it is necessary to mask the judge's motivation for a ruling to emulate a
real-world test scenario. We report results of 96% f1 score in predicting a
case ruling, 90% f1 score in predicting the law area of a case, and 75.9% f1
score in estimating the time span when a ruling has been issued using a linear
Support Vector Machine (SVM) classifier trained on lexical features.Comment: RANLP 201
Towards a data model for (inter)textual relationships. Connecting Ancient Egyptian texts and understanding scribal practices
The goal of this lecture is theory-oriented: we propose a conceptual data model that allows us to deal with complex textual relationships. It is empirically grounded in our experience of digital annotation of Ancient Egyptian texts. This paper is initially born out of the practical need of annotating and linking together hundreds of textual witnesses in the framework of the Ramses project (Polis et al. 2013; Polis & Winand 2013), the aim of which is to build and publish online (http://ramses.ulg.ac.be) the first richly annotated corpus of Late Egyptian texts (c. 1350-900 BCE)
Big Data in the Digital Humanities. New Conversations in the Global Academic Context
After analysing the meaning of the expression
“Big Data”, this article highlights the cultural
nature of data and defends the validity of
theories, models and hypotheses for carrying
out scientific research. Lastly, it discusses the
dialectic between privacy and control. In a sense,
this issue escapes the traditional field of the
humanities, but it also deserves our attention as
twenty-first-century citizens interested in the
cultural practices of the present. Humanists no
doubt have much to contribute to ethical and
epistemological debates on the use of the data
generated by citizens, recalling the “captured”
and cultural nature of data, and bringing their
experience to analysing particular cases bearing
in mind the general context
Plagiarism detection for Indonesian texts
As plagiarism becomes an increasing concern for Indonesian universities and research centers, the need of using automatic plagiarism checker is becoming more real. However, researches on Plagiarism Detection Systems (PDS) in Indonesian documents have not been well developed, since most of them deal with detecting duplicate or near-duplicate documents, have not addressed the problem of retrieving source documents, or show tendency to measure document similarity globally. Therefore, systems resulted from these researches are incapable of referring to exact locations of ``similar passage'' pairs. Besides, there has been no public and standard corpora available to evaluate PDS in Indonesian texts.
To address the weaknesses of former researches, this thesis develops a plagiarism detection system which executes various methods of plagiarism detection stages in a workflow system. In retrieval stage, a novel document feature coined as phraseword is introduced and executed along with word unigram and character n-grams to address the problem of retrieving source documents, whose contents are copied partially or obfuscated in a suspicious document. The detection stage, which exploits a two-step paragraph-based comparison, is aimed to address the problems of detecting and locating source-obfuscated passage pairs. The seeds for matching source-obfuscated passage pairs are based on locally-weighted significant terms to capture paraphrased and summarized passages. In addition to this system, an evaluation corpus was created through simulation by human writers, and by algorithmic random generation.
Using this corpus, the performance evaluation of the proposed methods was performed in three scenarios. On the first scenario which evaluated source retrieval performance, some methods using phraseword and token features were able to achieve the optimum recall rate 1. On the second scenario which evaluated detection performance, our system was compared to Alvi's algorithm and evaluated in 4 levels of measures: character, passage, document, and cases. The experiment results showed that methods resulted from using token as seeds have higher scores than Alvi's algorithm in all 4 levels of measures both in artificial and simulated plagiarism cases. In case detection, our systems outperform Alvi's algorithm in recognizing copied, shaked, and paraphrased passages. However, Alvi's recognition rate on summarized passage is insignificantly higher than our system. The same tendency of experiment results were demonstrated on the third experiment scenario, only the precision rates of Alvi's algorithm in character and paragraph levels are higher than our system. The higher Plagdet scores produced by some methods in our system than Alvi's scores show that this study has fulfilled its objective in implementing a competitive state-of-the-art algorithm for detecting plagiarism in Indonesian texts.
Being run at our test document corpus, Alvi's highest scores of recall, precision, Plagdet, and detection rate on no-plagiarism cases correspond to its scores when it was tested on PAN'14 corpus. Thus, this study has contributed in creating a standard evaluation corpus for assessing PDS for Indonesian documents. Besides, this study contributes in a source retrieval algorithm which introduces phrasewords as document features, and a paragraph-based text alignment algorithm which relies on two different strategies. One of them is to apply local-word weighting used in text summarization field to select seeds for both discriminating paragraph pair candidates and matching process. The proposed detection algorithm results in almost no multiple detection. This contributes to the strength of this algorithm
A Sheep in Wolff's Clothing: Émilie du Châtelet and the Encyclopédie
This article explores the use of Émilie Du Châtelet's Institutions de physique as both an acknowledged and unacknowledged source for the Encyclopédie of Diderot and d'Alembert, and argues for Du Châtelet's inclusion as a full participant in the philosophical conversations the Encyclopédie enacts. Widely considered a minor voice who entered the Encyclopédie solely through the mediation of Samuel Formey—a largely forgotten and conflicted encyclopédiste—new evidence generated using techniques developed in the digital humanities suggests that Du Châtelet was a much more central figure in the Encyclopédie's engagement with the metaphysics of Leibniz and Wolff than previously thought
- …