133 research outputs found

    Automatic Detection of Reuses and Citations in Literary Texts

    Full text link
    For more than forty years now, modern theories of literature (Compagnon, 1979) insist on the role of paraphrases, rewritings, citations, reciprocal borrowings and mutual contributions of any kinds. The notions of intertextuality, transtextuality, hypertextuality/hypotextuality, were introduced in the seventies and eighties to approach these phenomena. The careful analysis of these references is of particular interest in evaluating the distance that the creator voluntarily introduces with his/her masters. Phoebus is collaborative project that makes computer scientists from the University Pierre and Marie Curie (LIP6-UPMC) collaborate with the literary teams of Paris-Sorbonne University with the aim to develop efficient tools for literary studies that take advantage of modern computer science techniques. In this context, we have developed a piece of software that automatically detects and explores networks of textual reuses in classical literature. This paper describes the principles on which is based this program, the significant results that have already been obtained and the perspectives for the near future

    The Logic of the Big Data Turn in Digital Literary Studies

    Get PDF

    Ancient Greek Historians in the Digital Age

    Get PDF
    Ein Beitrag zur Digital History 2023: Digitale Methoden in der geschichtswissenschaftlichen Praxis: Fachliche Transformationen und ihre epistemologischen Konsequenzen, Berlin, 23.-26.5.2023. Abstract: This paper presents results of ongoing digital projects on ancient Greek historians. The research question is the analysis of the language used by ancient sources to refer to historians and cite their works with a particular reference to lost historians (the so-called fragmentary authors). If a lot of scholarship has been devoted to collect fragments of many different genres and try to reconstruct the texts from which they were taken, less effort has been spent on collecting data pertaining to the language used by ancient authors to refer to them and their works. The paper discusses the use of Computational Linguistics techniques and Named Entity Recognition to extract and annotate information about ancient Greek historians and their works from the sources where they are preserved. Morevoer, the paper describes a new catalog of ancient Greek authors and works based on the extraction and annotation of references to them in ancient sources

    Predicting the Law Area and Decisions of French Supreme Court Cases

    Get PDF
    In this paper, we investigate the application of text classification methods to predict the law area and the decision of cases judged by the French Supreme Court. We also investigate the influence of the time period in which a ruling was made over the textual form of the case description and the extent to which it is necessary to mask the judge's motivation for a ruling to emulate a real-world test scenario. We report results of 96% f1 score in predicting a case ruling, 90% f1 score in predicting the law area of a case, and 75.9% f1 score in estimating the time span when a ruling has been issued using a linear Support Vector Machine (SVM) classifier trained on lexical features.Comment: RANLP 201

    Towards a data model for (inter)textual relationships. Connecting Ancient Egyptian texts and understanding scribal practices

    Full text link
    The goal of this lecture is theory-oriented: we propose a conceptual data model that allows us to deal with complex textual relationships. It is empirically grounded in our experience of digital annotation of Ancient Egyptian texts. This paper is initially born out of the practical need of annotating and linking together hundreds of textual witnesses in the framework of the Ramses project (Polis et al. 2013; Polis & Winand 2013), the aim of which is to build and publish online (http://ramses.ulg.ac.be) the first richly annotated corpus of Late Egyptian texts (c. 1350-900 BCE)

    Big Data in the Digital Humanities. New Conversations in the Global Academic Context

    Get PDF
    After analysing the meaning of the expression “Big Data”, this article highlights the cultural nature of data and defends the validity of theories, models and hypotheses for carrying out scientific research. Lastly, it discusses the dialectic between privacy and control. In a sense, this issue escapes the traditional field of the humanities, but it also deserves our attention as twenty-first-century citizens interested in the cultural practices of the present. Humanists no doubt have much to contribute to ethical and epistemological debates on the use of the data generated by citizens, recalling the “captured” and cultural nature of data, and bringing their experience to analysing particular cases bearing in mind the general context

    Plagiarism detection for Indonesian texts

    Get PDF
    As plagiarism becomes an increasing concern for Indonesian universities and research centers, the need of using automatic plagiarism checker is becoming more real. However, researches on Plagiarism Detection Systems (PDS) in Indonesian documents have not been well developed, since most of them deal with detecting duplicate or near-duplicate documents, have not addressed the problem of retrieving source documents, or show tendency to measure document similarity globally. Therefore, systems resulted from these researches are incapable of referring to exact locations of ``similar passage'' pairs. Besides, there has been no public and standard corpora available to evaluate PDS in Indonesian texts. To address the weaknesses of former researches, this thesis develops a plagiarism detection system which executes various methods of plagiarism detection stages in a workflow system. In retrieval stage, a novel document feature coined as phraseword is introduced and executed along with word unigram and character n-grams to address the problem of retrieving source documents, whose contents are copied partially or obfuscated in a suspicious document. The detection stage, which exploits a two-step paragraph-based comparison, is aimed to address the problems of detecting and locating source-obfuscated passage pairs. The seeds for matching source-obfuscated passage pairs are based on locally-weighted significant terms to capture paraphrased and summarized passages. In addition to this system, an evaluation corpus was created through simulation by human writers, and by algorithmic random generation. Using this corpus, the performance evaluation of the proposed methods was performed in three scenarios. On the first scenario which evaluated source retrieval performance, some methods using phraseword and token features were able to achieve the optimum recall rate 1. On the second scenario which evaluated detection performance, our system was compared to Alvi's algorithm and evaluated in 4 levels of measures: character, passage, document, and cases. The experiment results showed that methods resulted from using token as seeds have higher scores than Alvi's algorithm in all 4 levels of measures both in artificial and simulated plagiarism cases. In case detection, our systems outperform Alvi's algorithm in recognizing copied, shaked, and paraphrased passages. However, Alvi's recognition rate on summarized passage is insignificantly higher than our system. The same tendency of experiment results were demonstrated on the third experiment scenario, only the precision rates of Alvi's algorithm in character and paragraph levels are higher than our system. The higher Plagdet scores produced by some methods in our system than Alvi's scores show that this study has fulfilled its objective in implementing a competitive state-of-the-art algorithm for detecting plagiarism in Indonesian texts. Being run at our test document corpus, Alvi's highest scores of recall, precision, Plagdet, and detection rate on no-plagiarism cases correspond to its scores when it was tested on PAN'14 corpus. Thus, this study has contributed in creating a standard evaluation corpus for assessing PDS for Indonesian documents. Besides, this study contributes in a source retrieval algorithm which introduces phrasewords as document features, and a paragraph-based text alignment algorithm which relies on two different strategies. One of them is to apply local-word weighting used in text summarization field to select seeds for both discriminating paragraph pair candidates and matching process. The proposed detection algorithm results in almost no multiple detection. This contributes to the strength of this algorithm

    A Sheep in Wolff's Clothing: Émilie du Châtelet and the Encyclopédie

    Get PDF
    This article explores the use of Émilie Du Châtelet's Institutions de physique as both an acknowledged and unacknowledged source for the Encyclopédie of Diderot and d'Alembert, and argues for Du Châtelet's inclusion as a full participant in the philosophical conversations the Encyclopédie enacts. Widely considered a minor voice who entered the Encyclopédie solely through the mediation of Samuel Formey—a largely forgotten and conflicted encyclopédiste—new evidence generated using techniques developed in the digital humanities suggests that Du Châtelet was a much more central figure in the Encyclopédie's engagement with the metaphysics of Leibniz and Wolff than previously thought
    corecore