27 research outputs found

    Plagiarism Detection in arXiv

    Full text link
    We describe a large-scale application of methods for finding plagiarism in research document collections. The methods are applied to a collection of 284,834 documents collected by arXiv.org over a 14 year period, covering a few different research disciplines. The methodology efficiently detects a variety of problematic author behaviors, and heuristics are developed to reduce the number of false positives. The methods are also efficient enough to implement as a real-time submission screen for a collection many times larger.Comment: Sixth International Conference on Data Mining (ICDM'06), Dec 200

    Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books

    Get PDF
    Collaborative work on unstructured or semi-structured documents, such as in literature corpora or source code, often involves agreed upon templates containing metadata. These templates are not consistent across users and over time. Rule-based parsing of these templates is expensive to maintain and tends to fail as new documents are added. Statistical techniques based on frequent occurrences have the potential to identify automatically a large fraction of the templates, thus reducing the burden on the programmers. We investigate the case of the Project Gutenberg corpus, where most documents are in ASCII format with preambles and epilogues that are often copied and pasted or manually typed. We show that a statistical approach can solve most cases though some documents require knowledge of English. We also survey various technical solutions that make our approach applicable to large data sets

    MODELS OF INTEGRATION OF INFORMATION SYSTEMS IN HIGHER EDUCATION INSTITUTIONS

    Get PDF
    At present a lot of automated systems are developing and implementing to support the educational and research processes in the universities. Often these systems duplicate some functions, databases, and also there are problems of compatibility of these systems. The most common educational systems are systems for creating electronic libraries, access to scientific and educational information, a program for detecting plagiarism, testing knowledge, etc. In this article, models and solutions for the integration of such educational automated systems as the information library system (ILS) and the anti-plagiarism system are examined. Integration of systems is based on the compatibility of databases, if more precisely in the metadata of different information models. At the same time, Cloud technologies are used - data processing technology, in which computer resources are provided to the user of the integrated system as an online service. ILS creates e-library of graduation papers and dissertations on the main server. During the creation of the electronic catalog, the communication format MARC21 is used. The database development is distributed for each department. The subsystem of anti-plagiarism analyzes the full-text database for the similarity of texts (dissertations, diploma works and others). Also it identifies the percentage of coincidence, creates the table of statistical information on the coincidence of tests for each author and division, indicating similar fields. The integrated system was developed and tested at the Tashkent University of Information Technologies to work in the corporate mode of various departments (faculties, departments, TUIT branches)

    The impact factor's Matthew effect: a natural experiment in bibliometrics

    Get PDF
    Since the publication of Robert K. Merton's theory of cumulative advantage in science (Matthew Effect), several empirical studies have tried to measure its presence at the level of papers, individual researchers, institutions or countries. However, these studies seldom control for the intrinsic "quality" of papers or of researchers--"better" (however defined) papers or researchers could receive higher citation rates because they are indeed of better quality. Using an original method for controlling the intrinsic value of papers--identical duplicate papers published in different journals with different impact factors--this paper shows that the journal in which papers are published have a strong influence on their citation rates, as duplicate papers published in high impact journals obtain, on average, twice as much citations as their identical counterparts published in journals with lower impact factors. The intrinsic value of a paper is thus not the only reason a given paper gets cited or not; there is a specific Matthew effect attached to journals and this gives to paper published there an added value over and above their intrinsic quality.Comment: 7 pages, 2 table

    On the prevalence and scientific impact of duplicate publications in different scientific fields (1980-2007)

    Get PDF
    The issue of duplicate publications has received a lot of attention in the medical literature, but much less in the information science community. This paper aims at analyzing the prevalence and scientific impact of duplicate publications across all fields of research between 1980 and 2007, using a definition of duplicate papers based on their metadata. It shows that in all fields combined, the prevalence of duplicates is one out of two-thousand papers, but is higher in the natural and medical sciences than in the social sciences and humanities. A very high proportion (>85%) of these papers are published the same year or one year apart, which suggest that most duplicate papers were submitted simultaneously. Furthermore, duplicate papers are generally published in journals with impact factors below the average of their field and obtain a lower number of citations. This paper provides clear evidence that the prevalence of duplicate papers is low and, more importantly, that the scientific impact of such papers is below average.Comment: 13 pages, 7 figure

    Text-Based Plagiarism in Scientific Publishing: Issues, Developments and Education

    Get PDF
    Text-based plagiarism, or copying language from sources, has recently become an issue of growing concern in scientific publishing. Use of CrossCheck (a computational text-matching tool) by journals has sometimes exposed an unexpected amount of textual similarity between submissions and databases of scholarly literature. In this paper I provide an overview of the relevant literature, to examine how journal gatekeepers perceive textual appropriation, and how automated plagiarism-screening tools have been developed to detect text matching, with the technique now available for self-check of manuscripts before submission; I also discuss issues around English as an additional language (EAL) authors and in particular EAL novices being the typical offenders of textual borrowing. The final section of the paper proposes a few educational directions to take in tackling text-based plagiarism, highlighting the roles of the publishing industry, senior authors and English for academic purposes professionals. © 2012 The Author(s).published_or_final_versionSpringer Open Choice, 28 May 201

    Text-Based Plagiarism in Scientific Writing: What Chinese Supervisors Think About Copying and How to Reduce it in Students' Writing

    Get PDF
    Text-based plagiarism, or textual copying, typically in the form of replicating or patchwriting sentences in a row from sources, seems to be an issue of growing concern among scientific journal editors. Editors have emphasized that senior authors (typically supervisors of science students) should take the responsibility for educating novices against text-based plagiarism. To address a research gap in the literature as to how scientist supervisors perceive the issue of textual copying and what they do in educating their students, this paper reports an interview study with 14 supervisors at a research-oriented Chinese university. The study throws light on the potentiality of senior authors mentoring novices in English as an Additional Language (EAL) contexts and has implications for the efforts that can be made in the wider scientific community to support scientists in writing against text-based plagiarism. © 2011 The Author(s).published_or_final_versionSpringer Open Choice, 28 May 201
    corecore