Search CORE

4,346 research outputs found

Two-Step Cluster Based Feature Discretization of Naive Bayes for Outlier Detection in Intrinsic Plagiarism Detection

Author: Wahono R. S. (Romi)
Wijaya A. (Adi)
Publication venue: None
Publication date: 01/01/2015
Field of study

Intrinsic plagiarism detection is the task of analyzing a document with respect to undeclared changes in writing style which treated as outliers. Naive Bayes is often used to outlier detection. However, Naive Bayes has assumption that the values of continuous feature are normally distributed where this condition is strongly violated that caused low classification performance. Discretization of continuous feature can improve the performance of Naïve Bayes. In this study, feature discretization based on Two-Step Cluster for Naïve Bayes has been proposed. The proposed method using tf-idf and query language model as feature creator and False Positive/False Negative (FP/FN) threshold which aims to improve the accuracy and evaluated using PAN PC 2009 dataset. The result indicated that the proposed method with discrete feature outperform the result from continuous feature for all evaluation, such as recall, precision, f-measure and accuracy. The using of FP/FN threshold affects the result as well since it can decrease FP and FN; thus, increase all evaluation

Neliti

A Machine Learning Approach for Plagiarism Detection

Author: Al-Sallal Muna
Publication venue
Publication date: 01/01/2016
Field of study

Plagiarism detection is gaining increasing importance due to requirements for integrity in education. The existing research has investigated the problem of plagrarim detection with a varying degree of success. The literature revealed that there are two main methods for detecting plagiarism, namely extrinsic and intrinsic. This thesis has developed two novel approaches to address both of these methods. Firstly a novel extrinsic method for detecting plagiarism is proposed. The method is based on four well-known techniques namely Bag of Words (BOW), Latent Semantic Analysis (LSA), Stylometry and Support Vector Machines (SVM). The LSA application was fine-tuned to take in the stylometric features (most common words) in order to characterise the document authorship as described in chapter 4. The results revealed that LSA based stylometry has outperformed the traditional LSA application. Support vector machine based algorithms were used to perform the classification procedure in order to predict which author has written a particular book being tested. The proposed method has successfully addressed the limitations of semantic characteristics and identified the document source by assigning the book being tested to the right author in most cases. Secondly, the intrinsic detection method has relied on the use of the statistical properties of the most common words. LSA was applied in this method to a group of most common words (MCWs) to extract their usage patterns based on the transitivity property of LSA. The feature sets of the intrinsic model were based on the frequency of the most common words, their relative frequencies in series, and the deviation of these frequencies across all books for a particular author. The Intrinsic method aims to generate a model of author “style” by revealing a set of certain features of authorship. The model’s generation procedure focuses on just one author as an attempt to summarise aspects of an author’s style in a definitive and clear-cut manner. The thesis has also proposed a novel experimental methodology for testing the performance of both extrinsic and intrinsic methods for plagiarism detection. This methodology relies upon the CEN (Corpus of English Novels) training dataset, but divides that dataset up into training and test datasets in a novel manner. Both approaches have been evaluated using the well-known leave-one-out-cross-validation method. Results indicated that by integrating deep analysis (LSA) and Stylometric analysis, hidden changes can be identified whether or not a reference collection exists

Coventry University Pure Portal

Composing Measures for Computing Text Similarity

Author: Bär Daniel
Gurevych Iryna
Zesch Torsten
Publication venue
Publication date: 26/01/2015
Field of study

We present a comprehensive study of computing similarity between texts. We start from the observation that while the concept of similarity is well grounded in psychology, text similarity is much less well-defined in the natural language processing community. We thus define the notion of text similarity and distinguish it from related tasks such as textual entailment and near-duplicate detection. We then identify multiple text dimensions, i.e. characteristics inherent to texts that can be used to judge text similarity, for which we provide empirical evidence. We discuss state-of-the-art text similarity measures previously proposed in the literature, before continuing with a thorough discussion of common evaluation metrics and datasets. Based on the analysis, we devise an architecture which combines text similarity measures in a unified classification framework. We apply our system in two evaluation settings, for which it consistently outperforms prior work and competing systems: (a) an intrinsic evaluation in the context of the Semantic Textual Similarity Task as part of the Semantic Evaluation (SemEval) exercises, and (b) an extrinsic evaluation for the detection of text reuse. As a basis for future work, we introduce DKPro Similarity, an open source software package which streamlines the development of text similarity measures and complete experimental setups

TUbiblio

tuprints

Impact Factor: outdated artefact or stepping-stone to journal certification?

Author: A Andrade
A Bonillo Perales
A Coleman
A Fassoulaki
A Lomnicki
A Schubert
A Tal
AI Pudovkin
Anonymous
AP Kurmis
AP Kurmis
AR Weale
AW Jones
AW Jones
B Cashore
B Statzner
B Stiftel
B Timuralp
BD Cameron
BM Althouse
C Chen
C Johnson
C McGarty
C Minelli
C Neuhaus
C Scully
CM Ketcham
CP Robert
CR Bain
CW Holsapple
D Ho
D Leutner
D Poomkottayil
D Schoonbaert
D Ugolini
DB Resnik
DB Resnik
DC Greenwood
DE Drew
DN Laband
DR Smith
DW Straub
E Archambault
E Garfield
E Garfield
E Garfield
E Garfield
E Postma
E Wager
EJ Favaloro
EM Sonderstrup-Andersen
EV Bernstam
F Godlee
F Habibzadeh
F Preti
F-T Krell
FJ Bath
FN Dost
G Abramo
G Jacobs
G Knothe
G Lehmkuhl
G Maier
G Racki
G Walter
G Williams
G Winkmann
G Winkmann
G Yu
G Yu
GE Hunt
H Brown
H Goldstein
H Moed
H Sompel Van de
H Xiao
HB Hansen
HF Moed
I Chalmers
I Hames
J Awrey
J Bollen
J Cheek
J Davies
J Horner
J Kapeller
J Mehrad
J Mehrad
J Moller
J Reedijk
J Roberts
J Schopfel
J Woelfel
JA Dempsey
JB Cohen
Jerome K. Vanclay
JJ Ramsden
JK Vanclay
JL Wulff
JM Campanario
JM Campanario
JM Campanario
JP Skovsgaard
JS Kotiaho
JYA Foo
JYA Foo
K Metze
K Simons
K Soreide
KG Altmann
L Bornmann
L Eaton
L Gollogly
L Leydesdorff
L McKeever
L Rieseberg
L Rieseberg
LL Gluud
LL Lange
M Callaham
M Callaham
M Chew
M Driel Van
M Errami
M Errami
M Frank
M Jahangiriana
M Patterson
M Porter
M Rossner
M Rossner
M Taylor
M-J Johnstone
MA Hernan
MA Ruiz
MC Calver
ME Falagas
ME McVeigh
MF Fox
MJ Cobo
MJ Tobin
MV Simkin
N Haslam
N Rezaei-Ghaleh
N Sombatsompop
N Sombatsompop
P Abraham
P Campbell
P Dong
P Jacso
P Jörgensen
P Owlia
PA Lawrence
PA Todd
PA Todd
PC Gøtzsche
PLoS Medicine Editors
PO Seglen
PO Seglen
PO Seglen
R Adler
R Coleman
R Meneghini
R Rousseau
R Rousseau
R Saunders
R Zetterstrom
RA Brumback
RA Brumback
RJ Epstein
RJ Stein
RP Vlosky
RW Glynn
S Butakov
S Chapman
S Lock
S Meyer zu Eissen
S Saha
S Wang
S Woolgar
SE Gwilym
SJ Bensman
SJ Bensman
SJ Bensman
SL Lau
T Braun
TC Ha
TH Berquist
TL Ogden
TN Leeuwen Van
TP Kurmis
TV Perneger
V Barbour
V Kumar
V Larivière
VA Cartwright
W Glänzel
W Glänzel
W Yue
WH Starbuck
WR Schumm
WT Obremskey
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

A review of Garfield's journal impact factor and its specific implementation as the Thomson Reuters Impact Factor reveals several weaknesses in this commonly-used indicator of journal standing. Key limitations include the mismatch between citing and cited documents, the deceptive display of three decimals that belies the real precision, and the absence of confidence intervals. These are minor issues that are easily amended and should be corrected, but more substantive improvements are needed. There are indications that the scientific community seeks and needs better certification of journal procedures to improve the quality of published science. Comprehensive certification of editorial and review procedures could help ensure adequate procedures to detect duplicate and fraudulent submissions.Comment: 25 pages, 12 figures, 6 table

arXiv.org e-Print Archive

ePublications@SCU

Crossref

On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism

Author: Barrón Cedeño Luis Alberto
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 08/06/2012
Field of study

Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci

RiuNet

Normalized Information Distance

Author: Balbach Frank J.
Cilibrasi Rudi L.
Li Ming
Vitanyi Paul M. B.
Publication venue
Publication date: 01/01/2008
Field of study

The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to approximate the Kolmogorov complexity if the objects have a string representation. Second, for names and abstract concepts, page count statistics from the World Wide Web can be used. These practical realizations of the normalized information distance can then be applied to machine learning tasks, expecially clustering, to perform feature-free and parameter-free data mining. This chapter discusses the theoretical foundations of the normalized information distance and both practical realizations. It presents numerous examples of successful real-world applications based on these distance measures, ranging from bioinformatics to music clustering to machine translation.Comment: 33 pages, 12 figures, pdf, in: Normalized information distance, in: Information Theory and Statistical Learning, Eds. M. Dehmer, F. Emmert-Streib, Springer-Verlag, New-York, To appea

arXiv.org e-Print Archive

CiteSeerX

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Closing the loop: assisting archival appraisal and information retrieval in one sweep

Author: Kim Y.
Ross S.
Publication venue
Publication date: 01/01/2013
Field of study

In this article, we examine the similarities between the concept of appraisal, a process that takes place within the archives, and the concept of relevance judgement, a process fundamental to the evaluation of information retrieval systems. More specifically, we revisit selection criteria proposed as result of archival research, and work within the digital curation communities, and, compare them to relevance criteria as discussed within information retrieval's literature based discovery. We illustrate how closely these criteria relate to each other and discuss how understanding the relationships between the these disciplines could form a basis for proposing automated selection for archival processes and initiating multi-objective learning with respect to information retrieval

Crossref

Enlighten

Intelligent Plagiarism Detection for Electronic Documents

Author: Al-Bayed Mohran H. J.
Publication venue
Publication date: 01/01/2017
Field of study

Plagiarism detection is the process of finding similarities on electronic based documents. Recently, this process is highly required because of the large number of available documents on the internet and the ability to copy and paste the text of relevant documents with simply Control+C and Control+V commands. The proposed solution is to investigate and develop an easy, fast, and multi-language support plagiarism detector with the easy of one click to detect the document plagiarism. This process will be done with the support of intelligent system that can learn, change and adapt to the input document and make a cross-fast search for the content on the local repository and the online repository and link the content of the file with the matching content everywhere found. Furthermore, the supported document type that we will use is word, text and in some cases, the pdf files –where is the text can be extracting from them- and this made possible by using the DLL file from Word application that Microsoft provided on OS. The using of DLL will let us to not constrain on how to get the text from files; and will help us to apply the file on our Delphi project and walk throw our methodology and read the file word by word to grantee the best working scenarios for the calculation. In the result, this process will help in the uprising the documents quality and enhance the writer experience related to his work and will save the copyrights for the official writer of the documents by providing a new alternative tool for plagiarism detection problem for easy and fast use to the concerned Institutions for free

PhilPapers

TAKSONOMIJA METODA AKADEMSKOG PLAGIRANJA

Author: Ana Meštrović
Tedo Vrbanec
Publication venue: 'Polytechnic of Rijeka University'
Publication date: 01/01/2021
Field of study

The article gives an overview of the plagiarism domain, with focus on academic plagiarism. The article defines plagiarism, explains the origin of the term, as well as plagiarism related terms. It identifies the extent of the plagiarism domain and then focuses on the plagiarism subdomain of text documents, for which it gives an overview of current classifications and taxonomies and then proposes a more comprehensive classification according to several criteria: their origin and purpose, technical implementation, consequence, complexity of detection and according to the number of linguistic sources. The article suggests the new classification of academic plagiarism, describes sorts and methods of plagiarism, types and categories, approaches and phases of plagiarism detection, the classification of methods and algorithms for plagiarism detection. The title of the article explicitly targets the academic community, but it is sufficiently general and interdisciplinary, so it can be useful for many other professionals like software developers, linguists and librarians.Rad daje pregled domene plagiranja tekstnih dokumenata. Opisuje porijeklo pojma plagijata, daje prikaz definicija te objašnjava plagijatu srodne pojmove. Ukazuje na širinu domene plagiranja, a za tekstne dokumenate daje pregled dosadašnjih taksonomija i predlaže sveobuhvatniju taksonomiju prema više kriterija: porijeklu i namjeni, tehničkoj provedbi plagiranja, posljedicama plagiranja, složenosti otkrivanja i (više)jezičnom porijeklu. Rad predlaže novu klasifikaciju akademskog plagiranja, prikazuje vrste i metode plagiranja, tipove i kategorije plagijata, pristupe i faze otkrivanja plagiranja. Potom opisuje klasifikaciju metoda i algoritama otkrivanja plagijata. Iako cilja na akademskog čitatelja, može biti od koristi u interdisciplinarnim područjima te razvijateljima softvera, lingvistima i knjižničarima

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

Detecting plagiarism in the forensic linguistics turn

Author: Sousa Silva Rui
Publication venue
Publication date
Field of study

This study investigates plagiarism detection, with an application in forensic contexts. Two types of data were collected for the purposes of this study. Data in the form of written texts were obtained from two Portuguese Universities and from a Portuguese newspaper. These data are analysed linguistically to identify instances of verbatim, morpho-syntactical, lexical and discursive overlap. Data in the form of survey were obtained from two higher education institutions in Portugal, and another two in the United Kingdom. These data are analysed using a 2 by 2 between-groups Univariate Analysis of Variance (ANOVA), to reveal cross-cultural divergences in the perceptions of plagiarism. The study discusses the legal and social circumstances that may contribute to adopting a punitive approach to plagiarism, or, conversely, reject the punishment. The research adopts a critical approach to plagiarism detection. On the one hand, it describes the linguistic strategies adopted by plagiarists when borrowing from other sources, and, on the other hand, it discusses the relationship between these instances of plagiarism and the context in which they appear. A focus of this study is whether plagiarism involves an intention to deceive, and, in this case, whether forensic linguistic evidence can provide clues to this intentionality. It also evaluates current computational approaches to plagiarism detection, and identifies strategies that these systems fail to detect. Specifically, a method is proposed to translingual plagiarism. The findings indicate that, although cross-cultural aspects influence the different perceptions of plagiarism, a distinction needs to be made between intentional and unintentional plagiarism. The linguistic analysis demonstrates that linguistic elements can contribute to finding clues for the plagiarist’s intentionality. Furthermore, the findings show that translingual plagiarism can be detected by using the method proposed, and that plagiarism detection software can be improved using existing computer tools

Aston Publications Explorer