Search CORE

1,459 research outputs found

A Machine Learning Approach for Plagiarism Detection

Author: Al-Sallal Muna
Publication venue
Publication date: 01/01/2016
Field of study

Plagiarism detection is gaining increasing importance due to requirements for integrity in education. The existing research has investigated the problem of plagrarim detection with a varying degree of success. The literature revealed that there are two main methods for detecting plagiarism, namely extrinsic and intrinsic. This thesis has developed two novel approaches to address both of these methods. Firstly a novel extrinsic method for detecting plagiarism is proposed. The method is based on four well-known techniques namely Bag of Words (BOW), Latent Semantic Analysis (LSA), Stylometry and Support Vector Machines (SVM). The LSA application was fine-tuned to take in the stylometric features (most common words) in order to characterise the document authorship as described in chapter 4. The results revealed that LSA based stylometry has outperformed the traditional LSA application. Support vector machine based algorithms were used to perform the classification procedure in order to predict which author has written a particular book being tested. The proposed method has successfully addressed the limitations of semantic characteristics and identified the document source by assigning the book being tested to the right author in most cases. Secondly, the intrinsic detection method has relied on the use of the statistical properties of the most common words. LSA was applied in this method to a group of most common words (MCWs) to extract their usage patterns based on the transitivity property of LSA. The feature sets of the intrinsic model were based on the frequency of the most common words, their relative frequencies in series, and the deviation of these frequencies across all books for a particular author. The Intrinsic method aims to generate a model of author “style” by revealing a set of certain features of authorship. The model’s generation procedure focuses on just one author as an attempt to summarise aspects of an author’s style in a definitive and clear-cut manner. The thesis has also proposed a novel experimental methodology for testing the performance of both extrinsic and intrinsic methods for plagiarism detection. This methodology relies upon the CEN (Corpus of English Novels) training dataset, but divides that dataset up into training and test datasets in a novel manner. Both approaches have been evaluated using the well-known leave-one-out-cross-validation method. Results indicated that by integrating deep analysis (LSA) and Stylometric analysis, hidden changes can be identified whether or not a reference collection exists

Coventry University Pure Portal

Two-Step Cluster Based Feature Discretization of Naive Bayes for Outlier Detection in Intrinsic Plagiarism Detection

Author: Wahono R. S. (Romi)
Wijaya A. (Adi)
Publication venue: None
Publication date: 01/01/2015
Field of study

Intrinsic plagiarism detection is the task of analyzing a document with respect to undeclared changes in writing style which treated as outliers. Naive Bayes is often used to outlier detection. However, Naive Bayes has assumption that the values of continuous feature are normally distributed where this condition is strongly violated that caused low classification performance. Discretization of continuous feature can improve the performance of Naïve Bayes. In this study, feature discretization based on Two-Step Cluster for Naïve Bayes has been proposed. The proposed method using tf-idf and query language model as feature creator and False Positive/False Negative (FP/FN) threshold which aims to improve the accuracy and evaluated using PAN PC 2009 dataset. The result indicated that the proposed method with discrete feature outperform the result from continuous feature for all evaluation, such as recall, precision, f-measure and accuracy. The using of FP/FN threshold affects the result as well since it can decrease FP and FN; thus, increase all evaluation

Neliti

Impact Factor: outdated artefact or stepping-stone to journal certification?

Author: A Andrade
A Bonillo Perales
A Coleman
A Fassoulaki
A Lomnicki
A Schubert
A Tal
AI Pudovkin
Anonymous
AP Kurmis
AP Kurmis
AR Weale
AW Jones
AW Jones
B Cashore
B Statzner
B Stiftel
B Timuralp
BD Cameron
BM Althouse
C Chen
C Johnson
C McGarty
C Minelli
C Neuhaus
C Scully
CM Ketcham
CP Robert
CR Bain
CW Holsapple
D Ho
D Leutner
D Poomkottayil
D Schoonbaert
D Ugolini
DB Resnik
DB Resnik
DC Greenwood
DE Drew
DN Laband
DR Smith
DW Straub
E Archambault
E Garfield
E Garfield
E Garfield
E Garfield
E Postma
E Wager
EJ Favaloro
EM Sonderstrup-Andersen
EV Bernstam
F Godlee
F Habibzadeh
F Preti
F-T Krell
FJ Bath
FN Dost
G Abramo
G Jacobs
G Knothe
G Lehmkuhl
G Maier
G Racki
G Walter
G Williams
G Winkmann
G Winkmann
G Yu
G Yu
GE Hunt
H Brown
H Goldstein
H Moed
H Sompel Van de
H Xiao
HB Hansen
HF Moed
I Chalmers
I Hames
J Awrey
J Bollen
J Cheek
J Davies
J Horner
J Kapeller
J Mehrad
J Mehrad
J Moller
J Reedijk
J Roberts
J Schopfel
J Woelfel
JA Dempsey
JB Cohen
Jerome K. Vanclay
JJ Ramsden
JK Vanclay
JL Wulff
JM Campanario
JM Campanario
JM Campanario
JP Skovsgaard
JS Kotiaho
JYA Foo
JYA Foo
K Metze
K Simons
K Soreide
KG Altmann
L Bornmann
L Eaton
L Gollogly
L Leydesdorff
L McKeever
L Rieseberg
L Rieseberg
LL Gluud
LL Lange
M Callaham
M Callaham
M Chew
M Driel Van
M Errami
M Errami
M Frank
M Jahangiriana
M Patterson
M Porter
M Rossner
M Rossner
M Taylor
M-J Johnstone
MA Hernan
MA Ruiz
MC Calver
ME Falagas
ME McVeigh
MF Fox
MJ Cobo
MJ Tobin
MV Simkin
N Haslam
N Rezaei-Ghaleh
N Sombatsompop
N Sombatsompop
P Abraham
P Campbell
P Dong
P Jacso
P Jörgensen
P Owlia
PA Lawrence
PA Todd
PA Todd
PC Gøtzsche
PLoS Medicine Editors
PO Seglen
PO Seglen
PO Seglen
R Adler
R Coleman
R Meneghini
R Rousseau
R Rousseau
R Saunders
R Zetterstrom
RA Brumback
RA Brumback
RJ Epstein
RJ Stein
RP Vlosky
RW Glynn
S Butakov
S Chapman
S Lock
S Meyer zu Eissen
S Saha
S Wang
S Woolgar
SE Gwilym
SJ Bensman
SJ Bensman
SJ Bensman
SL Lau
T Braun
TC Ha
TH Berquist
TL Ogden
TN Leeuwen Van
TP Kurmis
TV Perneger
V Barbour
V Kumar
V Larivière
VA Cartwright
W Glänzel
W Glänzel
W Yue
WH Starbuck
WR Schumm
WT Obremskey
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

A review of Garfield's journal impact factor and its specific implementation as the Thomson Reuters Impact Factor reveals several weaknesses in this commonly-used indicator of journal standing. Key limitations include the mismatch between citing and cited documents, the deceptive display of three decimals that belies the real precision, and the absence of confidence intervals. These are minor issues that are easily amended and should be corrected, but more substantive improvements are needed. There are indications that the scientific community seeks and needs better certification of journal procedures to improve the quality of published science. Comprehensive certification of editorial and review procedures could help ensure adequate procedures to detect duplicate and fraudulent submissions.Comment: 25 pages, 12 figures, 6 table

arXiv.org e-Print Archive

ePublications@SCU

Crossref

Identifying Machine-Paraphrased Plagiarism

Author: Foltýnek Tomáš
Gipp Bela
Meuschke Norman
Ruas Terry
Wahle Jan Philip
Publication venue
Publication date: 29/11/2021
Field of study

Employing paraphrasing tools to conceal plagiarized text is a severe threat to academic integrity. To enable the detection of machine-paraphrased text, we evaluate the effectiveness of five pre-trained word embedding models combined with machine learning classifiers and state-of-the-art neural language models. We analyze preprints of research papers, graduation theses, and Wikipedia articles, which we paraphrased using different configurations of the tools SpinBot and SpinnerChief. The best performing technique, Longformer, achieved an average F1 score of 80.99% (F1=99.68% for SpinBot and F1=71.64% for SpinnerChief cases), while human evaluators achieved F1=78.4% for SpinBot and F1=65.6% for SpinnerChief cases. We show that the automated classification alleviates shortcomings of widely-used text-matching systems, such as Turnitin and PlagScan. To facilitate future research, all data, code, and two web applications showcasing our contributions are openly available

arXiv.org e-Print Archive

Similarity Continua and Criteria in Memetic Theory and Analysis

Author: Jan Steven
Publication venue: Music Council of Australia
Publication date: 01/01/2014
Field of study

University of Huddersfield Repository

Huddersfield Research Portal

The benefits from publicly funded research

Author: Ben Martin
Puay Tang
Publication venue
Publication date
Field of study

Research, Technological change, Government Policy

Research Papers in Economics

Source code authorship attribution

Author: Burrows S
Publication venue: RMIT University
Publication date: 01/01/2010
Field of study

To attribute authorship means to identify the true author among many candidates for samples of work of unknown or contentious authorship. Authorship attribution is a prolific research area for natural language, but much less so for source code, with eight other research groups having published empirical results concerning the accuracy of their approaches to date. Authorship attribution of source code is the focus of this thesis. We first review, reimplement, and benchmark all existing published methods to establish a consistent set of accuracy scores. This is done using four newly constructed and significant source code collections comprising samples from academic sources, freelance sources, and multiple programming languages. The collections developed are the most comprehensive to date in the field. We then propose a novel information retrieval method for source code authorship attribution. In this method, source code features from the collection samples are tokenised, converted into n-grams, and indexed for stylistic comparison to query samples using the Okapi BM25 similarity measure. Authorship of the top ranked sample is used to classify authorship of each query, and the proportion of times that this is correct determines overall accuracy. The results show that this approach is more accurate than the best approach from the previous work for three of the four collections. The accuracy of the new method is then explored in the context of author style evolving over time, by experimenting with a collection of student programming assignments that spans three semesters with established relative timestamps. We find that it takes one full semester for individual coding styles to stabilise, which is essential knowledge for ongoing authorship attribution studies and quality control in general. We conclude the research by extending both the new information retrieval method and previous methods to provide a complete set of benchmarks for advancing the field. In the final evaluation, we show that the n-gram approaches are leading the field, with accuracy scores for some collections around 90% for a one-in-ten classification problem

RMIT Research Repository

On the Nature and Types of Anomalies: A Review

Author: Foorthuis Ralph
Publication venue
Publication date: 27/12/2020
Field of study

Anomalies are occurrences in a dataset that are in some way unusual and do not fit the general patterns. The concept of the anomaly is generally ill-defined and perceived as vague and domain-dependent. Moreover, despite some 250 years of publications on the topic, no comprehensive and concrete overviews of the different types of anomalies have hitherto been published. By means of an extensive literature review this study therefore offers the first theoretically principled and domain-independent typology of data anomalies, and presents a full overview of anomaly types and subtypes. To concretely define the concept of the anomaly and its different manifestations, the typology employs five dimensions: data type, cardinality of relationship, anomaly level, data structure and data distribution. These fundamental and data-centric dimensions naturally yield 3 broad groups, 9 basic types and 61 subtypes of anomalies. The typology facilitates the evaluation of the functional capabilities of anomaly detection algorithms, contributes to explainable data science, and provides insights into relevant topics such as local versus global anomalies.Comment: 38 pages (30 pages content), 10 figures, 3 tables. Preprint; review comments will be appreciated. Improvements in version 2: Explicit mention of fifth anomaly dimension; Added section on explainable anomaly detection; Added section on variations on the anomaly concept; Various minor additions and improvement

arXiv.org e-Print Archive