1,070 research outputs found
A Deep Network Model for Paraphrase Detection in Short Text Messages
This paper is concerned with paraphrase detection. The ability to detect
similar sentences written in natural language is crucial for several
applications, such as text mining, text summarization, plagiarism detection,
authorship authentication and question answering. Given two sentences, the
objective is to detect whether they are semantically identical. An important
insight from this work is that existing paraphrase systems perform well when
applied on clean texts, but they do not necessarily deliver good performance
against noisy texts. Challenges with paraphrase detection on user generated
short texts, such as Twitter, include language irregularity and noise. To cope
with these challenges, we propose a novel deep neural network-based approach
that relies on coarse-grained sentence modeling using a convolutional neural
network and a long short-term memory model, combined with a specific
fine-grained word-level similarity matching model. Our experimental results
show that the proposed approach outperforms existing state-of-the-art
approaches on user-generated noisy social media data, such as Twitter texts,
and achieves highly competitive performance on a cleaner corpus
TAKSONOMIJA METODA AKADEMSKOG PLAGIRANJA
The article gives an overview of the plagiarism domain, with focus on academic plagiarism. The
article defines plagiarism, explains the origin of the term, as well as plagiarism related terms. It
identifies the extent of the plagiarism domain and then focuses on the plagiarism subdomain of text
documents, for which it gives an overview of current classifications and taxonomies and then proposes a
more comprehensive classification according to several criteria: their origin and purpose, technical
implementation, consequence, complexity of detection and according to the number of linguistic sources.
The article suggests the new classification of academic plagiarism, describes sorts and methods of
plagiarism, types and categories, approaches and phases of plagiarism detection, the classification
of methods and algorithms for plagiarism detection. The title of the article explicitly targets the
academic community, but it is sufficiently general and interdisciplinary, so it can be useful for
many other professionals like software developers, linguists and librarians.Rad daje pregled domene plagiranja tekstnih dokumenata. Opisuje porijeklo pojma plagijata, daje prikaz
definicija te objašnjava plagijatu srodne pojmove. Ukazuje na širinu domene plagiranja, a za tekstne
dokumenate daje pregled dosadašnjih taksonomija i predlaže sveobuhvatniju taksonomiju prema više kriterija:
porijeklu i namjeni, tehničkoj provedbi plagiranja, posljedicama plagiranja, složenosti otkrivanja i
(više)jezičnom porijeklu. Rad predlaže novu klasifikaciju akademskog plagiranja, prikazuje vrste i
metode plagiranja, tipove i kategorije plagijata, pristupe i faze otkrivanja plagiranja. Potom opisuje
klasifikaciju metoda i algoritama otkrivanja plagijata. Iako cilja na akademskog čitatelja, može biti
od koristi u interdisciplinarnim područjima te razvijateljima softvera, lingvistima i knjižničarima
Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations
Identifying academic plagiarism is a pressing task for educational and
research institutions, publishers, and funding agencies. Current plagiarism
detection systems reliably find instances of copied and moderately reworded
text. However, reliably detecting concealed plagiarism, such as strong
paraphrases, translations, and the reuse of nontextual content and ideas is an
open research problem. In this paper, we extend our prior research on analyzing
mathematical content and academic citations. Both are promising approaches for
improving the detection of concealed academic plagiarism primarily in Science,
Technology, Engineering and Mathematics (STEM). We make the following
contributions: i) We present a two-stage detection process that combines
similarity assessments of mathematical content, academic citations, and text.
ii) We introduce new similarity measures that consider the order of
mathematical features and outperform the measures in our prior research. iii)
We compare the effectiveness of the math-based, citation-based, and text-based
detection approaches using confirmed cases of academic plagiarism. iv) We
demonstrate that the combined analysis of math-based and citation-based content
features allows identifying potentially suspicious cases in a collection of
102K STEM documents. Overall, we show that analyzing the similarity of
mathematical content and academic citations is a striking supplement for
conventional text-based detection approaches for academic literature in the
STEM disciplines.Comment: Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries
(JCDL) 2019. The data and code of our study are openly available at
https://purl.org/hybridP
Scalable and Language-Independent Embedding-based Approach for Plagiarism Detection Considering Obfuscation Type: No Training Phase
[EN] The efficiency and scalability of plagiarism detection systems have become a major challenge due to the vast amount of available textual data in several languages over the Internet. Plagiarism occurs in different levels of obfuscation, ranging from the exact copy of original materials to text summarization. Consequently, designed algorithms to detect plagiarism should be robust to the diverse languages and different type of obfuscation in plagiarism cases. In this paper, we employ text embedding vectors to compare similarity among documents to detect plagiarism. Word vectors are combined by a simple aggregation function to represent a text document. This representation comprises semantic and syntactic information of the text and leads to efficient text alignment among suspicious and original documents. By comparing representations of sentences in source and suspicious documents, pair sentences with the highest similarity are considered as the candidates or seeds of plagiarism cases. To filter and merge these seeds, a set of parameters, including Jaccard similarity and merging threshold, are tuned by two different approaches: offline tuning and online tuning. The offline method, which is used as the benchmark, regulates a unique set of parameters for all types of plagiarism by several trials on the training corpus. Experiments show improvements in performance by considering obfuscation type during threshold tuning. In this regard, our proposed online approach uses two statistical methods to filter outlier candidates automatically by their scale of obfuscation. By employing the online tuning approach, no distinct training dataset is required to train the system. We applied our proposed method on available datasets in English, Persian and Arabic languages on the text alignment task to evaluate the robustness of the proposed methods from the language perspective as well. As our experimental results confirm, our efficient approach can achieve
considerable performance on the different datasets in various languages. Our online threshold tuning approach without any training datasets works as well as, or even in some cases better than, the training-base method.The work of Paolo Rosso was partially funded by the Spanish MICINN under the research Project MISMIS-FAKEn-HATE on Misinformation and Miscommunication in social media: FAKE news and HATE speech (PGC2018-096212-B-C31).Gharavi, E.; Veisi, H.; Rosso, P. (2020). Scalable and Language-Independent Embedding-based Approach for Plagiarism Detection Considering Obfuscation Type: No Training Phase. Neural Computing and Applications. 32(14):10593-10607. https://doi.org/10.1007/s00521-019-04594-yS1059310607321
CroLSSim: Cross‐language software similarity detector using hybrid approach of LSA‐based AST‐MDrep features and CNN‐LSTM model
Software similarity in different programming codes is a rapidly evolving field because of its numerous applications in software development, software cloning, software plagiarism, and software forensics. Currently, software researchers and developers search cross-language open-source repositories for similar applications for a variety of reasons, such as reusing programming code, analyzing different implementations, and looking for a better application. However, it is a challenging task because each programming language has a unique syntax and semantic structure. In this paper, a novel tool called Cross-Language Software Similarity (CroLSSim) is designed to detect similar software applications written in different programming codes. First, the Abstract Syntax Tree (AST) features are collected from different programming codes. These are high-quality features that can show the abstract view of each program. Then, Methods Description (MDrep) in combination with AST is used to examine the relationship among different method calls. Second, the Term Frequency Inverse Document Frequency approach is used to retrieve the local and global weights from AST-MDrep features. Third, the Latent Semantic Analysis-based features extraction and selection method is proposed to extract the semantic anchors in reduced dimensional space. Fourth, the Convolution Neural Network (CNN)-based features extraction method is proposed to mine the deep features. Finally, a hybrid deep learning model of CNN-Long-Short-Term Memory is designed to detect semantically similar software applications from these latent variables. The data set contains approximately 9.5K Java, 8.8K C#, and 7.4K C++ software applications obtained from GitHub. The proposed approach outperforms as compared with the state-of-the-art methods
Probing the topological properties of complex networks modeling short written texts
In recent years, graph theory has been widely employed to probe several
language properties. More specifically, the so-called word adjacency model has
been proven useful for tackling several practical problems, especially those
relying on textual stylistic analysis. The most common approach to treat texts
as networks has simply considered either large pieces of texts or entire books.
This approach has certainly worked well -- many informative discoveries have
been made this way -- but it raises an uncomfortable question: could there be
important topological patterns in small pieces of texts? To address this
problem, the topological properties of subtexts sampled from entire books was
probed. Statistical analyzes performed on a dataset comprising 50 novels
revealed that most of the traditional topological measurements are stable for
short subtexts. When the performance of the authorship recognition task was
analyzed, it was found that a proper sampling yields a discriminability similar
to the one found with full texts. Surprisingly, the support vector machine
classification based on the characterization of short texts outperformed the
one performed with entire books. These findings suggest that a local
topological analysis of large documents might improve its global
characterization. Most importantly, it was verified, as a proof of principle,
that short texts can be analyzed with the methods and concepts of complex
networks. As a consequence, the techniques described here can be extended in a
straightforward fashion to analyze texts as time-varying complex networks
- …