76 research outputs found
ASE@DPIL-FIRE2016: Hindi Paraphrase Detection using Natural Language Processing Techniques & Semantic Similarity Computations
ABSTRACT The paper reports the approaches utilized and results achieved for our system in the shared task (in FIRE-2016) for paraphrase identification in Indian languages (DPIL). Since Indian languages have a complex inherent nature, paraphrase identification in these languages becomes a challenging task. In the DPIL task, the challenge is to detect and identify whether a given sentence pairs paraphrased or not. In the proposed work, natural language processing with semantic concept extractions is explored for paraphrase detection in Hindi. Stop word removal, stemming and part of speech tagging are employed. Further similarity computations between the sentence pairs are done by extracting semantic concepts using WordNet lexical database. Initially, the proposed approach is evaluated over the given training sets using different machine learning classifiers. Then testing phase is used to predict the classes using the proposed features. The results are found to be promising, which shows the potency of natural language processing techniques and semantic concept extractions in detecting paraphrases. CCS Concepts Computing methodologies-Natural language processing Information systems -Document analysis and feature selection; Near-duplicate and paraphrase detectio
A study on plagiarism detection and plagiarism direction identification using natural language processing techniques
Ever since we entered the digital communication era, the ease of information sharing through the internet has encouraged online literature searching. With this comes the potential risk of a rise in academic misconduct and intellectual property theft. As concerns over plagiarism grow, more attention has been directed towards automatic plagiarism detection. This is a computational approach which assists humans in judging whether pieces of texts are plagiarised. However, most existing plagiarism detection approaches are limited to super cial, brute-force stringmatching techniques. If the text has undergone substantial semantic and syntactic changes, string-matching approaches do not perform well. In order to identify such changes, linguistic techniques which are able to perform a deeper analysis of the text are needed. To date, very limited research has been conducted on the topic of utilising linguistic techniques in plagiarism detection. This thesis provides novel perspectives on plagiarism detection and plagiarism direction identi cation tasks. The hypothesis is that original texts and rewritten texts exhibit signi cant but measurable di erences, and that these di erences can be captured through statistical and linguistic indicators. To investigate this hypothesis, four main research objectives are de ned. First, a novel framework for plagiarism detection is proposed. It involves the use of Natural Language Processing techniques, rather than only relying on the vii traditional string-matching approaches. The objective is to investigate and evaluate the in uence of text pre-processing, and statistical, shallow and deep linguistic techniques using a corpus-based approach. This is achieved by evaluating the techniques in two main experimental settings. Second, the role of machine learning in this novel framework is investigated. The objective is to determine whether the application of machine learning in the plagiarism detection task is helpful. This is achieved by comparing a thresholdsetting approach against a supervised machine learning classi er. Third, the prospect of applying the proposed framework in a large-scale scenario is explored. The objective is to investigate the scalability of the proposed framework and algorithms. This is achieved by experimenting with a large-scale corpus in three stages. The rst two stages are based on longer text lengths and the nal stage is based on segments of texts. Finally, the plagiarism direction identi cation problem is explored as supervised machine learning classi cation and ranking tasks. Statistical and linguistic features are investigated individually or in various combinations. The objective is to introduce a new perspective on the traditional brute-force pair-wise comparison of texts. Instead of comparing original texts against rewritten texts, features are drawn based on traits of texts to build a pattern for original and rewritten texts. Thus, the classi cation or ranking task is to t a piece of text into a pattern. The framework is tested by empirical experiments, and the results from initial experiments show that deep linguistic analysis contributes to solving the problems we address in this thesis. Further experiments show that combining shallow and viii deep techniques helps improve the classi cation of plagiarised texts by reducing the number of false negatives. In addition, the experiment on plagiarism direction detection shows that rewritten texts can be identi ed by statistical and linguistic traits. The conclusions of this study o er ideas for further research directions and potential applications to tackle the challenges that lie ahead in detecting text reuse.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations
Identifying academic plagiarism is a pressing task for educational and
research institutions, publishers, and funding agencies. Current plagiarism
detection systems reliably find instances of copied and moderately reworded
text. However, reliably detecting concealed plagiarism, such as strong
paraphrases, translations, and the reuse of nontextual content and ideas is an
open research problem. In this paper, we extend our prior research on analyzing
mathematical content and academic citations. Both are promising approaches for
improving the detection of concealed academic plagiarism primarily in Science,
Technology, Engineering and Mathematics (STEM). We make the following
contributions: i) We present a two-stage detection process that combines
similarity assessments of mathematical content, academic citations, and text.
ii) We introduce new similarity measures that consider the order of
mathematical features and outperform the measures in our prior research. iii)
We compare the effectiveness of the math-based, citation-based, and text-based
detection approaches using confirmed cases of academic plagiarism. iv) We
demonstrate that the combined analysis of math-based and citation-based content
features allows identifying potentially suspicious cases in a collection of
102K STEM documents. Overall, we show that analyzing the similarity of
mathematical content and academic citations is a striking supplement for
conventional text-based detection approaches for academic literature in the
STEM disciplines.Comment: Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries
(JCDL) 2019. The data and code of our study are openly available at
https://purl.org/hybridP
Automatic text summarisation using linguistic knowledge-based semantics
Text summarisation is reducing a text document to a short substitute summary. Since the commencement of the field, almost all summarisation research works implemented to this date involve identification and extraction of the most important document/cluster segments, called extraction. This typically involves scoring each document sentence according to a composite scoring function consisting of surface level and semantic features. Enabling machines to analyse text features and understand their meaning potentially requires both text semantic analysis and equipping computers with an external semantic knowledge. This thesis addresses extractive text summarisation by proposing a number of semantic and knowledge-based approaches. The work combines the high-quality semantic information in WordNet, the crowdsourced encyclopaedic knowledge in Wikipedia, and the manually crafted categorial variation in CatVar, to improve the summary quality. Such improvements are accomplished through sentence level morphological analysis and the incorporation of Wikipedia-based named-entity semantic relatedness while using heuristic algorithms. The study also investigates how sentence-level semantic analysis based on semantic role labelling (SRL), leveraged with a background world knowledge, influences sentence textual similarity and text summarisation. The proposed sentence similarity and summarisation methods were evaluated on standard publicly available datasets such as the Microsoft Research Paraphrase Corpus (MSRPC), TREC-9 Question Variants, and the Document Understanding Conference 2002, 2005, 2006 (DUC 2002, DUC 2005, DUC 2006) Corpora. The project also uses Recall-Oriented Understudy for Gisting Evaluation (ROUGE) for the quantitative assessment of the proposed summarisers’ performances. Results of our systems showed their effectiveness as compared to related state-of-the-art summarisation methods and baselines. Of the proposed summarisers, the SRL Wikipedia-based system demonstrated the best performance
Integrating State-of-the-art NLP Tools into Existing Methods to Address Current Challenges in Plagiarism Detection
Paraphrase plagiarism occurs when text is deliberately obfuscated to evade detection, deliberate alteration increases the complexity of plagiarism and the difficulty in detecting paraphrase plagiarism. In paraphrase plagiarism, copied texts often contain little or no matching words, and conventional plagiarism detectors, most of which are designed to detect matching stings are ineffective under such condition. The problem of plagiarism detection has been widely researched in recent years with significant progress made particularly in the platform of Pan@Clef competition on plagiarism detection. However further research is required specifically in the area of paraphrase and translation (obfuscation) plagiarism detection as studies show that the state-of-the-art is unsatisfactory. A rational solution to the problem is to apply models that detect plagiarism using semantic features in texts, rather than matching strings. Deep contextualised learning models (DCLMs) have the ability to learn deep textual features that can be used to compare text for semantic similarity. They have been remarkably effective in many natural language processing (NLP) tasks, but have not yet been tested in paraphrase plagiarism detection. The second problem facing conventional plagiarism detection is translation plagiarism, which occurs when copied text is translated to a different language and sometimes paraphrased and used without acknowledging the original sources. The most common method used for detecting cross-lingual plagiarism (CLP) require internet translation services, which is limiting to the detection process in many ways. A rational solution to the problem is to use detection models that do not utilise internet translation services. In this thesis we addressed these ongoing challenges facing conventional plagiarism detection by applying some of the most advanced methods in NLP, which includes contextualised and non-contextualised deep learning models. To address the problem of paraphrased plagiarism, we proposed a novel paraphrase plagiarism detector that integrates deep contextualised learning (DCL) into a generic plagiarism detection framework. Evaluation results revealed that our proposed paraphrase detector outperformed a state-of-art model, and a number of standard baselines in the task of paraphrase plagiarism detection. With respect to CLP detection, we propose a novel multilingual translation model (MTM) based on the Word2Vec (word embedding) model that can effectively translate text across a number of languages, it is independent of the internet and performs comparably, and in many cases better than a common cross-lingual plagiarism detection model that rely on online machine translator. The MTM does not require parallel or comparable corpora, it is therefore designed to resolve the problem of CLPD in low resource languages. The solutions provided in this research advance the state-of-the-art and contribute to the existing body of knowledge in plagiarism detection, and would also have a positive impact on academic integrity that has been under threat for a while by plagiarism
MATHEMATICAL LANGUAGE PROCESSING: DEEP LEARNING REPRESENTATIONS AND INFERENCE OVER MATHEMATICAL TEXT
On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism
Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci
Reformulation and Decomposition: Multitask learning approaches to Long Document Problems
Recent advances in Natural Language Processing (NLP) have led to success across a wide range of tasks including machine translation, summarization, and classification. Yet, the field still faces major challenges. This thesis addresses two key under-researched areas: the absence of general multitask learning capabilities, and the inability to scale to long, complex documents. Firstly, this thesis explores a form of multitasking where NLP tasks are reformulated as question answering problems. I examine existing models and measure their robustness to paraphrasing of their input. I contribute an annotated dataset which enables detailed analysis of model failures as well as evaluating methods for improving model robustness. Secondly, a set of long document tasks; MuLD, is introduced which forms a benchmark for evaluating the performance of models on large inputs with long-range dependencies. I show that this is a challenging task for baseline models. I then design an approach using task-decomposition to provide an interpretable solution which easily allows for multitask learning. I then explore how these themes of task reformulation for multitask learning, and task-decomposition for long inputs can be applied to other modalities. I show how visual modelling: a visual analogue of language modelling, can be used to predict missing frames from videos of simple physics simulations, and probe what knowledge about the physical world this induces in such models. Finally, I demonstrate how this task can be used to unite vision and NLP using the same framework, describing how task-reformulation and task-decomposition can be used for this purpose
Short Answer Assessment in Context: The Role of Information Structure
Short Answer Assessment (SAA), the computational task of judging the appro-
priateness of an answer to a question, has received much attention in recent
years (cf., e.g., Dzikovska et al. 2013; Burrows et al. 2015). Most researchers
have approached the problem as one similar to paraphrase recognition (cf.,
e.g., Brockett & Dolan 2005) or textual entailment (Dagan et al., 2006), where
the answer to be evaluated is aligned to another available utterance, such as a
target answer, in a sufficiently abstract way to capture form variation. While
this is a reasonable strategy, it fails to take the explicit context of an answer
into account: the question.
In this thesis, we present an attempt to change this situation by investigating
the role of Information Structure (IS, cf., e.g., Krifka 2007) in SAA. The basic
assumption adapted from IS here will be that the content of a linguistic ex-
pression is structured in a non-arbitrary way depending on its context (here:
the question), and thus it is possible to predetermine to some extent which
part of the expression’s content is relevant. In particular, we will adopt the
Question Under Discussion (QUD) approach advanced by Roberts (2012) where
the information structure of an answer is determined by an explicit or implicit
question in the discourse.
We proceed by first introducing the reader to the necessary prerequisites
in chapters 2 and 3. Since this is a computational linguistics thesis which
is inspired by theoretical linguistic research, we will provide an overview of
relevant work in both areas, discussing SAA and Information Structure (IS) in
sufficient detail, as well as existing attempts at annotating Information Structure
in corpora. After providing the reader with enough background to understand
the remainder of the thesis, we launch into a discussion of which IS notions and
dimensions are most relevant to our goal. We compare the given/new distinction
(information status) to the focus/background distinction and conclude that the
latter is better suited to our needs, as it captures requested information, which
can be either given or new in the context.
In chapter 4, we introduce the empirical basis of this work, the Corpus of
Reading Comprehension Exercises in German (CREG, Ott, Ziai & Meurers
2012). We outline how as a task-based corpus, CREG is particularly suited to
the analysis of language in context, and how it thus forms the basis of our
efforts in SAA and focus detection. Complementing this empirical basis, we
present the SAA system CoMiC in chapter 5, which is used to integrate focus
into SAA in chapter 8.
Chapter 6 then delves into the creation of a gold standard for automatic
focus detection. We describe what the desiderata for such a gold standard are
and how a subset of the CREG corpus is chosen for manual focus annotation.
Having determined these prerequisites, we proceed in detail to our novel
annotation scheme for focus, and its intrinsic evaluation in terms of inter-
annotator agreement. We also discuss explorations of using crowd-sourcing for
focus annotation.
After establishing the data basis, we turn to the task of automatic focus
detection in short answers in chapter 7. We first define the computational
task as classifying whether a given word of an answer is focused or not. We
experiment with several groups of features and explain in detail the motivation
for each: syntax and lexis of the question and the the answer, positional
features and givenness features, taking into account both question and answer
properties. Using the adjudicated gold standard we established in chapter 6, we
show that focus can be detected robustly using these features in a word-based
classifier in comparison to several baselines.
In chapter 8, we describe the integration of focus information into SAA,
which is both an extrinsic testbed for focus annotation and detection per se and
the computational task we originally set out to advance. We show that there
are several possible ways of integrating focus information into an alignment-
based SAA system, and discuss each one’s advantages and disadvantages.
We also experiment with using focus vs. using givenness in alignment before
concluding that a combination of both yields superior overall performance.
Finally, chapter 9 presents a summary of our main research findings along
with the contributions of this thesis. We conclude that analyzing focus in
authentic data is not only possible but necessary for a) developing context-
aware SAA approaches and b) grounding and testing linguistic theory. We give
an outlook on where future research needs to go and what particular avenues
could be explored.Short Answer Assessment (SAA), die computerlinguistische Aufgabe mit dem
Ziel, die Angemessenheit einer Antwort auf eine Frage zu bewerten, ist in
den letzten Jahren viel untersucht worden (siehe z.B. Dzikovska et al. 2013;
Burrows et al. 2015). Meist wird das Problem analog zur Paraphrase Recognition
(siehe z.B. Brockett & Dolan 2005) oder zum Textual Entailment (Dagan et al.,
2006) behandelt, indem die zu bewertende Antwort mit einer Referenzantwort
verglichen wird. Dies ist prinzipiell ein sinnvoller Ansatz, der jedoch den
expliziten Kontext einer Antwort außer Acht lässt: die Frage.
In der vorliegenden Arbeit wird ein Ansatz dargestellt, diesen Stand der
Forschung zu ändern, indem die Rolle der Informationsstruktur (IS, siehe z.B.
Krifka 2007) im SAA untersucht wird. Der Ansatz basiert auf der grundlegen-
den Annahme der IS, dass der Inhalt eines sprachlichen Ausdrucks auf einer
bestimmte Art und Weise durch seinen Kontext (hier: die Frage) strukturiert
wird, und dass man daher bis zu einem gewissen Grad vorhersagen kann,
welcher inhaltliche Teil des Ausdrucks relevant ist. Insbesondere wird der
Question Under Discussion (QUD) Ansatz (Roberts, 2012) übernommen, bei
dem die Informationsstruktur einer Antwort durch eine explizite oder implizite
Frage im Diskurs bestimmt wird.
In Kapitel 2 und 3 wird der Leser zunächst in die relevanten wissenschaft-
lichen Bereiche dieser Dissertation eingeführt. Da es sich um eine compu-
terlinguistische Arbeit handelt, die von theoretisch-linguistischer Forschung
inspiriert ist, werden sowohl SAA als auch IS in für die Arbeit ausreichender
Tiefe diskutiert, sowie ein Überblick über aktuelle Ansätze zur Annotation
von IS-Kategorien gegeben. Anschließend wird erörtert, welche Begriffe und
Unterscheidungen der IS für die Ziele dieser Arbeit zentral sind: Ein Vergleich
der given/new-Unterscheidung und der focus/background-Unterscheidung ergibt,
dass letztere das relevantere Kriterium darstellt, da sie erfragte Information
erfasst, welche im Kontext sowohl gegeben als auch neu sein kann.
Kapitel 4 stellt die empirische Basis dieser Arbeit vor, den Corpus of Reading
Comprehension Exercises in German (CREG, Ott, Ziai & Meurers 2012). Es
wird herausgearbeitet, warum ein task-basiertes Korpus wie CREG besonders
geeignet für die linguistische Analyse von Sprache im Kontext ist, und dass es
daher die Basis für die in dieser Arbeit dargestellten Untersuchungen zu SAA
und zur Fokusanalyse darstellt. Kapitel 5 präsentiert das SAA-System CoMiC
(Meurers, Ziai, Ott & Kopp, 2011b), welches für die Integration von Fokus in
SAA in Kapitel 8 verwendet wird.
Kapitel 6 befasst sich mit der Annotation eines Korpus mit dem Ziel der
manuellen und automatischen Fokusanalyse. Es wird diskutiert, auf welchen
Kriterien ein Ansatz zur Annotation von Fokus sinnvoll aufbauen kann, bevor
ein neues Annotationsschema präsentiert und auf einen Teil von CREG ange-
wendet wird. Der Annotationsansatz wird erfolgreich intrinsisch validiert, und
neben Expertenannotation wird außerdem ein Crowdsourcing-Experiment zur
Fokusannotation beschrieben.
Nachdem die Datengrundlage etabliert wurde, wendet sich Kapitel 7 der
automatischen Fokuserkennung in Antworten zu. Nach einem Überblick über
bisherige Arbeiten wird zunächst diskutiert, welche relevanten Eigenschaften
von Fragen und Antworten in einem automatischen Ansatz verwendet werden
können. Darauf folgt die Beschreibung eines wortbasierten Modells zur Foku-
serkennung, welches Merkmale der Syntax und Lexis von Frage und Antwort
einbezieht und mehrere Baselines in der Genauigkeit der Klassifikation klar
übertrifft.
In Kapitel 8 wird die Integration von Fokusinformation in SAA anhand des
CoMiC-Systems dargestellt, welche sowohl als extrinsische Validierung von
manueller und automatischer Fokusanalyse dient, als auch die computerlin-
guistische Aufgabe darstellt, zu der diese Arbeit einen Beitrag leistet. Fokus
wird als Filter für die Zuordnung von Lerner- und Musterantworten in CoMiC
integriert und diese Konfiguration wird benutzt, um den Einfluss von manu-
eller und automatischer Fokusannotation zu untersuchen, was zu positiven
Ergebnissen führt. Es wird außerdem gezeigt, dass eine Kombination von Fokus
und Givenness bei verlässlicher Fokusinformation für bessere Ergebnisse sorgt
als jede Kategorie in Isolation erreichen kann.
Schließlich gibt Kapitel 9 nochmals einen Überblick über den Inhalt der
Arbeit und stellt die Hauptbeiträge heraus. Die Schlussfolgerung ist, dass
Fokusanalyse in authentischen Daten sowohl möglich als auch notwendig ist,
um a) den Kontext in SAA einzubeziehen und b) linguistische Theorien zu IS
zu validieren und zu testen. Basierend auf den Ergebnissen werden mehrere
mögliche Richtungen für zukünftige Forschung aufgezeigt
- …