109 research outputs found
A Deep Network Model for Paraphrase Detection in Short Text Messages
This paper is concerned with paraphrase detection. The ability to detect
similar sentences written in natural language is crucial for several
applications, such as text mining, text summarization, plagiarism detection,
authorship authentication and question answering. Given two sentences, the
objective is to detect whether they are semantically identical. An important
insight from this work is that existing paraphrase systems perform well when
applied on clean texts, but they do not necessarily deliver good performance
against noisy texts. Challenges with paraphrase detection on user generated
short texts, such as Twitter, include language irregularity and noise. To cope
with these challenges, we propose a novel deep neural network-based approach
that relies on coarse-grained sentence modeling using a convolutional neural
network and a long short-term memory model, combined with a specific
fine-grained word-level similarity matching model. Our experimental results
show that the proposed approach outperforms existing state-of-the-art
approaches on user-generated noisy social media data, such as Twitter texts,
and achieves highly competitive performance on a cleaner corpus
Detecting Machine-obfuscated Plagiarism
Related dataset is at https://doi.org/10.7302/bewj-qx93 and also listed in the dc.relation field of the full item record.Research on academic integrity has identified online paraphrasing tools as a severe threat to the effectiveness of plagiarism detection systems. To enable the automated identification of machine-paraphrased text, we make three contributions. First, we evaluate the effectiveness of six prominent word embedding models in combination with five classifiers for distinguishing human-written from machine-paraphrased text. The best performing classification approach achieves an accuracy of 99.0% for documents and 83.4% for paragraphs. Second, we show that the best approach outperforms human experts and established plagiarism detection systems for these classification tasks. Third, we provide a Web application that uses the best performing classification approach to indicate whether a text underwent machine-paraphrasing. The data and code of our study are openly available.Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/152346/1/Foltynek2020_Paraphrase_Detection.pdfDescription of Foltynek2020_Paraphrase_Detection.pdf : Foltynek2020_Paraphrase_Detectio
An integrated semantic-based framework for intelligent similarity measurement and clustering of microblogging posts
Twitter, the most popular microblogging platform, is gaining rapid prominence as a source of
information sharing and social awareness due to its popularity and massive user generated
content. These include applications such as tailoring advertisement campaigns, event
detection, trends analysis, and prediction of micro-populations. The aforementioned
applications are generally conducted through cluster analysis of tweets to generate a more
concise and organized representation of the massive raw tweets. However, current approaches
perform traditional cluster analysis using conventional proximity measures, such as Euclidean
distance. However, the sheer volume, noise, and dynamism of Twitter, impose challenges that
hinder the efficacy of traditional clustering algorithms in detecting meaningful clusters within
microblogging posts. The research presented in this thesis sets out to design and develop a
novel short text semantic similarity (STSS) measure, named TREASURE, which captures the
semantic and structural features of microblogging posts for intelligently predicting the
similarities. TREASURE is utilised in the development of an innovative semantic-based
cluster analysis algorithm (SBCA) that contributes in generating more accurate and
meaningful granularities within microblogging posts. The integrated semantic-based
framework incorporating TREASURE and the SBCA algorithm tackles both the problem of
microblogging cluster analysis and contributes to the success of a variety of natural language
processing (NLP) and computational intelligence research.
TREASURE utilises word embedding neural network (NN) models to capture the semantic
relationships between words based on their co-occurrences in a corpus. Moreover,
TREASURE analyses the morphological and lexical structure of tweets to predict the syntactic
similarities. An intrinsic evaluation of TREASURE was performed with reference to a reliable
similarity benchmark generated through an experiment to gather human ratings on a Twitter
political dataset. A further evaluation was performed with reference to the SemEval-2014
similarity benchmark in order to validate the generalizability of TREASURE. The intrinsic
evaluation and statistical analysis demonstrated a strong positive linear correlation between
TREASURE and human ratings for both benchmarks. Furthermore, TREASURE achieved a
significantly higher correlation coefficient compared to existing state-of-the-art STSS
measures.
The SBCA algorithm incorporates TREASURE as the proximity measure. Unlike
conventional partition-based clustering algorithms, the SBCA algorithm is fully unsupervised
and dynamically determine the number of clusters beforehand. Subjective evaluation criteria
were employed to evaluate the SBCA algorithm with reference to the SemEval-2014 similarity
benchmark. Furthermore, an experiment was conducted to produce a reliable multi-class
benchmark on the European Referendum political domain, which was also utilised to evaluate
the SBCA algorithm. The evaluation results provide evidence that the SBCA algorithm
undertakes highly accurate combining and separation decisions and can generate pure clusters
from microblogging posts.
The contributions of this thesis to knowledge are mainly demonstrated as: 1) Development
of a novel STSS measure for microblogging posts (TREASURE). 2) Development of a new
SBCA algorithm that incorporates TREASURE to detect semantic themes in microblogs. 3)
Generating a word embedding pre-trained model learned from a large corpus of political
tweets. 4) Production of a reliable similarity-annotated benchmark and a reliable multi-class
benchmark in the domain of politics
Identifying Machine-Paraphrased Plagiarism
Employing paraphrasing tools to conceal plagiarized text is a severe threat
to academic integrity. To enable the detection of machine-paraphrased text, we
evaluate the effectiveness of five pre-trained word embedding models combined
with machine learning classifiers and state-of-the-art neural language models.
We analyze preprints of research papers, graduation theses, and Wikipedia
articles, which we paraphrased using different configurations of the tools
SpinBot and SpinnerChief. The best performing technique, Longformer, achieved
an average F1 score of 80.99% (F1=99.68% for SpinBot and F1=71.64% for
SpinnerChief cases), while human evaluators achieved F1=78.4% for SpinBot and
F1=65.6% for SpinnerChief cases. We show that the automated classification
alleviates shortcomings of widely-used text-matching systems, such as Turnitin
and PlagScan. To facilitate future research, all data, code, and two web
applications showcasing our contributions are openly available
Extracting News Events from Microblogs
Twitter stream has become a large source of information for many people, but
the magnitude of tweets and the noisy nature of its content have made
harvesting the knowledge from Twitter a challenging task for researchers for a
long time. Aiming at overcoming some of the main challenges of extracting the
hidden information from tweet streams, this work proposes a new approach for
real-time detection of news events from the Twitter stream. We divide our
approach into three steps. The first step is to use a neural network or deep
learning to detect news-relevant tweets from the stream. The second step is to
apply a novel streaming data clustering algorithm to the detected news tweets
to form news events. The third and final step is to rank the detected events
based on the size of the event clusters and growth speed of the tweet
frequencies. We evaluate the proposed system on a large, publicly available
corpus of annotated news events from Twitter. As part of the evaluation, we
compare our approach with a related state-of-the-art solution. Overall, our
experiments and user-based evaluation show that our approach on detecting
current (real) news events delivers a state-of-the-art performance
An Empirical Performance Evaluation of Semantic-Based Similarity Measures in Microblogging Social Media
Measuring textual semantic similarity has been a subject of intense discussion in NLP and AI for many years. A new area of research has emerged that applies semantic similarity measures within Twitter. However, the development of these measures for the semantic analysis of tweets imposes fundamental challenges. The sparsity, ambiguity, and informality present in social media are hampering the performance of traditional textual similarity measures as âtweetsâ, have special syntactic and semantic characteristics. This paper reviews and evaluates the performance of topological, statistical, and hybrid similarity measures, in the context of Twitter analysis. Furthermore, the performance of each measure is compared against a naĂŻve keyword-based similarity computation method to assess the significance of semantic computation in capturing the meaning in tweets. An experiment is designed and conducted to evaluate the different measures through examining various metrics, including correlation, error rates, and statistical tests on a benchmark dataset. The potential weaknesses of semantic similarity measures in relation to Twitter applications of textual similarity assessment and the research contributions are discussed. This research highlights challenges and potential improvement areas for the semantic similarity of tweets, a resource for researchers and practitioners
- âŚ