3,616 research outputs found
Citation sentence reuse behavior of scientists: A case study on massive bibliographic text dataset of computer science
Our current knowledge of scholarly plagiarism is largely based on the
similarity between full text research articles. In this paper, we propose an
innovative and novel conceptualization of scholarly plagiarism in the form of
reuse of explicit citation sentences in scientific research articles. Note that
while full-text plagiarism is an indicator of a gross-level behavior, copying
of citation sentences is a more nuanced micro-scale phenomenon observed even
for well-known researchers. The current work poses several interesting
questions and attempts to answer them by empirically investigating a large
bibliographic text dataset from computer science containing millions of lines
of citation sentences. In particular, we report evidences of massive copying
behavior. We also present several striking real examples throughout the paper
to showcase widespread adoption of this undesirable practice. In contrast to
the popular perception, we find that copying tendency increases as an author
matures. The copying behavior is reported to exist in all fields of computer
science; however, the theoretical fields indicate more copying than the applied
fields
Automated Crowdturfing Attacks and Defenses in Online Review Systems
Malicious crowdsourcing forums are gaining traction as sources of spreading
misinformation online, but are limited by the costs of hiring and managing
human workers. In this paper, we identify a new class of attacks that leverage
deep learning language models (Recurrent Neural Networks or RNNs) to automate
the generation of fake online reviews for products and services. Not only are
these attacks cheap and therefore more scalable, but they can control rate of
content output to eliminate the signature burstiness that makes crowdsourced
campaigns easy to detect.
Using Yelp reviews as an example platform, we show how a two phased review
generation and customization attack can produce reviews that are
indistinguishable by state-of-the-art statistical detectors. We conduct a
survey-based user study to show these reviews not only evade human detection,
but also score high on "usefulness" metrics by users. Finally, we develop novel
automated defenses against these attacks, by leveraging the lossy
transformation introduced by the RNN training and generation cycle. We consider
countermeasures against our mechanisms, show that they produce unattractive
cost-benefit tradeoffs for attackers, and that they can be further curtailed by
simple constraints imposed by online service providers
Learning semantic sentence representations from visually grounded language without lexical knowledge
Current approaches to learning semantic representations of sentences often
use prior word-level knowledge. The current study aims to leverage visual
information in order to capture sentence level semantics without the need for
word embeddings. We use a multimodal sentence encoder trained on a corpus of
images with matching text captions to produce visually grounded sentence
embeddings. Deep Neural Networks are trained to map the two modalities to a
common embedding space such that for an image the corresponding caption can be
retrieved and vice versa. We show that our model achieves results comparable to
the current state-of-the-art on two popular image-caption retrieval benchmark
data sets: MSCOCO and Flickr8k. We evaluate the semantic content of the
resulting sentence embeddings using the data from the Semantic Textual
Similarity benchmark task and show that the multimodal embeddings correlate
well with human semantic similarity judgements. The system achieves
state-of-the-art results on several of these benchmarks, which shows that a
system trained solely on multimodal data, without assuming any word
representations, is able to capture sentence level semantics. Importantly, this
result shows that we do not need prior knowledge of lexical level semantics in
order to model sentence level semantics. These findings demonstrate the
importance of visual information in semantics
A Deep Network Model for Paraphrase Detection in Short Text Messages
This paper is concerned with paraphrase detection. The ability to detect
similar sentences written in natural language is crucial for several
applications, such as text mining, text summarization, plagiarism detection,
authorship authentication and question answering. Given two sentences, the
objective is to detect whether they are semantically identical. An important
insight from this work is that existing paraphrase systems perform well when
applied on clean texts, but they do not necessarily deliver good performance
against noisy texts. Challenges with paraphrase detection on user generated
short texts, such as Twitter, include language irregularity and noise. To cope
with these challenges, we propose a novel deep neural network-based approach
that relies on coarse-grained sentence modeling using a convolutional neural
network and a long short-term memory model, combined with a specific
fine-grained word-level similarity matching model. Our experimental results
show that the proposed approach outperforms existing state-of-the-art
approaches on user-generated noisy social media data, such as Twitter texts,
and achieves highly competitive performance on a cleaner corpus
Didactic evolution of similarity detection software : the example of Compilatio
Since 2005, Compilatio has been offering tools to help detect and prevent plagiarism. Users of similarity detection software were initially attracted by the ability to track down cheaters. They are now more aware of the tools and services offered to create an environment that encourages the adoption of integrity and citizenship values, especially digital ones. They are aware that plagiarism is not a passing evil to be eradicated, but a deep-seated temptation that each individual must learn to overcome. The technology used to help teachers spot cheating has also evolved. The approach was initially syntactic, comparing texts formally to detect similarities. It then became semantic, using so-called artificial intelligence techniques to find similarities between different words with the same meaning. The issues related to plagiarism prevention illustrate how technology and pedagogy can be used together to train individuals for their future professional and civic life
- ā¦