117 research outputs found
UNIDECOR: A Unified Deception Corpus for Cross-Corpus Deception Detection
Verbal deception has been studied in psychology, forensics, and computational
linguistics for a variety of reasons, like understanding behaviour patterns,
identifying false testimonies, and detecting deception in online communication.
Varying motivations across research fields lead to differences in the domain
choices to study and in the conceptualization of deception, making it hard to
compare models and build robust deception detection systems for a given
language. With this paper, we improve this situation by surveying available
English deception datasets which include domains like social media reviews,
court testimonials, opinion statements on specific topics, and deceptive
dialogues from online strategy games. We consolidate these datasets into a
single unified corpus. Based on this resource, we conduct a correlation
analysis of linguistic cues of deception across datasets to understand the
differences and perform cross-corpus modeling experiments which show that a
cross-domain generalization is challenging to achieve. The unified deception
corpus (UNIDECOR) can be obtained from
https://www.ims.uni-stuttgart.de/data/unidecor
Creating and detecting fake reviews of online products
Customers increasingly rely on reviews for product information. However, the usefulness of online reviews is impeded by fake reviews that give an untruthful picture of product quality. Therefore, detection of fake reviews is needed. Unfortunately, so far, automatic detection has only had partial success in this challenging task. In this research, we address the creation and detection of fake reviews. First, we experiment with two language models, ULMFiT and GPT-2, to generate fake product reviews based on an Amazon e-commerce dataset. Using the better model, GPT-2, we create a dataset for a classification task of fake review detection. We show that a machine classifier can accomplish this goal near-perfectly, whereas human raters exhibit significantly lower accuracy and agreement than the tested algorithms. The model was also effective on detected human generated fake reviews. The results imply that, while fake review detection is challenging for humans, “machines can fight machines” in the task of detecting fake reviews. Our findings have implications for consumer protection, defense of firms from unfair competition, and responsibility of review platforms.</p
Detecting and Monitoring Hate Speech in Twitter
Social Media are sensors in the real world that can be used to measure the pulse of societies.
However, the massive and unfiltered feed of messages posted in social media is a phenomenon that
nowadays raises social alarms, especially when these messages contain hate speech targeted to a
specific individual or group. In this context, governments and non-governmental organizations
(NGOs) are concerned about the possible negative impact that these messages can have on individuals
or on the society. In this paper, we present HaterNet, an intelligent system currently being used by
the Spanish National Office Against Hate Crimes of the Spanish State Secretariat for Security that
identifies and monitors the evolution of hate speech in Twitter. The contributions of this research
are many-fold: (1) It introduces the first intelligent system that monitors and visualizes, using social
network analysis techniques, hate speech in Social Media. (2) It introduces a novel public dataset on
hate speech in Spanish consisting of 6000 expert-labeled tweets. (3) It compares several classification
approaches based on different document representation strategies and text classification models. (4)
The best approach consists of a combination of a LTSM+MLP neural network that takes as input the
tweet’s word, emoji, and expression tokens’ embeddings enriched by the tf-idf, and obtains an area
under the curve (AUC) of 0.828 on our dataset, outperforming previous methods presented in the
literatureThe work by Quijano-Sanchez was supported by the Spanish Ministry of Science and Innovation
grant FJCI-2016-28855. The research of Liberatore was supported by the Government of Spain, grant MTM2015-65803-R, and by the European Union’s Horizon 2020 Research and Innovation Programme, under the Marie Sklodowska-Curie grant agreement No. 691161 (GEOSAFE). All the financial support is gratefully acknowledge
Machine Generated Text: A Comprehensive Survey of Threat Models and Detection Methods
Machine generated text is increasingly difficult to distinguish from human
authored text. Powerful open-source models are freely available, and
user-friendly tools that democratize access to generative models are
proliferating. ChatGPT, which was released shortly after the first preprint of
this survey, epitomizes these trends. The great potential of state-of-the-art
natural language generation (NLG) systems is tempered by the multitude of
avenues for abuse. Detection of machine generated text is a key countermeasure
for reducing abuse of NLG models, with significant technical challenges and
numerous open problems. We provide a survey that includes both 1) an extensive
analysis of threat models posed by contemporary NLG systems, and 2) the most
complete review of machine generated text detection methods to date. This
survey places machine generated text within its cybersecurity and social
context, and provides strong guidance for future work addressing the most
critical threat models, and ensuring detection systems themselves demonstrate
trustworthiness through fairness, robustness, and accountability.Comment: Manuscript submitted to ACM Special Session on Trustworthy AI.
2022/11/19 - Updated reference
DGraph: A Large-Scale Financial Dataset for Graph Anomaly Detection
Graph Anomaly Detection (GAD) has recently become a hot research spot due to
its practicability and theoretical value. Since GAD emphasizes the application
and the rarity of anomalous samples, enriching the varieties of its datasets is
a fundamental work. Thus, this paper present DGraph, a real-world dynamic graph
in the finance domain. DGraph overcomes many limitations of current GAD
datasets. It contains about 3M nodes, 4M dynamic edges, and 1M ground-truth
nodes. We provide a comprehensive observation of DGraph, revealing that
anomalous nodes and normal nodes generally have different structures, neighbor
distribution, and temporal dynamics. Moreover, it suggests that those unlabeled
nodes are also essential for detecting fraudsters. Furthermore, we conduct
extensive experiments on DGraph. Observation and experiments demonstrate that
DGraph is propulsive to advance GAD research and enable in-depth exploration of
anomalous nodes.Comment: 9 page
- …