42 research outputs found
Attribution and Obfuscation of Neural Text Authorship: A Data Mining Perspective
Two interlocking research questions of growing interest and importance in
privacy research are Authorship Attribution (AA) and Authorship Obfuscation
(AO). Given an artifact, especially a text t in question, an AA solution aims
to accurately attribute t to its true author out of many candidate authors
while an AO solution aims to modify t to hide its true authorship.
Traditionally, the notion of authorship and its accompanying privacy concern is
only toward human authors. However, in recent years, due to the explosive
advancements in Neural Text Generation (NTG) techniques in NLP, capable of
synthesizing human-quality open-ended texts (so-called "neural texts"), one has
to now consider authorships by humans, machines, or their combination. Due to
the implications and potential threats of neural texts when used maliciously,
it has become critical to understand the limitations of traditional AA/AO
solutions and develop novel AA/AO solutions in dealing with neural texts. In
this survey, therefore, we make a comprehensive review of recent literature on
the attribution and obfuscation of neural text authorship from a Data Mining
perspective, and share our view on their limitations and promising research
directions.Comment: Accepted at ACM SIGKDD Explorations, Vol. 25, June 202
The Stylometric Processing of Sensory Open Source Data
This research projectâs end goal is on the Lone Wolf Terrorist.
The project uses an exploratory approach to the
self-radicalisation problem by creating a stylistic fingerprint
of a person's personality, or self, from subtle characteristics
hidden in a person's writing style. It separates the identity of
one person from another based on their writing style. It also
separates the writings of suicide attackers from ânormal'
bloggers by critical slowing down; a dynamical property used to
develop early warning signs of tipping points. It identifies
changes in a person's moods, or shifts from one state to another,
that might indicate a tipping point for self-radicalisation.
Research into authorship identity using personality is a
relatively new area in the field of neurolinguistics. There are
very few methods that model how an individual's cognitive
functions present themselves in writing. Here, we develop a
novel algorithm, RPAS, which draws on cognitive functions such as
aging, sensory processing, abstract or concrete thinking through
referential activity emotional experiences, and a person's
internal gender for identity. We use well-known techniques such
as Principal Component Analysis, Linear Discriminant Analysis,
and the Vector Space Method to cluster multiple
anonymous-authored works. Here we use a new approach, using
seriation with noise to separate subtle features in individuals.
We conduct time series analysis using modified variants of 1-lag
autocorrelation and the coefficient of skewness, two statistical
metrics that change near a tipping point, to track serious life
events in an individual through cognitive linguistic markers.
In our journey of discovery, we uncover secrets about the
Elizabethan playwrights hidden for over 400 years. We uncover
markers for depression and anxiety in modern-day writers and
identify linguistic cues for Alzheimer's disease much earlier
than other studies using sensory processing. In using these
techniques on the Lone Wolf, we can separate their writing style
used before their attacks that differs from other writing
Detecting deceptive behaviour in the wild:text mining for online child protection in the presence of noisy and adversarial social media communications
A real-life application of text mining research âin the wildâ, i.e. in online social media, differs from more general applications in that its defining characteristics are both domain and process dependent. This gives rise to a number of challenges of which contemporary research has only scratched the surface. More specifically, a text mining approach applied in the wild typically has no control over the dataset size. Hence, the system has to be robust towards limited data availability, a variable number of samples across users and a highly skewed dataset. Additionally, the quality of the data cannot be guaranteed. As a result, the approach needs to be tolerant to a certain degree of linguistic noise. Finally, it has to be robust towards deceptive behaviour or adversaries. This thesis examines the viability of a text mining approach for supporting cybercrime investigations pertaining to online child protection. The main contributions of this dissertation are as follows. A systematic study of different aspects of methodological design of a state-ofthe- art text mining approach is presented to assess its scalability towards a large, imbalanced and linguistically noisy social media dataset. In this framework, three key automatic text categorisation tasks are examined, namely the feasibility to (i) identify a social network userâs age group and gender based on textual information found in only one single message; (ii) aggregate predictions on the message level to the user level without neglecting potential clues of deception and detect false user profiles on social networks and (iii) identify child sexual abuse media among thousands of legal other media, including adult pornography, based on their filename. Finally, a novel approach is presented that combines age group predictions with advanced text clustering techniques and unsupervised learning to identify online child sex offendersâ grooming behaviour. The methodology presented in this thesis was extensively discussed with law enforcement to assess its forensic readiness. Additionally, each component was evaluated on actual child sex offender data. Despite the challenging characteristics of these text types, the results show high degrees of accuracy for false profile detection, identifying grooming behaviour and child sexual abuse media identification
Computational behavioral analytics: estimating psychological traits in foreign languages.
The rise of technology proliferating into the workplace has increased the threat of loss of intellectual property, classified, and proprietary information for companies, governments, and academics. This can cause economic damage to the creators of new IP, companies, and whole economies. This technology proliferation has also assisted terror groups and lone wolf actors in pushing their message to a larger audience or finding similar tribal groups that share common, sometimes flawed, beliefs across various social media platforms. These types of challenges have created numerous studies in psycholinguistics, as well as commercial tools, that look to assist in identifying potential threats before they have an opportunity to conduct malicious acts. This has led to an area of study that this dissertation defines as ``Computational Behavioral Analytics. A common practice espoused in various Natural Language Processing studies (both commercial and academic) conducted on foreign language text is the use of Machine Translation (MT) systems before conducting NLP tasks. In this dissertation, we explore three psycholinguistic traits conducted on foreign language text. We explore the effects (and failures) of MT systems in these types of psycholinguistic tasks in order to help push the field of study into a direction that will greatly improve the efficacy of such systems. Given the results of the experimentation in this dissertation, it is highly recommended to avoid the use of translations whenever the greatest levels of accuracy are necessary, such as for National Security and Law Enforcement purposes. If translations must be used for any reason, scientist should conduct a full analysis of the impact of their chosen translation system on their estimates to determine which traits are more significantly affected. This will help ensure that analysts and scientists are better informed of the potential inaccuracies and change any resulting decisions from the data accordingly. This dissertation introduces psycholinguistics and the benefits of using Machine Learning technologies in estimating various psychological traits, and provides a brief discussion on the potential privacy and legal issues that should be addressed in order to avoid the abuse of such systems in Chapter I. Chapter II outlines the datasets that are used during the experimentation and evaluation of the algorithms. Chapter III discusses each of the various implementations of the algorithms used in the three psycholinguistic tasks - Affect Analysis, Authorship Attribution, and Personality Estimation. Chapter IV discusses the experiments that were run in order to understand the effects of MT on the psycholinguistic tasks, and to understand how these tasks can be accomplished in the face of MT limitations, including rationale on the selection of the MT system used in this study. The dissertation concludes with Chapter V, providing a discussion and speculating on the findings and future experimentation that should be done
Psychographic Traits Identification based on political ideology: An author analysis study on spanish politicians tweets posted in 2020
In general, people are usually more reluctant to follow advice and directions from politicians who do not have their ideology. In extreme cases, people can be heavily biased in favour of a political party at the same time that they are in sharp disagreement with others, which may lead to irrational decision making and can put peopleâs lives at risk by ignoring certain recommendations from the authorities. Therefore, considering political ideology as a psychographic trait can improve political micro-targeting by helping public authorities and local governments to adopt better communication policies during crises. In this work, we explore the reliability of determining psychographic traits concerning political ideology. Our contribution is twofold. On the one hand, we release the PoliCorpus-2020, a dataset composed by Spanish politiciansâ tweets posted in 2020. On the other hand, we conduct two authorship analysis tasks with the aforementioned dataset: an author profiling task to extract demographic and psychographic traits, and an authorship attribution task to determine the author of an anonymous text in the political domain. Both experiments are evaluated with several neural network architectures grounded on explainable linguistic features, statistical features, and state-of-the-art transformers. In addition, we test whether the neural network models can be transferred to detect the political ideology of citizens. Our results indicate that the linguistic features are good indicators for identifying finegrained political affiliation, they boost the performance of neural network models when combined with embedding-based features, and they preserve relevant information when the models are tested with ordinary citizens. Besides, we found that lexical and morphosyntactic features are more effective on author profiling, whereas stylometric features are more effective in authorship attribution.publishedVersio
Automatic Image Captioning with Style
This thesis connects two core topics in machine learning, vision
and language. The problem of choice is image caption generation:
automatically constructing natural language descriptions of image
content. Previous research into image caption generation has
focused on generating purely descriptive captions; I focus on
generating visually relevant captions with a distinct linguistic
style. Captions with style have the potential to ease
communication and add a new layer of personalisation.
First, I consider naming variations in image captions, and
propose a method for predicting context-dependent names that
takes into account visual and linguistic information. This method
makes use of a large-scale image caption dataset, which I also
use to explore naming conventions and report naming conventions
for hundreds of animal classes. Next I propose the SentiCap
model, which relies on recent advances in artificial neural
networks to generate visually relevant image captions with
positive or negative sentiment. To balance descriptiveness and
sentiment, the SentiCap model dynamically switches between two
recurrent neural networks, one tuned for descriptive words and
one for sentiment words. As the first published model for
generating captions with sentiment, SentiCap has influenced a
number of subsequent works. I then investigate the sub-task of
modelling styled sentences without images. The specific task
chosen is sentence simplification: rewriting news article
sentences to make them easier to understand.
For this task I design a neural sequence-to-sequence model that
can work with
limited training data, using novel adaptations for word copying
and sharing
word embeddings. Finally, I present SemStyle, a system for
generating visually
relevant image captions in the style of an arbitrary text corpus.
A shared term
space allows a neural network for vision and content planning to
communicate
with a network for styled language generation. SemStyle achieves
competitive
results in human and automatic evaluations of descriptiveness and
style.
As a whole, this thesis presents two complete systems for styled
caption generation that are first of their kind and demonstrate,
for the first time, that automatic style transfer for image
captions is achievable. Contributions also include novel ideas
for object naming and sentence simplification. This thesis opens
up inquiries into highly personalised image captions; large scale
visually grounded concept naming; and more generally, styled text
generation with content control
Combating Misinformation in the Age of LLMs: Opportunities and Challenges
Misinformation such as fake news and rumors is a serious threat on
information ecosystems and public trust. The emergence of Large Language Models
(LLMs) has great potential to reshape the landscape of combating
misinformation. Generally, LLMs can be a double-edged sword in the fight. On
the one hand, LLMs bring promising opportunities for combating misinformation
due to their profound world knowledge and strong reasoning abilities. Thus, one
emergent question is: how to utilize LLMs to combat misinformation? On the
other hand, the critical challenge is that LLMs can be easily leveraged to
generate deceptive misinformation at scale. Then, another important question
is: how to combat LLM-generated misinformation? In this paper, we first
systematically review the history of combating misinformation before the advent
of LLMs. Then we illustrate the current efforts and present an outlook for
these two fundamental questions respectively. The goal of this survey paper is
to facilitate the progress of utilizing LLMs for fighting misinformation and
call for interdisciplinary efforts from different stakeholders for combating
LLM-generated misinformation.Comment: 9 pages for the main paper, 35 pages including 656 references, more
resources on "LLMs Meet Misinformation" are on the website:
https://llm-misinformation.github.io