2,100 research outputs found
Adapting Cross-Genre Author Profiling to Language and Corpus Notebook for PAN at CLEF 2016
Abstract This paper presents our approach to the Author Profiling (AP) task at PAN 2016. The task aims at identifying the author's age and gender under crossgenre AP conditions in three languages: English, Spanish, and Dutch. Our preprocessing stage includes reducing non-textual features to their corresponding semantic classes. We exploit typed character n-grams, lexical features, and nontextual features (domain names). We experimented with various feature representations (binary, raw frequency, normalized frequency, second order attributes (SOA), tf-idf) and machine learning algorithms (liblinear and libSVM implementations of Support Vector Machines (SVM), multinomial naive Bayes, logistic regression). For textual feature selection, we applied the transition point technique, except when SOA was used. We found that the optimal configuration was different for different languages at each stage
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
Multilingual Cross-domain Perspectives on Online Hate Speech
In this report, we present a study of eight corpora of online hate speech, by
demonstrating the NLP techniques that we used to collect and analyze the
jihadist, extremist, racist, and sexist content. Analysis of the multilingual
corpora shows that the different contexts share certain characteristics in
their hateful rhetoric. To expose the main features, we have focused on text
classification, text profiling, keyword and collocation extraction, along with
manual annotation and qualitative study.Comment: 24 page
Adapting Language Models for Non-Parallel Author-Stylized Rewriting
Given the recent progress in language modeling using Transformer-based neural
models and an active interest in generating stylized text, we present an
approach to leverage the generalization capabilities of a language model to
rewrite an input text in a target author's style. Our proposed approach adapts
a pre-trained language model to generate author-stylized text by fine-tuning on
the author-specific corpus using a denoising autoencoder (DAE) loss in a
cascaded encoder-decoder framework. Optimizing over DAE loss allows our model
to learn the nuances of an author's style without relying on parallel data,
which has been a severe limitation of the previous related works in this space.
To evaluate the efficacy of our approach, we propose a linguistically-motivated
framework to quantify stylistic alignment of the generated text to the target
author at lexical, syntactic and surface levels. The evaluation framework is
both interpretable as it leads to several insights about the model, and
self-contained as it does not rely on external classifiers, e.g. sentiment or
formality classifiers. Qualitative and quantitative assessment indicates that
the proposed approach rewrites the input text with better alignment to the
target style while preserving the original content better than state-of-the-art
baselines.Comment: Accepted for publication in Main Technical Track at AAAI 2
Stylometric Literariness Classification:the Case of Stephen King
This paper applies stylometry to quantify the literariness of 73 novels and novellas by American author Stephen King, chosen as an extraordinary case of a writer who has been dubbed both “high” and “low” in literariness in critical reception. We operationalize literariness using a measure of stylistic distance (Cosine Delta) based on the 1000 most frequent words in two bespoke comparison corpora used as proxies for literariness: one of popular genre fiction, another of National Book Award-winning authors. We report that a supervised model is highly effective in distinguishing the two categories, with 94.6% macro average in a binary classification. We define two subsets of texts by King—“high” and “low” literariness works as suggested by critics and ourselves—and find that a predictive model does identify King’s Dark Tower series and novels such as Dolores Claiborne as among his most “literary” texts, consistent with critical reception, which has also ascribed postmodern qualities to the Dark Tower novels. Our results demonstrate the efficacy of Cosine Delta-based stylometry in quantifying the literariness of texts, while also highlighting the methodological challenges of literariness, especially in the case of Stephen King. The code and data to reproduce our results are available at https://github.com/andreasvc/kingli
Bot and gender detection of twitter accounts using distortion and LSA notebook for PAN at CLEF 2019
In this work, we present our approach for the Author Profiling task of PAN 2019. The task is divided into two sub-problems, bot, and gender detection, for two different languages: English and Spanish. For each instance of the problem and each language, we address the problem differently. We use an ensemble architecture to solve the Bot Detection for accounts that write in English and a single SVM for those who write in Spanish. For the Gender detection we use a single SVM architecture for both the languages, but we pre-process the tweets in a different way. Our final models achieve accuracy over the 90% in the bot detection task, while for the gender detection, of 84.17% and 77.61% respectively for the English and Spanish languages
CHORUS Deliverable 2.2: Second report - identification of multi-disciplinary key issues for gap analysis toward EU multimedia search engines roadmap
After addressing the state-of-the-art during the first year of Chorus and establishing the existing landscape in
multimedia search engines, we have identified and analyzed gaps within European research effort during our second year.
In this period we focused on three directions, notably technological issues, user-centred issues and use-cases and socio-
economic and legal aspects. These were assessed by two central studies: firstly, a concerted vision of functional breakdown
of generic multimedia search engine, and secondly, a representative use-cases descriptions with the related discussion on
requirement for technological challenges. Both studies have been carried out in cooperation and consultation with the
community at large through EC concertation meetings (multimedia search engines cluster), several meetings with our
Think-Tank, presentations in international conferences, and surveys addressed to EU projects coordinators as well as
National initiatives coordinators. Based on the obtained feedback we identified two types of gaps, namely core
technological gaps that involve research challenges, and “enablers”, which are not necessarily technical research
challenges, but have impact on innovation progress. New socio-economic trends are presented as well as emerging legal
challenges
- …