139 research outputs found
Abusive Language Detection in Online Conversations by Combining Content-and Graph-based Features
In recent years, online social networks have allowed worldwide users to meet
and discuss. As guarantors of these communities, the administrators of these
platforms must prevent users from adopting inappropriate behaviors. This
verification task, mainly done by humans, is more and more difficult due to the
ever growing amount of messages to check. Methods have been proposed to
automatize this moderation process, mainly by providing approaches based on the
textual content of the exchanged messages. Recent work has also shown that
characteristics derived from the structure of conversations, in the form of
conversational graphs, can help detecting these abusive messages. In this
paper, we propose to take advantage of both sources of information by proposing
fusion methods integrating content-and graph-based features. Our experiments on
raw chat logs show that the content of the messages, but also of their dynamics
within a conversation contain partially complementary information, allowing
performance improvements on an abusive message classification task with a final
F-measure of 93.26%
Extraction and Analysis of Dynamic Conversational Networks from TV Series
Identifying and characterizing the dynamics of modern tv series subplots is
an open problem. One way is to study the underlying social network of
interactions between the characters. Standard dynamic network extraction
methods rely on temporal integration, either over the whole considered period,
or as a sequence of several time-slices. However, they turn out to be
inappropriate in the case of tv series, because the scenes shown onscreen
alternatively focus on parallel storylines, and do not necessarily respect a
traditional chronology. In this article, we introduce Narrative Smoothing, a
novel network extraction method taking advantage of the plot properties to
solve some of their limitations. We apply our method to a corpus of 3 popular
series, and compare it to both standard approaches. Narrative smoothing leads
to more relevant observations when it comes to the characterization of the
protagonists and their relationships, confirming its appropriateness to model
the intertwined storylines constituting the plots.Comment: arXiv admin note: substantial text overlap with arXiv:1602.0781
Quaternion Denoising Encoder-Decoder for Theme Identification of Telephone Conversations
International audienceIn the last decades, encoder-decoders or autoencoders (AE) have received a great interest from researchers due to their capability to construct robust representations of documents in a low dimensional subspace. Nonetheless, autoencoders reveal little in way of spoken document internal structure by only considering words or topics contained in the document as an isolate basic element, and tend to overfit with small corpus of documents. Therefore, Quaternion Multi-layer Perceptrons (QMLP) have been introduced to capture such internal latent dependencies , whereas denoising autoencoders (DAE) are composed with different stochastic noises to better process small set of documents. This paper presents a novel autoencoder based on both hitherto-proposed DAE (to manage small corpus) and the QMLP (to consider internal latent structures) called "Quater-nion denoising encoder-decoder" (QDAE). Moreover, the paper defines an original angular Gaussian noise adapted to the speci-ficity of hyper-complex algebra. The experiments, conduced on a theme identification task of spoken dialogues from the DE-CODA framework, show that the QDAE obtains the promising gains of 3% and 1.5% compared to the standard real valued de-noising autoencoder and the QMLP respectively
Automatic Text Summarization Approaches to Speed up Topic Model Learning Process
The number of documents available into Internet moves each day up. For this
reason, processing this amount of information effectively and expressibly
becomes a major concern for companies and scientists. Methods that represent a
textual document by a topic representation are widely used in Information
Retrieval (IR) to process big data such as Wikipedia articles. One of the main
difficulty in using topic model on huge data collection is related to the
material resources (CPU time and memory) required for model estimate. To deal
with this issue, we propose to build topic spaces from summarized documents. In
this paper, we present a study of topic space representation in the context of
big data. The topic space representation behavior is analyzed on different
languages. Experiments show that topic spaces estimated from text summaries are
as relevant as those estimated from the complete documents. The real advantage
of such an approach is the processing time gain: we showed that the processing
time can be drastically reduced using summarized documents (more than 60\% in
general). This study finally points out the differences between thematic
representations of documents depending on the targeted languages such as
English or latin languages.Comment: 16 pages, 4 tables, 8 figure
- …