43 research outputs found
Learning discrete word embeddings to achieve better interpretability and processing efficiency
L’omniprésente utilisation des plongements de mot dans le traitement des langues naturellesest la preuve de leur utilité et de leur capacité d’adaptation a une multitude de tâches. Ce-pendant, leur nature continue est une importante limite en terme de calculs, de stockage enmémoire et d’interprétation. Dans ce travail de recherche, nous proposons une méthode pourapprendre directement des plongements de mot discrets. Notre modèle est une adaptationd’une nouvelle méthode de recherche pour base de données avec des techniques dernier crien traitement des langues naturelles comme les Transformers et les LSTM. En plus d’obtenirdes plongements nécessitant une fraction des ressources informatiques nécéssaire à leur sto-ckage et leur traitement, nos expérimentations suggèrent fortement que nos représentationsapprennent des unités de bases pour le sens dans l’espace latent qui sont analogues à desmorphèmes. Nous appelons ces unités dessememes, qui, de l’anglaissemantic morphemes,veut dire morphèmes sémantiques. Nous montrons que notre modèle a un grand potentielde généralisation et qu’il produit des représentations latentes montrant de fortes relationssémantiques et conceptuelles entre les mots apparentés.The ubiquitous use of word embeddings in Natural Language Processing is proof of theirusefulness and adaptivity to a multitude of tasks. However, their continuous nature is pro-hibitive in terms of computation, storage and interpretation. In this work, we propose amethod of learning discrete word embeddings directly. The model is an adaptation of anovel database searching method using state of the art natural language processing tech-niques like Transformers and LSTM. On top of obtaining embeddings requiring a fractionof the resources to store and process, our experiments strongly suggest that our representa-tions learn basic units of meaning in latent space akin to lexical morphemes. We call theseunitssememes, i.e., semantic morphemes. We demonstrate that our model has a greatgeneralization potential and outputs representation showing strong semantic and conceptualrelations between related words
EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020
Welcome to EVALITA 2020! EVALITA is the evaluation campaign of Natural Language Processing and Speech Tools for Italian. EVALITA is an initiative of the Italian Association for Computational Linguistics (AILC, http://www.ai-lc.it) and it is endorsed by the Italian Association for Artificial Intelligence (AIxIA, http://www.aixia.it) and the Italian Association for Speech Sciences (AISV, http://www.aisv.it)
Social media mental health analysis framework through applied computational approaches
Studies have shown that mental illness burdens not only public health and productivity but also established market economies throughout the world. However, mental disorders are difficult to diagnose and monitor through traditional methods, which heavily rely on interviews, questionnaires and surveys, resulting in high under-diagnosis and under-treatment rates. The increasing use of online social media, such as Facebook and Twitter, is now a common part of people’s everyday life. The continuous and real-time user-generated content often reflects feelings, opinions, social status and behaviours of individuals, creating an unprecedented wealth of person-specific information. With advances in data science, social media has already been increasingly employed in population health monitoring and more recently mental health applications to understand mental disorders as well as to develop online screening and intervention tools. However, existing research efforts are still in their infancy, primarily aimed at highlighting the potential of employing social media in mental health research. The majority of work is developed on ad hoc datasets and lacks a systematic research pipeline. [Continues.]</div
Event summarization on social media stream: retrospective and prospective tweet summarization
Le contenu généré dans les médias sociaux comme Twitter permet aux utilisateurs d'avoir un aperçu rétrospectif d'évènement
et de suivre les nouveaux développements dès qu'ils se produisent. Cependant, bien que Twitter soit une source d'information
importante, il est caractérisé par le volume et la vélocité des informations publiées qui rendent difficile le suivi de
l'évolution des évènements. Pour permettre de mieux tirer profit de ce nouveau vecteur d'information, deux tâches
complémentaires de recherche d'information dans les médias sociaux ont été introduites : la génération de résumé
rétrospectif qui vise à sélectionner les tweets pertinents et non redondant récapitulant "ce qui s'est passé" et l'envoi des
notifications prospectives dès qu'une nouvelle information pertinente est détectée.
Notre travail s'inscrit dans ce cadre. L'objectif de cette thèse est de faciliter le suivi d'événement, en fournissant des
outils de génération de synthèse adaptés à ce vecteur d'information. Les défis majeurs sous-jacents à notre problématique
découlent d'une part du volume, de la vélocité et de la variété des contenus publiés et, d'autre part, de la qualité des
tweets qui peut varier d'une manière considérable.
La tâche principale dans la notification prospective est l'identification en temps réel des tweets pertinents et non
redondants. Le système peut choisir de retourner les nouveaux tweets dès leurs détections où bien de différer leur envoi
afin de s'assurer de leur qualité. Dans ce contexte, nos contributions se situent à ces différents niveaux : Premièrement,
nous introduisons Word Similarity Extended Boolean Model (WSEBM), un modèle d'estimation de la pertinence qui exploite la
similarité entre les termes basée sur le word embedding et qui n'utilise pas les statistiques de flux. L'intuition sous-
jacente à notre proposition est que la mesure de similarité à base de word embedding est capable de considérer des mots
différents ayant la même sémantique ce qui permet de compenser le non-appariement des termes lors du calcul de la
pertinence. Deuxièmement, l'estimation de nouveauté d'un tweet entrant est basée sur la comparaison de ses termes avec les
termes des tweets déjà envoyés au lieu d'utiliser la comparaison tweet à tweet. Cette méthode offre un meilleur passage à
l'échelle et permet de réduire le temps d'exécution. Troisièmement, pour contourner le problème du seuillage de pertinence,
nous utilisons un classificateur binaire qui prédit la pertinence. L'approche proposée est basée sur l'apprentissage
supervisé adaptatif dans laquelle les signes sociaux sont combinés avec les autres facteurs de pertinence dépendants de la
requête. De plus, le retour des jugements de pertinence est exploité pour re-entrainer le modèle de classification. Enfin,
nous montrons que l'approche proposée, qui envoie les notifications en temps réel, permet d'obtenir des performances
prometteuses en termes de qualité (pertinence et nouveauté) avec une faible latence alors que les approches de l'état de
l'art tendent à favoriser la qualité au détriment de la latence.
Cette thèse explore également une nouvelle approche de génération du résumé rétrospectif qui suit un paradigme différent de
la majorité des méthodes de l'état de l'art. Nous proposons de modéliser le processus de génération de synthèse sous forme
d'un problème d'optimisation linéaire qui prend en compte la diversité temporelle des tweets. Les tweets sont filtrés et
regroupés d'une manière incrémentale en deux partitions basées respectivement sur la similarité du contenu et le temps de
publication. Nous formulons la génération du résumé comme étant un problème linéaire entier dans lequel les variables
inconnues sont binaires, la fonction objective est à maximiser et les contraintes assurent qu'au maximum un tweet par cluster est sélectionné dans la limite de la longueur du résumé fixée préalablement.User-generated content on social media, such as Twitter, provides in many cases, the latest news before traditional media,
which allows having a retrospective summary of events and being updated in a timely fashion whenever a new development
occurs. However, social media, while being a valuable source of information, can be also overwhelming given the volume and
the velocity of published information. To shield users from being overwhelmed by irrelevant and redundant posts,
retrospective summarization and prospective notification (real-time summarization) were introduced as two complementary
tasks of information seeking on document streams. The former aims to select a list of relevant and non-redundant tweets that
capture "what happened". In the latter, systems monitor the live posts stream and push relevant and novel notifications as
soon as possible.
Our work falls within these frameworks and focuses on developing a tweet summarization approaches for the two
aforementioned scenarios. It aims at providing summaries that capture the key aspects of the event of interest to help users
to efficiently acquire information and follow the development of long ongoing events from social media. Nevertheless, tweet
summarization task faces many challenges that stem from, on one hand, the high volume, the velocity and the variety of the
published information and, on the other hand, the quality of tweets, which can vary significantly.
In the prospective notification, the core task is the relevancy and the novelty detection in real-time. For timeliness, a
system may choose to push new updates in real-time or may choose to trade timeliness for higher notification quality. Our
contributions address these levels: First, we introduce Word Similarity Extended Boolean Model (WSEBM), a relevance model
that does not rely on stream statistics and takes advantage of word embedding model. We used word similarity instead of the
traditional weighting techniques. By doing this, we overcome the shortness and word mismatch issues in tweets. The intuition
behind our proposition is that context-aware similarity measure in word2vec is able to consider different words with the
same semantic meaning and hence allows offsetting the word mismatch issue when calculating the similarity between a tweet
and a topic. Second, we propose to compute the novelty score of the incoming tweet regarding all words of tweets already
pushed to the user instead of using the pairwise comparison. The proposed novelty detection method scales better and reduces
the execution time, which fits real-time tweet filtering. Third, we propose an adaptive Learning to Filter approach that
leverages social signals as well as query-dependent features. To overcome the issue of relevance threshold setting, we use a
binary classifier that predicts the relevance of the incoming tweet. In addition, we show the gain that can be achieved by
taking advantage of ongoing relevance feedback. Finally, we adopt a real-time push strategy and we show that the proposed
approach achieves a promising performance in terms of quality (relevance and novelty) with low cost of latency whereas the
state-of-the-art approaches tend to trade latency for higher quality.
This thesis also explores a novel approach to generate a retrospective summary that follows a different paradigm than the
majority of state-of-the-art methods. We consider the summary generation as an optimization problem that takes into account
the topical and the temporal diversity. Tweets are filtered and are incrementally clustered in two cluster types, namely
topical clusters based on content similarity and temporal clusters that depends on publication time. Summary generation is
formulated as integer linear problem in which unknowns variables are binaries, the objective function is to be maximized and
constraints ensure that at most one post per cluster is selected with respect to the defined summary length limit
Context-Aware Message-Level Rumour Detection with Weak Supervision
Social media has become the main source of all sorts of information beyond a communication medium. Its intrinsic nature can allow a continuous and massive flow of misinformation to make a severe impact worldwide. In particular, rumours emerge unexpectedly and spread quickly. It is challenging to track down their origins and stop their propagation. One of the most ideal solutions to this is to identify rumour-mongering messages as early as possible, which is commonly referred to as "Early Rumour Detection (ERD)". This dissertation focuses on researching ERD on social media by exploiting weak supervision and contextual information. Weak supervision is a branch of ML where noisy and less precise sources (e.g. data patterns) are leveraged to learn limited high-quality labelled data (Ratner et al., 2017). This is intended to reduce the cost and increase the efficiency of the hand-labelling of large-scale data. This thesis aims to study whether identifying rumours before they go viral is possible and develop an architecture for ERD at individual post level. To this end, it first explores major bottlenecks of current ERD. It also uncovers a research gap between system design and its applications in the real world, which have received less attention from the research community of ERD. One bottleneck is limited labelled data. Weakly supervised methods to augment limited labelled training data for ERD are introduced. The other bottleneck is enormous amounts of noisy data. A framework unifying burst detection based on temporal signals and burst summarisation is investigated to identify potential rumours (i.e. input to rumour detection models) by filtering out uninformative messages. Finally, a novel method which jointly learns rumour sources and their contexts (i.e. conversational threads) for ERD is proposed. An extensive evaluation setting for ERD systems is also introduced
Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020
On behalf of the Program Committee, a very warm welcome to the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020). This edition of the conference is held in Bologna and organised by the University of Bologna. The CLiC-it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after six years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges
Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018
On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges