22 research outputs found
IRIT at TREC Microblog 2015
International audienceThis paper presents the participation of the IRIT laboratory (University of Toulouse) to the Microblog Track of TREC 2015. This track consists in a real-time filtering task aiming at monitoring a stream of social media posts in accordance to a user's interest profile. In this context, our team proposes three approaches: (a) a novel selective summarization approach based on a decision of selecting/ignoring tweets without the use of external knowledge and relying on novelty and redundancy factors, (b) a processing workflow enabling to index tweets in real-time and enhanced by a notification and digests method guided by diversity and user personalization, and (c) a step by step stream selection method focusing on rapidity, and taking into account tweet similarity as well as several features including content, entities and user-related aspects. For all these approaches, we discuss the obtained results during the experimental evaluation
Report on the Second International Workshop on the Evaluation on Collaborative Information Seeking and Retrieval (ECol'2017 @ CHIIR)
The 2nd workshop on the evaluation of collaborative information retrieval and seeking (ECol) was held in conjunction with the ACM SIGIR Conference on Human Information Interaction & Retrieval (CHIIR) in Oslo, Norway. The workshop focused on discussing the challenges and difficulties of researching and studying collaborative information retrieval and seeking (CIS/CIR). After an introductory and scene setting overview of developments in CIR/CIS, participants were challenged with devising a range of possible CIR/CIS tasks that could be used for evaluation purposes. Through the brainstorming and discussions, valuable insights regarding the evaluation of CIR/CIS tasks become apparent ? for particular tasks efficiency and/or effectiveness is most important, however for the majority of tasks the success and quality of outcomes along with knowledge sharing and sense-making were most important ? of which these latter attributes are much more difficult to measure and evaluate. Thus the major challenge for CIR/CIS research is to develop methods, measures and methodologies to evaluate these high order attributes
ON RELEVANCE FILTERING FOR REAL-TIME TWEET SUMMARIZATION
Real-time tweet summarization systems (RTS) require mechanisms for capturing relevant tweets, identifying novel tweets, and capturing timely tweets. In this thesis, we tackle the RTS problem with a main focus on the relevance filtering. We experimented with different traditional retrieval models.
Additionally, we propose two extensions to alleviate the sparsity and topic drift challenges that affect the relevance filtering. For the sparsity, we propose leveraging word embeddings in Vector Space model (VSM) term weighting to empower the system to use semantic similarity alongside the lexical matching. To mitigate the effect of topic drift, we exploit explicit relevance feedback to enhance profile representation to cope with its development in the stream over time.
We conducted extensive experiments over three standard English TREC test collections that were built specifically for RTS. Although the extensions do not generally exhibit better performance, they are comparable to the baselines used.
Moreover, we extended an event detection Arabic tweets test collection, called EveTAR, to support tasks that require novelty in the system's output. We collected novelty judgments using in-house annotators and used the collection to test our RTS system. We report preliminary results on EveTAR using different models of the RTS system.This work was made possible by NPRP grants # NPRP 7-1313-1-245 and # NPRP 7-1330-2-483 from the Qatar National Research Fund (a member of Qatar Foundation)
Event summarization on social media stream: retrospective and prospective tweet summarization
Le contenu généré dans les médias sociaux comme Twitter permet aux utilisateurs d'avoir un aperçu rétrospectif d'évÚnement
et de suivre les nouveaux développements dÚs qu'ils se produisent. Cependant, bien que Twitter soit une source d'information
importante, il est caractérisé par le volume et la vélocité des informations publiées qui rendent difficile le suivi de
l'Ă©volution des Ă©vĂšnements. Pour permettre de mieux tirer profit de ce nouveau vecteur d'information, deux tĂąches
complémentaires de recherche d'information dans les médias sociaux ont été introduites : la génération de résumé
rétrospectif qui vise à sélectionner les tweets pertinents et non redondant récapitulant "ce qui s'est passé" et l'envoi des
notifications prospectives dÚs qu'une nouvelle information pertinente est détectée.
Notre travail s'inscrit dans ce cadre. L'objectif de cette thÚse est de faciliter le suivi d'événement, en fournissant des
outils de génération de synthÚse adaptés à ce vecteur d'information. Les défis majeurs sous-jacents à notre problématique
découlent d'une part du volume, de la vélocité et de la variété des contenus publiés et, d'autre part, de la qualité des
tweets qui peut varier d'une maniÚre considérable.
La tùche principale dans la notification prospective est l'identification en temps réel des tweets pertinents et non
redondants. Le systĂšme peut choisir de retourner les nouveaux tweets dĂšs leurs dĂ©tections oĂč bien de diffĂ©rer leur envoi
afin de s'assurer de leur qualité. Dans ce contexte, nos contributions se situent à ces différents niveaux : PremiÚrement,
nous introduisons Word Similarity Extended Boolean Model (WSEBM), un modĂšle d'estimation de la pertinence qui exploite la
similarité entre les termes basée sur le word embedding et qui n'utilise pas les statistiques de flux. L'intuition sous-
jacente à notre proposition est que la mesure de similarité à base de word embedding est capable de considérer des mots
diffĂ©rents ayant la mĂȘme sĂ©mantique ce qui permet de compenser le non-appariement des termes lors du calcul de la
pertinence. DeuxiÚmement, l'estimation de nouveauté d'un tweet entrant est basée sur la comparaison de ses termes avec les
termes des tweets dĂ©jĂ envoyĂ©s au lieu d'utiliser la comparaison tweet Ă tweet. Cette mĂ©thode offre un meilleur passage Ă
l'échelle et permet de réduire le temps d'exécution. TroisiÚmement, pour contourner le problÚme du seuillage de pertinence,
nous utilisons un classificateur binaire qui prédit la pertinence. L'approche proposée est basée sur l'apprentissage
supervisé adaptatif dans laquelle les signes sociaux sont combinés avec les autres facteurs de pertinence dépendants de la
requĂȘte. De plus, le retour des jugements de pertinence est exploitĂ© pour re-entrainer le modĂšle de classification. Enfin,
nous montrons que l'approche proposée, qui envoie les notifications en temps réel, permet d'obtenir des performances
prometteuses en termes de qualité (pertinence et nouveauté) avec une faible latence alors que les approches de l'état de
l'art tendent à favoriser la qualité au détriment de la latence.
Cette thÚse explore également une nouvelle approche de génération du résumé rétrospectif qui suit un paradigme différent de
la majorité des méthodes de l'état de l'art. Nous proposons de modéliser le processus de génération de synthÚse sous forme
d'un problÚme d'optimisation linéaire qui prend en compte la diversité temporelle des tweets. Les tweets sont filtrés et
regroupés d'une maniÚre incrémentale en deux partitions basées respectivement sur la similarité du contenu et le temps de
publication. Nous formulons la génération du résumé comme étant un problÚme linéaire entier dans lequel les variables
inconnues sont binaires, la fonction objective est à maximiser et les contraintes assurent qu'au maximum un tweet par cluster est sélectionné dans la limite de la longueur du résumé fixée préalablement.User-generated content on social media, such as Twitter, provides in many cases, the latest news before traditional media,
which allows having a retrospective summary of events and being updated in a timely fashion whenever a new development
occurs. However, social media, while being a valuable source of information, can be also overwhelming given the volume and
the velocity of published information. To shield users from being overwhelmed by irrelevant and redundant posts,
retrospective summarization and prospective notification (real-time summarization) were introduced as two complementary
tasks of information seeking on document streams. The former aims to select a list of relevant and non-redundant tweets that
capture "what happened". In the latter, systems monitor the live posts stream and push relevant and novel notifications as
soon as possible.
Our work falls within these frameworks and focuses on developing a tweet summarization approaches for the two
aforementioned scenarios. It aims at providing summaries that capture the key aspects of the event of interest to help users
to efficiently acquire information and follow the development of long ongoing events from social media. Nevertheless, tweet
summarization task faces many challenges that stem from, on one hand, the high volume, the velocity and the variety of the
published information and, on the other hand, the quality of tweets, which can vary significantly.
In the prospective notification, the core task is the relevancy and the novelty detection in real-time. For timeliness, a
system may choose to push new updates in real-time or may choose to trade timeliness for higher notification quality. Our
contributions address these levels: First, we introduce Word Similarity Extended Boolean Model (WSEBM), a relevance model
that does not rely on stream statistics and takes advantage of word embedding model. We used word similarity instead of the
traditional weighting techniques. By doing this, we overcome the shortness and word mismatch issues in tweets. The intuition
behind our proposition is that context-aware similarity measure in word2vec is able to consider different words with the
same semantic meaning and hence allows offsetting the word mismatch issue when calculating the similarity between a tweet
and a topic. Second, we propose to compute the novelty score of the incoming tweet regarding all words of tweets already
pushed to the user instead of using the pairwise comparison. The proposed novelty detection method scales better and reduces
the execution time, which fits real-time tweet filtering. Third, we propose an adaptive Learning to Filter approach that
leverages social signals as well as query-dependent features. To overcome the issue of relevance threshold setting, we use a
binary classifier that predicts the relevance of the incoming tweet. In addition, we show the gain that can be achieved by
taking advantage of ongoing relevance feedback. Finally, we adopt a real-time push strategy and we show that the proposed
approach achieves a promising performance in terms of quality (relevance and novelty) with low cost of latency whereas the
state-of-the-art approaches tend to trade latency for higher quality.
This thesis also explores a novel approach to generate a retrospective summary that follows a different paradigm than the
majority of state-of-the-art methods. We consider the summary generation as an optimization problem that takes into account
the topical and the temporal diversity. Tweets are filtered and are incrementally clustered in two cluster types, namely
topical clusters based on content similarity and temporal clusters that depends on publication time. Summary generation is
formulated as integer linear problem in which unknowns variables are binaries, the objective function is to be maximized and
constraints ensure that at most one post per cluster is selected with respect to the defined summary length limit
BroDynâ18: Workshop on analysis of broad dynamic topics over social media
This book constitutes the refereed proceedings of the 40th European Conference on IR Research, ECIR 2018, held in Grenoble, France, in March 2018.
The 39 full papers and 39 short papers presented together with 6 demos, 5 workshops and 3 tutorials, were carefully reviewed and selected from 303 submissions. Accepted papers cover the state of the art in information retrieval including topics such as: topic modeling, deep learning, evaluation, user behavior, document representation, recommendation systems, retrieval methods, learning and classication, and micro-blogs
Définition et évaluation de modÚles d'agrégation pour l'estimation de la pertinence multidimensionnelle en recherche d'information
The main research topic of this document revolve around the information retrieval (IR) field. Traditional IR models rank documents by computing single scores separately with respect to one single objective criterion. Recently, an increasing number of IR studies has triggered a resurgence of interest in redefining the algorithmic estimation of relevance, which implies a shift from topical to multidimensional relevance assessment.In our work, we specifically address the multidimensional relevance assessment and evaluation problems. To tackle this challenge, state-of-the-art approaches are often based on linear combination mechanisms. However, However, these methods rely on the unrealistic additivity hypothesis and independence of the relevance dimensions, which makes it unsuitable in many real situations where criteria are correlated.Other techniques from the machine learning area have also been proposed. The latter learn a model from example inputs and generalize it to combine the different criteria. Nonetheless, these methods tend to offer only limited insight on how to consider the importance and the interaction between the criteria. In addition to the parameters sensitivity used within these algorithms, it is quite difficult to understand why a criteria is more preferred over another one.To address this problem, we proposed a model based on a multi-criteria aggregation operator that is able to overcome the problem of additivity. Our model is based on a fuzzy measure that offer semantic interpretations of the correlations and interactions between the criteria. We have adapted this model to the multidimensional relevance estimation in two scenarii: (i) a tweet search task and (ii) two personalized IR settings. The second line of research focuses on the integration of the temporal factor in the aggregation process, in order to consider the changes of document collections over time. To do so, we have proposed a time-aware IR model for combining the temporal relavance criterion with the topical relevance one. Then, we performed a time series analysis to identify the temporal query nature, and we proposed an evaluation framework within a time-aware IR setting.La problĂ©matique gĂ©nĂ©rale de notre travail s'inscrit dans le domaine scientifique de la recherche d'information (RI). Les modĂšles de RI classiques sont gĂ©nĂ©ralement basĂ©s sur une dĂ©finition de la notion de pertinence qui est liĂ©e essentiellement Ă l'adĂ©quation thĂ©matique entre le sujet de la requĂȘte et le sujet du document. Le concept de pertinence a Ă©tĂ© revisitĂ© selon diffĂ©rents niveaux intĂ©grant ainsi diffĂ©rents facteurs liĂ©s Ă l'utilisateur et Ă son environnement dans une situation de RI. Dans ce travail, nous abordons spĂ©cifiquement le problĂšme liĂ© Ă la modĂ©lisation de la pertinence multidimensionnelle Ă travers la dĂ©finition de nouveaux modĂšles d'agrĂ©gation des critĂšres et leur Ă©valuation dans des tĂąches de recherche de RI. Pour rĂ©pondre Ă cette problĂ©matique, les travaux de l'Ă©tat de l'art se basent principalement sur des combinaisons linĂ©aires simples. Cependant, ces mĂ©thodes se reposent sur l'hypothĂšse non rĂ©aliste d'additivitĂ© ou d'indĂ©pendance des dimensions, ce qui rend le modĂšle non appropriĂ© dans plusieurs situations de recherche rĂ©elles dans lesquelles les critĂšres Ă©tant corrĂ©lĂ©s ou prĂ©sentant des interactions entre eux. D'autres techniques issues du domaine de l'apprentissage automatique ont Ă©tĂ© aussi proposĂ©es, permettant ainsi d'apprendre un modĂšle par l'exemple et de le gĂ©nĂ©raliser dans l'ordonnancement et l'agrĂ©gation des critĂšres. Toutefois, ces mĂ©thodes ont tendance Ă offrir un aperçu limitĂ© sur la façon de considĂ©rer l'importance et l'interaction entre les critĂšres. En plus de la sensibilitĂ© des paramĂštres utilisĂ©s dans ces algorithmes, est trĂšs difficile de comprendre pourquoi un critĂšre est prĂ©fĂ©rĂ© par rapport Ă un autre. Pour rĂ©pondre Ă cette premiĂšre direction de recherche, nous avons proposĂ© un modĂšle de combinaison de pertinence multicritĂšres basĂ© sur un opĂ©rateur d'agrĂ©gation qui permet de surmonter le problĂšme d'additivitĂ© des fonctions de combinaison classiques. Notre modĂšle se base sur une mesure qui permet de donner une idĂ©e plus claire sur les corrĂ©lations et interactions entre les critĂšres. Nous avons ainsi adaptĂ© ce modĂšle pour deux scĂ©narios de combinaison de pertinence multicritĂšres : (i) un cadre de recherche d'information multicritĂšres dans un contexte de recherche de tweets et (ii) deux cadres de recherche d'information personnalisĂ©e. Le deuxiĂšme axe de recherche s'intĂ©resse Ă l'intĂ©gration du facteur temporel dans le processus d'agrĂ©gation afin de tenir compte des changements occurrents sur les collection de documents au cours du temps. Pour ce faire, nous avons proposĂ© donc un modĂšle d'agrĂ©gation sensible au temps pour combinant le facteur temporel avec le facteur de pertinence thĂ©matique. Dans cet objectif, nous avons effectuĂ© une analyse temporelle pour Ă©liciter l'aspect temporel des requĂȘtes, et nous avons proposĂ© une Ă©valuation de ce modĂšle dans une tĂąche de recherche sensible au temps
Extraction de Localisations dans les MicroBlogs
La circulation de lâinformation est de plus en plus rapide. Les applications comme WhatsApp ou Twitter permettent dâĂ©changer des informations sur des Ă©vĂšnements de façon quasi instantanĂ©e. Il sâagit de ressources prĂ©cieuses desquelles peuvent ĂȘtre extraites des informations sur des Ă©vĂ©nements (temps, localisation ou entitĂ© concernĂ©e). Nous nous centrons ici sur lâaspect localisation qui a de nombreuses applications aussi bien dans le cadre dâoutils gĂ©ospatialisĂ©s que pour des recommandations personnalisĂ©es. Dans le contexte de microblogs, les outils dĂ©veloppĂ©s en traitement du langage naturel ne sont pas suffisants compte tenu de la forme des messages; par exemple les tweets ne sont pas linguistiquement corrects. Par ailleurs, le nombre important de messages Ă traiter est Ă©galement un challenge. Dans ce article, nous prĂ©sentons un modĂšle pour prĂ©dire si un microblog (tweet) contient une localisation ou non et nous montrons que cette prĂ©diction amĂ©liore lâefficacitĂ© de lâextraction de localisations des tweets
A Method for Short Message Contextualization: Experiments at CLEF/INEX
International audienceThis paper presents the approach we developed for automatic multi-document summarization applied to short message contextualization, in particular to tweet contextualization. The proposed method is based on named entity recognition, part-of-speech weighting and sentence quality measuring. In contrast to previous research, we introduced an algorithm from smoothing from the local context. Our approach exploits topic-comment structure of a text. Moreover, we developed a graph-based algorithm for sentence reordering. The method has been evaluated at INEX/CLEF tweet contextualization track. We provide the evaluation results over the 4 years of the track. The method was also adapted to snippet retrieval and query expansion. The evaluation results indicate good performance of the approach