22 research outputs found

    IRIT at TREC Microblog 2015

    Get PDF
    International audienceThis paper presents the participation of the IRIT laboratory (University of Toulouse) to the Microblog Track of TREC 2015. This track consists in a real-time filtering task aiming at monitoring a stream of social media posts in accordance to a user's interest profile. In this context, our team proposes three approaches: (a) a novel selective summarization approach based on a decision of selecting/ignoring tweets without the use of external knowledge and relying on novelty and redundancy factors, (b) a processing workflow enabling to index tweets in real-time and enhanced by a notification and digests method guided by diversity and user personalization, and (c) a step by step stream selection method focusing on rapidity, and taking into account tweet similarity as well as several features including content, entities and user-related aspects. For all these approaches, we discuss the obtained results during the experimental evaluation

    Report on the Second International Workshop on the Evaluation on Collaborative Information Seeking and Retrieval (ECol'2017 @ CHIIR)

    Get PDF
    The 2nd workshop on the evaluation of collaborative information retrieval and seeking (ECol) was held in conjunction with the ACM SIGIR Conference on Human Information Interaction & Retrieval (CHIIR) in Oslo, Norway. The workshop focused on discussing the challenges and difficulties of researching and studying collaborative information retrieval and seeking (CIS/CIR). After an introductory and scene setting overview of developments in CIR/CIS, participants were challenged with devising a range of possible CIR/CIS tasks that could be used for evaluation purposes. Through the brainstorming and discussions, valuable insights regarding the evaluation of CIR/CIS tasks become apparent ? for particular tasks efficiency and/or effectiveness is most important, however for the majority of tasks the success and quality of outcomes along with knowledge sharing and sense-making were most important ? of which these latter attributes are much more difficult to measure and evaluate. Thus the major challenge for CIR/CIS research is to develop methods, measures and methodologies to evaluate these high order attributes

    ON RELEVANCE FILTERING FOR REAL-TIME TWEET SUMMARIZATION

    Get PDF
    Real-time tweet summarization systems (RTS) require mechanisms for capturing relevant tweets, identifying novel tweets, and capturing timely tweets. In this thesis, we tackle the RTS problem with a main focus on the relevance filtering. We experimented with different traditional retrieval models. Additionally, we propose two extensions to alleviate the sparsity and topic drift challenges that affect the relevance filtering. For the sparsity, we propose leveraging word embeddings in Vector Space model (VSM) term weighting to empower the system to use semantic similarity alongside the lexical matching. To mitigate the effect of topic drift, we exploit explicit relevance feedback to enhance profile representation to cope with its development in the stream over time. We conducted extensive experiments over three standard English TREC test collections that were built specifically for RTS. Although the extensions do not generally exhibit better performance, they are comparable to the baselines used. Moreover, we extended an event detection Arabic tweets test collection, called EveTAR, to support tasks that require novelty in the system's output. We collected novelty judgments using in-house annotators and used the collection to test our RTS system. We report preliminary results on EveTAR using different models of the RTS system.This work was made possible by NPRP grants # NPRP 7-1313-1-245 and # NPRP 7-1330-2-483 from the Qatar National Research Fund (a member of Qatar Foundation)

    Event summarization on social media stream: retrospective and prospective tweet summarization

    Get PDF
    Le contenu gĂ©nĂ©rĂ© dans les mĂ©dias sociaux comme Twitter permet aux utilisateurs d'avoir un aperçu rĂ©trospectif d'Ă©vĂšnement et de suivre les nouveaux dĂ©veloppements dĂšs qu'ils se produisent. Cependant, bien que Twitter soit une source d'information importante, il est caractĂ©risĂ© par le volume et la vĂ©locitĂ© des informations publiĂ©es qui rendent difficile le suivi de l'Ă©volution des Ă©vĂšnements. Pour permettre de mieux tirer profit de ce nouveau vecteur d'information, deux tĂąches complĂ©mentaires de recherche d'information dans les mĂ©dias sociaux ont Ă©tĂ© introduites : la gĂ©nĂ©ration de rĂ©sumĂ© rĂ©trospectif qui vise Ă  sĂ©lectionner les tweets pertinents et non redondant rĂ©capitulant "ce qui s'est passĂ©" et l'envoi des notifications prospectives dĂšs qu'une nouvelle information pertinente est dĂ©tectĂ©e. Notre travail s'inscrit dans ce cadre. L'objectif de cette thĂšse est de faciliter le suivi d'Ă©vĂ©nement, en fournissant des outils de gĂ©nĂ©ration de synthĂšse adaptĂ©s Ă  ce vecteur d'information. Les dĂ©fis majeurs sous-jacents Ă  notre problĂ©matique dĂ©coulent d'une part du volume, de la vĂ©locitĂ© et de la variĂ©tĂ© des contenus publiĂ©s et, d'autre part, de la qualitĂ© des tweets qui peut varier d'une maniĂšre considĂ©rable. La tĂąche principale dans la notification prospective est l'identification en temps rĂ©el des tweets pertinents et non redondants. Le systĂšme peut choisir de retourner les nouveaux tweets dĂšs leurs dĂ©tections oĂč bien de diffĂ©rer leur envoi afin de s'assurer de leur qualitĂ©. Dans ce contexte, nos contributions se situent Ă  ces diffĂ©rents niveaux : PremiĂšrement, nous introduisons Word Similarity Extended Boolean Model (WSEBM), un modĂšle d'estimation de la pertinence qui exploite la similaritĂ© entre les termes basĂ©e sur le word embedding et qui n'utilise pas les statistiques de flux. L'intuition sous- jacente Ă  notre proposition est que la mesure de similaritĂ© Ă  base de word embedding est capable de considĂ©rer des mots diffĂ©rents ayant la mĂȘme sĂ©mantique ce qui permet de compenser le non-appariement des termes lors du calcul de la pertinence. DeuxiĂšmement, l'estimation de nouveautĂ© d'un tweet entrant est basĂ©e sur la comparaison de ses termes avec les termes des tweets dĂ©jĂ  envoyĂ©s au lieu d'utiliser la comparaison tweet Ă  tweet. Cette mĂ©thode offre un meilleur passage Ă  l'Ă©chelle et permet de rĂ©duire le temps d'exĂ©cution. TroisiĂšmement, pour contourner le problĂšme du seuillage de pertinence, nous utilisons un classificateur binaire qui prĂ©dit la pertinence. L'approche proposĂ©e est basĂ©e sur l'apprentissage supervisĂ© adaptatif dans laquelle les signes sociaux sont combinĂ©s avec les autres facteurs de pertinence dĂ©pendants de la requĂȘte. De plus, le retour des jugements de pertinence est exploitĂ© pour re-entrainer le modĂšle de classification. Enfin, nous montrons que l'approche proposĂ©e, qui envoie les notifications en temps rĂ©el, permet d'obtenir des performances prometteuses en termes de qualitĂ© (pertinence et nouveautĂ©) avec une faible latence alors que les approches de l'Ă©tat de l'art tendent Ă  favoriser la qualitĂ© au dĂ©triment de la latence. Cette thĂšse explore Ă©galement une nouvelle approche de gĂ©nĂ©ration du rĂ©sumĂ© rĂ©trospectif qui suit un paradigme diffĂ©rent de la majoritĂ© des mĂ©thodes de l'Ă©tat de l'art. Nous proposons de modĂ©liser le processus de gĂ©nĂ©ration de synthĂšse sous forme d'un problĂšme d'optimisation linĂ©aire qui prend en compte la diversitĂ© temporelle des tweets. Les tweets sont filtrĂ©s et regroupĂ©s d'une maniĂšre incrĂ©mentale en deux partitions basĂ©es respectivement sur la similaritĂ© du contenu et le temps de publication. Nous formulons la gĂ©nĂ©ration du rĂ©sumĂ© comme Ă©tant un problĂšme linĂ©aire entier dans lequel les variables inconnues sont binaires, la fonction objective est Ă  maximiser et les contraintes assurent qu'au maximum un tweet par cluster est sĂ©lectionnĂ© dans la limite de la longueur du rĂ©sumĂ© fixĂ©e prĂ©alablement.User-generated content on social media, such as Twitter, provides in many cases, the latest news before traditional media, which allows having a retrospective summary of events and being updated in a timely fashion whenever a new development occurs. However, social media, while being a valuable source of information, can be also overwhelming given the volume and the velocity of published information. To shield users from being overwhelmed by irrelevant and redundant posts, retrospective summarization and prospective notification (real-time summarization) were introduced as two complementary tasks of information seeking on document streams. The former aims to select a list of relevant and non-redundant tweets that capture "what happened". In the latter, systems monitor the live posts stream and push relevant and novel notifications as soon as possible. Our work falls within these frameworks and focuses on developing a tweet summarization approaches for the two aforementioned scenarios. It aims at providing summaries that capture the key aspects of the event of interest to help users to efficiently acquire information and follow the development of long ongoing events from social media. Nevertheless, tweet summarization task faces many challenges that stem from, on one hand, the high volume, the velocity and the variety of the published information and, on the other hand, the quality of tweets, which can vary significantly. In the prospective notification, the core task is the relevancy and the novelty detection in real-time. For timeliness, a system may choose to push new updates in real-time or may choose to trade timeliness for higher notification quality. Our contributions address these levels: First, we introduce Word Similarity Extended Boolean Model (WSEBM), a relevance model that does not rely on stream statistics and takes advantage of word embedding model. We used word similarity instead of the traditional weighting techniques. By doing this, we overcome the shortness and word mismatch issues in tweets. The intuition behind our proposition is that context-aware similarity measure in word2vec is able to consider different words with the same semantic meaning and hence allows offsetting the word mismatch issue when calculating the similarity between a tweet and a topic. Second, we propose to compute the novelty score of the incoming tweet regarding all words of tweets already pushed to the user instead of using the pairwise comparison. The proposed novelty detection method scales better and reduces the execution time, which fits real-time tweet filtering. Third, we propose an adaptive Learning to Filter approach that leverages social signals as well as query-dependent features. To overcome the issue of relevance threshold setting, we use a binary classifier that predicts the relevance of the incoming tweet. In addition, we show the gain that can be achieved by taking advantage of ongoing relevance feedback. Finally, we adopt a real-time push strategy and we show that the proposed approach achieves a promising performance in terms of quality (relevance and novelty) with low cost of latency whereas the state-of-the-art approaches tend to trade latency for higher quality. This thesis also explores a novel approach to generate a retrospective summary that follows a different paradigm than the majority of state-of-the-art methods. We consider the summary generation as an optimization problem that takes into account the topical and the temporal diversity. Tweets are filtered and are incrementally clustered in two cluster types, namely topical clusters based on content similarity and temporal clusters that depends on publication time. Summary generation is formulated as integer linear problem in which unknowns variables are binaries, the objective function is to be maximized and constraints ensure that at most one post per cluster is selected with respect to the defined summary length limit

    BroDyn’18: Workshop on analysis of broad dynamic topics over social media

    Get PDF
    This book constitutes the refereed proceedings of the 40th European Conference on IR Research, ECIR 2018, held in Grenoble, France, in March 2018. The 39 full papers and 39 short papers presented together with 6 demos, 5 workshops and 3 tutorials, were carefully reviewed and selected from 303 submissions. Accepted papers cover the state of the art in information retrieval including topics such as: topic modeling, deep learning, evaluation, user behavior, document representation, recommendation systems, retrieval methods, learning and classication, and micro-blogs

    Définition et évaluation de modÚles d'agrégation pour l'estimation de la pertinence multidimensionnelle en recherche d'information

    Get PDF
    The main research topic of this document revolve around the information retrieval (IR) field. Traditional IR models rank documents by computing single scores separately with respect to one single objective criterion. Recently, an increasing number of IR studies has triggered a resurgence of interest in redefining the algorithmic estimation of relevance, which implies a shift from topical to multidimensional relevance assessment.In our work, we specifically address the multidimensional relevance assessment and evaluation problems. To tackle this challenge, state-of-the-art approaches are often based on linear combination mechanisms. However, However, these methods rely on the unrealistic additivity hypothesis and independence of the relevance dimensions, which makes it unsuitable in many real situations where criteria are correlated.Other techniques from the machine learning area have also been proposed. The latter learn a model from example inputs and generalize it to combine the different criteria. Nonetheless, these methods tend to offer only limited insight on how to consider the importance and the interaction between the criteria. In addition to the parameters sensitivity used within these algorithms, it is quite difficult to understand why a criteria is more preferred over another one.To address this problem, we proposed a model based on a multi-criteria aggregation operator that is able to overcome the problem of additivity. Our model is based on a fuzzy measure that offer semantic interpretations of the correlations and interactions between the criteria. We have adapted this model to the multidimensional relevance estimation in two scenarii: (i) a tweet search task and (ii) two personalized IR settings. The second line of research focuses on the integration of the temporal factor in the aggregation process, in order to consider the changes of document collections over time. To do so, we have proposed a time-aware IR model for combining the temporal relavance criterion with the topical relevance one. Then, we performed a time series analysis to identify the temporal query nature, and we proposed an evaluation framework within a time-aware IR setting.La problĂ©matique gĂ©nĂ©rale de notre travail s'inscrit dans le domaine scientifique de la recherche d'information (RI). Les modĂšles de RI classiques sont gĂ©nĂ©ralement basĂ©s sur une dĂ©finition de la notion de pertinence qui est liĂ©e essentiellement Ă  l'adĂ©quation thĂ©matique entre le sujet de la requĂȘte et le sujet du document. Le concept de pertinence a Ă©tĂ© revisitĂ© selon diffĂ©rents niveaux intĂ©grant ainsi diffĂ©rents facteurs liĂ©s Ă  l'utilisateur et Ă  son environnement dans une situation de RI. Dans ce travail, nous abordons spĂ©cifiquement le problĂšme liĂ© Ă  la modĂ©lisation de la pertinence multidimensionnelle Ă  travers la dĂ©finition de nouveaux modĂšles d'agrĂ©gation des critĂšres et leur Ă©valuation dans des tĂąches de recherche de RI. Pour rĂ©pondre Ă  cette problĂ©matique, les travaux de l'Ă©tat de l'art se basent principalement sur des combinaisons linĂ©aires simples. Cependant, ces mĂ©thodes se reposent sur l'hypothĂšse non rĂ©aliste d'additivitĂ© ou d'indĂ©pendance des dimensions, ce qui rend le modĂšle non appropriĂ© dans plusieurs situations de recherche rĂ©elles dans lesquelles les critĂšres Ă©tant corrĂ©lĂ©s ou prĂ©sentant des interactions entre eux. D'autres techniques issues du domaine de l'apprentissage automatique ont Ă©tĂ© aussi proposĂ©es, permettant ainsi d'apprendre un modĂšle par l'exemple et de le gĂ©nĂ©raliser dans l'ordonnancement et l'agrĂ©gation des critĂšres. Toutefois, ces mĂ©thodes ont tendance Ă  offrir un aperçu limitĂ© sur la façon de considĂ©rer l'importance et l'interaction entre les critĂšres. En plus de la sensibilitĂ© des paramĂštres utilisĂ©s dans ces algorithmes, est trĂšs difficile de comprendre pourquoi un critĂšre est prĂ©fĂ©rĂ© par rapport Ă  un autre. Pour rĂ©pondre Ă  cette premiĂšre direction de recherche, nous avons proposĂ© un modĂšle de combinaison de pertinence multicritĂšres basĂ© sur un opĂ©rateur d'agrĂ©gation qui permet de surmonter le problĂšme d'additivitĂ© des fonctions de combinaison classiques. Notre modĂšle se base sur une mesure qui permet de donner une idĂ©e plus claire sur les corrĂ©lations et interactions entre les critĂšres. Nous avons ainsi adaptĂ© ce modĂšle pour deux scĂ©narios de combinaison de pertinence multicritĂšres : (i) un cadre de recherche d'information multicritĂšres dans un contexte de recherche de tweets et (ii) deux cadres de recherche d'information personnalisĂ©e. Le deuxiĂšme axe de recherche s'intĂ©resse Ă  l'intĂ©gration du facteur temporel dans le processus d'agrĂ©gation afin de tenir compte des changements occurrents sur les collection de documents au cours du temps. Pour ce faire, nous avons proposĂ© donc un modĂšle d'agrĂ©gation sensible au temps pour combinant le facteur temporel avec le facteur de pertinence thĂ©matique. Dans cet objectif, nous avons effectuĂ© une analyse temporelle pour Ă©liciter l'aspect temporel des requĂȘtes, et nous avons proposĂ© une Ă©valuation de ce modĂšle dans une tĂąche de recherche sensible au temps

    Extraction de Localisations dans les MicroBlogs

    Get PDF
    La circulation de l’information est de plus en plus rapide. Les applications comme WhatsApp ou Twitter permettent d’échanger des informations sur des Ă©vĂšnements de façon quasi instantanĂ©e. Il s’agit de ressources prĂ©cieuses desquelles peuvent ĂȘtre extraites des informations sur des Ă©vĂ©nements (temps, localisation ou entitĂ© concernĂ©e). Nous nous centrons ici sur l’aspect localisation qui a de nombreuses applications aussi bien dans le cadre d’outils gĂ©ospatialisĂ©s que pour des recommandations personnalisĂ©es. Dans le contexte de microblogs, les outils dĂ©veloppĂ©s en traitement du langage naturel ne sont pas suffisants compte tenu de la forme des messages; par exemple les tweets ne sont pas linguistiquement corrects. Par ailleurs, le nombre important de messages Ă  traiter est Ă©galement un challenge. Dans ce article, nous prĂ©sentons un modĂšle pour prĂ©dire si un microblog (tweet) contient une localisation ou non et nous montrons que cette prĂ©diction amĂ©liore l’efficacitĂ© de l’extraction de localisations des tweets

    A Method for Short Message Contextualization: Experiments at CLEF/INEX

    Get PDF
    International audienceThis paper presents the approach we developed for automatic multi-document summarization applied to short message contextualization, in particular to tweet contextualization. The proposed method is based on named entity recognition, part-of-speech weighting and sentence quality measuring. In contrast to previous research, we introduced an algorithm from smoothing from the local context. Our approach exploits topic-comment structure of a text. Moreover, we developed a graph-based algorithm for sentence reordering. The method has been evaluated at INEX/CLEF tweet contextualization track. We provide the evaluation results over the 4 years of the track. The method was also adapted to snippet retrieval and query expansion. The evaluation results indicate good performance of the approach
    corecore