12 research outputs found

    Explicit diversification of event aspects for temporal summarization

    Get PDF
    During major events, such as emergencies and disasters, a large volume of information is reported on newswire and social media platforms. Temporal summarization (TS) approaches are used to automatically produce concise overviews of such events by extracting text snippets from related articles over time. Current TS approaches rely on a combination of event relevance and textual novelty for snippet selection. However, for events that span multiple days, textual novelty is often a poor criterion for selecting snippets, since many snippets are textually unique but are semantically redundant or non-informative. In this article, we propose a framework for the diversification of snippets using explicit event aspects, building on recent works in search result diversification. In particular, we first propose two techniques to identify explicit aspects that a user might want to see covered in a summary for different types of event. We then extend a state-of-the-art explicit diversification framework to maximize the coverage of these aspects when selecting summary snippets for unseen events. Through experimentation over the TREC TS 2013, 2014, and 2015 datasets, we show that explicit diversification for temporal summarization significantly outperforms classical novelty-based diversification, as the use of explicit event aspects reduces the amount of redundant and off-topic snippets returned, while also increasing summary timeliness

    DĂ©tection d'informations vitales pour la mise Ă  jour de bases de connaissances

    Get PDF
    National audienceMettre à jour une base de connaissances est une problématique actuelle qui suit l'évolution permanente du web de données liées. De nombreuses approches ont été proposées afin d'extraire dans des documents textuels la connaissance à mettre à jour. Ces approches arrivent à maturité mais reposent sur l'hypothèse selon laquelle le corpus adéquat a déjà été constitué. Dans la majorité des cas, les documents à prendre en compte sont sélectionnés manuellement ce qui rend difficile une mise à jour exhaustive de la base. Dans cet article nous proposons une approche originale visant à identifier automatiquement dans un flux de documents du web les éléments pouvant apporter de la connaissance nouvelle sur des instances déjà représentées dans une base

    Accelerating the update of knowledge base instances by detecting vital information from a document stream

    Get PDF
    International audienceIn this paper we aim at filtering documents containing timely relevant information about an entity (e.g., a person, a place, an organization) from a document stream. These documents that we call vital documents provide relevant and fresh information about the entity. The approach we propose leverages the temporal information reflected by the temporal expressions in the document in order to infer its vitality. Experiments carried out on the 2013 TREC Knowledge Base Acceleration (KBA) collection show the effectiveness of our approach compared to state-of-the-art ones

    Filtering News from Document Streams: Evaluation Aspects and Modeled Stream Utility

    Get PDF
    Events like hurricanes, earthquakes, or accidents can impact a large number of people. Not only are people in the immediate vicinity of the event affected, but concerns about their well-being are shared by the local government and well-wishers across the world. The latest information about news events could be of use to government and aid agencies in order to make informed decisions on providing necessary support, security and relief. The general public avails of news updates via dedicated news feeds or broadcasts, and lately, via social media services like Facebook or Twitter. Retrieving the latest information about newsworthy events from the world-wide web is thus of importance to a large section of society. As new content on a multitude of topics is continuously being published on the web, specific event related information needs to be filtered from the resulting stream of documents. We present in this thesis, a user-centric evaluation measure for evaluating systems that filter news related information from document streams. Our proposed evaluation measure, Modeled Stream Utility (MSU), models users accessing information from a stream of sentences produced by a news update filtering system. The user model allows for simulating a large number of users with different characteristic stream browsing behavior. Through simulation, MSU estimates the utility of a system for an average user browsing a stream of sentences. Our results show that system performance is sensitive to a user population's stream browsing behavior and that existing evaluation metrics correspond to very specific types of user behavior. To evaluate systems that filter sentences from a document stream, we need a set of judged sentences. This judged set is a subset of all the sentences returned by all systems, and is typically constructed by pooling together the highest quality sentences, as determined by respective system assigned scores for each sentence. Sentences in the pool are manually assessed and the resulting set of judged sentences is then used to compute system performance metrics. In this thesis, we investigate the effect of including duplicates of judged sentences, into the judged set, on system performance evaluation. We also develop an alternative pooling methodology, that given the MSU user model, selects sentences for pooling based on the probability of a sentences being read by modeled users. Our research lays the foundation for interesting future work for utilizing user-models in different aspects of evaluation of stream filtering systems. The MSU measure enables incorporation of different user models. Furthermore, the applicability of MSU could be extended through calibration based on user behavior

    Design and Evaluation of Temporal Summarization Systems

    Get PDF
    Temporal Summarization (TS) is a new track introduced as part of the Text REtrieval Conference (TREC) in 2013. This track aims to develop systems which can return important updates related to an event over time. In TREC 2013, the TS track specifically used disaster related events such as earthquake, hurricane, bombing, etc. This thesis mainly focuses on building an effective TS system by using a combination of Information Retrieval techniques. The developed TS system returns updates related to disaster related events in a timely manner. By participating in TREC 2013 and with experiments conducted after TREC, we examine the effectiveness of techniques such as distributional similarity for term expansion, which can be employed in building TS systems. Also, this thesis describes the effectiveness of other techniques such as stemming, adaptive sentence selection over time and de-duplication in our system, by comparing it with other baseline systems. The second part of the thesis examines the current methodology used for evaluating TS systems. We propose a modified evaluation method which could reduce the manual effort of assessors, and also correlates well with the official track’s evaluation. We also propose a supervised learning based evaluation method, which correlates well with the official track’s evaluation of systems and could save the assessor’s time by as much as 80%

    Filtrage et agrégation d'informations vitales relatives à des entités

    Get PDF
    Nowadays, knowledge bases such as Wikipedia and DBpedia are the main sources to access information on a wide variety of entities (an entity is a thing that can be distinctly identified such a person, an organization, a product, an event, etc.). However, the update of these sources with new information related to a given entity is done manually by contributors with a significant latency time particularly if that entity is not popular. A system that analyzes documents when published on the Web to filter important information about entities will probably accelerate the update of these knowledge bases. In this thesis, we are interested in filtering timely and relevant information, called vital information, concerning the entities. We aim at answering the following two issues: (1) How to detect if a document is vital (i.e., it provides timely relevant information) to an entity? and (2) How to extract vital information from these documents to build a temporal summary about the entity that can be seen as a reference for updating the corresponding knowledge base entry?Regarding the first issue, we proposed two methods. The first proposal is fully supervised. It is based on a vitality language model. The second proposal measures the freshness of temporal expressions in a document to decide its vitality. Concerning the second issue, we proposed a method that selects the sentences based on the presence of triggers words automatically retrieved from the knowledge already represented in the knowledge base (such as the description of similar entities).We carried out our experiments on the TREC Stream corpus 2013 and 2014 with 1.2 billion documents and different types of entities (persons, organizations, facilities and events). For vital documents filtering approaches, we conducted our experiments in the context of the task "knowledge Base Acceleration (KBA)" for the years 2013 and 2014. Our method based on leveraging the temporal expressions in the document obtained good results outperforming the best participant system in the task KBA 2013. In addition, we showed the importance of our generated temporal summaries to accelerate the update of knowledge bases.Aujourd'hui, les bases de connaissances telles que Wikipedia et DBpedia représentent les sources principales pour accéder aux informations disponibles sur une grande variété d'entités (une entité est une chose qui peut être distinctement identifiée par exemple une personne, une organisation, un produit, un événement, etc.). Cependant, la mise à jour de ces sources avec des informations nouvelles en rapport avec une entité donnée se fait manuellement par des contributeurs et avec un temps de latence important en particulier si cette entité n'est pas populaire. Concevoir un système qui analyse les documents dès leur publication sur le Web pour filtrer les informations importantes relatives à des entités pourra sans doute accélérer la mise à jour de ces bases de connaissances. Dans cette thèse, nous nous intéressons au filtrage d'informations pertinentes et nouvelles, appelées vitales, relatives à des entités. Ces travaux rentrent dans le cadre de la recherche d'information mais visent aussi à enrichir les techniques d'ingénierie de connaissances en aidant à la sélection des informations à traiter. Nous souhaitons répondre principalement aux deux problématiques suivantes: (1) Comment détecter si un document est vital (c.à.d qu'il apporte une information pertinente et nouvelle) par rapport à une entité donnée? et (2) Comment extraire les informations vitales à partir de ces documents qui serviront comme référence pour mettre à jour des bases de connaissances? Concernant la première problématique, nous avons proposé deux méthodes. La première proposition est totalement supervisée. Elle se base sur un modèle de langue de vitalité. La deuxième proposition mesure la fraîcheur des expressions temporelles contenues dans un document afin de décider de sa vitalité. En ce qui concerne la deuxième problématique relative à l'extraction d'informations vitales à partir des documents vitaux, nous avons proposé une méthode qui sélectionne les phrases comportant potentiellement ces informations vitales, en nous basant sur la présence de mots déclencheurs récupérés automatiquement à partir de la connaissance déjà représentée dans la base de connaissances (comme la description d'entités similaires).L'évaluation des approches proposées a été effectuée dans le cadre de la campagne d'évaluation internationale TREC sur une collection de 1.2 milliard de documents avec différents types d'entités (personnes, organisations, établissements et événements). Pour les approches de filtrage de documents vitaux, nous avons mené nos expérimentations dans le cadre de la tâche "Knwoledge Base Acceleration (KBA)" pour les années 2013 et 2014. L'exploitation des expressions temporelles dans le document a permis d'obtenir de bons résultats dépassant le meilleur système proposé dans la tâche KBA 2013. Pour évaluer les contributions concernant l'extraction des informations vitales relatives à des entités, nous nous sommes basés sur le cadre expérimental de la tâche "Temporal Summarization (TS)". Nous avons montré que notre approche permet de minimiser le temps de latence des mises à jour de bases de connaissances

    Supervised extractive summarisation of news events

    Get PDF
    This thesis investigates whether the summarisation of news-worthy events can be improved by using evidence about entities (i.e.\ people, places, and organisations) involved in the events. More effective event summaries, that better assist people with their news-based information access requirements, can help to reduce information overload in today's 24-hour news culture. Summaries are based on sentences extracted verbatim from news articles about the events. Within a supervised machine learning framework, we propose a series of entity-focused event summarisation features. Computed over multiple news articles discussing a given event, such entity-focused evidence estimates: the importance of entities within events; the significance of interactions between entities within events; and the topical relevance of entities to events. The statement of this research work is that augmenting supervised summarisation models, which are trained on discriminative multi-document newswire summarisation features, with evidence about the named entities involved in the events, by integrating entity-focused event summarisation features, we will obtain more effective summaries of news-worthy events. The proposed entity-focused event summarisation features are thoroughly evaluated over two multi-document newswire summarisation scenarios. The first scenario is used to evaluate the retrospective event summarisation task, where the goal is to summarise an event to-date, based on a static set of news articles discussing the event. The second scenario is used to evaluate the temporal event summarisation task, where the goal is to summarise the changes in an ongoing event, based on a time-stamped stream of news articles discussing the event. The contributions of this thesis are two-fold. First, this thesis investigates the utility of entity-focused event evidence for identifying important and salient event summary sentences, and as a means to perform anti-redundancy filtering to control the volume of content emitted as a summary of an evolving event. Second, this thesis also investigates the validity of automatic summarisation evaluation metrics, the effectiveness of standard summarisation baselines, and the effective training of supervised machine learned summarisation models

    A Time-Aware Approach to Improving Ad-hoc Information Retrieval from Microblogs

    Get PDF
    There is an immense number of short-text documents produced as the result of microblogging. The content produced is growing as the number of microbloggers grows, and as active microbloggers continue to post millions of updates. The range of topics discussed is so vast, that microblogs provide an abundance of useful information. In this work, the problem of retrieving the most relevant information in microblogs is addressed. Interesting temporal patterns were found in the initial analysis of the study. Therefore the focus of the current work is to first exploit a temporal variable in order to see how effectively it can be used to predict the relevance of the tweets and, then, to include it in a retrieval weighting model along with other tweet-specific features. Generalized Linear Mixed-effect Models (GLMMs) are used to analyze the features and to propose two re-ranking models. These two models were developed through an exploratory process on a training set and then were evaluated on a test set

    Combining heterogeneous sources in an interactive multimedia content retrieval model

    Get PDF
    Interactive multimodal information retrieval systems (IMIR) increase the capabilities of traditional search systems, by adding the ability to retrieve information of different types (modes) and from different sources. This article describes a formal model for interactive multimodal information retrieval. This model includes formal and widespread definitions of each component of an IMIR system. A use case that focuses on information retrieval regarding sports validates the model, by developing a prototype that implements a subset of the features of the model. Adaptive techniques applied to the retrieval functionality of IMIR systems have been defined by analysing past interactions using decision trees, neural networks, and clustering techniques. This model includes a strategy for selecting sources and combining the results obtained from every source. After modifying the strategy of the prototype for selecting sources, the system is reevaluated using classification techniques.This work was partially supported by eGovernAbility-Access project (TIN2014-52665-C2-2-R)

    Event summarization on social media stream: retrospective and prospective tweet summarization

    Get PDF
    Le contenu généré dans les médias sociaux comme Twitter permet aux utilisateurs d'avoir un aperçu rétrospectif d'évènement et de suivre les nouveaux développements dès qu'ils se produisent. Cependant, bien que Twitter soit une source d'information importante, il est caractérisé par le volume et la vélocité des informations publiées qui rendent difficile le suivi de l'évolution des évènements. Pour permettre de mieux tirer profit de ce nouveau vecteur d'information, deux tâches complémentaires de recherche d'information dans les médias sociaux ont été introduites : la génération de résumé rétrospectif qui vise à sélectionner les tweets pertinents et non redondant récapitulant "ce qui s'est passé" et l'envoi des notifications prospectives dès qu'une nouvelle information pertinente est détectée. Notre travail s'inscrit dans ce cadre. L'objectif de cette thèse est de faciliter le suivi d'événement, en fournissant des outils de génération de synthèse adaptés à ce vecteur d'information. Les défis majeurs sous-jacents à notre problématique découlent d'une part du volume, de la vélocité et de la variété des contenus publiés et, d'autre part, de la qualité des tweets qui peut varier d'une manière considérable. La tâche principale dans la notification prospective est l'identification en temps réel des tweets pertinents et non redondants. Le système peut choisir de retourner les nouveaux tweets dès leurs détections où bien de différer leur envoi afin de s'assurer de leur qualité. Dans ce contexte, nos contributions se situent à ces différents niveaux : Premièrement, nous introduisons Word Similarity Extended Boolean Model (WSEBM), un modèle d'estimation de la pertinence qui exploite la similarité entre les termes basée sur le word embedding et qui n'utilise pas les statistiques de flux. L'intuition sous- jacente à notre proposition est que la mesure de similarité à base de word embedding est capable de considérer des mots différents ayant la même sémantique ce qui permet de compenser le non-appariement des termes lors du calcul de la pertinence. Deuxièmement, l'estimation de nouveauté d'un tweet entrant est basée sur la comparaison de ses termes avec les termes des tweets déjà envoyés au lieu d'utiliser la comparaison tweet à tweet. Cette méthode offre un meilleur passage à l'échelle et permet de réduire le temps d'exécution. Troisièmement, pour contourner le problème du seuillage de pertinence, nous utilisons un classificateur binaire qui prédit la pertinence. L'approche proposée est basée sur l'apprentissage supervisé adaptatif dans laquelle les signes sociaux sont combinés avec les autres facteurs de pertinence dépendants de la requête. De plus, le retour des jugements de pertinence est exploité pour re-entrainer le modèle de classification. Enfin, nous montrons que l'approche proposée, qui envoie les notifications en temps réel, permet d'obtenir des performances prometteuses en termes de qualité (pertinence et nouveauté) avec une faible latence alors que les approches de l'état de l'art tendent à favoriser la qualité au détriment de la latence. Cette thèse explore également une nouvelle approche de génération du résumé rétrospectif qui suit un paradigme différent de la majorité des méthodes de l'état de l'art. Nous proposons de modéliser le processus de génération de synthèse sous forme d'un problème d'optimisation linéaire qui prend en compte la diversité temporelle des tweets. Les tweets sont filtrés et regroupés d'une manière incrémentale en deux partitions basées respectivement sur la similarité du contenu et le temps de publication. Nous formulons la génération du résumé comme étant un problème linéaire entier dans lequel les variables inconnues sont binaires, la fonction objective est à maximiser et les contraintes assurent qu'au maximum un tweet par cluster est sélectionné dans la limite de la longueur du résumé fixée préalablement.User-generated content on social media, such as Twitter, provides in many cases, the latest news before traditional media, which allows having a retrospective summary of events and being updated in a timely fashion whenever a new development occurs. However, social media, while being a valuable source of information, can be also overwhelming given the volume and the velocity of published information. To shield users from being overwhelmed by irrelevant and redundant posts, retrospective summarization and prospective notification (real-time summarization) were introduced as two complementary tasks of information seeking on document streams. The former aims to select a list of relevant and non-redundant tweets that capture "what happened". In the latter, systems monitor the live posts stream and push relevant and novel notifications as soon as possible. Our work falls within these frameworks and focuses on developing a tweet summarization approaches for the two aforementioned scenarios. It aims at providing summaries that capture the key aspects of the event of interest to help users to efficiently acquire information and follow the development of long ongoing events from social media. Nevertheless, tweet summarization task faces many challenges that stem from, on one hand, the high volume, the velocity and the variety of the published information and, on the other hand, the quality of tweets, which can vary significantly. In the prospective notification, the core task is the relevancy and the novelty detection in real-time. For timeliness, a system may choose to push new updates in real-time or may choose to trade timeliness for higher notification quality. Our contributions address these levels: First, we introduce Word Similarity Extended Boolean Model (WSEBM), a relevance model that does not rely on stream statistics and takes advantage of word embedding model. We used word similarity instead of the traditional weighting techniques. By doing this, we overcome the shortness and word mismatch issues in tweets. The intuition behind our proposition is that context-aware similarity measure in word2vec is able to consider different words with the same semantic meaning and hence allows offsetting the word mismatch issue when calculating the similarity between a tweet and a topic. Second, we propose to compute the novelty score of the incoming tweet regarding all words of tweets already pushed to the user instead of using the pairwise comparison. The proposed novelty detection method scales better and reduces the execution time, which fits real-time tweet filtering. Third, we propose an adaptive Learning to Filter approach that leverages social signals as well as query-dependent features. To overcome the issue of relevance threshold setting, we use a binary classifier that predicts the relevance of the incoming tweet. In addition, we show the gain that can be achieved by taking advantage of ongoing relevance feedback. Finally, we adopt a real-time push strategy and we show that the proposed approach achieves a promising performance in terms of quality (relevance and novelty) with low cost of latency whereas the state-of-the-art approaches tend to trade latency for higher quality. This thesis also explores a novel approach to generate a retrospective summary that follows a different paradigm than the majority of state-of-the-art methods. We consider the summary generation as an optimization problem that takes into account the topical and the temporal diversity. Tweets are filtered and are incrementally clustered in two cluster types, namely topical clusters based on content similarity and temporal clusters that depends on publication time. Summary generation is formulated as integer linear problem in which unknowns variables are binaries, the objective function is to be maximized and constraints ensure that at most one post per cluster is selected with respect to the defined summary length limit
    corecore