7 research outputs found

    Explicit diversification of event aspects for temporal summarization

    Get PDF
    During major events, such as emergencies and disasters, a large volume of information is reported on newswire and social media platforms. Temporal summarization (TS) approaches are used to automatically produce concise overviews of such events by extracting text snippets from related articles over time. Current TS approaches rely on a combination of event relevance and textual novelty for snippet selection. However, for events that span multiple days, textual novelty is often a poor criterion for selecting snippets, since many snippets are textually unique but are semantically redundant or non-informative. In this article, we propose a framework for the diversification of snippets using explicit event aspects, building on recent works in search result diversification. In particular, we first propose two techniques to identify explicit aspects that a user might want to see covered in a summary for different types of event. We then extend a state-of-the-art explicit diversification framework to maximize the coverage of these aspects when selecting summary snippets for unseen events. Through experimentation over the TREC TS 2013, 2014, and 2015 datasets, we show that explicit diversification for temporal summarization significantly outperforms classical novelty-based diversification, as the use of explicit event aspects reduces the amount of redundant and off-topic snippets returned, while also increasing summary timeliness

    Neural Methods for Effective, Efficient, and Exposure-Aware Information Retrieval

    Get PDF
    Neural networks with deep architectures have demonstrated significant performance improvements in computer vision, speech recognition, and natural language processing. The challenges in information retrieval (IR), however, are different from these other application areas. A common form of IR involves ranking of documents--or short passages--in response to keyword-based queries. Effective IR systems must deal with query-document vocabulary mismatch problem, by modeling relationships between different query and document terms and how they indicate relevance. Models should also consider lexical matches when the query contains rare terms--such as a person's name or a product model number--not seen during training, and to avoid retrieving semantically related but irrelevant results. In many real-life IR tasks, the retrieval involves extremely large collections--such as the document index of a commercial Web search engine--containing billions of documents. Efficient IR methods should take advantage of specialized IR data structures, such as inverted index, to efficiently retrieve from large collections. Given an information need, the IR system also mediates how much exposure an information artifact receives by deciding whether it should be displayed, and where it should be positioned, among other results. Exposure-aware IR systems may optimize for additional objectives, besides relevance, such as parity of exposure for retrieved items and content publishers. In this thesis, we present novel neural architectures and methods motivated by the specific needs and challenges of IR tasks.Comment: PhD thesis, Univ College London (2020

    Supervised extractive summarisation of news events

    Get PDF
    This thesis investigates whether the summarisation of news-worthy events can be improved by using evidence about entities (i.e.\ people, places, and organisations) involved in the events. More effective event summaries, that better assist people with their news-based information access requirements, can help to reduce information overload in today's 24-hour news culture. Summaries are based on sentences extracted verbatim from news articles about the events. Within a supervised machine learning framework, we propose a series of entity-focused event summarisation features. Computed over multiple news articles discussing a given event, such entity-focused evidence estimates: the importance of entities within events; the significance of interactions between entities within events; and the topical relevance of entities to events. The statement of this research work is that augmenting supervised summarisation models, which are trained on discriminative multi-document newswire summarisation features, with evidence about the named entities involved in the events, by integrating entity-focused event summarisation features, we will obtain more effective summaries of news-worthy events. The proposed entity-focused event summarisation features are thoroughly evaluated over two multi-document newswire summarisation scenarios. The first scenario is used to evaluate the retrospective event summarisation task, where the goal is to summarise an event to-date, based on a static set of news articles discussing the event. The second scenario is used to evaluate the temporal event summarisation task, where the goal is to summarise the changes in an ongoing event, based on a time-stamped stream of news articles discussing the event. The contributions of this thesis are two-fold. First, this thesis investigates the utility of entity-focused event evidence for identifying important and salient event summary sentences, and as a means to perform anti-redundancy filtering to control the volume of content emitted as a summary of an evolving event. Second, this thesis also investigates the validity of automatic summarisation evaluation metrics, the effectiveness of standard summarisation baselines, and the effective training of supervised machine learned summarisation models

    Personalized information retrieval based on time-sensitive user profile

    Get PDF
    Les moteurs de recherche, largement utilisĂ©s dans diffĂ©rents domaines, sont devenus la principale source d'information pour de nombreux utilisateurs. Cependant, les SystĂšmes de Recherche d'Information (SRI) font face Ă  de nouveaux dĂ©fis liĂ©s Ă  la croissance et Ă  la diversitĂ© des donnĂ©es disponibles. Un SRI analyse la requĂȘte soumise par l'utilisateur et explore des collections de donnĂ©es de nature non structurĂ©e ou semi-structurĂ©e (par exemple : texte, image, vidĂ©o, page Web, etc.) afin de fournir des rĂ©sultats qui correspondent le mieux Ă  son intention et ses intĂ©rĂȘts. Afin d'atteindre cet objectif, au lieu de prendre en considĂ©ration l'appariement requĂȘte-document uniquement, les SRI s'intĂ©ressent aussi au contexte de l'utilisateur. En effet, le profil utilisateur a Ă©tĂ© considĂ©rĂ© dans la littĂ©rature comme l'Ă©lĂ©ment contextuel le plus important permettant d'amĂ©liorer la pertinence de la recherche. Il est intĂ©grĂ© dans le processus de recherche d'information afin d'amĂ©liorer l'expĂ©rience utilisateur en recherchant des informations spĂ©cifiques. Comme le facteur temps a gagnĂ© beaucoup d'importance ces derniĂšres annĂ©es, la dynamique temporelle est introduite pour Ă©tudier l'Ă©volution du profil utilisateur qui consiste principalement Ă  saisir les changements du comportement, des intĂ©rĂȘts et des prĂ©fĂ©rences de l'utilisateur en fonction du temps et Ă  actualiser le profil en consĂ©quence. Les travaux antĂ©rieurs ont distinguĂ© deux types de profils utilisateurs : les profils Ă  court-terme et ceux Ă  long-terme. Le premier type de profil est limitĂ© aux intĂ©rĂȘts liĂ©s aux activitĂ©s actuelles de l'utilisateur tandis que le second reprĂ©sente les intĂ©rĂȘts persistants de l'utilisateur extraits de ses activitĂ©s antĂ©rieures tout en excluant les intĂ©rĂȘts rĂ©cents. Toutefois, pour les utilisateurs qui ne sont pas trĂšs actifs dont les activitĂ©s sont peu nombreuses et sĂ©parĂ©es dans le temps, le profil Ă  court-terme peut Ă©liminer des rĂ©sultats pertinents qui sont davantage liĂ©s Ă  leurs intĂ©rĂȘts personnels. Pour les utilisateurs qui sont trĂšs actifs, l'agrĂ©gation des activitĂ©s rĂ©centes sans ignorer les intĂ©rĂȘts anciens serait trĂšs intĂ©ressante parce que ce type de profil est gĂ©nĂ©ralement en Ă©volution au fil du temps. Contrairement Ă  ces approches, nous proposons, dans cette thĂšse, un profil utilisateur gĂ©nĂ©rique et sensible au temps qui est implicitement construit comme un vecteur de termes pondĂ©rĂ©s afin de trouver un compromis en unifiant les intĂ©rĂȘts rĂ©cents et anciens. Les informations du profil utilisateur peuvent ĂȘtre extraites Ă  partir de sources multiples. Parmi les mĂ©thodes les plus prometteuses, nous proposons d'utiliser, d'une part, l'historique de recherche, et d'autre part les mĂ©dias sociaux. En effet, les donnĂ©es de l'historique de recherche peuvent ĂȘtre extraites implicitement sans aucun effort de l'utilisateur et comprennent les requĂȘtes Ă©mises, les rĂ©sultats correspondants, les requĂȘtes reformulĂ©es et les donnĂ©es de clics qui ont un potentiel de retour de pertinence/rĂ©troaction. Par ailleurs, la popularitĂ© des mĂ©dias sociaux permet d'en faire une source inestimable de donnĂ©es utilisĂ©es par les utilisateurs pour exprimer, partager et marquer comme favori le contenu qui les intĂ©resse. En premier lieu, nous avons modĂ©lisĂ© le profil utilisateur utilisateur non seulement en fonction du contenu de ses activitĂ©s mais aussi de leur fraĂźcheur en supposant que les termes utilisĂ©s rĂ©cemment dans les activitĂ©s de l'utilisateur contiennent de nouveaux intĂ©rĂȘts, prĂ©fĂ©rences et pensĂ©es et doivent ĂȘtre pris en considĂ©ration plus que les anciens intĂ©rĂȘts surtout que de nombreux travaux antĂ©rieurs ont prouvĂ© que l'intĂ©rĂȘt de l'utilisateur diminue avec le temps. Nous avons modĂ©lisĂ© le profil utilisateur sensible au temps en fonction d'un ensemble de donnĂ©es collectĂ©es de Twitter (un rĂ©seau social et un service de microblogging) et nous l'avons intĂ©grĂ© dans le processus de reclassement afin de personnaliser les rĂ©sultats standards en fonction des intĂ©rĂȘts de l'utilisateur.En second lieu, nous avons Ă©tudiĂ© la dynamique temporelle dans le cadre de la session de recherche oĂč les requĂȘtes rĂ©centes soumises par l'utilisateur contiennent des informations supplĂ©mentaires permettant de mieux expliquer l'intention de l'utilisateur et prouvant qu'il n'a pas trouvĂ© les informations recherchĂ©es Ă  partir des requĂȘtes prĂ©cĂ©dentes.Ainsi, nous avons considĂ©rĂ© les interactions rĂ©centes et rĂ©currentes au sein d'une session de recherche en donnant plus d'importance aux termes apparus dans les requĂȘtes rĂ©centes et leurs rĂ©sultats cliquĂ©s. Nos expĂ©rimentations sont basĂ©s sur la tĂąche Session TREC 2013 et la collection ClueWeb12 qui ont montrĂ© l'efficacitĂ© de notre approche par rapport Ă  celles de l'Ă©tat de l'art. Au terme de ces diffĂ©rentes contributions et expĂ©rimentations, nous prouvons que notre modĂšle gĂ©nĂ©rique de profil utilisateur sensible au temps assure une meilleure performance de personnalisation et aide Ă  analyser le comportement des utilisateurs dans les contextes de session de recherche et de mĂ©dias sociaux.Recently, search engines have become the main source of information for many users and have been widely used in different fields. However, Information Retrieval Systems (IRS) face new challenges due to the growth and diversity of available data. An IRS analyses the query submitted by the user and explores collections of data with unstructured or semi-structured nature (e.g. text, image, video, Web page etc.) in order to deliver items that best match his/her intent and interests. In order to achieve this goal, we have moved from considering the query-document matching to consider the user context. In fact, the user profile has been considered, in the literature, as the most important contextual element which can improve the accuracy of the search. It is integrated in the process of information retrieval in order to improve the user experience while searching for specific information. As time factor has gained increasing importance in recent years, the temporal dynamics are introduced to study the user profile evolution that consists mainly in capturing the changes of the user behavior, interests and preferences, and updating the profile accordingly. Prior work used to discern short-term and long-term profiles. The first profile type is limited to interests related to the user's current activities while the second one represents user's persisting interests extracted from his prior activities excluding the current ones. However, for users who are not very active, the short-term profile can eliminate relevant results which are more related to their personal interests. This is because their activities are few and separated over time. For users who are very active, the aggregation of recent activities without ignoring the old interests would be very interesting because this kind of profile is usually changing over time. Unlike those approaches, we propose, in this thesis, a generic time-sensitive user profile that is implicitly constructed as a vector of weighted terms in order to find a trade-off by unifying both current and recurrent interests. User profile information can be extracted from multiple sources. Among the most promising ones, we propose to use, on the one hand, searching history. Data from searching history can be extracted implicitly without any effort from the user and includes issued queries, their corresponding results, reformulated queries and click-through data that has relevance feedback potential. On the other hand, the popularity of Social Media makes it as an invaluable source of data used by users to express, share and mark as favorite the content that interests them. First, we modeled a user profile not only according to the content of his activities but also to their freshness under the assumption that terms used recently in the user's activities contain new interests, preferences and thoughts and should be considered more than old interests. In fact, many prior works have proved that the user interest is decreasing as time goes by. In order to evaluate the time-sensitive user profile, we used a set of data collected from Twitter, i.e a social networking and microblogging service. Then, we apply our re-ranking process to a Web search system in order to adapt the user's online interests to the original retrieved results. Second, we studied the temporal dynamics within session search where recent submitted queries contain additional information explaining better the user intent and prove that the user hasn't found the information sought from previous submitted ones. We integrated current and recurrent interactions within a unique session model giving more importance to terms appeared in recently submitted queries and clicked results. We conducted experiments using the 2013 TREC Session track and the ClueWeb12 collection that showed the effectiveness of our approach compared to state-of-the-art ones. Overall, in those different contributions and experiments, we prove that our time-sensitive user profile insures better performance of personalization and helps to analyze user behavior in both session search and social media contexts

    Event summarization on social media stream: retrospective and prospective tweet summarization

    Get PDF
    Le contenu gĂ©nĂ©rĂ© dans les mĂ©dias sociaux comme Twitter permet aux utilisateurs d'avoir un aperçu rĂ©trospectif d'Ă©vĂšnement et de suivre les nouveaux dĂ©veloppements dĂšs qu'ils se produisent. Cependant, bien que Twitter soit une source d'information importante, il est caractĂ©risĂ© par le volume et la vĂ©locitĂ© des informations publiĂ©es qui rendent difficile le suivi de l'Ă©volution des Ă©vĂšnements. Pour permettre de mieux tirer profit de ce nouveau vecteur d'information, deux tĂąches complĂ©mentaires de recherche d'information dans les mĂ©dias sociaux ont Ă©tĂ© introduites : la gĂ©nĂ©ration de rĂ©sumĂ© rĂ©trospectif qui vise Ă  sĂ©lectionner les tweets pertinents et non redondant rĂ©capitulant "ce qui s'est passĂ©" et l'envoi des notifications prospectives dĂšs qu'une nouvelle information pertinente est dĂ©tectĂ©e. Notre travail s'inscrit dans ce cadre. L'objectif de cette thĂšse est de faciliter le suivi d'Ă©vĂ©nement, en fournissant des outils de gĂ©nĂ©ration de synthĂšse adaptĂ©s Ă  ce vecteur d'information. Les dĂ©fis majeurs sous-jacents Ă  notre problĂ©matique dĂ©coulent d'une part du volume, de la vĂ©locitĂ© et de la variĂ©tĂ© des contenus publiĂ©s et, d'autre part, de la qualitĂ© des tweets qui peut varier d'une maniĂšre considĂ©rable. La tĂąche principale dans la notification prospective est l'identification en temps rĂ©el des tweets pertinents et non redondants. Le systĂšme peut choisir de retourner les nouveaux tweets dĂšs leurs dĂ©tections oĂč bien de diffĂ©rer leur envoi afin de s'assurer de leur qualitĂ©. Dans ce contexte, nos contributions se situent Ă  ces diffĂ©rents niveaux : PremiĂšrement, nous introduisons Word Similarity Extended Boolean Model (WSEBM), un modĂšle d'estimation de la pertinence qui exploite la similaritĂ© entre les termes basĂ©e sur le word embedding et qui n'utilise pas les statistiques de flux. L'intuition sous- jacente Ă  notre proposition est que la mesure de similaritĂ© Ă  base de word embedding est capable de considĂ©rer des mots diffĂ©rents ayant la mĂȘme sĂ©mantique ce qui permet de compenser le non-appariement des termes lors du calcul de la pertinence. DeuxiĂšmement, l'estimation de nouveautĂ© d'un tweet entrant est basĂ©e sur la comparaison de ses termes avec les termes des tweets dĂ©jĂ  envoyĂ©s au lieu d'utiliser la comparaison tweet Ă  tweet. Cette mĂ©thode offre un meilleur passage Ă  l'Ă©chelle et permet de rĂ©duire le temps d'exĂ©cution. TroisiĂšmement, pour contourner le problĂšme du seuillage de pertinence, nous utilisons un classificateur binaire qui prĂ©dit la pertinence. L'approche proposĂ©e est basĂ©e sur l'apprentissage supervisĂ© adaptatif dans laquelle les signes sociaux sont combinĂ©s avec les autres facteurs de pertinence dĂ©pendants de la requĂȘte. De plus, le retour des jugements de pertinence est exploitĂ© pour re-entrainer le modĂšle de classification. Enfin, nous montrons que l'approche proposĂ©e, qui envoie les notifications en temps rĂ©el, permet d'obtenir des performances prometteuses en termes de qualitĂ© (pertinence et nouveautĂ©) avec une faible latence alors que les approches de l'Ă©tat de l'art tendent Ă  favoriser la qualitĂ© au dĂ©triment de la latence. Cette thĂšse explore Ă©galement une nouvelle approche de gĂ©nĂ©ration du rĂ©sumĂ© rĂ©trospectif qui suit un paradigme diffĂ©rent de la majoritĂ© des mĂ©thodes de l'Ă©tat de l'art. Nous proposons de modĂ©liser le processus de gĂ©nĂ©ration de synthĂšse sous forme d'un problĂšme d'optimisation linĂ©aire qui prend en compte la diversitĂ© temporelle des tweets. Les tweets sont filtrĂ©s et regroupĂ©s d'une maniĂšre incrĂ©mentale en deux partitions basĂ©es respectivement sur la similaritĂ© du contenu et le temps de publication. Nous formulons la gĂ©nĂ©ration du rĂ©sumĂ© comme Ă©tant un problĂšme linĂ©aire entier dans lequel les variables inconnues sont binaires, la fonction objective est Ă  maximiser et les contraintes assurent qu'au maximum un tweet par cluster est sĂ©lectionnĂ© dans la limite de la longueur du rĂ©sumĂ© fixĂ©e prĂ©alablement.User-generated content on social media, such as Twitter, provides in many cases, the latest news before traditional media, which allows having a retrospective summary of events and being updated in a timely fashion whenever a new development occurs. However, social media, while being a valuable source of information, can be also overwhelming given the volume and the velocity of published information. To shield users from being overwhelmed by irrelevant and redundant posts, retrospective summarization and prospective notification (real-time summarization) were introduced as two complementary tasks of information seeking on document streams. The former aims to select a list of relevant and non-redundant tweets that capture "what happened". In the latter, systems monitor the live posts stream and push relevant and novel notifications as soon as possible. Our work falls within these frameworks and focuses on developing a tweet summarization approaches for the two aforementioned scenarios. It aims at providing summaries that capture the key aspects of the event of interest to help users to efficiently acquire information and follow the development of long ongoing events from social media. Nevertheless, tweet summarization task faces many challenges that stem from, on one hand, the high volume, the velocity and the variety of the published information and, on the other hand, the quality of tweets, which can vary significantly. In the prospective notification, the core task is the relevancy and the novelty detection in real-time. For timeliness, a system may choose to push new updates in real-time or may choose to trade timeliness for higher notification quality. Our contributions address these levels: First, we introduce Word Similarity Extended Boolean Model (WSEBM), a relevance model that does not rely on stream statistics and takes advantage of word embedding model. We used word similarity instead of the traditional weighting techniques. By doing this, we overcome the shortness and word mismatch issues in tweets. The intuition behind our proposition is that context-aware similarity measure in word2vec is able to consider different words with the same semantic meaning and hence allows offsetting the word mismatch issue when calculating the similarity between a tweet and a topic. Second, we propose to compute the novelty score of the incoming tweet regarding all words of tweets already pushed to the user instead of using the pairwise comparison. The proposed novelty detection method scales better and reduces the execution time, which fits real-time tweet filtering. Third, we propose an adaptive Learning to Filter approach that leverages social signals as well as query-dependent features. To overcome the issue of relevance threshold setting, we use a binary classifier that predicts the relevance of the incoming tweet. In addition, we show the gain that can be achieved by taking advantage of ongoing relevance feedback. Finally, we adopt a real-time push strategy and we show that the proposed approach achieves a promising performance in terms of quality (relevance and novelty) with low cost of latency whereas the state-of-the-art approaches tend to trade latency for higher quality. This thesis also explores a novel approach to generate a retrospective summary that follows a different paradigm than the majority of state-of-the-art methods. We consider the summary generation as an optimization problem that takes into account the topical and the temporal diversity. Tweets are filtered and are incrementally clustered in two cluster types, namely topical clusters based on content similarity and temporal clusters that depends on publication time. Summary generation is formulated as integer linear problem in which unknowns variables are binaries, the objective function is to be maximized and constraints ensure that at most one post per cluster is selected with respect to the defined summary length limit
    corecore