27 research outputs found

    Real Time Web Search Framework for Performing Efficient Retrieval of Data

    Get PDF
    With the rapidly growing amount of information on the internet, real-time system is one of the key strategies to cope with the information overload and to help users in finding highly relevant information. Real-time events and domain-specific information are important knowledge base references on the Web that frequently accessed by millions of users. Real-time system is a vital to product and a technique must resolve the context of challenges to be more reliable, e.g. short data life-cycles, heterogeneous user interests, strict time constraints, and context-dependent article relevance. Since real-time data have only a short time to live, real-time models have to be continuously adapted, ensuring that real-time data are always up-to-date. The focal point of this manuscript is for designing a real-time web search approach that aggregates several web search algorithms at query time to tune search results for relevancy. We learn a context-aware delegation algorithm that allows choosing the best real-time algorithms for each query request. The evaluation showed that the proposed approach outperforms the traditional models, in which it allows us to adapt the specific properties of the considered real-time resources. In the experiments, we found that it is highly relevant for most recently searched queries, consistent in its performance, and resilient to the drawbacks faced by other algorithms

    Explicit diversification of event aspects for temporal summarization

    Get PDF
    During major events, such as emergencies and disasters, a large volume of information is reported on newswire and social media platforms. Temporal summarization (TS) approaches are used to automatically produce concise overviews of such events by extracting text snippets from related articles over time. Current TS approaches rely on a combination of event relevance and textual novelty for snippet selection. However, for events that span multiple days, textual novelty is often a poor criterion for selecting snippets, since many snippets are textually unique but are semantically redundant or non-informative. In this article, we propose a framework for the diversification of snippets using explicit event aspects, building on recent works in search result diversification. In particular, we first propose two techniques to identify explicit aspects that a user might want to see covered in a summary for different types of event. We then extend a state-of-the-art explicit diversification framework to maximize the coverage of these aspects when selecting summary snippets for unseen events. Through experimentation over the TREC TS 2013, 2014, and 2015 datasets, we show that explicit diversification for temporal summarization significantly outperforms classical novelty-based diversification, as the use of explicit event aspects reduces the amount of redundant and off-topic snippets returned, while also increasing summary timeliness

    Filtering News from Document Streams: Evaluation Aspects and Modeled Stream Utility

    Get PDF
    Events like hurricanes, earthquakes, or accidents can impact a large number of people. Not only are people in the immediate vicinity of the event affected, but concerns about their well-being are shared by the local government and well-wishers across the world. The latest information about news events could be of use to government and aid agencies in order to make informed decisions on providing necessary support, security and relief. The general public avails of news updates via dedicated news feeds or broadcasts, and lately, via social media services like Facebook or Twitter. Retrieving the latest information about newsworthy events from the world-wide web is thus of importance to a large section of society. As new content on a multitude of topics is continuously being published on the web, specific event related information needs to be filtered from the resulting stream of documents. We present in this thesis, a user-centric evaluation measure for evaluating systems that filter news related information from document streams. Our proposed evaluation measure, Modeled Stream Utility (MSU), models users accessing information from a stream of sentences produced by a news update filtering system. The user model allows for simulating a large number of users with different characteristic stream browsing behavior. Through simulation, MSU estimates the utility of a system for an average user browsing a stream of sentences. Our results show that system performance is sensitive to a user population's stream browsing behavior and that existing evaluation metrics correspond to very specific types of user behavior. To evaluate systems that filter sentences from a document stream, we need a set of judged sentences. This judged set is a subset of all the sentences returned by all systems, and is typically constructed by pooling together the highest quality sentences, as determined by respective system assigned scores for each sentence. Sentences in the pool are manually assessed and the resulting set of judged sentences is then used to compute system performance metrics. In this thesis, we investigate the effect of including duplicates of judged sentences, into the judged set, on system performance evaluation. We also develop an alternative pooling methodology, that given the MSU user model, selects sentences for pooling based on the probability of a sentences being read by modeled users. Our research lays the foundation for interesting future work for utilizing user-models in different aspects of evaluation of stream filtering systems. The MSU measure enables incorporation of different user models. Furthermore, the applicability of MSU could be extended through calibration based on user behavior

    Misinformation Retrieval

    Get PDF
    This work introduces the task of misinformation retrieval, identifying all documents containing misinformation for a given topic, and proposes a pipeline for misinformation retrieval on tweets. As part of the work, I curated 50 COVID-19 misinformation topics used in the TREC 2020 Health Misinformation track. In addition, I annotated a test set of tweets using the TREC COVID-19 misinformation on social media. Misinformation on social media has proven highly detrimental to communities by encouraging harmful and often life-threatening behavior. The chaos caused by COVID-19 misinformation has created an urgent need for misinformation detection methods to moderate social media platforms. Drawing upon previous work in misinformation detection and the TREC 2020 Health Misinformation Track, I focused on the task of misinformation retrieval on social media. I extended the COVID-Lies data set created to detect COVID-19 misinformation in tweets by rephrasing the misconceptions accompanying each tweet. I also created 50 COVID-19 related topics for the TREC 2020 Health Misinformation track used for evaluation purposes. I propose a natural language inference (NLI) based approach using CT-BERT to identify tweets that contradict a given fact, used to score documents utilizing the model’s classification probability. The model was trained using a combination of NLI data sets to find the best approach. Tweets were labeled for the TREC 2020 Health Misinformation Track topics to create a test set on which the best model achieves an AUC of 0.81. I conducted several experiments which show that domain adaptation significantly improved the ability to detect misinformation. A combination of a large NLI corpus, such as SNLI, and an in-domain, such as the COVID-Lies, data set achieves the best performance on our test set. The pipelines retrieved and ranked tweets based on misinformation for 7 TREC topics from the COVID-19 Twitter stream. The top 20 unique tweets were analyzed using Precision@20 to evaluate the pipeline

    When Diversity Is Needed... But Not Expected!

    Get PDF
    International audienceRecent studies have highlighted the correlation between users' satisfaction and diversity within recommenders, especially the fact that diversity increases users' confidence when choosing an item. Understanding the reasons of this positive impact on recommenders is now becoming crucial. Based on this assumption, we designed a user study that focuses on the utility of this new dimension, as well as its perceived qualities. This study has been conducted on 250 users and it compared 5 recommendation approaches, based on collaborative filtering, content-based filtering and popularity, along with various degrees of diversity. Results show that, when recommendations are made explicit, diversity may reduce users' acceptance rate. However, it helps increasing users' satisfaction. Moreover, this study highlights the need to build users' preference models that are diverse enough, so as to generate good recommendations

    Methods for ranking user-generated text streams: a case study in blog feed retrieval

    Get PDF
    User generated content are one of the main sources of information on the Web nowadays. With the huge amount of this type of data being generated everyday, having an efficient and effective retrieval system is essential. The goal of such a retrieval system is to enable users to search through this data and retrieve documents relevant to their information needs. Among the different retrieval tasks of user generated content, retrieving and ranking streams is one of the important ones that has various applications. The goal of this task is to rank streams, as collections of documents with chronological order, in response to a user query. This is different than traditional retrieval tasks where the goal is to rank single documents and temporal properties are less important in the ranking. In this thesis we investigate the problem of ranking user-generated streams with a case study in blog feed retrieval. Blogs, like all other user generated streams, have specific properties and require new considerations in the retrieval methods. Blog feed retrieval can be defined as retrieving blogs with a recurrent interest in the topic of the given query. We define three different properties of blog feed retrieval each of which introduces new challenges in the ranking task. These properties include: 1) term mismatch in blog retrieval, 2) evolution of topics in blogs and 3) diversity of blog posts. For each of these properties, we investigate its corresponding challenges and propose solutions to overcome those challenges. We further analyze the effect of our solutions on the performance of a retrieval system. We show that taking the new properties into account for developing the retrieval system can help us to improve state of the art retrieval methods. In all the proposed methods, we specifically pay attention to temporal properties that we believe are important information in any type of streams. We show that when combined with content-based information, temporal information can be useful in different situations. Although we apply our methods to blog feed retrieval, they are mostly general methods that are applicable to similar stream ranking problems like ranking experts or ranking twitter users

    Design and Evaluation of Temporal Summarization Systems

    Get PDF
    Temporal Summarization (TS) is a new track introduced as part of the Text REtrieval Conference (TREC) in 2013. This track aims to develop systems which can return important updates related to an event over time. In TREC 2013, the TS track specifically used disaster related events such as earthquake, hurricane, bombing, etc. This thesis mainly focuses on building an effective TS system by using a combination of Information Retrieval techniques. The developed TS system returns updates related to disaster related events in a timely manner. By participating in TREC 2013 and with experiments conducted after TREC, we examine the effectiveness of techniques such as distributional similarity for term expansion, which can be employed in building TS systems. Also, this thesis describes the effectiveness of other techniques such as stemming, adaptive sentence selection over time and de-duplication in our system, by comparing it with other baseline systems. The second part of the thesis examines the current methodology used for evaluating TS systems. We propose a modified evaluation method which could reduce the manual effort of assessors, and also correlates well with the official track’s evaluation. We also propose a supervised learning based evaluation method, which correlates well with the official track’s evaluation of systems and could save the assessor’s time by as much as 80%
    corecore