9 research outputs found

    Distance matters! Cumulative proximity expansions for ranking documents

    Get PDF
    In the information retrieval process, functions that rank documents according to their estimated relevance to a query typically regard query terms as being independent. However, it is often the joint presence of query terms that is of interest to the user, which is overlooked when matching independent terms. One feature that can be used to express the relatedness of co-occurring terms is their proximity in text. In past research, models that are trained on the proximity information in a collection have performed better than models that are not estimated on data. We analyzed how co-occurring query terms can be used to estimate the relevance of documents based on their distance in text, which is used to extend a unigram ranking function with a proximity model that accumulates the scores of all occurring term combinations. This proximity model is more practical than existing models, since it does not require any co-occurrence statistics, it obviates the need to tune additional parameters, and has a retrieval speed close to competing models. We show that this approach is more robust than existing models, on both Web and newswire corpora, and on average performs equal or better than existing proximity models across collections

    CWI and TU Delft at the TREC 2015 Temporal Summarization Track

    Get PDF
    Internet users are turning more frequently to online news as a \nreplacement for traditional media sources such as newspapers or \ntelevision shows. Still, discovering news events online and follow- \ning them as they develop can be a difficult task. In previous work, \nwe presented a novel approach to extract sentences from an online \nstream of news articles that summarizes the most important news \nfacts for a given ad-hoc information need, which compared to ex- \nisting systems obtained relatively high-precision results and a com- \nparable recall [9]. In this track, we experiment with this approach \nto improve the recall of retrieved results

    CWI and TU Delft at TREC 2013: Contextual Suggestion, Federated Web Search, KBA, and Web Tracks

    Get PDF
    This paper provides an overview of the work done at the Centrum Wiskunde & Informatica (CWI) and Delft University of Technology (TU Delft) for different tracks of TREC 2013. We participated in the Contextual Suggestion Track, the Federated Web Search Track, the Knowledge Base Acceleration (KBA) Track, and the Web Ad-hoc Track. In the Contextual Suggestion track, we focused on filtering the entire ClueWeb12 collection to generate recommendations according to the provided user profiles and contexts. For the Federated Web Search track, we exploited both categories from ODP and document relevance to merge result lists. In the KBA track, we focused on the Cumulative Citation Recommendation task where we exploited different features to two classification algorithms. For the Web track, we extended an ad-hoc baseline with a proximity model that promotes documents in which the query terms are positioned closer together

    Proximity of Terms, Texts and Semantic Vectors in Information Retrieval

    No full text
    Information Retrieval (IR) is finding content of an unstructured nature with respect to an information need. A retrieval system typically uses a retrieval model to rank the available content by their estimated relevance to an information need. For decades, state-of-the-art retrieval models have used the assumption that terms appear independently in text documents. Chapter 1 of this thesis describes how the relevance likelihood of a document changes by the observed distance between co-occurring query terms in its text.Nowadays, news is abundantly available online, allowing users to discover and follow news events. However, online news is often very redundant; most sources basing their stories on previously published works and add only limited new information. Thus, a user often ends up spending significant amount of effort re-reading the same parts of a story before finding relevant and novel information. In Chapter 2 and Chapter 3, we present a novel approach to construct an online news summary for a given topic. Salient sentences are identified by clustering the sentences in the news stream based on the relative proximity of the sentences and the temporal proximity of their publication times. To improve the coherence of a long summary that describes a news topic, we propose to automatically cluster sentences by subtopics in Chapter 4. In Chapter 5, we show how new topics can be detected in the news stream using the same clustering technique.In real-life decision making, people are often faced with an overload of choices. A recommender system aids the user by reducing the available choices to a shortlist of items that are of interest to the user. In Chapter 6, we learn high-dimensional representations for movies that allow to effectively recommend movies based on a user’s most recently rated movies.SIKS Dissertation Series No. 2017-19 The research reported in this thesis has been carried out under the auspices of SIKS, the Dutch Research School for Information and Knowledge Systems.Multimedia Computin

    Obtaining High-Quality Relevance Judgments Using Crowdsourcing

    No full text
    The performance of information retrieval (IR) systems is commonly evaluated using a test set with known relevance. Crowdsourcing is one method for learning the relevant documents to each query in the test set. However, the quality of relevance learned through crowdsourcing can be questionable, because it uses workers of unknown quality with possible spammers among them. To detect spammers, the authors' algorithm compares judgments between workers; they evaluate their approach by comparing the consistency of crowdsourced ground truth to that obtained from expert annotators and conclude that crowdsourcing can match the quality obtained from the latter

    First Story Detection using Multiple Nearest Neighbors

    No full text

    Exploring Deep Space: Learning Personalized Ranking in a Semantic Space

    No full text
    corecore