13 research outputs found

    Contextual compositionality detection with external knowledge bases and word embeddings

    Get PDF
    When the meaning of a phrase cannot be inferred from the individual meanings of its words (e.g., hot dog), that phrase is said to be non-compositional. Automatic compositionality detection in multiword phrases is critical in any application of semantic processing, such as search engines [9]; failing to detect non-compositional phrases can hurt system effectiveness notably. Existing research treats phrases as either compositional or non-compositional in a deterministic manner. In this paper, we operationalize the viewpoint that compositionality is contextual rather than deterministic, i.e., that whether a phrase is compositional or non-compositional depends on its context. For example, the phrase \ufffdgreen card\ufffd is compositional when referring to a green colored card, whereas it is non-compositional when meaning permanent residence authorization. We address the challenge of detecting this type of contextual compositionality as follows: given a multi-word phrase, we enrich the word embedding representing its semantics with evidence about its global context (terms it often collocates with) as well as its local context (narratives where that phrase is used, which we call usage scenarios). We further extend this representation with information extracted from external knowledge bases. The resulting representation incorporates both localized context and more general usage of the phrase and allows to detect its compositionality in a non-deterministic and contextual way. Empirical evaluation of our model on a dataset of phrase compositionality1, manually collected by crowdsourcing contextual compositionality assessments, shows that our model outperforms state-of-the-art baselines notably on detecting phrase compositionality

    Exploiting the Bipartite Structure of Entity Grids for Document Coherence and Retrieval

    Get PDF
    International audienceDocument coherence describes how much sense text makes in terms of its logical organisation and discourse flow. Even though coherence is a relatively difficult notion to quantify precisely, it can be approximated automatically. This type of coherence modelling is not only interesting in itself, but also useful for a number of other text processing tasks, including Information Retrieval (IR), where adjusting the ranking of documents according to both their relevance and their coherence has been shown to increase retrieval effectiveness.The state of the art in unsupervised coherence modelling represents documents as bipartite graphs of sentences and discourse entities, and then projects these bipartite graphs into one–mode undirected graphs. However, one–mode projections may incur significant loss of the information present in the original bipartite structure. To address this we present three novel graph metrics that compute document coherence on the original bipartite graph of sentences and entities. Evaluation on standard settings shows that: (i) one of our coherence metrics beats the state of the art in terms of coherence accuracy; and (ii) all three of our coherence metrics improve retrieval effectiveness because, as closer analysis reveals, they capture aspects of document quality that go undetected by both keyword-based standard ranking and by spam filtering. This work contributes document coherence metrics that are theoretically principled, parameter-free, and useful to IR

    Investigating the statistical properties of user-generated documents

    Get PDF
    The importance of the Internet as a communication medium is reflected in the large amount of documents being generated every day by users of the different services that take place online. In this work we aim at analyzing the properties of these online user-generated documents for some of the established services over the Internet (Kongregate, Twitter, Myspace and Slashdot) and comparing them with a consolidated collection of standard information retrieval documents (from the Wall Street Journal, Associated Press and Financial Times, as part of the TREC ad-hoc collection). We investigate features such as document similarity, term burstiness, emoticons and Part-Of-Speech analysis, highlighting the applicability and limits of traditional content analysis and indexing techniques used in information retrieval to the new online usergenerated documents

    On clustering and polyrepresentation

    No full text
    Polyrepresentation is one of the most prominent principles in a cognitive approach to interactive information seeking and retrieval. When it comes to interactive retrieval, clustering is another method for accessing information. While polyrepresentation has been explored and validated in a scenario where a system returns a ranking of documents, so far there are no insights if and how polyrepresentation and clustering can be combined. In this paper we discuss how both are related and present an approach to integrate polyrepresentation into clustering. We further report some initial evaluation results

    Personalized social query expansion using social annotations

    Get PDF
    © 2019, Springer-Verlag GmbH Germany, part of Springer Nature. Query expansion is a query pre-processing technique that adds to a given query, terms that are likely to occur in relevant documents in order to improve information retrieval accuracy. A key problem to solve is “how to identify the terms to be added to a query?” While considering social tagging systems as a data source, we propose an approach that selects terms based on (i) the semantic similarity between tags composing a query, (ii) a social proximity between the query and the user for a personalized expansion, and (iii) a strategy for expanding, on the fly, user queries. We demonstrate the effectiveness of our approach by an intensive evaluation on three large public datasets crawled from delicious, Flickr, and CiteULike. We show that the expanded queries built by our method provide more accurate results as compared to the initial queries, by increasing the MAP in a range of 10 to 16% on the three datasets. We also compare our method to three state of the art baselines, and we show that our query expansion method allows significant improvement in the MAP, with a boost in a range between 5 to 18%

    Report on ECIR 2016: 38th European Conference on Information Retrieval

    Get PDF
    International audienceThe 38th European Conference on Information Retrieval took place from the 20th to the 23rd of March 2016 in Padua, Italy. This report summarizes the conference in terms of the presented keynotes, scientific and social programme, industry day, tutorials, workshops and student support

    A study of factuality, objectivity and relevance:three desiderata in large-scale information retrieval?

    No full text
    Much of the information processed by Information Retrieval (IR) systems is unreliable, biased, and generally untrustworthy [1], [2], [3]. Yet, factuality & objectivity detection is not a standard component of IR systems, even though it has been possible in Natural Language Processing (NLP) in the last decade. Motivated by this, we ask if and how factuality & objectivity detection may benefit IR. We answer this in two parts. First, we use state-of-the-art NLP to compute the probability of document factuality & objectivity in two TREC collections, and analyse its relation to document relevance. We find that factuality is strongly and positively correlated to document relevance, but objectivity is not. Second, we study the impact of factuality & objectivity to retrieval effectiveness by treating them as query independent features that we combine with a competitive language modelling baseline. Experiments with 450 TREC queries show that factuality improves precision >10% over strong baselines, especially for uncurated data used in web search; objectivity gives mixed results. An overall clear trend is that document factuality & objectivity is much more beneficial to IR when searching uncurated (e.g. web) documents vs. curated (e.g. state documentation and newswire articles). To our knowledge, this is the first study of factuality & objectivity for back-end IR, contributing novel findings about the relation between relevance and factuality/objectivity, and statistically significant gains to retrieval effectiveness in the competitive web search task

    CLEF 2004: Ad Hoc Track Overview and Results Analysis

    No full text
    We describe the objectives and organization of the CLEF 2004 ad hoc track and discuss the main characteristics of the experiments. The results are analyzed and commented and their statistical significance is investigated. The paper concludes with some observations on the impact of the CLEF campaign on the state-of-the-art in cross-language information retrieval
    corecore