13 research outputs found
Contextual compositionality detection with external knowledge bases and word embeddings
When the meaning of a phrase cannot be inferred from the individual meanings of its words (e.g., hot dog), that phrase is said to be non-compositional. Automatic compositionality detection in multiword phrases is critical in any application of semantic processing, such as search engines [9]; failing to detect non-compositional phrases can hurt system effectiveness notably. Existing research treats phrases as either compositional or non-compositional in a deterministic manner. In this paper, we operationalize the viewpoint that compositionality is contextual rather than deterministic, i.e., that whether a phrase is compositional or non-compositional depends on its context. For example, the phrase \ufffdgreen card\ufffd is compositional when referring to a green colored card, whereas it is non-compositional when meaning permanent residence authorization. We address the challenge of detecting this type of contextual compositionality as follows: given a multi-word phrase, we enrich the word embedding representing its semantics with evidence about its global context (terms it often collocates with) as well as its local context (narratives where that phrase is used, which we call usage scenarios). We further extend this representation with information extracted from external knowledge bases. The resulting representation incorporates both localized context and more general usage of the phrase and allows to detect its compositionality in a non-deterministic and contextual way. Empirical evaluation of our model on a dataset of phrase compositionality1, manually collected by crowdsourcing contextual compositionality assessments, shows that our model outperforms state-of-the-art baselines notably on detecting phrase compositionality
Exploiting the Bipartite Structure of Entity Grids for Document Coherence and Retrieval
International audienceDocument coherence describes how much sense text makes in terms of its logical organisation and discourse flow. Even though coherence is a relatively difficult notion to quantify precisely, it can be approximated automatically. This type of coherence modelling is not only interesting in itself, but also useful for a number of other text processing tasks, including Information Retrieval (IR), where adjusting the ranking of documents according to both their relevance and their coherence has been shown to increase retrieval effectiveness.The state of the art in unsupervised coherence modelling represents documents as bipartite graphs of sentences and discourse entities, and then projects these bipartite graphs into one–mode undirected graphs. However, one–mode projections may incur significant loss of the information present in the original bipartite structure. To address this we present three novel graph metrics that compute document coherence on the original bipartite graph of sentences and entities. Evaluation on standard settings shows that: (i) one of our coherence metrics beats the state of the art in terms of coherence accuracy; and (ii) all three of our coherence metrics improve retrieval effectiveness because, as closer analysis reveals, they capture aspects of document quality that go undetected by both keyword-based standard ranking and by spam filtering. This work contributes document coherence metrics that are theoretically principled, parameter-free, and useful to IR
Investigating the statistical properties of user-generated documents
The importance of the Internet as a communication medium is reflected in the large amount of documents being generated every day by users of the different services that take place online. In this work we aim at analyzing the properties of these online user-generated documents for some of the established services over the Internet (Kongregate, Twitter, Myspace and Slashdot) and comparing them with a consolidated collection of standard information retrieval documents (from the Wall Street Journal, Associated Press and Financial Times, as part of the TREC ad-hoc collection). We investigate features such as document similarity, term burstiness, emoticons and Part-Of-Speech analysis, highlighting the applicability and limits of traditional content analysis and indexing techniques used in information retrieval to the new online usergenerated documents
On clustering and polyrepresentation
Polyrepresentation is one of the most prominent principles in a cognitive approach to interactive information seeking and retrieval. When it comes to interactive retrieval, clustering is another method for accessing information. While polyrepresentation has been explored and validated in a scenario where a system returns a ranking of documents, so far there are no insights if and how polyrepresentation and clustering can be combined. In this paper we discuss how both are related and present an approach to integrate polyrepresentation into clustering. We further report some initial evaluation results
Personalized social query expansion using social annotations
© 2019, Springer-Verlag GmbH Germany, part of Springer Nature. Query expansion is a query pre-processing technique that adds to a given query, terms that are likely to occur in relevant documents in order to improve information retrieval accuracy. A key problem to solve is “how to identify the terms to be added to a query?” While considering social tagging systems as a data source, we propose an approach that selects terms based on (i) the semantic similarity between tags composing a query, (ii) a social proximity between the query and the user for a personalized expansion, and (iii) a strategy for expanding, on the fly, user queries. We demonstrate the effectiveness of our approach by an intensive evaluation on three large public datasets crawled from delicious, Flickr, and CiteULike. We show that the expanded queries built by our method provide more accurate results as compared to the initial queries, by increasing the MAP in a range of 10 to 16% on the three datasets. We also compare our method to three state of the art baselines, and we show that our query expansion method allows significant improvement in the MAP, with a boost in a range between 5 to 18%
Report on ECIR 2016: 38th European Conference on Information Retrieval
International audienceThe 38th European Conference on Information Retrieval took place from the 20th to the 23rd of March 2016 in Padua, Italy. This report summarizes the conference in terms of the presented keynotes, scientific and social programme, industry day, tutorials, workshops and student support
A study of factuality, objectivity and relevance:three desiderata in large-scale information retrieval?
Much of the information processed by Information Retrieval (IR) systems is
unreliable, biased, and generally untrustworthy [1], [2], [3]. Yet, factuality
& objectivity detection is not a standard component of IR systems, even though
it has been possible in Natural Language Processing (NLP) in the last decade.
Motivated by this, we ask if and how factuality & objectivity detection may
benefit IR. We answer this in two parts. First, we use state-of-the-art NLP to
compute the probability of document factuality & objectivity in two TREC
collections, and analyse its relation to document relevance. We find that
factuality is strongly and positively correlated to document relevance, but
objectivity is not. Second, we study the impact of factuality & objectivity to
retrieval effectiveness by treating them as query independent features that we
combine with a competitive language modelling baseline. Experiments with 450
TREC queries show that factuality improves precision >10% over strong
baselines, especially for uncurated data used in web search; objectivity gives
mixed results. An overall clear trend is that document factuality & objectivity
is much more beneficial to IR when searching uncurated (e.g. web) documents vs.
curated (e.g. state documentation and newswire articles). To our knowledge,
this is the first study of factuality & objectivity for back-end IR,
contributing novel findings about the relation between relevance and
factuality/objectivity, and statistically significant gains to retrieval
effectiveness in the competitive web search task
CLEF 2004: Ad Hoc Track Overview and Results Analysis
We describe the objectives and organization of the CLEF 2004 ad hoc track and discuss the main characteristics of the experiments. The results are analyzed and commented and their statistical significance is investigated. The paper concludes with some observations on the impact of the CLEF campaign on the state-of-the-art in cross-language information retrieval