49 research outputs found

    A bi-directional unified Model for information retrieval

    Get PDF
    Relevance matching between two information objects such as a document and query or a user and product (e.g. movie) is an important problem in information retrieval systems. The most common and most successful way to approach this problem is by probabilistically modelling the relevance between information objects, and computing their relevance matching as the probability of relevance. The objective of a probabilistic relevance retrieval model is to compute the probability of relevance between a given information object pair using all the available information about the individual objects (e.g., document and query), the existing relevance information on both objects and all the information available on other information objects (other documents, queries in the collection and the relevance information on them). The probabilistic retrieval models developed to date are not capable of utilising all available information due to the lack of a unified theory for relevance modelling. More than three decades ago, the notion of simultaneously utilising the relevance information about individual user needs and individual documents to come to a retrieval decision was formalised as the problem of a unified relevance model for Information Retrieval (IR). Since the inception of the unified model, a number of unsuccessful attempts have been made to develop a formal probabilistic relevance model to solve the problem. This thesis provides a new theory and a probabilistic relevance framework that not only solves the problem of the original unified relevance model but also provides the capability to utilise any available information about the information objects in computing the probability of relevance. In this thesis, we consider information matching between two objects (e.g. documents and queries) to be bi-directional preference matching and the relevance between them is thus established and estimated on top of the bi-directional relationship. A key benefit of this bi-directional approach is that the resulting probabilistic bi-directional unified model not only solves the original problem of a unified model in information retrieval but also has the ability to incorporate all of the available information on the information objects (documents and queries) into a single model while computing the probability of relevance. Theoretically, we demonstrate the effectiveness of applying our single framework by deriving relevance ranking functions for popular retrieval scenarios such as collaborative filtering (recommendation), group recommendation and ad-hoc retrieval. In the past, the solution for relevance matching in each of these retrieval scenarios approached with a different solution/framework, partly due to the kind of information available to the retrieval system for computing the probability of relevance. However, the underlying problem of information matching is the same in all scenarios, and a solution to the problem of a unified model should be applicable to all scenarios. One of the interesting aspects of our new theory and model in applying to a collaborative filtering scenario is that it computes the probability of relevance between a given user and a given item while not applying any dimensionality reduction technique or computing the explicit similarity between the users/items, which is contrary to the state-of-the-art collaborative filtering/recommender models (e.g. Matrix Factorisation methods, neighbourhood-based methods). This property allows the retrieval model to model users and items independently with their own features, rather than forcing it to use a common feature space (e.g., common hidden factor-features between a user-item pair of objects or a common vocabulary space between a document-query pair of objects). The effectiveness of this theoretical framework is demonstrated in various real-world applications by experimenting on datasets in collaborative filtering, group recommendation and ad-hoc retrieval tasks. For collaborative filtering and group recommendation the model convincingly out-performs various state-of-the-art recommender models (or frameworks). For ad-hoc retrieval, the model also outperforms the state-of-the-art information retrieval models when it is restricted to use the same information used by the other models. The bi-directional unified model allows the building of both search and personalisation/recommender (or collaborative filtering) systems from a single model, which has not been possible before with the existing probabilistic relevance models. Finally, our theory and its framework have been adopted by some large companies in gaming, venture-capital matching, retail and media, and deployed on their web systems to match their customers, often in the tens of millions, with relevant content

    Probability models for information retrieval based on divergence from randomness

    Get PDF
    This thesis devises a novel methodology based on probability theory, suitable for the construction of term-weighting models of Information Retrieval. Our term-weighting functions are created within a general framework made up of three components. Each of the three components is built independently from the others. We obtain the term-weighting functions from the general model in a purely theoretic way instantiating each component with different probability distribution forms. The thesis begins with investigating the nature of the statistical inference involved in Information Retrieval. We explore the estimation problem underlying the process of sampling. De Finetti’s theorem is used to show how to convert the frequentist approach into Bayesian inference and we display and employ the derived estimation techniques in the context of Information Retrieval. We initially pay a great attention to the construction of the basic sample spaces of Information Retrieval. The notion of single or multiple sampling from different populations in the context of Information Retrieval is extensively discussed and used through-out the thesis. The language modelling approach and the standard probabilistic model are studied under the same foundational view and are experimentally compared to the divergence-from-randomness approach. In revisiting the main information retrieval models in the literature, we show that even language modelling approach can be exploited to assign term-frequency normalization to the models of divergence from randomness. We finally introduce a novel framework for the query expansion. This framework is based on the models of divergence-from-randomness and it can be applied to arbitrary models of IR, divergence-based, language modelling and probabilistic models included. We have done a very large number of experiment and results show that the framework generates highly effective Information Retrieval models

    Using Learning to Rank Approach to Promoting Diversity for Biomedical Information Retrieval with Wikipedia

    Get PDF
    In most of the traditional information retrieval (IR) models, the independent relevance assumption is taken, which assumes the relevance of a document is independent of other documents. However, the pitfall of this is the high redundancy and low diversity of retrieval result. This has been seen in many scenarios, especially in biomedical IR, where the information need of one query may refer to different aspects. Promoting diversity in IR takes the relationship between documents into account. Unlike previous studies, we tackle this problem in the learning to rank perspective. The main challenges are how to find salient features for biomedical data and how to integrate dynamic features into the ranking model. To address these challenges, Wikipedia is used to detect topics of documents for generating diversity biased features. A combined model is proposed and studied to learn a diversified ranking result. Experiment results show the proposed method outperforms baseline models

    Search beyond traditional probabilistic information retrieval

    Get PDF
    "This thesis focuses on search beyond probabilistic information retrieval. Three ap- proached are proposed beyond the traditional probabilistic modelling. First, term associ- ation is deeply examined. Term association considers the term dependency using a factor analysis based model, instead of treating each term independently. Latent factors, con- sidered the same as the hidden variables of ""eliteness"" introduced by Robertson et al. to gain understanding of the relation among term occurrences and relevance, are measured by the dependencies and occurrences of term sequences and subsequences. Second, an entity-based ranking approach is proposed in an entity system named ""EntityCube"" which has been released by Microsoft for public use. A summarization page is given to summarize the entity information over multiple documents such that the truly relevant entities can be highly possibly searched from multiple documents through integrating the local relevance contributed by proximity and the global enhancer by topic model. Third, multi-source fusion sets up a meta-search engine to combine the ""knowledge"" from different sources. Meta-features, distilled as high-level categories, are deployed to diversify the baselines. Three modified fusion methods are employed, which are re- ciprocal, CombMNZ and CombSUM with three expanded versions. Through extensive experiments on the standard large-scale TREC Genomics data sets, the TREC HARD data sets and the Microsoft EntityCube Web collections, the proposed extended models beyond probabilistic information retrieval show their effectiveness and superiority.

    Semantic enrichment of knowledge sources supported by domain ontologies

    Get PDF
    This thesis introduces a novel conceptual framework to support the creation of knowledge representations based on enriched Semantic Vectors, using the classical vector space model approach extended with ontological support. One of the primary research challenges addressed here relates to the process of formalization and representation of document contents, where most existing approaches are limited and only take into account the explicit, word-based information in the document. This research explores how traditional knowledge representations can be enriched through incorporation of implicit information derived from the complex relationships (semantic associations) modelled by domain ontologies with the addition of information presented in documents. The relevant achievements pursued by this thesis are the following: (i) conceptualization of a model that enables the semantic enrichment of knowledge sources supported by domain experts; (ii) development of a method for extending the traditional vector space, using domain ontologies; (iii) development of a method to support ontology learning, based on the discovery of new ontological relations expressed in non-structured information sources; (iv) development of a process to evaluate the semantic enrichment; (v) implementation of a proof-of-concept, named SENSE (Semantic Enrichment kNowledge SourcEs), which enables to validate the ideas established under the scope of this thesis; (vi) publication of several scientific articles and the support to 4 master dissertations carried out by the department of Electrical and Computer Engineering from FCT/UNL. It is worth mentioning that the work developed under the semantic referential covered by this thesis has reused relevant achievements within the scope of research European projects, in order to address approaches which are considered scientifically sound and coherent and avoid “reinventing the wheel”.European research projects - CoSpaces (IST-5-034245), CRESCENDO (FP7-234344) and MobiS (FP7-318452

    Explicit web search result diversification

    Get PDF
    Queries submitted to a web search engine are typically short and often ambiguous. With the enormous size of the Web, a misunderstanding of the information need underlying an ambiguous query can misguide the search engine, ultimately leading the user to abandon the originally submitted query. In order to overcome this problem, a sensible approach is to diversify the documents retrieved for the user's query. As a result, the likelihood that at least one of these documents will satisfy the user's actual information need is increased. In this thesis, we argue that an ambiguous query should be seen as representing not one, but multiple information needs. Based upon this premise, we propose xQuAD---Explicit Query Aspect Diversification, a novel probabilistic framework for search result diversification. In particular, the xQuAD framework naturally models several dimensions of the search result diversification problem in a principled yet practical manner. To this end, the framework represents the possible information needs underlying a query as a set of keyword-based sub-queries. Moreover, xQuAD accounts for the overall coverage of each retrieved document with respect to the identified sub-queries, so as to rank highly diverse documents first. In addition, it accounts for how well each sub-query is covered by the other retrieved documents, so as to promote novelty---and hence penalise redundancy---in the ranking. The framework also models the importance of each of the identified sub-queries, so as to appropriately cater for the interests of the user population when diversifying the retrieved documents. Finally, since not all queries are equally ambiguous, the xQuAD framework caters for the ambiguity level of different queries, so as to appropriately trade-off relevance for diversity on a per-query basis. The xQuAD framework is general and can be used to instantiate several diversification models, including the most prominent models described in the literature. In particular, within xQuAD, each of the aforementioned dimensions of the search result diversification problem can be tackled in a variety of ways. In this thesis, as additional contributions besides the xQuAD framework, we introduce novel machine learning approaches for addressing each of these dimensions. These include a learning to rank approach for identifying effective sub-queries as query suggestions mined from a query log, an intent-aware approach for choosing the ranking models most likely to be effective for estimating the coverage and novelty of multiple documents with respect to a sub-query, and a selective approach for automatically predicting how much to diversify the documents retrieved for each individual query. In addition, we perform the first empirical analysis of the role of novelty as a diversification strategy for web search. As demonstrated throughout this thesis, the principles underlying the xQuAD framework are general, sound, and effective. In particular, to validate the contributions of this thesis, we thoroughly assess the effectiveness of xQuAD under the standard experimentation paradigm provided by the diversity task of the TREC 2009, 2010, and 2011 Web tracks. The results of this investigation demonstrate the effectiveness of our proposed framework. Indeed, xQuAD attains consistent and significant improvements in comparison to the most effective diversification approaches in the literature, and across a range of experimental conditions, comprising multiple input rankings, multiple sub-query generation and coverage estimation mechanisms, as well as queries with multiple levels of ambiguity. Altogether, these results corroborate the state-of-the-art diversification performance of xQuAD

    The role of news values in the discur-sive construction of the Brexit refer-endum in the UK press

    Get PDF
    El objetivo principal de este estudio es explorar el discurso periodístico de la campaña del referéndum del Brexit en Reino Unido desde la perspectiva de los estudios del discurso asistidos por corpus (CADS). En concreto, se analiza cómo se construyeron discursivamente diferentes temas y debates relacionados con el Brexit en la cobertura de la prensa británica de calidad durante la campaña del referéndum. Asimismo, se investigan las diferencias ideológicas en la construcción discursiva de dicha cobertura según las afiliaciones políticas (izquierda-derecha) y las posturas ideológicas hacia el Brexit (Bandos de Salida-Permanencia). Para ello, se recopiló un corpus de cuatro diarios británicos (The Guardian, The Independent, The Times y Daily Telegraph) utilizando las bases de datos de noticias de Nexis UK. La palabra de búsqueda utilizada para la recuperación de datos fue Brexit. Los resultados se redujeron limitando el período de búsqueda [del 22 de febrero al 23 de junio de 2016], el tipo de noticia [artículos], y gestionando las duplicidades (es decir, los artículos que se repiten en las ediciones digitales y en papel). El corpus se analizó mediante una combinación de herramientas de lingüística de corpus para el análisis cuantitativo y el método del Análisis de Valores del Discurso de las Noticias (Bednarek y Caple, 2017) para el análisis cualitativo. Así, el diseño metodológico siguió el modelo de análisis crítico del discurso asistido por corpus de Baker et al. (2008). Mediante el análisis de clústeres se extrajeron las palabras más frecuentes del corpus y se identificaron cinco grandes áreas de discusión, a saber, Brexit, Economía, Inmigración, Reino Unido frente a la UE y Personas y Público. A continuación, se realizó el análisis de concordancia para cada palabra en los cuatro subcorpus, y se codificaron los valores noticiosos empleados en cada caso. Por último, se calculó la frecuencia de aparición de dichos valores. A continuación, se calcularon las distribuciones normalizadas de la frecuencia relativa de los valores noticiosos en cada caso para los cuatro medios. Además, se seleccionó una serie de fragmentos (62 piezas) para un análisis cualitativo más profundo centrado en el uso de los valores noticiosos para construir discursivamente el Brexit y otros temas relacionados con la cobertura de la campaña. Los resultados del estudio mostraron que, en general, la postura hacia el Brexit tenía un valor explicativo más sólido que la tradicional división izquierda. Además, en muchos casos los diferentes valores de las noticias se utilizaron conjuntamente para construir ciertas narrativas y discursos en línea con la postura de los medios hacia el Brexit. Los datos sugieren que ciertos valores informativos se utilizaron de forma jerárquica y sinérgica con importantes implicaciones discursivas e ideológicas. De los datos analizados se puede concluir que los periódicos anti-Brexit, en general, tendieron a construir un discurso negativo sobre las consecuencias del Brexit combinando el valor noticioso de la Negatividad con los valores de Impacto, Elitismo, Superlatividad y Proximidad. Por otro lado, los periódicos pro-Brexit trataron de restar importancia a las consecuencias negativas de un posible Brexit y separaron sistemáticamente la Negatividad de otros valores noticiosos, devaluando y minimizando así su potencial impacto. En otros casos, el periódico pro-Brexit combinó la Positividad con el Impacto y la Elegancia para realzar y elaborar algunas representaciones específicas en su discurso. En general, el uso de los valores noticiosos en la construcción discursiva del Brexit y sus correspondientes campos semánticos analizados en esta tesis doctoral pueden ser estimados como una práctica discursiva cargada de consideraciones sociales, ideológicas y políticas.The main aim of this study is to explore the news discourse of the Brexit referendum campaign from a corpus assisted discourse studies (CADS) perspective. More specifically, I analyse how different topics and debates related to Brexit were discursively constructed in the British quality press coverage of the referendum campaign. Furthermore, I also investigate the ideological differences in the discursive construction of the aforementioned coverage along political affiliations (left-right) and ideological stances toward Brexit (Leave-Remain). To do so, a corpus of four major British broadsheets (The Guardian, The Independent, The Times, and Daily Telegraph) was collected using Nexis UK news databases. The search word used for data retrieval was Brexit. The results were down-sampled by limiting search timespan [22 February to 23 June 2016], news type [articles], and managing duplicities (i.e., articles repeated in digital and paper editions). The corpus was analysed using a combination of corpus linguistics tools for quantitative analysis and the Discourse of News values Analysis (Bednarek and Caple, 2017) for qualitative analysis. Thus, the methodological design followed Baker et al.’s (2008) model of corpus-assisted critical discourse analysis. Using cluster analysis, the most frequent words of the corpus were extracted, and five major areas of discussion, namely, Brexit, Economy, Immigration, UK vs EU, and People and Public, were identified. Next, concordance analyses were performed for each word in the four sub-corpora, and the news values employed in each field were coded and their frequency computed. Then, normalised distributions of the relative frequency of news values were calculated accordingly, in each area for all the four outlets. In addition, a number of selected excerpts (62 pieces) were selected for further in-depth qualitative analysis of how news values were used to discursively construct Brexit and other related topics in the campaign coverage. The study results showed that, in general, the stance toward Brexit had a more explanatory value than the traditional left-right divide in how different news values were used across the corpus. In addition, in many cases, different news values were used together to construct certain narratives and discourses in line with the outlets’ stance toward Brexit. The data suggest certain news values were used hierarchically and synergistically, with important discursive and ideological implications. From the analysed data it can be concluded that the pro-Remain newspapers, in general, tended to construct a negative discourse about the consequences of Brexit by combining the news value of Negativity with Impact, Eliteness, Superlativeness and Proximity. On the other hand, the pro-Leave newspaper tried to downgrade such negative outcomes of a possible Brexit by systematically separating Negativity from other news values, thus devaluing and minimizing their potential impact. In other cases, the pro-Leave newspaper combined Positivity with Impact and Eliteness to enhance and elaborate some specific representations in its discourse. In general, the use of news values in the discursive construction of Brexit and its related semantic fields analysed in this dissertation can be considered as a discursive practice highly charged with social, ideological, and political considerations

    Query routing in cooperative semi-structured peer-to-peer information retrieval networks

    Get PDF
    Conventional web search engines are centralised in that a single entity crawls and indexes the documents selected for future retrieval, and the relevance models used to determine which documents are relevant to a given user query. As a result, these search engines suffer from several technical drawbacks such as handling scale, timeliness and reliability, in addition to ethical concerns such as commercial manipulation and information censorship. Alleviating the need to rely entirely on a single entity, Peer-to-Peer (P2P) Information Retrieval (IR) has been proposed as a solution, as it distributes the functional components of a web search engine – from crawling and indexing documents, to query processing – across the network of users (or, peers) who use the search engine. This strategy for constructing an IR system poses several efficiency and effectiveness challenges which have been identified in past work. Accordingly, this thesis makes several contributions towards advancing the state of the art in P2P-IR effectiveness by improving the query processing and relevance scoring aspects of a P2P web search. Federated search systems are a form of distributed information retrieval model that route the user’s information need, formulated as a query, to distributed resources and merge the retrieved result lists into a final list. P2P-IR networks are one form of federated search in routing queries and merging result among participating peers. The query is propagated through disseminated nodes to hit the peers that are most likely to contain relevant documents, then the retrieved result lists are merged at different points along the path from the relevant peers to the query initializer (or namely, customer). However, query routing in P2P-IR networks is considered as one of the major challenges and critical part in P2P-IR networks; as the relevant peers might be lost in low-quality peer selection while executing the query routing, and inevitably lead to less effective retrieval results. This motivates this thesis to study and propose query routing techniques to improve retrieval quality in such networks. Cluster-based semi-structured P2P-IR networks exploit the cluster hypothesis to organise the peers into similar semantic clusters where each such semantic cluster is managed by super-peers. In this thesis, I construct three semi-structured P2P-IR models and examine their retrieval effectiveness. I also leverage the cluster centroids at the super-peer level as content representations gathered from cooperative peers to propose a query routing approach called Inverted PeerCluster Index (IPI) that simulates the conventional inverted index of the centralised corpus to organise the statistics of peers’ terms. The results show a competitive retrieval quality in comparison to baseline approaches. Furthermore, I study the applicability of using the conventional Information Retrieval models as peer selection approaches where each peer can be considered as a big document of documents. The experimental evaluation shows comparative and significant results and explains that document retrieval methods are very effective for peer selection that brings back the analogy between documents and peers. Additionally, Learning to Rank (LtR) algorithms are exploited to build a learned classifier for peer ranking at the super-peer level. The experiments show significant results with state-of-the-art resource selection methods and competitive results to corresponding classification-based approaches. Finally, I propose reputation-based query routing approaches that exploit the idea of providing feedback on a specific item in the social community networks and manage it for future decision-making. The system monitors users’ behaviours when they click or download documents from the final ranked list as implicit feedback and mines the given information to build a reputation-based data structure. The data structure is used to score peers and then rank them for query routing. I conduct a set of experiments to cover various scenarios including noisy feedback information (i.e, providing positive feedback on non-relevant documents) to examine the robustness of reputation-based approaches. The empirical evaluation shows significant results in almost all measurement metrics with approximate improvement more than 56% compared to baseline approaches. Thus, based on the results, if one were to choose one technique, reputation-based approaches are clearly the natural choices which also can be deployed on any P2P network
    corecore