11 research outputs found

    LTRo: Learning to Route Queries in Clustered P2P IR

    Get PDF
    Query Routing is a critical step in P2P Information Retrieval. In this paper, we consider learning to rank approaches for query routing in the clustered P2P IR architecture. Our formulation, LTRo, scores resources based on the number of relevant documents for each training query, and uses that information to build a model that would then rank promising peers for a new query. Our empirical analysis over a variety of P2P IR testbeds illustrate the superiority of our method against the state-of-the-art methods for query routing

    Transferring Learning To Rank Models for Web Search

    Get PDF
    ABSTRACT Learning to rank techniques provide mechanisms for combining document feature values into learned models that produce effective rankings. However, issues concerning the transferability of learned models between different corpora or subsets of the same corpus are not yet well understood. For instance, is the importance of different feature sets consistent between subsets of a corpus, or whether a learned model obtained on a small subset of the corpus effectively transfer to the larger corpus? By formulating our experiments around two null hypotheses, in this work, we apply a full-factorial experiment design to empirically investigate these questions using the ClueWeb09 and ClueWeb12 corpora, combined with queries from the TREC Web track. Among other observations, our experiments reveal that ClueWeb09 remains an effective choice of training corpus for learning effective models for ClueWeb12, and also that the importance of query independent features varies among the ClueWeb09 and ClueWeb12 corpora. In doing so, this work contributes an important study into the transferability of learning to rank models, as well as empirically-derived best practices for effective retrieval on the ClueWeb12 corpus

    Tackling Biased Baselines in the Risk-Sensitive Evaluation of Retrieval Systems

    Full text link
    Abstract. The aim of optimising information retrieval (IR) systems using a risk-sensitive evaluation methodology is to minimise the risk of performing any par-ticular topic less effectively than a given baseline system. Baseline systems in this context determine the reference effectiveness for topics, relative to which the effectiveness of a given IR system in minimising the risk will be measured. How-ever, the comparative risk-sensitive evaluation of a set of diverse IR systems – as attempted by the TREC 2013 Web track – is challenging, as the different systems under evaluation may be based upon a variety of different (base) retrieval models, such as learning to rank or language models. Hence, a question arises about how to properly measure the risk exhibited by each system. In this paper, we argue that no model of information retrieval alone is representative enough in this respect to be a true reference for the models available in the current state-of-the-art, and demonstrate, using the TREC 2012 Web track data, that as the baseline system changes, the resulting risk-based ranking of the systems changes significantly. In-stead of using a particular system’s effectiveness as the reference effectiveness for topics, we propose several remedies including the use of mean within-topic sys-tem effectiveness as a baseline, which is shown to enable unbiased measurements of the risk-sensitive effectiveness of IR systems.

    The relationship between retrievability bias and retrieval performance

    Get PDF
    A long standing problem in the domain of Information Retrieval (IR) has been the influence of biases within an IR system on the ranked results presented to a user. Retrievability is an IR evaluation measure which provides a means to assess the level of bias present in a system by evaluating how \emph{easily} documents in the collection can be found by the IR system in place. Retrievability is intrinsically related to retrieval performance because a document needs to be retrieved before it can be judged relevant. It is therefore reasonable to expect that lowering the level of bias present within a system could lead to improvements in retrieval performance. In this thesis, we undertake an investigation of the nature of the relationship between classical retrieval performance and retrievability bias. We explore the interplay between the two as we alter different aspects of the IR system in an attempt to investigate the \emph{Fairness Hypothesis}: that a system which is fairer (i.e. exerts the least amount of retrievability bias), performs better. To investigate the relationship between retrievability bias and retrieval performance we utilise a set of 6 standard TREC collections (3 news and 3 web) and a suite of standard retrieval models. We investigate this relationship by looking at four main aspects of the retrieval process using this set of TREC collections to also explore how generalisable the findings are. We begin by investigating how the retrieval model used relates to both bias and performance by issuing a large set of queries to a set of common retrieval models. We find a general trend where using a retrieval model that is evaluated to be more \emph{fair} (i.e. less biased) leads to improved performance over less fair systems. Hinting that providing documents with a more equal opportunity for access can lead to better retrieval performance. Following on from our first study, we investigate how bias and performance are affected by tuning length normalisation of several parameterised retrieval models. We explore the space of the length normalisation parameters of BM25, PL2 and Language Modelling. We find that tuning these parameters often leads to a trade off between performance and bias such that minimising bias will often not equate to maximising performance when traditional TREC performance measures are used. However, we find that measures which account for document length and users stopping strategies tend to evaluate the least biased settings to also be the maximum (or near maximum) performing parameter, indicating that the Fairness Hypothesis holds. Following this, we investigate the impact that query length has on retrievability bias. We issue various automatically generated query sets to the system to see if longer or shorter queries tend to influence the level of bias associated with the system. We find that longer queries tend to reduce bias, possibly due to the fact that longer queries will often lead to more documents being retrieved, but the reductions in bias are in diminishing returns. Our studies show that after issuing two terms, each additional term reduces bias by significantly less. Finally, we build on our work by employing some fielded retrieval models. We look at typical fielding, where the field relevance scores are computed individually then combined, and compare it with an enhanced version of fielding, where fields are weighted and combined then scored. We see that there are inherent biases against particular documents in the former model, especially in cases where a field is empty and as such see the latter tends to both perform better and also lower bias when compared with the former. In this thesis, we have examined several different ways in which performance and bias can be related. We conclude that while the Fairness Hypothesis has its merits, it is not a universally applicable idea. We further add to this by noting that the method used to compute bias does not distinguish between positive and negative biases and this influences our results. We do however support the idea that reducing the bias of a system by eliminating biases that are known to be negative should result in improvements in system performance

    Query routing in cooperative semi-structured peer-to-peer information retrieval networks

    Get PDF
    Conventional web search engines are centralised in that a single entity crawls and indexes the documents selected for future retrieval, and the relevance models used to determine which documents are relevant to a given user query. As a result, these search engines suffer from several technical drawbacks such as handling scale, timeliness and reliability, in addition to ethical concerns such as commercial manipulation and information censorship. Alleviating the need to rely entirely on a single entity, Peer-to-Peer (P2P) Information Retrieval (IR) has been proposed as a solution, as it distributes the functional components of a web search engine – from crawling and indexing documents, to query processing – across the network of users (or, peers) who use the search engine. This strategy for constructing an IR system poses several efficiency and effectiveness challenges which have been identified in past work. Accordingly, this thesis makes several contributions towards advancing the state of the art in P2P-IR effectiveness by improving the query processing and relevance scoring aspects of a P2P web search. Federated search systems are a form of distributed information retrieval model that route the user’s information need, formulated as a query, to distributed resources and merge the retrieved result lists into a final list. P2P-IR networks are one form of federated search in routing queries and merging result among participating peers. The query is propagated through disseminated nodes to hit the peers that are most likely to contain relevant documents, then the retrieved result lists are merged at different points along the path from the relevant peers to the query initializer (or namely, customer). However, query routing in P2P-IR networks is considered as one of the major challenges and critical part in P2P-IR networks; as the relevant peers might be lost in low-quality peer selection while executing the query routing, and inevitably lead to less effective retrieval results. This motivates this thesis to study and propose query routing techniques to improve retrieval quality in such networks. Cluster-based semi-structured P2P-IR networks exploit the cluster hypothesis to organise the peers into similar semantic clusters where each such semantic cluster is managed by super-peers. In this thesis, I construct three semi-structured P2P-IR models and examine their retrieval effectiveness. I also leverage the cluster centroids at the super-peer level as content representations gathered from cooperative peers to propose a query routing approach called Inverted PeerCluster Index (IPI) that simulates the conventional inverted index of the centralised corpus to organise the statistics of peers’ terms. The results show a competitive retrieval quality in comparison to baseline approaches. Furthermore, I study the applicability of using the conventional Information Retrieval models as peer selection approaches where each peer can be considered as a big document of documents. The experimental evaluation shows comparative and significant results and explains that document retrieval methods are very effective for peer selection that brings back the analogy between documents and peers. Additionally, Learning to Rank (LtR) algorithms are exploited to build a learned classifier for peer ranking at the super-peer level. The experiments show significant results with state-of-the-art resource selection methods and competitive results to corresponding classification-based approaches. Finally, I propose reputation-based query routing approaches that exploit the idea of providing feedback on a specific item in the social community networks and manage it for future decision-making. The system monitors users’ behaviours when they click or download documents from the final ranked list as implicit feedback and mines the given information to build a reputation-based data structure. The data structure is used to score peers and then rank them for query routing. I conduct a set of experiments to cover various scenarios including noisy feedback information (i.e, providing positive feedback on non-relevant documents) to examine the robustness of reputation-based approaches. The empirical evaluation shows significant results in almost all measurement metrics with approximate improvement more than 56% compared to baseline approaches. Thus, based on the results, if one were to choose one technique, reputation-based approaches are clearly the natural choices which also can be deployed on any P2P network

    IRRA at TREC 2009: Index term weighting based on divergence from independence model

    No full text
    18th Text REtrieval Conference, TREC 2009 -- 17 November 2009 through 20 November 2009 -- Gaithersburg, MD -- 95384[No abstract available

    Avaliação de algoritmos para ordenação de documentos digitais recuperados em busca

    Get PDF
    Monografia (graduação)—Universidade de Brasília, Faculdade UnB Gama, Curso de Engenharia de Software, 2013.A busca de informação teve seu princípio através de bibliotecas, recuperando o que o usuário viria a necessitar através de consultas por meio de, por exemplo, cartões de catálogos, categorizando os livros por título, por autor, ano ou editora dos livros. Com o avanço da tecnologia, ocorreu a automação deste processo, fazendo com que esse tipo de tarefa fosse realizada através de um computador. Entretanto com o grande volume de informação disponível, nem sempre é fácil encontrar o que se procura com eficácia, tornando assim a atividade de busca cansativa e trabalhosa. Para tratar este problema existem estudos e implementação a respeito da ordenação de informação obtida através da recuperação de informação. É interessante também a adoção técnicas para realizar consultas personalizadas, de acordo com características pré-estabelecidas pelos usuários do motor. Através de estudos acerca de algoritmos dinâmicos, baseados em termos e estáticos de ordenação da recuperação da informação, o objetivo deste trabalho é analisar o que existe tratando de ordenação da recuperação de informação. Juntamente com a inserção de perfis para buscar um grau de personalização das consultas em conjunto com um motor de busca open source, será analisado qual dos algoritmos é mais preciso, através de métricas de precisão x recall, quais algoritmos permitem consulta com um grau de personalização, aceitando inserção de perfil das quatro engenharias da Universidade de Brasília – Faculdade do Gama, e como a engenharia de software pode contribuir com a ordenação da recuperação de informação. ___________________________________________________________________________ ABSTRACTThe information retrieval had its beginning through libraries, seeking what the user would need by searching through, for example, cards catalogs, categorizing the books by their title, author, year or their publishing house. With the technological progress, this searching process was automated making this kind of searching was made through some computer. However, with the large volume of information available, it is not always easy to find what you are looking effectively, thus making the search activity tiresome and hard task. To address this problem, there are researches and implementation about ranking the information obtained from the information retrieval. It is also interesting to adopt techniques for performing custom queries in accordance with predetermined characteristics by users of the engine. Through studies about dynamics, term based, and statics ranking algorithms, the main objective of this work is to analyze what exists about ranking the retrieval information. Along with the inclusion of profiles to find a degree of retrieval customization in conjunction with a search engine open source, which will be analyzed is more accurate algorithms, through metrics of precision and recall, algorithms which allow consultation with a degree of customization, accepting insertion profile of the four engineering University of Brasilia - Faculty of Gama’s graduation courses, and how software engineering can contribute to the ordering of information retrieval

    Multiple ionization in strong laser fields

    Get PDF
    With the ultrashort laser pulses available today, intensities which exceed the binding electrical field of an atom by several orders of magnitude are routinely achieved. As a consequence, it is possible to remove (ionize) one electron or several electrons from an atom within one pulse. The intensity dependence of laser-induced ionization is highly nonlinear and is mostly studied with chemically inert noble gases, using pulses with frequencies in the visible or near-infrared range. For intensities above 10^14 W/cm^2 and femtosecond pulse durations, single ionization (A->A+) can be described very well as a tunneling process with subsequent classical motion of the electron in the laser field. Ionization of \textit{two} electrons can be expressed in terms of two independent single ionization steps (sequential double ionization, A->A+->A2+) if the intensity is high enough (e.g. I>10^15 W/cm^2 for neon). However, for smaller intensities, the measured A2+ ion yields are several orders of magnitude larger than those expected from the sequential mechanism and the transition to the sequential regime leads to a characteristic knee structure in the intensity dependence of the yield. The ionization pathway responsible for the increased production of A2+ ions, i.e. the simultaneous ejection of two electrons (A->A2+), is called nonsequential double ionization (NSDI). For the description of this process, a semiclassical rescattering mechanism has proved successful. According to the rescattering mechanism, an electron tunnels from the atomic potential, is accelerated by the laser field and driven back to the ion where, in an inelastic collision, a second electron is released. With respect to the final momenta of the ionized electrons, the rescattering mechanism also allows for quantitative predictions which are in good agreement with experimental results. The mechanisms of double ionization can be generalized to ionization of an arbitrary number of electrons, with all pathways deviating from the sequential one being referred to as nonsequential multiple ionization. An understanding of triple ionization is of special interest since it is the first case for which several competing nonsequential pathways exist, i.e. simultaneous ionization of three electrons described by the rescattering mechanism (I: A->A3+) and the two combinations of single ionization with NSDI by rescattering (II: A->A+->A3+ and III: A->A2+->A3+). Considering the nonlinear dependence of the tunneling probability on the ionization energies of the participating charge states, one expects that only the pathways I and II contribute significantly to the A3+ yield in the nonsequential intensity regime (e.g. IA+->A2+->A3+), respectively. Based on the predictions of the rescattering mechanism, these transitions should also manifest themselves in the momentum distributions of the A3+ ions. Since experiments could only partially confirm the above expectations, a detailed theoretical investigation of triple ionization is desirable. In this work, quantum mechanical simulations of triple ionization with laser pulses of visible and near-infrared frequencies are presented. To allow for efficient numerical calculations, the motion of the electrons is restricted to a three-dimensional subspace of the full configuration space. This modeling approach has already proved successful in the qualitative investigation of double ionization. From the quantum mechanical wave function of the model, several quantities are calculated which can also be measured experimentally (ion yields, electron and ion momentum distributions) and their dependence on the laser parameters (intensity, frequency, pulse duration) is studied. The main goal of this work is to understand the pathways and mechanisms of triple ionization in the different intensity regimes. For this purpose, we first study the ion yields as a function of intensity. Using one- and two-electron approximations, the yields of the pathways II - IV can be written as products of the yields of the intermediate charge states. This way, it is possible to quantitatively understand the A3+ yields in a wide range of intensities. To quantify the remaining pathway I, rescattering of an electron is analyzed classically (by performing trajectory studies) and quantum mechanically (by considering the time-dependent probability flux). Finally, the insights gained from the product yields and the rescattering analysis are used to interpret the A3+ ion momentum distributions which reflect the change of the prevalent ionization pathway more clearly than the yields. A major result of this work is the importance of classical thresholds for simultaneous multiple ionization. For example, the onset of the regime where the intensity-dependent A3+/A+ yield ratio is approximately constant can be identified with the threshold intensity of simultaneous triple ionization where the energy of the rescattered electron is equal to the sum of the two ionization energies of the A+ ion. Furthermore, the investigation of the A3+ yields indicates that the pathway III plays a much more important role for triple ionization in the nonsequential intensity regime than previously thought. Finally, one has to emphasize the ability of the model to qualitatively reproduce the essential experimental observations on triple ionization
    corecore