72 research outputs found

    Simple Yet Effective Pseudo Relevance Feedback with Rocchio’s Technique and Text Classification

    Get PDF
    With the continuous growth of the Internet and the availability of large-scale collections, assisting users in locating the information they need becomes a necessity. Generally, an information retrieval system will process an input query and provide a list of ranked results. However, this process could be challenging due to the "vocabulary mismatch" issue between input queries and passages. A well-known technique to address this issue is called "query expansion", which reformulates the given query by selecting and adding more relevant terms. Relevance feedback, as a form of query expansion, collects users' opinions on candidate passages and expands query terms from relevant ones. Pseudo relevance feedback assumes that the top documents in initial retrieval are relevant and rebuilds queries without any user interactions. In this thesis, we will discuss two implementations of pseudo relevance feedback: decades-old Rocchio's Technique and more recent text classification. As the reader might notice, both techniques are not "novel" anymore, e.g., the emergence of Rocchio can even be dated back to the 1960s. They are both proposed and studied before the neural age, where texts are still mostly stored as bag-of-words representations. Today, transformers have been shown to advance information retrieval, and searching with transformer-based dense representations outperforms traditional bag-of-words searching on many challenging and complex ranking tasks. This motivates us to ask the following three research questions: RQ1: Given strong baselines, large labelled datasets, and the emergence of transformers today, does pseudo relevance feedback with Rocchio's Technique still perform effectively with both sparse and dense representations? RQ2: Given strong baselines, large labelled datasets, and the emergence of transformers today, does pseudo relevance feedback via text classification still perform effectively with both sparse and dense representations? RQ3: Does applying pseudo relevance feedback with text classification on top of Rocchio's Technique results in further improvements? To answer RQ1, we have implemented Rocchio's Technique with sparse representations based on the Anserini and Pyserini toolkits. Building in a previous implementation of Rocchio's Technique with dense representations in the Pyserini toolkit, we can easily evaluate and compare the impact of Rocchio's Technique on effectiveness with both sparse and dense representations. By applying Rocchio's Technique to MS MARCO Passage and Document TREC Deep Learning topics, we can achieve about a 0.03-0.04 increase in average precision. It’s no surprise that Rocchio's Technique outperforms the BM25 baseline, but it's impressive to find that it is competitive or even superior to RM3, a more common strong baseline, under most circumstances. Hence, we propose to switch to Rocchio's Technique as a more robust and general baseline in future studies. To our knowledge, pseudo relevance feedback via text classification using both positive and negative labels is not well-studied before our work. To answer RQ2, we have verified the effectiveness of pseudo relevance feedback via text classification with both sparse and dense representations. Three classifiers (LR, SVM, KNN) are trained, and all enhance effectiveness. We also observe that pseudo relevance feedback via text classification with dense representations yields greater improvement than sparse ones. However, when we compare text classification to Rocchio's Technique, we find that Rocchio's Technique is superior to pseudo relevance feedback via text classification under all circumstances. In RQ3, the success of pseudo relevance feedback via text classification on BM25 + RM3 across four newswire collections in our previous paper motivates us to study the impact of pseudo relevance feedback via text classification on top of another query expansion result, Rocchio's Technique. However, unlike RM3, we could not observe much difference in the two evaluation metrics after applying pseudo relevance feedback via text classification on top of Rocchio's Technique. This work aims to explore some simple yet effective techniques which might be ignored in light of deep learning transformers. Instead of pursuing "more", we are aiming to find out something "less". We demonstrate the robustness and effectiveness of some "out-of-date" methods in the age of neural network

    Information retrieval models for recommender systems

    Get PDF
    Programa Oficial de Doutoramento en Computación . 5009V01[Abstract] Information retrieval addresses the information needs of users by delivering relevant pieces of information but requires users to convey their information needs explicitly. In contrast, recommender systems offer personalized suggestions of items automatically. Ultimately, both fields help users cope with information overload by providing them with relevant items of information. This thesis aims to explore the connections between information retrieval and recommender systems. Our objective is to devise recommendation models inspired in information retrieval techniques. We begin by borrowing ideas from the information retrieval evaluation literature to analyze evaluation metrics in recommender systems. Second, we study the applicability of pseudo-relevance feedback models to different recommendation tasks. We investigate the conventional top-N recommendation task, but we also explore the recently formulated user-item group formation problem and propose a novel task based on the liquidation oflong tail items. Third, we exploit ad hoc retrieval models to compute neighborhoods in a collaborative filtering scenario. Fourth, we explore the opposite direction by adapting an effective recommendation framework to pseudo-relevance feedback. Finally, we discuss the results and present our concIusions. In summary, this doctoral thesis adapts a series of information retrieval models to recommender systems. Our investigation shows that many retrieval models can be accommodated to deal with different recommendation tasks. Moreover, we find that taking the opposite path is also possible. Exhaustive experimentation confirms that the proposed models are competitive. Finally, we also perform a theoretical analysis of sorne models to explain their effectiveness.[Resumen] La recuperación de información da respuesta a las necesidades de información de los usuarios proporcionando información relevante, pero requiere que los usuarios expresen explícitamente sus necesidades de información. Por el contrario, los sistemas de recomendación ofrecen sugerencias personalizadas de elementos automáticamente. En última instancia, ambos campos ayudan a los usuarios a lidiar con la sobrecarga de información al proporcionarles información relevante. Esta tesis tiene como propósito explorar las conexiones entre la recuperación de información y los sistemas de recomendación. Nuestro objetivo es diseñar modelos de recomendación inspirados en técnicas de recuperación de información. Comenzamos tomando prestadas ideas de la literatura de evaluación en recuperación de información para analizar las métricas de evaluación en los sistemas de recomendación. En segundo lugar, estudiamos la aplicabilidad de los modelos de retroalimentación de pseudo-relevancia a diferentes tareas de recomendación. Investigamos la tarea de recomendar listas ordenadas de elementos, pero también exploramos el problema recientemente formulado de formación de grupos usuario-elemento y proponemos una tarea novedosa basada en la liquidación de los elementos de la larga cola. Tercero, explotamos modelos de recuperación ad hoc para calcular vecindarios en un escenario de filtrado colaborativo. En cuarto lugar, exploramos la dirección opuesta adaptando un método eficaz de recomendación a la retroalimentación de pseudo-relevancia. Finalmente, discutimos los resultados y presentamos nuestras conclusiones. En resumen, esta tesis doctoral adapta varios modelos de recuperación de información para su uso como sistemas de recomendación. Nuestra investigación muestra que muchos modelos de recuperación de información se pueden aplicar para tratar diferentes tareas de recomendación. Además, comprobamos que tomar el camino contrario también es posible. Una experimentación exhaustiva confirma que los modelos propuestos son competitivos. Finalmente, también realizamos un análisis teórico de algunos modelos para explicar su efectividad.[Resumo] A recuperación de información dá resposta ás necesidades de información dos usuarios proporcionando información relevante, pero require que os usuarios expresen explicitamente as súas necesidades de información. Pola contra, os sistemas de recomendación ofrecen suxestións personalizadas de elementos automaticamente. En última instancia, ambos os campos axudan aos usuarios a lidar coa sobrecarga de información ao proporcionarlles información relevante. Esta tese ten como propósito explorar as conexións entre a recuperación de información e os sistemas de recomendación. O naso obxectivo é deseñar modelos de recomendación inspirados en técnicas de recuperación de información. Comezamos tomando prestadas ideas da literatura de avaliación en recuperación de información para analizar as métricas de avaliación nos sistemas de recomendación. En segundo lugar, estudamos a aplicabilidade dos modelos de retroalimentación de seudo-relevancia a diferentes tarefas de recomendación. Investigamos a tarefa de recomendar listas ordenadas de elementos, pero tamén exploramos o problema recentemente formulado de formación de grupos de usuario-elemento e propoñemos unha tarefa nova baseada na liquidación dos elementos da longa cola. Terceiro, explotamos modelos de recuperación ad hoc para calcular veciñanzas nun escenario de filtrado colaborativo. En cuarto lugar, exploramos a dirección aposta adaptando un método eficaz de recomendación á retroalimentación de seudo-relevancia. Finalmente, discutimos os resultados e presentamos as nasas conclusións. En resumo, esta tese doutoral adapta varios modelos de recuperación de información para o seu uso como sistemas de recomendación. A nosa investigación mostra que moitos modelos de recuperación de información pódense aplicar para tratar diferentes tarefas de recomendación. Ademais, comprobamos que tomar o camiño contrario tamén é posible. Unha experimentación exhaustiva confirma que os modelos propostos son competitivos. Finalmente, tamén realizamos unha análise teórica dalgúns modelos para explicar a súa efectividade

    Personalizing type-based facet ranking using BERT embeddings

    Get PDF
    In Faceted Search Systems (FSS), users navigate the information space through facets, which are attributes or meta-data that describe the underlying content of the collection. Type-based facets (aka t-facets) help explore the categories associated with the searched objects in structured information space. This work investigates how personalizing t-facet ranking can minimize user effort to reach the intended search target. We propose a lightweight personalisation method based on Vector Space Model (VSM) for ranking the t-facet hierarchy in two steps. The first step scores each individual leaf-node t-facet by computing the similarity between the t-facet BERT embedding and the user profile vector. In this model, the user's profile is expressed in a category space through vectors that capture the users' past preferences. In the second step, this score is used to re-order and select the sub-tree to present to the user. The final ranked tree reflects the t-facet relevance both to the query and the user profile. Through the use of embeddings, the proposed method effectively handles unseen facets without adding extra processing to the FSS. The effectiveness of the proposed approach is measured by the user effort required to retrieve the sought item when using the ranked facets. The approach outperformed existing personalization baselines

    Conceptual, Impact-Based Publications Recommendations

    Get PDF
    CiteSeerx is a digital library for scientific publications by computer science researchers. It also functions as a search engine with several features including autonomous citation indexing, automatic metadata extraction, full-text indexing and reference linking. Users are able to retrieve relevant documents from the CiteSeerx database directly using search queries and will further benefit if the system suggests document recommendations to the user based on their preferences and search history. Therefore, recommender systems were initially developed and continue to evolve to recommend more relevant documents to the CiteSeerx users. In this thesis, we introduce the Conceptual, Impact-Based Recommender (CIBR), a hybrid recommender system, derived from the previously implemented conceptual recommender system in CiteSeerx. The Conceptual recommender system utilized the user\u27s top weighted concepts to recommend relevant documents to the users. Our hybrid recommender system, CIBR, considers the impact factor in addition to the top weighted concepts for generating recommendations for the user. The impact factor of a document is determined by using the author\u27s h-index of the publication. A survey was conducted to evaluate the efficiency of our hybrid system and this study shows that the CIBR system generates more relevant documents as compared to those recommended by the conceptual recommender system

    Rcv1: A new benchmark collection for text categorization research

    Get PDF
    Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which the data was produced. Drawing on interviews with Reuters personnel and access to Reuters documentation, we describe the coding policy and quality control procedures used in producing the RCV1 data, the intended semantics of the hierarchical category taxonomies, and the corrections necessary to remove errorful data. We refer to the original data as RCV1-v1, and the corrected data as RCV1-v2. We benchmark several widely used supervised learning methods on RCV1-v2, illustrating the collection’s properties, suggesting new directions for research, and providing baseline results for future studies. We make available detailed, per-category experimental results, as well a

    Automatic text categorization for information filtering.

    Get PDF
    Ho Chao Yang.Thesis (M.Phil.)--Chinese University of Hong Kong, 1998.Includes bibliographical references (leaves 157-163).Abstract also in Chinese.Abstract --- p.iAcknowledgment --- p.iiiList of Figures --- p.viiiList of Tables --- p.xivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Automatic Document Categorization --- p.1Chapter 1.2 --- Information Filtering --- p.3Chapter 1.3 --- Contributions --- p.6Chapter 1.4 --- Organization of the Thesis --- p.7Chapter 2 --- Related Work --- p.9Chapter 2.1 --- Existing Automatic Document Categorization Approaches --- p.9Chapter 2.1.1 --- Rule-Based Approach --- p.10Chapter 2.1.2 --- Similarity-Based Approach --- p.13Chapter 2.2 --- Existing Information Filtering Approaches --- p.19Chapter 2.2.1 --- Information Filtering Systems --- p.19Chapter 2.2.2 --- Filtering in TREC --- p.21Chapter 3 --- Document Pre-Processing --- p.23Chapter 3.1 --- Document Representation --- p.23Chapter 3.2 --- Classification Scheme Learning Strategy --- p.26Chapter 4 --- A New Approach - IBRI --- p.31Chapter 4.1 --- Overview of Our New IBRI Approach --- p.31Chapter 4.2 --- The IBRI Representation and Definitions --- p.34Chapter 4.3 --- The IBRI Learning Algorithm --- p.37Chapter 5 --- IBRI Experiments --- p.43Chapter 5.1 --- Experimental Setup --- p.43Chapter 5.2 --- Evaluation Metric --- p.45Chapter 5.3 --- Results --- p.46Chapter 6 --- A New Approach - GIS --- p.50Chapter 6.1 --- Motivation of GIS --- p.50Chapter 6.2 --- Similarity-Based Learning --- p.51Chapter 6.3 --- The Generalized Instance Set Algorithm (GIS) --- p.58Chapter 6.4 --- Using GIS Classifiers for Classification --- p.63Chapter 6.5 --- Time Complexity --- p.64Chapter 7 --- GIS Experiments --- p.68Chapter 7.1 --- Experimental Setup --- p.68Chapter 7.2 --- Results --- p.73Chapter 8 --- A New Information Filtering Approach Based on GIS --- p.87Chapter 8.1 --- Information Filtering Systems --- p.87Chapter 8.2 --- GIS-Based Information Filtering --- p.90Chapter 9 --- Experiments on GIS-based Information Filtering --- p.95Chapter 9.1 --- Experimental Setup --- p.95Chapter 9.2 --- Results --- p.100Chapter 10 --- Conclusions and Future Work --- p.108Chapter 10.1 --- Conclusions --- p.108Chapter 10.2 --- Future Work --- p.110Chapter A --- Sample Documents in the corpora --- p.111Chapter B --- Details of Experimental Results of GIS --- p.120Chapter C --- Computational Time of Reuters-21578 Experiments --- p.14
    corecore