72 research outputs found
Simple Yet Effective Pseudo Relevance Feedback with Rocchio’s Technique and Text Classification
With the continuous growth of the Internet and the availability of large-scale collections, assisting users in locating the information they need becomes a necessity. Generally, an information retrieval system will process an input query and provide a list of ranked
results. However, this process could be challenging due to the "vocabulary mismatch" issue between input queries and passages. A well-known technique to address this issue is called "query expansion", which reformulates the given query by selecting and adding more relevant terms. Relevance feedback, as a form of query expansion, collects users' opinions on candidate passages and expands query terms from relevant ones. Pseudo relevance feedback assumes that the top documents in initial retrieval are relevant and rebuilds queries without any user interactions.
In this thesis, we will discuss two implementations of pseudo relevance feedback: decades-old Rocchio's Technique and more recent text classification. As the reader might notice, both techniques are not "novel" anymore, e.g., the emergence of Rocchio can even be dated back to the 1960s. They are both proposed and studied before the neural age, where texts are still mostly stored as bag-of-words representations. Today, transformers have been shown to advance information retrieval, and searching with transformer-based dense representations outperforms traditional bag-of-words searching on many challenging and
complex ranking tasks.
This motivates us to ask the following three research questions:
RQ1: Given strong baselines, large labelled datasets, and the emergence of transformers today, does pseudo relevance feedback with Rocchio's Technique still perform effectively with both sparse and dense representations?
RQ2: Given strong baselines, large labelled datasets, and the emergence of transformers today, does pseudo relevance feedback via text classification still perform effectively with both sparse and dense representations?
RQ3: Does applying pseudo relevance feedback with text classification on top of Rocchio's Technique results in further improvements?
To answer RQ1, we have implemented Rocchio's Technique with sparse representations based on the Anserini and Pyserini toolkits. Building in a previous implementation of Rocchio's Technique with dense representations in the Pyserini toolkit, we can easily evaluate and compare the impact of Rocchio's Technique on effectiveness with both sparse and dense representations. By applying Rocchio's Technique to MS MARCO Passage and Document TREC Deep Learning topics, we can achieve about a 0.03-0.04 increase in average precision. It’s no surprise that Rocchio's Technique outperforms the BM25 baseline, but it's impressive to find that it is competitive or even superior to RM3, a more common strong baseline, under most circumstances. Hence, we propose to switch to Rocchio's Technique as a more robust and general baseline in future studies.
To our knowledge, pseudo relevance feedback via text classification using both positive and negative labels is not well-studied before our work. To answer RQ2, we have verified the effectiveness of pseudo relevance feedback via text classification with both sparse and dense representations. Three classifiers (LR, SVM, KNN) are trained, and all enhance effectiveness. We also observe that pseudo relevance feedback via text classification with dense representations yields greater improvement than sparse ones. However, when we compare text classification to Rocchio's Technique, we find that Rocchio's Technique is superior to pseudo relevance feedback via text classification under all circumstances.
In RQ3, the success of pseudo relevance feedback via text classification on BM25 + RM3 across four newswire collections in our previous paper motivates us to study the impact of pseudo relevance feedback via text classification on top of another query expansion result, Rocchio's Technique. However, unlike RM3, we could not observe much difference in the two evaluation metrics after applying pseudo relevance feedback via text classification on top of Rocchio's Technique.
This work aims to explore some simple yet effective techniques which might be ignored in light of deep learning transformers. Instead of pursuing "more", we are aiming to find out something "less". We demonstrate the robustness and effectiveness of some "out-of-date" methods in the age of neural network
Information retrieval models for recommender systems
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
Information retrieval addresses the information needs of users by delivering
relevant pieces of information but requires users to convey their
information needs explicitly. In contrast, recommender systems offer personalized
suggestions of items automatically. Ultimately, both fields help
users cope with information overload by providing them with relevant
items of information.
This thesis aims to explore the connections between information retrieval
and recommender systems. Our objective is to devise recommendation
models inspired in information retrieval techniques. We begin by
borrowing ideas from the information retrieval evaluation literature to analyze
evaluation metrics in recommender systems. Second, we study the
applicability of pseudo-relevance feedback models to different recommendation
tasks. We investigate the conventional top-N recommendation
task, but we also explore the recently formulated user-item group formation
problem and propose a novel task based on the liquidation oflong
tail items. Third, we exploit ad hoc retrieval models to compute neighborhoods
in a collaborative filtering scenario. Fourth, we explore the
opposite direction by adapting an effective recommendation framework
to pseudo-relevance feedback. Finally, we discuss the results and present
our concIusions.
In summary, this doctoral thesis adapts a series of information retrieval
models to recommender systems. Our investigation shows that many
retrieval models can be accommodated to deal with different recommendation
tasks. Moreover, we find that taking the opposite path is also
possible. Exhaustive experimentation confirms that the proposed models
are competitive. Finally, we also perform a theoretical analysis of sorne
models to explain their effectiveness.[Resumen]
La recuperación de información da respuesta a las necesidades de información
de los usuarios proporcionando información relevante, pero
requiere que los usuarios expresen explícitamente sus necesidades de
información. Por el contrario, los sistemas de recomendación ofrecen
sugerencias personalizadas de elementos automáticamente. En última
instancia, ambos campos ayudan a los usuarios a lidiar con la sobrecarga
de información al proporcionarles información relevante.
Esta tesis tiene como propósito explorar las conexiones entre la recuperación
de información y los sistemas de recomendación. Nuestro
objetivo es diseñar modelos de recomendación inspirados en técnicas de
recuperación de información. Comenzamos tomando prestadas ideas de
la literatura de evaluación en recuperación de información para analizar
las métricas de evaluación en los sistemas de recomendación. En segundo
lugar, estudiamos la aplicabilidad de los modelos de retroalimentación de
pseudo-relevancia a diferentes tareas de recomendación. Investigamos
la tarea de recomendar listas ordenadas de elementos, pero también exploramos
el problema recientemente formulado de formación de grupos
usuario-elemento y proponemos una tarea novedosa basada en la liquidación
de los elementos de la larga cola. Tercero, explotamos modelos
de recuperación ad hoc para calcular vecindarios en un escenario de
filtrado colaborativo. En cuarto lugar, exploramos la dirección opuesta
adaptando un método eficaz de recomendación a la retroalimentación de
pseudo-relevancia. Finalmente, discutimos los resultados y presentamos
nuestras conclusiones.
En resumen, esta tesis doctoral adapta varios modelos de recuperación
de información para su uso como sistemas de recomendación. Nuestra
investigación muestra que muchos modelos de recuperación de información
se pueden aplicar para tratar diferentes tareas de recomendación.
Además, comprobamos que tomar el camino contrario también es posible.
Una experimentación exhaustiva confirma que los modelos propuestos
son competitivos. Finalmente, también realizamos un análisis teórico de
algunos modelos para explicar su efectividad.[Resumo]
A recuperación de información dá resposta ás necesidades de información
dos usuarios proporcionando información relevante, pero require
que os usuarios expresen explicitamente as súas necesidades de información.
Pola contra, os sistemas de recomendación ofrecen suxestións
personalizadas de elementos automaticamente. En última instancia, ambos
os campos axudan aos usuarios a lidar coa sobrecarga de información
ao proporcionarlles información relevante.
Esta tese ten como propósito explorar as conexións entre a recuperación
de información e os sistemas de recomendación. O naso obxectivo é deseñar
modelos de recomendación inspirados en técnicas de recuperación
de información. Comezamos tomando prestadas ideas da literatura de
avaliación en recuperación de información para analizar as métricas de
avaliación nos sistemas de recomendación. En segundo lugar, estudamos
a aplicabilidade dos modelos de retroalimentación de seudo-relevancia a
diferentes tarefas de recomendación. Investigamos a tarefa de recomendar
listas ordenadas de elementos, pero tamén exploramos o problema
recentemente formulado de formación de grupos de usuario-elemento e
propoñemos unha tarefa nova baseada na liquidación dos elementos da
longa cola. Terceiro, explotamos modelos de recuperación ad hoc para
calcular veciñanzas nun escenario de filtrado colaborativo. En cuarto
lugar, exploramos a dirección aposta adaptando un método eficaz de
recomendación á retroalimentación de seudo-relevancia. Finalmente,
discutimos os resultados e presentamos as nasas conclusións.
En resumo, esta tese doutoral adapta varios modelos de recuperación
de información para o seu uso como sistemas de recomendación. A nosa
investigación mostra que moitos modelos de recuperación de información
pódense aplicar para tratar diferentes tarefas de recomendación.
Ademais, comprobamos que tomar o camiño contrario tamén é posible.
Unha experimentación exhaustiva confirma que os modelos propostos
son competitivos. Finalmente, tamén realizamos unha análise teórica
dalgúns modelos para explicar a súa efectividade
Personalizing type-based facet ranking using BERT embeddings
In Faceted Search Systems (FSS), users navigate the information space through facets, which are attributes or meta-data that describe the underlying content of the collection.
Type-based facets (aka t-facets) help explore the categories associated with the searched objects in structured information space.
This work investigates how personalizing t-facet ranking can minimize user effort to reach the intended search target.
We propose a lightweight personalisation method based on Vector Space Model (VSM) for ranking the t-facet hierarchy in two steps.
The first step scores each individual leaf-node t-facet by computing the similarity between the t-facet BERT embedding and the user profile vector.
In this model, the user's profile is expressed in a category space through vectors that capture the users' past preferences.
In the second step, this score is used to re-order and select the sub-tree to present to the user.
The final ranked tree reflects the t-facet relevance both to the query and the user profile.
Through the use of embeddings, the proposed method effectively handles unseen facets without adding extra processing to the FSS.
The effectiveness of the proposed approach is measured by the user effort required to retrieve the sought item when using the ranked facets.
The approach outperformed existing personalization baselines
Conceptual, Impact-Based Publications Recommendations
CiteSeerx is a digital library for scientific publications by computer science researchers. It also functions as a search engine with several features including autonomous citation indexing, automatic metadata extraction, full-text indexing and reference linking. Users are able to retrieve relevant documents from the CiteSeerx database directly using search queries and will further benefit if the system suggests document recommendations to the user based on their preferences and search history. Therefore, recommender systems were initially developed and continue to evolve to recommend more relevant documents to the CiteSeerx users. In this thesis, we introduce the Conceptual, Impact-Based Recommender (CIBR), a hybrid recommender system, derived from the previously implemented conceptual recommender system in CiteSeerx. The Conceptual recommender system utilized the user\u27s top weighted concepts to recommend relevant documents to the users. Our hybrid recommender system, CIBR, considers the impact factor in addition to the top weighted concepts for generating recommendations for the user. The impact factor of a document is determined by using the author\u27s h-index of the publication. A survey was conducted to evaluate the efficiency of our hybrid system and this study shows that the CIBR system generates more relevant documents as compared to those recommended by the conceptual recommender system
Rcv1: A new benchmark collection for text categorization research
Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which the data was produced. Drawing on interviews with Reuters personnel and access to Reuters documentation, we describe the coding policy and quality control procedures used in producing the RCV1 data, the intended semantics of the hierarchical category taxonomies, and the corrections necessary to remove errorful data. We refer to the original data as RCV1-v1, and the corrected data as RCV1-v2. We benchmark several widely used supervised learning methods on RCV1-v2, illustrating the collection’s properties, suggesting new directions for research, and providing baseline results for future studies. We make available detailed, per-category experimental results, as well a
Automatic text categorization for information filtering.
Ho Chao Yang.Thesis (M.Phil.)--Chinese University of Hong Kong, 1998.Includes bibliographical references (leaves 157-163).Abstract also in Chinese.Abstract --- p.iAcknowledgment --- p.iiiList of Figures --- p.viiiList of Tables --- p.xivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Automatic Document Categorization --- p.1Chapter 1.2 --- Information Filtering --- p.3Chapter 1.3 --- Contributions --- p.6Chapter 1.4 --- Organization of the Thesis --- p.7Chapter 2 --- Related Work --- p.9Chapter 2.1 --- Existing Automatic Document Categorization Approaches --- p.9Chapter 2.1.1 --- Rule-Based Approach --- p.10Chapter 2.1.2 --- Similarity-Based Approach --- p.13Chapter 2.2 --- Existing Information Filtering Approaches --- p.19Chapter 2.2.1 --- Information Filtering Systems --- p.19Chapter 2.2.2 --- Filtering in TREC --- p.21Chapter 3 --- Document Pre-Processing --- p.23Chapter 3.1 --- Document Representation --- p.23Chapter 3.2 --- Classification Scheme Learning Strategy --- p.26Chapter 4 --- A New Approach - IBRI --- p.31Chapter 4.1 --- Overview of Our New IBRI Approach --- p.31Chapter 4.2 --- The IBRI Representation and Definitions --- p.34Chapter 4.3 --- The IBRI Learning Algorithm --- p.37Chapter 5 --- IBRI Experiments --- p.43Chapter 5.1 --- Experimental Setup --- p.43Chapter 5.2 --- Evaluation Metric --- p.45Chapter 5.3 --- Results --- p.46Chapter 6 --- A New Approach - GIS --- p.50Chapter 6.1 --- Motivation of GIS --- p.50Chapter 6.2 --- Similarity-Based Learning --- p.51Chapter 6.3 --- The Generalized Instance Set Algorithm (GIS) --- p.58Chapter 6.4 --- Using GIS Classifiers for Classification --- p.63Chapter 6.5 --- Time Complexity --- p.64Chapter 7 --- GIS Experiments --- p.68Chapter 7.1 --- Experimental Setup --- p.68Chapter 7.2 --- Results --- p.73Chapter 8 --- A New Information Filtering Approach Based on GIS --- p.87Chapter 8.1 --- Information Filtering Systems --- p.87Chapter 8.2 --- GIS-Based Information Filtering --- p.90Chapter 9 --- Experiments on GIS-based Information Filtering --- p.95Chapter 9.1 --- Experimental Setup --- p.95Chapter 9.2 --- Results --- p.100Chapter 10 --- Conclusions and Future Work --- p.108Chapter 10.1 --- Conclusions --- p.108Chapter 10.2 --- Future Work --- p.110Chapter A --- Sample Documents in the corpora --- p.111Chapter B --- Details of Experimental Results of GIS --- p.120Chapter C --- Computational Time of Reuters-21578 Experiments --- p.14
- …