11 research outputs found

    DCU-TCD@LogCLEF 2010: re-ranking document collections and query performance estimation

    Get PDF
    This paper describes the collaborative participation of Dublin City University and Trinity College Dublin in LogCLEF 2010. Two sets of experiments were conducted. First, different aspects of the TEL query logs were analysed after extracting user sessions of consecutive queries on a topic. The relation between the queries and their length (number of terms) and position (first query or further reformulations) was examined in a session with respect to query performance estimators such as query scope, IDF-based measures, simplified query clarity score, and average inverse document collection frequency. Results of this analysis suggest that only some estimator values show a correlation with query length or position in the TEL logs (e.g. similarity score between collection and query). Second, the relation between three attributes was investigated: the user's country (detected from IP address), the query language, and the interface language. The investigation aimed to explore the influence of the three attributes on the user's collection selection. Moreover, the investigation involved assigning different weights to the three attributes in a scoring function that was used to re-rank the collections displayed to the user according to the language and country. The results of the collection re-ranking show a significant improvement in Mean Average Precision (MAP) over the original collection ranking of TEL. The results also indicate that the query language and interface language have more in uence than the user's country on the collections selected by the users

    What Makes a Top-Performing Precision Medicine Search Engine? Tracing Main System Features in a Systematic Way

    Full text link
    From 2017 to 2019 the Text REtrieval Conference (TREC) held a challenge task on precision medicine using documents from medical publications (PubMed) and clinical trials. Despite lots of performance measurements carried out in these evaluation campaigns, the scientific community is still pretty unsure about the impact individual system features and their weights have on the overall system performance. In order to overcome this explanatory gap, we first determined optimal feature configurations using the Sequential Model-based Algorithm Configuration (SMAC) program and applied its output to a BM25-based search engine. We then ran an ablation study to systematically assess the individual contributions of relevant system features: BM25 parameters, query type and weighting schema, query expansion, stop word filtering, and keyword boosting. For evaluation, we employed the gold standard data from the three TREC-PM installments to evaluate the effectiveness of different features using the commonly shared infNDCG metric.Comment: Accepted for SIGIR2020, 10 page

    Novel approach for quantitative and qualitative authors research profiling using feature fusion and tree-based learning approach

    Get PDF
    Article citation creates a link between the cited and citing articles and is used as a basis for several parameters like author and journal impact factor, H-index, i10 index, etc., for scientific achievements. Citations also include self-citation which refers to article citation by the author himself. Self-citation is important to evaluate an author’s research profile and has gained popularity recently. Although different criteria are found in the literature regarding appropriate self-citation, self-citation does have a huge impact on a researcher’s scientific profile. This study carries out two cases in this regard. In case 1, the qualitative aspect of the author’s profile is analyzed using hand-crafted feature engineering techniques. The sentiments conveyed through citations are integral in assessing research quality, as they can signify appreciation, critique, or serve as a foundation for further research. Analyzing sentiments within in-text citations remains a formidable challenge, even with the utilization of automated sentiment annotations. For this purpose, this study employs machine learning models using term frequency (TF) and term frequency-inverse document frequency (TF-IDF). Random forest using TF with Synthetic Minority Oversampling Technique (SMOTE) achieved a 0.9727 score of accuracy. Case 2 deals with quantitative analysis and investigates direct and indirect self-citation. In this study, the top 2% of researchers in 2020 is considered as a baseline. For this purpose, the data of the top 25 Pakistani researchers are manually retrieved from this dataset, in addition to the citation information from the Web of Science (WoS). The selfcitation is estimated using the proposed model and results are compared with those obtained from WoS. Experimental results show a substantial difference between the two, as the ratio of self-citation from the proposed approach is higher than WoS. It is observed that the citations from the WoS for authors are overstated. For a comprehensive evaluation of the researcher's profile, both direct and indirect selfcitation must be included

    A Constraint to Automatically Regulate Document-Length Normalisation

    Get PDF
    ABSTRACT Retrieval functions in information retrieval (IR) are fundamental to the effectiveness of search systems. However, considerable parameter tuning is often needed to increase the effectiveness of the retrieval. Document length normalisation is one such aspect that requires tuning on a per-query and per-collection basis for many retrieval functions. In this paper, we develop an approach that regularises the level of normalisation to apply on a per-query basis. We formally describe the interaction between query-terms and document length normalisation using a constraint. We then develop a general pre-retrieval approach to adapt a number of state-of-the-art ranking functions so that they adhere to the constraint. Finally, we empirically demonstrate that the adapted retrieval functions outperform default versions of the original retrieval functions, and perform at least comparably to tuned versions of the original functions, on a number of datasets. Essentially this regulates the normalisation parameter in a number of retrieval functions on a per-query basis in a principled manner

    A systematic approach to normalization in probabilistic models

    Get PDF
    Open access funding provided by Austrian Science Fund (FWF). This research was partly supported by the Austrian Science Fund (FWF) Project Number P25905-N23 (ADmIRE). This work has been supported by the Self-Optimizer project (FFG 852624) in the EUROSTARS programme, funded by EUREKA, the BMWFW and the European Union

    Index compression for information retrielval systems

    Get PDF
    [Abstract] Given the increasing amount of information that is available today, there is a clear need for Information Retrieval (IR) systems that can process this information in an efficient and effective way. Efficient processing means minimising the amount of time and space required to process data, whereas effective processing means identifying accurately which information is relevant to the user and which is not. Traditionally, efficiency and effectiveness are at opposite ends (what is beneficial to efficiency is usually harmful to effectiveness, and vice versa), so the challenge of IR systems is to find a compromise between efficient and effective data processing. This thesis investigates the efficiency of IR systems. It suggests several novel strategies that can render IR systems more efficient by reducing the index size of IR systems, referred to as index compression. The index is the data structure that stores the information handled in the retrieval process. Two different approaches are proposed for index compression, namely document reordering and static index pruning. Both of these approaches exploit document collection characteristics in order to reduce the size of indexes, either by reassigning the document identifiers in the collection in the index, or by selectively discarding information that is less relevant to the retrieval process by pruning the index. The index compression strategies proposed in this thesis can be grouped into two categories: (i) Strategies which extend state of the art in the field of efficiency methods in novel ways. (ii) Strategies which are derived from properties pertaining to the effectiveness of IR systems; these are novel strategies, because they are derived from effectiveness as opposed to efficiency principles, and also because they show that efficiency and effectiveness can be successfully combined for retrieval. The main contributions of this work are in indicating principled extensions of state of the art in index compression, and also in suggesting novel theoretically-driven index compression techniques which are derived from principles of IR effectiveness. All these techniques are evaluated extensively, in thorough experiments involving established datasets and baselines, which allow for a straight-forward comparison with state of the art. Moreover, the optimality of the proposed approaches is addressed from a theoretical perspective.[Resumen] Dada la creciente cantidad de información disponible hoy en día, existe una clara necesidad de sistemas de Recuperación de Información (RI) que sean capaces de procesar esa información de una manera efectiva y eficiente. En este contexto, eficiente significa cantidad de tiempo y espacio requeridos para procesar datos, mientras que efectivo significa identificar de una manera precisa qué información es relevante para el usuario y cual no lo es. Tradicionalmente, eficiencia y efectividad se encuentran en polos opuestos - lo que es beneficioso para la eficiencia, normalmente perjudica la efectividad y viceversa - así que un reto para los sistemas de RI es encontrar un compromiso adecuado entre el procesamiento efectivo y eficiente de los datos. Esta tesis investiga el problema de la eficiencia de los sistemas de RI. Sugiere diferentes estrategias novedosas que pueden permitir la reducción de los índices de los sistemas de RI, enmarcadas dentro da las técnicas conocidas como compresión de índices. El índice es la estructura de datos que almacena la información utilizada en el proceso de recuperación. Se presentan dos aproximaciones diferentes para la compresión de los índices, referidas como reordenación de documentos y pruneado estático del índice. Ambas aproximaciones explotan características de colecciones de documentos para reducir el tamaño final de los índices, mediante la reasignación de los identificadores de los documentos de la colección o bien descartando selectivamente la información que es "menos relevante" para el proceso de recuperación. Las estrategias de compresión propuestas en este tesis se pueden agrupar en dos categorías: (i) estrategias que extienden el estado del arte en la eficiencia de una manera novedosa y (ii) estrategias derivadas de propiedades relacionadas con los principios de la efectividad en los sistemas de RI; estas estrategias son novedosas porque son derivadas desde principios de la efectividad como contraposición a los de la eficiencia, e porque revelan como la eficiencia y la efectividad pueden ser combinadas de una manera efectiva para la recuperación de información. Las contribuciones de esta tesis abarcan la elaboración de técnicas del estado del arte en compresión de índices y también en la derivación de técnicas de compresión basadas en fundamentos teóricos derivados de los principios de la efectividad de los sistemas de RI. Todas estas técnicas han sido evaluadas extensamente con numerosos experimentos que involucran conjuntos de datos y técnicas de referencia bien establecidas en el campo, las cuales permiten una comparación directa con el estado del arte. Finalmente, la optimalidad de las aproximaciones presentadas es tratada desde una perspectiva teórica

    Recherche d'information et contexte

    Get PDF
    My research work is related the field of Information Retrieval (IR) whose objective is to enable a user to find information that meets its needs within a large volume of information. The work in IR have focused primarily on improving information processing in terms of indexing to obtain optimal representations of documents and queries and in terms of matching between these representations. Contributions have long made no distinction between all searches assuming a unique type of search and when proposing a model intended to be effective for this unique type of search. The growing volume of information and diversity of situations have marked the limits of existing IR approaches bringing out the field of contextual IR. Contextual IR aims to better respond to users' needs taking into account the search context. The principle is to differentiate searches by integrating in the IR process, contextual factors that will influence the IRS effectiveness. The notion of context is broad and refers to all knowledge related to information conducted by a user querying an IRS. My research has been directed toward taking into account the contextual factors that are: the domain of information, the information structure and the user. The first three directions of my work consist in proposing models that incorporate each of these elements of context, and a fourth direction aims at exploring how to adapt the process to each search according to its context. Various European and national projects have provided application frameworks for this research and have allowed us to validate our proposals. This research has also led to development of various prototypes and allowed the conduct of PhD theses and research internships.Mes travaux de recherche s'inscrivent dans le domaine de la recherche d'information (RI) dont l'objectif est de permettre à un utilisateur de trouver de l'information répondant à son besoin au sein d'un volume important d'informations. Les recherches en RI ont été tout d'abord orientées système. Elles sont restées très longtemps axées sur l'appariement pour évaluer la correspondance entre les requêtes et les documents ainsi que sur l'indexation des documents et de requêtes pour obtenir une représentation qui supporte leur mise en correspondance. Cela a conduit à la définition de modèles théoriques de RI comme le modèle vectoriel ou le modèle probabiliste. L'objectif initialement visé a été de proposer un modèle de RI qui possède un comportement global le plus efficace possible. La RI s'est longtemps basée sur des hypothèses simplificatrices notamment en considérant un type unique d'interrogation et en appliquant le même traitement à chaque interrogation. Le contexte dans lequel s'effectue la recherche a été ignoré. Le champ d'application de la RI n'a cessé de s'étendre notamment grâce à l'essor d'internet. Le volume d'information toujours plus important combiné à une utilisation de SRI qui s'est démocratisée ont conduit à une diversité des situations. Cet essor a rendu plus difficile l'identification des informations correspondant à chaque besoin exprimé par un utilisateur, marquant ainsi les limites des approches de RI existantes. Face à ce constat, des propositions ont émergé, visant à faire évoluer la RI en rapprochant l'utilisateur du système tels que les notions de réinjection de pertinence utilisateur ou de profil utilisateur. Dans le but de fédérer les travaux et proposer des SRI offrant plus de précision en réponse au besoin de l'utilisateur, le domaine de la RI contextuelle a récemment émergé. L'objectif est de différencier les recherches au niveau des modèles de RI en intégrant des éléments de contexte susceptibles d'avoir une influence sur les performances du SRI. La notion de contexte est vaste et se réfère à toute connaissance liée à la recherche de l'utilisateur interrogeant un SRI. Mes travaux de recherche se sont orientés vers la prise en compte des éléments de contexte que sont le domaine de l'information, la structure de l'information et l'utilisateur. Ils consistent, dans le cadre de trois premières orientations, à proposer des modèles qui intègrent chacun de ces éléments de contexte, et, dans une quatrième orientation, d'étudier comment adapter les processus à chaque recherche en fonction de son contexte. Différents projets européens et nationaux ont servi de cadre applicatifs à ces recherches et ainsi à valider nos propositions. Mes travaux de recherche ont également fait l'objet de développements dans différents prototypes et ont permis le déroulement de thèses de doctorat et stages de recherche
    corecore