4,217 research outputs found
A Relevance Feedback-Based System For Quickly Narrowing Biomedical Literature Search Result
The online literature is an important source that helps people find the information. The quick increase of online literature makes the manual search process for the most relevant information a very time-consuming task and leads to sifting through many results to find the relevant ones. The existing search engines and online databases return a list of results that satisfy the user\u27s search criteria. The list is often too long for the user to go through every hit if he/she does not exactly know what he/she wants or/and does not have time to review them one by one. My focus is on how to find biomedical literature in a fastest way. In this dissertation, I developed a biomedical literature search system that uses relevance feedback mechanism, fuzzy logic, text mining techniques and Unified Medical Language System. The system extracts and decodes information from the online biomedical documents and uses the extracted information to first filter unwanted documents and then ranks the related ones based on the user preferences. I used text mining techniques to extract PDF document features and used these features to filter unwanted documents with the help of fuzzy logic. The system extracts meaning and semantic relations between texts and calculates the similarity between documents using these relations. Moreover, I developed a fuzzy literature ranking method that uses fuzzy logic, text mining techniques and Unified Medical Language System. The ranking process is utilized based on fuzzy logic and Unified Medical Language System knowledge resources. The fuzzy ranking method uses semantic type and meaning concepts to map the relations between texts in documents. The relevance feedback-based biomedical literature search system is evaluated using a real biomedical data that created using dobutamine (drug name). The data set contains 1,099 original documents. To obtain coherent and reliable evaluation results, two physicians are involved in the system evaluation. Using (30-day mortality) as specific query, the retrieved result precision improves by 87.7% in three rounds, which shows the effectiveness of using relevance feedback, fuzzy logic and UMLS in the search process. Moreover, the fuzzy-based ranking method is evaluated in term of ranking the biomedical search result. Experiments show that the fuzzy-based ranking method improves the average ranking order accuracy by 3.35% and 29.55% as compared with UMLS meaning and semantic type methods respectively
Information Retrieval: Recent Advances and Beyond
In this paper, we provide a detailed overview of the models used for
information retrieval in the first and second stages of the typical processing
chain. We discuss the current state-of-the-art models, including methods based
on terms, semantic retrieval, and neural. Additionally, we delve into the key
topics related to the learning process of these models. This way, this survey
offers a comprehensive understanding of the field and is of interest for for
researchers and practitioners entering/working in the information retrieval
domain
Recommended from our members
SEARCHING BASED ON QUERY DOCUMENTS
Searches can start with query documents where search queries are formulated based on document-level descriptions. This type of searches is more common in domain-specific search environments. For example, in patent retrieval, one major search task is finding relevant information for new (query) patents, and search queries are generated from the query patents One unique characteristic of this search is that the search process can take longer and be more comprehensive, compared to general web search. As an example, to complete a single patent retrieval task, a typical user may generate 15 queries and examine more than 100 retrieved documents. In these search environments, searchers need to formulate multiple queries based on query documents that are typically complex and difficult to understand. In this work, we describe methods for automatically generating queries and diversifying search results based on query documents, which can be used for query vi suggestion and for improving the quality of retrieval results. In particular, we focus on resolving three main issues related to query document-based searches: (1) query generation, (2) query suggestion and formulation, and (3) search result diversification. Automatic query generation helps users by reducing the burden of formulating queries from query documents. Using generated queries as suggestions is investigated as a method of presenting alternative queries. Search result diversification is important in domain-specific search because of the nature of the query documents. Since query documents generally contain long complex descriptions, diverse query topics can be identified, and a range of relevant documents can be found that are related to these diverse topics. The proposed methods we study in this thesis explicitly address these three issues. To solve the query generation issue, we use binary decision trees to generate effective Boolean queries and labeling propagation to formulate more effective phrasal-concept queries. In order to diversify search results, we propose two different approaches: query-side and result-level diversification. To generate diverse queries, we identify important topics from query documents and generate queries based on the identified topics. For result-level diversification, we extract query topics from query documents, and apply state-of-the-art diversification algorithms based on the extracted topics. In addition, we devise query suggestion techniques for each query generation method. To demonstrate the effectiveness of our approach, we conduct experiments for various domain-specific search tasks, and devise appropriate evaluation measures for domain-specific search environments
Recuperação multimodal e interativa de informação orientada por diversidade
Orientador: Ricardo da Silva TorresTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Os mĂ©todos de Recuperação da Informação, especialmente considerando-se dados multimĂdia, evoluĂram para a integração de mĂşltiplas fontes de evidĂŞncia na análise de relevância de itens em uma tarefa de busca. Neste contexto, para atenuar a distância semântica entre as propriedades de baixo nĂvel extraĂdas do conteĂşdo dos objetos digitais e os conceitos semânticos de alto nĂvel (objetos, categorias, etc.) e tornar estes sistemas adaptativos Ă s diferentes necessidades dos usuários, modelos interativos que consideram o usuário mais prĂłximo do processo de recuperação tĂŞm sido propostos, permitindo a sua interação com o sistema, principalmente por meio da realimentação de relevância implĂcita ou explĂcita. Analogamente, a promoção de diversidade surgiu como uma alternativa para lidar com consultas ambĂguas ou incompletas. Adicionalmente, muitos trabalhos tĂŞm tratado a ideia de minimização do esforço requerido do usuário em fornecer julgamentos de relevância, Ă medida que mantĂ©m nĂveis aceitáveis de eficácia. Esta tese aborda, propõe e analisa experimentalmente mĂ©todos de recuperação da informação interativos e multimodais orientados por diversidade. Este trabalho aborda de forma abrangente a literatura acerca da recuperação interativa da informação e discute sobre os avanços recentes, os grandes desafios de pesquisa e oportunidades promissoras de trabalho. NĂłs propusemos e avaliamos dois mĂ©todos de aprimoramento do balanço entre relevância e diversidade, os quais integram mĂşltiplas informações de imagens, tais como: propriedades visuais, metadados textuais, informação geográfica e descritores de credibilidade dos usuários. Por sua vez, como integração de tĂ©cnicas de recuperação interativa e de promoção de diversidade, visando maximizar a cobertura de mĂşltiplas interpretações/aspectos de busca e acelerar a transferĂŞncia de informação entre o usuário e o sistema, nĂłs propusemos e avaliamos um mĂ©todo multimodal de aprendizado para ranqueamento utilizando realimentação de relevância sobre resultados diversificados. Nossa análise experimental mostra que o uso conjunto de mĂşltiplas fontes de informação teve impacto positivo nos algoritmos de balanceamento entre relevância e diversidade. Estes resultados sugerem que a integração de filtragem e re-ranqueamento multimodais Ă© eficaz para o aumento da relevância dos resultados e tambĂ©m como mecanismo de potencialização dos mĂ©todos de diversificação. AlĂ©m disso, com uma análise experimental minuciosa, nĂłs investigamos várias questões de pesquisa relacionadas Ă possibilidade de aumento da diversidade dos resultados e a manutenção ou atĂ© mesmo melhoria da sua relevância em sessões interativas. Adicionalmente, nĂłs analisamos como o esforço em diversificar afeta os resultados gerais de uma sessĂŁo de busca e como diferentes abordagens de diversificação se comportam para diferentes modalidades de dados. Analisando a eficácia geral e tambĂ©m em cada iteração de realimentação de relevância, nĂłs mostramos que introduzir diversidade nos resultados pode prejudicar resultados iniciais, enquanto que aumenta significativamente a eficácia geral em uma sessĂŁo de busca, considerando-se nĂŁo apenas a relevância e diversidade geral, mas tambĂ©m o quĂŁo cedo o usuário Ă© exposto ao mesmo montante de itens relevantes e nĂvel de diversidadeAbstract: Information retrieval methods, especially considering multimedia data, have evolved towards the integration of multiple sources of evidence in the analysis of the relevance of items considering a given user search task. In this context, for attenuating the semantic gap between low-level features extracted from the content of the digital objects and high-level semantic concepts (objects, categories, etc.) and making the systems adaptive to different user needs, interactive models have brought the user closer to the retrieval loop allowing user-system interaction mainly through implicit or explicit relevance feedback. Analogously, diversity promotion has emerged as an alternative for tackling ambiguous or underspecified queries. Additionally, several works have addressed the issue of minimizing the required user effort on providing relevance assessments while keeping an acceptable overall effectiveness. This thesis discusses, proposes, and experimentally analyzes multimodal and interactive diversity-oriented information retrieval methods. This work, comprehensively covers the interactive information retrieval literature and also discusses about recent advances, the great research challenges, and promising research opportunities. We have proposed and evaluated two relevance-diversity trade-off enhancement work-flows, which integrate multiple information from images, such as: visual features, textual metadata, geographic information, and user credibility descriptors. In turn, as an integration of interactive retrieval and diversity promotion techniques, for maximizing the coverage of multiple query interpretations/aspects and speeding up the information transfer between the user and the system, we have proposed and evaluated a multimodal learning-to-rank method trained with relevance feedback over diversified results. Our experimental analysis shows that the joint usage of multiple information sources positively impacted the relevance-diversity balancing algorithms. Our results also suggest that the integration of multimodal-relevance-based filtering and reranking was effective on improving result relevance and also boosted diversity promotion methods. Beyond it, with a thorough experimental analysis we have investigated several research questions related to the possibility of improving result diversity and keeping or even improving relevance in interactive search sessions. Moreover, we analyze how much the diversification effort affects overall search session results and how different diversification approaches behave for the different data modalities. By analyzing the overall and per feedback iteration effectiveness, we show that introducing diversity may harm initial results whereas it significantly enhances the overall session effectiveness not only considering the relevance and diversity, but also how early the user is exposed to the same amount of relevant items and diversityDoutoradoCiĂŞncia da ComputaçãoDoutor em CiĂŞncia da ComputaçãoP-4388/2010140977/2012-0CAPESCNP
Large Language Models for Information Retrieval: A Survey
As a primary means of information acquisition, information retrieval (IR)
systems, such as search engines, have integrated themselves into our daily
lives. These systems also serve as components of dialogue, question-answering,
and recommender systems. The trajectory of IR has evolved dynamically from its
origins in term-based methods to its integration with advanced neural models.
While the neural models excel at capturing complex contextual signals and
semantic nuances, thereby reshaping the IR landscape, they still face
challenges such as data scarcity, interpretability, and the generation of
contextually plausible yet potentially inaccurate responses. This evolution
requires a combination of both traditional methods (such as term-based sparse
retrieval methods with rapid response) and modern neural architectures (such as
language models with powerful language understanding capacity). Meanwhile, the
emergence of large language models (LLMs), typified by ChatGPT and GPT-4, has
revolutionized natural language processing due to their remarkable language
understanding, generation, generalization, and reasoning abilities.
Consequently, recent research has sought to leverage LLMs to improve IR
systems. Given the rapid evolution of this research trajectory, it is necessary
to consolidate existing methodologies and provide nuanced insights through a
comprehensive overview. In this survey, we delve into the confluence of LLMs
and IR systems, including crucial aspects such as query rewriters, retrievers,
rerankers, and readers. Additionally, we explore promising directions within
this expanding field
Neural Representations of Concepts and Texts for Biomedical Information Retrieval
Information retrieval (IR) methods are an indispensable tool in the current landscape of exponentially increasing textual data, especially on the Web. A typical IR task involves fetching and ranking a set of documents (from a large corpus) in terms of relevance to a user\u27s query, which is often expressed as a short phrase. IR methods are the backbone of modern search engines where additional system-level aspects including fault tolerance, scale, user interfaces, and session maintenance are also addressed. In addition to fetching documents, modern search systems may also identify snippets within the documents that are potentially most relevant to the input query. Furthermore, current systems may also maintain preprocessed structured knowledge derived from textual data as so called knowledge graphs, so certain types of queries that are posed as questions can be parsed as such; a response can be an output of one or more named entities instead of a ranked list of documents (e.g., what diseases are associated with EGFR mutations? ). This refined setup is often termed as question answering (QA) in the IR and natural language processing (NLP) communities.
In biomedicine and healthcare, specialized corpora are often at play including research articles by scientists, clinical notes generated by healthcare professionals, consumer forums for specific conditions (e.g., cancer survivors network), and clinical trial protocols (e.g., www.clinicaltrials.gov). Biomedical IR is specialized given the types of queries and the variations in the texts are different from that of general Web documents. For example, scientific articles are more formal with longer sentences but clinical notes tend to have less grammatical conformity and are rife with abbreviations. There is also a mismatch between the vocabulary of consumers and the lingo of domain experts and professionals. Queries are also different and can range from simple phrases (e.g., COVID-19 symptoms ) to more complex implicitly fielded queries (e.g., chemotherapy regimens for stage IV lung cancer patients with ALK mutations ). Hence, developing methods for different configurations (corpus, query type, user type) needs more deliberate attention in biomedical IR.
Representations of documents and queries are at the core of IR methods and retrieval methodology involves coming up with these representations and matching queries with documents based on them. Traditional IR systems follow the approach of keyword based indexing of documents (the so called inverted index) and matching query phrases against the document index. It is not difficult to see that this keyword based matching ignores the semantics of texts (synonymy at the lexeme level and entailment at phrase/clause/sentence levels) and this has lead to dimensionality reduction methods such as latent semantic indexing that generally have scale-related concerns; such methods also do not address similarity at the sentence level. Since the resurgence of neural network methods in NLP, the IR field has also moved to incorporate advances in neural networks into current IR methods.
This dissertation presents four specific methodological efforts toward improving biomedical IR. Neural methods always begin with dense embeddings for words and concepts to overcome the limitations of one-hot encoding in traditional NLP/IR. In the first effort, we present a new neural pre-training approach to jointly learn word and concept embeddings for downstream use in applications. In the second study, we present a joint neural model for two essential subtasks of information extraction (IE): named entity recognition (NER) and entity normalization (EN). Our method detects biomedical concept phrases in texts and links them to the corresponding semantic types and entity codes. These first two studies provide essential tools to model textual representations as compositions of both surface forms (lexical units) and high level concepts with potential downstream use in QA. In the third effort, we present a document reranking model that can help surface documents that are likely to contain answers (e.g, factoids, lists) to a question in a QA task. The model is essentially a sentence matching neural network that learns the relevance of a candidate answer sentence to the given question parametrized with a bilinear map. In the fourth effort, we present another document reranking approach that is tailored for precision medicine use-cases. It combines neural query-document matching and faceted text summarization. The main distinction of this effort from previous efforts is to pivot from a query manipulation setup to transforming candidate documents into pseudo-queries via neural text summarization. Overall, our contributions constitute nontrivial advances in biomedical IR using neural representations of concepts and texts
Dense Text Retrieval based on Pretrained Language Models: A Survey
Text retrieval is a long-standing research topic on information seeking,
where a system is required to return relevant information resources to user's
queries in natural language. From classic retrieval methods to learning-based
ranking functions, the underlying retrieval models have been continually
evolved with the ever-lasting technical innovation. To design effective
retrieval models, a key point lies in how to learn the text representation and
model the relevance matching. The recent success of pretrained language models
(PLMs) sheds light on developing more capable text retrieval approaches by
leveraging the excellent modeling capacity of PLMs. With powerful PLMs, we can
effectively learn the representations of queries and texts in the latent
representation space, and further construct the semantic matching function
between the dense vectors for relevance modeling. Such a retrieval approach is
referred to as dense retrieval, since it employs dense vectors (a.k.a.,
embeddings) to represent the texts. Considering the rapid progress on dense
retrieval, in this survey, we systematically review the recent advances on
PLM-based dense retrieval. Different from previous surveys on dense retrieval,
we take a new perspective to organize the related work by four major aspects,
including architecture, training, indexing and integration, and summarize the
mainstream techniques for each aspect. We thoroughly survey the literature, and
include 300+ related reference papers on dense retrieval. To support our
survey, we create a website for providing useful resources, and release a code
repertory and toolkit for implementing dense retrieval models. This survey aims
to provide a comprehensive, practical reference focused on the major progress
for dense text retrieval
- …