104 research outputs found

    Enabling multi-level relevance feedback on PubMed by integrating rank learning into DBMS

    Get PDF
    Background: Finding relevant articles from PubMed is challenging because it is hard to express the user's specific intention in the given query interface, and a keyword query typically retrieves a large number of results. Researchers have applied machine learning techniques to find relevant articles by ranking the articles according to the learned relevance function. However, the process of learning and ranking is usually done offline without integrated with the keyword queries, and the users have to provide a large amount of training documents to get a reasonable learning accuracy. This paper proposes a novel multi-level relevance feedback system for PubMed, called RefMed, which supports both ad-hoc keyword queries and a multi-level relevance feedback in real time on PubMed. Results: RefMed supports a multi-level relevance feedback by using the RankSVM as the learning method, and thus it achieves higher accuracy with less feedback. RefMed "tightly" integrates the RankSVM into RDBMS to support both keyword queries and the multi-level relevance feedback in real time; the tight coupling of the RankSVM and DBMS substantially improves the processing time. An efficient parameter selection method for the RankSVM is also proposed, which tunes the RankSVM parameter without performing validation. Thereby, RefMed achieves a high learning accuracy in real time without performing a validation process. RefMed is accessible at http://dm.postech.ac.kr/refmed. Conclusions: RefMed is the first multi-level relevance feedback system for PubMed, which achieves a high accuracy with less feedback. It effectively learns an accurate relevance function from the user's feedback and efficiently processes the function to return relevant articles in real time.1114Nsciescopu

    PubMed and beyond: a survey of web tools for searching biomedical literature

    Get PDF
    The past decade has witnessed the modern advances of high-throughput technology and rapid growth of research capacity in producing large-scale biological data, both of which were concomitant with an exponential growth of biomedical literature. This wealth of scholarly knowledge is of significant importance for researchers in making scientific discoveries and healthcare professionals in managing health-related matters. However, the acquisition of such information is becoming increasingly difficult due to its large volume and rapid growth. In response, the National Center for Biotechnology Information (NCBI) is continuously making changes to its PubMed Web service for improvement. Meanwhile, different entities have devoted themselves to developing Web tools for helping users quickly and efficiently search and retrieve relevant publications. These practices, together with maturity in the field of text mining, have led to an increase in the number and quality of various Web tools that provide comparable literature search service to PubMed. In this study, we review 28 such tools, highlight their respective innovations, compare them to the PubMed system and one another, and discuss directions for future development. Furthermore, we have built a website dedicated to tracking existing systems and future advances in the field of biomedical literature search. Taken together, our work serves information seekers in choosing tools for their needs and service providers and developers in keeping current in the field

    Prediction of Relevant Biomedical Documents: a Human Microbiome Case Study

    Get PDF
    Background: Retrieving relevant biomedical literature has become increasingly difficult due to the large volume and rapid growth of biomedical publication. A query to a biomedical retrieval system often retrieves hundreds of results. Since the searcher will not likely consider all of these documents, ranking the documents is important. Ranking by recency, as PubMed does, takes into account only one factor indicating potential relevance. This study explores the use of the searcher’s relevance feedback judgments to support relevance ranking based on features more general than recency. Results: It was found that the researcher’s relevance judgments could be used to accurately predict the relevance of additional documents: both using tenfold cross-validation and by training on publications from 2008–2010 and testing on documents from 2011. Conclusions: This case study has shown the promise for relevance feedback to improve biomedical document retrieval. A researcher’s judgments as to which initially retrieved documents are relevant, or not, can be leveraged to predict additional relevant documents

    A Relevance Feedback-Based System For Quickly Narrowing Biomedical Literature Search Result

    Get PDF
    The online literature is an important source that helps people find the information. The quick increase of online literature makes the manual search process for the most relevant information a very time-consuming task and leads to sifting through many results to find the relevant ones. The existing search engines and online databases return a list of results that satisfy the user\u27s search criteria. The list is often too long for the user to go through every hit if he/she does not exactly know what he/she wants or/and does not have time to review them one by one. My focus is on how to find biomedical literature in a fastest way. In this dissertation, I developed a biomedical literature search system that uses relevance feedback mechanism, fuzzy logic, text mining techniques and Unified Medical Language System. The system extracts and decodes information from the online biomedical documents and uses the extracted information to first filter unwanted documents and then ranks the related ones based on the user preferences. I used text mining techniques to extract PDF document features and used these features to filter unwanted documents with the help of fuzzy logic. The system extracts meaning and semantic relations between texts and calculates the similarity between documents using these relations. Moreover, I developed a fuzzy literature ranking method that uses fuzzy logic, text mining techniques and Unified Medical Language System. The ranking process is utilized based on fuzzy logic and Unified Medical Language System knowledge resources. The fuzzy ranking method uses semantic type and meaning concepts to map the relations between texts in documents. The relevance feedback-based biomedical literature search system is evaluated using a real biomedical data that created using dobutamine (drug name). The data set contains 1,099 original documents. To obtain coherent and reliable evaluation results, two physicians are involved in the system evaluation. Using (30-day mortality) as specific query, the retrieved result precision improves by 87.7% in three rounds, which shows the effectiveness of using relevance feedback, fuzzy logic and UMLS in the search process. Moreover, the fuzzy-based ranking method is evaluated in term of ranking the biomedical search result. Experiments show that the fuzzy-based ranking method improves the average ranking order accuracy by 3.35% and 29.55% as compared with UMLS meaning and semantic type methods respectively

    The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives

    Full text link
    The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. Of the few query logs publicly available, none combines size, scope, and diversity. The AQL is the first to do so, enabling research on new retrieval models and (diachronic) search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.Comment: SIGIR 2023 resource paper, 13 page

    Recuperação de informação multimodal em repositórios de imagem médica

    Get PDF
    The proliferation of digital medical imaging modalities in hospitals and other diagnostic facilities has created huge repositories of valuable data, often not fully explored. Moreover, the past few years show a growing trend of data production. As such, studying new ways to index, process and retrieve medical images becomes an important subject to be addressed by the wider community of radiologists, scientists and engineers. Content-based image retrieval, which encompasses various methods, can exploit the visual information of a medical imaging archive, and is known to be beneficial to practitioners and researchers. However, the integration of the latest systems for medical image retrieval into clinical workflows is still rare, and their effectiveness still show room for improvement. This thesis proposes solutions and methods for multimodal information retrieval, in the context of medical imaging repositories. The major contributions are a search engine for medical imaging studies supporting multimodal queries in an extensible archive; a framework for automated labeling of medical images for content discovery; and an assessment and proposal of feature learning techniques for concept detection from medical images, exhibiting greater potential than feature extraction algorithms that were pertinently used in similar tasks. These contributions, each in their own dimension, seek to narrow the scientific and technical gap towards the development and adoption of novel multimodal medical image retrieval systems, to ultimately become part of the workflows of medical practitioners, teachers, and researchers in healthcare.A proliferação de modalidades de imagem médica digital, em hospitais, clínicas e outros centros de diagnóstico, levou à criação de enormes repositórios de dados, frequentemente não explorados na sua totalidade. Além disso, os últimos anos revelam, claramente, uma tendência para o crescimento da produção de dados. Portanto, torna-se importante estudar novas maneiras de indexar, processar e recuperar imagens médicas, por parte da comunidade alargada de radiologistas, cientistas e engenheiros. A recuperação de imagens baseada em conteúdo, que envolve uma grande variedade de métodos, permite a exploração da informação visual num arquivo de imagem médica, o que traz benefícios para os médicos e investigadores. Contudo, a integração destas soluções nos fluxos de trabalho é ainda rara e a eficácia dos mais recentes sistemas de recuperação de imagem médica pode ser melhorada. A presente tese propõe soluções e métodos para recuperação de informação multimodal, no contexto de repositórios de imagem médica. As contribuições principais são as seguintes: um motor de pesquisa para estudos de imagem médica com suporte a pesquisas multimodais num arquivo extensível; uma estrutura para a anotação automática de imagens; e uma avaliação e proposta de técnicas de representation learning para deteção automática de conceitos em imagens médicas, exibindo maior potencial do que as técnicas de extração de features visuais outrora pertinentes em tarefas semelhantes. Estas contribuições procuram reduzir as dificuldades técnicas e científicas para o desenvolvimento e adoção de sistemas modernos de recuperação de imagem médica multimodal, de modo a que estes façam finalmente parte das ferramentas típicas dos profissionais, professores e investigadores da área da saúde.Programa Doutoral em Informátic

    Predicting Rules for Cancer Subtype Classification using Grammar-Based Genetic Programming on various Genomic Data Types

    Get PDF
    With the advent of high-throughput methods more genomic data then ever has been generated during the past decade. As these technologies remain cost intensive and not worthwhile for every research group, databases, such as the TCGA and Firebrowse, emerged. While these database enable the fast and free access to massive amounts of genomic data, they also embody new challenges to the research community. This study investigates methods to obtain, normalize and process genomic data for computer aided decision making in the field of cancer subtype discovery. A new software, termed FirebrowseR is introduced, allowing the direct download of genomic data sets into the R programming environment. To pre-process the obtained data, a set of methods is introduced, enabling data type specific normalization. As a proof of principle, the Web-TCGA software is created, enabling fast data analysis. To explore cancer subtypes a statistical model, the EDL, is introduced. The newly developed method is designed to provide highly precise, yet interpretable models. The EDL is tested on well established data sets, while its performance is compared to state of the art machine learning algorithms. As a proof of principle, the EDL was run on a cohort of 1,000 breast cancer patients, where it reliably re-identified the known subtypes and automatically selected the corresponding maker genes, by which the subtypes are defined. In addition, novel patterns of alterations in well known maker genes could be identified to distinguish primary and mCRPC samples. The findings suggest that mCRPC is characterized through a unique amplification of the Androgen Receptor, while a significant fraction of primary samples is described by a loss of heterozygosity TP53 and NCOR1

    Query Based Sampling and Multi-Layered Semantic Analysis to find Robust Network of Drug-Disease Associations

    Get PDF
    This thesis presents the design and implementation of a system to discover the semantically related networks of drug-disease associations, called DDNet, from medical literature. A fully functional DDNet can be transformative in identification of drug targets and may new avenues for drug repositioning in clinical and translational research. In particular, a Local Latent Semantic Analysis (LLSA) was introduced to implement a system that is efficient, scalalble and relatively free from systemi bias. In addition, a query-based sampling was introduced to find representative samples from the ocean of data to build model that is relatively free from garbage-in garbage-out syndrome. Also, the concept of mapping ontologies was adopted to determine the relevant results and reverse ontology mapping were used to create a network of associations. In addition, a web service application was developed to query the system and visualize the computed network of associations in a form that is easy to interact. A pilot study was conducted to evaluate the performance of the system using both subjective and objective measures. The PahrmGKB was used as the gold standard and the PR curve was obtained from a large number of queries at different recall points. Empirical analyses suggest that DDNet is robust, relatively stable and scalable over traditional Global LSA model

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System
    corecore