Search CORE

Public Library of Science (PLOS)

Using Noun Phrases for Navigating Biomedical Literature on Pubmed: How Many Updates Are We Losing Track of?

Author: A Névéol
A Rzhetsky
Andrey Rzhetsky
BM Fonseca
C Jacquemin
C Manning
CD Manning
D Beeferman
D Rebholz-Schuhmann
D Shotton
D Shotton
D Srikrishna
D Trieschnigg
Devabhaktuni Srikrishna
DR Hunter
GF Cooper
J Evans
J Lin
JPA Ionnidis
M Muin
M Weeber
Marc A. Coram
MH MacRoberts
MJ Schuemie
N Tran
O Bodenreider
P Srinivasan
PL Elkin
Q He
Q Li
R Islamaj Dogan
R Schifanella
RA DiGiacomo
S Bird
T Rindflesch
T Wachter
V Sintchenko
W Kim
Y Huang
Z Lu
Z Sun
Publication venue: Public Library of Science
Publication date: 14/09/2011
Field of study

Author-supplied citations are a fraction of the related literature for a paper. The “related citations” on PubMed is typically dozens or hundreds of results long, and does not offer hints why these results are related. Using noun phrases derived from the sentences of the paper, we show it is possible to more transparently navigate to PubMed updates through search terms that can associate a paper with its citations. The algorithm to generate these search terms involved automatically extracting noun phrases from the paper using natural language processing tools, and ranking them by the number of occurrences in the paper compared to the number of occurrences on the web. We define search queries having at least one instance of overlap between the author-supplied citations of the paper and the top 20 search results as citation validated (CV). When the overlapping citations were written by same authors as the paper itself, we define it as CV-S and different authors is defined as CV-D. For a systematic sample of 883 papers on PubMed Central, at least one of the search terms for 86% of the papers is CV-D versus 65% for the top 20 PubMed “related citations.” We hypothesize these quantities computed for the 20 million papers on PubMed to differ within 5% of these percentages. Averaged across all 883 papers, 5 search terms are CV-D, and 10 search terms are CV-S, and 6 unique citations validate these searches. Potentially related literature uncovered by citation-validated searches (either CV-S or CV-D) are on the order of ten per paper – many more if the remaining searches that are not citation-validated are taken into account. The significance and relationship of each search result to the paper can only be vetted and explained by a researcher with knowledge of or interest in that paper

Directory of Open Access Journals

Literature Based Discovery (LBD): Towards Hypothesis Generation and Knowledge Discovery in Biomedical Text Mining

Author: Bhasuran Balu
Murugesan Gurusamy
Natarajan Jeyakumar
Publication venue
Publication date: 03/10/2023
Field of study

Biomedical knowledge is growing in an astounding pace with a majority of this knowledge is represented as scientific publications. Text mining tools and methods represents automatic approaches for extracting hidden patterns and trends from this semi structured and unstructured data. In Biomedical Text mining, Literature Based Discovery (LBD) is the process of automatically discovering novel associations between medical terms otherwise mentioned in disjoint literature sets. LBD approaches proven to be successfully reducing the discovery time of potential associations that are hidden in the vast amount of scientific literature. The process focuses on creating concept profiles for medical terms such as a disease or symptom and connecting it with a drug and treatment based on the statistical significance of the shared profiles. This knowledge discovery approach introduced in 1989 still remains as a core task in text mining. Currently the ABC principle based two approaches namely open discovery and closed discovery are mostly explored in LBD process. This review starts with general introduction about text mining followed by biomedical text mining and introduces various literature resources such as MEDLINE, UMLS, MESH, and SemMedDB. This is followed by brief introduction of the core ABC principle and its associated two approaches open discovery and closed discovery in LBD process. This review also discusses the deep learning applications in LBD by reviewing the role of transformer models and neural networks based LBD models and its future aspects. Finally, reviews the key biomedical discoveries generated through LBD approaches in biomedicine and conclude with the current limitations and future directions of LBD.Comment: 43 Pages, 5 Figures, 4 Table

iASiS Open Data Graph: Automated Semantic Integration of Disease-Specific Knowledge

Author: Bougiatiotis Konstantinos
Krithara Anastasia
Nentidis Anastasios
Paliouras Georgios
Publication venue
Publication date: 02/06/2020
Field of study

In biomedical research, unified access to up-to-date domain-specific knowledge is crucial, as such knowledge is continuously accumulated in scientific literature and structured resources. Identifying and extracting specific information is a challenging task and computational analysis of knowledge bases can be valuable in this direction. However, for disease-specific analyses researchers often need to compile their own datasets, integrating knowledge from different resources, or reuse existing datasets, that can be out-of-date. In this study, we propose a framework to automatically retrieve and integrate disease-specific knowledge into an up-to-date semantic graph, the iASiS Open Data Graph. This disease-specific semantic graph provides access to knowledge relevant to specific concepts and their individual aspects, in the form of concept relations and attributes. The proposed approach is implemented as an open-source framework and applied to three diseases (Lung Cancer, Dementia, and Duchenne Muscular Dystrophy). Exemplary queries are presented, investigating the potential of this automatically generated semantic graph as a basis for retrieval and analysis of disease-specific knowledge.Comment: 6 pages, 2 figures, accepted in IEEE 33rd International Symposium on Computer Based Medical Systems (CBMS2020

MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval

Author: Chen Qingyu
Comeau Donald C.
Jin Qiao
Kim Won
Lu Zhiyong
Wilbur W. John
Yeganova Lana
Publication venue
Publication date: 03/10/2023
Field of study

Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query-article annotations that are difficult to obtain in biomedicine. As a result, most biomedical IR systems only conduct lexical matching. In response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot semantic IR in biomedicine. To train MedCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely-integrated retriever and re-ranker. Experimental results show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks, outperforming various baselines including much larger models such as GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical article and sentence representations for semantic evaluations. As such, MedCPT can be readily applied to various real-world biomedical IR tasks.Comment: The MedCPT code and API are available at https://github.com/ncbi/MedCP

Factors affecting the effectiveness of biomedical document indexing and retrieval based on terminologies

Author: Boubekeur Fatiha
Dinh Duy
Tamine Lynda
Publication venue: 'Elsevier BV'
Publication date: 01/02/2013
Field of study

International audienceThe aim of this work is to evaluate a set of indexing and retrieval strategies based on the integration of several biomedical terminologies on the available TREC Genomics collections for an ad hoc information retrieval (IR) task.Materials and methodsWe propose a multi-terminology based concept extraction approach to selecting best concepts from free text by means of voting techniques. We instantiate this general approach on four terminologies (MeSH, SNOMED, ICD-10 and GO). We particularly focus on the effect of integrating terminologies into a biomedical IR process, and the utility of using voting techniques for combining the extracted concepts from each document in order to provide a list of unique concepts.ResultsExperimental studies conducted on the TREC Genomics collections show that our multi-terminology IR approach based on voting techniques are statistically significant compared to the baseline. For example, tested on the 2005 TREC Genomics collection, our multi-terminology based IR approach provides an improvement rate of +6.98% in terms of MAP (mean average precision) (p < 0.05) compared to the baseline. In addition, our experimental results show that document expansion using preferred terms in combination with query expansion using terms from top ranked expanded documents improve the biomedical IR effectiveness.ConclusionWe have evaluated several voting models for combining concepts issued from multiple terminologies. Through this study, we presented many factors affecting the effectiveness of biomedical IR system including term weighting, query expansion, and document expansion models. The appropriate combination of those factors could be useful to improve the IR performance

Scientific Publications of the University of Toulouse II Le Mirail

Open Archive Toulouse Archive Ouverte

BioNLP Shared Task - The Bacteria Track

Author: Alphonse Erick
Bessières Philippe
Bossy Robert
Jourde Julien
Manine Alain-Pierre
Nédellec Claire
van de Guchte Maarten
Veber Philippe
Publication venue: BioMed Central
Publication date: 01/01/2012
Field of study

Background: We present the BioNLP 2011 Shared Task Bacteria Track, the first Information Extraction challenge entirely dedicated to bacteria. It includes three tasks that cover different levels of biological knowledge. The Bacteria Gene Renaming supporting task is aimed at extracting gene renaming and gene name synonymy in PubMed abstracts. The Bacteria Gene Interaction is a gene/protein interaction extraction task from individual sentences. The interactions have been categorized into ten different sub-types, thus giving a detailed account of genetic regulations at the molecular level. Finally, the Bacteria Biotopes task focuses on the localization and environment of bacteria mentioned in textbook articles. We describe the process of creation for the three corpora, including document acquisition and manual annotation, as well as the metrics used to evaluate the participants' submissions. Results: Three teams submitted to the Bacteria Gene Renaming task; the best team achieved an F-score of 87%. For the Bacteria Gene Interaction task, the only participant's score had reached a global F-score of 77%, although the system efficiency varies significantly from one sub-type to another. Three teams submitted to the Bacteria Biotopes task with very different approaches; the best team achieved an F-score of 45%. However, the detailed study of the participating systems efficiency reveals the strengths and weaknesses of each participating system. Conclusions: The three tasks of the Bacteria Track offer participants a chance to address a wide range of issues in Information Extraction, including entity recognition, semantic typing and coreference resolution. We found commond trends in the most efficient systems: the systematic use of syntactic dependencies and machine learning. Nevertheless, the originality of the Bacteria Biotopes task encouraged the use of interesting novel methods and techniques, such as term compositionality, scopes wider than the sentence

Springer - Publisher Connector

ProdInra

Micropublications: a Semantic Model for Claims, Evidence, Arguments and Annotations in Biomedical Communications

Author: Ciccarese Paolo N.
Clark Tim
Goble Carole A.
Publication venue
Publication date: 01/01/2014
Field of study

The Micropublications semantic model for scientific claims, evidence, argumentation and annotation in biomedical publications, is a metadata model of scientific argumentation, designed to support several key requirements for exchange and value-addition of semantic metadata across the biomedical publications ecosystem. Micropublications allow formalizing the argument structure of scientific publications so that (a) their internal structure is semantically clear and computable; (b) citation networks can be easily constructed across large corpora; (c) statements can be formalized in multiple useful abstraction models; (d) statements in one work may cite statements in another, individually; (e) support, similarity and challenge of assertions can be modelled across corpora; (f) scientific assertions, particularly in review articles, may be transitively closed to supporting evidence and methods. The model supports natural language statements; data; methods and materials specifications; discussion and commentary; as well as challenge and disagreement. A detailed analysis of nine use cases is provided, along with an implementation in OWL 2 and SWRL, with several example instantiations in RDF.Comment: Version 4. Minor revision

Springer - Publisher Connector

Harvard University - DASH