55 research outputs found
MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval
Information retrieval (IR) is essential in biomedical knowledge acquisition
and clinical decision support. While recent progress has shown that language
model encoders perform better semantic retrieval, training such models requires
abundant query-article annotations that are difficult to obtain in biomedicine.
As a result, most biomedical IR systems only conduct lexical matching. In
response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained
Transformer model for zero-shot semantic IR in biomedicine. To train MedCPT, we
collected an unprecedented scale of 255 million user click logs from PubMed.
With such data, we use contrastive learning to train a pair of
closely-integrated retriever and re-ranker. Experimental results show that
MedCPT sets new state-of-the-art performance on six biomedical IR tasks,
outperforming various baselines including much larger models such as
GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical
article and sentence representations for semantic evaluations. As such, MedCPT
can be readily applied to various real-world biomedical IR tasks.Comment: The MedCPT code and API are available at
https://github.com/ncbi/MedCP
Opportunities and Challenges for ChatGPT and Large Language Models in Biomedicine and Health
ChatGPT has drawn considerable attention from both the general public and
domain experts with its remarkable text generation capabilities. This has
subsequently led to the emergence of diverse applications in the field of
biomedicine and health. In this work, we examine the diverse applications of
large language models (LLMs), such as ChatGPT, in biomedicine and health.
Specifically we explore the areas of biomedical information retrieval, question
answering, medical text summarization, information extraction, and medical
education, and investigate whether LLMs possess the transformative power to
revolutionize these tasks or whether the distinct complexities of biomedical
domain presents unique challenges. Following an extensive literature survey, we
find that significant advances have been made in the field of text generation
tasks, surpassing the previous state-of-the-art methods. For other
applications, the advances have been modest. Overall, LLMs have not yet
revolutionized the biomedicine, but recent rapid progress indicates that such
methods hold great potential to provide valuable means for accelerating
discovery and improving health. We also find that the use of LLMs, like
ChatGPT, in the fields of biomedicine and health entails various risks and
challenges, including fabricated information in its generated responses, as
well as legal and privacy concerns associated with sensitive patient data. We
believe this first-of-its-kind survey can provide a comprehensive overview to
biomedical researchers and healthcare practitioners on the opportunities and
challenges associated with using ChatGPT and other LLMs for transforming
biomedicine and health
Abbreviation definition identification based on automatic precision estimates
<p>Abstract</p> <p>Background</p> <p>The rapid growth of biomedical literature presents challenges for automatic text processing, and one of the challenges is abbreviation identification. The presence of unrecognized abbreviations in text hinders indexing algorithms and adversely affects information retrieval and extraction. Automatic abbreviation definition identification can help resolve these issues. However, abbreviations and their definitions identified by an automatic process are of uncertain validity. Due to the size of databases such as MEDLINE only a small fraction of abbreviation-definition pairs can be examined manually. An automatic way to estimate the accuracy of abbreviation-definition pairs extracted from text is needed. In this paper we propose an abbreviation definition identification algorithm that employs a variety of strategies to identify the most probable abbreviation definition. In addition our algorithm produces an accuracy estimate, pseudo-precision, for each strategy without using a human-judged gold standard. The pseudo-precisions determine the order in which the algorithm applies the strategies in seeking to identify the definition of an abbreviation.</p> <p>Results</p> <p>On the Medstract corpus our algorithm produced 97% precision and 85% recall which is higher than previously reported results. We also annotated 1250 randomly selected MEDLINE records as a gold standard. On this set we achieved 96.5% precision and 83.2% recall. This compares favourably with the well known Schwartz and Hearst algorithm.</p> <p>Conclusion</p> <p>We developed an algorithm for abbreviation identification that uses a variety of strategies to identify the most probable definition for an abbreviation and also produces an estimated accuracy of the result. This process is purely automatic.</p
Recommended from our members
BioC: a minimalist approach to interoperability for biomedical text processing
A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net
Differential Effects of MYH9 and APOL1 Risk Variants on FRMD3 Association with Diabetic ESRD in African Americans
Single nucleotide polymorphisms (SNPs) in MYH9 and APOL1 on chromosome 22 (c22) are powerfully associated with non-diabetic end-stage renal disease (ESRD) in African Americans (AAs). Many AAs diagnosed with type 2 diabetic nephropathy (T2DN) have non-diabetic kidney disease, potentially masking detection of DN genes. Therefore, genome-wide association analyses were performed using the Affymetrix SNP Array 6.0 in 966 AA with T2DN and 1,032 non-diabetic, non-nephropathy (NDNN) controls, with and without adjustment for c22 nephropathy risk variants. No associations were seen between FRMD3 SNPs and T2DN before adjusting for c22 variants. However, logistic regression analysis revealed seven FRMD3 SNPs significantly interacting with MYH9—a finding replicated in 640 additional AA T2DN cases and 683 NDNN controls. Contrasting all 1,592 T2DN cases with all 1,671 NDNN controls, FRMD3 SNPs appeared to interact with the MYH9 E1 haplotype (e.g., rs942280 interaction p-value = 9.3E−7 additive; odds ratio [OR] 0.67). FRMD3 alleles were associated with increased risk of T2DN only in subjects lacking two MYH9 E1 risk haplotypes (rs942280 OR = 1.28), not in MYH9 E1 risk allele homozygotes (rs942280 OR = 0.80; homogeneity p-value = 4.3E−4). Effects were weaker stratifying on APOL1. FRMD3 SNPS were associated with T2DN, not type 2 diabetes per se, comparing AAs with T2DN to those with diabetes lacking nephropathy. T2DN-associated FRMD3 SNPs were detectable in AAs only after accounting for MYH9, with differential effects for APOL1. These analyses reveal a role for FRMD3 in AA T2DN susceptibility and accounting for c22 nephropathy risk variants can assist in detecting DN susceptibility genes
Genome-Wide Association and Trans-ethnic Meta-Analysis for Advanced Diabetic Kidney Disease: Family Investigation of Nephropathy and Diabetes (FIND)
Diabetic kidney disease (DKD) is the most common etiology of chronic kidney disease (CKD) in the industrialized world and accounts for much of the excess mortality in patients with diabetes mellitus. Approximately 45% of U.S. patients with incident end-stage kidney disease (ESKD) have DKD. Independent of glycemic control, DKD aggregates in families and has higher incidence rates in African, Mexican, and American Indian ancestral groups relative to European populations. The Family Investigation of Nephropathy and Diabetes (FIND) performed a genome-wide association study (GWAS) contrasting 6,197 unrelated individuals with advanced DKD with healthy and diabetic individuals lacking nephropathy of European American, African American, Mexican American, or American Indian ancestry. A large-scale replication and trans-ethnic meta-analysis included 7,539 additional European American, African American and American Indian DKD cases and non-nephropathy controls. Within ethnic group meta-analysis of discovery GWAS and replication set results identified genome-wide significant evidence for association between DKD and rs12523822 on chromosome 6q25.2 in American Indians (P = 5.74x10-9). The strongest signal of association in the trans-ethnic meta-analysis was with a SNP in strong linkage disequilibrium with rs12523822 (rs955333; P = 1.31x10-8), with directionally consistent results across ethnic groups. These 6q25.2 SNPs are located between the SCAF8 and CNKSR3 genes, a region with DKD relevant changes in gene expression and an eQTL with IPCEF1, a gene co-translated with CNKSR3. Several other SNPs demonstrated suggestive evidence of association with DKD, within and across populations. These data identify a novel DKD susceptibility locus with consistent directions of effect across diverse ancestral groups and provide insight into the genetic architecture of DKD
Type 2 Diabetes Variants Disrupt Function of SLC16A11 through Two Distinct Mechanisms
Type 2 diabetes (T2D) affects Latinos at twice the rate seen in populations of European descent. We recently identified a risk haplotype spanning SLC16A11 that explains ∼20% of the increased T2D prevalence in Mexico. Here, through genetic fine-mapping, we define a set of tightly linked variants likely to contain the causal allele(s). We show that variants on the T2D-associated haplotype have two distinct effects: (1) decreasing SLC16A11 expression in liver and (2) disrupting a key interaction with basigin, thereby reducing cell-surface localization. Both independent mechanisms reduce SLC16A11 function and suggest SLC16A11 is the causal gene at this locus. To gain insight into how SLC16A11 disruption impacts T2D risk, we demonstrate that SLC16A11 is a proton-coupled monocarboxylate transporter and that genetic perturbation of SLC16A11 induces changes in fatty acid and lipid metabolism that are associated with increased T2D risk. Our findings suggest that increasing SLC16A11 function could be therapeutically beneficial for T2D. Video Abstract [Figure presented] Keywords: type 2 diabetes (T2D); genetics; disease mechanism; SLC16A11; MCT11; solute carrier (SLC); monocarboxylates; fatty acid metabolism; lipid metabolism; precision medicin
- …