Search CORE

55 research outputs found

MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval

Author: Chen Qingyu
Comeau Donald C.
Jin Qiao
Kim Won
Lu Zhiyong
Wilbur W. John
Yeganova Lana
Publication venue
Publication date: 03/10/2023
Field of study

Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query-article annotations that are difficult to obtain in biomedicine. As a result, most biomedical IR systems only conduct lexical matching. In response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot semantic IR in biomedicine. To train MedCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely-integrated retriever and re-ranker. Experimental results show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks, outperforming various baselines including much larger models such as GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical article and sentence representations for semantic evaluations. As such, MedCPT can be readily applied to various real-world biomedical IR tasks.Comment: The MedCPT code and API are available at https://github.com/ncbi/MedCP

arXiv.org e-Print Archive

Opportunities and Challenges for ChatGPT and Large Language Models in Biomedicine and Health

Author: Chen Qingyu
Chen Xiuying
Comeau Donald C.
Gao Xin
Islamaj Rezarta
Jin Qiao
Kapoor Aadit
Kim Won
Lai Po-Ting
Lu Zhiyong
Tian Shubo
Yang Yifan
Yeganova Lana
Zhu Qingqing
Publication venue
Publication date: 15/06/2023
Field of study

ChatGPT has drawn considerable attention from both the general public and domain experts with its remarkable text generation capabilities. This has subsequently led to the emergence of diverse applications in the field of biomedicine and health. In this work, we examine the diverse applications of large language models (LLMs), such as ChatGPT, in biomedicine and health. Specifically we explore the areas of biomedical information retrieval, question answering, medical text summarization, information extraction, and medical education, and investigate whether LLMs possess the transformative power to revolutionize these tasks or whether the distinct complexities of biomedical domain presents unique challenges. Following an extensive literature survey, we find that significant advances have been made in the field of text generation tasks, surpassing the previous state-of-the-art methods. For other applications, the advances have been modest. Overall, LLMs have not yet revolutionized the biomedicine, but recent rapid progress indicates that such methods hold great potential to provide valuable means for accelerating discovery and improving health. We also find that the use of LLMs, like ChatGPT, in the fields of biomedicine and health entails various risks and challenges, including fabricated information in its generated responses, as well as legal and privacy concerns associated with sensitive patient data. We believe this first-of-its-kind survey can provide a comprehensive overview to biomedical researchers and healthcare practitioners on the opportunities and challenges associated with using ChatGPT and other LLMs for transforming biomedicine and health

arXiv.org e-Print Archive

Machine learning with naturally labeled data for identifying abbreviation definitions

Author: A Schwartz
C Kuo
D Nadeau
Donald C Comeau
H Liu
H Yu
J Pustejovsky
L Smith
Lana Yeganova
N Okazaki
R Islamaj
S Sohn
T Zhang
W John Wilbur
W Zhou
Y Park
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Abbreviation definition identification based on automatic precision estimates

Author: A Aronson
A Schwartz
C Fauquet
C Federiuk
C Friedman
Donald C Comeau
H Liu
H Yu
J Pustejovsky
JT Chang
K Fukuda
L Smith
M Yoshida
Sunghwan Sohn
T Cheng
W John Wilbur
W Zhou
Won Kim
Y Park
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background The rapid growth of biomedical literature presents challenges for automatic text processing, and one of the challenges is abbreviation identification. The presence of unrecognized abbreviations in text hinders indexing algorithms and adversely affects information retrieval and extraction. Automatic abbreviation definition identification can help resolve these issues. However, abbreviations and their definitions identified by an automatic process are of uncertain validity. Due to the size of databases such as MEDLINE only a small fraction of abbreviation-definition pairs can be examined manually. An automatic way to estimate the accuracy of abbreviation-definition pairs extracted from text is needed. In this paper we propose an abbreviation definition identification algorithm that employs a variety of strategies to identify the most probable abbreviation definition. In addition our algorithm produces an accuracy estimate, pseudo-precision, for each strategy without using a human-judged gold standard. The pseudo-precisions determine the order in which the algorithm applies the strategies in seeking to identify the definition of an abbreviation. Results On the Medstract corpus our algorithm produced 97% precision and 85% recall which is higher than previously reported results. We also annotated 1250 randomly selected MEDLINE records as a gold standard. On this set we achieved 96.5% precision and 83.2% recall. This compares favourably with the well known Schwartz and Hearst algorithm. Conclusion We developed an algorithm for abbreviation identification that uses a variety of strategies to identify the most probable definition for an abbreviation and also produces an estimated accuracy of the result. This process is purely automatic.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Recommended from our members

BioC: a minimalist approach to interoperability for biomedical text processing

Author: Ciccarese Paolo
Cohen Kevin Bretonnel
Comeau Donald C.
Islamaj Doğan Rezarta
Krallinger Martin
Leitner Florian
Lu Zhiyong
Peng Yifan
Rinaldi Fabio
Torii Manabu
Valencia Alfonso
Verspoor Karin
Wiegers Thomas C.
Wilbur W. John
Wu Cathy H.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 11/03/2014
Field of study

A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net

Harvard University - DASH

Differential Effects of MYH9 and APOL1 Risk Variants on FRMD3 Association with Diabetic ESRD in African Americans

Author: A Kottgen
A Kottgen
AH Chishti
AJ Baines
AM Reeves-Daniel
Barry I. Freedman
BI Freedman
BI Freedman
BI Freedman
BI Freedman
C Pattaro
Caitrin W. McDonough
Carl D. Langefeld
Cheryl A. Winkler
CW McDonough
Donald W. Bowden
G Genovese
George W. Nelson
GW Nelson
H Tang
I Gopalakrishnan
J Divers
Jasmin Divers
JB Kopp
Jeffrey B. Kopp
Jessica N. Cooke
KB Hoover
Lingyi Lu
M Murea
M Nunez
M Ramez
Mark I. McCarthy
Mary E. Comeau
MC Frame
Meredith A. Bostrom
MG Pezzolesi
Nicholette D. Palmer
Pamela J. Hicks
Randall C. Johnson
S Maeda
S Tzur
WH Kao
WW Piegorsch
X Ni
Publication venue: Public Library of Science
Publication date: 01/06/2011
Field of study

Single nucleotide polymorphisms (SNPs) in MYH9 and APOL1 on chromosome 22 (c22) are powerfully associated with non-diabetic end-stage renal disease (ESRD) in African Americans (AAs). Many AAs diagnosed with type 2 diabetic nephropathy (T2DN) have non-diabetic kidney disease, potentially masking detection of DN genes. Therefore, genome-wide association analyses were performed using the Affymetrix SNP Array 6.0 in 966 AA with T2DN and 1,032 non-diabetic, non-nephropathy (NDNN) controls, with and without adjustment for c22 nephropathy risk variants. No associations were seen between FRMD3 SNPs and T2DN before adjusting for c22 variants. However, logistic regression analysis revealed seven FRMD3 SNPs significantly interacting with MYH9—a finding replicated in 640 additional AA T2DN cases and 683 NDNN controls. Contrasting all 1,592 T2DN cases with all 1,671 NDNN controls, FRMD3 SNPs appeared to interact with the MYH9 E1 haplotype (e.g., rs942280 interaction p-value = 9.3E−7 additive; odds ratio [OR] 0.67). FRMD3 alleles were associated with increased risk of T2DN only in subjects lacking two MYH9 E1 risk haplotypes (rs942280 OR = 1.28), not in MYH9 E1 risk allele homozygotes (rs942280 OR = 0.80; homogeneity p-value = 4.3E−4). Effects were weaker stratifying on APOL1. FRMD3 SNPS were associated with T2DN, not type 2 diabetes per se, comparing AAs with T2DN to those with diabetes lacking nephropathy. T2DN-associated FRMD3 SNPs were detectable in AAs only after accounting for MYH9, with differential effects for APOL1. These analyses reveal a role for FRMD3 in AA T2DN susceptibility and accounting for c22 nephropathy risk variants can assist in detecting DN susceptibility genes

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Genome-Wide Association and Trans-ethnic Meta-Analysis for Advanced Diabetic Kidney Disease: Family Investigation of Nephropathy and Diabetes (FIND)

Author: Abboud Hanna E.
Adler Sharon G.
Best Lyle G.
Bowden Donald W.
Burlock Allison
Chen Yii-Der Ida
Cole Shelley A.
Comeau Mary E.
Curtis Jeffrey M.
DIvers Jasmin
Drechsler Christiane
Duggirala Ravi
Elston Robert C.
Family Investigation of Nephropathy and Diabetes (FIND) Research Group
Freedman Barry I.
Guo Xiuqing
Hanson Robert L.
Hoffmann Michael M.
Howard Barbara V.
Huang Huateng
Igo Robert P., Jr.
Ipp Eli
Iyengar Sudha K.
Kao W. H. Linda
Keller Benjamin J.
Kimmel Paul L.
Klag Michael J.
Knowler William C.
Kohn Orly F.
Kretzler Matthias
Langefeld Carl D.
Leak Tennille S.
Leehey David J.
Li Man
Malhotra Alka
Marz Winfried
Nair Viji
Nelson Robert G.
Nicholas Susanne B.
O\u27Brien Stephen J.
Pahl Madeleine V.
Parekh Rulan S.
Pezzolesi Marcus G.
Rasooly Rebekah S.
Rotimi Charles N.
Rotter Jerome I.
Schelling Jeffrey R.
Sedor John R.
Seldin Michael F.
Shah Vallabh O.
Smiles Adam M.
Smith Michael W.
Taylor Kent D.
Thameem Farook
Thornley-Brown Denyse P.
Truitt Barbara J.
Wanner Christoph
Weil E. Jennifer
Winkler Cheryl
Zager Philip G.
Publication venue: NSUWorks
Publication date: 01/01/2015
Field of study

Diabetic kidney disease (DKD) is the most common etiology of chronic kidney disease (CKD) in the industrialized world and accounts for much of the excess mortality in patients with diabetes mellitus. Approximately 45% of U.S. patients with incident end-stage kidney disease (ESKD) have DKD. Independent of glycemic control, DKD aggregates in families and has higher incidence rates in African, Mexican, and American Indian ancestral groups relative to European populations. The Family Investigation of Nephropathy and Diabetes (FIND) performed a genome-wide association study (GWAS) contrasting 6,197 unrelated individuals with advanced DKD with healthy and diabetic individuals lacking nephropathy of European American, African American, Mexican American, or American Indian ancestry. A large-scale replication and trans-ethnic meta-analysis included 7,539 additional European American, African American and American Indian DKD cases and non-nephropathy controls. Within ethnic group meta-analysis of discovery GWAS and replication set results identified genome-wide significant evidence for association between DKD and rs12523822 on chromosome 6q25.2 in American Indians (P = 5.74x10-9). The strongest signal of association in the trans-ethnic meta-analysis was with a SNP in strong linkage disequilibrium with rs12523822 (rs955333; P = 1.31x10-8), with directionally consistent results across ethnic groups. These 6q25.2 SNPs are located between the SCAF8 and CNKSR3 genes, a region with DKD relevant changes in gene expression and an eQTL with IPCEF1, a gene co-translated with CNKSR3. Several other SNPs demonstrated suggestive evidence of association with DKD, within and across populations. These data identify a novel DKD susceptibility locus with consistent directions of effect across diverse ancestral groups and provide insight into the genetic architecture of DKD

Crossref

Harvard University - DASH

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Online-Publikations-Server der Universität Würzburg

NSU Works

FigShare

Type 2 Diabetes Variants Disrupt Function of SLC16A11 through Two Distinct Mechanisms

Author: Adeyemo Adebowale
Aguilar-Salinas Carlos A.
Almeda-Valdés Paloma
Altshuler David
Altshuler David M
Alvirde Ulices
An Ping
Arellano-Campos Olimpia
Armstrong Loren L.
Barajas-Olmos Francisco Martin
Becker Diane M.
Bielak Lawrence F.
Bielinski Suzette J.
Blot William J.
Boerwinkle Eric
Borecki Ingrid B.
Bottinger Erwin P.
Bowden Donald W.
Burtt Noël P.
Cai Qiuyin
Carr Steven A.
Caulkins Lizz
Centeno-Cruz Federico
Centeno-Cruz Federico
Chen Brian H.
Chen Brian H.
Chen Guanjie
Chen Wei-Min
Chen Y-D Ida
Clish Clary B.
Comeau Mary E.
Contreras-Cubas Cecilia
Correa Adolfo
Cortes Maria L.
Couper David
Crawford Dana C.
Cruz-Bautista Ivette
Cummings Steven R.
Córdova Emilio
Deik Amy A.
Dennis Courtney
DeRan Michael
Doumatey Ayo
Evans Daniel S.
Evans Michele K.
Flannick Jason
Florez Jose C.
Florez Jose C.
Fontanillas Pierre
Fornage Myriam
Freedman Barry I.
García-Ortiz Humberto
González-Villalpando Clicerio
González-Villalpando María Elena
Goodarzi Mark O.
Gottesman Omri
Grundberg Elin
Guo Xiuqing
Guzman Gaelen
Gymrek Melissa
Gómez Donají
Haiman Christopher A.
Hartigan Christina R.
Hayes M. Geoffrey
Hoch Eitan
Hsueh Wen-Chi
Huerta-Chagoya Alicia
Igo Robert P.
Islas-Andrade Sergio
Iyengar Sudha K.
Jacobs Suzanne B.R.
Jensen Richard A.
Kabagambe Edmond K.
Kao W.H. Linda
Keene Keith L.
Kolonel Laurence
Kraja Aldi
Lander Eric Steven
Langefeld Carl D.
Le Marchand Loic
Li Jiang
Liu Jiankang
Liu Simin
Liu Yongmei
Long Jirong
Loos Ruth J.F.
Lowe William L.
Lu Yingchang
Manning Alisa
Martínez-Hernández Angélica
Mathias Rasika A.
McKnight Barbara
Mendoza-Caamal Elvia
Mercader Josep M.
Mercader Josep M.
Monroe Kristine
Moreno-Macías Hortensia
Mudgal Poorva
Muñoz-Hernandez Linda Liliana
Mychaleckyj Josyf C.
Nalls Michael A.
Nayak Uma
Ng Maggie C.Y.
Ng Maggie C.Y.
Ordóñez-Sánchez Maria L.
Orozco Lorena
Orozco Lorena
Pacheco Jennifer A.
Palmer Nicholette D.
Pankow James S.
Patel Sanjay R.
Patterson Nick
Peyser Patricia A.
Pierce Kerry A.
Province Michael A.
Psaty Bruce M.
Raffel Leslie J.
Rasmussen-Torvik Laura J.
Revilla-Monsalve Cristina
Rice Kenneth
Rich Stephen S.
Rodríguez-Guillén Rosario
Rodríguez-Torres Maribel
Rotimi Charles N.
Rotter Jerome I.
Rusu Victor
Sale Michèle M.
Schenone Monica
Schreiber Stuart L.
Sedor John R.
Segura-Kato Yayoi
Shriner Daniel
Shu Xiao-Ou
Sims Mario
Singleton Andrew B.
Siscovick David S.
Snively Beverly M.
Soberón Xavier
Spooner Alexandra
Sun Yan V.
Sáenz Tamara
Taylor Herman A.
Tenen Danielle E.
Tusié-Luna Teresa
Vaidya Dhananjay
von Grotthuss Marcin
Wagenknecht Lynne
Wagner Bridget K.
Wilkens Lynne
Wilson James G.
Yanek Lisa R.
Yang Lingyao
Zerrweck Carlos
Zerrweck Carlos
Zhao Wei
Zheng Wei
Zonderman Alan B.
Publication venue: 'Elsevier BV'
Publication date: 01/06/2017
Field of study

Type 2 diabetes (T2D) affects Latinos at twice the rate seen in populations of European descent. We recently identified a risk haplotype spanning SLC16A11 that explains ∼20% of the increased T2D prevalence in Mexico. Here, through genetic fine-mapping, we define a set of tightly linked variants likely to contain the causal allele(s). We show that variants on the T2D-associated haplotype have two distinct effects: (1) decreasing SLC16A11 expression in liver and (2) disrupting a key interaction with basigin, thereby reducing cell-surface localization. Both independent mechanisms reduce SLC16A11 function and suggest SLC16A11 is the causal gene at this locus. To gain insight into how SLC16A11 disruption impacts T2D risk, we demonstrate that SLC16A11 is a proton-coupled monocarboxylate transporter and that genetic perturbation of SLC16A11 induces changes in fatty acid and lipid metabolism that are associated with increased T2D risk. Our findings suggest that increasing SLC16A11 function could be therapeutically beneficial for T2D. Video Abstract [Figure presented] Keywords: type 2 diabetes (T2D); genetics; disease mechanism; SLC16A11; MCT11; solute carrier (SLC); monocarboxylates; fatty acid metabolism; lipid metabolism; precision medicin

DSpace@MIT

Crossref