Search CORE

11 research outputs found

Biomedical knowledge graph-enhanced prompt generation for large language models

Author: Akbas Rabia E
Baranzini Sergio E
Cerono Gabriel
Huang Sui
Israni Sharat
Morris John H
Nelson Charlotte A
Peetoom Braian
Rizk-Jackson Angela
Rose Peter W
Shi Yongmei
Smith Brett
Soman Karthik
Villouta-Reyes Catalina
Publication venue
Publication date: 28/11/2023
Field of study

Large Language Models (LLMs) have been driving progress in AI at an unprecedented rate, yet still face challenges in knowledge-intensive domains like biomedicine. Solutions such as pre-training and domain-specific fine-tuning add substantial computational overhead, and the latter require domain-expertise. External knowledge infusion is task-specific and requires model training. Here, we introduce a task-agnostic Knowledge Graph-based Retrieval Augmented Generation (KG-RAG) framework by leveraging the massive biomedical KG SPOKE with LLMs such as Llama-2-13b, GPT-3.5-Turbo and GPT-4, to generate meaningful biomedical text rooted in established knowledge. KG-RAG consistently enhanced the performance of LLMs across various prompt types, including one-hop and two-hop prompts, drug repurposing queries, biomedical true/false questions, and multiple-choice questions (MCQ). Notably, KG-RAG provides a remarkable 71% boost in the performance of the Llama-2 model on the challenging MCQ dataset, demonstrating the framework's capacity to empower open-source models with fewer parameters for domain-specific questions. Furthermore, KG-RAG enhanced the performance of proprietary GPT models, such as GPT-3.5 which exhibited improvement over GPT-4 in context utilization on MCQ data. Our approach was also able to address drug repurposing questions, returning meaningful repurposing suggestions. In summary, the proposed framework combines explicit and implicit knowledge of KG and LLM, respectively, in an optimized fashion, thus enhancing the adaptability of general-purpose LLMs to tackle domain-specific questions in a unified framework.Comment: 28 pages, 5 figures, 2 tables, 1 supplementary fil

arXiv.org e-Print Archive

Recommended from our members

Biomedical knowledge graph-optimized prompt generation for large language models

Author: Akbas Rabia E
Baranzini Sergio E
Cerono Gabriel
Huang Sui
Israni Sharat
Morris John H
Nelson Charlotte A
Peetoom Braian
Rizk-Jackson Angela
Rose Peter W
Shi Yongmei
Smith Brett
Soman Karthik
Villouta-Reyes Catalina
Publication venue: eScholarship, University of California
Publication date: 01/01/2024
Field of study

MotivationLarge Language Models (LLMs) are being adopted at an unprecedented rate, yet still face challenges in knowledge-intensive domains like biomedicine. Solutions such as pre-training and domain-specific fine-tuning add substantial computational overhead, requiring further domain-expertise. Here, we introduce a token-optimized and robust Knowledge Graph-based Retrieval Augmented Generation (KG-RAG) framework by leveraging a massive biomedical KG (SPOKE) with LLMs such as Llama-2-13b, GPT-3.5-Turbo and GPT-4, to generate meaningful biomedical text rooted in established knowledge.ResultsCompared to the existing RAG technique for Knowledge Graphs, the proposed method utilizes minimal graph schema for context extraction and uses embedding methods for context pruning. This optimization in context extraction results in more than 50% reduction in token consumption without compromising the accuracy, making a cost-effective and robust RAG implementation on proprietary LLMs. KG-RAG consistently enhanced the performance of LLMs across diverse biomedical prompts by generating responses rooted in established knowledge, accompanied by accurate provenance and statistical evidence (if available) to substantiate the claims. Further benchmarking on human curated datasets, such as biomedical true/false and multiple-choice questions (MCQ), showed a remarkable 71% boost in the performance of the Llama-2 model on the challenging MCQ dataset, demonstrating the framework's capacity to empower open-source models with fewer parameters for domain-specific questions. Furthermore, KG-RAG enhanced the performance of proprietary GPT models, such as GPT-3.5 and GPT-4. In summary, the proposed framework combines explicit and implicit knowledge of KG and LLM in a token optimized fashion, thus enhancing the adaptability of general-purpose LLMs to tackle domain-specific questions in a cost-effective fashion.Availability and implementationSPOKE KG can be accessed at https://spoke.rbvi.ucsf.edu/neighborhood.html. It can also be accessed using REST-API (https://spoke.rbvi.ucsf.edu/swagger/). KG-RAG code is made available at https://github.com/BaranziniLab/KG_RAG. Biomedical benchmark datasets used in this study are made available to the research community in the same GitHub repository.Supplementary informationSupplementary data are available at Bioinformatics online

eScholarship - University of California

Locus for severity implicates CNS resilience in progression of multiple sclerosis

Author: Alfredsson Lars
Alikhani Katayoun
Amezcua Lilyana
Andlauer Till F. M.
Ban Maria
Baranzini Sergio
Baranzini Sergio E.
Barcellos Lisa
Barizzone Nadia
Beecham Ashley
Berge Tone
Berthele Achim
Bittner Stefan
Blanco Yolanda
Bos Steffan
Briggs Farren B. S.
Caillier Stacy
Calabresi Peter
Caputo Domenico
Carmona-Burgos David
Cavalla Paola
Celius Elisabeth
Cerono Gabriel
Chinea Angel
Chitnis Tanuja
Clarelli Ferdinando
Comabella Manuel
Comi Giancarlo
Compston Alastair
Cotsapas Chris
Cree Bruce C. A.
D'Alfonso Sandra
Dardiotis Efthimios
De Jager Philip
Delgado Silvia
Dubois Benedicte
Engel Sinah
Engelenburg Hendrik
Esposito Federica
Fabis-Pedrini Marzena
Filippi Massimo
Fitzgerald Kathryn
Gasperi Christiane
Gomez Lissette
Gomez Refujia
Goris An
Goris An
Hadjigeorgiou Georgios
Hafler David
Hamann Joerg
Harbo Hanne
Harbo Hanne F.
Harroud Adil
Hauser Stephen
Held Friederike
Hemmer Bernhard
Hemmer Bernhard
Henry Roland
Hillert Jan
Huang Jesse
Huitinga Inge
Islam Talat
Isobe Noriko
Jagodic Maja
Jonsdottir Ingileif
Kermode Allan L.
Khalil Michael
Kilpatrick Trevor
Kockum Ingrid
Kockum Ingrid
Konidari Ioanna
Kreft Karim
Lechner-Scott Jeannette
Leone Maurizio
Llufriu Sara
Luessi Felix
Madireddy Lohith
Malhotra Sunny
Manouchehrinia Ali
Manrique Clara
Martinelli-Boneschi Filippo
Martinez Andrea
Martinez-Maldonado Viviana
Mascia Elisabetta
McCauley Jacob H.
Metz Luanne
Midaglia Luciana
Montalban Xavier
Oksenberg Jorge
Olsson Tomas
Oturai Annette
Paakkonen Kimmo
Parnell Grant P.
Patsopoulos Nikolaos
Pericak-Vance Margaret
Piehl Fredrik
Rubio Justin
Saarela Janna
Saiz Albert
Santaniello Adam
Santoro Silvia
Sawcer Stephen
Sawcer Stephen J.
Schaefer Catherine
Sellebjerg Finn
Shams Hengameh
Shchetynsky Klementy
Silva Claudia
Siokas Vasileios
Smolders Joost
Sondergaard Helle
Sorosina Melissa
Stefansson Kari
Stewart Graeme
Stridh Pernilla J.
Taylor Bruce
van den Bosch Aletta M. R.
Vandebergh Marijne
Vasileiou Elena
Vecchio Domizia
Villoslada Pablo
Voortman Margarete
Weiner Howard
Wever Dennis
Yong V. Wee
Zipp Frauke
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 01/01/2023
Field of study

Multiple sclerosis (MS) is an autoimmune disease of the central nervous system (CNS) that results in significant neurodegeneration in the majority of those affected and is a common cause of chronic neurological disability in young adults(1,2). Here, to provide insight into the potential mechanisms involved in progression, we conducted a genome-wide association study of the age-related MS severity score in 12,584 cases and replicated our findings in a further 9,805 cases. We identified a significant association with rs10191329 in the DYSF-ZNF638 locus, the risk allele of which is associated with a shortening in the median time to requiring a walking aid of a median of 3.7 years in homozygous carriers and with increased brainstem and cortical pathology in brain tissue. We also identified suggestive association with rs149097173 in the DNM3-PIGC locus and significant heritability enrichment in CNS tissues. Mendelian randomization analyses suggested a potential protective role for higher educational attainment. In contrast to immune-driven susceptibility(3), these findings suggest a key role for CNS resilience and potentially neurocognitive reserve in determining outcome in MS

Open Access LMU ( Ludwig-Maximilians-Univ. München)

Ensemble machine learning reveals key features for diabetes duration from electronic health records

Author: Davide Chicco
Gabriel Cerono
Publication venue: PeerJ Inc.
Publication date: 01/02/2024
Field of study

Diabetes is a metabolic disorder that affects more than 420 million of people worldwide, and it is caused by the presence of a high level of sugar in blood for a long period. Diabetes can have serious long-term health consequences, such as cardiovascular diseases, strokes, chronic kidney diseases, foot ulcers, retinopathy, and others. Even if common, this disease is uneasy to spot, because it often comes with no symptoms. Especially for diabetes type 2, that happens mainly in the adults, knowing how long the diabetes has been present for a patient can have a strong impact on the treatment they can receive. This information, although pivotal, might be absent: for some patients, in fact, the year when they received the diabetes diagnosis might be well-known, but the year of the disease unset might be unknown. In this context, machine learning applied to electronic health records can be an effective tool to predict the past duration of diabetes for a patient. In this study, we applied a regression analysis based on several computational intelligence methods to a dataset of electronic health records of 73 patients with diabetes type 1 with 20 variables and another dataset of records of 400 patients of diabetes type 2 with 49 variables. Among the algorithms applied, Random Forests was able to outperform the other ones and to efficiently predict diabetes duration for both the cohorts, with the regression performances measured through the coefficient of determination R2. Afterwards, we applied the same method for feature ranking, and we detected the most relevant factors of the clinical records correlated with past diabetes duration: age, insulin intake, and body-mass index. Our study discoveries can have profound impact on clinical practice: when the information about the duration of diabetes of patient is missing, medical doctors can use our tool and focus on age, insulin intake, and body-mass index to infer this important aspect. Regarding limitations, unfortunately we were unable to find additional dataset of EHRs of patients with diabetes having the same variables of the two analyzed here, so we could not verify our findings on a validation cohort

Directory of Open Access Journals

Time-aware Embeddings of Clinical Data using a Knowledge Graph

Author: Baranzini Sergio E
Cerono Gabriel
Nelson Charlotte A
Soman Karthik
Publication venue: eScholarship, University of California
Publication date: 01/11/2022
Field of study

Meaningful representations of clinical data using embedding vectors is a pivotal step to invoke any machine learning (ML) algorithm for data inference. In this article, we propose a time-aware embedding approach of electronic health records onto a biomedical knowledge graph for creating machine readable patient representations. This approach not only captures the temporal dynamics of patient clinical trajectories, but also enriches it with additional biological information from the knowledge graph. To gauge the predictivity of this approach, we propose an ML pipeline called TANDEM (Temporal and Non-temporal Dynamics Embedded Model) and apply it on the early detection of Parkinson’s disease. TANDEM results in a classification AUC score of 0.85 on unseen test dataset. These predictions are further explained by providing a biological insight using the knowledge graph. Taken together, we show that temporal embeddings of clinical data could be a meaningful predictive representation for downstream ML pipelines in clinical decision-making

PubMed Central

eScholarship - University of California

Recommended from our members

Early detection of Parkinson’s disease through enriching the electronic health record using a biomedical knowledge graph

Author: Baranzini Sergio E
Brown Ethan G
Cerono Gabriel
Goldman Samuel M
Nelson Charlotte A
Soman Karthik
Publication venue: eScholarship, University of California
Publication date: 01/01/2023
Field of study

IntroductionEarly diagnosis of Parkinson's disease (PD) is important to identify treatments to slow neurodegeneration. People who develop PD often have symptoms before the disease manifests and may be coded as diagnoses in the electronic health record (EHR).MethodsTo predict PD diagnosis, we embedded EHR data of patients onto a biomedical knowledge graph called Scalable Precision medicine Open Knowledge Engine (SPOKE) and created patient embedding vectors. We trained and validated a classifier using these vectors from 3,004 PD patients, restricting records to 1, 3, and 5 years before diagnosis, and 457,197 non-PD group.ResultsThe classifier predicted PD diagnosis with moderate accuracy (AUC = 0.77 ± 0.06, 0.74 ± 0.05, 0.72 ± 0.05 at 1, 3, and 5 years) and performed better than other benchmark methods. Nodes in the SPOKE graph, among cases, revealed novel associations, while SPOKE patient vectors revealed the basis for individual risk classification.DiscussionThe proposed method was able to explain the clinical predictions using the knowledge graph, thereby making the predictions clinically interpretable. Through enriching EHR data with biomedical associations, SPOKE may be a cost-efficient and personalized way to predict PD diagnosis years before its occurrence

eScholarship - University of California

Recommended from our members

Leveraging electronic health records and knowledge networks for Alzheimer’s disease prediction and sex-specific biological insights

Author: Aghaeepour Nima
Allen Isabel E
Baranzini Sergio
Bove Riley
Cerono Gabriel
Glymour Maria
Lee Albert
Li Yaqiao
Miller Zachary
Mills Hunter
Miramontes Silvia
Nelson Charlotte
Oskotsky Tomiko T
Rankin Katherine P
Roger Jacquelyn
Sanders Stephan J
Sirota Marina
Soman Karthik
Tang Alice S
Woldemariam Sarah
Zeng Billy
Publication venue: eScholarship, University of California
Publication date: 01/03/2024
Field of study

Identification of Alzheimer's disease (AD) onset risk can facilitate interventions before irreversible disease progression. We demonstrate that electronic health records from the University of California, San Francisco, followed by knowledge networks (for example, SPOKE) allow for (1) prediction of AD onset and (2) prioritization of biological hypotheses, and (3) contextualization of sex dimorphism. We trained random forest models and predicted AD onset on a cohort of 749 individuals with AD and 250,545 controls with a mean area under the receiver operating characteristic of 0.72 (7 years prior) to 0.81 (1 day prior). We further harnessed matched cohort models to identify conditions with predictive power before AD onset. Knowledge networks highlight shared genes between multiple top predictors and AD (for example, APOE, ACTB, IL6 and INS). Genetic colocalization analysis supports AD association with hyperlipidemia at the APOE locus, as well as a stronger female AD association with osteoporosis at a locus near MS4A6A. We therefore show how clinical data can be utilized for early AD prediction and identification of personalized biological hypotheses

eScholarship - University of California

Recommended from our members

The scalable precision medicine open knowledge engine (SPOKE): a massive knowledge graph of biomedical information

Author: Akbas Rabia E
Baranzini Sergio E
Bharat Krish
Cerono Gabriel
Chakraborty Arjun
Costes Sylvain V
Hardi Josef
Harroud Adil
Huang Conrad C
Huang Sui
Israni Sharat
Keiser Michael
Mardirossian Taline
Meng Elaine C
Morris John H
Musen Mark
Nelson Charlotte A
Pico Alexander R
Rizk-Jackson Angela
Rose Peter W
Sanders Lauren
Schenk Gundolf
Shi Yongmei
Smith Brett
Soman Karthik
Tang Alice
Zhou Xiaoyuan
Publication venue: eScholarship, University of California
Publication date: 03/02/2023
Field of study

MotivationKnowledge graphs (KGs) are being adopted in industry, commerce and academia. Biomedical KG presents a challenge due to the complexity, size and heterogeneity of the underlying information.ResultsIn this work, we present the Scalable Precision Medicine Open Knowledge Engine (SPOKE), a biomedical KG connecting millions of concepts via semantically meaningful relationships. SPOKE contains 27 million nodes of 21 different types and 53 million edges of 55 types downloaded from 41 databases. The graph is built on the framework of 11 ontologies that maintain its structure, enable mappings and facilitate navigation. SPOKE is built weekly by python scripts which download each resource, check for integrity and completeness, and then create a 'parent table' of nodes and edges. Graph queries are translated by a REST API and users can submit searches directly via an API or a graphical user interface. Conclusions/Significance: SPOKE enables the integration of seemingly disparate information to support precision medicine efforts.Availability and implementationThe SPOKE neighborhood explorer is available at https://spoke.rbvi.ucsf.edu.Supplementary informationSupplementary data are available at Bioinformatics online

eScholarship - University of California

Recommended from our members

The scalable precision medicine open knowledge engine (SPOKE): a massive knowledge graph of biomedical information.

Author: Akbas Rabia E
Baranzini Sergio E
Bharat Krish
Cerono Gabriel
Chakraborty Arjun
Costes Sylvain V
Hardi Josef
Harroud Adil
Huang Conrad C
Huang Sui
Israni Sharat
Keiser Michael
Mardirossian Taline
Meng Elaine C
Morris John H
Musen Mark
Nelson Charlotte A
Pico Alexander R
Rizk-Jackson Angela
Rose Peter W
Sanders Lauren
Schenk Gundolf
Shi Yongmei
Smith Brett
Soman Karthik
Tang Alice
Zhou Xiaoyuan
Publication venue: Providence St. Joseph Health Digital Commons
Publication date: 03/02/2023
Field of study

MOTIVATION: Knowledge graphs (KGs) are being adopted in industry, commerce and academia. Biomedical KG presents a challenge due to the complexity, size and heterogeneity of the underlying information. RESULTS: In this work, we present the Scalable Precision Medicine Open Knowledge Engine (SPOKE), a biomedical KG connecting millions of concepts via semantically meaningful relationships. SPOKE contains 27 million nodes of 21 different types and 53 million edges of 55 types downloaded from 41 databases. The graph is built on the framework of 11 ontologies that maintain its structure, enable mappings and facilitate navigation. SPOKE is built weekly by python scripts which download each resource, check for integrity and completeness, and then create a \u27parent table\u27 of nodes and edges. Graph queries are translated by a REST API and users can submit searches directly via an API or a graphical user interface. Conclusions/Significance: SPOKE enables the integration of seemingly disparate information to support precision medicine efforts. AVAILABILITY AND IMPLEMENTATION: The SPOKE neighborhood explorer is available at https://spoke.rbvi.ucsf.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online

eScholarship - University of California

Providence St. Joseph Health Digital Commons