11 research outputs found
Biomedical knowledge graph-enhanced prompt generation for large language models
Large Language Models (LLMs) have been driving progress in AI at an
unprecedented rate, yet still face challenges in knowledge-intensive domains
like biomedicine. Solutions such as pre-training and domain-specific
fine-tuning add substantial computational overhead, and the latter require
domain-expertise. External knowledge infusion is task-specific and requires
model training. Here, we introduce a task-agnostic Knowledge Graph-based
Retrieval Augmented Generation (KG-RAG) framework by leveraging the massive
biomedical KG SPOKE with LLMs such as Llama-2-13b, GPT-3.5-Turbo and GPT-4, to
generate meaningful biomedical text rooted in established knowledge. KG-RAG
consistently enhanced the performance of LLMs across various prompt types,
including one-hop and two-hop prompts, drug repurposing queries, biomedical
true/false questions, and multiple-choice questions (MCQ). Notably, KG-RAG
provides a remarkable 71% boost in the performance of the Llama-2 model on the
challenging MCQ dataset, demonstrating the framework's capacity to empower
open-source models with fewer parameters for domain-specific questions.
Furthermore, KG-RAG enhanced the performance of proprietary GPT models, such as
GPT-3.5 which exhibited improvement over GPT-4 in context utilization on MCQ
data. Our approach was also able to address drug repurposing questions,
returning meaningful repurposing suggestions. In summary, the proposed
framework combines explicit and implicit knowledge of KG and LLM, respectively,
in an optimized fashion, thus enhancing the adaptability of general-purpose
LLMs to tackle domain-specific questions in a unified framework.Comment: 28 pages, 5 figures, 2 tables, 1 supplementary fil
Recommended from our members
Biomedical knowledge graph-optimized prompt generation for large language models
MotivationLarge Language Models (LLMs) are being adopted at an unprecedented rate, yet still face challenges in knowledge-intensive domains like biomedicine. Solutions such as pre-training and domain-specific fine-tuning add substantial computational overhead, requiring further domain-expertise. Here, we introduce a token-optimized and robust Knowledge Graph-based Retrieval Augmented Generation (KG-RAG) framework by leveraging a massive biomedical KG (SPOKE) with LLMs such as Llama-2-13b, GPT-3.5-Turbo and GPT-4, to generate meaningful biomedical text rooted in established knowledge.ResultsCompared to the existing RAG technique for Knowledge Graphs, the proposed method utilizes minimal graph schema for context extraction and uses embedding methods for context pruning. This optimization in context extraction results in more than 50% reduction in token consumption without compromising the accuracy, making a cost-effective and robust RAG implementation on proprietary LLMs. KG-RAG consistently enhanced the performance of LLMs across diverse biomedical prompts by generating responses rooted in established knowledge, accompanied by accurate provenance and statistical evidence (if available) to substantiate the claims. Further benchmarking on human curated datasets, such as biomedical true/false and multiple-choice questions (MCQ), showed a remarkable 71% boost in the performance of the Llama-2 model on the challenging MCQ dataset, demonstrating the framework's capacity to empower open-source models with fewer parameters for domain-specific questions. Furthermore, KG-RAG enhanced the performance of proprietary GPT models, such as GPT-3.5 and GPT-4. In summary, the proposed framework combines explicit and implicit knowledge of KG and LLM in a token optimized fashion, thus enhancing the adaptability of general-purpose LLMs to tackle domain-specific questions in a cost-effective fashion.Availability and implementationSPOKE KG can be accessed at https://spoke.rbvi.ucsf.edu/neighborhood.html. It can also be accessed using REST-API (https://spoke.rbvi.ucsf.edu/swagger/). KG-RAG code is made available at https://github.com/BaranziniLab/KG_RAG. Biomedical benchmark datasets used in this study are made available to the research community in the same GitHub repository.Supplementary informationSupplementary data are available at Bioinformatics online
Locus for severity implicates CNS resilience in progression of multiple sclerosis
Multiple sclerosis (MS) is an autoimmune disease of the central nervous system (CNS) that results in significant neurodegeneration in the majority of those affected and is a common cause of chronic neurological disability in young adults(1,2). Here, to provide insight into the potential mechanisms involved in progression, we conducted a genome-wide association study of the age-related MS severity score in 12,584 cases and replicated our findings in a further 9,805 cases. We identified a significant association with rs10191329 in the DYSF-ZNF638 locus, the risk allele of which is associated with a shortening in the median time to requiring a walking aid of a median of 3.7 years in homozygous carriers and with increased brainstem and cortical pathology in brain tissue. We also identified suggestive association with rs149097173 in the DNM3-PIGC locus and significant heritability enrichment in CNS tissues. Mendelian randomization analyses suggested a potential protective role for higher educational attainment. In contrast to immune-driven susceptibility(3), these findings suggest a key role for CNS resilience and potentially neurocognitive reserve in determining outcome in MS
Ensemble machine learning reveals key features for diabetes duration from electronic health records
Diabetes is a metabolic disorder that affects more than 420 million of people worldwide, and it is caused by the presence of a high level of sugar in blood for a long period. Diabetes can have serious long-term health consequences, such as cardiovascular diseases, strokes, chronic kidney diseases, foot ulcers, retinopathy, and others. Even if common, this disease is uneasy to spot, because it often comes with no symptoms. Especially for diabetes type 2, that happens mainly in the adults, knowing how long the diabetes has been present for a patient can have a strong impact on the treatment they can receive. This information, although pivotal, might be absent: for some patients, in fact, the year when they received the diabetes diagnosis might be well-known, but the year of the disease unset might be unknown. In this context, machine learning applied to electronic health records can be an effective tool to predict the past duration of diabetes for a patient. In this study, we applied a regression analysis based on several computational intelligence methods to a dataset of electronic health records of 73 patients with diabetes type 1 with 20 variables and another dataset of records of 400 patients of diabetes type 2 with 49 variables. Among the algorithms applied, Random Forests was able to outperform the other ones and to efficiently predict diabetes duration for both the cohorts, with the regression performances measured through the coefficient of determination R2. Afterwards, we applied the same method for feature ranking, and we detected the most relevant factors of the clinical records correlated with past diabetes duration: age, insulin intake, and body-mass index. Our study discoveries can have profound impact on clinical practice: when the information about the duration of diabetes of patient is missing, medical doctors can use our tool and focus on age, insulin intake, and body-mass index to infer this important aspect. Regarding limitations, unfortunately we were unable to find additional dataset of EHRs of patients with diabetes having the same variables of the two analyzed here, so we could not verify our findings on a validation cohort
Time-aware Embeddings of Clinical Data using a Knowledge Graph
Meaningful representations of clinical data using embedding vectors is a pivotal step to invoke any machine learning (ML) algorithm for data inference. In this article, we propose a time-aware embedding approach of electronic health records onto a biomedical knowledge graph for creating machine readable patient representations. This approach not only captures the temporal dynamics of patient clinical trajectories, but also enriches it with additional biological information from the knowledge graph. To gauge the predictivity of this approach, we propose an ML pipeline called TANDEM (Temporal and Non-temporal Dynamics Embedded Model) and apply it on the early detection of Parkinson’s disease. TANDEM results in a classification AUC score of 0.85 on unseen test dataset. These predictions are further explained by providing a biological insight using the knowledge graph. Taken together, we show that temporal embeddings of clinical data could be a meaningful predictive representation for downstream ML pipelines in clinical decision-making
Recommended from our members
Early detection of Parkinson’s disease through enriching the electronic health record using a biomedical knowledge graph
IntroductionEarly diagnosis of Parkinson's disease (PD) is important to identify treatments to slow neurodegeneration. People who develop PD often have symptoms before the disease manifests and may be coded as diagnoses in the electronic health record (EHR).MethodsTo predict PD diagnosis, we embedded EHR data of patients onto a biomedical knowledge graph called Scalable Precision medicine Open Knowledge Engine (SPOKE) and created patient embedding vectors. We trained and validated a classifier using these vectors from 3,004 PD patients, restricting records to 1, 3, and 5 years before diagnosis, and 457,197 non-PD group.ResultsThe classifier predicted PD diagnosis with moderate accuracy (AUC = 0.77 ± 0.06, 0.74 ± 0.05, 0.72 ± 0.05 at 1, 3, and 5 years) and performed better than other benchmark methods. Nodes in the SPOKE graph, among cases, revealed novel associations, while SPOKE patient vectors revealed the basis for individual risk classification.DiscussionThe proposed method was able to explain the clinical predictions using the knowledge graph, thereby making the predictions clinically interpretable. Through enriching EHR data with biomedical associations, SPOKE may be a cost-efficient and personalized way to predict PD diagnosis years before its occurrence
Recommended from our members
Leveraging electronic health records and knowledge networks for Alzheimer’s disease prediction and sex-specific biological insights
Identification of Alzheimer's disease (AD) onset risk can facilitate interventions before irreversible disease progression. We demonstrate that electronic health records from the University of California, San Francisco, followed by knowledge networks (for example, SPOKE) allow for (1) prediction of AD onset and (2) prioritization of biological hypotheses, and (3) contextualization of sex dimorphism. We trained random forest models and predicted AD onset on a cohort of 749 individuals with AD and 250,545 controls with a mean area under the receiver operating characteristic of 0.72 (7 years prior) to 0.81 (1 day prior). We further harnessed matched cohort models to identify conditions with predictive power before AD onset. Knowledge networks highlight shared genes between multiple top predictors and AD (for example, APOE, ACTB, IL6 and INS). Genetic colocalization analysis supports AD association with hyperlipidemia at the APOE locus, as well as a stronger female AD association with osteoporosis at a locus near MS4A6A. We therefore show how clinical data can be utilized for early AD prediction and identification of personalized biological hypotheses
Recommended from our members
The scalable precision medicine open knowledge engine (SPOKE): a massive knowledge graph of biomedical information
MotivationKnowledge graphs (KGs) are being adopted in industry, commerce and academia. Biomedical KG presents a challenge due to the complexity, size and heterogeneity of the underlying information.ResultsIn this work, we present the Scalable Precision Medicine Open Knowledge Engine (SPOKE), a biomedical KG connecting millions of concepts via semantically meaningful relationships. SPOKE contains 27 million nodes of 21 different types and 53 million edges of 55 types downloaded from 41 databases. The graph is built on the framework of 11 ontologies that maintain its structure, enable mappings and facilitate navigation. SPOKE is built weekly by python scripts which download each resource, check for integrity and completeness, and then create a 'parent table' of nodes and edges. Graph queries are translated by a REST API and users can submit searches directly via an API or a graphical user interface. Conclusions/Significance: SPOKE enables the integration of seemingly disparate information to support precision medicine efforts.Availability and implementationThe SPOKE neighborhood explorer is available at https://spoke.rbvi.ucsf.edu.Supplementary informationSupplementary data are available at Bioinformatics online
Recommended from our members
The scalable precision medicine open knowledge engine (SPOKE): a massive knowledge graph of biomedical information.
MOTIVATION: Knowledge graphs (KGs) are being adopted in industry, commerce and academia. Biomedical KG presents a challenge due to the complexity, size and heterogeneity of the underlying information.
RESULTS: In this work, we present the Scalable Precision Medicine Open Knowledge Engine (SPOKE), a biomedical KG connecting millions of concepts via semantically meaningful relationships. SPOKE contains 27 million nodes of 21 different types and 53 million edges of 55 types downloaded from 41 databases. The graph is built on the framework of 11 ontologies that maintain its structure, enable mappings and facilitate navigation. SPOKE is built weekly by python scripts which download each resource, check for integrity and completeness, and then create a \u27parent table\u27 of nodes and edges. Graph queries are translated by a REST API and users can submit searches directly via an API or a graphical user interface. Conclusions/Significance: SPOKE enables the integration of seemingly disparate information to support precision medicine efforts.
AVAILABILITY AND IMPLEMENTATION: The SPOKE neighborhood explorer is available at https://spoke.rbvi.ucsf.edu.
SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online