2 research outputs found

    Harnessing large language models (LLMs) for candidate gene prioritization and selection.

    Get PDF
    BACKGROUND: Feature selection is a critical step for translating advances afforded by systems-scale molecular profiling into actionable clinical insights. While data-driven methods are commonly utilized for selecting candidate genes, knowledge-driven methods must contend with the challenge of efficiently sifting through extensive volumes of biomedical information. This work aimed to assess the utility of large language models (LLMs) for knowledge-driven gene prioritization and selection. METHODS: In this proof of concept, we focused on 11 blood transcriptional modules associated with an Erythroid cells signature. We evaluated four leading LLMs across multiple tasks. Next, we established a workflow leveraging LLMs. The steps consisted of: (1) Selecting one of the 11 modules; (2) Identifying functional convergences among constituent genes using the LLMs; (3) Scoring candidate genes across six criteria capturing the gene\u27s biological and clinical relevance; (4) Prioritizing candidate genes and summarizing justifications; (5) Fact-checking justifications and identifying supporting references; (6) Selecting a top candidate gene based on validated scoring justifications; and (7) Factoring in transcriptome profiling data to finalize the selection of the top candidate gene. RESULTS: Of the four LLMs evaluated, OpenAI\u27s GPT-4 and Anthropic\u27s Claude demonstrated the best performance and were chosen for the implementation of the candidate gene prioritization and selection workflow. This workflow was run in parallel for each of the 11 erythroid cell modules by participants in a data mining workshop. Module M9.2 served as an illustrative use case. The 30 candidate genes forming this module were assessed, and the top five scoring genes were identified as BCL2L1, ALAS2, SLC4A1, CA1, and FECH. Researchers carefully fact-checked the summarized scoring justifications, after which the LLMs were prompted to select a top candidate based on this information. GPT-4 initially chose BCL2L1, while Claude selected ALAS2. When transcriptional profiling data from three reference datasets were provided for additional context, GPT-4 revised its initial choice to ALAS2, whereas Claude reaffirmed its original selection for this module. CONCLUSIONS: Taken together, our findings highlight the ability of LLMs to prioritize candidate genes with minimal human intervention. This suggests the potential of this technology to boost productivity, especially for tasks that require leveraging extensive biomedical knowledge

    Assessing GPX4 as a Blood-Based Biomarker via a LLM-engaged Gene Prioritization and Characterization Workflow

    No full text
    Analyzing blood transcripts for disease development is an established method, and next-generation sequencing allows for a comprehensive genomic assessment. However, clinical applications often require selecting specific genes for targeted panels, and this study focused on gathering information on the GPX4 gene from literature and transcriptome datasets, aided by large language models (LLMs). Using rigorous methodology, we systematically extracted and integrated data, identifying GPX4\u27s link to various diseases and its potential as a biomarker. Our review revealed notable differences in GPX4 expression levels across different clinical conditions, including neurological disorders, liver diseases, diabetes, and lung ailments. This was visualized using interactive plots and further confirmed by blood transcriptome datasets. The routine clinical application of GPX4 is emerging, particularly in oncology, where it correlates with adverse outcomes. This study highlights the promise of targeting GPX4 in blood to manage diseases related to oxidative stress and cancer, opening avenues for understanding and potential interventions, and setting the stage for future research and clinical applications
    corecore