47 research outputs found
Viral Proteins Acquired from a Host Converge to Simplified Domain Architectures
The infection cycle of viruses creates many opportunities for the exchange of genetic material with the host. Many viruses integrate their sequences into the genome of their host for replication. These processes may lead to the virus acquisition of host sequences. Such sequences are prone to accumulation of mutations and deletions. However, in rare instances, sequences acquired from a host become beneficial for the virus. We searched for unexpected sequence similarity among the 900,000 viral proteins and all proteins from cellular organisms. Here, we focus on viruses that infect metazoa. The high-conservation analysis yielded 187 instances of highly similar viral-host sequences. Only a small number of them represent viruses that hijacked host sequences. The low-conservation sequence analysis utilizes the Pfam family collection. About 5% of the 12,000 statistical models archived in Pfam are composed of viral-metazoan proteins. In about half of Pfam families, we provide indirect support for the directionality from the host to the virus. The other families are either wrongly annotated or reflect an extensive sequence exchange between the viruses and their hosts. In about 75% of cross-taxa Pfam families, the viral proteins are significantly shorter than their metazoan counterparts. The tendency for shorter viral proteins relative to their related host proteins accounts for the acquisition of only a fragment of the host gene, the elimination of an internal domain and shortening of the linkers between domains. We conclude that, along viral evolution, the host-originated sequences accommodate simplified domain compositions. We postulate that the trimmed proteins act by interfering with the fundamental function of the host including intracellular signaling, post-translational modification, protein-protein interaction networks and cellular trafficking. We compiled a collection of hijacked protein sequences. These sequences are attractive targets for manipulation of viral infection
Subpopulation-Specific Synthetic EHR for Better Mortality Prediction
Electronic health records (EHR) often contain different rates of
representation of certain subpopulations (SP). Factors like patient
demographics, clinical condition prevalence, and medical center type contribute
to this underrepresentation. Consequently, when training machine learning
models on such datasets, the models struggle to generalize well and perform
poorly on underrepresented SPs. To address this issue, we propose a novel
ensemble framework that utilizes generative models. Specifically, we train a
GAN-based synthetic data generator for each SP and incorporate synthetic
samples into each SP training set. Ultimately, we train SP-specific prediction
models. To properly evaluate this method, we design an evaluation pipeline with
2 real-world use case datasets, queried from the MIMIC database. Our approach
shows increased model performance over underrepresented SPs. Our code and
models are given as supplementary and will be made available on a public
repository.Comment: 10 pages, 4 figures, submitted to AIME 202
Functional inference by ProtoNet family tree: the uncharacterized proteome of Daphnia pulex
BACKGROUND: Daphnia pulex (Water flea) is the first fully sequenced crustacean genome. The crustaceans and insects have diverged from a common ancestor. It is a model organism for studying the molecular makeup for coping with the environmental challenges. In the complete proteome, there are 30,550 putative proteins. However, about 10,000 of them have no known homologues. Currently, the UniProtoKB reports on 95% of the Daphnia's proteins as putative and uncharacterized proteins. RESULTS: We have applied ProtoNet, an unsupervised hierarchical protein clustering method that covers about 10 million sequences, for automatic annotation of the Daphnia's proteome. 98.7% (26,625) of the Daphnia full-length proteins were successfully mapped to 13,880 ProtoNet stable clusters, and only 1.3% remained unmapped. We compared the properties of the Daphnia's protein families with those of the mouse and the fruitfly proteomes. Functional annotations were successfully assigned for 86% of the proteins. Most proteins (61%) were mapped to only 2953 clusters that contain Daphnia's duplicated genes. We focused on the functionality of maximally amplified paralogs. Cuticle structure components and a variety of ion channels protein families were associated with a maximal level of gene amplification. We focused on gene amplification as a leading strategy of the Daphnia in coping with environmental toxicity. CONCLUSIONS: Automatic inference is achieved through mapping of sequences to the protein family tree of ProtoNet 6.0. Applying a careful inference protocol resulted in functional assignments for over 86% of the complete proteome. We conclude that the scaffold of ProtoNet can be used as an alignment-free protocol for large-scale annotation task of uncharacterized proteomes
CPLLM: Clinical Prediction with Large Language Models
We present Clinical Prediction with Large Language Models (CPLLM), a method
that involves fine-tuning a pre-trained Large Language Model (LLM) for clinical
disease prediction. We utilized quantization and fine-tuned the LLM using
prompts, with the task of predicting whether patients will be diagnosed with a
target disease during their next visit or in the subsequent diagnosis,
leveraging their historical diagnosis records. We compared our results versus
various baselines, including Logistic Regression, RETAIN, and Med-BERT, which
is the current state-of-the-art model for disease prediction using structured
EHR data. Our experiments have shown that CPLLM surpasses all the tested models
in terms of both PR-AUC and ROC-AUC metrics, displaying noteworthy enhancements
compared to the baseline models
Recommended from our members
Tracing diagnosis trajectories over millions of patients reveal an unexpected risk in schizophrenia.
The identification of novel disease associations using big-data for patient care has had limited success. In this study, we created a longitudinal disease network of traced readmissions (disease trajectories), merging data from over 10.4 million inpatients through the Healthcare Cost and Utilization Project, which allowed the representation of disease progression mapping over 300 diseases. From these disease trajectories, we discovered an interesting association between schizophrenia and rhabdomyolysis, a rare muscle disease (incidence < 1E-04) (relative risk, 2.21 [1.80-2.71, confidence interval = 0.95], P-value 9.54E-15). We validated this association by using independent electronic medical records from over 830,000 patients at the University of California, San Francisco (UCSF) medical center. A case review of 29 rhabdomyolysis incidents in schizophrenia patients at UCSF demonstrated that 62% are idiopathic, without the use of any drug known to lead to this adverse event, suggesting a warning to physicians to watch for this unexpected risk of schizophrenia. Large-scale analysis of disease trajectories can help physicians understand potential sequential events in their patients
Recommended from our members
A model and test for coordinated polygenic epistasis in complex traits
Interactions between genetic variants—epistasis—is pervasive in model systems and can profoundly impact evolutionary adaption, population disease dynamics, genetic mapping, and precision medicine efforts. In this work, we develop a model for structured polygenic epistasis, called coordinated epistasis (CE), and prove that several recent theories of genetic architecture fall under the formal umbrella of CE. Unlike standard epistasis models that assume epistasis and main effects are independent, CE captures systematic correlations between epistasis and main effects that result from pathway-level epistasis, on balance skewing the penetrance of genetic effects. To test for the existence of CE, we propose the even-odd (EO) test and prove it is calibrated in a range of realistic biological models. Applying the EO test in the UK Biobank, we find evidence of CE in 18 of 26 traits spanning disease, anthropometric, and blood categories. Finally, we extend the EO test to tissue-specific enrichment and identify several plausible tissue–trait pairs. Overall, CE is a dimension of genetic architecture that can capture structured, systemic forms of epistasis in complex human traits
PatientExploreR: an extensible application for dynamic visualization of patient clinical history from electronic health records in the OMOP common data model.
MotivationElectronic health records (EHRs) are quickly becoming omnipresent in healthcare, but interoperability issues and technical demands limit their use for biomedical and clinical research. Interactive and flexible software that interfaces directly with EHR data structured around a common data model (CDM) could accelerate more EHR-based research by making the data more accessible to researchers who lack computational expertise and/or domain knowledge.ResultsWe present PatientExploreR, an extensible application built on the R/Shiny framework that interfaces with a relational database of EHR data in the Observational Medical Outcomes Partnership CDM format. PatientExploreR produces patient-level interactive and dynamic reports and facilitates visualization of clinical data without any programming required. It allows researchers to easily construct and export patient cohorts from the EHR for analysis with other software. This application could enable easier exploration of patient-level data for physicians and researchers. PatientExploreR can incorporate EHR data from any institution that employs the CDM for users with approved access. The software code is free and open source under the MIT license, enabling institutions to install and users to expand and modify the application for their own purposes.Availability and implementationPatientExploreR can be freely obtained from GitHub: https://github.com/BenGlicksberg/PatientExploreR. We provide instructions for how researchers with approved access to their institutional EHR can use this package. We also release an open sandbox server of synthesized patient data for users without EHR access to explore: http://patientexplorer.ucsf.edu.Supplementary informationSupplementary data are available at Bioinformatics online
An expanded evaluation of protein function prediction methods shows an improvement in accuracy
Background: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. Results: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. Conclusions: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent. Keywords: Protein function prediction, Disease gene prioritizationpublishedVersio
An Expanded Evaluation of Protein Function Prediction Methods Shows an Improvement In Accuracy
Background: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging.
Results: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2.
Conclusions: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent