23 research outputs found
Multiâomics analysis reveals drivers of loss of βâcell function after newly diagnosed autoimmune type 1 diabetes: An INNODIA multicenter study
Aims: Heterogeneity in the rate of beta-cell loss in newly diagnosed type 1 diabetes patients is poorly understood and creates a barrier to designing and interpreting disease-modifying clinical trials. Integrative analyses of baseline multi-omics data obtained after the diagnosis of type 1 diabetes may provide mechanistic insight into the diverse rates of disease progression after type 1 diabetes diagnosis. Methods: We collected samples in a pan-European consortium that enabled the concerted analysis of five different omics modalities in data from 97 newly diagnosed patients. In this study, we used Multi-Omics Factor Analysis to identify molecular signatures correlating with post-diagnosis decline in beta-cell mass measured as fasting C-peptide. Results: Two molecular signatures were significantly correlated with fasting C-peptide levels. One signature showed a correlation to neutrophil degranulation, cytokine signalling, lymphoid and non-lymphoid cell interactions and G-protein coupled receptor signalling events that were inversely associated with a rapid decline in beta-cell function. The second signature was related to translation and viral infection was inversely associated with change in beta-cell function. In addition, the immunomics data revealed a Natural Killer cell signature associated with rapid beta-cell decline. Conclusions: Features that differ between individuals with slow and rapid decline in beta-cell mass could be valuable in staging and prediction of the rate of disease progression and thus enable smarter (shorter and smaller) trial designs for disease modifying therapies as well as offering biomarkers of therapeutic effect
Spectrum of Protein Location in Proteomes Captures Evolutionary Relationship Between Species
The native subcellular location (also referred to as localization or cellular compartment) of a protein is the one in which it acts most frequently; it is one aspect of protein function. Do ten eukaryotic model organisms differ in their location spectrum, i.e., the fraction of its proteome in each of seven major cellular compartments? As experimental annotations of locations remain biased and incomplete, we need prediction methods to answer this question. After systematic bias corrections, the complete but faulty prediction methods appeared to be more appropriate to compare location spectra between species than the incomplete more accurate experimental data. This work compared the location spectra for ten eukaryotes: Homo sapiens (human), Gorilla gorilla (gorilla), Pan troglodytes (chimpanzee), Mus musculus (mouse), Rattus norvegicus (rat), Drosophila melanogaster (fruit/vinegar fly), Anopheles gambiae (African malaria mosquito), Caenorhabitis elegans (nematode), Saccharomyces cerevisiae (bakerâs yeast), and Schizosaccharomyces pombe (fission yeast). The two largest classes were predicted to be the nucleus and the cytoplasm together accounting for 47â62% of all proteins, while 7â21% of the proteins were predicted in the plasma membrane and 4â15% to be secreted. Overall, the predicted location spectra were largely similar. However, in detail, the differences sufficed to plot trees (UPGMA) and 2D (PCA) maps relating the ten organisms using a simple Euclidean distance in seven states (location classes). The relations based on the simple predicted location spectra captured aspects of cross-species comparisons usually revealed only by much more detailed evolutionary comparisons. Most interestingly, known phylogenetic relations were reproduced better by paralog-only than by ortholog-only trees. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s00239-021-10022-4
SpanSeq:similarity-based sequence data splitting method for improved development and assessment of deep learning projects
The use of deep learning models in computational biology has increased massively in recent years, and it is expected to continue with the current advances in the fields such as Natural Language Processing. These models, although able to draw complex relations between input and target, are also inclined to learn noisy deviations from the pool of data used during their development. In order to assess their performance on unseen data (their capacity to generalize), it is common to split the available data randomly into development (train/validation) and test sets. This procedure, although standard, has been shown to produce dubious assessments of generalization due to the existing similarity between samples in the databases used. In this work, we present SpanSeq, a database partition method for machine learning that can scale to most biological sequences (genes, proteins and genomes) in order to avoid data leakage between sets. We also explore the effect of not restraining similarity between sets by reproducing the development of two state-of-the-art models on bioinformatics, not only confirming the consequences of randomly splitting databases on the model assessment, but expanding those repercussions to the model development.</p