35 research outputs found
Machine Learning Approaches Identify Genes Containing Spatial Information From Single-Cell Transcriptomics Data.
The development of single-cell sequencing technologies has allowed researchers to gain important new knowledge about the expression profile of genes in thousands of individual cells of a model organism or tissue. A common disadvantage of this technology is the loss of the three-dimensional (3-D) structure of the cells. Consequently, the Dialogue on Reverse Engineering Assessment and Methods (DREAM) organized the Single-Cell Transcriptomics Challenge, in which we participated, with the aim to address the following two problems: (a) to identify the top 60, 40, and 20 genes of the Drosophila melanogaster embryo that contain the most spatial information and (b) to reconstruct the 3-D arrangement of the embryo using information from those genes. We developed two independent techniques, leveraging machine learning models from least absolute shrinkage and selection operator (Lasso) and deep neural networks (NNs), which are applied to high-dimensional single-cell sequencing data in order to accurately identify genes that contain spatial information. Our first technique, Lasso.TopX, utilizes the Lasso and ranking statistics and allows a user to define a specific number of features they are interested in. The NN approach utilizes weak supervision for linear regression to accommodate for uncertain or probabilistic training labels. We show, individually for both techniques, that we are able to identify important, stable, and a user-defined number of genes containing the most spatial information. The results from both techniques achieve high performance when reconstructing spatial information in D. melanogaster and also generalize to zebrafish (Danio rerio). Furthermore, we identified novel D. melanogaster genes that carry important positional information and were not previously suspected. We also show how the indirect use of the full datasets’ information can lead to data leakage and generate bias in overestimating the model’s performance. Lastly, we discuss the applicability of our approaches to other feature selection problems outside the realm of single-cell sequencing and the importance of being able to handle probabilistic training labels. Our source code and detailed documentation are available at https://github.com/TJU-CMC-Org/SingleCell-DREAM/
Reproducibility Efforts as a Teaching Tool: A Pilot Study
The replication crisis is a methodological problem in which many scientific research findings have been difficult or impossible to replicate. Because the reproducibility of empirical results is an essential aspect of the scientific method, such failures endanger the credibility of theories based on them and possibly significant portions of scientific knowledge. An instance of the replication crisis, analytic replication, pertains to reproducing published results through computational reanalysis of the authors\u27 original data. However, direct replications are costly, time-consuming, and unrewarded in today\u27s publishing standards. We propose that bioinformatics and computational biology students replicate recent discoveries as part of their curriculum. Considering the above, we performed a pilot study in one of the graduate-level courses we developed and taught at our University. The course is entitled Intro to R Programming and is meant for students in our Master\u27s and PhD programs who have little to no programming skills. As the course emphasized real-world data analysis, we thought it would be an appropriate setting to carry out this study. The primary objective was to expose the students to real biological data analysis problems. These include locating and downloading the needed datasets, understanding any underlying conventions and annotations, understanding the analytical methods, and regenerating multiple graphs from their assigned article. The secondary goal was to determine whether the assigned articles contained sufficient information for a graduate-level student to replicate its figures. Overall, the students successfully reproduced 39% of the figures. The main obstacles were the need for more advanced programming skills and the incomplete documentation of the applied methods. Students were engaged, enthusiastic, and focused throughout the semester. We believe that this teaching approach will allow students to make fundamental scientific contributions under appropriate supervision. It will teach them about the scientific process, the importance of reporting standards, and the importance of openness
Non-parametric combination analysis of multiple data types enables detection of novel regulatory mechanisms in T cells of multiple sclerosis patients
Multiple Sclerosis (MS) is an autoimmune disease of the central nervous system with prominent neurodegenerative components. The triggering and progression of MS is associated with transcriptional and epigenetic alterations in several tissues, including peripheral blood. The combined influence of transcriptional and epigenetic changes associated with MS has not been assessed in the same individuals. Here we generated paired transcriptomic (RNA-seq) and DNA methylation (Illumina 450 K array) profiles of CD4+ and CD8+ T cells (CD4, CD8), using clinically accessible blood from healthy donors and MS patients in the initial relapsing-remitting and subsequent secondary-progressive stage. By integrating the output of a differential expression test with a permutation-based non-parametric combination methodology, we identified 149 differentially expressed (DE) genes in both CD4 and CD8 cells collected from MS patients. Moreover, by leveraging the methylation-dependent regulation of gene expression, we identified the gene SH3YL1, which displayed significant correlated expression and methylation changes in MS patients. Importantly, silencing of SH3YL1 in primary human CD4 cells demonstrated its influence on T cell activation. Collectively, our strategy based on paired sampling of several cell-types provides a novel approach to increase sensitivity for identifying shared mechanisms altered in CD4 and CD8 cells of relevance in MS in small sized clinical materials
STATegra: Multi-Omics Data Integration - A Conceptual Scheme With a Bioinformatics Pipeline
Technologies for profiling samples using different omics platforms have been at the forefront since the human genome project. Large-scale multi-omics data hold the promise of deciphering different regulatory layers. Yet, while there is a myriad of bioinformatics tools, each multi-omics analysis appears to start from scratch with an arbitrary decision over which tools to use and how to combine them. Therefore, it is an unmet need to conceptualize how to integrate such data and implement and validate pipelines in different cases. We have designed a conceptual framework (STATegra), aiming it to be as generic as possible for multi-omics analysis, combining available multi-omic anlaysis tools (machine learning component analysis, non-parametric data combination, and a multi-omics exploratory analysis) in a step-wise manner. While in several studies, we have previously combined those integrative tools, here, we provide a systematic description of the STATegra framework and its validation using two The Cancer Genome Atlas (TCGA) case studies. For both, the Glioblastoma and the Skin Cutaneous Melanoma (SKCM) cases, we demonstrate an enhanced capacity of the framework (and beyond the individual tools) to identify features and pathways compared to single-omics analysis. Such an integrative multi-omics analysis framework for identifying features and components facilitates the discovery of new biology. Finally, we provide several options for applying the STATegra framework when parametric assumptions are fulfilled and for the case when not all the samples are profiled for all omics. The STATegra framework is built using several tools, which are being integrated step-by-step as OpenSource in the STATegRa Bioconductor packag
FAIR+E pathogen data for surveillance and research: lessons from COVID-19
The COVID-19 pandemic has exemplified the importance of interoperable and equitable data sharing for global surveillance and to support research. While many challenges could be overcome, at least in some countries, many hurdles within the organizational, scientific, technical and cultural realms still remain to be tackled to be prepared for future threats. We propose to (i) continue supporting global efforts that have proven to be efficient and trustworthy toward addressing challenges in pathogen molecular data sharing; (ii) establish a distributed network of Pathogen Data Platforms to (a) ensure high quality data, metadata standardization and data analysis, (b) perform data brokering on behalf of data providers both for research and surveillance, (c) foster capacity building and continuous improvements, also for pandemic preparedness; (iii) establish an International One Health Pathogens Portal, connecting pathogen data isolated from various sources (human, animal, food, environment), in a truly One Health approach and following FAIR principles. To address these challenging endeavors, we have started an ELIXIR Focus Group where we invite all interested experts to join in a concerted, expert-driven effort toward sustaining and ensuring high-quality data for global surveillance and research
Gene selection for optimal prediction of cell position in tissues from single-cell transcriptomics data.
Single-cell RNA-sequencing (scRNAseq) technologies are rapidly evolving. Although very informative, in standard scRNAseq experiments, the spatial organization of the cells in the tissue of origin is lost. Conversely, spatial RNA-seq technologies designed to maintain cell localization have limited throughput and gene coverage. Mapping scRNAseq to genes with spatial information increases coverage while providing spatial location. However, methods to perform such mapping have not yet been benchmarked. To fill this gap, we organized the DREAM Single-Cell Transcriptomics challenge focused on the spatial reconstruction of cells from the Drosophila embryo from scRNAseq data, leveraging as silver standard, genes with in situ hybridization data from the Berkeley Drosophila Transcription Network Project reference atlas. The 34 participating teams used diverse algorithms for gene selection and location prediction, while being able to correctly localize clusters of cells. Selection of predictor genes was essential for this task. Predictor genes showed a relatively high expression entropy, high spatial clustering and included prominent developmental genes such as gap and pair-rule genes and tissue markers. Application of the top 10 methods to a zebra fish embryo dataset yielded similar performance and statistical properties of the selected genes than in the Drosophila data. This suggests that methods developed in this challenge are able to extract generalizable properties of genes that are useful to accurately reconstruct the spatial arrangement of cells in tissues
Μελέτη microRNA-mRNA αλληλεπιδράσεων σχετιζόμενων με καρκίνο
Τα microRNA είναι μικρά μη κωδικοποιά μόρια RNA τα οποία προσδένονταιστην 3’ αμετάφραστη περιοχή (3’UTR) του mRNA στόχου και οδηγούν σεκαταστολή της μετάφρασης ή/και την αποικοδόμηση του. Έχουν συνδεθεί μεδιάφορα είδη καρκίνου, μέσω της εμφάνισής τους σε γενωμικές περιοχές πουσχετίζονται με καρκίνο (CAGR/FRA), επειδή στοχεύουν γονίδια που εμπλέκονταισε καρκίνο ή επειδή η έκφραση τους εμφανίζεται διαφοροποιημένη σεκαρκινικούς ιστούς. Το εργαστήριο της Δρ. Ποϊράζη ανακάλυψε πρόσφατατέσσερα καινούργια πρόδρομα microRNA σε CAGR, χωρίς ωστόσο να είναιγνωστά τα ώριμα μόρια και η ακριβής σχέση τους με τον καρκίνο. Στόχοι τηπαρούσας διατριβής είναι: Η δημιουργία ενός υπολογιστικού εργαλείου για την πρόβλεψη τωνώριμων μορίων των miRNA, (περιγράφεται στο κεφάλαιο ΙΙ και ΙΙΙ). Η πειραματική εύρεση των ώριμων μορίων που παράγονται από τέσσεραπρόδρομα miRNAs, (περιγράφεται στο κεφάλαιο IV). Η υπολογιστική πρόβλεψη και πειραματική επιβεβαίωσηαλληλεπιδράσεων μεταξύ των ώριμων μορίων και γονιδίων που έχουνσυσχετιστεί με τον καρκίνο. (περιγράφεται στο κεφάλαιο IV).MicroRNAs belong to the large family of small non coding RNAs. They regulateprotein synthesis by binding to their mRNA targets causing mRNA degradationor translational repression. A large number of miRNAs have been associated withcancer because they are often found to be located within cancer associatedgenomic region (CAGRs/FRA) to target cancer-related genes, and to bedifferentially expressed in tumor compared to normal tissues. Previous work inthe Computational Biology lab had identified four new putative miRNA genesthat were located within CAGR. However their mature molecules and theirassociation with cancer phenotypes were unknown. My thesis focuses onresolving these two issues, using a combination of theoretical and experimentaltechniques. The specific aims of this work are: The development of a mature miRNA prediction algorithm(Chapter II, III) The identification of the mature miRNA molecules of the newly identifiedmiRNA genes via a combination of computational and experimentalmethods (Chapter IV) The utilization of a target prediction algorithm to predict andexperimentally verify interactions between the mature molecules andcancer-related genes Chapter IV)
MiRduplexSVM: A High-Performing MiRNA-Duplex Prediction and Evaluation Methodology
<div><p>We address the problem of predicting the position of a miRNA duplex on a microRNA hairpin via the development and application of a novel SVM-based methodology. Our method combines a unique problem representation and an unbiased optimization protocol to learn from mirBase19.0 an accurate predictive model, termed MiRduplexSVM. This is the first model that provides precise information about all four ends of the miRNA duplex. We show that (a) our method outperforms four state-of-the-art tools, namely MaturePred, MiRPara, MatureBayes, MiRdup as well as a Simple Geometric Locator when applied on the same training datasets employed for each tool and evaluated on a common blind test set. (b) In all comparisons, MiRduplexSVM shows superior performance, achieving up to a 60% increase in prediction accuracy for mammalian hairpins and can generalize very well on plant hairpins, without any special optimization. (c) The tool has a number of important applications such as the ability to accurately predict the miRNA or the miRNA*, given the opposite strand of a duplex. Its performance on this task is superior to the 2nts overhang rule commonly used in computational studies and similar to that of a comparative genomic approach, without the need for prior knowledge or the complexity of performing multiple alignments. Finally, it is able to evaluate novel, potential miRNAs found either computationally or experimentally. In relation with recent confidence evaluation methods used in miRBase, MiRduplexSVM was successful in identifying high confidence potential miRNAs.</p></div
omicsNPC: Applying the Non-Parametric Combination Methodology to the Integrative Analysis of Heterogeneous Omics Data
<div><p>Background</p><p>The advance of omics technologies has made possible to measure several data modalities on a system of interest. In this work, we illustrate how the Non-Parametric Combination methodology, namely NPC, can be used for simultaneously assessing the association of different molecular quantities with an outcome of interest. We argue that NPC methods have several potential applications in integrating heterogeneous omics technologies, as for example identifying genes whose methylation and transcriptional levels are jointly deregulated, or finding proteins whose abundance shows the same trends of the expression of their encoding genes.</p><p>Results</p><p>We implemented the NPC methodology within “omicsNPC”, an R function specifically tailored for the characteristics of omics data. We compare omicsNPC against a range of alternative methods on simulated as well as on real data. Comparisons on simulated data point out that omicsNPC produces unbiased / calibrated p-values and performs equally or significantly better than the other methods included in the study; furthermore, the analysis of real data show that omicsNPC (a) exhibits higher statistical power than other methods, (b) it is easily applicable in a number of different scenarios, and (c) its results have improved biological interpretability.</p><p>Conclusions</p><p>The omicsNPC function competitively behaves in all comparisons conducted in this study. Taking into account that the method (i) requires minimal assumptions, (ii) it can be used on different studies designs and (iii) it captures the dependences among heterogeneous data modalities, omicsNPC provides a flexible and statistically powerful solution for the integrative analysis of different omics data.</p></div
Identification of high confidence miRNAs.
<p>As shown in the figure MiRduplexSVM assigned a higher score to 554 high confidence miRNAs (blue bars, median = 0.53 and mean = 0.44) than to 554 randomly selected miRNAS (red bars, median = -0.24 and mean = -0.14) with the observed differences being statistically significant (ranksum: p = 8.3084e-47 and t-test: p = 3.9577e-50). The x axis shows MiRduplexSVM’s scores and the y axis shows the percentage of hairpins assigned with the respective scores.</p