10 research outputs found

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    Applying Machine Learning Algorithms for the Analysis of Biological Sequences and Medical Records

    Get PDF
    The modern sequencing technology revolutionizes the genomic research and triggers explosive growth of DNA, RNA, and protein sequences. How to infer the structure and function from biological sequences is a fundamentally important task in genomics and proteomics fields. With the development of statistical and machine learning methods, an integrated and user-friendly tool containing the state-of-the-art data mining methods are needed. Here, we propose SeqFea-Learn, a comprehensive Python pipeline that integrating multiple steps: feature extraction, dimensionality reduction, feature selection, predicting model constructions based on machine learning and deep learning approaches to analyze sequences. We used enhancers, RNA N6- methyladenosine sites and protein-protein interactions datasets to evaluate the validation of the tool. The results show that the tool can effectively perform biological sequence analysis and classification tasks. Applying machine learning algorithms for Electronic medical record (EMR) data analysis is also included in this dissertation. Chronic kidney disease (CKD) is prevalent across the world and well defined by an estimated glomerular filtration rate (eGFR). The progression of kidney disease can be predicted if future eGFR can be accurately estimated using predictive analytics. Thus, I present a prediction model of eGFR that was built using Random Forest regression. The dataset includes demographic, clinical and laboratory information from a regional primary health care clinic. The final model included eGFR, age, gender, body mass index (BMI), obesity, hypertension, and diabetes, which achieved a mean coefficient of determination of 0.95. The estimated eGFRs were used to classify patients into CKD stages with high macro-averaged and micro-averaged metrics

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    Mapping the proteome with data-driven methods: A cycle of measurement, modeling, hypothesis generation, and engineering

    Get PDF
    The living cell exhibits emergence of complex behavior and its modeling requires a systemic, integrative approach if we are to thoroughly understand and harness it. The work in this thesis has had the more narrow aim of quantitatively characterizing and mapping the proteome using data-driven methods, as proteins perform most functional and structural roles within the cell. Covered are the different parts of the cycle from improving quantification methods, to deriving protein features relying on their primary structure, predicting the protein content solely from sequence data, and, finally, to developing theoretical protein engineering tools, leading back to experiment.\ua0\ua0\ua0\ua0 High-throughput mass spectrometry platforms provide detailed snapshots of a cell\u27s protein content, which can be mined towards understanding how the phenotype arises from genotype and the interplay between the various properties of the constituent proteins. However, these large and dense data present an increased analysis challenge and current methods capture only a small fraction of signal. The first part of my work has involved tackling these issues with the implementation of a GPU-accelerated and distributed signal decomposition pipeline, making factorization of large proteomics scans feasible and efficient. The pipeline yields individual analyte signals spanning the majority of acquired signal, enabling high precision quantification and further analytical tasks.\ua0\ua0\ua0 Having such detailed snapshots of the proteome enables a multitude of undertakings. One application has been to use a deep neural network model to learn the amino acid sequence determinants of temperature adaptation, in the form of reusable deep model features. More generally, systemic quantities may be predicted from the information encoded in sequence by evolutionary pressure. Two studies taking inspiration from natural language processing have sought to learn the grammars behind the languages of expression, in one case predicting mRNA levels from DNA sequence, and in the other protein abundance from amino acid sequence. These two models helped build a quantitative understanding of the central dogma and, furthermore, in combination yielded an improved predictor of protein amount. Finally, a mathematical framework relying on the embedded space of a deep model has been constructed to assist guided mutation of proteins towards optimizing their abundance

    Model based approaches to characterize heterogeneity in gene regulation across cells and disease types

    Get PDF
    Access to large genome-wide biological datasets has now enabled computational researchers to tackle long-standing questions in Biomedicine through the lens of Machine Learning (ML) and Artificial Intelligence (AI). The potential benefits of such computational approaches to biological research are immense. For example, efficient, and yet interpretable, machine learning models of disease/drug response/phenotype can impact our life at both personal and social levels. However, heterogeneity is found at multiple scales in biology, manifested as the context-specificity of biological processes. This context-specific heterogeneity poses a major challenge to ML models. Even though context-specific models are often trained, this is mostly done without the benefit of mechanistic insights about the biological processes being modeled, and as such do not help improve our biological understanding. This dissertation addresses these challenges and their limitations by: a) designing appropriate features and ML models motivated by the current biological hypothesis at hand, b) building pipelines to analyze multiple context-specific models together, and c) developing data integration and imputation methods to address the problems of insufficient and missing data. The first project studies loss of methylation or hypo-methylation in large blocks causing aberrant gene activity, a well-known phenomenon in cancer. To find the associated markers, I designed a classification model of hypo-methylated block boundaries and non-boundaries in colon cancer. The second project models binding of transcription factor (TF) to specific DNA element to the genome, one of the principal components of gene regulation. Since condition specificity of TF binding is not yet well understood, this dissertation examines a design of cell type-specific models for transcription factor (TF) binding using ChIPSeq data. A meta-analysis pipeline, called TRISECT, is applied for multiple TF binding models to understand heterogeneity of cell specificity across those models. Next, models for breast cancer metastasis using gene expression data are discussed. In breast cancer metastasis, the affinity towards distant tissues called secondary tissues has not been comprehended. Therefore, going beyond mere discriminatory models, I propose another meta-analysis pipeline, MONTAGE intending to understand the organotropism of breast cancer metastasis across secondary tissues. Building ML models can be hindered by the data size, specially, for rare diseases. Therefore, by necessity, molecular data have been merged across multiple studies, and across multiple technical platforms which has vulnerability of so called batch effects diluting the actual biological signal. Existing methods are not capable of removing multi-variate confounding artifacts leading to inaccurate models. To circumvent this issue, this dissertation examines a deep learning based technique (deepSavior) which ‘translates’ the gene expression profile from samples of one technical platform to another platform. To summarize, this dissertation makes three distinct contributions, a) designing effective ML model to explore the determinants of cancer-associated hypomethlation, b) designing meta-analysis pipelines to compare multiple related but context-specific ML models to understand heterogeneous relations among biological processes, and b) developing new method to overcome the data integration and imputation challenges

    Identification of transcriptomic signature in cellular senescence and characterization of circulating small non-coding RNA during human aging

    Get PDF
    Accumulation of cellular senescence always forebodes the initialization of aging and cancer. It is an irreversible process that leads to cell cycle arrest while senescent cells still own metabolic viability to affect tissue homeostasis. Senescent cells not only accelerate individual aging process, they are also the driver of age-related diseases such as cancer, osteoarthritis, atherosclerosis, and Alzheimer’s diseases. Senescent phenotype shows heterogeneity in different cell lines under diverse triggers, and blurred traits with non-senescent cells make it difficult to identify senescence precisely. It is very necessary to identify robust shared markers, senescence-specific pathways and biological processes across different senescence models. Recently the emerging role of non-coding RNA in senescence and aging has been noticed due to its ability to control cell cycle at post-transcriptional level. Usually the highly proactive secretome from senescent cells, termed the senescence-associated secretory phenotype (SASP), can result in age-related process through intercellular communication, whereas only a number of factors have been identified in very specific scenarios and the role of secreted extracellular RNAs (exRNAs) is not well understood. Detection of exRNAs protected by EV membrane uncovered the fact that most of extracellular mRNAs are fragmentation, along with small non-coding RNAs (sncRNAs), such as miRNAs, piRNA and tRNA fragments. Therefore it is promising to uncover the role of extracellular sncRNA in aging related dysfunction during cell-cell interaction.To better understand the nature of cellular senescence and its corresponding human aging process at transcriptome level, RNA sequencing data from different cell types and senescence inductions were collected, and significantly shared gene markers and pathways among multiple senescence models were determined through meta-analysis and machine learning-based logistic regression methods. Extensionally, the function of identified senescence associated long non-coding RNAs (lncRNAs) during cell cycle were verified through short interfering RNAs (siRNAs) knock-down treatment in lung fibroblasts (IMR-90). In parallel, the abundance of extracellular sncRNAs from healthy people aged 20 to 99 was quantified using 446 small RNA sequencing datasets. The expressional trends of each sncRNA subspecies were detected with age and a sncRNAs-based age predictors was established using high performance ensemble machine learning strategy

    Molecular Science for Drug Development and Biomedicine

    Get PDF
    With the avalanche of biological sequences generated in the postgenomic age, molecular science is facing an unprecedented challenge, i.e., how to timely utilize the huge amount of data to benefit human beings. Stimulated by such a challenge, a rapid development has taken place in molecular science, particularly in the areas associated with drug development and biomedicine, both experimental and theoretical. The current thematic issue was launched with the focus on the topic of “Molecular Science for Drug Development and Biomedicine”, in hopes to further stimulate more useful techniques and findings from various approaches of molecular science for drug development and biomedicine

    Understanding Neuromuscular Health and Disease: Advances in Genetics, Omics, and Molecular Function

    Get PDF
    This compilation focuses on recent advances in the molecular and cellular understandingof neuromuscular biology, and the treatment of neuromuscular disease.These advances are at the forefront of modern molecular methodologies, oftenintegrating across wet-lab cell and tissue models, dry-lab computational approaches,and clinical studies. The continuing development and application ofmultiomics methods offer particular challenges and opportunities in the field,not least in the potential for personalized medicine

    Génomique et métagénomique comparatives des bactéries

    Get PDF
    Les domaines de la génomique et de la métagénomique ont apporté un support incommensurable à l'avancement de nos connaissances sur la génétique des bactéries. Les bactéries pathogènes sont maintenant séquencées et analysées pour identifier les facteurs causant leur virulence et/ou leur résistance aux antibiotiques ainsi que leur capacité à transmettre ces éléments génétiques qui sont d'un intérêt clinique. Les bactéries commensales, quant à elles, sont de plus en plus associées à la santé humaine et sont étudiées à l'aide de la métagénomique pour contrer les difficultés liées à leur culture étant donné leur grande diversité en matière de besoins métaboliques. Les nouvelles technologies de séquençages permettent donc de produire en masse ces séquences d'ADN à des fins de caractérisation et de comparaison dans le but d'élucider des questions souvent reliées à la santé humaine. Les avancées en génomique et en métagénomique requièrent des logiciels bio-informatiques capables de gérer et de s'adapter à la quantité massive et croissante des données biologiques. Les deux premières hypothèses de ce doctorat concernaient le développement de méthodes efficaces et flexibles pour l'analyse de génomes et de métagénomes bactériens. Plusieurs méthodes d'analyses bio-informatiques ont été explorées et ont mené à l'implémentation de deux logiciels pour supporter les hypothèses de recherche : Ray Surveyor et kAAmer. La première hypothèse de recherche consistait à vérifier s'il était possible d'obtenir une comparaison de génomes, depuis leur simple contenu en k-mers de séquences d'ADN, avec des résultats analogues aux comparaisons génomiques standards comme le pourcentage moyen d'identités ou les arbres phylogénétiques, mais sans nécessiter d'alignements de séquences. Nous avons démontré avec le logiciel Ray Surveyor et plusieurs analyses de génomique et de métagénomique bactérienne, qu'il était possible d'obtenir de tels comparaisons à l'aide de séquences d'ADN découpées en k-mer. Dans l'étude qui présenta les résultats de l'hypothèse de recherche, nous avons aussi estimé la propension génotypique de plusieurs espèces bactériennes à des phénotypes d'intérêt clinique à l'aide de bases de données de gènes spécialisées. La deuxième hypothèse était de tester s'il était possible de développer un logiciel pour l'identification de séquences protéiques, basé sur des k-mers d'acides aminés, qui serait plus performant que les logiciels existants, spécifiquement pour l'identification de protéines avec un haut degré d'homologie. Les travaux menèrent à l'implémentation de kAAmer, un logiciel permettant de créer des bases de données de protéines où la recherche de séquence se fait par association exacte de k-mers tout en supportant l'alignement de séquences. KAAmer s'est avéré très efficace pour la recherche de séquences de protéines avec des performances surpassant même, dans la majorité des scénarios, les aligneurs de séquences les plus rapides. D'autres fonctionnalités intéressantes sont aussi offertes par kAAmer, tel que la possibilité d'héberger une base de données en tant que service de manière permanente. Enfin, la troisième et dernière hypothèse de recherche visait à valider si les deux logiciels développés durant le projet de doctorat (Ray Surveyor et kAAmer) produiraient des résultats viables dans une analyse métagénomique du microbiote intestinal en lien avec l'obésité. Les profilages taxonomique et fonctionnel furent donc réalisés avec kAAmer et la comparaison de novo des métagénomes investiguée avec Ray Surveyor. Les résultats obtenus se sont avérés significatifs et ont démontrés, entre autres, une tendance vers une abondance relative plus élevée pour le phylum Bacteroidetes et moins élevée pour les phyla Firmicutes et Acinetobacteria chez les sujets obèses. Une multitude de fonctions métaboliques se sont aussi avérées significativement différentes dans les conditions normales et d'obésités des métagénomes, avec une mention particulière à celles reliées au métabolisme des acides gras à chaîne courte qui sont reconnues pour être associées à l'obésité.The fields of genomics and metagenomics have provided immeasurable support to the advancement of our knowledge of bacterial genetics. Pathogenic bacteria are now routinely sequenced and analyzed to identify the factors causing their virulence or antibiotic resistance as well as their ability to transmit genetic elements. Commensal bacteria are increasingly associated with human health and are being studied using metagenomics to counter the issues associated with their culture due to their wide range of metabolic needs. Next generation sequencing enabled us to mass-produce these DNA sequences for characterization and comparison purposes in order to elucidate questions related to human health. Improvement in genomics and metagenomics studies required bio-informatics software that are able to manage and adapt to an increasing availability of biological sequences data. The first two hypotheses of this thesis include the development of efficient and flexible methods for the analysis of bacterial genomes and metagenomes. Several bio-informatics analysis methods were explored and led to the implementation of two software to support the research hypotheses: Ray Surveyor and kAAmer. The first research hypothesis was to test the possibility of obtaining a comparison of genomes, from their simple DNA k-mers content, with results analogous to standard genomic comparisons such as average nucleotide identity or phylogenetic trees, but without the need for sequence alignments. Using Ray Surveyor software and several bacterial genomic and metagenomic analyses, we have demonstrated that it is possible to obtain such comparisons using k-mers from DNA sequences. In the study that presented the results of the research hypothesis, we also estimated the genotypic propensity of several bacterial species to clinically relevant phenotypes using specialized gene databases. The second hypothesis was to test the possibility of developing a software for protein sequence identification, based on amino acid k-mers, which would be more efficient than existing software, specifically for the identification of proteins with a high degree of homology. The work led to the implementation of kAAmer, a software solution that allows the creation of protein databases where the sequence search is done by exact match of k-mers, while supporting sequence alignment. KAAmer has proven to be very efficient for protein sequence search with performances surpassing even the fastest sequence aligners in most scenarios. Other interesting features are also offered by kAAmer, such as the possibility to host a database as a service on a permanent basis. Finally, the third and last research hypothesis aimed to test the capacity the two software developed during the PhD project (Ray Surveyor and kAAmer) to produce viable results in a metagenomic analysis of the gut microbiota in relation to obesity. Taxonomic and functional profiling was performed with kAAmer as the de novo comparison of metagenomes with Ray Surveyor. The results obtained were significant and showed, among others, a trend towards higher relative abundance of the Bacteroidetes phylum and lower relative abundance of the Firmicutes and Acinetobacteria phyla in obese subjects. Several metabolic functions were also found to be significantly different in the normal and obese conditions, with a particular mention to the metabolism of short-chain fatty acids (SCFA) that are known to be associated with obesity
    corecore