1,103 research outputs found

    Improvement in the prediction of the translation initiation site through balancing methods, inclusion of acquired knowledge and addition of features to sequences of mRNA

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The accurate prediction of the initiation of translation in sequences of mRNA is an important activity for genome annotation. However, obtaining an accurate prediction is not always a simple task and can be modeled as a problem of classification between positive sequences (protein codifiers) and negative sequences (non-codifiers). The problem is highly imbalanced because each molecule of mRNA has a unique translation initiation site and various others that are not initiators. Therefore, this study focuses on the problem from the perspective of balancing classes and we present an undersampling balancing method, M-clus, which is based on clustering. The method also adds features to sequences and improves the performance of the classifier through the inclusion of knowledge obtained by the model, called InAKnow.</p> <p>Results</p> <p>Through this methodology, the measures of performance used (accuracy, sensitivity, specificity and adjusted accuracy) are greater than 93% for the <it>Mus musculus</it> and <it>Rattus norvegicus</it> organisms, and varied between 72.97% and 97.43% for the other organisms evaluated: <it>Arabidopsis thaliana</it>, <it>Caenorhabditis elegans</it>, <it>Drosophila melanogaster</it>, <it>Homo sapiens</it>, <it>Nasonia vitripennis</it>. The precision increases significantly by 39% and 22.9% for <it>Mus musculus</it> and <it>Rattus norvegicus</it>, respectively, when the knowledge obtained by the model is included. For the other organisms, the precision increases by between 37.10% and 59.49%. The inclusion of certain features during training, for example, the presence of ATG in the upstream region of the Translation Initiation Site, improves the rate of sensitivity by approximately 7%. Using the M-Clus balancing method generates a significant increase in the rate of sensitivity from 51.39% to 91.55% (<it>Mus musculus</it>) and from 47.45% to 88.09% (<it>Rattus norvegicus</it>).</p> <p>Conclusions</p> <p>In order to solve the problem of TIS prediction, the results indicate that the methodology proposed in this work is adequate, particularly when using the concept of acquired knowledge which increased the accuracy in all databases evaluated.</p

    Machine learning for the prediction of protein-protein interactions

    Get PDF
    The prediction of protein-protein interactions (PPI) has recently emerged as an important problem in the fields of bioinformatics and systems biology, due to the fact that most essential cellular processes are mediated by these kinds of interactions. In this thesis we focussed in the prediction of co-complex interactions, where the objective is to identify and characterize protein pairs which are members of the same protein complex. Although high-throughput methods for the direct identification of PPI have been developed in the last years. It has been demonstrated that the data obtained by these methods is often incomplete and suffers from high false-positive and false-negative rates. In order to deal with this technology-driven problem, several machine learning techniques have been employed in the past to improve the accuracy and trustability of predicted protein interacting pairs, demonstrating that the combined use of direct and indirect biological insights can improve the quality of predictive PPI models. This task has been commonly viewed as a binary classification problem. However, the nature of the data creates two major problems. Firstly, the imbalanced class problem due to the number of positive examples (pairs of proteins which really interact) being much smaller than the number of negative ones. Secondly, the selection of negative examples is based on some unreliable assumptions which could introduce some bias in the classification results. The first part of this dissertation addresses these drawbacks by exploring the use of one-class classification (OCC) methods to deal with the task of prediction of PPI. OCC methods utilize examples of just one class to generate a predictive model which is consequently independent of the kind of negative examples selected; additionally these approaches are known to cope with imbalanced class problems. We designed and carried out a performance evaluation study of several OCC methods for this task. We also undertook a comparative performance evaluation with several conventional learning techniques. Furthermore, we pay attention to a new potential drawback which appears to affect the performance of PPI prediction. This is associated with the composition of the positive gold standard set, which contain a high proportion of examples associated with interactions of ribosomal proteins. We demonstrate that this situation indeed biases the classification task, resulting in an over-optimistic performance result. The prediction of non-ribosomal PPI is a much more difficult task. We investigate some strategies in order to improve the performance of this subtask, integrating new kinds of data as well as combining diverse classification models generated from different sets of data. In this thesis, we undertook a preliminary validation study of the new PPI predicted by using OCC methods. To achieve this, we focus in three main aspects: look for biological evidence in the literature that support the new predictions; the analysis of predicted PPI networks properties; and the identification of highly interconnected groups of proteins which can be associated with new protein complexes. Finally, this thesis explores a slightly different area, related to the prediction of PPI types. This is associated with the classification of PPI structures (complexes) contained in the Protein Data Bank (PDB) data base according to its function and binding affinity. Considering the relatively reduced number of crystalized protein complexes available, it is not possible at the moment to link these results with the ones obtained previously for the prediction of PPI complexes. However, this could be possible in the near future when more PPI structures will be available

    Predictive design of sigma factor-specific promoters

    Get PDF
    To engineer synthetic gene circuits, molecular building blocks are developed which can modulate gene expression without interference, mutually or with the host's cell machinery. As the complexity of gene circuits increases, automated design tools and tailored building blocks to ensure perfect tuning of all components in the network are required. Despite the efforts to develop prediction tools that allow forward engineering of promoter transcription initiation frequency (TIF), such a tool is still lacking. Here, we use promoter libraries of E. coli sigma factor 70 (sigma (70))- and B. subtilis sigma (B)-, sigma (F)- and sigma (W)-dependent promoters to construct prediction models, capable of both predicting promoter TIF and orthogonality of the sigma -specific promoters. This is achieved by training a convolutional neural network with high-throughput DNA sequencing data from fluorescence-activated cell sorted promoter libraries. This model functions as the base of the online promoter design tool (ProD), providing tailored promoters for tailored genetic systems. Automated design tools and tailored subunits are beneficial in fine-tuning all components of a complex genetic circuit. Here the authors create E. coli and B. subtilis promoter libraries using FACS and HTS, from which an online promoter design tool has been developed using CNN

    Mechanism-driven hypothesis generation support for a predictive adverse effect in colorectal cancer treatment

    Get PDF
    Diese bioinformatische Dissertation beschreibt die tumorbiologische Hypothesengenierung, insbesondere im Kontext des Kolorektalkarzinoms. Hintergrund der Studien ist eine Beobachtung aus der klinischen Praxis. Verschiedene Autoren berichten, dass bei der Behandlung mit Inhibitoren des Epidermalen Wachstumsfaktor Rezeptors (EGFR), speziell des therapeutischen Antikörpers Cetuximab, eine Minderheit der Patienten die übliche Nebenwirkung der Hauttoxizität nicht oder in deutlich verminderter Form zeigt. Bei diesen Patienten wird gleichzeitig eine reduzierte Wirksamkeit der Therapie beschrieben. Das Ausbleiben der Nebenwirkung wird somit als phänotypischer Biomarker genutzt, um gegebenenfalls die Therapie anzupassen. Nachteilig erscheint in diesem Kontext allerdings die präventive Hautpflege sowie die Tatsache, dass eine Cetuximab-Behandlung zunächst gestartet werden muss, um eine Information über die Wirksamkeit zu gewinnen. Dadurch, dass der zugrunde liegende molekulare Mechanismus unbekannt ist, kann keine Vorhersage anhand eines klinischen Test getroffen werden. In der vorliegenden Arbeit war es das Ziel, Hypothesen zu generieren, welche Proteine und zellulären Signalwege kausal für das unterschiedliche Ansprechverhalten der Patientengruppen sein könnten. Ausgehend von der Annahme, dass natürliche Keimbahnvarianten in der Erbinformation der Individuen im Behandlungskontext diskriminatorisch wirken, baut die Dissertation auf einem kleinen Datensatz von 23 Exomen von Teilnehmern klinischer Studien auf. Diese Sequenzierungsdaten wurden in genomische Varianten überführt und auf ihren potentiellen genetisch-mechanistischen Einfluss hin untersucht. Gezielte Einschränkungen wurden dabei anhand einer Modellierung des biomedizinischen Kontextes des Anwendungsfalls eingeführt, um die reduzierte Datenlage gezielt mit Informationen anzureichern. Die so erhaltenen Kandidatengene, welche in nachfolgenden praktischen Arbeiten validiert werden müssen, werden im Einzelnen beschrieben und bewertet. Methodisch ist das Ergebnis dieser Dissertation die „Molecular Systems Map“, eine in Cytoscape modellierte Netzwerkstruktur, die funktionelle Interaktionen zwischen Proteinen interaktiv visualisiert und gleichzeitig als Filter auf Basis des biologischen Kontexts dient. Ziel hierbei ist es, einen biomedizinisch ausgebildeten Fachanwender bei der Generierung von Hypothesen zu unterstützen, indem im Gegensatz zu sonst häufig anzutreffenden tabellarischen Ansichten die Ergebnisse aus der Sequenzanalyse in eben jenem funktionalen Kontext dargestellt werden. Darüber hinaus wird so die Anwendung von Graphenalgorithmen und die Integration weiterer Daten ermöglicht, z.B. solcher aus komplementären ‘omics-Experimenten.This bioinformatics thesis describes work and results from a study on a use case in the context of colorectal cancer. Background of the studies is an observation form the clinical practice. Various authors report that upon treatment with inhibitors of the Epidermal Growth Factor Receptor (EGFR), in particular with the therapeutic antibody Cetuximab, a minority of patients does not, or in a clearly reduced form, show common adverse effects of skin toxicity. For these patients, at the same time a reduced efficacy of the therapy is described. The lack of the adverse effect therefore gets used as a phenotypic biomarker for inducing a switch of therapy. However, preventive skin care during treatment, counteracting the biomarker signal, and the necessity to start the therapy first in order to gain the information, appear unfavorable. As the underlying molecular mechanisms remain elusive, predictions ahead of treatment, e.g. by a clinical test, are not possible yet. In the presented work, the aim was to generate hypotheses, which proteins and cellular signaling pathways might be causal for the differentiating response of the patient groups. Starting from the assumption that naturally occurring germline variations functionally discriminate individuals in the context of the treatment, the thesis builds up on a small dataset of 23 exomes of patients from a clinical study context. These sequencing data were processed to genomic variants and analyzed for their potential influence on the mechanistic level. Targeted restrictions were introduced by modeling the biomedical context of the use case in order to enrich the sparse individual data with further information. The obtained candidate genes, which are necessary to be validated in practical studies, are described and evaluated in detail. Methodologically, the result of the thesis is the „Molecular Systems Map“, a network data structure modeled in Cytoscape, interactively visualizing the functional interactions of proteins and simulatenously filtering the called variants upon the biological context. Here, the aim is to enable biomedical domain experts, beyond scrolling tabular information on called variants, to review their experimental data in the functional context and support them in the hypothesis generation process. Additionally, this provides the opportunity to apply graph algorithms and integrate further data, e.g. such from completary ‘omics experiments

    Developing statistical and bioinformatic analysis of genomic data from tumours

    Get PDF
    Previous prognostic signatures for melanoma based on tumour transcriptomic data were developed predominantly on cohorts of AJCC (American Joint Committee on Cancer) stages III and IV melanoma. Since 92% of melanoma patients are diagnosed at AJCC stages I and II, there is an urgent need for better prognostic biomarkers to allow patient stratification for receiving early adjuvant therapies. This study uses genome-wide tumour gene expression levels and clinico-histopathological characteristics of patients from the Leeds Melanoma Cohort (LMC). Several unsupervised and supervised classification approaches were applied to the transcriptomic data, to identify biological classes of melanoma, and to develop prognostic classification models respectively. Unsupervised clustering identified six biologically distinct primary melanoma classes (LMC classes). Unlike previous molecular classes of melanoma, the LMC classes were prognostic in both the whole LMC dataset and in stage I tumours. The prognostic value of the LMC classes was replicated in an independent dataset, but insufficient data were available to replicate in an AJCC stage I subset. Supervised classification using the Random Forest (RF) approach provided improved performances when adjustments were made to deal with class imbalance, while this did not improve performance of the Support Vector Machine (SVM). However, RF and SVM had similar results overall, with RF only marginally better. Combining clinical and transcriptomic information in the RF further improved the performance of the prediction model in comparison to using clinical information alone. Finally, the agnostically derived LMC classes and the supervised RF model showed convergence in their association with outcome in some groups of patients, but not in others. In conclusion, this study reports six molecular classes of primary melanoma with prognostic value in stage I disease and overall, and a prognostic classification model that predicts outcome in primary melanoma

    MACHINE LEARNING AND BIOINFORMATIC INSIGHTS INTO KEY ENZYMES FOR A BIO-BASED CIRCULAR ECONOMY

    Get PDF
    The world is presently faced with a sustainability crisis; it is becoming increasingly difficult to meet the energy and material needs of a growing global population without depleting and polluting our planet. Greenhouse gases released from the continuous combustion of fossil fuels engender accelerated climate change, and plastic waste accumulates in the environment. There is need for a circular economy, where energy and materials are renewably derived from waste items, rather than by consuming limited resources. Deconstruction of the recalcitrant linkages in natural and synthetic polymers is crucial for a circular economy, as deconstructed monomers can be used to manufacture new products. In Nature, organisms utilize enzymes for the efficient depolymerization and conversion of macromolecules. Consequently, by employing enzymes industrially, biotechnology holds great promise for energy- and cost-efficient conversion of materials for a circular economy. However, there is need for enhanced molecular-level understanding of enzymes to enable economically viable technologies that can be applied on a global scale. This work is a computational study of key enzymes that catalyze important reactions that can be utilized for a bio-based circular economy. Specifically, bioinformatics and data- mining approaches were employed to study family 7 glycoside hydrolases (GH7s), which are the principal enzymes in Nature for deconstructing cellulose to simple sugars; a cytochrome P450 enzyme (GcoA) that catalyzes the demethylation of lignin subunits; and MHETase, a tannase-family enzyme utilized by the bacterium, Ideonella sakaiensis, in the degradation and assimilation of polyethylene terephthalate (PET). Since enzyme function is fundamentally dependent on the primary amino-acid sequence, we hypothesize that machine-learning algorithms can be trained on an ensemble of functionally related enzymes to reveal functional patterns in the enzyme family, and to map the primary sequence to enzyme function such that functional properties can be predicted for a new enzyme sequence with significant accuracy. We find that supervised machine learning identifies important residues for processivity and accurately predicts functional subtypes and domain architectures in GH7s. Bioinformatic analyses revealed conserved active-site residues in GcoA and informed protein engineering that enabled expanded enzyme specificity and improved activity. Similarly, bioinformatic studies and phylogenetic analysis provided evolutionary context and identified crucial residues for MHET-hydrolase activity in a tannase-family enzyme (MHETase). Lastly, we developed machine-learning models to predict enzyme thermostability, allowing for high-throughput screening of enzymes that can catalyze reactions at elevated temperatures. Altogether, this work provides a solid basis for a computational data-driven approach to understanding, identifying, and engineering enzymes for biotechnological applications towards a more sustainable world

    Systematic approaches to mine, predict and visualize biological functions

    Full text link
    With advances in high-throughput technologies and next-generation sequencing, the amount of genomic and proteomic data is dramatically increasing in the post-genomic era. One of the biggest challenges that has arisen is the connection of sequences to their activities and the understanding of their cellular functions and interactions. In this dissertation, I present three different strategies for mining, predicting and visualizing biological functions. In the first part, I present the COMputational Bridges to Experiments (COMBREX) project, which facilitates the functional annotation of microbial proteins by leveraging the power of scientific community. The goal is to bring computational biologists and biochemists together to expand our knowledge. A database-driven web portal has been built to serve as a hub for the community. Predicted annotations will be deposited into the database and the recommendation system will guide biologists to the predictions whose experimental validation will be more beneficial to our knowledge of microbial proteins. In addition, by taking advantage of the rich content, we develop a web service to help community members enrich their genome annotations. In the second part, I focus on identifying the genes for enzyme activities that lack genetic details in the major biological databases. Protein sequences are unknown for about one-third of the characterized enzyme activities listed in the EC system, the so-called orphan enzymes. Our approach considers the similarities between enzyme activities, enabling us to deal with broad types of orphan enzymes in eukaryotes. I apply our framework to human orphan enzymes and show that we can successfully fill the knowledge gaps in the human metabolic network. In the last part, I construct a platform for visually analyzing the eco-system level metabolic network. Most microbes live in a multiple-species environment. The underlying nutrient exchange can be seen as a dynamic eco-system level metabolic network. The complexity of the network poses new visualization challenges. Using the data predicted by Computation Of Microbial Ecosystems in Time and Space (COMETS), I demonstrate that our platform is a powerful tool for investigating the interactions of the microbial community. We apply it to the exploration of a simulated microbial eco-system in the human gut. The result reflects both known knowledge and novel mutualistic interactions, such as the nutrients exchanges between E. coli, C. difficile and L. acidophilus

    Statistical methods for clinical genome interpretation with specific application to inherited cardiac conditions

    Get PDF
    Background: While next-generation sequencing has enabled us to rapidly identify sequence variants, clinical application is limited by our ability to determine which rare variants impact disease risk. Aim: Developing computational methods to identify clinically important variants Methods and Results: (1) I built a disease-specific variant classifier for inherited cardiac conditions (ICCs), which outperforms genome-wide tools in a wide range of benchmarking. It discriminates pathogenic variants from benign variants with global accuracy improved by 4-24% over existing tools. Variants classified with >90% confidence are significantly associated with both disease status and clinical outcomes. (2) To better interpret missense variants, I examined evolutionarily equivalent residues across protein domain families, to identify positions intolerant of variations. Homologous residue constraint is a strong predictor of variant pathogenicity. It can identify a subset of de novo missense variants with comparable impact on developmental disorders as protein-truncating variants. Independent from existing approaches, it can also improve the prioritisation of disease-relevant gene for both developmental disorders and inherited hypertrophic cardiomyopathy. (3) TTN-truncating variants are known to cause dilated cardiomyopathy, but the effect of missense variants is poorly understood. Using the approach in (2), I studied the role of TTN missense variants on DCM. Our prioritised residues are enriched with known pathogenic variants, including the two known to cause DCM and others involved in skeletal myopathies. I also found a significant association between constrained variants of TTN I-set domains and DCM in a case-control burden test of Caucasian samples (OR=3.2, 95%CI=1.3-9.4). Within subsets of DCM, the association is replicated in alcoholic cardiomyopathy. (4) Finally, I also developed a tool to annotate 5’UTR variants creating or disrupting upstream open reading frames (uORF). Its utility is demonstrated to detect high-impact uORF-disturbing variants from ClinVar, gnomAD and Genomics England. Conclusion: These studies established broadly applicable methods and improved understanding of ICCs.Open Acces
    corecore