19 research outputs found

    Identifying Cancer Subtypes Using Unsupervised Deep Learning

    Get PDF
    Glioblastoma multiforme (GBM) is the most fatal malignant type of brain tumor with a very poor prognosis with a median survival of around one year. Numerous studies have reported tumor subtypes that consider different characteristics on individual patients, which may play important roles in determining the survival rates in GBM. In this study, we present a pathway-based clustering method using Restricted Boltzmann Machine (RBM), called R-PathCluster, for identifying unknown subtypes with pathway markers of gene expressions. In order to assess the performance of R-PathCluster, we conducted experiments with several clustering methods such as k-means, hierarchical clustering, and RBM models with different input data. R-PathCluster showed the best performance in clustering longterm and short-term survivals, although its clustering score was not the highest among them in experiments. R-PathCluster provides a solution to interpret the model in biological sense, since it takes pathway markers that represent biological process of pathways. We discussed that our findings from R-PathCluster are supported by many biological literatures. Keywords. Glioblastoma multiforme, tumor subtypes, clustering, Restricted Boltzmann Machin

    Preventing premature convergence and proving the optimality in evolutionary algorithms

    Get PDF
    http://ea2013.inria.fr//proceedings.pdfInternational audienceEvolutionary Algorithms (EA) usually carry out an efficient exploration of the search-space, but get often trapped in local minima and do not prove the optimality of the solution. Interval-based techniques, on the other hand, yield a numerical proof of optimality of the solution. However, they may fail to converge within a reasonable time due to their inability to quickly compute a good approximation of the global minimum and their exponential complexity. The contribution of this paper is a hybrid algorithm called Charibde in which a particular EA, Differential Evolution, cooperates with a Branch and Bound algorithm endowed with interval propagation techniques. It prevents premature convergence toward local optima and outperforms both deterministic and stochastic existing approaches. We demonstrate its efficiency on a benchmark of highly multimodal problems, for which we provide previously unknown global minima and certification of optimality

    Analysis of microarray and next generation sequencing data for classification and biomarker discovery in relation to complex diseases

    Get PDF
    PhDThis thesis presents an investigation into gene expression profiling, using microarray and next generation sequencing (NGS) datasets, in relation to multi-category diseases such as cancer. It has been established that if the sequence of a gene is mutated, it can result in the unscheduled production of protein, leading to cancer. However, identifying the molecular signature of different cancers amongst thousands of genes is complex. This thesis investigates tools that can aid the study of gene expression to infer useful information towards personalised medicine. For microarray data analysis, this study proposes two new techniques to increase the accuracy of cancer classification. In the first method, a novel optimisation algorithm, COA-GA, was developed by synchronising the Cuckoo Optimisation Algorithm and the Genetic Algorithm for data clustering in a shuffle setup, to choose the most informative genes for classification purposes. Support Vector Machine (SVM) and Multilayer Perceptron (MLP) artificial neural networks are utilised for the classification step. Results suggest this method can significantly increase classification accuracy compared to other methods. An additional method involving a two-stage gene selection process was developed. In this method, a subset of the most informative genes are first selected by the Minimum Redundancy Maximum Relevance (MRMR) method. In the second stage, optimisation algorithms are used in a wrapper setup with SVM to minimise the selected genes whilst maximising the accuracy of classification. A comparative performance assessment suggests that the proposed algorithm significantly outperforms other methods at selecting fewer genes that are highly relevant to the cancer type, while maintaining a high classification accuracy. In the case of NGS, a state-of-the-art pipeline for the analysis of RNA-Seq data is investigated to discover differentially expressed genes and differential exon usages between normal and AIP positive Drosophila datasets, which are produced in house at Queen Mary, University of London. Functional genomic of differentially expressed genes were examined and found to be relevant to the case study under investigation. Finally, after normalising the RNA-Seq data, machine learning approaches similar to those in microarray was successfully implemented for these datasets

    Evolutionary Computation, Optimization and Learning Algorithms for Data Science

    Get PDF
    A large number of engineering, science and computational problems have yet to be solved in a computationally efficient way. One of the emerging challenges is how evolving technologies grow towards autonomy and intelligent decision making. This leads to collection of large amounts of data from various sensing and measurement technologies, e.g., cameras, smart phones, health sensors, smart electricity meters, and environment sensors. Hence, it is imperative to develop efficient algorithms for generation, analysis, classification, and illustration of data. Meanwhile, data is structured purposefully through different representations, such as large-scale networks and graphs. We focus on data science as a crucial area, specifically focusing on a curse of dimensionality (CoD) which is due to the large amount of generated/sensed/collected data. This motivates researchers to think about optimization and to apply nature-inspired algorithms, such as evolutionary algorithms (EAs) to solve optimization problems. Although these algorithms look un-deterministic, they are robust enough to reach an optimal solution. Researchers do not adopt evolutionary algorithms unless they face a problem which is suffering from placement in local optimal solution, rather than global optimal solution. In this chapter, we first develop a clear and formal definition of the CoD problem, next we focus on feature extraction techniques and categories, then we provide a general overview of meta-heuristic algorithms, its terminology, and desirable properties of evolutionary algorithms

    Exploring the functional interactions between geminivirus and host

    Get PDF
    Los geminivirus son virus de plantas con genomas circulares de DNA de cadena sencilla que infectan numerosas especies de interés agronómico en todo el mundo, provocando cuantiosas pérdidas que pueden llegar hasta el 100% de la cosecha. Los genomas de estos virus están muy reducidos, y codifican sólo 6 u 8 proteínas, dependiendo de la especie. Esta reducción genómica hace que el virus dependa de factores celulares para el desarrollo de la infección y la compleción de su ciclo vital, incluyendo las fases de replicación y tráfico dentro de la célula o en la planta. Dado que los geminivirus precisan proteínas de la planta hospedadora, la identificación de dichas proteínas supondría un importante paso hacia la comprensión del proceso de infección, lo que en último término podría suponer un importante aporte en la lucha contra la enfermedad. El objetivo de éste trabajo consiste en explorar las interacciones funcionales entre los geminivirus y sus hospedadores. En el primer capítulo se estudió la importancia de la proteína C2 de los geminivirus durante la señalización de jasmonatos (JA) y la respuesta de defensa en planta. Trabajos previos en nuestro grupo mostraron que la expresión de las proteínas C2 es capaz de interferir con la ruta de ubiquitinación en plantas de Arabidopsis. Con la finalidad de estudiar más a detalle el efecto que tiene la expresión de C2 en Arabidopsis, se hizo un análisis de microarray. El análisis transcriptómico de las plantas transgénicas expresan C2 de Tomato yellow curl Sardinia virus (TYLCSV) reveló que C2 altera múltiples procesos celulares. Entre los procesos más destacados están la represión de la respuesta a JA y al metabolismo secundario. Además, el análisis transcriptómico de las plantas transgénicas de Arabidopsis que expresan C2 tratadas con JA exógeno, puntualizó que la represión causada por C2 es a través la respuesta específica de genes inducidos por JA; por lo tanto es dudoso que ésta inhibición sea a través de la inhibición de la E3-ligasa SCFCOI1. Por otro lado, observamos que la proteína C2 interacciona en levaduras e in planta con la proteína represora de la respuesta a JA llamada JAZ8. Por lo tanto, hemos propuesto que la proteína C2 de los geminivirus podría estar interfiriendo con la respuesta a JA en varios niveles. El segundo capítulo tuvo como objetivo llevar a cabo la identificación de genes del hospedador involucrados en la infección por geminivirus usando una aproximación de genética reversa. Para llevar a cabo esto, usamos plantas transgénicas de Nicotiana benthamiana denominadas 2IRGFP. Estas plantas presentan una sobreexpresión de la GFP dependiente de la actividad de la proteína viral Rep de TYLCSV, que dispara la formación de replicones mGFP. La acumulación de GFP actúa como marcador de la replicación de TYLCSV, permitiendo la detección y seguimiento de este proceso de manera rápida, sencilla, semi-cuantitativa y a tiempo real. Además, en combinación con una técnica de silenciamiento como el silenciamiento génico inducido por virus (VIGS), las plantas 2IRGFP son una poderosa herramienta en estudios de genética reversa dirigidos a la identificación de genes de la planta necesarios para la infección viral. Siguiendo este concepto, se silenciaron 37 genes candidatos en las plantas 2IRGFP a las que paralelamente se infectaron con TYLCSV. De acuerdo con el efecto de su silenciamiento sobre la infección TYLCSV, medida como tiempo de aparición y la intensidad de la expresión de GFP, se agruparon los genes del huésped en tres clases: aquellos cuyo silenciamiento no causó cambios en la expresión (grupo A), o aquellos cuya silenciamiento adelantó (grupo B) o por el contrario retrasó o llegó a ser nula (grupo C) la expresión de GFP. En total hemos identificado 18 genes implicados en varios procesos celulares cuyos silenciamiento altera infección TYLCSV. En particular, 15 de estos genes son descritos por primera vez como factores implicados en infecciones virales. Por lo tanto, nuestros resultados proporcionan nuevos conocimientos sobre los posibles mecanismos moleculares que subyacen a las infecciones por geminivirus, y al mismo tiempo revelan el sistema 2IRGFP/ VIGS como una poderosa herramienta para los estudios de genética funcional reversa. Como tercera meta nos propusimos analizar el papel que cumple el tráfico vesícula retrogrado durante la infección por geminivirus. Éste capítulo se inició por el sorprendente hallazgo que se obtuvo en el segundo capítulo del presente trabajo, donde observamos que silenciamiento del gen que codifica para subunidad delta del complejo de coatomero COPI (-COP) imposibilita por completo la infección de TYLCSV en plantas de N. benthamiana. Para obtener mayor información sobre el rol del tráfico retrógrado sobre la infección por geminivirus, decidimos silenciar otro gen implicado en ésta ruta. Este gen fué ADP-ribosilación 1 (ARF1), la GTPasa específica que impulsa la formación de las vesículas de COPI. Observamos que los geminivirus TYLCSV y Beet curly top virus no infectan plantas de N. benthamiana donde se silenciaron los genes -COP como ARF1. Además pudimos concluir que es un efecto específico de los geminivirus, porque al infectar con otros patógenos como virus de RNA y Pseudomonas syringae pv tomato no se observó ningún efecto. En resumen, los geminivirus requieren un sistema de transporte retrógrado activo. El silenciamiento génico de dos de los principales componentes de ésta ruta afecta negativamente la infección de TYLCSV y BCTV, pero no altera la interacción con otros patógenos de planta

    Computational Design and Experimental Validation of Functional Ribonucleic Acid Nanostructures

    Get PDF
    In living cells, two major classes of ribonucleic acid (RNA) molecules can be found. The first class called the messenger RNA (mRNA) contains the genetic information that allows the ribosome to read and translate it into proteins. The second class called non-coding RNA (ncRNA), do not code for proteins and are involved with key cellular processes, such as gene expression regulation, splicing, differentiation, and development. NcRNAs fold into an ensemble of thermodynamically stable secondary structures, which will eventually lead the molecule to fold into a specific 3D structure. It is widely known that ncRNAs carry their functions via their 3D structures as well as their molecular composition. The secondary structure of ncRNAs is composed of different types of structural elements (motifs) such as stacking base pairs, internal loops, hairpin loops and pseudoknots. Pseudoknots are specifically difficult to model, are abundant in nature and known to stabilize the functional form of the molecule. Due to the diverse range of functions of ncRNAs, their computational design and analysis have numerous applications in nano-technology, therapeutics, synthetic biology, and materials engineering. The RNA design problem is to find novel RNA sequences that are predicted to fold into target structure(s) while satisfying specific qualitative characteristics and constraints. RNA design can be modeled as a combinatorial optimization problem (COP) and is known to be computationally challenging or more precisely NP-hard. Numerous algorithms to solve the RNA design problem have been developed over the past two decades, however mostly ignore pseudoknots and therefore limit application to only a slice of real-world modeling and design problems. Moreover, the few existing pseudoknot designer methods which were developed only recently, do not provide any evidence about the applicability of their proposed design methodology in biological contexts. The two objectives of this thesis are set to address these two shortcomings. First, we are interested in developing an efficient computational method for the design of RNA secondary structures including pseudoknots that show significantly improved in-silico quality characteristics than the state of the art. Second, we are interested in showing the real-world worthiness of the proposed method by validating it experimentally. More precisely, our aim is to design instances of certain types of RNA enzymes (i.e. ribozymes) and demonstrate that they are functionally active. This would likely only happen if their predicted folding matched their actual folding in the in-vitro experiments. In this thesis, we present four contributions. First, we propose a novel adaptive defect weighted sampling algorithm to efficiently solve the RNA secondary structure design problem where pseudoknots are included. We compare the performance of our design algorithm with the state of the art and show that our method generates molecules that are thermodynamically more stable and less defective than those generated by state of the art methods. Moreover, we show when the effect of fitness evaluation is decoupled from the search and optimization process, our optimization method converges faster than the non-dominated sorting genetic algorithm (NSGA II) and the ant colony optimization (ACO) algorithm do. Second, we use our algorithmic development to implement an RNA design pipeline called Enzymer and make it available as an open source package useful for wet lab practitioners and RNA bioinformaticians. Enzymer uses multiple sequence alignment (MSA) data to generate initial design templates for further optimization. Our design pipeline can then be used to re-engineer naturally occurring RNA enzymes such as ribozymes and riboswitches. Our first and second contributions are published in the RNA section of the Journal of Frontiers in Genetics. Third, we use Enzymer to reengineer three different species of pseudoknotted ribozymes: a hammerhead ribozyme from the mouse gut metagenome, a hammerhead ribozyme from Yarrowia lipolytica and a glmS ribozyme from Thermoanaerobacter tengcogensis. We designed a total of 18 ribozyme sequences and showed the 16 of them were active in-vitro. Our experimental results have been submitted to the RNA journal and strongly suggest that Enzymer is a reliable tool to design pseudoknotted ncRNAs with desired secondary structure. Finally, we propose a novel architecture for a new ribozyme-based gene regulatory network where a hammerhead ribozyme modulates expression of a reporter gene when an external stimulus IPTG is present. Our in-vivo results show expected results in 7 out of 12 cases
    corecore