19 research outputs found

    Improving the efficiency of Bayesian Network Based EDAs and their application in Bioinformatics

    Get PDF
    Estimation of distribution algorithms (EDAs) is a relatively new trend of stochastic optimizers which have received a lot of attention during last decade. In each generation, EDAs build probabilistic models of promising solutions of an optimization problem to guide the search process. New sets of solutions are obtained by sampling the corresponding probability distributions. Using this approach, EDAs are able to provide the user a set of models that reveals the dependencies between variables of the optimization problems while solving them. In order to solve a complex problem, it is necessary to use a probabilistic model which is able to capture the dependencies. Bayesian networks are usually used for modeling multiple dependencies between variables. Learning Bayesian networks, especially for large problems with high degree of dependencies among their variables is highly computationally expensive which makes it the bottleneck of EDAs. Therefore introducing efficient Bayesian learning algorithms in EDAs seems necessary in order to use them for large problems. In this dissertation, after comparing several Bayesian network learning algorithms, we propose an algorithm, called CMSS-BOA, which uses a recently introduced heuristic called max-min parent children (MMPC) in order to constrain the model search space. This algorithm does not consider a fixed and small upper bound on the order of interaction between variables and is able solve problems with large numbers of variables efficiently. We compare the efficiency of CMSS-BOA with the standard Bayesian network based EDA for solving several benchmark problems and finally we use it to build a predictor for predicting the glycation sites in mammalian proteins

    Biclustering of Gene Expression Data by Correlation-Based Scatter Search

    Get PDF
    BACKGROUND: The analysis of data generated by microarray technology is very useful to understand how the genetic information becomes functional gene products. Biclustering algorithms can determine a group of genes which are co-expressed under a set of experimental conditions. Recently, new biclustering methods based on metaheuristics have been proposed. Most of them use the Mean Squared Residue as merit function but interesting and relevant patterns from a biological point of view such as shifting and scaling patterns may not be detected using this measure. However, it is important to discover this type of patterns since commonly the genes can present a similar behavior although their expression levels vary in different ranges or magnitudes. METHODS: Scatter Search is an evolutionary technique that is based on the evolution of a small set of solutions which are chosen according to quality and diversity criteria. This paper presents a Scatter Search with the aim of finding biclusters from gene expression data. In this algorithm the proposed fitness function is based on the linear correlation among genes to detect shifting and scaling patterns from genes and an improvement method is included in order to select just positively correlated genes. RESULTS: The proposed algorithm has been tested with three real data sets such as Yeast Cell Cycle dataset, human B-cells lymphoma dataset and Yeast Stress dataset, finding a remarkable number of biclusters with shifting and scaling patterns. In addition, the performance of the proposed method and fitness function are compared to that of CC, OPSM, ISA, BiMax, xMotifs and Samba using Gene the Ontology Database

    Identifying gene regulatory networks common to multiple plant stress responses

    Get PDF
    Stress responses in plants can be defined as a change that affects the homeostasis of pathways, resulting in a phenotype that may or may not be visible to the human eye, affecting the fitness of the plant. Crosstalk is believed to be the shared components of pathways of networks, and is widespread in plants, as shown by examples of crosstalk between transcriptional regulation pathways, and hormone signalling. Crosstalk between stress responses is believed to exist, particularly crosstalk within the responses to biotic stress, and within the responses to abiotic stress. Certain hormone pathways are known to be involved in the crosstalk between the responses to both biotic and abiotic stresses, and can confer immunity or tolerance of Arabidopsis thaliana to these stresses. Transcriptional regulation has also been identified as an important factor in controlling tolerance and resistance to stresses. In this thesis, networks of regulation mediating the response tomultiple stresses are studied. Firstly, co-regulation was predicted for genes differentially expressed in two or more stresses by development of a novel multi-clustering approach, Wigwams Identifies Genes Working Across Multiple Stresses (Wigwams). This approach finds groups of genes whose expression is correlated within stresses, but also identifies a strong statistical link between subsets of stresses. Wigwams identifies the known co-expression of genes encoding enzymes of metabolic and flavonoid biosynthesis pathways, and predicts novels clusters of co-expressed genes. By hypothesising that by being coexpressed could also infer that the genes are co-regulated, promoter motif analysis and modelling provides information for potential upstream regulators. The context-free regulation of groups of co-expressed genes, or potential regulons, was explored using models generated by modelling techniques, in order to generate a quantitative model of transcriptional regulation during the response to B. cinerea, P. syringae pv. tomato DC3000 and senescence. This model was subsequently validated and extended by experimental techniques, using Yeast 1-Hybrid to investigate the protein-DNA interactions, and also microarrays. Analysis of mutants and plants overexpressing a predicted regulator, Rap2.6L, by gene expression analysis identified a number of potential regulon members as downstream targets. Rap2.6L was identified as an indirect regulator of the transcription factor members of three potential regulons co-expressed in the stresses B. cinerea, P. syringae pv. tomato DC3000 and long day senescence, allowing the confirmation of a predicted gene regulatory network operating in multiple stress responses

    Multi-Class Clustering of Cancer Subtypes through SVM Based Ensemble of Pareto-Optimal Solutions for Gene Marker Identification

    Get PDF
    With the advancement of microarray technology, it is now possible to study the expression profiles of thousands of genes across different experimental conditions or tissue samples simultaneously. Microarray cancer datasets, organized as samples versus genes fashion, are being used for classification of tissue samples into benign and malignant or their subtypes. They are also useful for identifying potential gene markers for each cancer subtype, which helps in successful diagnosis of particular cancer types. In this article, we have presented an unsupervised cancer classification technique based on multiobjective genetic clustering of the tissue samples. In this regard, a real-coded encoding of the cluster centers is used and cluster compactness and separation are simultaneously optimized. The resultant set of near-Pareto-optimal solutions contains a number of non-dominated solutions. A novel approach to combine the clustering information possessed by the non-dominated solutions through Support Vector Machine (SVM) classifier has been proposed. Final clustering is obtained by consensus among the clusterings yielded by different kernel functions. The performance of the proposed multiobjective clustering method has been compared with that of several other microarray clustering algorithms for three publicly available benchmark cancer datasets. Moreover, statistical significance tests have been conducted to establish the statistical superiority of the proposed clustering method. Furthermore, relevant gene markers have been identified using the clustering result produced by the proposed clustering method and demonstrated visually. Biological relationships among the gene markers are also studied based on gene ontology. The results obtained are found to be promising and can possibly have important impact in the area of unsupervised cancer classification as well as gene marker identification for multiple cancer subtypes

    Biclustering sobre datos de expresión génica basado en búsqueda dispersa

    Get PDF
    Falta palabras claveLos datos de expresión génica, y su particular naturaleza e importancia, motivan no sólo el desarrollo de nuevas técnicas sino la formulación de nuevos problemas como el problema del biclustering. El biclustering es una técnica de aprendizaje no supervisado que agrupa tanto genes como condiciones. Este doble agrupamiento lo diferencia del clustering tradicional sobre este tipo de datos ya que éste sólo agrupa o bien genes o condiciones. La presente tesis presenta un nuevo algoritmo de biclustering que permite el estudio de distintos criterios de búsqueda. Dicho algoritmo utiliza esquema de búsqueda dispersa, o scatter search, que independiza el mecanismo de búsqueda del criterio empleado. Se han estudiado tres criterios de búsqueda diferentes que motivan las tres principales aportaciones de la tesis. En primer lugar se estudia la correlación lineal entre los genes, que se integra como parte de la función objetivo empleada por el algoritmo de biclustering. La correlación lineal permite encontrar biclusters con patrones de desplazamiento y escalado, lo que mejora propuestas anteriores. En segundo lugar, y motivado por el significado biológico de los patrones de activación-inhibición entre genes, se modifica la correlación lineal de manera que se contemplen estos patrones. Por último, se ha tenido en cuenta la información disponible sobre genes en repositorios públicos, como la ontología de genes GO, y se incorpora dicha información como parte del criterio de búsqueda. Se añade un término extra que refleja, por cada bicluster que se evalúe, la calidad de ese grupo de genes según su información almacenada en GO. Se estudian dos posibilidades para dicho término de integración de información biológica, se comparan entre sí y se comprueba que los resultados son mejores cuando se usa información biológica en el algoritmo de biclustering. Las tres aportaciones descritas, junto con una serie de pasos intermedios, han dado lugar a resultados publicados tanto en revistas como en conferencias nacionales e internacionales

    Preventing premature convergence and proving the optimality in evolutionary algorithms

    Get PDF
    http://ea2013.inria.fr//proceedings.pdfInternational audienceEvolutionary Algorithms (EA) usually carry out an efficient exploration of the search-space, but get often trapped in local minima and do not prove the optimality of the solution. Interval-based techniques, on the other hand, yield a numerical proof of optimality of the solution. However, they may fail to converge within a reasonable time due to their inability to quickly compute a good approximation of the global minimum and their exponential complexity. The contribution of this paper is a hybrid algorithm called Charibde in which a particular EA, Differential Evolution, cooperates with a Branch and Bound algorithm endowed with interval propagation techniques. It prevents premature convergence toward local optima and outperforms both deterministic and stochastic existing approaches. We demonstrate its efficiency on a benchmark of highly multimodal problems, for which we provide previously unknown global minima and certification of optimality

    Contributions on evolutionary computation for statistical inference

    Get PDF
    Evolutionary Computation (EC) techniques have been introduced in the 1960s for dealing with complex situations. One possible example is an optimization problems not having an analytical solution or being computationally intractable; in many cases such methods, named Evolutionary Algorithms (EAs), have been successfully implemented. In statistics there are many situations where complex problems arise, in particular concerning optimization. A general example is when the statistician needs to select, inside a prohibitively large discrete set, just one element, which could be a model, a partition, an experiment, or such: this would be the case of model selection, cluster analysis or design of experiment. In other situations there could be an intractable function of data, such as a likelihood, which needs to be maximized, as it happens in model parameter estimation. These kind of problems are naturally well suited for EAs, and in the last 20 years a large number of papers has been concerned with applications of EAs in tackling statistical issues. The present dissertation is set in this part of literature, as it reports several implementations of EAs in statistics, although being mainly focused on statistical inference problems. Original results are proposed, as well as overviews and surveys on several topics. EAs are employed and analyzed considering various statistical points of view, showing and confirming their efficiency and flexibility. The first proposal is devoted to parametric estimation problems. When EAs are employed in such analysis a novel form of variability related to their stochastic elements is introduced. We shall analyze both variability due to sampling, associated with selected estimator, and variability due to the EA. This analysis is set in a framework of statistical and computational tradeoff question, crucial in nowadays problems, by introducing cost functions related to both data acquisition and EA iterations. The proposed method will be illustrated by means of model building problem examples. Subsequent chapter is concerned with EAs employed in Markov Chain Monte Carlo (MCMC) sampling. When sampling from multimodal or highly correlated distribution is concerned, in fact, a possible strategy suggests to run several chains in parallel, in order to improve their mixing. If these chains are allowed to interact with each other then many analogies with EC techniques can be observed, and this has led to research in many fields. The chapter aims at reviewing various methods found in literature which conjugates EC techniques and MCMC sampling, in order to identify specific and common procedures, and unifying them in a framework of EC. In the last proposal we present a complex time series model and an identification procedure based on Genetic Algorithms (GAs). The model is capable of dealing with seasonality, by Periodic AutoRegressive (PAR) modelling, and structural changes in time, leading to a nonstationary structure. As far as a very large number of parameters and possibilites of change points are concerned, GAs are appropriate for identifying such model. Effectiveness of procedure is shown on both simulated data and real examples, these latter referred to river flow data in hydrology. The thesis concludes with some final remarks, concerning also future work

    Contributions on evolutionary computation for statistical inference

    Get PDF
    Evolutionary Computation (EC) techniques have been introduced in the 1960s for dealing with complex situations. One possible example is an optimization problems not having an analytical solution or being computationally intractable; in many cases such methods, named Evolutionary Algorithms (EAs), have been successfully implemented. In statistics there are many situations where complex problems arise, in particular concerning optimization. A general example is when the statistician needs to select, inside a prohibitively large discrete set, just one element, which could be a model, a partition, an experiment, or such: this would be the case of model selection, cluster analysis or design of experiment. In other situations there could be an intractable function of data, such as a likelihood, which needs to be maximized, as it happens in model parameter estimation. These kind of problems are naturally well suited for EAs, and in the last 20 years a large number of papers has been concerned with applications of EAs in tackling statistical issues. The present dissertation is set in this part of literature, as it reports several implementations of EAs in statistics, although being mainly focused on statistical inference problems. Original results are proposed, as well as overviews and surveys on several topics. EAs are employed and analyzed considering various statistical points of view, showing and confirming their efficiency and flexibility. The first proposal is devoted to parametric estimation problems. When EAs are employed in such analysis a novel form of variability related to their stochastic elements is introduced. We shall analyze both variability due to sampling, associated with selected estimator, and variability due to the EA. This analysis is set in a framework of statistical and computational tradeoff question, crucial in nowadays problems, by introducing cost functions related to both data acquisition and EA iterations. The proposed method will be illustrated by means of model building problem examples. Subsequent chapter is concerned with EAs employed in Markov Chain Monte Carlo (MCMC) sampling. When sampling from multimodal or highly correlated distribution is concerned, in fact, a possible strategy suggests to run several chains in parallel, in order to improve their mixing. If these chains are allowed to interact with each other then many analogies with EC techniques can be observed, and this has led to research in many fields. The chapter aims at reviewing various methods found in literature which conjugates EC techniques and MCMC sampling, in order to identify specific and common procedures, and unifying them in a framework of EC. In the last proposal we present a complex time series model and an identification procedure based on Genetic Algorithms (GAs). The model is capable of dealing with seasonality, by Periodic AutoRegressive (PAR) modelling, and structural changes in time, leading to a nonstationary structure. As far as a very large number of parameters and possibilites of change points are concerned, GAs are appropriate for identifying such model. Effectiveness of procedure is shown on both simulated data and real examples, these latter referred to river flow data in hydrology. The thesis concludes with some final remarks, concerning also future work
    corecore