215 research outputs found


    Get PDF
    Biclustering has emerged as an important problem in the analysis of gene expression data since genes may only jointly respond over a subset of conditions. Many of the methods for biclustering, and clustering algorithms in general, utilize simplified models or heuristic strategies for identifying the ``best\u27\u27 grouping of elements according to some metric and cluster definition and thus result in suboptimal clusters. In the first part of the presentation, we present a rigorous approach to biclustering, OREO, which is based on the Optimal RE-Ordering of the rows and columns of a data matrix so as to globally minimize the dissimilarity metric [1,2]. The physical permutations of the rows and columns of the data matrix can be modeled as either a network flow problem or a traveling salesman problem. The performance of OREO is tested on several important data matrices arising in systems biology to validate the ability of the proposed method and compare it to existing biclustering and clustering methods. In the second part of the talk, we will focus on novel methods for clustering of data matrices that are very sparse [3]. These types of data matrices arise in drug discovery where the x- and y-axis of a data matrix can correspond to different functional groups for two distinct substituent sites on a molecular scaffold. Each possible x and y pair corresponds to a single molecule which can be synthesized and tested for a certain property, such as percent inhibition of a protein function. For even moderate size matrices, synthesizing and testing a small fraction of the molecules is labor intensive and not economically feasible. Thus, it is of paramount importance to have a reliable method for guiding the synthesis process to select molecules that have a high probability of success. In the second part of the presentation, we introduce a new strategy to enable efficient substituent reordering and descriptor-free property estimation. Our approach casts substituent reordering as a special high-dimensional rearrangement clustering problem, eliminating the need for functional approximation and enhancing computational efficiency [4, 5]. Deterministic optimization approaches based on mixed-integer linear programming can provide guaranteed convergence to the optimal substituent ordering. The proposed approach is demonstrated on a sparse data matrix (about 29% dense) of inhibition values for 14,043 unknown compounds provided by Pfizer Inc. It is shown that an iterative synthesis strategy is able to uncover a significant percentage of the lead molecules while using only a fraction of total compound library, even when starting from a mere 3% of the total library space. In the third part of the presentation, we combine the strengths of integer linear optimization and machine learning to predict in vivo toxicities for a library of pesticide chemicals using only in vitro data. Our approach utilizes a biclustering method based on iterative optimal re-ordering [1,2] to identify biclusters corresponding to subsets of chemicals that have similar responses over distinct subsets of the in vitro assays. This enables us to determine subsets of experimental assays that are most likely to be correlated with toxicity, according to the in vivo data set. An optimal method based on integer linear optimization (ILP) for re-ordering sparse data matrices [3] is also applied to the in vivo dataset (21.3% sparse) in order to cluster endpoints that have similar lowest effect level (LEL) values, where it is observed that endpoints are grouped according to similar physiological attributes. Based upon the clustering results of the in vitro and in vivo data sets, logistic regression is then utilized to (a) learn the correlation between the subsets of in vitro data and the in vivo responses, and (b) subsequently predict the toxicity signatures of the chemicals. Our approach aims to find the highest prediction accuracy using the minimum number of in vitro descriptors

    Designing Energy-Efficient Heat Recovery Networks using Mixed-Integer Nonlinear Optimisation

    Get PDF
    Many industrial processes involve heating and cooling liquids: a quarter of the EU 2012 energy consumption came from industry and industry uses 73% of this energy on heating and cooling. We discuss mixed-integer nonlinear optimisation and its applications to energy efficiency. Our particular emphasis is on algorithms and solution techniques enabling optimisation for large-scale industrial networks. As a first application, optimising heat exchangers networks may increase efficiency in industrial plants. We develop deterministic global optimisation algorithms for a mixed-integer nonlinear optimisation model that simultaneously incorporates utility cost, equipment area, and hot/cold stream matches. We automatically recognise and exploit special mathematical structures common in heat recovery. We also computationally demonstrate the impact on the global optimisation solver ANTIGONE and benchmark large-scale test cases against heuristic approaches. As a second application, we discuss special structure in nonconvex quadratically-constrained optimisation problems, particularly through the lens of stream mixing and intermediate blending on process systems engineering networks. We take a parametric approach to uncovering topological structure and sparsity of the standard pooling problem in its p-formulation. We show that the sparse patterns of active topological structure are associated with a piecewise objective function. Finally, the presentation explains the conditions under which sparsity vanishes and where the combinatorial complexity emerges to cross over the P/NP boundary. We formally present the results obtained and their derivations for various specialised instances

    Microarray data mining: A novel optimization-based approach to uncover biologically coherent structures

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>DNA microarray technology allows for the measurement of genome-wide expression patterns. Within the resultant mass of data lies the problem of analyzing and presenting information on this genomic scale, and a first step towards the rapid and comprehensive interpretation of this data is gene clustering with respect to the expression patterns. Classifying genes into clusters can lead to interesting biological insights. In this study, we describe an iterative clustering approach to uncover biologically coherent structures from DNA microarray data based on a novel clustering algorithm EP_GOS_Clust.</p> <p>Results</p> <p>We apply our proposed iterative algorithm to three sets of experimental DNA microarray data from experiments with the yeast <it>Saccharomyces cerevisiae </it>and show that the proposed iterative approach improves biological coherence. Comparison with other clustering techniques suggests that our iterative algorithm provides superior performance with regard to biological coherence. An important consequence of our approach is that an increasing proportion of genes find membership in clusters of high biological coherence and that the average cluster specificity improves.</p> <p>Conclusion</p> <p>The results from these clustering experiments provide a robust basis for extracting motifs and trans-acting factors that determine particular patterns of expression. In addition, the biological coherence of the clusters is iteratively assessed independently of the clustering. Thus, this method will not be severely impacted by functional annotations that are missing, inaccurate, or sparse.</p

    A New Robust Optimization Approach for Scheduling Under Uncertainty:

    Get PDF
    Abstract The problem of scheduling under bounded uncertainty is addressed. We propose a novel robust optimization methodology, which when applied to mixed-integer linear programming (MILP) problems produces &quot;robust&quot; solutions which are in a sense immune against bounded uncertainty. Both the coefficients in the objective function, the left-hand-side parameters and the right-hand-side parameters of the inequalities are considered. Robust optimization techniques are developed for two types of uncertain data: bounded uncertainty and bounded and symmetric uncertainty. By introducing a small number of auxiliary variables and constraints, a deterministic robust counterpart problem is formulated to determine the optimal solution given the (relative) magnitude of uncertain data, feasibility tolerance, and &quot;reliability level&quot; when a probabilistic measurement is applied. The robust optimization approach is then applied to the scheduling under uncertainty problem. Based on a novel and effective continuous-time short-term scheduling model proposed by Floudas and coworkers [Ind. Eng. Chem. Res. 37 (1998a

    Highly Accurate Structure-Based Prediction of HIV-1 Coreceptor Usage Suggests Intermolecular Interactions Driving Tropism

    Get PDF
    HIV-1 entry into host cells is mediated by interactions between the V3-loop of viral glycoprotein gp120 and chemokine receptor CCR5 or CXCR4, collectively known as HIV-1 coreceptors. Accurate genotypic prediction of coreceptor usage is of significant clinical interest and determination of the factors driving tropism has been the focus of extensive study. We have developed a method based on nonlinear support vector machines to elucidate the interacting residue pairs driving coreceptor usage and provide highly accurate coreceptor usage predictions. Our models utilize centroid-centroid interaction energies from computationally derived structures of the V3-loop:coreceptor complexes as primary features, while additional features based on established rules regarding V3-loop sequences are also investigated. We tested our method on 2455 V3-loop sequences of various lengths and subtypes, and produce a median area under the receiver operator curve of 0.977 based on 500 runs of 10-fold cross validation. Our study is the first to elucidate a small set of specific interacting residue pairs between the V3-loop and coreceptors capable of predicting coreceptor usage with high accuracy across major HIV-1 subtypes. The developed method has been implemented as a web tool named CRUSH, CoReceptor USage prediction for HIV-1, which is available at http://ares.tamu.edu/CRUSH/