86 research outputs found

    Multipartite Graph Algorithms for the Analysis of Heterogeneous Data

    Get PDF
    The explosive growth in the rate of data generation in recent years threatens to outpace the growth in computer power, motivating the need for new, scalable algorithms and big data analytic techniques. No field may be more emblematic of this data deluge than the life sciences, where technologies such as high-throughput mRNA arrays and next generation genome sequencing are routinely used to generate datasets of extreme scale. Data from experiments in genomics, transcriptomics, metabolomics and proteomics are continuously being added to existing repositories. A goal of exploratory analysis of such omics data is to illuminate the functions and relationships of biomolecules within an organism. This dissertation describes the design, implementation and application of graph algorithms, with the goal of seeking dense structure in data derived from omics experiments in order to detect latent associations between often heterogeneous entities, such as genes, diseases and phenotypes. Exact combinatorial solutions are developed and implemented, rather than relying on approximations or heuristics, even when problems are exceedingly large and/or difficult. Datasets on which the algorithms are applied include time series transcriptomic data from an experiment on the developing mouse cerebellum, gene expression data measuring acute ethanol response in the prefrontal cortex, and the analysis of a predicted protein-protein interaction network. A bipartite graph model is used to integrate heterogeneous data types, such as genes with phenotypes and microbes with mouse strains. The techniques are then extended to a multipartite algorithm to enumerate dense substructure in multipartite graphs, constructed using data from three or more heterogeneous sources, with applications to functional genomics. Several new theoretical results are given regarding multipartite graphs and the multipartite enumeration algorithm. In all cases, practical implementations are demonstrated to expand the frontier of computational feasibility

    Near-Optimal Motion Planning Algorithms Via A Topological and Geometric Perspective

    Get PDF
    Motion planning is a fundamental problem in robotics, which involves finding a path for an autonomous system, such as a robot, from a given source to a destination while avoiding collisions with obstacles. The properties of the planning space heavily influence the performance of existing motion planning algorithms, which can pose significant challenges in handling complex regions, such as narrow passages or cluttered environments, even for simple objects. The problem of motion planning becomes deterministic if the details of the space are fully known, which is often difficult to achieve in constantly changing environments. Sampling-based algorithms are widely used among motion planning paradigms because they capture the topology of space into a roadmap. These planners have successfully solved high-dimensional planning problems with a probabilistic-complete guarantee, i.e., it guarantees to find a path if one exists as the number of vertices goes to infinity. Despite their progress, these methods have failed to optimize the sub-region information of the environment for reuse by other planners. This results in re-planning overhead at each execution, affecting the performance complexity for computation time and memory space usage. In this research, we address the problem by focusing on the theoretical foundation of the algorithmic approach that leverages the strengths of sampling-based motion planners and the Topological Data Analysis methods to extract intricate properties of the environment. The work contributes a novel algorithm to overcome the performance shortcomings of existing motion planners by capturing and preserving the essential topological and geometric features to generate a homotopy-equivalent roadmap of the environment. This roadmap provides a mathematically rich representation of the environment, including an approximate measure of the collision-free space. In addition, the roadmap graph vertices sampled close to the obstacles exhibit advantages when navigating through narrow passages and cluttered environments, making obstacle-avoidance path planning significantly more efficient. The application of the proposed algorithms solves motion planning problems, such as sub-optimal planning, diverse path planning, and fault-tolerant planning, by demonstrating the improvement in computational performance and path quality. Furthermore, we explore the potential of these algorithms in solving computational biology problems, particularly in finding optimal binding positions for protein-ligand or protein-protein interactions. Overall, our work contributes a new way to classify routes in higher dimensional space and shows promising results for high-dimensional robots, such as articulated linkage robots. The findings of this research provide a comprehensive solution to motion planning problems and offer a new perspective on solving computational biology problems

    New Techniques for Clustering Complex Objects

    Get PDF
    The tremendous amount of data produced nowadays in various application domains such as molecular biology or geography can only be fully exploited by efficient and effective data mining tools. One of the primary data mining tasks is clustering, which is the task of partitioning points of a data set into distinct groups (clusters) such that two points from one cluster are similar to each other whereas two points from distinct clusters are not. Due to modern database technology, e.g.object relational databases, a huge amount of complex objects from scientific, engineering or multimedia applications is stored in database systems. Modelling such complex data often results in very high-dimensional vector data ("feature vectors"). In the context of clustering, this causes a lot of fundamental problems, commonly subsumed under the term "Curse of Dimensionality". As a result, traditional clustering algorithms often fail to generate meaningful results, because in such high-dimensional feature spaces data does not cluster anymore. But usually, there are clusters embedded in lower dimensional subspaces, i.e. meaningful clusters can be found if only a certain subset of features is regarded for clustering. The subset of features may even be different for varying clusters. In this thesis, we present original extensions and enhancements of the density-based clustering notion to cope with high-dimensional data. In particular, we propose an algorithm called SUBCLU (density-connected Subspace Clustering) that extends DBSCAN (Density-Based Spatial Clustering of Applications with Noise) to the problem of subspace clustering. SUBCLU efficiently computes all clusters of arbitrary shape and size that would have been found if DBSCAN were applied to all possible subspaces of the feature space. Two subspace selection techniques called RIS (Ranking Interesting Subspaces) and SURFING (SUbspaces Relevant For clusterING) are proposed. They do not compute the subspace clusters directly, but generate a list of subspaces ranked by their clustering characteristics. A hierarchical clustering algorithm can be applied to these interesting subspaces in order to compute a hierarchical (subspace) clustering. In addition, we propose the algorithm 4C (Computing Correlation Connected Clusters) that extends the concepts of DBSCAN to compute density-based correlation clusters. 4C searches for groups of objects which exhibit an arbitrary but uniform correlation. Often, the traditional approach of modelling data as high-dimensional feature vectors is no longer able to capture the intuitive notion of similarity between complex objects. Thus, objects like chemical compounds, CAD drawings, XML data or color images are often modelled by using more complex representations like graphs or trees. If a metric distance function like the edit distance for graphs and trees is used as similarity measure, traditional clustering approaches like density-based clustering are applicable to those data. However, we face the problem that a single distance calculation can be very expensive. As clustering performs a lot of distance calculations, approaches like filter and refinement and metric indices get important. The second part of this thesis deals with special approaches for clustering in application domains with complex similarity models. We show, how appropriate filters can be used to enhance the performance of query processing and, thus, clustering of hierarchical objects. Furthermore, we describe how the two paradigms of filtering and metric indexing can be combined. As complex objects can often be represented by using different similarity models, a new clustering approach is presented that is able to cluster objects that provide several different complex representations

    visone - Software for the Analysis and Visualization of Social Networks

    Get PDF
    We present the software tool visone which combines graph-theoretic methods for the analysis of social networks with tailored means of visualization. Our main contribution is the design of novel graph-layout algorithms which accurately reflect computed analyses results in well-arranged drawings of the networks under consideration. Besides this, we give a detailed description of the design of the software tool and the provided analysis methods

    Measuring and improving the readability of network visualizations

    Get PDF
    Network data structures have been used extensively for modeling entities and their ties across such diverse disciplines as Computer Science, Sociology, Bioinformatics, Urban Planning, and Archeology. Analyzing networks involves understanding the complex relationships between entities as well as any attributes, statistics, or groupings associated with them. The widely used node-link visualization excels at showing the topology, attributes, and groupings simultaneously. However, many existing node-link visualizations are difficult to extract meaning from because of (1) the inherent complexity of the relationships, (2) the number of items designers try to render in limited screen space, and (3) for every network there are many potential unintelligible or even misleading visualizations. Automated layout algorithms have helped, but frequently generate ineffective visualizations even when used by expert analysts. Past work, including my own described herein, have shown there can be vast improvements in network visualizations, but no one can yet produce readable and meaningful visualizations for all networks. Since there is no single way to visualize all networks effectively, in this dissertation I investigate three complimentary strategies. First, I introduce a technique called motif simplification that leverages the repeating patterns or motifs in a network to reduce visual complexity. I replace common, high-payoff motifs with easily understandable glyphs that require less screen space, can reveal otherwise hidden relationships, and improve user performance on many network analysis tasks. Next, I present new Group-in-a-Box layouts that subdivide large, dense networks using attribute- or topology-based groupings. These layouts take group membership into account to more clearly show the ties within groups as well as the aggregate relationships between groups. Finally, I develop a set of readability metrics to measure visualization effectiveness and localize areas needing improvement. I detail optimization recommendations for specific user tasks, in addition to leveraging the readability metrics in a user-assisted layout optimization technique. This dissertation contributes an understanding of why some node-link visualizations are difficult to read, what measures of readability could help guide designers and users, and several promising strategies for improving readability which demonstrate that progress is possible. This work also opens several avenues of research, both technical and in user education

    Systems Toxicology: Mining chemical-toxicity signaling paths to enable network medicine

    Get PDF
    Systems toxicology, a branch of toxicology that studies chemical effects on biological systems, presents exciting knowledge discovery challenges for the information researcher. The exponential increase in availability of genomic and proteomic data in this domain needs to be matched with increasingly sophisticated network analysis approaches. Improved ability to mine complex gene and protein interaction networks may eventually lead to discovery of drugs that target biological sub-networks (‘network medicine’) instead of individual proteins. In this thesis, we have proposed and investigated the use of a maximal edge centrality criterion to discover drug-toxicity signaling paths inside a human protein interaction network. The signaling path detection approach utilizes drug and toxicity information along with two novel edge weighting measures, one based on edge centrality for detected paths and another using differential gene expression between tissues treated with toxicity-inducing drugs and a control set. Drugs known to induce non-immune Neutropenia were analyzed as a test case and common path proteins on discovered signaling paths were evaluated for toxicological significance. In addition to investigating the value of topological connectivity for identification of toxicity biomarkers, the gene expression-based measure led to identification of a proposed biomarker panel for screening new drug candidates. Comparative evaluation of findings from the DTSP approach with standard microarray analysis method showed clear improvements in various performance measures including true positive rate, positive predictive value, negative predictive value and overall accuracy. Comparison of non-immune Neutropenia signaling paths with those discovered for a control set showed increased transcript-level activation of discovered signaling paths for toxicity-inducing drugs. We have demonstrated the scientific value from a systems-based approach for identifying toxicity-related proteins inside complex biological networks. The algorithm should be useful for biomarker identification for any toxicity assuming availability of relevant drug and drug-induced toxicity information.Ph.D., Information Studies -- Drexel University, 201

    Computational Approaches to Generating Diverse Enzyme Panels

    Get PDF
    Ph. D. ThesisMotivation Enzymes are complex macromolecules crucial to life on earth. From bacteria to human beings, all organisms use enzymes to catalyse the many thousands of chemical reactions occurring in their cells. Enzyme functions are so diverse that the use of enzymes in industries like pharmaceuticals and agriculture has gained popularity over recent years as ”biocatalysts”. Unfortunately, the confident laboratory-based characterisation of enzyme function has lagged behind a massive increase in sequencing data, slowing down initiatives that look to use biocatalysts as part of their chemical processes. Computational methods for identifying biocatalysts do exist, but often falter due to the complexity of enzymes and sequence bias, leaving much of the catalytic space of enzymes and their families undiscovered. This thesis has two major themes: the development of in silico approaches for curating diverse panels of novel enzyme sequences for experimental characterisation, and of tooling that integrates in silico panel creation and in vitro enzyme characterisation into a unified and iterative framework. Contributions of this thesis The contributions of this thesis can be divided into the two larger themes, starting with the diverse panel selection of sequences from an enzyme family: • A novel type of protein network based on patterns of coevolving residues that can be used to identify functionally-interesting groupings in enzyme families. • The automatic sampling of functionally diverse subsets of enzyme sequences by solving the maximum diversity problem. - i - • A study into the viability of artificially increasing enzyme family diversity through neural networks-based generation of synthetic sequences. The second theme, which deals with built tools for bridging the gap between the in silico and in vitro side of enzyme family exploration: • A platform that integrates the panel selection process and resulting characterisation data to promote an iterative approach to exploring enzyme families. • A repository for storing the metadata generated by the major steps of characterisation assays in the lab.EPSRC and Prozomix Limite

    Geometric, Feature-based and Graph-based Approaches for the Structural Analysis of Protein Binding Sites : Novel Methods and Computational Analysis

    Get PDF
    In this thesis, protein binding sites are considered. To enable the extraction of information from the space of protein binding sites, these binding sites must be mapped onto a mathematical space. This can be done by mapping binding sites onto vectors, graphs or point clouds. To finally enable a structure on the mathematical space, a distance measure is required, which is introduced in this thesis. This distance measure eventually can be used to extract information by means of data mining techniques

    Optimization methods for side-chain positioning and macromolecular docking

    Full text link
    This dissertation proposes new optimization algorithms targeting protein-protein docking which is an important class of problems in computational structural biology. The ultimate goal of docking methods is to predict the 3-dimensional structure of a stable protein-protein complex. We study two specific problems encountered in predictive docking of proteins. The first problem is Side-Chain Positioning (SCP), a central component of homology modeling and computational protein docking methods. We formulate SCP as a Maximum Weighted Independent Set (MWIS) problem on an appropriately constructed graph. Our formulation also considers the significant special structure of proteins that SCP exhibits for docking. We develop an approximate algorithm that solves a relaxation of MWIS and employ randomized estimation heuristics to obtain high-quality feasible solutions to the problem. The algorithm is fully distributed and can be implemented on multi-processor architectures. Our computational results on a benchmark set of protein complexes show that the accuracy of our approximate MWIS-based algorithm predictions is comparable with the results achieved by a state-of-the-art method that finds an exact solution to SCP. The second problem we target in this work is protein docking refinement. We propose two different methods to solve the refinement problem. The first approach is based on a Monte Carlo-Minimization (MCM) search to optimize rigid-body and side-chain conformations for binding. In particular, we study the impact of optimally positioning the side-chains in the interface region between two proteins in the process of binding. We report computational results showing that incorporating side-chain flexibility in docking provides substantial improvement in the quality of docked predictions compared to the rigid-body approaches. Further, we demonstrate that the inclusion of unbound side-chain conformers in the side-chain search introduces significant improvement in the performance of the docking refinement protocols. In the second approach, we propose a novel stochastic optimization algorithm based on Subspace Semi-Definite programming-based Underestimation (SSDU), which aims to solve protein docking and protein structure prediction. SSDU is based on underestimating the binding energy function in a permissive subspace of the space of rigid-body motions. We apply Principal Component Analysis (PCA) to determine the permissive subspace and reduce the dimensionality of the conformational search space. We consider the general class of convex polynomial underestimators, and formulate the problem of finding such underestimators as a Semi-Definite Programming (SDP) problem. Using these underestimators, we perform a biased sampling in the vicinity of the conformational regions where the energy function is at its global minimum. Moreover, we develop an exploration procedure based on density-based clustering to detect the near-native regions even when there are many local minima residing far from each other. We also incorporate a Model Selection procedure into SSDU to pick a predictive conformation. Testing our algorithm over a benchmark of protein complexes indicates that SSDU substantially improves the quality of docking refinement compared with existing methods

    New Approaches to Protein Structure Prediction

    Get PDF
    Protein structure prediction is concerned with the prediction of a protein's three dimensional structure from its amino acid sequence. Such predictions are commonly performed by searching the possible structures and evaluating each structure by using some scoring function. If it is assumed that the target protein structure resembles the structure of a known protein, the search space can be significantly reduced. Such an approach is referred to as comparative structure prediction. When such an assumption is not made, the approach is known as ab initio structure prediction. There are several difficulties in devising efficient searches or in computing the scoring function. Many of these problems have ready solutions from known mathematical methods. However, the problems that are yet unsolved have hindered structure prediction methods from more ideal predictions. The objective of this study is to present a complete framework for ab initio protein structure prediction. To achieve this, a new search strategy is proposed, and better techniques are devised for computing the known scoring functions. Some of the remaining problems in protein structure prediction are revisited. Several of them are shown to be intractable. In many of these cases, approximation methods are suggested as alternative solutions. The primary issues addressed in this thesis are concerned with local structures prediction, structure assembly or sampling, side chain packing, model comparison, and structural alignment. For brevity, we do not elaborate on these problems here; a concise introduction is given in the first section of this thesis. Results from these studies prompted the development of several programs, forming a utility suite for ab initio protein structure prediction. Due to the general usefulness of these programs, some of them are released with open source licenses to benefit the community
    corecore