53 research outputs found

    Optimization-Based Network Analysis with Applications in Clustering and Data Mining

    Get PDF
    In this research we develop theoretical foundations and efficient solution methods for two classes of cluster-detection problems from optimization point of view. In particular, the s-club model and the biclique model are considered due to various application areas. An analytical review of the optimization problems is followed by theoretical results and algorithmic solution methods developed in this research. The maximum s-club problem has applications in graph-based data mining and robust network design where high reachability is often considered a critical property. Massive size of real-life instances makes it necessary to devise a scalable solution method for practical purposes. Moreover, lack of heredity property in s-clubs imposes challenges in the design of optimization algorithms. Motivated by these properties, a sufficient condition for checking maximality, by inclusion, of a given s-club is proposed. The sufficient condition can be employed in the design of optimization algorithms to reduce the computational effort. A variable neighborhood search algorithm is proposed for the maximum s-club problem to facilitate the solution of large instances with reasonable computational effort. In addition, a hybrid exact algorithm has been developed for the problem. Inspired by wide usability of bipartite graphs in modeling and data mining, we consider three classes of the maximum biclique problem. Specifically, the maximum edge biclique, the maximum vertex biclique and the maximum balanced biclique problems are considered. Asymptotic lower and upper bounds on the size of these structures in uniform random graphs are developed. These bounds are insightful in understanding the evolution and growth rate of bicliques in large-scale graphs. To overcome the computational difficulty of solving large instances, a scale-reduction technique for the maximum vertex and maximum edge biclique problems, in general graphs, is proposed. The procedure shrinks the underlying network, by confirming and removing edges that cannot be in the optimal solution, thus enabling the exact solution methods to solve large-scale sparse instances to optimality. Also, a combinatorial branch-and-bound algorithm is developed that best suits to solve dense instances where scale-reduction method might be less effective. Proposed algorithms are flexible and, with small modifications, can solve the weighted versions of the problems

    The Tensor Networks Anthology: Simulation techniques for many-body quantum lattice systems

    Full text link
    We present a compendium of numerical simulation techniques, based on tensor network methods, aiming to address problems of many-body quantum mechanics on a classical computer. The core setting of this anthology are lattice problems in low spatial dimension at finite size, a physical scenario where tensor network methods, both Density Matrix Renormalization Group and beyond, have long proven to be winning strategies. Here we explore in detail the numerical frameworks and methods employed to deal with low-dimension physical setups, from a computational physics perspective. We focus on symmetries and closed-system simulations in arbitrary boundary conditions, while discussing the numerical data structures and linear algebra manipulation routines involved, which form the core libraries of any tensor network code. At a higher level, we put the spotlight on loop-free network geometries, discussing their advantages, and presenting in detail algorithms to simulate low-energy equilibrium states. Accompanied by discussions of data structures, numerical techniques and performance, this anthology serves as a programmer's companion, as well as a self-contained introduction and review of the basic and selected advanced concepts in tensor networks, including examples of their applications.Comment: 115 pages, 56 figure

    Computational haplotyping : theory and practice

    Get PDF
    Genomics has paved a new way to comprehend life and its evolution, and also to investigate causes of diseases and their treatment. One of the important problems in genomic analyses is haplotype assembly. Constructing complete and accurate haplotypes plays an essential role in understanding population genetics and how species evolve. In this thesis, we focus on computational approaches to haplotype assembly from third generation sequencing technologies. This involves huge amounts of sequencing data, and such data contain errors due to the single molecule sequencing protocols employed. Taking advantage of combinatorial formulations helps to correct for these errors to solve the haplotyping problem. Various computational techniques such as dynamic programming, parameterized algorithms, and graph algorithms are used to solve this problem. This thesis presents several contributions concerning the area of haplotyping. First, a novel algorithm based on dynamic programming is proposed to provide approximation guarantees for phasing a single individual. Second, an integrative approach is introduced to combining multiple sequencing datasets to generating complete and accurate haplotypes. The effectiveness of this integrative approach is demonstrated on a real human genome. Third, we provide a novel efficient approach to phasing pedigrees and demonstrate its advantages in comparison to phasing a single individual. Fourth, we present a generalized graph-based framework for performing haplotype-aware de novo assembly. Specifically, this generalized framework consists of a hybrid pipeline for generating accurate and complete haplotypes from data stemming from multiple sequencing technologies, one that provides accurate reads and other that provides long reads.Die Genomik hat neue Wege eröffnet, die es ermöglichen, die Evolution lebendiger Organismen zu verstehen, sowie die Ursachen zahlreicher Krankheiten zu erforschen und neue Therapien zu entwickeln. Ein wichtiges Problem ist die Assemblierung der Haplotypen eines Individuums. Diese Rekonstruktion von Haplotypen spielt eine zentrale Rolle für das Verständnis der Populationsgenetik und der Evolution einer Spezies. In der vorliegenden Arbeit werden Algorithmen zur Assemblierung von Haplotypen vorgestellt, die auf Sequenzierdaten der dritten Generation basieren. Dies erfordert große Mengen an Daten, welche wiederum Fehler enthalten, die die zugrunde liegenden Sequenzierprotokolle hervorbringen. Durch kombinatorische Formulierungen des Problems ist die Rekonstruktion von Haplotypen dennoch möglich, da Fehler erfolgreich korrigiert werden können. Verschiedene informatische Methoden, wie dynamische Programmierung, parametrisierte Algorithmen und Graph Algorithmen können verwendet werden, um dieses Problem zu lösen. Die vorliegende Arbeit stellt mehrere Lösungsansätze für die Rekonstruktion von Haplotypen vor. Als erstes wird ein neuartiger Algorithmus vorgestellt, der basierend auf dem Prinzip der dynamischen Programmierung Approximationsgarantien für das Haplotyping eines einzelnen Individuums liefert. Als zweites wird ein integrativer Ansatz präsentiert, um mehrere Sequenzierdatensätze zu kombinieren und somit akkurate Haplotypen zu generieren. Die Effektivität dieser Methode wird auf einem echten, menschlichen Datensatz demonstriert. Als drittes wird ein neuer, effzienter Algorithmus beschrieben, um Haplotypen verwandter Individuen simultan zu konstruieren und die Vorteile gegenüber der Betrachtung einzelner Individuen aufgezeigt. Als viertes präsentieren wir eine Graph-basierte Methode um mittels Haplotypinformation de-novo Assemblierung durchzuführen. Dieser Methode kombiniert Daten stammend von verschiedenen Sequenziertechnologien, welche entweder genaue oder aber lange Sequenzierreads liefern

    Phylogeny-Aware Placement and Alignment Methods for Short Reads

    Get PDF
    In recent years bioinformatics has entered a new phase: New sequencing methods, generally referred to as Next Generation Sequencing (NGS) have become widely available. This thesis introduces algorithms for phylogeny aware analysis of short sequence reads, as generated by NGS methods in the context of metagenomic studies. A considerable part of this work focuses on the technical (w.r.t. performance) challenges of these new algorithms, which have been developed specifically to exploit parallelism

    High-Quality Hypergraph Partitioning

    Get PDF
    This dissertation focuses on computing high-quality solutions for the NP-hard balanced hypergraph partitioning problem: Given a hypergraph and an integer kk, partition its vertex set into kk disjoint blocks of bounded size, while minimizing an objective function over the hyperedges. Here, we consider the two most commonly used objectives: the cut-net metric and the connectivity metric. Since the problem is computationally intractable, heuristics are used in practice - the most prominent being the three-phase multi-level paradigm: During coarsening, the hypergraph is successively contracted to obtain a hierarchy of smaller instances. After applying an initial partitioning algorithm to the smallest hypergraph, contraction is undone and, at each level, refinement algorithms try to improve the current solution. With this work, we give a brief overview of the field and present several algorithmic improvements to the multi-level paradigm. Instead of using a logarithmic number of levels like traditional algorithms, we present two coarsening algorithms that create a hierarchy of (nearly) nn levels, where nn is the number of vertices. This makes consecutive levels as similar as possible and provides many opportunities for refinement algorithms to improve the partition. This approach is made feasible in practice by tailoring all algorithms and data structures to the nn-level paradigm, and developing lazy-evaluation techniques, caching mechanisms and early stopping criteria to speed up the partitioning process. Furthermore, we propose a sparsification algorithm based on locality-sensitive hashing that improves the running time for hypergraphs with large hyperedges, and show that incorporating global information about the community structure into the coarsening process improves quality. Moreover, we present a portfolio-based initial partitioning approach, and propose three refinement algorithms. Two are based on the Fiduccia-Mattheyses (FM) heuristic, but perform a highly localized search at each level. While one is designed for two-way partitioning, the other is the first FM-style algorithm that can be efficiently employed in the multi-level setting to directly improve kk-way partitions. The third algorithm uses max-flow computations on pairs of blocks to refine kk-way partitions. Finally, we present the first memetic multi-level hypergraph partitioning algorithm for an extensive exploration of the global solution space. All contributions are made available through our open-source framework KaHyPar. In a comprehensive experimental study, we compare KaHyPar with hMETIS, PaToH, Mondriaan, Zoltan-AlgD, and HYPE on a wide range of hypergraphs from several application areas. Our results indicate that KaHyPar, already without the memetic component, computes better solutions than all competing algorithms for both the cut-net and the connectivity metric, while being faster than Zoltan-AlgD and equally fast as hMETIS. Moreover, KaHyPar compares favorably with the current best graph partitioning system KaFFPa - both in terms of solution quality and running time

    Algorithms for phylogenetic tree correction in species and cancer evolution

    Get PDF
    Reconstructing evolutionary trees, also known as phylogenies, from molecular sequence data is a fundamental problem in computational biology. Classically, evolutionary trees have been estimated over a set of species, where leaves correspond to extant species and internal nodes correspond to ancestral species. This type of phylogeny is colloquially thought of as the “Tree of Life” and assembling it has been designated as a Grand Challenge by the National Science Foundation Advisory Committee for Cyberinfrastructure. However, processes other than speciation are also shaped by evolution. One notable example is in the development of a malignant tumor; tumor cells rapidly grow and divide, acquiring new mutations with each subsequent generation. Tumor cells then compete for resources, often resulting in selection for more aggressive cell types. Recent advancements in sequencing technology rapidly increased the amount of sequencing data taken from tumor biopsies. This development has allowed researchers to attempt reconstructing evolutionary histories for individual patient tumors, improving our understanding of cancer and laying the groundwork for precision therapy. Despite algorithmic improvements in the estimation of both species and tumor phylogenies from molecular sequence data, current approaches still suffer a number of limitations. Incomplete sampling and estimation error can lead to missing leaves and low-support branches in the estimated phylogenies. Moreover, commonly posed optimization problems are often under-determined given the limited amounts and low quality of input data, leading to large solution spaces of equally plausible phylogenies. In this dissertation, we explore current limitations in both species and tumor phylogeny estimation, connecting similarities and highlighting key differences. We then put forward four new methods that improve phylogeny estimation methods by incorporating auxiliary information: OCTAL, TRACTION, PhySigs, and RECAP. For each method, we present theoretical results (e.g., optimization problem complexity, algorithmic correctness, running time analysis) as well as empirical results on simulated and real datasets. Collectively, these methods show we can significantly improve the accuracy of leading phylogeny estimation methods by leveraging additional signal in distinct, but related datasets
    • …
    corecore