60 research outputs found

    Evaluation of ILP-based approaches for partitioning into colorful components

    Get PDF
    The NP-hard Colorful Components problem is a graph partitioning problem on vertex-colored graphs. We identify a new application of Colorful Components in the correction of Wikipedia interlanguage links, and describe and compare three exact and two heuristic approaches. In particular, we devise two ILP formulations, one based on Hitting Set and one based on Clique Partition. Furthermore, we use the recently proposed implicit hitting set framework [Karp, JCSS 2011; Chandrasekaran et al., SODA 2011] to solve Colorful Components. Finally, we study a move-based and a merge-based heuristic for Colorful Components. We can optimally solve Colorful Components for Wikipedia link correction data; while the Clique Partition-based ILP outperforms the other two exact approaches, the implicit hitting set is a simple and competitive alternative. The merge-based heuristic is very accurate and outperforms the move-based one. The above results for Wikipedia data are confirmed by experiments with synthetic instances

    Algorithmic and Hardness Results for the Colorful Components Problems

    Full text link
    In this paper we investigate the colorful components framework, motivated by applications emerging from comparative genomics. The general goal is to remove a collection of edges from an undirected vertex-colored graph GG such that in the resulting graph GG' all the connected components are colorful (i.e., any two vertices of the same color belong to different connected components). We want GG' to optimize an objective function, the selection of this function being specific to each problem in the framework. We analyze three objective functions, and thus, three different problems, which are believed to be relevant for the biological applications: minimizing the number of singleton vertices, maximizing the number of edges in the transitive closure, and minimizing the number of connected components. Our main result is a polynomial time algorithm for the first problem. This result disproves the conjecture of Zheng et al. that the problem is NP NP-hard (assuming PNPP \neq NP). Then, we show that the second problem is APX APX-hard, thus proving and strengthening the conjecture of Zheng et al. that the problem is NP NP-hard. Finally, we show that the third problem does not admit polynomial time approximation within a factor of V1/14ϵ|V|^{1/14 - \epsilon} for any ϵ>0\epsilon > 0, assuming PNPP \neq NP (or within a factor of V1/2ϵ|V|^{1/2 - \epsilon}, assuming ZPPNPZPP \neq NP).Comment: 18 pages, 3 figure

    A Multilingual Study of Multi-Sentence Compression using Word Vertex-Labeled Graphs and Integer Linear Programming

    Full text link
    Multi-Sentence Compression (MSC) aims to generate a short sentence with the key information from a cluster of similar sentences. MSC enables summarization and question-answering systems to generate outputs combining fully formed sentences from one or several documents. This paper describes an Integer Linear Programming method for MSC using a vertex-labeled graph to select different keywords, with the goal of generating more informative sentences while maintaining their grammaticality. Our system is of good quality and outperforms the state of the art for evaluations led on news datasets in three languages: French, Portuguese and Spanish. We led both automatic and manual evaluations to determine the informativeness and the grammaticality of compressions for each dataset. In additional tests, which take advantage of the fact that the length of compressions can be modulated, we still improve ROUGE scores with shorter output sentences.Comment: Preprint versio

    Machine Learning for Instance Segmentation

    Get PDF
    Volumetric Electron Microscopy images can be used for connectomics, the study of brain connectivity at the cellular level. A prerequisite for this inquiry is the automatic identification of neural cells, which requires machine learning algorithms and in particular efficient image segmentation algorithms. In this thesis, we develop new algorithms for this task. In the first part we provide, for the first time in this field, a method for training a neural network to predict optimal input data for a watershed algorithm. We demonstrate its superior performance compared to other segmentation methods of its category. In the second part, we develop an efficient watershed-based algorithm for weighted graph partitioning, the \emph{Mutex Watershed}, which uses negative edge-weights for the first time. We show that it is intimately related to the multicut and has a cutting edge performance on a connectomics challenge. Our algorithm is currently used by the leaders of two connectomics challenges. Finally, motivated by inpainting neural networks, we create a method to learn the graph weights without any supervision

    Multi-resolution region-preserving segmentation for color images of natural scene

    Get PDF
    Master'sMASTER OF SCIENC

    Computational methods for large-scale single-cell RNA-seq and multimodal data

    Get PDF
    Emerging single cell genomics technologies such as single cell RNA-seq (scRNA-seq) and single cell ATAC-seq provide new opportunities for discovery of previously unknown cell types, facilitating the study of biological processes such as tumor progression, and delineating molecular mechanism differences between species. Due to the high dimensionality of the data produced by the technologies, computation and mathematics have been the cornerstone in decoding meaningful information from the data. Computational models have been challenged by the exponential growth of the data thanks to the continuing decrease in sequencing costs and growth of large-scale genomic projects such as the Human Cell Atlas. In addition, recent single-cell technologies have enabled us to measure multiple modalities such as transcriptome, protome, and epigenome in the same cell. This requires us to establish new computational methods which can cope with multiple layers of the data. To address these challenges, the main goal of this thesis was to develop computational methods and mathematical models for analyzing large-scale scRNA-seq and multimodal omics data. In particular, I have focused on fundamental single-cell analysis such as clustering and visualization. The most common task in scRNA-seq data analysis is the identification of cell types. Numerous methods have been proposed for this problem with a current focus on methods for the analysis of large scale scRNA-seq data. I developed Specter, a computational method that utilizes recent algorithmic advances in fast spectral clustering and ensemble learning. Specter achieves a substantial improvement in accuracy over existing methods and identifies rare cell types with high sensitivity. Specter allows us to process a dataset comprising 2 million cells in just 26 minutes. Moreover, the analysis of CITE-seq data, that simultaneously provides gene expression and protein levels, showed that Specter is able to incorporate multimodal omics measurements to resolve subtle transcriptomic differences between subpopulations of cells. We have effectively handled big data for clustering analysis using Specter. The question is how to cope with the big data for other downstream analyses such as trajectory inference and data integration. The most simple scheme is to shrink the data by selecting a subset of cells (the sketch) that best represents the full data set. Therefore I developed an algorithm called Sphetcher that makes use of the thresholding technique to efficiently pick representative cells that evenly cover the transcriptomic space occupied by the original data set. I showed that the sketch computed by Sphetcher constitutes a more accurate presentation of the original transcriptomic landscape than existing methods, which leads to a more balanced composition of cell types and a large fraction of rare cell types in the sketch. Sphetcher bridges the gap between the scalability of computational methods and the volume of the data. Moreover, I demonstrated that Sphetcher can incorporate prior information (e.g. cell labels) to inform the inference of the trajectory of human skeletal muscle myoblast differentiation. The biological processes such as development, differentiation, and cell cycle can be monitored by performing single cell sequencing at different time points, each corresponding to a snapshot of the process. A class of computational methods called trajectory inference aims to reconstruct the developmental trajectories from these snapshots. Trajectory inference (TI) methods such as Monocle, can computationally infer a pseudotime variable which serves as a proxy for developmental time. In order to compare two trajectories inferred by TI methods, we need to align the pseudotime between two trajectories. Current methods for aligning trajectories are based on the concept of dynamic time warping, which is limited to simple linear trajectories. Since complex trajectories are common in developmental processes, I adopted arboreal matchings to compare and align complex trajectories with multiple branch points diverting cells into alternative fates. Arboreal matchings were originally proposed in the context of phylogenetic trees and I theoretically linked them to dynamic time warping. A suite of exact and heuristic algorithms for aligning complex trajectories was implemented in a software Trajan. When aligning single-cell trajectories describing human muscle differentiation and myogenic reprogramming, Trajan automatically identifies the core paths from which we are able to reproduce recently reported barriers to reprogramming. In a perturbation experiment, I showed that Trajan correctly maps identical cells in a global view of trajectories, as opposed to a pairwise application of dynamic time warping. Visualization using dimensionality reduction techniques such as t-SNE and UMAP is a fundamental step in the analysis of high-dimensional data. Visualization has played a pivotal role in discovering the dynamic trends in single cell genomics data. I developed j-SNE and j-UMAP as their generalizations to the joint visualization of multimodal omics data, e.g., CITE-seq data. The approach automatically learns the relative importance of each modality in order to obtain a concise representation of the data. When comparing with the conventional approaches, I demonstrated that j-SNE and j-UMAP produce unified embeddings that better agree with known cell types and that harmonize RNA and protein velocity landscapes

    Reasoning-Driven Question-Answering For Natural Language Understanding

    Get PDF
    Natural language understanding (NLU) of text is a fundamental challenge in AI, and it has received significant attention throughout the history of NLP research. This primary goal has been studied under different tasks, such as Question Answering (QA) and Textual Entailment (TE). In this thesis, we investigate the NLU problem through the QA task and focus on the aspects that make it a challenge for the current state-of-the-art technology. This thesis is organized into three main parts: In the first part, we explore multiple formalisms to improve existing machine comprehension systems. We propose a formulation for abductive reasoning in natural language and show its effectiveness, especially in domains with limited training data. Additionally, to help reasoning systems cope with irrelevant or redundant information, we create a supervised approach to learn and detect the essential terms in questions. In the second part, we propose two new challenge datasets. In particular, we create two datasets of natural language questions where (i) the first one requires reasoning over multiple sentences; (ii) the second one requires temporal common sense reasoning. We hope that the two proposed datasets will motivate the field to address more complex problems. In the final part, we present the first formal framework for multi-step reasoning algorithms, in the presence of a few important properties of language use, such as incompleteness, ambiguity, etc. We apply this framework to prove fundamental limitations for reasoning algorithms. These theoretical results provide extra intuition into the existing empirical evidence in the field

    LIPIcs, Volume 244, ESA 2022, Complete Volume

    Get PDF
    LIPIcs, Volume 244, ESA 2022, Complete Volum

    Availability Constrained Routing And Wavelength Assignment And Survivability In Optical Wdm Networks

    Get PDF
    Tez (Doktora) -- İstanbul Teknik Üniversitesi, Fen Bilimleri Enstitüsü, 2009Thesis (PhD) -- İstanbul Technical University, Institute of Science and Technology, 2009Bu çalışmada, optik ağlarda kullanılabilirlik kısıtı altında yol ve dalgaboyu atama konusu farklı sürdürülebilirlik politikaları altında çalışılmış ve elde edilen sonuçlar benzetim aracılığıyla doğrulanarak sunulmuştur. Öncelikle paylaşılan yol koruması altında sınırsız kaynak bulunması durumunda kullanılabilirlik kısıtı altında yol ve dalgaboyu atamaya yönelik planlama amaçlı bağlantı kurma yöntemleri sezgisel ve optimizasyona dayalı olarak önerilmiştir. Sonrasında bu teknikler kısıtlı kaynak altında ve farklılaştırılmış kullanılabilirlik düzeyi gereksinimleri ile gelen bağlantı isteklerinin olması koşulu altında çalışabilecek şekilde adapte edilmiş ve başarımları sınanmıştır. Önerilen tekniklerin literatürde yaygın olarak bilinen bağlantı kurma tekniklerinin başarımını bağlantı düşürme olasılığı ve bağlantı kullanılabilirliği açısından yükselttiği, bunun yanında yedek kaynak tüketim fazlasını da gözeterek kabul edilir bir düzeyde tuttuğu gözlenmiştir. Özellikle optimizasyon tabanlı bağlantı kurma tekniğinin, farklılaştırılmış bağlantı isteklerinin bulunduğu ortamda kaynak tüketimini de düşürdüğü gösterilmiştir. Son olarak da, segmanlı koruma için önceden önerilmiş bir kullanılabilirlik analizine rastlanamadığı için, paylaşılan segmanlı koruma için kullanılabilirlik analizi yöntemi önerilerek benzetim aracılığıyla doğrulanmıştır. Bu analiz kullanılarak da segmanlı koruma altında kullanılabilirliği gözeten yol ve dalgaboyu atama yöntemleri oluşturularak başarımları kaynak kısıtlı ve kaynakça zengin ortamlarda denenerek uygulanabilirlikleri belirlenmiştir.In this study, we have proposed availability aware routing and wavelength assignment schemes for optical networks under different survivability policies. The proposed techniques are evaluated by simulation. First, we have proposed heuristic and optimization driven connection provisioning schemes under shared backup path protection in resource plentiful environment. Then, the proposed schemes are modified to work in resource limited environment where connections arrive with differentiated availability requirements. The proposed techniques are compared to a conventional reliable connection provisioning algorithm. The simulation results show that the proposed techniques lead to lower connection blocking probability and better connection availability. Besides this, it is also shown that the proposed techniques also keep the resource overbuild due to protection in a feasible range. Moreover, the experimental results also show that the optimization driven technique leads to a decreased resource overbuild under resource limited environment for connection arrivals with differentiated availability requirements. The last part of this work deals with shared segment protection. Since there is no specific availability analysis method for shared segment protection, an availability analysis method for this protection scheme is proposed and validated by simulation. Based on this analysis, availability aware connection provisioning schemes are constructed, their performance is evaluated in resource plentiful and resource scarce environments, and the applicability of the schemes are determined in terms of environmental constraints.DoktoraPh

    Sketching as a Tool for Efficient Networked Systems

    Get PDF
    Today, computer systems need to cope with the explosive growth of data in the world. For instance, in data-center networks, monitoring systems are used to measure traffic statistics at high speed; and in financial technology companies, distributed processing systems are deployed to support graph analytics. To fulfill the requirements of handling such large datasets, we build efficient networked systems in a distributed manner most of the time. Ideally, we expect the systems to meet service-level objectives (SLOs) using the least amount of resource. However, existing systems constructed with conventional in-memory algorithms face the following challenges: (1) excessive resource requirements (e.g., CPU, ASIC, and memory) with high cost; (2) infeasibility in a larger scale; (3) processing the data too slowly to meet the objectives. To address these challenges, we propose sketching techniques as a tool to build more efficient networked systems. Sketching algorithms aim to process the data with one or several passes in an online, streaming fashion (e.g., a stream of network packets), and compute highly accurate results. With sketching, we only maintain a compact summary of the entire data and provide theoretical guarantees on error bounds. This dissertation argues for a sketching based design for large-scale networked systems, and demonstrates the benefits in three application contexts: (i) Network monitoring: we build generic monitoring frameworks that support a range of applications on both software and hardware with universal sketches. (ii) Graph pattern mining: we develop a swift, approximate graph pattern miner that scales to very large graphs by leveraging graph sketching techniques. (iii) Halo finding in N-body simulations: we design scalable halo finders on CPU and GPU by leveraging sketch-based heavy hitter algorithms
    corecore