19 research outputs found

    The minimum-entropy set cover problem

    Get PDF
    AbstractWe consider the minimum entropy principle for learning data generated by a random source and observed with random noise.In our setting we have a sequence of observations of objects drawn uniformly at random from a population. Each object in the population belongs to one class. We perform an observation for each object which determines that it belongs to one of a given set of classes. Given these observations, we are interested in assigning the most likely class to each of the objects.This scenario is a very natural one that appears in many real life situations. We show that under reasonable assumptions finding the most likely assignment is equivalent to the following variant of the set cover problem. Given a universe U and a collection S=(S1,
,St) of subsets of U, we wish to find an assignment f:U→S such that u∈f(u) and the entropy of the distribution defined by the values |f-1(Si)| is minimized.We show that this problem is NP-hard and that the greedy algorithm for set cover s with an additive constant error with respect to the optimal cover. This sheds a new light on the behavior of the greedy set cover algorithm. We further enhance the greedy algorithm and show that the problem admits a polynomial time approximation scheme (PTAS).Finally, we demonstrate how this model and the greedy algorithm can be useful in real life scenarios, and in particular, in problems arising naturally in computational biology

    Minimum Entropy Orientations

    Full text link
    We study graph orientations that minimize the entropy of the in-degree sequence. The problem of finding such an orientation is an interesting special case of the minimum entropy set cover problem previously studied by Halperin and Karp [Theoret. Comput. Sci., 2005] and by the current authors [Algorithmica, to appear]. We prove that the minimum entropy orientation problem is NP-hard even if the graph is planar, and that there exists a simple linear-time algorithm that returns an approximate solution with an additive error guarantee of 1 bit. This improves on the only previously known algorithm which has an additive error guarantee of log_2 e bits (approx. 1.4427 bits).Comment: Referees' comments incorporate

    Schema-agnostic entity retrieval in highly heterogeneous semi-structured environments

    Get PDF
    [no abstract

    Set covering with our eyes closed

    Get PDF
    Given a universe UU of nn elements and a weighted collection S\mathscr{S} of mm subsets of UU, the universal set cover problem is to a priori map each element u∈Uu \in U to a set S(u)∈SS(u) \in \mathscr{S} containing uu such that any set X⊆UX{\subseteq U} is covered by S(X)=\cup_{u\in XS(u). The aim is to find a mapping such that the cost of S(X)S(X) is as close as possible to the optimal set cover cost for XX. (Such problems are also called oblivious or a priori optimization problems.) Unfortunately, for every universal mapping, the cost of S(X)S(X) can be Ω(n)\Omega(\sqrt{n}) times larger than optimal if the set XX is adversarially chosen. In this paper we study the performance on average, when XX is a set of randomly chosen elements from the universe: we show how to efficiently find a universal map whose expected cost is O(log⁥mn)O(\log mn) times the expected optimal cost. In fact, we give a slightly improved analysis and show that this is the best possible. We generalize these ideas to weighted set cover and show similar guarantees to (nonmetric) facility location, where we have to balance the facility opening cost with the cost of connecting clients to the facilities. We show applications of our results to universal multicut and disc-covering problems and show how all these universal mappings give us algorithms for the stochastic online variants of the problems with the same competitive factors

    Brotli: A General-Purpose Data Compressor

    Get PDF
    Brotli is an open source general-purpose data compressor introduced by Google in late 2013 and now adopted in most known browsers and Web servers. It is publicly available on GitHub and its data format was submitted as RFC 7932 in July 2016. Brotli is based on the Lempel-Ziv compression scheme and planned as a generic replacement of Gzip and ZLib. The main goal in its design was to compress data on the Internet, which meant optimizing the resources used at decoding time, while achieving maximal compression density. This article is intended to provide the first thorough, systematic description of the Brotli format as well as a detailed computational and experimental analysis of the main algorithmic blocks underlying the current encoder implementation, together with a comparison against compressors of different families constituting the state-of-the-art either in practice or in theory. This treatment will allow us to raise a set of new algorithmic and software engineering problems that deserve further attention from the scientific community

    Algorithms for Viral Population Analysis

    Get PDF
    The genetic structure of an intra-host viral population has an effect on many clinically important phenotypic traits such as escape from vaccine induced immunity, virulence, and response to antiviral therapies. Next-generation sequencing provides read-coverage sufficient for genomic reconstruction of a heterogeneous, yet highly similar, viral population; and more specifically, for the detection of rare variants. Admittedly, while depth is less of an issue for modern sequencers, the short length of generated reads complicates viral population assembly. This task is worsened by the presence of both random and systematic sequencing errors in huge amounts of data. In this dissertation I present completed work for reconstructing a viral population given next-generation sequencing data. Several algorithms are described for solving this problem under the error-free amplicon (or sliding-window) model. In order for these methods to handle actual real-world data, an error-correction method is proposed. A formal derivation of its likelihood model along with optimization steps for an EM algorithm are presented. Although these methods perform well, they cannot take into account paired-end sequencing data. In order to address this, a new method is detailed that works under the error-free paired-end case along with maximum a-posteriori estimation of the model parameters
    corecore