869 research outputs found

    Approximating Weighted Duo-Preservation in Comparative Genomics

    Full text link
    Motivated by comparative genomics, Chen et al. [9] introduced the Maximum Duo-preservation String Mapping (MDSM) problem in which we are given two strings s1s_1 and s2s_2 from the same alphabet and the goal is to find a mapping π\pi between them so as to maximize the number of duos preserved. A duo is any two consecutive characters in a string and it is preserved in the mapping if its two consecutive characters in s1s_1 are mapped to same two consecutive characters in s2s_2. The MDSM problem is known to be NP-hard and there are approximation algorithms for this problem [3, 5, 13], but all of them consider only the "unweighted" version of the problem in the sense that a duo from s1s_1 is preserved by mapping to any same duo in s2s_2 regardless of their positions in the respective strings. However, it is well-desired in comparative genomics to find mappings that consider preserving duos that are "closer" to each other under some distance measure [19]. In this paper, we introduce a generalized version of the problem, called the Maximum-Weight Duo-preservation String Mapping (MWDSM) problem that captures both duos-preservation and duos-distance measures in the sense that mapping a duo from s1s_1 to each preserved duo in s2s_2 has a weight, indicating the "closeness" of the two duos. The objective of the MWDSM problem is to find a mapping so as to maximize the total weight of preserved duos. In this paper, we give a polynomial-time 6-approximation algorithm for this problem.Comment: Appeared in proceedings of the 23rd International Computing and Combinatorics Conference (COCOON 2017

    Fast Matching-based Approximations for Maximum Duo-Preservation String Mapping and its Weighted Variant

    Get PDF

    The Maximum Duo-Preservation String Mapping Problem with Bounded Alphabet

    Get PDF
    Given two strings A and B such that B is a permutation of A, the max duo-preservation string mapping (MPSM) problem asks to find a mapping ? between them so as to preserve a maximum number of duos. A duo is any pair of consecutive characters in a string and it is preserved by ? if its two consecutive characters in A are mapped to same two consecutive characters in B. This problem has received a growing attention in recent years, partly as an alternative way to produce approximation algorithms for its minimization counterpart, min common string partition, a widely studied problem due its applications in comparative genomics. Considering this favored field of application with short alphabet, it is surprising that MPSM^?, the variant of MPSM with bounded alphabet, has received so little attention, with a single yet impressive work that provides a 2.67-approximation achieved in O(n) [Brubach, 2018], where n = |A| = |B|. Our work focuses on MPSM^?, and our main contribution is the demonstration that this problem admits a Polynomial Time Approximation Scheme (PTAS) when ? = O(1). We also provide an alternate, somewhat simpler, proof of NP-hardness for this problem compared with the NP-hardness proof presented in [Haitao Jiang et al., 2012]

    A Family of Approximation Algorithms for the Maximum Duo-Preservation String Mapping Problem

    Get PDF
    In the Maximum Duo-Preservation String Mapping problem we are given two strings and wish to map the letters of the former to the letters of the latter as to maximise the number of duos. A duo is a pair of consecutive letters that is mapped to a pair of consecutive letters in the same order. This is complementary to the well-studied Minimum Common String Partition problem, where the goal is to partition the former string into blocks that can be permuted and concatenated to obtain the latter string. Maximum Duo-Preservation String Mapping is APX-hard. After a series of improvements, Brubach [WABI 2016] showed a polynomial-time 3.25-approximation algorithm. Our main contribution is that, for any eps>0, there exists a polynomial-time (2+eps)-approximation algorithm. Similarly to a previous solution by Boria et al. [CPM 2016], our algorithm uses the local search technique. However, this is used only after a certain preliminary greedy procedure, which gives us more structure and makes a more general local search possible. We complement this with a specialised version of the algorithm that achieves 2.67-approximation in quadratic time

    Revisiting the Parameterized Complexity of Maximum-Duo Preservation String Mapping

    Get PDF
    In the Maximum-Duo Preservation String Mapping (Max-Duo PSM) problem, the input consists of two related strings A and B of length n and a nonnegative integer k. The objective is to determine whether there exists a mapping m from the set of positions of A to the set of positions of B that maps only to positions with the same character and preserves at least k duos, which are pairs of adjacent positions. We develop a randomized algorithm that solves Max-Duo PSM in time 4^k * n^{O(1)}, and a deterministic algorithm that solves this problem in time 6.855^k * n^{O(1)}. The previous best known (deterministic) algorithm for this problem has running time (8e)^{2k+o(k)} * n^{O(1)} [Beretta et al., Theor. Comput. Sci. 2016]. We also show that Max-Duo PSM admits a problem kernel of size O(k^3), improving upon the previous best known problem kernel of size O(k^6)

    Markets, Elections, and Microbes: Data-driven Algorithms from Theory to Practice

    Get PDF
    Many modern problems in algorithms and optimization are driven by data which often carries with it an element of uncertainty. In this work, we conduct an investigation into algorithmic foundations and applications across three main areas. The first area is online matching algorithms for e-commerce applications such as online sales and advertising. The importance of e-commerce in modern business cannot be overstated and even minor algorithmic improvements can have huge impacts. In online matching problems, we generally have a known offline set of goods or advertisements while users arrive online and allocations must be made immediately and irrevocably when a user arrives. However, in the real world, there is also uncertainty about a user's true interests and this can be modeled by considering matching problems in a graph with stochastic edges that only have a probability of existing. These edges can represent the probability of a user purchasing a product or clicking on an ad. Thus, we optimize over data which only provides an estimate of what types of users will arrive and what they will prefer. We survey a broad landscape of problems in this area, gain a deeper understanding of the algorithmic challenges, and present algorithms with improved worst case performance The second area is constrained clustering where we explore classical clustering problems with additional constraints on which data points should be clustered together. Utilizing these constraints is important for many clustering problems because they can be used to ensure fairness, exploit expert advice, or capture natural properties of the data. In simplest case, this can mean some pairs of points have ``must-link'' constraints requiring that that they must be clustered together. Moving into stochastic settings, we can describe more general pairwise constraints such as bounding the probability that two points are separated into different clusters. This lets us introduce a new notion of fairness for clustering and address stochastic problems such as semi-supervised learning with advice from imperfect experts. Here, we introduce new models of constrained clustering including new notions of fairness for clustering applications. Since these problems are NP-hard, we give approximation algorithms and in some cases conduct experiments to explore how the algorithms perform in practice. Finally, we look closely at the particular clustering problem of drawing election districts and show how constraining the clusters based on past voting data can interact with voter incentives. The third area is string algorithms for bioinformatics and metagenomics specifically where the data deluge from next generation sequencing drives the necessity for new algorithms that are both fast and accurate. For metagenomic analysis, we present a tool for clustering a microbial marker gene, the 16S ribosomal RNA gene. On the more theoretical side, we present a succinct application of the Method of the Four Russians to edit distance computation as well as new algorithms and bounds for the maximum duo-preservation string mapping (MPSM) problem

    28th Annual Symposium on Combinatorial Pattern Matching : CPM 2017, July 4-6, 2017, Warsaw, Poland

    Get PDF
    Peer reviewe

    Compression, Modeling, and Real-Time Rendering of Realistic Materials and Objects

    Get PDF
    The realism of a scene basically depends on the quality of the geometry, the illumination and the materials that are used. Whereas many sources for the creation of three-dimensional geometry exist and numerous algorithms for the approximation of global illumination were presented, the acquisition and rendering of realistic materials remains a challenging problem. Realistic materials are very important in computer graphics, because they describe the reflectance properties of surfaces, which are based on the interaction of light and matter. In the real world, an enormous diversity of materials can be found, comprising very different properties. One important objective in computer graphics is to understand these processes, to formalize them and to finally simulate them. For this purpose various analytical models do already exist, but their parameterization remains difficult as the number of parameters is usually very high. Also, they fail for very complex materials that occur in the real world. Measured materials, on the other hand, are prone to long acquisition time and to huge input data size. Although very efficient statistical compression algorithms were presented, most of them do not allow for editability, such as altering the diffuse color or mesostructure. In this thesis, a material representation is introduced that makes it possible to edit these features. This makes it possible to re-use the acquisition results in order to easily and quickly create deviations of the original material. These deviations may be subtle, but also substantial, allowing for a wide spectrum of material appearances. The approach presented in this thesis is not based on compression, but on a decomposition of the surface into several materials with different reflection properties. Based on a microfacette model, the light-matter interaction is represented by a function that can be stored in an ordinary two-dimensional texture. Additionally, depth information, local rotations, and the diffuse color are stored in these textures. As a result of the decomposition, some of the original information is inevitably lost, therefore an algorithm for the efficient simulation of subsurface scattering is presented as well. Another contribution of this work is a novel perception-based simplification metric that includes the material of an object. This metric comprises features of the human visual system, for example trichromatic color perception or reduced resolution. The proposed metric allows for a more aggressive simplification in regions where geometric metrics do not simplif

    Secure and Efficient Comparisons between Untrusted Parties

    Get PDF
    A vast number of online services is based on users contributing their personal information. Examples are manifold, including social networks, electronic commerce, sharing websites, lodging platforms, and genealogy. In all cases user privacy depends on a collective trust upon all involved intermediaries, like service providers, operators, administrators or even help desk staff. A single adversarial party in the whole chain of trust voids user privacy. Even more, the number of intermediaries is ever growing. Thus, user privacy must be preserved at every time and stage, independent of the intrinsic goals any involved party. Furthermore, next to these new services, traditional offline analytic systems are replaced by online services run in large data centers. Centralized processing of electronic medical records, genomic data or other health-related information is anticipated due to advances in medical research, better analytic results based on large amounts of medical information and lowered costs. In these scenarios privacy is of utmost concern due to the large amount of personal information contained within the centralized data. We focus on the challenge of privacy-preserving processing on genomic data, specifically comparing genomic sequences. The problem that arises is how to efficiently compare private sequences of two parties while preserving confidentiality of the compared data. It follows that the privacy of the data owner must be preserved, which means that as little information as possible must be leaked to any party participating in the comparison. Leakage can happen at several points during a comparison. The secured inputs for the comparing party might leak some information about the original input, or the output might leak information about the inputs. In the latter case, results of several comparisons can be combined to infer information about the confidential input of the party under observation. Genomic sequences serve as a use-case, but the proposed solutions are more general and can be applied to the generic field of privacy-preserving comparison of sequences. The solution should be efficient such that performing a comparison yields runtimes linear in the length of the input sequences and thus producing acceptable costs for a typical use-case. To tackle the problem of efficient, privacy-preserving sequence comparisons, we propose a framework consisting of three main parts. a) The basic protocol presents an efficient sequence comparison algorithm, which transforms a sequence into a set representation, allowing to approximate distance measures over input sequences using distance measures over sets. The sets are then represented by an efficient data structure - the Bloom filter -, which allows evaluation of certain set operations without storing the actual elements of the possibly large set. This representation yields low distortion for comparing similar sequences. Operations upon the set representation are carried out using efficient, partially homomorphic cryptographic systems for data confidentiality of the inputs. The output can be adjusted to either return the actual approximated distance or the result of an in-range check of the approximated distance. b) Building upon this efficient basic protocol we introduce the first mechanism to reduce the success of inference attacks by detecting and rejecting similar queries in a privacy-preserving way. This is achieved by generating generalized commitments for inputs. This generalization is done by treating inputs as messages received from a noise channel, upon which error-correction from coding theory is applied. This way similar inputs are defined as inputs having a hamming distance of their generalized inputs below a certain predefined threshold. We present a protocol to perform a zero-knowledge proof to assess if the generalized input is indeed a generalization of the actual input. Furthermore, we generalize a very efficient inference attack on privacy-preserving sequence comparison protocols and use it to evaluate our inference-control mechanism. c) The third part of the framework lightens the computational load of the client taking part in the comparison protocol by presenting a compression mechanism for partially homomorphic cryptographic schemes. It reduces the transmission and storage overhead induced by the semantically secure homomorphic encryption schemes, as well as encryption latency. The compression is achieved by constructing an asymmetric stream cipher such that the generated ciphertext can be converted into a ciphertext of an associated homomorphic encryption scheme without revealing any information about the plaintext. This is the first compression scheme available for partially homomorphic encryption schemes. Compression of ciphertexts of fully homomorphic encryption schemes are several orders of magnitude slower at the conversion from the transmission ciphertext to the homomorphically encrypted ciphertext. Indeed our compression scheme achieves optimal conversion performance. It further allows to generate keystreams offline and thus supports offloading to trusted devices. This way transmission-, storage- and power-efficiency is improved. We give security proofs for all relevant parts of the proposed protocols and algorithms to evaluate their security. A performance evaluation of the core components demonstrates the practicability of our proposed solutions including a theoretical analysis and practical experiments to show the accuracy as well as efficiency of approximations and probabilistic algorithms. Several variations and configurations to detect similar inputs are studied during an in-depth discussion of the inference-control mechanism. A human mitochondrial genome database is used for the practical evaluation to compare genomic sequences and detect similar inputs as described by the use-case. In summary we show that it is indeed possible to construct an efficient and privacy-preserving (genomic) sequences comparison, while being able to control the amount of information that leaves the comparison. To the best of our knowledge we also contribute to the field by proposing the first efficient privacy-preserving inference detection and control mechanism, as well as the first ciphertext compression system for partially homomorphic cryptographic systems
    • …
    corecore