236 research outputs found

    Pivot Selection for Median String Problem

    Full text link
    The Median String Problem is W[1]-Hard under the Levenshtein distance, thus, approximation heuristics are used. Perturbation-based heuristics have been proved to be very competitive as regards the ratio approximation accuracy/convergence speed. However, the computational burden increase with the size of the set. In this paper, we explore the idea of reducing the size of the problem by selecting a subset of representative elements, i.e. pivots, that are used to compute the approximate median instead of the whole set. We aim to reduce the computation time through a reduction of the problem size while achieving similar approximation accuracy. We explain how we find those pivots and how to compute the median string from them. Results on commonly used test data suggest that our approach can reduce the computational requirements (measured in computed edit distances) by 88\% with approximation accuracy as good as the state of the art heuristic. This work has been supported in part by CONICYT-PCHA/Doctorado Nacional/2014−631400742014-63140074 through a Ph.D. Scholarship; Universidad Cat\'{o}lica de la Sant\'{i}sima Concepci\'{o}n through the research project DIN-01/2016; European Union's Horizon 2020 under the Marie Sk\l odowska-Curie grant agreement 690941690941; Millennium Institute for Foundational Research on Data (IMFD); FONDECYT-CONICYT grant number 11704971170497; and for O. Pedreira, Xunta de Galicia/FEDER-UE refs. CSI ED431G/01 and GRC: ED431C 2017/58

    An Efficient Rank Based Approach for Closest String and Closest Substring

    Get PDF
    This paper aims to present a new genetic approach that uses rank distance for solving two known NP-hard problems, and to compare rank distance with other distance measures for strings. The two NP-hard problems we are trying to solve are closest string and closest substring. For each problem we build a genetic algorithm and we describe the genetic operations involved. Both genetic algorithms use a fitness function based on rank distance. We compare our algorithms with other genetic algorithms that use different distance measures, such as Hamming distance or Levenshtein distance, on real DNA sequences. Our experiments show that the genetic algorithms based on rank distance have the best results

    Boosting Perturbation-Based Iterative Algorithms to Compute the Median String

    Get PDF
    [Abstract] The most competitive heuristics for calculating the median string are those that use perturbation-based iterative algorithms. Given the complexity of this problem, which under many formulations is NP-hard, the computational cost involved in the exact solution is not affordable. In this work, the heuristic algorithms that solve this problem are addressed, emphasizing its initialization and the policy to order possible editing operations. Both factors have a significant weight in the solution of this problem. Initial string selection influences the algorithm’s speed of convergence, as does the criterion chosen to select the modification to be made in each iteration of the algorithm. To obtain the initial string, we use the median of a subset of the original dataset; to obtain this subset, we employ the Half Space Proximal (HSP) test to the median of the dataset. This test provides sufficient diversity within the members of the subset while at the same time fulfilling the centrality criterion. Similarly, we provide an analysis of the stop condition of the algorithm, improving its performance without substantially damaging the quality of the solution. To analyze the results of our experiments, we computed the execution time of each proposed modification of the algorithms, the number of computed editing distances, and the quality of the solution obtained. With these experiments, we empirically validated our proposal.This work was supported in part by the Comisión Nacional de Investigación Científica y Tecnológica - Programa de Formación de Capital Humano Avanzado (CONICYT-PCHA)/Doctorado Nacional/2014-63140074 through the Ph.D. Scholarship, in part by the European Union's Horizon 2020 under the Marie Sklodowska-Curie under Grant 690941, in part by the Millennium Institute for Foundational Research on Data (IMFD), and in part by the FONDECYT-CONICYT under Grant 1170497. The work of ÓSCAR PEDREIRA was supported in part by the Xunta de Galicia/FEDER-UE refs under Grant CSI ED431G/01 and Grant GRC: ED431C 2017/58, in part by the Office of the Vice President for Research and Postgraduate Studies of the Universidad Católica de Temuco, VIPUCT Project 2020EM-PS-08, and in part by the FEQUIP 2019-INRN-03 of the Universidad Católica de TemucoXunta de Galicia; ED431G/01Xunta de Galicia; ED431C 2017/58Chile. Comisión Nacional de Investigación Científica y Tecnológica; 2014-63140074Chile. Comisión Nacional de Investigación Científica y Tecnológica; 1170497Universidad Católica de Temuco (Chile); 2020EM-PS-08Universidad Católica de Temuco (Chile); 2019-INRN-0

    Approximate Trace Reconstruction via Median String (In Average-Case)

    Get PDF
    We consider an \emph{approximate} version of the trace reconstruction problem, where the goal is to recover an unknown string s∈{0,1}ns\in\{0,1\}^n from mm traces (each trace is generated independently by passing ss through a probabilistic insertion-deletion channel with rate pp). We present a deterministic near-linear time algorithm for the average-case model, where ss is random, that uses only \emph{three} traces. It runs in near-linear time O~(n)\tilde O(n) and with high probability reports a string within edit distance O(ϵpn)O(\epsilon p n) from ss for ϵ=O~(p)\epsilon=\tilde O(p), which significantly improves over the straightforward bound of O(pn)O(pn). Technically, our algorithm computes a (1+ϵ)(1+\epsilon)-approximate median of the three input traces. To prove its correctness, our probabilistic analysis shows that an approximate median is indeed close to the unknown ss. To achieve a near-linear time bound, we have to bypass the well-known dynamic programming algorithm that computes an optimal median in time O(n3)O(n^3)

    Markets, Elections, and Microbes: Data-driven Algorithms from Theory to Practice

    Get PDF
    Many modern problems in algorithms and optimization are driven by data which often carries with it an element of uncertainty. In this work, we conduct an investigation into algorithmic foundations and applications across three main areas. The first area is online matching algorithms for e-commerce applications such as online sales and advertising. The importance of e-commerce in modern business cannot be overstated and even minor algorithmic improvements can have huge impacts. In online matching problems, we generally have a known offline set of goods or advertisements while users arrive online and allocations must be made immediately and irrevocably when a user arrives. However, in the real world, there is also uncertainty about a user's true interests and this can be modeled by considering matching problems in a graph with stochastic edges that only have a probability of existing. These edges can represent the probability of a user purchasing a product or clicking on an ad. Thus, we optimize over data which only provides an estimate of what types of users will arrive and what they will prefer. We survey a broad landscape of problems in this area, gain a deeper understanding of the algorithmic challenges, and present algorithms with improved worst case performance The second area is constrained clustering where we explore classical clustering problems with additional constraints on which data points should be clustered together. Utilizing these constraints is important for many clustering problems because they can be used to ensure fairness, exploit expert advice, or capture natural properties of the data. In simplest case, this can mean some pairs of points have ``must-link'' constraints requiring that that they must be clustered together. Moving into stochastic settings, we can describe more general pairwise constraints such as bounding the probability that two points are separated into different clusters. This lets us introduce a new notion of fairness for clustering and address stochastic problems such as semi-supervised learning with advice from imperfect experts. Here, we introduce new models of constrained clustering including new notions of fairness for clustering applications. Since these problems are NP-hard, we give approximation algorithms and in some cases conduct experiments to explore how the algorithms perform in practice. Finally, we look closely at the particular clustering problem of drawing election districts and show how constraining the clusters based on past voting data can interact with voter incentives. The third area is string algorithms for bioinformatics and metagenomics specifically where the data deluge from next generation sequencing drives the necessity for new algorithms that are both fast and accurate. For metagenomic analysis, we present a tool for clustering a microbial marker gene, the 16S ribosomal RNA gene. On the more theoretical side, we present a succinct application of the Method of the Four Russians to edit distance computation as well as new algorithms and bounds for the maximum duo-preservation string mapping (MPSM) problem

    Matching records in multiple databases using a hybridization of several technologies.

    Get PDF
    A major problem with integrating information from multiple databases is that the same data objects can exist in inconsistent data formats across databases and a variety of attribute variations, making it difficult to identify matching objects using exact string matching. In this research, a variety of models and methods have been developed and tested to alleviate this problem. A major motivation for this research is that the lack of efficient tools for patient record matching still exists for health care providers. This research is focused on the approximate matching of patient records with third party payer databases. This is a major need for all medical treatment facilities and hospitals that try to match patient treatment records with records of insurance companies, Medicare, Medicaid and the veteran\u27s administration. Therefore, the main objectives of this research effort are to provide an approximate matching framework that can draw upon multiple input service databases, construct an identity, and match to third party payers with the highest possible accuracy in object identification and minimal user interactions. This research describes the object identification system framework that has been developed from a hybridization of several technologies, which compares the object\u27s shared attributes in order to identify matching object. Methodologies and techniques from other fields, such as information retrieval, text correction, and data mining, are integrated to develop a framework to address the patient record matching problem. This research defines the quality of a match in multiple databases by using quality metrics, such as Precision, Recall, and F-measure etc, which are commonly used in Information Retrieval. The performance of resulting decision models are evaluated through extensive experiments and found to perform very well. The matching quality performance metrics, such as precision, recall, F-measure, and accuracy, are over 99%, ROC index are over 99.50% and mismatching rates are less than 0.18% for each model generated based on different data sets. This research also includes a discussion of the problems in patient records matching; an overview of relevant literature for the record matching problem and extensive experimental evaluation of the methodologies, such as string similarity functions and machine learning that are utilized. Finally, potential improvements and extensions to this work are also presented

    Searching, clustering and evaluating biological sequences

    Get PDF
    The latest generation of biological sequencing technologies have made it possible to generate sequence data faster and cheaper than ever before. The growth of sequence data has been exponential, and so far, has outpaced the rate of improvement of computer speed and capacity. This rate of growth, however, makes analysis of new datasets increasingly difficult, and highlights the need for efficient, scalable and modular software tools. Fortunately most types of analysis of sequence data involve a few fundamental operations. Here we study three such problems, namely searching for local alignments between two sets of sequences, clustering sequences, and evaluating the assemblies made from sequence fragments. We present simple and efficient heuristic algorithms for these problems, as well as open source software tools which implement these algorithms. First, we present approximate seeds; a new type of seed for local alignment search. Approximate seeds are a generalization of exact seeds and spaced seeds, in that they allow for insertions and deletions within the seed. We prove that approximate seeds are completely sensitive. We also show how to efficiently find approximate seeds using a suffix array index of the sequences. Next, we present DNACLUST; a tool for clustering millions of DNA sequence fragments. Although DNACLUST has been primarily made for clustering 16S ribosomal RNA sequences, it can be used for other tasks, such as removing duplicate or near duplicate sequences from a dataset. Finally, we present a framework for comparing (two or more) assemblies built from the same set of reads. Our evaluation requires the set of reads and the assemblies only, and does not require the true genome sequence. Therefore our method can be used in de novo assembly projects, where the true genome is not known. Our score is based on probability theory, and the true genome is expected to obtain the maximum score
    • …
    corecore