5 research outputs found

    The Complexity of Problems in P Given Correlated Instances

    Get PDF
    Instances of computational problems do not exist in isolation. Rather, multiple and correlated instances of the same problem arise naturally in the real world. The challenge is how to gain computationally from correlations when they can be found. [DGH, ITCS 2015] showed that significant computational gains can be made by having access to auxiliary instances which are correlated to the primary problem instance via the solution space. They demonstrate this for constraint satisfaction problems, which are NP-hard in the general worst case form. Here, we set out to study the impact of having access to correlated instances on the complexity of polynomial time problems. Namely, for a problem P that is conjectured to require time n^c for c>0, we ask whether access to a few instances of P that are correlated in some natural way can be used to solve P on one of them (the designated "primary instance") faster than the conjectured lower bound of n^c. We focus our attention on a number of problems: the Longest Common Subsequence (LCS), the minimum Edit Distance between sequences, and Dynamic Time Warping Distance (DTWD) of curves, for all of which the best known algorithms achieve O(n^2/polylog(n)) runtime via dynamic programming. These problems form an interesting case in point to study, as it has been shown that a O(n^(2 - epsilon)) time algorithm for a worst-case instance would imply improved algorithms for a host of other problems as well as disprove complexity hypotheses such as the Strong Exponential Time Hypothesis. We show how to use access to a logarithmic number of auxiliary correlated instances, to design novel o(n^2) time algorithms for LCS, EDIT, DTWD, and more generally improved algorithms for computing any tuple-based similarity measure - a generalization which we define within on strings. For the multiple sequence alignment problem on k strings, this yields an O(nklog n) algorithm contrasting with classical O(n^k) dynamic programming. Our results hold for several correlation models between the primary and the auxiliary instances. In the most general correlation model we address, we assume that the primary instance is a worst-case instance and the auxiliary instances are chosen with uniform distribution subject to the constraint that their alignments are epsilon-correlated with the optimal alignment of the primary instance. We emphasize that optimal solutions for the auxiliary instances will not generally coincide with optimal solutions for the worst case primary instance. We view our work as pointing out a new avenue for looking for significant improvements for sequence alignment problems and computing similarity measures, by taking advantage of access to sequences which are correlated through natural generating processes. In this first work we show how to take advantage of mathematically inspired simple clean models of correlation - the intriguing question, looking forward, is to find correlation models which coincide with evolutionary models and other relationships and for which our approach to multiple sequence alignment gives provable guarantees

    Theoretical analysis of edit distance algorithms: an applied perspective

    Full text link
    Given its status as a classic problem and its importance to both theoreticians and practitioners, edit distance provides an excellent lens through which to understand how the theoretical analysis of algorithms impacts practical implementations. From an applied perspective, the goals of theoretical analysis are to predict the empirical performance of an algorithm and to serve as a yardstick to design novel algorithms that perform well in practice. In this paper, we systematically survey the types of theoretical analysis techniques that have been applied to edit distance and evaluate the extent to which each one has achieved these two goals. These techniques include traditional worst-case analysis, worst-case analysis parametrized by edit distance or entropy or compressibility, average-case analysis, semi-random models, and advice-based models. We find that the track record is mixed. On one hand, two algorithms widely used in practice have been born out of theoretical analysis and their empirical performance is captured well by theoretical predictions. On the other hand, all the algorithms developed using theoretical analysis as a yardstick since then have not had any practical relevance. We conclude by discussing the remaining open problems and how they can be tackled

    The complexity of problems in p given correlated instances

    No full text
    Instances of computational problems do not exist in isolation. Rather, multiple and correlated instances of the same problem arise naturally in the real world. The challenge is how to gain computationally from correlations when they can be found. [DGH, ITCS 2015] showed that significant computational gains can be made by having access to auxiliary instances which are correlated to the primary problem instance via the solution space. They demonstrate this for constraint satisfaction problems, which are NP-hard in the general worst case form. Here, we set out to study the impact of having access to correlated instances on the complexity of polynomial time problems. Namely, for a problem P that is conjectured to require time nc for c > 0, we ask whether access to a few instances of P that are correlated in some natural way can be used to solve P on one of them (the designated "primary instance") faster than the conjectured lower bound of nc. We focus our attention on a number of problems: The Longest Common Subsequence (LCS), the minimum Edit Distance between sequences, and Dynamic Time Warping Distance (DTWD) of curves, for all of which the best known algorithms achieve ∼O(n2) runtime via dynamic programming. These problems form an interesting case in point to study, as it has been shown that a O(n2-δ) time algorithm for a worst-case instance would imply improved algorithms for a host of other problems as well as disprove complexity hypotheses such as the Strong Exponential Time Hypothesis. We show how to use access to a logarithmic number of auxiliary correlated instances, to design novel o(n2) time algorithms for LCS, EDIT, DTWD, and more generally improved algorithms for computing any tuple-based similarity measure-a generalization which we define within on strings. For the multiple sequence alignment problem on k strings, this yields an O(nk log n) algorithm contrasting with classical O(nk) dynamic programming. Our results hold for several correlation models between the primary and the auxiliary instances. In the most general correlation model we address, we assume that the primary instance is a worst-case instance and the auxiliary instances are chosen with uniform distribution subject to the constraint that their allignments are ϵ-correlated with the optimal allignment of the primary instance. We emphasize that optimal solutions for the auxiliary instances will not generally coincide withan optimal solutions for the worst case primary instance. We view our work as pointing out a new avenue for looking for significant improvements for sequence alignment problems and computing similarity measures, by taking advantage of access to sequences which are correlated through natural generating processes. In this first work we show how to take advantage of mathematically inspired simple clean models of correlation - the intriguing question, looking forward, is to find correlation models which coincide with evolutionary models and other relationships and for which our approach to multiple sequence alignment gives provable guarantees

    Developing bioinformatics approaches for the analysis of influenza virus whole genome sequence data

    Get PDF
    Influenza viruses represent a major public health burden worldwide, resulting in an estimated 500,000 deaths per year, with potential for devastating pandemics. Considerable effort is expended in the surveillance of influenza, including major World Health Organization (WHO) initiatives such as the Global Influenza Surveillance and Response System (GISRS). To this end, whole-genome sequencning (WGS), and corresponding bioinformatics pipelines, have emerged as powerful tools. However, due to the inherent diversity of influenza genomes, circulation in several different host species, and noise in short-read data, several pitfalls can appear during bioinformatics processing and analysis. 2.1.2 Results Conventional mapping approaches can be insufficient when a sub-optimal reference strain is chosen. For short-read datasets simulated from human-origin influenza H1N1 HA sequences, read recovery after single-reference mapping was routinely as low as 90% for human-origin influenza sequences, and often lower than 10% for those from avian hosts. To this end, I developed software using de Bruijn 47Graphs (DBGs) for classification of influenza WGS datasets: VAPOR. In real data benchmarking using 257 WGS read sets with corresponding de novo assemblies, VAPOR provided classifications for all samples with a mean of >99.8% identity to assembled contigs. This resulted in an increase of the number of mapped reads by 6.8% on average, up to a maximum of 13.3%. Additionally, using simulations, I demonstrate that classification from reads may be applied to detection of reassorted strains. 2.1.3 Conclusions The approach used in this study has the potential to simplify bioinformatics pipelines for surveillance, providing a novel method for detection of influenza strains of human and non-human origin directly from reads, minimization of potential data loss and bias associated with conventional mapping, and facilitating alignments that would otherwise require slow de novo assembly. Whilst with expertise and time these pitfalls can largely be avoided, with pre-classification they are remedied in a single step. Furthermore, this algorithm could be adapted in future to surveillance of other RNA viruses. VAPOR is available at https://github.com/connor-lab/vapor. Lastly, VAPOR could be improved by future implementation in C++, and should employ more efficient methods for DBG representation
    corecore