27 research outputs found

    Subsequences and Supersequences of Strings

    Get PDF
    Stringology - the study of strings - is a branch of algorithmics which been the sub-ject of mounting interest in recent years. Very recently, two books [M. Crochemore and W. Rytter, Text Algorithms, Oxford University Press, 1995] and [G. Stephen, String Searching Algorithms, World Scientific, 1994] have been published on the subject and at least two others are known to be in preparation. Problems on strings arise in information retrieval, version control, automatic spelling correction, and many other domains. However the greatest motivation for recent work in stringology has come from the field of molecular biology. String problems occur, for example, in genetic sequence construction, genetic sequence comparison, and phylogenetic tree construction. In this thesis we study a variety of string problems from a theoretical perspective. In particular, we focus on problems involving subsequences and supersequences of strings

    Towards a better solution to the shortest common supersequence problem: the deposition and reduction algorithm

    Get PDF
    BACKGROUND: The problem of finding a Shortest Common Supersequence (SCS) of a set of sequences is an important problem with applications in many areas. It is a key problem in biological sequences analysis. The SCS problem is well-known to be NP-complete. Many heuristic algorithms have been proposed. Some heuristics work well on a few long sequences (as in sequence comparison applications); others work well on many short sequences (as in oligo-array synthesis). Unfortunately, most do not work well on large SCS instances where there are many, long sequences. RESULTS: In this paper, we present a Deposition and Reduction (DR) algorithm for solving large SCS instances of biological sequences. There are two processes in our DR algorithm: deposition process, and reduction process. The deposition process is responsible for generating a small set of common supersequences; and the reduction process shortens these common supersequences by removing some characters while preserving the common supersequence property. Our evaluation on simulated data and real DNA and protein sequences show that our algorithm consistently produces the best results compared to many well-known heuristic algorithms, and especially on large instances. CONCLUSION: Our DR algorithm provides a partial answer to the open problem of designing efficient heuristic algorithm for SCS problem on many long sequences. Our algorithm has a bounded approximation ratio. The algorithm is efficient, both in running time and space complexity and our evaluation shows that it is practical even for SCS problems on many long sequences

    Average-case analysis via incompressibility

    Get PDF

    Expected length of longest common subsequences

    Get PDF
    A longest common subsequence of two sequences is a sequence that is a subsequence of both the given sequences and has largest possible length. It is known that the expected length of a longest common subsequence is proportional to the length of the given sequences. The proportion, denoted by 7k, is dependent on the alphabet size k and the exact value of this proportion is not known even for a binary alphabet. To obtain lower bounds for the constants 7k, finite state machines computing a common subsequence of the inputs are built. Analysing the behaviour of the machines for random inputs we get lower bounds for the constants 7k. The analysis of the machines is based on the theory of Markov chains. An algorithm for automated production of lower bounds is described. To obtain upper bounds for the constants 7k, collations pairs of sequences with a marked common subsequence - are defined. Upper bounds for the number of collations of ‘small size’ can be easily transformed to upper bounds for the constants 7k. Combinatorial analysis is used to bound the number of collations. The methods used for producing bounds on the expected length of a common subsequence of two sequences are also used for other problems, namely a longest common subsequence of several sequences, a shortest common supersequence and a maximal adaptability

    Algorithms for the Analysis of Spatio-Temporal Data from Team Sports

    Get PDF
    Modern object tracking systems are able to simultaneously record trajectories—sequences of time-stamped location points—for large numbers of objects with high frequency and accuracy. The availability of trajectory datasets has resulted in a consequent demand for algorithms and tools to extract information from these data. In this thesis, we present several contributions intended to do this, and in particular, to extract information from trajectories tracking football (soccer) players during matches. Football player trajectories have particular properties that both facilitate and present challenges for the algorithmic approaches to information extraction. The key property that we look to exploit is that the movement of the players reveals information about their objectives through cooperative and adversarial coordinated behaviour, and this, in turn, reveals the tactics and strategies employed to achieve the objectives. While the approaches presented here naturally deal with the application-specific properties of football player trajectories, they also apply to other domains where objects are tracked, for example behavioural ecology, traffic and urban planning

    Website Fingerprinting: Attacks and Defenses

    Get PDF
    Website fingerprinting attacks allow a local, passive eavesdropper to determine a client's web activity by leveraging features from her packet sequence. These attacks break the privacy expected by users of privacy technologies, including low-latency anonymity networks such as proxies, VPNs, or Tor. As a discipline, website fingerprinting is an application of machine learning techniques to the diverse field of privacy. To perform a website fingerprinting attack, the eavesdropping attacker passively records the time, direction, and size of the client's packets. Then, he uses a machine learning algorithm to classify the packet sequence so as to determine the web page it came from. In this work we construct and evaluate three new website fingerprinting attacks: Wa-OSAD, an attack using a modified edit distance as the kernel of a Support Vector Machine, achieving greater accuracy than attacks before it; Wa-FLev, an attack that quickly approximates an edit distance computation, allowing a low-resource attacker to deanonymize many clients at once; and Wa-kNN, the current state-of-the-art attack, which is effective and fast, with a very low false positive rate in the open-world scenario. While our new attacks perform well in theoretical scenarios, there are significant differences between the situation in the wild and in the laboratory. Specifically, we tackle concerns regarding the freshness of the training set, splitting packet sequences so that each part corresponds to one web page access (for easy classification), and removing misleading noise from the packet sequence. To defend ourselves against such attacks, we need defenses that are both efficient and provable. We rigorously define and motivate the notion of a provable defense in this work, and we present three new provable defenses: Tamaraw, which is a relatively efficient way to flood the channel with fixed-rate packet scheduling; Supersequence, which uses smallest common supersequences to save on bandwidth overhead; and Walkie-Talkie, which uses half-duplex communication to significantly reduce both bandwidth and time overhead, allowing a truly efficient yet provable defense

    Solution Biases and Pheromone Representation Selection in Ant Colony Optimisation.

    Get PDF
    Combinatorial optimisation problems (COPs) pervade human society: scheduling, design, layout, distribution, timetabling, resource allocation and project management all feature problems where the solution is some combination of elements, the overall value of which needs to be either maximised or minimised (i.e., optimised), typically subject to a number of constraints. Thus, techniques to efficiently solve such problems are an important area of research. A popular group of optimisation algorithms are the metaheuristics, approaches that specify how to search the space of solutions in a problem independent way so that high quality solutions are likely to result in a reasonable amount of computational time. Although metaheuristic algorithms are specified in a problem independent manner, they must be tailored to suit each particular problem to which they are applied. This thesis investigates a number of aspects of the application of the relatively new Ant Colony Optimisation (ACO) metaheuristic to different COPs. The standard ACO metaheuristic is a constructive algorithm loosely based on the foraging behaviour of ant colonies, which are able to find the shortest path to a food source by indirect communication through pheromones. ACO’s artificial pheromone represents a model of the solution components that its artificial ants use to construct solutions. Developing an appropriate pheromone representation is a key aspect of the application of ACO to a problem. An examination of existing ACO applications and the constructive approach more generally reveals how the metaheuristic can be applied more systematically across a range of COPs. The two main issues addressed in this thesis are biases inherent in the constructive process and the systematic selection of pheromone representations. The systematisation of ACO should lead to more consistently high performance of the algorithm across different problems. Additionally, it supports the creation of a generalised ACO system, capable of adapting itself to suit many different combinatorial problems without the need for manual intervention

    Beam search for the longest common subsequence problem

    Get PDF
    The longest common subsequence problem is a classical string problem that concerns finding the common part of a set of strings. It has several important applications, for example, pattern recognition or computational biology. Most research efforts up to now have focused on solving this problem optimally. In comparison, only few works exist dealing with heuristic approaches. In this work we present a deterministic beam search algorithm. The results show that our algorithm outperforms classical approaches as well as recent metaheuristic approaches.Postprint (published version
    corecore