13,523 research outputs found

    Prospects and limitations of full-text index structures in genome analysis

    Get PDF
    The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared

    PDF-Malware Detection: A Survey and Taxonomy of Current Techniques

    Get PDF
    Portable Document Format, more commonly known as PDF, has become, in the last 20 years, a standard for document exchange and dissemination due its portable nature and widespread adoption. The flexibility and power of this format are not only leveraged by benign users, but from hackers as well who have been working to exploit various types of vulnerabilities, overcome security restrictions, and then transform the PDF format in one among the leading malicious code spread vectors. Analyzing the content of malicious PDF files to extract the main features that characterize the malware identity and behavior, is a fundamental task for modern threat intelligence platforms that need to learn how to automatically identify new attacks. This paper surveys existing state of the art about systems for the detection of malicious PDF files and organizes them in a taxonomy that separately considers the used approaches and the data analyzed to detect the presence of malicious code. © Springer International Publishing AG, part of Springer Nature 2018

    Incremental construction of minimal acyclic finite-state automata

    Get PDF
    In this paper, we describe a new method for constructing minimal, deterministic, acyclic finite-state automata from a set of strings. Traditional methods consist of two phases: the first to construct a trie, the second one to minimize it. Our approach is to construct a minimal automaton in a single phase by adding new strings one by one and minimizing the resulting automaton on-the-fly. We present a general algorithm as well as a specialization that relies upon the lexicographical ordering of the input strings.Comment: 14 pages, 7 figure

    Algorithms to Compute the Lyndon Array

    Get PDF
    We first describe three algorithms for computing the Lyndon array that have been suggested in the literature, but for which no structured exposition has been given. Two of these algorithms execute in quadratic time in the worst case, the third achieves linear time, but at the expense of prior computation of both the suffix array and the inverse suffix array of x. We then go on to describe two variants of a new algorithm that avoids prior computation of global data structures and executes in worst-case n log n time. Experimental evidence suggests that all but one of these five algorithms require only linear execution time in practice, with the two new algorithms faster by a small factor. We conjecture that there exists a fast and worst-case linear-time algorithm to compute the Lyndon array that is also elementary (making no use of global data structures such as the suffix array)

    A taxonomy of sublinear multiple keyword pattern matching algorithms

    Get PDF
    AbstractThis article presents a taxonomy of sublinear keyword pattern matching algorithms related to the Boyer-Moore algorithm [3] and the Commentz-Walter algorithm [5, 6]. The taxonomy includes, amongst others, the multiple keyword generalization of the single keyword Boyer-Moore algorithm and an algorithm by Fan and Su [9, 10]. The corresponding precomputation algorithms are presented as well. The taxonomy is based on the idea of ordering algorithms according to their essential problem and algorithm details, and deriving all algorithms from a common starting point by successively adding these details in a correctness preserving way. This way of presentation not only provides a complete correctness argument of each algorithm, but also makes very clear what algorithms have in common (the details of their nearest common ancestor) and where they differ (the details added after their nearest common ancestor). Introduction of the notion of safe shift distances proves to be essential in the derivation and classification of the algorithms. Moreover, the article provides a common derivation for and a uniform presentation of the precomputation algorithms, not yet found in the literature
    corecore