366 research outputs found

    DNA-inspired online behavioral modeling and its application to spambot detection

    Get PDF
    We propose a strikingly novel, simple, and effective approach to model online user behavior: we extract and analyze digital DNA sequences from user online actions and we use Twitter as a benchmark to test our proposal. We obtain an incisive and compact DNA-inspired characterization of user actions. Then, we apply standard DNA analysis techniques to discriminate between genuine and spambot accounts on Twitter. An experimental campaign supports our proposal, showing its effectiveness and viability. To the best of our knowledge, we are the first ones to identify and adapt DNA-inspired techniques to online user behavioral modeling. While Twitter spambot detection is a specific use case on a specific social media, our proposed methodology is platform and technology agnostic, hence paving the way for diverse behavioral characterization tasks

    Fast Arc-Annotated Subsequence Matching in Linear Space

    Full text link
    An arc-annotated string is a string of characters, called bases, augmented with a set of pairs, called arcs, each connecting two bases. Given arc-annotated strings PP and QQ the arc-preserving subsequence problem is to determine if PP can be obtained from QQ by deleting bases from QQ. Whenever a base is deleted any arc with an endpoint in that base is also deleted. Arc-annotated strings where the arcs are ``nested'' are a natural model of RNA molecules that captures both the primary and secondary structure of these. The arc-preserving subsequence problem for nested arc-annotated strings is basic primitive for investigating the function of RNA molecules. Gramm et al. [ACM Trans. Algorithms 2006] gave an algorithm for this problem using O(nm)O(nm) time and space, where mm and nn are the lengths of PP and QQ, respectively. In this paper we present a new algorithm using O(nm)O(nm) time and O(n+m)O(n + m) space, thereby matching the previous time bound while significantly reducing the space from a quadratic term to linear. This is essential to process large RNA molecules where the space is likely to be a bottleneck. To obtain our result we introduce several novel ideas which may be of independent interest for related problems on arc-annotated strings.Comment: To appear in Algoritmic

    Palindrome Recognition In The Streaming Model

    Full text link
    In the Palindrome Problem one tries to find all palindromes (palindromic substrings) in a given string. A palindrome is defined as a string which reads forwards the same as backwards, e.g., the string "racecar". A related problem is the Longest Palindromic Substring Problem in which finding an arbitrary one of the longest palindromes in the given string suffices. We regard the streaming version of both problems. In the streaming model the input arrives over time and at every point in time we are only allowed to use sublinear space. The main algorithms in this paper are the following: The first one is a one-pass randomized algorithm that solves the Palindrome Problem. It has an additive error and uses O(nO(\sqrt n) space. The second algorithm is a two-pass algorithm which determines the exact locations of all longest palindromes. It uses the first algorithm as the first pass. The third algorithm is again a one-pass randomized algorithm, which solves the Longest Palindromic Substring Problem. It has a multiplicative error using only O(log(n))O(\log(n)) space. We also give two variants of the first algorithm which solve other related practical problems

    Prospects and limitations of full-text index structures in genome analysis

    Get PDF
    The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared

    When Can You Fold a Map?

    Get PDF
    We explore the following problem: given a collection of creases on a piece of paper, each assigned a folding direction of mountain or valley, is there a flat folding by a sequence of simple folds? There are several models of simple folds; the simplest one-layer simple fold rotates a portion of paper about a crease in the paper by +-180 degrees. We first consider the analogous questions in one dimension lower -- bending a segment into a flat object -- which lead to interesting problems on strings. We develop efficient algorithms for the recognition of simply foldable 1D crease patterns, and reconstruction of a sequence of simple folds. Indeed, we prove that a 1D crease pattern is flat-foldable by any means precisely if it is by a sequence of one-layer simple folds. Next we explore simple foldability in two dimensions, and find a surprising contrast: ``map'' folding and variants are polynomial, but slight generalizations are NP-complete. Specifically, we develop a linear-time algorithm for deciding foldability of an orthogonal crease pattern on a rectangular piece of paper, and prove that it is (weakly) NP-complete to decide foldability of (1) an orthogonal crease pattern on a orthogonal piece of paper, (2) a crease pattern of axis-parallel and diagonal (45-degree) creases on a square piece of paper, and (3) crease patterns without a mountain/valley assignment.Comment: 24 pages, 19 figures. Version 3 includes several improvements thanks to referees, including formal definitions of simple folds, more figures, table summarizing results, new open problems, and additional reference

    Two-dimensional prefix string matching and covering on square matrices

    Get PDF
    International audienceTwo linear time algorithms are presented. One for determining, for every position in a given square matrix, the longest prefix of a given pattern (also a square matrix) that occurs at that position and one for computing all square covers of a given two-dimensional square matrix
    corecore