58 research outputs found

    A framework of dynamic data structures for string processing

    Get PDF
    In this paper we present DYNAMIC, an open-source C++ library implementing dynamic compressed data structures for string manipulation. Our framework includes useful tools such as searchable partial sums, succinct/gap-encoded bitvectors, and entropy/run-length compressed strings and FM indexes. We prove close-to-optimal theoretical bounds for the resources used by our structures, and show that our theoretical predictions are empirically tightly verified in practice. To conclude, we turn our attention to applications. We compare the performance of five recently-published compression algorithms implemented using DYNAMIC with those of stateof-the-art tools performing the same task. Our experiments show that algorithms making use of dynamic compressed data structures can be up to three orders of magnitude more space-efficient (albeit slower) than classical ones performing the same tasks

    Optimal rank and select queries on dictionary-compressed text

    Get PDF
    We study the problem of supporting queries on a string S of length n within a space bounded by the size \u3b3 of a string attractor for S. In the paper introducing string attractors it was shown that random access on S can be supported in optimal O(log(n/\u3b3)/ log log n) time within O (\u3b3 polylog n) space. In this paper, we extend this result to rank and select queries and provide lower bounds matching our upper bounds on alphabets of polylogarithmic size. Our solutions are given in the form of a space-time trade-off that is more general than the one previously known for grammars and that improves existing bounds on LZ77-compressed text by a log log n time-factor in select queries. We also provide matching lower and upper bounds for partial sum and predecessor queries within attractor-bounded space, and extend our lower bounds to encompass navigation of dictionary-compressed tree representations

    Special Issue on Algorithms and Data-Structures for Compressed Computation

    Get PDF
    As the production of massive data has outpaced Moore’s law in many scientific areas, the very notion of algorithms is transforming [...

    Adaptive learning of compressible strings

    Get PDF
    Suppose an oracle knows a string S that is unknown to us and that we want to determine. The oracle can answer queries of the form "Is s a substring of S?". In 1995, Skiena and Sundaram showed that, in the worst case, any algorithm needs to ask the oracle Sigma n/4 - O(n) queries in order to be able to reconstruct the hidden string, where Sigma is the size of the alphabet of S and n its length, and gave an algorithm that spends (Sigma - 1)n + O(Sigma root n) queries to reconstruct S. The main contribution of our paper is to improve the above upper-bound in the context where the string is compressible. We first present a universal algorithm that, given a (computable) compressor that compresses the string to Tau bits, performs q = O(Tau) substring queries; this algorithm, however, runs in exponential time. For this reason, the second part of the paper focuses on more time-efficient algorithms whose number of queries is bounded by specific compressibility measures. We first show that any string of length n over an integer alphabet of size Sigma with rle runs can be reconstructed with q = O(rle(Sigma + log nrle)) substring queries in linear time and space. We then present an algorithm that spends q is an element of O (Sigma g log n) substring queries and runs in O (n(logn + log Sigma) + q) time using linear space, where g is the size of a smallest straight-line program generating the string. (c) 2021 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

    String attractors : Verification and optimization

    Get PDF
    String attractors [STOC 2018] are combinatorial objects recently introduced to unify all known dictionary compression techniques in a single theory. A set γ ⊆ [1.n] is a k-attractor for a string S ∈ Σn if and only if every distinct substring of S of length at most k has an occurrence crossing at least one of the positions in γ. Finding the smallest k-attractor is NP-hard for k ≥ 3, but polylogarithmic approximations can be found using reductions from dictionary compressors. It is easy to reduce the k-attractor problem to a set-cover instance where the string's positions are interpreted as sets of substrings. The main result of this paper is a much more powerful reduction based on the truncated suffix tree. Our new characterization of the problem leads to more efficient algorithms for string attractors: we show how to check the validity and minimality of a k-attractor in near-optimal time and how to quickly compute exact solutions. For example, we prove that a minimum 3-attractor can be found in O(n) time when |Σ| ∈ O(3+ϵ√log n) for some constant ϵ > 0, despite the problem being NP-hard for large Σ. © Dominik Kempa, Alberto Policriti, Nicola Prezza, and Eva Rotenberg.Peer reviewe

    Gsufsort: Constructing suffix arrays, LCP arrays and BWTs for string collections

    Get PDF
    Background: The construction of a suffix array for a collection of strings is a fundamental task in Bioinformatics and in many other applications that process strings. Related data structures, as the Longest Common Prefix array, the Burrows-Wheeler transform, and the document array, are often needed to accompany the suffix array to efficiently solve a wide variety of problems. While several algorithms have been proposed to construct the suffix array for a single string, less emphasis has been put on algorithms to construct suffix arrays for string collections. Result: In this paper we introduce gsufsort, an open source software for constructing the suffix array and related data indexing structures for a string collection with N symbols in O(N) time. Our tool is written in ANSI/C and is based on the algorithm gSACA-K (Louza et al. in Theor Comput Sci 678:22-39, 2017), the fastest algorithm to construct suffix arrays for string collections. The tool supports large fasta, fastq and text files with multiple strings as input. Experiments have shown very good performance on different types of strings. Conclusions: gsufsort is a fast, portable, and lightweight tool for constructing the suffix array and additional data structures for string collections

    Indexing k-mers in linear space for quality value compression.

    Get PDF
    Many bioinformatics tools heavily rely on [Formula: see text]-mer dictionaries to describe the composition of sequences and allow for faster reference-free algorithms or look-ups. Unfortunately, naive [Formula: see text]-mer dictionaries are very memory-inefficient, requiring very large amount of storage space to save each [Formula: see text]-mer. This problem is generally worsened by the necessity of an index for fast queries. In this work, we discuss how to build an indexed linear reference containing a set of input [Formula: see text]-mers and its application to the compression of quality scores in FASTQ files. Most of the entropies of sequencing data lie in the quality scores, and thus they are difficult to compress. Here, we present an application to improve the compressibility of quality values while preserving the information for SNP calling. We show how a dictionary of significant [Formula: see text]-mers, obtained from SNP databases and multiple genomes, can be indexed in linear space and used to improve the compression of quality value. Availability: The software is freely available at https://github.com/yhhshb/yalff

    Anxiety and self-esteem as mediators of the relation between family communication and indecisiveness in adolescence

    Get PDF
    Abstract In this study, we explored the unique and common contributions of anxiety, self-esteem, and family communication on indecisiveness among adolescents. Three hundred and fifty pupils from 13 to 16 years of age completed selfreport measures on indecisiveness, quality of family communication, trait anxiety, and self-esteem. The findings in this study showed that students\u2019 indecisiveness is predicted by family communication mediated by anxiety and self-esteem. These results have important implications for practice as it stresses the importance of anxiety and self-esteem. Nevertheless, the counselors could also focus on enhancing relationship-building skills by introducing the adolescents\u2019 career formation as an adolescent\u2013parent joint project.L\u2019anxie\ub4te\ub4 et l\u2019estime de soi comme me\ub4diateurs de la relation entre communication au sein de la famille et indecision chronique a` l\u2019adolescence. Dans cette e\ub4tude, nous avons explore\ub4 les contributions uniques et communes de l\u2019anxiete\ub4, de l\u2019estime de soi et de la communication au sein de la famille sur l\u2019indecision chronique aupres d\u2019adolescents. Trois cent cinquante e\ub4le`ves a\u2c6ge\ub4s de 13 a` 16 ans ont rempli des mesures d\u2019auto-e\ub4valuation de l\u2019inde\ub4cision chronique, de la qualite\ub4 de la communication familiale, de l\u2019anxie\ub4te\ub4-trait et de l\u2019estime de soi. Les re\ub4sultats de cette e\ub4tude ont montre\ub4 que l\u2019inde\ub4cision des e\ub4tudiants est explique\ub4e par la communication au sein de la famille et que ce lien est me\ub4diatise\ub4 par l\u2019anxie\ub4te\ub4 et l\u2019estime de soi. Ces re\ub4sultats ont des implications importantes pour la pratique car ils soulignent l\u2019importance de l\u2019anxie\ub4te\ub4 et l\u2019estime de soi. Ne\ub4anmoins, les conseillers pourraient e\ub4galement se concentrer sur l\u2019ame\ub4lioration des compe\ub4tences de construction relationnelle en introduisant la formation professionnelle des adolescents comme un projet conjoint adolescent-parent.Resumen. Ansiedad y Autoestima como Mediadores de la Relacio\ub4n entre Comunicacio\ub4n Familiar e Indecisio\ub4n en la Adolescencia. En este estudio, exploramos las contribuciones u\ub4nicas y comunes de la ansiedad, la autoestima y la comunicacio\ub4n de familia en la indecisio\ub4n de los adolescentes. Tres y cientos cincuenta alumnos entre los 13 y 16 an\u2dcos completaron auto-evaluaciones sobre la indecisio\ub4n, la calidad de comunicacio\ub4n familiar, los rasgos de ansiedad y la autoestima. Los resultados en este estudio muestran que la indecisio\ub4n en los estudiantes es prevista por la comunicacio\ub4n familiar mediada por la ansiedad y la autoestima. Estos resultados tiene importantes implicaciones para la practica ya que destacan la importancia de la ansiedad y la autoestima. Sin embargo, los consejeros podr\u131\ub4an tambie\ub4n focalizarse en el aumento de sus habilidades para construir relaciones mediante la introduccio\ub4n de la formacio\ub4n profesional para adolescentes como un proyecto conjunto del adolescente-padre.Angst und Selbstwertgefu\ua8 hl als Mediatoren der Beziehung zwischen Kommunikation in der Familie und Unentschlossenheit in der Adoleszenz. In dieser Studie untersuchten wir die spezifischen und gemeinsamen Beitra\ua8ge von Angst, Selbstwertgefu\ua8hl und Kommunikation in der Familie auf Unentschlossenheit unter Jugendlichen. Dreihundertfu\ua8nfzig Schu\ua8ler, zwischen 13 bis 16 Jahre alt, fu\ua8llten Skalen zur Unentschlossenheit, Qualita\ua8t der Kommunikation in der Familie, A\ua8 ngstlichkeit und Selbstwertgefu\ua8hl aus. Die Ergebnisse dieser Studie zeigten, dass Unentschlossenheit der Schu\ua8ler von Kommunikation in der Familie vorhergesagt wird, vermittelt durch Angst und Selbstwertgefu\ua8 hl. Diese Ergebnisse haben wichtige Implikationen fu\ua8r die Praxis, da sie die Bedeutung von Angst und Selbstwertgefu\ua8hl hervorheben. Dennoch ko\ua8nnten sich die Beratungspersonen durch die Einfu\ua8hrung der beruflichen Bildung der Jugendlichen als ein gemeinsames Projekt von Jugendliche-Elternteil auch auf die Verbesserung der Fa\ua8higkeiten zum Aufbau von Beziehungen konzentrieren
    • …
    corecore