42,493 research outputs found

    Composite repetition-aware data structures

    Get PDF
    In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure. The key component of our constructions is the run-length encoded BWT (RLBWT), which takes space proportional to the number of BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it with data structures from LZ77 indexes, which take space proportional to the number of LZ77 factors, and with the compact directed acyclic word graph (CDAWG), which takes space proportional to the number of extensions of maximal repeats. The combination of CDAWG and RLBWT enables also a new representation of the suffix tree, whose size depends again on the number of extensions of maximal repeats, and that is powerful enough to support matching statistics and constant-space traversal.Comment: (the name of the third co-author was inadvertently omitted from previous version

    Computing LZ77 in Run-Compressed Space

    Get PDF
    In this paper, we show that the LZ77 factorization of a text T {\in\Sigma^n} can be computed in O(R log n) bits of working space and O(n log R) time, R being the number of runs in the Burrows-Wheeler transform of T reversed. For extremely repetitive inputs, the working space can be as low as O(log n) bits: exponentially smaller than the text itself. As a direct consequence of our result, we show that a class of repetition-aware self-indexes based on a combination of run-length encoded BWT and LZ77 can be built in asymptotically optimal O(R + z) words of working space, z being the size of the LZ77 parsing

    Moving Toward Non-transcription Based Discourse Analysis in Stable and Progressive Aphasia

    Get PDF
    Measurement of communication ability at the discourse level holds promise for predicting how well persons with stable (e.g., stroke-induced), or progressive aphasia navigate everyday communicative interactions. However, barriers to the clinical utilization of discourse measures have persisted. Recent advancements in the standardization of elicitation protocols and the existence of large databases for development of normative references have begun to address some of these barriers. Still, time remains a consistently reported barrier by clinicians. Non-transcription based discourse measurement would reduce the time required for discourse analysis, making clinical utilization a reality. The purpose of this article is to present evidence regarding discourse measures (main concept analysis, core lexicon, and derived efficiency scores) that are well suited to non-transcription based analysis. Combined with previous research, our results suggest that these measures are sensitive to changes following stroke or neurodegenerative disease. Given the evidence, further research specifically assessing the reliability of these measures in clinical implementation is warranted

    Fast Label Extraction in the CDAWG

    Full text link
    The compact directed acyclic word graph (CDAWG) of a string TT of length nn takes space proportional just to the number ee of right extensions of the maximal repeats of TT, and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which ee grows significantly more slowly than nn. We reduce from O(mloglogn)O(m\log{\log{n}}) to O(m)O(m) the time needed to count the number of occurrences of a pattern of length mm, using an existing data structure that takes an amount of space proportional to the size of the CDAWG. This implies a reduction from O(mloglogn+occ)O(m\log{\log{n}}+\mathtt{occ}) to O(m+occ)O(m+\mathtt{occ}) in the time needed to locate all the occ\mathtt{occ} occurrences of the pattern. We also reduce from O(kloglogn)O(k\log{\log{n}}) to O(k)O(k) the time needed to read the kk characters of the label of an edge of the suffix tree of TT, and we reduce from O(mloglogn)O(m\log{\log{n}}) to O(m)O(m) the time needed to compute the matching statistics between a query of length mm and TT, using an existing representation of the suffix tree based on the CDAWG. All such improvements derive from extracting the label of a vertex or of an arc of the CDAWG using a straight-line program induced by the reversed CDAWG.Comment: 16 pages, 1 figure. In proceedings of the 24th International Symposium on String Processing and Information Retrieval (SPIRE 2017). arXiv admin note: text overlap with arXiv:1705.0864

    Common aetiology for diverse language skills in 41/2-year-old twins

    Get PDF
    Multivariate genetic analysis was used to examine the genetic and environmental aetiology of the interrelationships of diverse linguistic skills. This study used data from a large sample of 4 1/2 year-old twins who were tested on measures assessing articulation, phonology, grammar, vocabulary, and verbal memory. Phenotypic analysis suggested two latent factors: articulation (2 measures) and general language (the remaining 7), and a genetic model incorporating these factors provided a good fit to the data. Almost all genetic and shared environmental influences on the 9 measures acted through the two latent factors. There was also substantial aetiological overlap between the two latent factors, with a genetic correlation of 0·64 and shared environment correlation of 1·00. We conclude that to a large extent, the same genetic and environmental factors underlie the development of individual differences in a wide range of linguistic skills
    corecore