43,433 research outputs found
Composite repetition-aware data structures
In highly repetitive strings, like collections of genomes from the same
species, distinct measures of repetition all grow sublinearly in the length of
the text, and indexes targeted to such strings typically depend only on one of
these measures. We describe two data structures whose size depends on multiple
measures of repetition at once, and that provide competitive tradeoffs between
the time for counting and reporting all the exact occurrences of a pattern, and
the space taken by the structure. The key component of our constructions is the
run-length encoded BWT (RLBWT), which takes space proportional to the number of
BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it
with data structures from LZ77 indexes, which take space proportional to the
number of LZ77 factors, and with the compact directed acyclic word graph
(CDAWG), which takes space proportional to the number of extensions of maximal
repeats. The combination of CDAWG and RLBWT enables also a new representation
of the suffix tree, whose size depends again on the number of extensions of
maximal repeats, and that is powerful enough to support matching statistics and
constant-space traversal.Comment: (the name of the third co-author was inadvertently omitted from
previous version
Computing LZ77 in Run-Compressed Space
In this paper, we show that the LZ77 factorization of a text T {\in\Sigma^n}
can be computed in O(R log n) bits of working space and O(n log R) time, R
being the number of runs in the Burrows-Wheeler transform of T reversed. For
extremely repetitive inputs, the working space can be as low as O(log n) bits:
exponentially smaller than the text itself. As a direct consequence of our
result, we show that a class of repetition-aware self-indexes based on a
combination of run-length encoded BWT and LZ77 can be built in asymptotically
optimal O(R + z) words of working space, z being the size of the LZ77 parsing
A Psychogenetic Algorithm for Behavioral Sequence Learning
This work presents an original algorithmic model of some essential features of psychogenetic theory, as was proposed by J.Piaget. Specifically, we modeled some elements of cognitive structure learning in children from 0 to 4 months of life. We are in fact convinced that the study of well-established cognitive models of human learning can suggest new, interesting approaches to problem so far not satisfactorily solved in the field of machine learning. Further, we discussed the possible parallels between our model and subsymbolic machine learning and neuroscience. The model was implemented and tested in some simple experimental settings, with reference to the task of learning sensorimotor sequences
Moving Toward Non-transcription Based Discourse Analysis in Stable and Progressive Aphasia
Measurement of communication ability at the discourse level holds promise for predicting how well persons with stable (e.g., stroke-induced), or progressive aphasia navigate everyday communicative interactions. However, barriers to the clinical utilization of discourse measures have persisted. Recent advancements in the standardization of elicitation protocols and the existence of large databases for development of normative references have begun to address some of these barriers. Still, time remains a consistently reported barrier by clinicians. Non-transcription based discourse measurement would reduce the time required for discourse analysis, making clinical utilization a reality. The purpose of this article is to present evidence regarding discourse measures (main concept analysis, core lexicon, and derived efficiency scores) that are well suited to non-transcription based analysis. Combined with previous research, our results suggest that these measures are sensitive to changes following stroke or neurodegenerative disease. Given the evidence, further research specifically assessing the reliability of these measures in clinical implementation is warranted
Fast Label Extraction in the CDAWG
The compact directed acyclic word graph (CDAWG) of a string of length
takes space proportional just to the number of right extensions of the
maximal repeats of , and it is thus an appealing index for highly repetitive
datasets, like collections of genomes from similar species, in which grows
significantly more slowly than . We reduce from to
the time needed to count the number of occurrences of a pattern of
length , using an existing data structure that takes an amount of space
proportional to the size of the CDAWG. This implies a reduction from
to in the time needed to
locate all the occurrences of the pattern. We also reduce from
to the time needed to read the characters of the
label of an edge of the suffix tree of , and we reduce from
to the time needed to compute the matching
statistics between a query of length and , using an existing
representation of the suffix tree based on the CDAWG. All such improvements
derive from extracting the label of a vertex or of an arc of the CDAWG using a
straight-line program induced by the reversed CDAWG.Comment: 16 pages, 1 figure. In proceedings of the 24th International
Symposium on String Processing and Information Retrieval (SPIRE 2017). arXiv
admin note: text overlap with arXiv:1705.0864
Recommended from our members
Narrating the archive and archiving narrative: the electronic book and the logic of the index
The creation of my hypermedia work Index of Love, which narrates a love story as an archive of moments, images and objects recollected, also articulated for me the potential of the book as electronic text. The book has always existed as both narrative and archive. Tables of contents and indexes allow the book to function simultaneously as linear narrative and non-linear, searchable database. The book therefore has more in common with the so-called 'new media' of the 21st century than it does with the dominant 20th century media of film, video and audiotape, whose logic and mode of distribution are resolutely linear. My thesis is that the non-linear logic of new media brings to the fore an aspect of the book - the index - whose potential for the production of narrative is only just beginning to be explored. When a reader/user accesses an electronic work, such as a website, via its menu, they simultaneously experience it as narrative and archive. The narrative journey taken is created through the menu choices made. Within the electronic book, therefore, the index (or menu) has the potential to function as more than just an analytical or navigational tool. It has the potential to become a creative, structuring device. This opens up new possibilities for the book, particularly as, in its paper based form, the book indexes factual work, but not fiction. In the electronic book, however, the index offers as rich a potential for fictional narratives as it does for factual volumes. [ABSTRACT FROM AUTHOR
- …