104 research outputs found
Suffix Sorting via Matching Statistics
Funding Information: Academy of Finland grants 339070 and 351150 Publisher Copyright: © Zsuzsanna Lipták, Francesco Masillo, and Simon J. Puglisi.We introduce a new algorithm for constructing the generalized suffix array of a collection of highly similar strings. As a first step, we construct a compressed representation of the matching statistics of the collection with respect to a reference string. We then use this data structure to distribute suffixes into a partial order, and subsequently to speed up suffix comparisons to complete the generalized suffix array. Our experimental evidence with a prototype implementation (a tool we call sacamats) shows that on string collections with highly similar strings we can construct the suffix array in time competitive with or faster than the fastest available methods. Along the way, we describe a heuristic for fast computation of the matching statistics of two strings, which may be of independent interest.Peer reviewe
Pattern Discovery in Colored Strings
In this paper, we consider the problem of identifying patterns of interest in
colored strings. A colored string is a string where each position is assigned
one of a finite set of colors. Our task is to find substrings of the colored
string that always occur followed by the same color at the same distance. The
problem is motivated by applications in embedded systems verification, in
particular, assertion mining. The goal there is to automatically find
properties of the embedded system from the analysis of its simulation traces.
We show that, in our setting, the number of patterns of interest is
upper-bounded by , where is the length of the string. We
introduce a baseline algorithm, running in time, which
identifies all patterns of interest satisfying certain minimality conditions,
for all colors in the string. For the case where one is interested in patterns
related to one color only, we also provide a second algorithm which runs in
time in the worst case but is faster than the baseline
algorithm in practice. Both solutions use suffix trees, and the second
algorithm also uses an appropriately defined priority queue, which allows us to
reduce the number of computations. We performed an experimental evaluation of
the proposed approaches over both synthetic and real-world datasets, and found
that the second algorithm outperforms the first algorithm on all simulated
data, while on the real-world data, the performance varies between a slight
slowdown (on half of the datasets) and a speedup by a factor of up to 11.Comment: 22 pages, 5 figures, 2 tables, published in ACM Journal of
Experimental Algorithmics. This is the journal version of the paper with the
same title at SEA 2020 (18th Symposium on Experimental Algorithms, Catania,
Italy, June 16-18, 2020
On Compressing Collections of Substring Samples
Publisher Copyright: © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).Given a string X = X[1..n] of length n, and integers m and s, such that n > m ≥ 2s > 0, we consider the problem of compressing the string S formed by concatenating the substrings of X of length m starting at positions i ≡ 1 (mod s). In particular, we provide an upper bound of (2n − m)/s + 2z + (m − s) on the size of the Lempel-Ziv (LZ77) parsing of S, where z is the size of the parsing of X. We also show that a related bound holds regardless of the order in which the substrings are concatenated in the formation of S. If X is viewed as a genome sequence, the above substring sampling process corresponds to an idealized model of short read DNA sequencing.Peer reviewe
Contrast and island sensitivity in clausal ellipsis
Theoretical and Experimental Linguistic
On the Interaction between Verb Movement and Ellipsis: New Evidence from Hungarian
Theoretical and Experimental Linguistic
What sluicing can do, what it can’t and in which language : on the cross-linguistic syntax of ellipsis
Theoretical and Experimental Linguistic
Reprise fragments in English and Hungarian: further evidence for an in-situ Q-equivalence approach to clausal ellipsis
Theoretical and Experimental Linguistic
Dutch preposition stranding and ellipsis: 'Merchant's Wrinkle' ironed out
This paper provides an explanation for the unexpected ban on preposition stranding by wh-R-pronouns under sluicing in Dutch. After showing that previous prosodic and syntactic explanations are untenable, we propose that the observed ban is a by-product of an EPP condition that applies in the PP domain in Dutch. Our analysis revolves around the idea that ellipsis bleeds EPP-driven movement, an idea that already has empirical support from independent patterns of ellipsis found in English and in other structural domains in Dutch. Our claim is that: (1) R-pronominalization involves a pronominal argument of P moving to the periphery of its extended PP domain (PlaceP) in order to satisfy a PP-internal EPP condition, (2) this EPP-driven movement is bled under sluicing, and (3) because SpecPlaceP is the 'escape hatch' through which R-pronouns must move in order to exit the PP domain to form preposition stranding configurations, bleeding the EPP-driven movement of R-pronouns to SpecPlaceP therefore precludes R-pronouns from undergoing the wh-movement required to form a sluicing configuration.Theoretical and Experimental Linguistic
A new study of the spectroscopic binary 7 Vul with a Be star primary
We confirmed the binary nature of the Be star 7~Vul, derived a~more accurate
spectroscopic orbit with an orbital period of (69.4212+/-0.0034) d, and
improved the knowledge of the basic physical elements of the system. Analyzing
available photometry and the strength of the \ha emission, we also document the
long-term spectral variations of the Be primary. In addition, we confirmed
rapid light changes with a~period of 0.5592 d, which is comparable to the
expected rotational period of the Be primary, but note that its amplitude and
possibly its period vary with time. We were able to disentangle only the He I
6678 A line of the secondary, which could support our tentative conclusion that
the secondary appears to be a hot subdwarf. A search for this object in
high-dispersion far-UV spectra could provide confirmation. Probable masses of
the binary components are ()~Mnom \ and ()~Mnom. If the
presence of a hot subdwarf is firmly confirmed, 7 Vul might be identified as a
rare object with a B4-B5 primary; all Be + hot subdwarf systems found so far
contain B0-B3 primaries.Comment: 17 pages, 23 figures, accepted for publication in Astronomy and
Astrophysic
- …