114 research outputs found

    Reverse-Safe Data Structures for Text Indexing

    Get PDF
    We introduce the notion of reverse-safe data structures. These are data structures that prevent the reconstruction of the data they encode (i.e., they cannot be easily reversed). A data structure D is called z-reverse-safe when there exist at least z datasets with the same set of answers as the ones stored by D. The main challenge is to ensure that D stores as many answers to useful queries as possible, is constructed efficiently, and has size close to the size of the original dataset it encodes. Given a text of length n and an integer z, we propose an algorithm which constructs a z-reverse-safe data structure that has size O(n) and answers pattern matching queries of length at most d optimally, where d is maximal for any such z-reverse-safe data structure. The construction algorithm takes O(n ω log d) time, where ω is the matrix multiplication exponent. We show that, despite the n ω factor, our engineered implementation takes only a few minutes to finish for million-letter texts. We further show that plugging our method in data analysis applications gives insignificant or no data utility loss. Finally, we show how our technique can be extended to support applications under a realistic adversary model

    Palindromic Length of Words with Many Periodic Palindromes

    Full text link
    The palindromic length PL(v)\text{PL}(v) of a finite word vv is the minimal number of palindromes whose concatenation is equal to vv. In 2013, Frid, Puzynina, and Zamboni conjectured that: If ww is an infinite word and kk is an integer such that PL(u)k\text{PL}(u)\leq k for every factor uu of ww then ww is ultimately periodic. Suppose that ww is an infinite word and kk is an integer such PL(u)k\text{PL}(u)\leq k for every factor uu of ww. Let Ω(w,k)\Omega(w,k) be the set of all factors uu of ww that have more than k1uk\sqrt[k]{k^{-1}\vert u\vert} palindromic prefixes. We show that Ω(w,k)\Omega(w,k) is an infinite set and we show that for each positive integer jj there are palindromes a,ba,b and a word uΩ(w,k)u\in \Omega(w,k) such that (ab)j(ab)^j is a factor of uu and bb is nonempty. Note that (ab)j(ab)^j is a periodic word and (ab)ia(ab)^ia is a palindrome for each iji\leq j. These results justify the following question: What is the palindromic length of a concatenation of a suffix of bb and a periodic word (ab)j(ab)^j with "many" periodic palindromes? It is known that PL(uv)PL(u)PL(v)\lvert\text{PL}(uv)-\text{PL}(u)\rvert\leq \text{PL}(v), where uu and vv are nonempty words. The main result of our article shows that if a,ba,b are palindromes, bb is nonempty, uu is a nonempty suffix of bb, ab\vert ab\vert is the minimal period of abaaba, and jj is a positive integer with j3PL(u)j\geq3\text{PL}(u) then PL(u(ab)j)PL(u)0\text{PL}(u(ab)^j)-\text{PL}(u)\geq 0

    Substring Complexity in Sublinear Space

    Get PDF
    Shannon's entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad-hoc measures are employed to estimate the repetitiveness of strings, e.g., the size zz of the Lempel-Ziv parse or the number rr of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size γ\gamma of a smallest string attractor. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing γ\gamma is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure that is based on the function STS_T counting the cardinalities of the sets of substrings of each length of TT, also known as the substring complexity. This new measure is defined as δ=sup{ST(k)/k,k1}\delta= \sup\{S_T(k)/k, k\geq 1\} and lower bounds all the measures previously considered. In particular, δγ\delta\leq \gamma always holds and δ\delta can be computed in O(n)\mathcal{O}(n) time using Ω(n)\Omega(n) working space. Kociumaka et al. showed that if δ\delta is given, one can construct an O(δlognδ)\mathcal{O}(\delta \log \frac{n}{\delta})-sized representation of TT supporting efficient direct access and efficient pattern matching queries on TT. Given that for highly compressible strings, δ\delta is significantly smaller than nn, it is natural to pose the following question: Can we compute δ\delta efficiently using sublinear working space? It is straightforward to show that any algorithm computing δ\delta using O(b)\mathcal{O}(b) space requires Ω(n2o(1)/b)\Omega(n^{2-o(1)}/b) time through a reduction from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We present the following results: an O(n3/b2)\mathcal{O}(n^3/b^2)-time and O(b)\mathcal{O}(b)-space algorithm to compute δ\delta, for any b[1,n]b\in[1,n]; and an O~(n2/b)\tilde{\mathcal{O}}(n^2/b)-time and O(b)\mathcal{O}(b)-space algorithm to compute δ\delta, for any b[n2/3,n]b\in[n^{2/3},n]

    Substring Complexity in Sublinear Space

    Get PDF
    Shannon’s entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad hoc measures are employed to estimate the repetitiveness of strings, e.g., the size z of the Lempel–Ziv parse or the number r of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size γ of a smallest string attractor. Let T be a string of length n. A string attractor of T is a set of positions of T capturing the occurrences of all the substrings of T. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing γ is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure of compressibility that is based on the function S_T(k) counting the number of distinct substrings of length k of T, also known as the substring complexity of T. This new measure is defined as δ = sup{S_T(k)/k, k ≥ 1} and lower bounds all the relevant ad hoc measures previously considered. In particular, δ ≤ γ always holds and δ can be computed in O(n) time using Θ(n) working space. Kociumaka et al. showed that one can construct an O(δ log n/(δ))-sized representation of T supporting efficient direct access and efficient pattern matching queries on T. Given that for highly compressible strings, δ is significantly smaller than n, it is natural to pose the following question: Can we compute δ efficiently using sublinear working space? It is straightforward to show that in the comparison model, any algorithm computing δ using O(b) space requires Ω(n^{2-o(1)}/b) time through a reduction from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We thus wanted to investigate whether we can indeed match this lower bound. We address this algorithmic challenge by showing the following bounds to compute δ: - O((n3log b)/b2) time using O(b) space, for any b ∈ [1,n], in the comparison model. - Õ(n2/b) time using Õ(b) space, for any b ∈ [√n,n], in the word RAM model. This gives an Õ(n^{1+ε})-time and Õ(n^{1-ε})-space algorithm to compute δ, for any 0 < ε ≤ 1/2. Let us remark that our algorithms compute S_T(k), for all k, within the same complexities

    Constructing Antidictionaries of Long Texts in Output-Sensitive Space

    Get PDF
    A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y1, … , yk over an alphabet Σ, we are asked to compute the set M{y1,…,yk}ℓ of minimal absent words of length at most ℓ of the collection {y1, … , yk}. The set M{y1,…,yk}ℓ contains all the words x such that x is absent from all the words of the collection while there exist i,j, such that the maximal proper suffix of x is a factor of yi and the maximal proper prefix of x is a factor of yj. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. Indeed, the set Myℓ of minimal absent words of a word y is equal to M{y1,…,yk}ℓ for any decomposition of y into a collection of words y1, … , yk such that there is an overlap of length at least ℓ − 1 between any two consecutive words in the collection. This computation generally requires Ω(n) space for n = |y| using any of the plenty available O(n) -time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ∥M{y1,…,yN}ℓ∥=o(n), for all N ∈ [1,k], where ∥S∥ denotes the sum of the lengths of words in set S. For instance, in the human genome, n ≈ 3 × 109 but ∥M{y1,…,yk}12∥≈106. We consider a constant-sized alphabet for stating our results. We show that allMy1ℓ,…,M{y1,…,yk}ℓ can be computed in O(kn+∑N=1k∥M{y1,…,yN}ℓ∥) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in {y1, … , yk} and MaxOut=max{∥M{y1,…,yN}ℓ∥:N∈[1,k]}. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution

    POSTURE AND POSTUROLOGY, ANATOMICAL AND PHYSIOLOGICAL PROFILES: OVERVIEW AND CURRENT STATE OF ART

    Get PDF
    Background and aim of work: posture is the position of the body in the space, and is controlled by a set of anatomical structures. The maintenance and the control of posture are a set of interactions between muscle-skeletal, visual, vestibular, and skin system. Lately there are numerous studies that correlate the muscle-skeletal and the maintenance of posture. In particular, the correction of defects and obstruction of temporomandibular disorders, seem to have an impoact on posture. The aim of this work is to collect information in literature on posture and the influence of the stomatognatich system on postural system. Methods: Comparison of the literature on posture and posturology by consulting books and scientific sites. results: the results obtained from the comparison of the of the literature on posture and posturology by consulting books and scientific sites. Some studies support the correlation between stomatognatich system and posture, while others such a correlation. Conclusions: further studies are necessary to be able to confirm one or the other argument. (www.actabiomedica.it

    A Characterization of Bispecial Sturmian Words

    Full text link
    A finite Sturmian word w over the alphabet {a,b} is left special (resp. right special) if aw and bw (resp. wa and wb) are both Sturmian words. A bispecial Sturmian word is a Sturmian word that is both left and right special. We show as a main result that bispecial Sturmian words are exactly the maximal internal factors of Christoffel words, that are words coding the digital approximations of segments in the Euclidean plane. This result is an extension of the known relation between central words and primitive Christoffel words. Our characterization allows us to give an enumerative formula for bispecial Sturmian words. We also investigate the minimal forbidden words for the set of Sturmian words.Comment: Accepted to MFCS 201

    Words with the Maximum Number of Abelian Squares

    Full text link
    An abelian square is the concatenation of two words that are anagrams of one another. A word of length nn can contain Θ(n2)\Theta(n^2) distinct factors that are abelian squares. We study infinite words such that the number of abelian square factors of length nn grows quadratically with nn.Comment: To appear in the proceedings of WORDS 201

    Minimal Absent Words in Rooted and Unrooted Trees

    Get PDF
    We extend the theory of minimal absent words to (rooted and unrooted) trees, having edges labeled by letters from an alphabet of cardinality. We show that the set of minimal absent words of a rooted (resp. unrooted) tree T with n nodes has cardinality (resp.), and we show that these bounds are realized. Then, we exhibit algorithms to compute all minimal absent words in a rooted (resp. unrooted) tree in output-sensitive time (resp. assuming an integer alphabet of size polynomial in n

    Single-cell NGS-based analysis of copy number alterations reveals new insights in circulating tumor cells persistence in early-stage breast cancer

    Get PDF
    Circulating tumor cells (CTCs) are a rare population of cells representing a key player in the metastatic cascade. They are recognized as a validated tool for the identification of patients with a higher risk of relapse, including those diagnosed with breast cancer (BC). However, CTCs are characterized by high levels of heterogeneity that also involve copy number alterations (CNAs), structural variations associated with gene dosage changes. In this study, single CTCs were isolated from the peripheral blood of 11 early-stage BC patients at different time points. A label-free enrichment of CTCs was performed using OncoQuick, and single CTCs were isolated using DEPArray. Libraries were prepared from single CTCs and DNA extracted from matched tumor tissues for a whole-genome low-coverage next-generation sequencing (NGS) analysis using the Ion Torrent S5 System. The analysis of the CNA burden highlighted that CTCs had different degrees of aberration based on the time point and subtype. CTCs were found even six months after surgery and shared CNAs with matched tumor tissue. Tumor-associated CNAs that were recurrent in CTCs were patient-specific, and some alterations involved regions associated with BC and survival (i.e., gains at 1q21-23 and 5p15.33). The enrichment analysis emphasized the involvement of aberrations of terms, associated in particular with interferon (IFN) signaling. Collectively, our findings reveal that these aberrations may contribute to understanding the molecular mechanisms involving CTC-related processes and their survival ability in occult niches, supporting the goal of exploiting their application in patients’ surveillance and follow-up
    corecore