2,444 research outputs found

    Longest Common Extensions in Sublinear Space

    Get PDF
    The longest common extension problem (LCE problem) is to construct a data structure for an input string TT of length nn that supports LCE(i,j)(i,j) queries. Such a query returns the length of the longest common prefix of the suffixes starting at positions ii and jj in TT. This classic problem has a well-known solution that uses O(n)O(n) space and O(1)O(1) query time. In this paper we show that for any trade-off parameter 1τn1 \leq \tau \leq n, the problem can be solved in O(nτ)O(\frac{n}{\tau}) space and O(τ)O(\tau) query time. This significantly improves the previously best known time-space trade-offs, and almost matches the best known time-space product lower bound.Comment: An extended abstract of this paper has been accepted to CPM 201

    Music Retrieval System Using Query-by-Humming

    Get PDF
    Music Information Retrieval (MIR) is a particular research area of great interest because there are various strategies to retrieve music. To retrieve music, it is important to find a similarity between the input query and the matching music. Several solutions have been proposed that are currently being used in the application domain(s) such as Query- by-Example (QBE) which takes a sample of an audio recording playing in the background and retrieves the result. However, there is no efficient approach to solve this problem in a Query-by-Humming (QBH) application. In a Query-by-Humming application, the aim is to retrieve music that is most similar to the hummed query in an efficient manner. In this paper, I shall discuss the different music information retrieval techniques and their system architectures. Moreover, I will discuss the Query-by-Humming approach and its various techniques that allow for a novel method for music retrieval. Lastly, we conclude that the proposed system was effective combined with the MIDI dataset and custom hummed queries that were recorded from a sample of people. Although, the MRR was measured at 0.82 – 0.90 for only 100 songs in the database, the retrieval time was very high. Therefore, improving the retrieval time and Deep Learning approaches are suggested for future work

    Universal Compressed Text Indexing

    Get PDF
    The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows-Wheeler transform and on the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on the other hand, are based on the Lempel-Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this observation, in this paper we develop the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation. Let γ\gamma be the size of a string attractor for a text of length nn. Our index takes O(γlog(n/γ))O(\gamma\log(n/\gamma)) words of space and supports locating the occocc occurrences of any pattern of length mm in O(mlogn+occlogϵn)O(m\log n + occ\log^{\epsilon}n) time, for any constant ϵ>0\epsilon>0. This is, in particular, the first index for general macro schemes and collage systems. Our result shows that the relation between indexing and compression is much deeper than what was previously thought: the simple property standing at the core of all dictionary compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment

    Time-space trade-offs for lempel-ziv compressed indexing

    Get PDF
    Given a string SS, the \emph{compressed indexing problem} is to preprocess SS into a compressed representation that supports fast \emph{substring queries}. The goal is to use little space relative to the compressed size of SS while supporting fast queries. We present a compressed index based on the Lempel--Ziv 1977 compression scheme. We obtain the following time-space trade-offs: For constant-sized alphabets; (i) O(m+occlglgn)O(m + occ \lg\lg n) time using O(zlg(n/z)lglgz)O(z\lg(n/z)\lg\lg z) space, or (ii) O(m(1+lgϵzlg(n/z))+occ(lglgn+lgϵz))O(m(1 + \frac{\lg^\epsilon z}{\lg(n/z)}) + occ(\lg\lg n + \lg^\epsilon z)) time using O(zlg(n/z))O(z\lg(n/z)) space. For integer alphabets polynomially bounded by nn; (iii) O(m(1+lgϵzlg(n/z))+occ(lglgn+lgϵz))O(m(1 + \frac{\lg^\epsilon z}{\lg(n/z)}) + occ(\lg\lg n + \lg^\epsilon z)) time using O(z(lg(n/z)+lglgz))O(z(\lg(n/z) + \lg\lg z)) space, or (iv) O(m+occ(lglgn+lgϵz))O(m + occ(\lg\lg n + \lg^{\epsilon} z)) time using O(z(lg(n/z)+lgϵz))O(z(\lg(n/z) + \lg^{\epsilon} z)) space, where nn and mm are the length of the input string and query string respectively, zz is the number of phrases in the LZ77 parse of the input string, occocc is the number of occurrences of the query in the input and ϵ>0\epsilon > 0 is an arbitrarily small constant. In particular, (i) improves the leading term in the query time of the previous best solution from O(mlgm)O(m\lg m) to O(m)O(m) at the cost of increasing the space by a factor lglgz\lg \lg z. Alternatively, (ii) matches the previous best space bound, but has a leading term in the query time of O(m(1+lgϵzlg(n/z)))O(m(1+\frac{\lg^{\epsilon} z}{\lg (n/z)})). However, for any polynomial compression ratio, i.e., z=O(n1δ)z = O(n^{1-\delta}), for constant δ>0\delta > 0, this becomes O(m)O(m). Our index also supports extraction of any substring of length \ell in O(+lg(n/z))O(\ell + \lg(n/z)) time. Technically, our results are obtained by novel extensions and combinations of existing data structures of independent interest, including a new batched variant of weak prefix search
    corecore