1,685 research outputs found
Extending Yioop! With Geographical Location Local Search
It is often useful when doing an internet search to get results based on our current location. For example, we might want such results when we search on restaurants, car service center, or hospitals. Current open source search engines like those based on Nutch do not provide this facility. Commercial engines like Google and Yahoo! provide this facility so it would be useful to incorporate it in an open source alternative. The goal of this project is to include location aware search in Yioop!(Pollett, 2012) by using geographical data from OpenStreetMap(βOpen Street map wikiβ, 2012) and hostip.info (βDMOZβ, n.d.) database to geolocate IP addresses
Compressed Text Indexes:From Theory to Practice!
A compressed full-text self-index represents a text in a compressed form and
still answers queries efficiently. This technology represents a breakthrough
over the text indexing techniques of the previous decade, whose indexes
required several times the size of the text. Although it is relatively new,
this technology has matured up to a point where theoretical research is giving
way to practical developments. Nonetheless this requires significant
programming skills, a deep engineering effort, and a strong algorithmic
background to dig into the research results. To date only isolated
implementations and focused comparisons of compressed indexes have been
reported, and they missed a common API, which prevented their re-use or
deployment within other applications.
The goal of this paper is to fill this gap. First, we present the existing
implementations of compressed indexes from a practitioner's point of view.
Second, we introduce the Pizza&Chili site, which offers tuned implementations
and a standardized API for the most successful compressed full-text
self-indexes, together with effective testbeds and scripts for their automatic
validation and test. Third, we show the results of our extensive experiments on
these codes with the aim of demonstrating the practical relevance of this novel
and exciting technology
TopSig: Topology Preserving Document Signatures
Performance comparisons between File Signatures and Inverted Files for text
retrieval have previously shown several significant shortcomings of file
signatures relative to inverted files. The inverted file approach underpins
most state-of-the-art search engine algorithms, such as Language and
Probabilistic models. It has been widely accepted that traditional file
signatures are inferior alternatives to inverted files. This paper describes
TopSig, a new approach to the construction of file signatures. Many advances in
semantic hashing and dimensionality reduction have been made in recent times,
but these were not so far linked to general purpose, signature file based,
search engines. This paper introduces a different signature file approach that
builds upon and extends these recent advances. We are able to demonstrate
significant improvements in the performance of signature file based indexing
and retrieval, performance that is comparable to that of state of the art
inverted file based systems, including Language models and BM25. These findings
suggest that file signatures offer a viable alternative to inverted files in
suitable settings and from the theoretical perspective it positions the file
signatures model in the class of Vector Space retrieval models.Comment: 12 pages, 8 figures, CIKM 201
Fast and Tiny Structural Self-Indexes for XML
XML document markup is highly repetitive and therefore well compressible
using dictionary-based methods such as DAGs or grammars. In the context of
selectivity estimation, grammar-compressed trees were used before as synopsis
for structural XPath queries. Here a fully-fledged index over such grammars is
presented. The index allows to execute arbitrary tree algorithms with a
slow-down that is comparable to the space improvement. More interestingly,
certain algorithms execute much faster over the index (because no decompression
occurs). E.g., for structural XPath count queries, evaluating over the index is
faster than previous XPath implementations, often by two orders of magnitude.
The index also allows to serialize XML results (including texts) faster than
previous systems, by a factor of ca. 2-3. This is due to efficient copy
handling of grammar repetitions, and because materialization is totally
avoided. In order to compare with twig join implementations, we implemented a
materializer which writes out pre-order numbers of result nodes, and show its
competitiveness.Comment: 13 page
κ°κ²°ν μλ£κ΅¬μ‘°λ₯Ό νμ©ν λ°κ΅¬μ‘°νλ λ¬Έμ νμλ€μ κ³΅κ° ν¨μ¨μ ννλ²
νμλ
Όλ¬Έ (λ°μ¬) -- μμΈλνκ΅ λνμ : 곡과λν μ κΈ°Β·μ»΄ν¨ν°κ³΅νλΆ, 2021. 2. Srinivasa Rao Satti.Numerous big data are generated from a plethora of sources. Most of the data stored as files contain a non-fixed type of schema, so that the files are suitable to be maintained as semi-structured document formats. A number of those formats, such as XML (eXtensible Markup Language), JSON (JavaScript Object Notation), and YAML (YAML Ain't Markup Language) are suggested to sustain hierarchy in the original corpora of data. Several data models structuring the gathered data - including RDF (Resource Description Framework) - depend on the semi-structured document formats to be serialized and transferred for future processing.
Since the semi-structured document formats focus on readability and verbosity, redundant space is required to organize and maintain the document. Even though general-purpose compression schemes are widely used to compact the documents, applying those algorithms hinder future handling of the corpora, owing to loss of internal structures.
The area of succinct data structures is widely investigated and researched in theory, to provide answers to the queries while the encoded data occupy space close to the information-theoretic lower bound. Bit vectors and trees are the notable succinct data structures. Nevertheless, there were few attempts to apply the idea of succinct data structures to represent the semi-structured documents in space-efficient manner.
In this dissertation we propose a unified, space-efficient representation of various semi-structured document formats. The core functionality of this representation is its compactness and query-ability derived from enriched functions of succinct data structures. Incorporation of (a) bit indexed arrays, (b) succinct ordinal trees, and (c) compression techniques engineers the compact representation. We implement this representation in practice, and show by experiments that construction of this representation decreases the disk usage by up to 60% while occupying 90% less RAM. We also allow processing a document in partial manner, to allow processing of larger corpus of big data even in the constrained environment.
In parallel to establishing the aforementioned compact semi-structured document representation, we provide and reinforce some of the existing compression schemes in this dissertation. We first suggest an idea to encode an array of integers that is not necessarily sorted. This compaction scheme improves upon the existing universal code systems, by assistance of succinct bit vector structure. We show that our suggested algorithm reduces space usage by up to 44% while consuming 15% less time than the original code system, while the algorithm additionally supports random access of elements upon the encoded array.
We also reinforce the SBH bitmap index compression algorithm. The main strength of this scheme is the use of intermediate super-bucket during operations, giving better performance on querying through a combination of compressed bitmap indexes. Inspired from splits done during the intermediate process of the SBH algorithm, we give an improved compression mechanism supporting parallelism that could be utilized in both CPUs and GPUs. We show by experiments that this CPU parallel processing optimization diminishes compression and decompression times by up to 38% in a 4-core machine without modifying the bitmap compressed form. For GPUs, the new algorithm gives 48% faster query processing time in the experiments, compared to the previous existing bitmap index compression schemes.μ
μ μλ λΉ
λ°μ΄ν°κ° λ€μν μλ³Έλ‘λΆν° μμ±λκ³ μλ€. μ΄λ€ λ°μ΄ν°μ λλΆλΆμ κ³ μ λμ§ μμ μ’
λ₯μ μ€ν€λ§λ₯Ό ν¬ν¨ν νμΌ ννλ‘ μ μ₯λλλ°, μ΄λ‘ μΈνμ¬ λ°κ΅¬μ‘°νλ λ¬Έμ νμμ μ΄μ©νμ¬ νμΌμ μ μ§νλ κ²μ΄ μ ν©νλ€. XML, JSON λ° YAMLκ³Ό κ°μ μ’
λ₯μ λ°κ΅¬μ‘°νλ λ¬Έμ νμμ΄ λ°μ΄ν°μ λ΄μ¬νλ ꡬ쑰λ₯Ό μ μ§νκΈ° μνμ¬ μ μλμλ€. μμ§λ λ°μ΄ν°λ₯Ό ꡬ쑰ννλ RDFμ κ°μ μ¬λ¬ λ°μ΄ν° λͺ¨λΈλ€μ μ¬ν μ²λ¦¬λ₯Ό μν μ μ₯ λ° μ μ‘μ μνμ¬ λ°κ΅¬μ‘°νλ λ¬Έμ νμμ μμ‘΄νλ€.
λ°κ΅¬μ‘°νλ λ¬Έμ νμμ κ°λ
μ±κ³Ό λ€λ³μ±μ μ§μ€νκΈ° λλ¬Έμ, λ¬Έμλ₯Ό ꡬ쑰ννκ³ μ μ§νκΈ° μνμ¬ μΆκ°μ μΈ κ³΅κ°μ νμλ‘ νλ€. λ¬Έμλ₯Ό μμΆμν€κΈ° μνμ¬ μΌλ°μ μΈ μμΆ κΈ°λ²λ€μ΄ λ리 μ¬μ©λκ³ μμΌλ, μ΄λ€ κΈ°λ²λ€μ μ μ©νκ² λλ©΄ λ¬Έμμ λ΄λΆ ꡬ쑰μ μμ€λ‘ μΈνμ¬ λ°μ΄ν°μ μ¬ν μ²λ¦¬κ° μ΄λ ΅κ² λλ€.
λ°μ΄ν°λ₯Ό μ 보μ΄λ‘ μ ννμ κ°κΉμ΄ 곡κ°λ§μ μ¬μ©νμ¬ μ μ₯μ κ°λ₯νκ² νλ©΄μ μ§μμ λν μλ΅μ μ 곡νλ κ°κ²°ν μλ£κ΅¬μ‘°λ μ΄λ‘ μ μΌλ‘ λ리 μ°κ΅¬λκ³ μλ λΆμΌμ΄λ€. λΉνΈμ΄κ³Ό νΈλ¦¬κ° λ리 μλ €μ§ κ°κ²°ν μλ£κ΅¬μ‘°λ€μ΄λ€. κ·Έλ¬λ λ°κ΅¬μ‘°νλ λ¬Έμλ€μ μ μ₯νλ λ° κ°κ²°ν μλ£κ΅¬μ‘°μ μμ΄λμ΄λ₯Ό μ μ©ν μ°κ΅¬λ κ±°μ μ§νλμ§ μμλ€.
λ³Έ νμλ
Όλ¬Έμ ν΅ν΄ μ°λ¦¬λ λ€μν μ’
λ₯μ λ°κ΅¬μ‘°νλ λ¬Έμ νμμ ν΅μΌλκ² νννλ κ³΅κ° ν¨μ¨μ ννλ²μ μ μνλ€. μ΄ κΈ°λ²μ μ£Όμν κΈ°λ₯μ κ°κ²°ν μλ£κ΅¬μ‘°κ° κ°μ μΌλ‘ κ°μ§λ νΉμ±μ κΈ°λ°ν κ°κ²°μ±κ³Ό μ§μ κ°λ₯μ±μ΄λ€. λΉνΈμ΄λ‘ μΈλ±μ±λ λ°°μ΄, κ°κ²°ν μμ μλ νΈλ¦¬ λ° λ€μν μμΆ κΈ°λ²μ ν΅ν©νμ¬ ν΄λΉ ννλ²μ κ³ μνμλ€. μ΄ κΈ°λ²μ μ€μ¬μ μΌλ‘ ꡬνλμκ³ , μ€νμ ν΅νμ¬ μ΄ κΈ°λ²μ μ μ©ν λ°κ΅¬μ‘°νλ λ¬Έμλ€μ μ΅λ 60% μ μ λμ€ν¬ 곡κ°κ³Ό 90% μ μ λ©λͺ¨λ¦¬ 곡κ°μ ν΅ν΄ ννλ μ μλ€λ κ²μ 보μΈλ€. λλΆμ΄ λ³Έ νμλ
Όλ¬Έμμ λ°κ΅¬μ‘°νλ λ¬Έμλ€μ λΆν μ μΌλ‘ ννμ΄ κ°λ₯ν¨μ 보μ΄κ³ , μ΄λ₯Ό ν΅νμ¬ μ νλ νκ²½μμλ λΉ
λ°μ΄ν°λ₯Ό ννν λ¬Έμλ€μ μ²λ¦¬ν μ μλ€λ κ²μ 보μΈλ€.
μμ μΈκΈν κ³΅κ° ν¨μ¨μ λ°κ΅¬μ‘°νλ λ¬Έμ ννλ²μ ꡬμΆν¨κ³Ό λμμ, λ³Έ νμλ
Όλ¬Έμμ μ΄λ―Έ μ‘΄μ¬νλ μμΆ κΈ°λ² μ€ μΌλΆλ₯Ό μΆκ°μ μΌλ‘ κ°μ νλ€. 첫째λ‘, λ³Έ νμλ
Όλ¬Έμμλ μ λ ¬ μ¬λΆμ κ΄κ³μλ μ μ λ°°μ΄μ λΆνΈννλ μμ΄λμ΄λ₯Ό μ μνλ€. μ΄ κΈ°λ²μ μ΄λ―Έ μ‘΄μ¬νλ λ²μ© μ½λ μμ€ν
μ κ°μ ν ννλ‘, κ°κ²°ν λΉνΈμ΄ μλ£κ΅¬μ‘°λ₯Ό μ΄μ©νλ€. μ μλ μκ³ λ¦¬μ¦μ κΈ°μ‘΄ λ²μ© μ½λ μμ€ν
μ λΉν΄ μ΅λ 44\% μ μ 곡κ°μ μ¬μ©ν λΏλ§ μλλΌ 15\% μ μ λΆνΈν μκ°μ νμλ‘ νλ©°, κΈ°μ‘΄ μμ€ν
μμ μ 곡νμ§ μλ λΆνΈνλ λ°°μ΄μμμ μμ μ κ·Όμ μ§μνλ€.
λν λ³Έ νμλ
Όλ¬Έμμλ λΉνΈλ§΅ μΈλ±μ€ μμΆμ μ¬μ©λλ SBH μκ³ λ¦¬μ¦μ κ°μ μν¨λ€. ν΄λΉ κΈ°λ²μ μ£Όλ κ°μ μ λΆνΈνμ 볡νΈν μ§ν μ μ€κ° 맀κ°μΈ μνΌλ²μΌμ μ¬μ©ν¨μΌλ‘μ¨ μ¬λ¬ μμΆλ λΉνΈλ§΅ μΈλ±μ€μ λν μ§μ μ±λ₯μ κ°μ μν€λ κ²μ΄λ€. μ μμΆ μκ³ λ¦¬μ¦μ μ€κ° κ³Όμ μμ μ§νλλ λΆν μμ μκ°μ μ»μ΄, λ³Έ νμλ
Όλ¬Έμμ CPU λ° GPUμ μ μ© κ°λ₯ν κ°μ λ λ³λ ¬ν μμΆ λ§€μ»€λμ¦μ μ μνλ€. μ€νμ ν΅ν΄ CPU λ³λ ¬ μ΅μ νκ° μ΄λ£¨μ΄μ§ μκ³ λ¦¬μ¦μ μμΆλ ννμ λ³ν μμ΄ 4μ½μ΄ μ»΄ν¨ν°μμ μ΅λ 38\%μ μμΆ λ° ν΄μ μκ°μ κ°μμν¨λ€λ κ²μ 보μΈλ€. GPU λ³λ ¬ μ΅μ νλ κΈ°μ‘΄μ μ‘΄μ¬νλ GPU λΉνΈλ§΅ μμΆ κΈ°λ²μ λΉν΄ 48\% λΉ λ₯Έ μ§μ μ²λ¦¬ μκ°μ νμλ‘ ν¨μ νμΈνλ€.Chapter 1 Introduction 1
1.1 Contribution 3
1.2 Organization 5
Chapter 2 Background 6
2.1 Model of Computation 6
2.2 Succinct Data Structures 7
Chapter 3 Space-efficient Representation of Integer Arrays 9
3.1 Introduction 9
3.2 Preliminaries 10
3.2.1 Universal Code System 10
3.2.2 Bit Vector 13
3.3 Algorithm Description 13
3.3.1 Main Principle 14
3.3.2 Optimization in the Implementation 16
3.4 Experimental Results 16
Chapter 4 Space-efficient Parallel Compressed Bitmap Index Processing 19
4.1 Introduction 19
4.2 Related Work 23
4.2.1 Byte-aligned Bitmap Code (BBC) 24
4.2.2 Word-Aligned Hybrid (WAH) 27
4.2.3 WAH-derived Algorithms 28
4.2.4 GPU-based WAH Algorithms 31
4.2.5 Super Byte-aligned Hybrid (SBH) 33
4.3 Parallelizing SBH 38
4.3.1 CPU Parallelism 38
4.3.2 GPU Parallelism 39
4.4 Experimental Results 40
4.4.1 Plain Version 41
4.4.2 Parallelized Version 46
4.4.3 Summary 49
Chapter 5 Space-efficient Representation of Semi-structured Document Formats 50
5.1 Preliminaries 50
5.1.1 Semi-structured Document Formats 50
5.1.2 Resource Description Framework 57
5.1.3 Succinct Ordinal Tree Representations 60
5.1.4 String Compression Schemes 64
5.2 Representation 66
5.2.1 Bit String Indexed Array 67
5.2.2 Main Structure 68
5.2.3 Single Document as a Collection of Chunks 72
5.2.4 Supporting Queries 73
5.3 Experimental Results 75
5.3.1 Datasets 76
5.3.2 Construction Time 78
5.3.3 RAM Usage during Construction 80
5.3.4 Disk Usage and Serialization Time 83
5.3.5 Chunk Division 83
5.3.6 String Compression 88
5.3.7 Query Time 89
Chapter 6 Conclusion 94
Bibliography 96
μμ½ 109
Acknowledgements 111Docto
Query-friendly Compression and Indexing of Recurring Structures in XML Documents
XML documents are by design self-describing. In order to accomplish this, the XML data is highly verbose and very repetitious. Although techniques already exist to compress XML and text in general, most do not keep the data in a form that is useful to users. We present a technique that makes use of recurring structures within an XML document to compress the file in a way that can achieve better compression than other query-friendly compression techniques while still maintaining the data in a form that allows for both querying and indexing. Further, we present an example implementation of the technique, complete with an index-building mechanism and query processing capabilities
- β¦