2,901 research outputs found

    The DIGMAP geo-temporal web gazetteer service

    Get PDF
    This paper presents the DIGMAP geo-temporal Web gazetteer service, a system providing access to names of places, historical periods, and associated geo-temporal information. Within the DIGMAP project, this gazetteer serves as the unified repository of geographic and temporal information, assisting in the recognition and disambiguation of geo-temporal expressions over text, as well as in resource searching and indexing. We describe the data integration methodology, the handling of temporal information and some of the applications that use the gazetteer. Initial evaluation results show that the proposed system can adequately support several tasks related to geo-temporal information extraction and retrieval

    Fast and Tiny Structural Self-Indexes for XML

    Full text link
    XML document markup is highly repetitive and therefore well compressible using dictionary-based methods such as DAGs or grammars. In the context of selectivity estimation, grammar-compressed trees were used before as synopsis for structural XPath queries. Here a fully-fledged index over such grammars is presented. The index allows to execute arbitrary tree algorithms with a slow-down that is comparable to the space improvement. More interestingly, certain algorithms execute much faster over the index (because no decompression occurs). E.g., for structural XPath count queries, evaluating over the index is faster than previous XPath implementations, often by two orders of magnitude. The index also allows to serialize XML results (including texts) faster than previous systems, by a factor of ca. 2-3. This is due to efficient copy handling of grammar repetitions, and because materialization is totally avoided. In order to compare with twig join implementations, we implemented a materializer which writes out pre-order numbers of result nodes, and show its competitiveness.Comment: 13 page

    TopSig: Topology Preserving Document Signatures

    Get PDF
    Performance comparisons between File Signatures and Inverted Files for text retrieval have previously shown several significant shortcomings of file signatures relative to inverted files. The inverted file approach underpins most state-of-the-art search engine algorithms, such as Language and Probabilistic models. It has been widely accepted that traditional file signatures are inferior alternatives to inverted files. This paper describes TopSig, a new approach to the construction of file signatures. Many advances in semantic hashing and dimensionality reduction have been made in recent times, but these were not so far linked to general purpose, signature file based, search engines. This paper introduces a different signature file approach that builds upon and extends these recent advances. We are able to demonstrate significant improvements in the performance of signature file based indexing and retrieval, performance that is comparable to that of state of the art inverted file based systems, including Language models and BM25. These findings suggest that file signatures offer a viable alternative to inverted files in suitable settings and from the theoretical perspective it positions the file signatures model in the class of Vector Space retrieval models.Comment: 12 pages, 8 figures, CIKM 201

    ๊ฐ„๊ฒฐํ•œ ์ž๋ฃŒ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•œ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹๋“ค์˜ ๊ณต๊ฐ„ ํšจ์œจ์  ํ‘œํ˜„๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. Srinivasa Rao Satti.Numerous big data are generated from a plethora of sources. Most of the data stored as files contain a non-fixed type of schema, so that the files are suitable to be maintained as semi-structured document formats. A number of those formats, such as XML (eXtensible Markup Language), JSON (JavaScript Object Notation), and YAML (YAML Ain't Markup Language) are suggested to sustain hierarchy in the original corpora of data. Several data models structuring the gathered data - including RDF (Resource Description Framework) - depend on the semi-structured document formats to be serialized and transferred for future processing. Since the semi-structured document formats focus on readability and verbosity, redundant space is required to organize and maintain the document. Even though general-purpose compression schemes are widely used to compact the documents, applying those algorithms hinder future handling of the corpora, owing to loss of internal structures. The area of succinct data structures is widely investigated and researched in theory, to provide answers to the queries while the encoded data occupy space close to the information-theoretic lower bound. Bit vectors and trees are the notable succinct data structures. Nevertheless, there were few attempts to apply the idea of succinct data structures to represent the semi-structured documents in space-efficient manner. In this dissertation we propose a unified, space-efficient representation of various semi-structured document formats. The core functionality of this representation is its compactness and query-ability derived from enriched functions of succinct data structures. Incorporation of (a) bit indexed arrays, (b) succinct ordinal trees, and (c) compression techniques engineers the compact representation. We implement this representation in practice, and show by experiments that construction of this representation decreases the disk usage by up to 60% while occupying 90% less RAM. We also allow processing a document in partial manner, to allow processing of larger corpus of big data even in the constrained environment. In parallel to establishing the aforementioned compact semi-structured document representation, we provide and reinforce some of the existing compression schemes in this dissertation. We first suggest an idea to encode an array of integers that is not necessarily sorted. This compaction scheme improves upon the existing universal code systems, by assistance of succinct bit vector structure. We show that our suggested algorithm reduces space usage by up to 44% while consuming 15% less time than the original code system, while the algorithm additionally supports random access of elements upon the encoded array. We also reinforce the SBH bitmap index compression algorithm. The main strength of this scheme is the use of intermediate super-bucket during operations, giving better performance on querying through a combination of compressed bitmap indexes. Inspired from splits done during the intermediate process of the SBH algorithm, we give an improved compression mechanism supporting parallelism that could be utilized in both CPUs and GPUs. We show by experiments that this CPU parallel processing optimization diminishes compression and decompression times by up to 38% in a 4-core machine without modifying the bitmap compressed form. For GPUs, the new algorithm gives 48% faster query processing time in the experiments, compared to the previous existing bitmap index compression schemes.์…€ ์ˆ˜ ์—†๋Š” ๋น… ๋ฐ์ดํ„ฐ๊ฐ€ ๋‹ค์–‘ํ•œ ์›๋ณธ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋˜๊ณ  ์žˆ๋‹ค. ์ด๋“ค ๋ฐ์ดํ„ฐ์˜ ๋Œ€๋ถ€๋ถ„์€ ๊ณ ์ •๋˜์ง€ ์•Š์€ ์ข…๋ฅ˜์˜ ์Šคํ‚ค๋งˆ๋ฅผ ํฌํ•จํ•œ ํŒŒ์ผ ํ˜•ํƒœ๋กœ ์ €์žฅ๋˜๋Š”๋ฐ, ์ด๋กœ ์ธํ•˜์—ฌ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹์„ ์ด์šฉํ•˜์—ฌ ํŒŒ์ผ์„ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ด ์ ํ•ฉํ•˜๋‹ค. XML, JSON ๋ฐ YAML๊ณผ ๊ฐ™์€ ์ข…๋ฅ˜์˜ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹์ด ๋ฐ์ดํ„ฐ์— ๋‚ด์žฌํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ œ์•ˆ๋˜์—ˆ๋‹ค. ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ์กฐํ™”ํ•˜๋Š” RDF์™€ ๊ฐ™์€ ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ๋ชจ๋ธ๋“ค์€ ์‚ฌํ›„ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ์ €์žฅ ๋ฐ ์ „์†ก์„ ์œ„ํ•˜์—ฌ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹์— ์˜์กดํ•œ๋‹ค. ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹์€ ๊ฐ€๋…์„ฑ๊ณผ ๋‹ค๋ณ€์„ฑ์— ์ง‘์ค‘ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋ฌธ์„œ๋ฅผ ๊ตฌ์กฐํ™”ํ•˜๊ณ  ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ถ”๊ฐ€์ ์ธ ๊ณต๊ฐ„์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ๋ฌธ์„œ๋ฅผ ์••์ถ•์‹œํ‚ค๊ธฐ ์œ„ํ•˜์—ฌ ์ผ๋ฐ˜์ ์ธ ์••์ถ• ๊ธฐ๋ฒ•๋“ค์ด ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ์œผ๋‚˜, ์ด๋“ค ๊ธฐ๋ฒ•๋“ค์„ ์ ์šฉํ•˜๊ฒŒ ๋˜๋ฉด ๋ฌธ์„œ์˜ ๋‚ด๋ถ€ ๊ตฌ์กฐ์˜ ์†์‹ค๋กœ ์ธํ•˜์—ฌ ๋ฐ์ดํ„ฐ์˜ ์‚ฌํ›„ ์ฒ˜๋ฆฌ๊ฐ€ ์–ด๋ ต๊ฒŒ ๋œ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ ์ •๋ณด์ด๋ก ์  ํ•˜ํ•œ์— ๊ฐ€๊นŒ์šด ๊ณต๊ฐ„๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ €์žฅ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋ฉด์„œ ์งˆ์˜์— ๋Œ€ํ•œ ์‘๋‹ต์„ ์ œ๊ณตํ•˜๋Š” ๊ฐ„๊ฒฐํ•œ ์ž๋ฃŒ๊ตฌ์กฐ๋Š” ์ด๋ก ์ ์œผ๋กœ ๋„๋ฆฌ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ๋Š” ๋ถ„์•ผ์ด๋‹ค. ๋น„ํŠธ์—ด๊ณผ ํŠธ๋ฆฌ๊ฐ€ ๋„๋ฆฌ ์•Œ๋ ค์ง„ ๊ฐ„๊ฒฐํ•œ ์ž๋ฃŒ๊ตฌ์กฐ๋“ค์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋“ค์„ ์ €์žฅํ•˜๋Š” ๋ฐ ๊ฐ„๊ฒฐํ•œ ์ž๋ฃŒ๊ตฌ์กฐ์˜ ์•„์ด๋””์–ด๋ฅผ ์ ์šฉํ•œ ์—ฐ๊ตฌ๋Š” ๊ฑฐ์˜ ์ง„ํ–‰๋˜์ง€ ์•Š์•˜๋‹ค. ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์„ ํ†ตํ•ด ์šฐ๋ฆฌ๋Š” ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹์„ ํ†ต์ผ๋˜๊ฒŒ ํ‘œํ˜„ํ•˜๋Š” ๊ณต๊ฐ„ ํšจ์œจ์  ํ‘œํ˜„๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ์ด ๊ธฐ๋ฒ•์˜ ์ฃผ์š”ํ•œ ๊ธฐ๋Šฅ์€ ๊ฐ„๊ฒฐํ•œ ์ž๋ฃŒ๊ตฌ์กฐ๊ฐ€ ๊ฐ•์ ์œผ๋กœ ๊ฐ€์ง€๋Š” ํŠน์„ฑ์— ๊ธฐ๋ฐ˜ํ•œ ๊ฐ„๊ฒฐ์„ฑ๊ณผ ์งˆ์˜ ๊ฐ€๋Šฅ์„ฑ์ด๋‹ค. ๋น„ํŠธ์—ด๋กœ ์ธ๋ฑ์‹ฑ๋œ ๋ฐฐ์—ด, ๊ฐ„๊ฒฐํ•œ ์ˆœ์„œ ์žˆ๋Š” ํŠธ๋ฆฌ ๋ฐ ๋‹ค์–‘ํ•œ ์••์ถ• ๊ธฐ๋ฒ•์„ ํ†ตํ•ฉํ•˜์—ฌ ํ•ด๋‹น ํ‘œํ˜„๋ฒ•์„ ๊ณ ์•ˆํ•˜์˜€๋‹ค. ์ด ๊ธฐ๋ฒ•์€ ์‹ค์žฌ์ ์œผ๋กœ ๊ตฌํ˜„๋˜์—ˆ๊ณ , ์‹คํ—˜์„ ํ†ตํ•˜์—ฌ ์ด ๊ธฐ๋ฒ•์„ ์ ์šฉํ•œ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋“ค์€ ์ตœ๋Œ€ 60% ์ ์€ ๋””์Šคํฌ ๊ณต๊ฐ„๊ณผ 90% ์ ์€ ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„์„ ํ†ตํ•ด ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์ธ๋‹ค. ๋”๋ถˆ์–ด ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋“ค์€ ๋ถ„ํ• ์ ์œผ๋กœ ํ‘œํ˜„์ด ๊ฐ€๋Šฅํ•จ์„ ๋ณด์ด๊ณ , ์ด๋ฅผ ํ†ตํ•˜์—ฌ ์ œํ•œ๋œ ํ™˜๊ฒฝ์—์„œ๋„ ๋น… ๋ฐ์ดํ„ฐ๋ฅผ ํ‘œํ˜„ํ•œ ๋ฌธ์„œ๋“ค์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์ธ๋‹ค. ์•ž์„œ ์–ธ๊ธ‰ํ•œ ๊ณต๊ฐ„ ํšจ์œจ์  ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ‘œํ˜„๋ฒ•์„ ๊ตฌ์ถ•ํ•จ๊ณผ ๋™์‹œ์—, ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ ์ด๋ฏธ ์กด์žฌํ•˜๋Š” ์••์ถ• ๊ธฐ๋ฒ• ์ค‘ ์ผ๋ถ€๋ฅผ ์ถ”๊ฐ€์ ์œผ๋กœ ๊ฐœ์„ ํ•œ๋‹ค. ์ฒซ์งธ๋กœ, ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ์ •๋ ฌ ์—ฌ๋ถ€์— ๊ด€๊ณ„์—†๋Š” ์ •์ˆ˜ ๋ฐฐ์—ด์„ ๋ถ€ํ˜ธํ™”ํ•˜๋Š” ์•„์ด๋””์–ด๋ฅผ ์ œ์‹œํ•œ๋‹ค. ์ด ๊ธฐ๋ฒ•์€ ์ด๋ฏธ ์กด์žฌํ•˜๋Š” ๋ฒ”์šฉ ์ฝ”๋“œ ์‹œ์Šคํ…œ์„ ๊ฐœ์„ ํ•œ ํ˜•ํƒœ๋กœ, ๊ฐ„๊ฒฐํ•œ ๋น„ํŠธ์—ด ์ž๋ฃŒ๊ตฌ์กฐ๋ฅผ ์ด์šฉํ•œ๋‹ค. ์ œ์•ˆ๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ธฐ์กด ๋ฒ”์šฉ ์ฝ”๋“œ ์‹œ์Šคํ…œ์— ๋น„ํ•ด ์ตœ๋Œ€ 44\% ์ ์€ ๊ณต๊ฐ„์„ ์‚ฌ์šฉํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ 15\% ์ ์€ ๋ถ€ํ˜ธํ™” ์‹œ๊ฐ„์„ ํ•„์š”๋กœ ํ•˜๋ฉฐ, ๊ธฐ์กด ์‹œ์Šคํ…œ์—์„œ ์ œ๊ณตํ•˜์ง€ ์•Š๋Š” ๋ถ€ํ˜ธํ™”๋œ ๋ฐฐ์—ด์—์„œ์˜ ์ž„์˜ ์ ‘๊ทผ์„ ์ง€์›ํ•œ๋‹ค. ๋˜ํ•œ ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ๋น„ํŠธ๋งต ์ธ๋ฑ์Šค ์••์ถ•์— ์‚ฌ์šฉ๋˜๋Š” SBH ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐœ์„ ์‹œํ‚จ๋‹ค. ํ•ด๋‹น ๊ธฐ๋ฒ•์˜ ์ฃผ๋œ ๊ฐ•์ ์€ ๋ถ€ํ˜ธํ™”์™€ ๋ณตํ˜ธํ™” ์ง„ํ–‰ ์‹œ ์ค‘๊ฐ„ ๋งค๊ฐœ์ธ ์Šˆํผ๋ฒ„์ผ“์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ์—ฌ๋Ÿฌ ์••์ถ•๋œ ๋น„ํŠธ๋งต ์ธ๋ฑ์Šค์— ๋Œ€ํ•œ ์งˆ์˜ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ์‹œํ‚ค๋Š” ๊ฒƒ์ด๋‹ค. ์œ„ ์••์ถ• ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ค‘๊ฐ„ ๊ณผ์ •์—์„œ ์ง„ํ–‰๋˜๋Š” ๋ถ„ํ• ์—์„œ ์˜๊ฐ์„ ์–ป์–ด, ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ CPU ๋ฐ GPU์— ์ ์šฉ ๊ฐ€๋Šฅํ•œ ๊ฐœ์„ ๋œ ๋ณ‘๋ ฌํ™” ์••์ถ• ๋งค์ปค๋‹ˆ์ฆ˜์„ ์ œ์‹œํ•œ๋‹ค. ์‹คํ—˜์„ ํ†ตํ•ด CPU ๋ณ‘๋ ฌ ์ตœ์ ํ™”๊ฐ€ ์ด๋ฃจ์–ด์ง„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์••์ถ•๋œ ํ˜•ํƒœ์˜ ๋ณ€ํ˜• ์—†์ด 4์ฝ”์–ด ์ปดํ“จํ„ฐ์—์„œ ์ตœ๋Œ€ 38\%์˜ ์••์ถ• ๋ฐ ํ•ด์ œ ์‹œ๊ฐ„์„ ๊ฐ์†Œ์‹œํ‚จ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์ธ๋‹ค. GPU ๋ณ‘๋ ฌ ์ตœ์ ํ™”๋Š” ๊ธฐ์กด์— ์กด์žฌํ•˜๋Š” GPU ๋น„ํŠธ๋งต ์••์ถ• ๊ธฐ๋ฒ•์— ๋น„ํ•ด 48\% ๋น ๋ฅธ ์งˆ์˜ ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์„ ํ•„์š”๋กœ ํ•จ์„ ํ™•์ธํ•œ๋‹ค.Chapter 1 Introduction 1 1.1 Contribution 3 1.2 Organization 5 Chapter 2 Background 6 2.1 Model of Computation 6 2.2 Succinct Data Structures 7 Chapter 3 Space-efficient Representation of Integer Arrays 9 3.1 Introduction 9 3.2 Preliminaries 10 3.2.1 Universal Code System 10 3.2.2 Bit Vector 13 3.3 Algorithm Description 13 3.3.1 Main Principle 14 3.3.2 Optimization in the Implementation 16 3.4 Experimental Results 16 Chapter 4 Space-efficient Parallel Compressed Bitmap Index Processing 19 4.1 Introduction 19 4.2 Related Work 23 4.2.1 Byte-aligned Bitmap Code (BBC) 24 4.2.2 Word-Aligned Hybrid (WAH) 27 4.2.3 WAH-derived Algorithms 28 4.2.4 GPU-based WAH Algorithms 31 4.2.5 Super Byte-aligned Hybrid (SBH) 33 4.3 Parallelizing SBH 38 4.3.1 CPU Parallelism 38 4.3.2 GPU Parallelism 39 4.4 Experimental Results 40 4.4.1 Plain Version 41 4.4.2 Parallelized Version 46 4.4.3 Summary 49 Chapter 5 Space-efficient Representation of Semi-structured Document Formats 50 5.1 Preliminaries 50 5.1.1 Semi-structured Document Formats 50 5.1.2 Resource Description Framework 57 5.1.3 Succinct Ordinal Tree Representations 60 5.1.4 String Compression Schemes 64 5.2 Representation 66 5.2.1 Bit String Indexed Array 67 5.2.2 Main Structure 68 5.2.3 Single Document as a Collection of Chunks 72 5.2.4 Supporting Queries 73 5.3 Experimental Results 75 5.3.1 Datasets 76 5.3.2 Construction Time 78 5.3.3 RAM Usage during Construction 80 5.3.4 Disk Usage and Serialization Time 83 5.3.5 Chunk Division 83 5.3.6 String Compression 88 5.3.7 Query Time 89 Chapter 6 Conclusion 94 Bibliography 96 ์š”์•ฝ 109 Acknowledgements 111Docto

    Optimized Indexes for Data Structured Retrieval

    Get PDF
    The aim of this work is to show the novel index structure based suffix array and ternary search tree with rank and select succinct data structure. Suffix arrays were originally developed to reduce memory consumption compared to a suffix tree and ternary search tree combine the time efficiency of digital tries with the space efficiency of binary search trees. Rank of a symbol at a given position equals the number of times the symbol appears in the corresponding prefix of the sequence. Select is the inverse, retrieving the positions of the symbol occurrences. These operations are widely used in information retrieval and management, being the base of several data structures and algorithms for text collections, graphs, trees, etc. The resulting structure is faster than hashing for many typical search problems, and supports a broader range of useful problems and operations. There for we implement a path index based on those data structures that shown to be highly efficient when dealing with digital collection consist in structured documents. We describe how the index architecture works and we compare the searching algorithms with others, and finally experiments show the outperforms with earlier approaches

    Configurable indexing and ranking for XML information retrieval

    Full text link
    Indexing and ranking are two key factors for efficient and effective XML information retrieval. Inappropriate indexing may result in false negatives and false positives, and improper ranking may lead to low precisions. In this paper, we propose a configurable XML information retrieval system, in which users can configure appropriate index types for XML tags and text contents. Based on users โ€™ index configurations, the system transforms XML structures into a compact tree representation, Ctree, and indexes XML text contents. To support XML ranking, we propose the concepts of โ€œweighted term frequency โ€ and โ€œinverted element frequency, โ€ where the weight of a term depends on its frequency and location within an XML element as well as its popularity among similar elements in an XML dataset. We evaluate the effectiveness of our system through extensive experiments on the INEX 03 dataset and 30 content and structure (CAS) topics. The experimental results reveal that our system has significantly high precision at low recall regions and achieves the highest average precision (0.3309) as compared with 38 official INEX 03 submissions using the strict evaluation metric

    Data sharing in DHT based P2P systems

    Get PDF
    International audienceThe evolution of peer-to-peer (P2P) systems triggered the building of large scale distributed applications. The main application domain is data sharing across a very large number of highly autonomous participants. Building such data sharing systems is particularly challenging because of the "extreme" characteristics of P2P infrastructures: massive distribution, high churn rate, no global control, potentially untrusted participants... This article focuses on declarative querying support, query optimization and data privacy on a major class of P2P systems, that based on Distributed Hash Table (P2P DHT). The usual approaches and the algorithms used by classic distributed systems and databases forproviding data privacy and querying services are not well suited to P2P DHT systems. A considerable amount of work was required to adapt them for the new challenges such systems present. This paper describes the most important solutions found. It also identies important future research trends in data management in P2P DHT systems
    • โ€ฆ
    corecore