25 research outputs found

    Investigation into Indexing XML Data Techniques

    Get PDF
    The rapid development of XML technology improves the WWW, since the XML data has many advantages and has become a common technology for transferring data cross the internet. Therefore, the objective of this research is to investigate and study the XML indexing techniques in terms of their structures. The main goal of this investigation is to identify the main limitations of these techniques and any other open issues. Furthermore, this research considers most common XML indexing techniques and performs a comparison between them. Subsequently, this work makes an argument to find out these limitations. To conclude, the main problem of all the XML indexing techniques is the trade-off between the size and the efficiency of the indexes. So, all the indexes become large in order to perform well, and none of them is suitable for all usersโ€™ requirements. However, each one of these techniques has some advantages in somehow

    Accelerating data retrieval steps in XML documents

    Get PDF

    Efficient data representation for XML in peer-based systems

    Get PDF
    Purpose - New directions in the provision of end-user computing experiences mean that the best way to share data between small mobile computing devices needs to be determined. Partitioning large structures so that they can be shared efficiently provides a basis for data-intensive applications on such platforms. The partitioned structure can be compressed using dictionary-based approaches and then directly queried without firstly decompressing the whole structure. Design/methodology/approach - The paper describes an architecture for partitioning XML into structural and dictionary elements and the subsequent manipulation of the dictionary elements to make the best use of available space. Findings - The results indicate that considerable savings are available by removing duplicate dictionaries. The paper also identifies the most effective strategy for defining dictionary scope. Research limitations/implications - This evaluation is based on a range of benchmark XML structures and the approach to minimising dictionary size shows benefit in the majority of these. Where structures are small and regular, the benefits of efficient dictionary representation are lost. The authors' future research now focuses on heuristics for further partitioning of structural elements. Practical implications - Mobile applications that need access to large data collections will benefit from the findings of this research. Traditional client/server architectures are not suited to dealing with high volume demands from a multitude of small mobile devices. Peer data sharing provides a more scalable solution and the experiments that the paper describes demonstrate the most effective way of sharing data in this context. Social implications - Many services are available via smartphone devices but users are wary of exploiting the full potential because of the need to conserve battery power. The approach mitigates this challenge and consequently expands the potential for users to benefit from mobile information systems. This will have impact in areas such as advertising, entertainment and education but will depend on the acceptability of file sharing being extended from the desktop to the mobile environment. Originality/value - The original work characterises the most effective way of sharing large data sets between small mobile devices. This will save battery power on devices such as smartphones, thus providing benefits to users of such devices

    Multidimensional Xml File: A New Xml File Structure

    Get PDF

    ๊ฐ„๊ฒฐํ•œ ์ž๋ฃŒ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•œ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹๋“ค์˜ ๊ณต๊ฐ„ ํšจ์œจ์  ํ‘œํ˜„๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. Srinivasa Rao Satti.Numerous big data are generated from a plethora of sources. Most of the data stored as files contain a non-fixed type of schema, so that the files are suitable to be maintained as semi-structured document formats. A number of those formats, such as XML (eXtensible Markup Language), JSON (JavaScript Object Notation), and YAML (YAML Ain't Markup Language) are suggested to sustain hierarchy in the original corpora of data. Several data models structuring the gathered data - including RDF (Resource Description Framework) - depend on the semi-structured document formats to be serialized and transferred for future processing. Since the semi-structured document formats focus on readability and verbosity, redundant space is required to organize and maintain the document. Even though general-purpose compression schemes are widely used to compact the documents, applying those algorithms hinder future handling of the corpora, owing to loss of internal structures. The area of succinct data structures is widely investigated and researched in theory, to provide answers to the queries while the encoded data occupy space close to the information-theoretic lower bound. Bit vectors and trees are the notable succinct data structures. Nevertheless, there were few attempts to apply the idea of succinct data structures to represent the semi-structured documents in space-efficient manner. In this dissertation we propose a unified, space-efficient representation of various semi-structured document formats. The core functionality of this representation is its compactness and query-ability derived from enriched functions of succinct data structures. Incorporation of (a) bit indexed arrays, (b) succinct ordinal trees, and (c) compression techniques engineers the compact representation. We implement this representation in practice, and show by experiments that construction of this representation decreases the disk usage by up to 60% while occupying 90% less RAM. We also allow processing a document in partial manner, to allow processing of larger corpus of big data even in the constrained environment. In parallel to establishing the aforementioned compact semi-structured document representation, we provide and reinforce some of the existing compression schemes in this dissertation. We first suggest an idea to encode an array of integers that is not necessarily sorted. This compaction scheme improves upon the existing universal code systems, by assistance of succinct bit vector structure. We show that our suggested algorithm reduces space usage by up to 44% while consuming 15% less time than the original code system, while the algorithm additionally supports random access of elements upon the encoded array. We also reinforce the SBH bitmap index compression algorithm. The main strength of this scheme is the use of intermediate super-bucket during operations, giving better performance on querying through a combination of compressed bitmap indexes. Inspired from splits done during the intermediate process of the SBH algorithm, we give an improved compression mechanism supporting parallelism that could be utilized in both CPUs and GPUs. We show by experiments that this CPU parallel processing optimization diminishes compression and decompression times by up to 38% in a 4-core machine without modifying the bitmap compressed form. For GPUs, the new algorithm gives 48% faster query processing time in the experiments, compared to the previous existing bitmap index compression schemes.์…€ ์ˆ˜ ์—†๋Š” ๋น… ๋ฐ์ดํ„ฐ๊ฐ€ ๋‹ค์–‘ํ•œ ์›๋ณธ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋˜๊ณ  ์žˆ๋‹ค. ์ด๋“ค ๋ฐ์ดํ„ฐ์˜ ๋Œ€๋ถ€๋ถ„์€ ๊ณ ์ •๋˜์ง€ ์•Š์€ ์ข…๋ฅ˜์˜ ์Šคํ‚ค๋งˆ๋ฅผ ํฌํ•จํ•œ ํŒŒ์ผ ํ˜•ํƒœ๋กœ ์ €์žฅ๋˜๋Š”๋ฐ, ์ด๋กœ ์ธํ•˜์—ฌ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹์„ ์ด์šฉํ•˜์—ฌ ํŒŒ์ผ์„ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ด ์ ํ•ฉํ•˜๋‹ค. XML, JSON ๋ฐ YAML๊ณผ ๊ฐ™์€ ์ข…๋ฅ˜์˜ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹์ด ๋ฐ์ดํ„ฐ์— ๋‚ด์žฌํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ œ์•ˆ๋˜์—ˆ๋‹ค. ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ์กฐํ™”ํ•˜๋Š” RDF์™€ ๊ฐ™์€ ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ๋ชจ๋ธ๋“ค์€ ์‚ฌํ›„ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ์ €์žฅ ๋ฐ ์ „์†ก์„ ์œ„ํ•˜์—ฌ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹์— ์˜์กดํ•œ๋‹ค. ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹์€ ๊ฐ€๋…์„ฑ๊ณผ ๋‹ค๋ณ€์„ฑ์— ์ง‘์ค‘ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋ฌธ์„œ๋ฅผ ๊ตฌ์กฐํ™”ํ•˜๊ณ  ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ถ”๊ฐ€์ ์ธ ๊ณต๊ฐ„์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ๋ฌธ์„œ๋ฅผ ์••์ถ•์‹œํ‚ค๊ธฐ ์œ„ํ•˜์—ฌ ์ผ๋ฐ˜์ ์ธ ์••์ถ• ๊ธฐ๋ฒ•๋“ค์ด ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ์œผ๋‚˜, ์ด๋“ค ๊ธฐ๋ฒ•๋“ค์„ ์ ์šฉํ•˜๊ฒŒ ๋˜๋ฉด ๋ฌธ์„œ์˜ ๋‚ด๋ถ€ ๊ตฌ์กฐ์˜ ์†์‹ค๋กœ ์ธํ•˜์—ฌ ๋ฐ์ดํ„ฐ์˜ ์‚ฌํ›„ ์ฒ˜๋ฆฌ๊ฐ€ ์–ด๋ ต๊ฒŒ ๋œ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ ์ •๋ณด์ด๋ก ์  ํ•˜ํ•œ์— ๊ฐ€๊นŒ์šด ๊ณต๊ฐ„๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ €์žฅ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋ฉด์„œ ์งˆ์˜์— ๋Œ€ํ•œ ์‘๋‹ต์„ ์ œ๊ณตํ•˜๋Š” ๊ฐ„๊ฒฐํ•œ ์ž๋ฃŒ๊ตฌ์กฐ๋Š” ์ด๋ก ์ ์œผ๋กœ ๋„๋ฆฌ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ๋Š” ๋ถ„์•ผ์ด๋‹ค. ๋น„ํŠธ์—ด๊ณผ ํŠธ๋ฆฌ๊ฐ€ ๋„๋ฆฌ ์•Œ๋ ค์ง„ ๊ฐ„๊ฒฐํ•œ ์ž๋ฃŒ๊ตฌ์กฐ๋“ค์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋“ค์„ ์ €์žฅํ•˜๋Š” ๋ฐ ๊ฐ„๊ฒฐํ•œ ์ž๋ฃŒ๊ตฌ์กฐ์˜ ์•„์ด๋””์–ด๋ฅผ ์ ์šฉํ•œ ์—ฐ๊ตฌ๋Š” ๊ฑฐ์˜ ์ง„ํ–‰๋˜์ง€ ์•Š์•˜๋‹ค. ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์„ ํ†ตํ•ด ์šฐ๋ฆฌ๋Š” ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹์„ ํ†ต์ผ๋˜๊ฒŒ ํ‘œํ˜„ํ•˜๋Š” ๊ณต๊ฐ„ ํšจ์œจ์  ํ‘œํ˜„๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ์ด ๊ธฐ๋ฒ•์˜ ์ฃผ์š”ํ•œ ๊ธฐ๋Šฅ์€ ๊ฐ„๊ฒฐํ•œ ์ž๋ฃŒ๊ตฌ์กฐ๊ฐ€ ๊ฐ•์ ์œผ๋กœ ๊ฐ€์ง€๋Š” ํŠน์„ฑ์— ๊ธฐ๋ฐ˜ํ•œ ๊ฐ„๊ฒฐ์„ฑ๊ณผ ์งˆ์˜ ๊ฐ€๋Šฅ์„ฑ์ด๋‹ค. ๋น„ํŠธ์—ด๋กœ ์ธ๋ฑ์‹ฑ๋œ ๋ฐฐ์—ด, ๊ฐ„๊ฒฐํ•œ ์ˆœ์„œ ์žˆ๋Š” ํŠธ๋ฆฌ ๋ฐ ๋‹ค์–‘ํ•œ ์••์ถ• ๊ธฐ๋ฒ•์„ ํ†ตํ•ฉํ•˜์—ฌ ํ•ด๋‹น ํ‘œํ˜„๋ฒ•์„ ๊ณ ์•ˆํ•˜์˜€๋‹ค. ์ด ๊ธฐ๋ฒ•์€ ์‹ค์žฌ์ ์œผ๋กœ ๊ตฌํ˜„๋˜์—ˆ๊ณ , ์‹คํ—˜์„ ํ†ตํ•˜์—ฌ ์ด ๊ธฐ๋ฒ•์„ ์ ์šฉํ•œ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋“ค์€ ์ตœ๋Œ€ 60% ์ ์€ ๋””์Šคํฌ ๊ณต๊ฐ„๊ณผ 90% ์ ์€ ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„์„ ํ†ตํ•ด ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์ธ๋‹ค. ๋”๋ถˆ์–ด ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋“ค์€ ๋ถ„ํ• ์ ์œผ๋กœ ํ‘œํ˜„์ด ๊ฐ€๋Šฅํ•จ์„ ๋ณด์ด๊ณ , ์ด๋ฅผ ํ†ตํ•˜์—ฌ ์ œํ•œ๋œ ํ™˜๊ฒฝ์—์„œ๋„ ๋น… ๋ฐ์ดํ„ฐ๋ฅผ ํ‘œํ˜„ํ•œ ๋ฌธ์„œ๋“ค์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์ธ๋‹ค. ์•ž์„œ ์–ธ๊ธ‰ํ•œ ๊ณต๊ฐ„ ํšจ์œจ์  ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ‘œํ˜„๋ฒ•์„ ๊ตฌ์ถ•ํ•จ๊ณผ ๋™์‹œ์—, ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ ์ด๋ฏธ ์กด์žฌํ•˜๋Š” ์••์ถ• ๊ธฐ๋ฒ• ์ค‘ ์ผ๋ถ€๋ฅผ ์ถ”๊ฐ€์ ์œผ๋กœ ๊ฐœ์„ ํ•œ๋‹ค. ์ฒซ์งธ๋กœ, ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ์ •๋ ฌ ์—ฌ๋ถ€์— ๊ด€๊ณ„์—†๋Š” ์ •์ˆ˜ ๋ฐฐ์—ด์„ ๋ถ€ํ˜ธํ™”ํ•˜๋Š” ์•„์ด๋””์–ด๋ฅผ ์ œ์‹œํ•œ๋‹ค. ์ด ๊ธฐ๋ฒ•์€ ์ด๋ฏธ ์กด์žฌํ•˜๋Š” ๋ฒ”์šฉ ์ฝ”๋“œ ์‹œ์Šคํ…œ์„ ๊ฐœ์„ ํ•œ ํ˜•ํƒœ๋กœ, ๊ฐ„๊ฒฐํ•œ ๋น„ํŠธ์—ด ์ž๋ฃŒ๊ตฌ์กฐ๋ฅผ ์ด์šฉํ•œ๋‹ค. ์ œ์•ˆ๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ธฐ์กด ๋ฒ”์šฉ ์ฝ”๋“œ ์‹œ์Šคํ…œ์— ๋น„ํ•ด ์ตœ๋Œ€ 44\% ์ ์€ ๊ณต๊ฐ„์„ ์‚ฌ์šฉํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ 15\% ์ ์€ ๋ถ€ํ˜ธํ™” ์‹œ๊ฐ„์„ ํ•„์š”๋กœ ํ•˜๋ฉฐ, ๊ธฐ์กด ์‹œ์Šคํ…œ์—์„œ ์ œ๊ณตํ•˜์ง€ ์•Š๋Š” ๋ถ€ํ˜ธํ™”๋œ ๋ฐฐ์—ด์—์„œ์˜ ์ž„์˜ ์ ‘๊ทผ์„ ์ง€์›ํ•œ๋‹ค. ๋˜ํ•œ ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ๋น„ํŠธ๋งต ์ธ๋ฑ์Šค ์••์ถ•์— ์‚ฌ์šฉ๋˜๋Š” SBH ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐœ์„ ์‹œํ‚จ๋‹ค. ํ•ด๋‹น ๊ธฐ๋ฒ•์˜ ์ฃผ๋œ ๊ฐ•์ ์€ ๋ถ€ํ˜ธํ™”์™€ ๋ณตํ˜ธํ™” ์ง„ํ–‰ ์‹œ ์ค‘๊ฐ„ ๋งค๊ฐœ์ธ ์Šˆํผ๋ฒ„์ผ“์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ์—ฌ๋Ÿฌ ์••์ถ•๋œ ๋น„ํŠธ๋งต ์ธ๋ฑ์Šค์— ๋Œ€ํ•œ ์งˆ์˜ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ์‹œํ‚ค๋Š” ๊ฒƒ์ด๋‹ค. ์œ„ ์••์ถ• ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ค‘๊ฐ„ ๊ณผ์ •์—์„œ ์ง„ํ–‰๋˜๋Š” ๋ถ„ํ• ์—์„œ ์˜๊ฐ์„ ์–ป์–ด, ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ CPU ๋ฐ GPU์— ์ ์šฉ ๊ฐ€๋Šฅํ•œ ๊ฐœ์„ ๋œ ๋ณ‘๋ ฌํ™” ์••์ถ• ๋งค์ปค๋‹ˆ์ฆ˜์„ ์ œ์‹œํ•œ๋‹ค. ์‹คํ—˜์„ ํ†ตํ•ด CPU ๋ณ‘๋ ฌ ์ตœ์ ํ™”๊ฐ€ ์ด๋ฃจ์–ด์ง„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์••์ถ•๋œ ํ˜•ํƒœ์˜ ๋ณ€ํ˜• ์—†์ด 4์ฝ”์–ด ์ปดํ“จํ„ฐ์—์„œ ์ตœ๋Œ€ 38\%์˜ ์••์ถ• ๋ฐ ํ•ด์ œ ์‹œ๊ฐ„์„ ๊ฐ์†Œ์‹œํ‚จ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์ธ๋‹ค. GPU ๋ณ‘๋ ฌ ์ตœ์ ํ™”๋Š” ๊ธฐ์กด์— ์กด์žฌํ•˜๋Š” GPU ๋น„ํŠธ๋งต ์••์ถ• ๊ธฐ๋ฒ•์— ๋น„ํ•ด 48\% ๋น ๋ฅธ ์งˆ์˜ ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์„ ํ•„์š”๋กœ ํ•จ์„ ํ™•์ธํ•œ๋‹ค.Chapter 1 Introduction 1 1.1 Contribution 3 1.2 Organization 5 Chapter 2 Background 6 2.1 Model of Computation 6 2.2 Succinct Data Structures 7 Chapter 3 Space-efficient Representation of Integer Arrays 9 3.1 Introduction 9 3.2 Preliminaries 10 3.2.1 Universal Code System 10 3.2.2 Bit Vector 13 3.3 Algorithm Description 13 3.3.1 Main Principle 14 3.3.2 Optimization in the Implementation 16 3.4 Experimental Results 16 Chapter 4 Space-efficient Parallel Compressed Bitmap Index Processing 19 4.1 Introduction 19 4.2 Related Work 23 4.2.1 Byte-aligned Bitmap Code (BBC) 24 4.2.2 Word-Aligned Hybrid (WAH) 27 4.2.3 WAH-derived Algorithms 28 4.2.4 GPU-based WAH Algorithms 31 4.2.5 Super Byte-aligned Hybrid (SBH) 33 4.3 Parallelizing SBH 38 4.3.1 CPU Parallelism 38 4.3.2 GPU Parallelism 39 4.4 Experimental Results 40 4.4.1 Plain Version 41 4.4.2 Parallelized Version 46 4.4.3 Summary 49 Chapter 5 Space-efficient Representation of Semi-structured Document Formats 50 5.1 Preliminaries 50 5.1.1 Semi-structured Document Formats 50 5.1.2 Resource Description Framework 57 5.1.3 Succinct Ordinal Tree Representations 60 5.1.4 String Compression Schemes 64 5.2 Representation 66 5.2.1 Bit String Indexed Array 67 5.2.2 Main Structure 68 5.2.3 Single Document as a Collection of Chunks 72 5.2.4 Supporting Queries 73 5.3 Experimental Results 75 5.3.1 Datasets 76 5.3.2 Construction Time 78 5.3.3 RAM Usage during Construction 80 5.3.4 Disk Usage and Serialization Time 83 5.3.5 Chunk Division 83 5.3.6 String Compression 88 5.3.7 Query Time 89 Chapter 6 Conclusion 94 Bibliography 96 ์š”์•ฝ 109 Acknowledgements 111Docto

    Clustering-based Labelling Scheme - A Hybrid Approach for Efficient Querying and Updating XML Documents

    Get PDF
    Extensible Markup Language (XML) has become a dominant technology for transferring data through the worldwide web. The XML labelling schemes play a key role in handling XML data efficiently and robustly. Thus, many labelling schemes have been proposed. However, these labelling schemes have limitations and shortcomings. Thus, the aim of this research was to investigate the existing XML labelling schemes and their limitations in order to address the issue of efficiency of XML query performance. This thesis investigated the existing labelling schemes and classified them into three categories based on certain criteria, in order to identify the limitations and challenges of these labelling schemes. Based on the outcomes of this investigation, this thesis proposed a state-of-theart labelling scheme, called clustering-based labelling scheme, to resolve or improve the key limitations such as the efficiency of the XML query processing, labelling XML nodes, and XML updates cost. This thesis argued that using certain existing labelling schemes to label nodes, and using the clustering-based techniques can improve query and labelling nodes efficiency. Theoretically, the proposed scheme is based on dividing the nodes of an XML document into clusters. Two existing labelling schemes, which are the Dewey and LLS labelling schemes, were selected for labelling these clusters and their nodes. Subsequently, the proposed scheme was designed and implemented. In addition, the Dewey and LLS labelling scheme were implemented for the purpose of evaluating the proposed scheme. Subsequently, four experiments were designed in order to test the proposed scheme against the Dewey and LLS labelling schemes. The results of these experiments suggest that the proposed scheme achieved better results than the Dewey and LLS schemes. Consequently, the research hypothesis was accepted overall with few exceptions, and the proposed scheme showed an improvement in the performance and all the targeted features and aspects

    Semantics and efficient evaluation of partial tree-pattern queries on XML

    Get PDF
    Current applications export and exchange XML data on the web. Usually, XML data are queried using keyword queries or using the standard structured query language XQuery the core of which consists of the navigational query language XPath. In this context, one major challenge is the querying of the data when the structure of the data sources is complex or not fully known to the user. Another challenge is the integration of multiple data sources that export data with structural differences and irregularities. In this dissertation, a query language for XML called Partial Tree-Pattern Query (PTPQ) language is considered. PTPQs generalize and strictly contain Tree-Pattern Queries (TPQs) and can express a broad structural fragment of XPath. Because of their expressive power and flexibility, they are useful for querying XML documents the structure of which is complex or not fully known to the user, and for integrating XML data sources with different structures. The dissertation focuses on three issues. The first one is the design of efficient non-main-memory evaluation methods for PTPQs. The second one is the assignment of semantics to PTPQs so that they return meaningful answers. The third one is the development of techniques for answering TPQs using materialized views. Non-main-memory XML query evaluation can be done in two modes (which also define two evaluation models). In the first mode, data is preprocessed and indexes, called inverted lists, are built for it. In the second mode, data are unindexed and arrives continuously in the form of a stream. Existing algorithms cannot be used directly or indirectly to efficiently compute PTPQs in either mode. Initially, the problem of efficiently evaluating partial path queries in the inverted lists model has been addressed. Partial path queries form a subclass of PTPQs which is not contained in the class of TPQs. Three novel algorithms for evaluating partial path queries including a holistic one have been designed. The analytical and experimental results show that the holistic algorithm outperforms the other two. These results have been extended into holistic and non-holistic approaches for PTPQs in the inverted lists model. The experiments show again the superiority of the holistic approach. The dissertation has also addressed the problem of evaluating PTPQs in the streaming model, and two original efficient streaming algorithms for PTPQs have been designed. Compared to the only known streaming algorithm that supports an extension of TPQs, the experimental results show that the proposed algorithms perform better by orders of magnitude while consuming a much smaller fraction of memory space. An original approach for assigning semantics to PTPQs has also been devised. The novel semantics seamlessly applies to keyword queries and to queries with structural restrictions. In contrast to previous approaches that operate locally on data, the proposed approach operates globally on structural summaries of data to extract tree patterns. Compared to previous approaches, an experimental evaluation shows that our approach has a perfect recall both for XML documents with complete and with incomplete data. It also shows better precision compared to approaches with similar recall. Finally, the dissertation has addressed the problem of answering XML queries using exclusively materialized views. An original approach for materializing views in the context of the inverted lists model has been suggested. Necessary and sufficient conditions have been provided for tree-pattern query answerability in terms of view-to-query homomorphisms. A time and space efficient algorithm was designed for deciding query answerability and a technique for computing queries over view materializations using stack- based holistic algorithms was developed. Further, optimizations were developed which (a) minimize the storage space and avoid redundancy by materializing views as bitmaps, and (b) optimize the evaluation of the queries over the views by applying bitwise operations on view materializations. The experimental results show that the proposed approach obtains largely higher hit rates than previous approaches, speeds up significantly the evaluation of queries without using views, and scales very smoothly in terms of storage space and computational overhead

    Enhancing the Usability of XML keyword Search

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    On Optimally Partitioning Variable-Byte Codes

    Get PDF
    The ubiquitous Variable-Byte encoding is one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to be compressed are small that is the typical case for inverted indexes. This paper shows that the compression ratio of Variable-Byte can be improved by 2x by adopting a partitioned representation of the inverted lists. This makes Variable-Byte surprisingly competitive in space with the best bit-aligned encoders, hence disproving the folklore belief that Variable-Byte is space-inefficient for inverted index compression. Despite the significant space savings, we show that our optimization almost comes for free, given that: we introduce an optimal partitioning algorithm that does not affect indexing time because of its linear-time complexity; we show that the query processing speed of Variable-Byte is preserved, with an extensive experimental analysis and comparison with several other state-of-the-art encoders.Comment: Published in IEEE Transactions on Knowledge and Data Engineering (TKDE), 15 April 201
    corecore