Search CORE

5 research outputs found

Optimizing XML Compression

Author: A. Lempel
J. Ziv
P. Skibinski
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

The eXtensible Markup Language (XML) provides a powerful and flexible means of encoding and exchanging data. As it turns out, its main advantage as an encoding format (namely, its requirement that all open and close markup tags are present and properly balanced) yield also one of its main disadvantages: verbosity. XML-conscious compression techniques seek to overcome this drawback. Many of these techniques first separate XML structure from the document content, and then compress each independently. Further compression gains can be realized by identifying and compressing together document content that is highly similar, thereby amortizing the storage costs of auxiliary information required by the chosen compression algorithm. Additionally, the proper choice of compression algorithm is an important factor not only for the achievable compression gain, but also for access performance. Hence, choosing a compression configuration that optimizes compression gain requires one to determine (1) a partitioning strategy for document content, and (2) the best available compression algorithm to apply to each set within this partition. In this paper, we show that finding an optimal compression configuration with respect to compression gain is an NP-hard optimization problem. This problem remains intractable even if one considers a single compression algorithm for all content. We also describe an approximation algorithm for selecting a partitioning strategy for document content based on the branch-and-bound paradigm.Comment: 16 pages, extended version of paper accepted for XSym 200

arXiv.org e-Print Archive

CiteSeerX

Crossref

Optimizing XML Compression in XQueC

Author: Arion Andrei
Bonifati Angela
Manolescu Ioana
Pugliese Andrea
Publication venue: Dagstuhl Seminar Proceedings. 08261 - Structure-Based Compression of Complex Massive Data
Publication date: 01/01/2008
Field of study

We present our approach to the problem of optimizing compression choices in the context of the XQueC compressed XML database system. In XQueC, data items are aggregated into containers, which are further grouped to be compressed together. This way, XQueC is able to exploit data commonalities and to perform query evaluation in the compressed domain, with the aim of improving both compression and querying performance. However, different compression algorithms have different performance and support different sets of operations in the compressed domain. Therefore, choosing how to group containers and which compression algorithm to apply to each group is a challenging issue. We address this problem through an appropriate cost model and a suitable blend of heuristics which, based on a given query workload, are capable of driving appropriate compression choices

Dagstuhl Research Online Publication Server

How much is XML involved in DB publishing?

Author: Szabó Gyula I.
Publication venue
Publication date: 01/01/2010
Field of study

University of Szeged

How much is involved in DB publishing?

Author: Szabó I. Gyula
Publication venue: Periodica Polytechnica Electrical Engineering (Archives)
Publication date: 17/05/2013
Field of study

XML has been intensive investigated lately, with the sentence, that "XML is (has been) the standard form for data publishing", especially in data base area.That is, there are assumptions, that the newly published data take mostly the form of XML documents, particularly when databases are involved. This presumption seems to be the reason of the heavy investment applied for researching the topics of handling, querying and comprising XML documents. We check these assumptions by investigating the documents accessible on the Internet, possible going under the surface, into the "deep Web". The investigation involves analyzing large scientific databases, but the commercial data stored in the "deep Web" will be handled also.We used the technique of randomly generated IP addresses for investigating the "deep Web", i.e. the part of the Internet not indexed by the search engines. For the part of the Web that is accessed (indexed) by the large search engines we used the random walk technique to collect uniformly distributed samplings. We found, that XML has not(yet) been the standard of Web publishing, but it is strongly represented on the Web. We add a simple new evaluation method to the known uniformly sampling processes.These investigations can be repeated in the future in order to get a dynamic picture of the growing rate of the number of the XML documents present on the Web

Periodica Polytechnica (Budapest University of Technology and Economics)

The 7th Conference of PhD Students in Computer Science

Author
Publication venue
Publication date: 01/01/2010
Field of study

University of Szeged