5 research outputs found
Optimizing XML Compression
The eXtensible Markup Language (XML) provides a powerful and flexible means
of encoding and exchanging data. As it turns out, its main advantage as an
encoding format (namely, its requirement that all open and close markup tags
are present and properly balanced) yield also one of its main disadvantages:
verbosity. XML-conscious compression techniques seek to overcome this drawback.
Many of these techniques first separate XML structure from the document
content, and then compress each independently. Further compression gains can be
realized by identifying and compressing together document content that is
highly similar, thereby amortizing the storage costs of auxiliary information
required by the chosen compression algorithm. Additionally, the proper choice
of compression algorithm is an important factor not only for the achievable
compression gain, but also for access performance. Hence, choosing a
compression configuration that optimizes compression gain requires one to
determine (1) a partitioning strategy for document content, and (2) the best
available compression algorithm to apply to each set within this partition. In
this paper, we show that finding an optimal compression configuration with
respect to compression gain is an NP-hard optimization problem. This problem
remains intractable even if one considers a single compression algorithm for
all content. We also describe an approximation algorithm for selecting a
partitioning strategy for document content based on the branch-and-bound
paradigm.Comment: 16 pages, extended version of paper accepted for XSym 200
Optimizing XML Compression in XQueC
We present our approach to the problem of optimizing compression choices in the context of the XQueC compressed XML database system. In XQueC, data items are aggregated into containers, which are further grouped to be compressed together. This way, XQueC is able to exploit data commonalities and to perform query evaluation in the compressed domain, with the aim of improving both compression and querying performance. However, different compression
algorithms have different performance and support different sets of operations in the compressed domain. Therefore, choosing how to group containers and which compression algorithm to apply to each group is a challenging issue. We address this problem through an appropriate cost model and a suitable blend of heuristics which, based on a given query workload, are capable of driving
appropriate compression choices
How much is involved in DB publishing?
XML has been intensive investigated lately, with the sentence, that "XML is (has been) the standard form for data publishing", especially in data base area.That is, there are assumptions, that the newly published data take mostly the form of XML documents, particularly when databases are involved. This presumption seems to be the reason of the heavy investment applied for researching the topics of handling, querying and comprising XML documents. We check these assumptions by investigating the documents accessible on the Internet, possible going under the surface, into the "deep Web". The investigation involves analyzing large scientific databases, but the commercial data stored in the "deep Web" will be handled also.We used the technique of randomly generated IP addresses for investigating the "deep Web", i.e. the part of the Internet not indexed by the search engines. For the part of the Web that is accessed (indexed) by the large search engines we used the random walk technique to collect uniformly distributed samplings. We found, that XML has not(yet) been the standard of Web publishing, but it is strongly represented on the Web. We add a simple new evaluation method to the known uniformly sampling processes.These investigations can be repeated in the future in order to get a dynamic picture of the growing rate of the number of the XML documents present on the Web