Skip to main content
Article thumbnail
Location of Repository

Space efficient in-memory representation of XML documents

By O'Neil Davion Delpratt

Abstract

Extensible Markup Language (XML) is a multi-purpose text-based format, used for storage, transmission and manipulation of data. XML documents are often held in main memory and processed via standard interfaces such as the Document Object Model (DOM). However, XML is inherently verbose, and the in-memory representation of XML documents by existing DOM implementations is up to ten times larger than the file size. This is a problem for machines with limited memory, such as mobile devices, where processing even moderately-sized XML documents requires more memory than is available. We focus on in-memory representations of XML documents for situations where space is limited and where rapid processing time is important. We propose a compact representation of XML documents that uses succinct or highly space-efficient data structures, that allows XML processing to be executed efficiently.\ud Succinct data structures use space that approaches the information-theoretic lower bound on the space that is required to represent the data, and support operations upon the representation in constant time. In the context of XML documents, we study and improve succinct representations for ordinal trees by adding features that make them more suitable for use in XML documents. We explore fast and space-efficient representations of the textual data of XML documents. Our basic approach is to concatenate all the textual data in the XML document into a single string, and extract individual textual values by computing the appropriate substring of the concatenated string. Computing the substring requires us to store offsets into the text. The storage of the offsets is surprisingly expensive, if stored naively (as 32 or 64-bit integer values).\ud We give a succinct representation and provide data-aware representations (adapted from work on inverted indices in information retrieval), and show their close connection. We describe Succinct DOM (SDOM), which is a DOM implementation that has low, stable and predictable memory usage. We show, via an experimental evaluation, that SDOM is extremely fast. A variant, SDOM-CT, applies BZip-based compression to textual and attribute data, and its space usage is comparable with “query-friendly” XML compressors. Some of these compressors support navigation and/or querying (e.g. subpath queries) of the compressed file. SDOM-CT does not support querying directly, but remains extremely fast: it is several orders of magnitude faster for navigation than query-friendly XML compressors that support navigation (and only a few times slower than popular DOM implementations such as the Apache Foundation’s Xerces-C)

Publisher: University of Leicester
Year: 2009
OAI identifier: oai:lra.le.ac.uk:2381/4805

Suggested articles

Citations

  1. (2008). 12). In Wikipedia, The Free Encyclopedia. Retrieved 21:09,
  2. (2008). 13). In Wikipedia, The Free Encyclopedia.
  3. (2008). 13). In Wikipedia, The Free Encyclopedia. Retrieved 21:09,
  4. (1994). A block sorting lossless data compression algorithm. doi
  5. (2006). A compressor for effective archiving, retrieval, and updating of XML documents. doi
  6. (1989). A random binary tree generator. doi
  7. (2006). A simple optimal representation for balanced parentheses. doi
  8. (2007). A space efficient XML DOM parser. doi
  9. (2004). A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML. doi
  10. (2001). An experimental study of an opportunistic index. doi
  11. (2004). API documentation.
  12. (1996). Compact Pat Trees.
  13. (2007). Compressed data structures: Dictionaries and data-aware measures. doi
  14. (2006). Compressed dictionaries: space measures, data sets, and experiments. doi
  15. (2007). Compressed Prefix Sums. doi
  16. (2000). Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract). doi
  17. (2006). Compressing and Searching XML Data Via Two Zips. doi
  18. (1984). Data compression using adaptive coding and partial string matching. doi
  19. (1997). Dictionary-based order-preserving string compression. doi
  20. (2005). Efficient implementation of rank and select functions for succinct representation. doi
  21. (2005). Efficient Memory Representation of XML Documents. doi
  22. (1974). Efficient storage retrieval by content and address of static files. doi
  23. (1996). Efficient suffix trees on secondary storage.
  24. (2008). Engineering succinct DOM. doi
  25. (2006). Engineering the LOUDS Succinct Tree Representation. doi
  26. Galax XQuery Implementation. doi
  27. Implementation. http://www.cpointc.com/XML/ (note: not delivered anymore as open source)
  28. (2003). Improving XML Processing Using Adapted Data Structures. doi
  29. (1990). Introduction to Algorithms, first edition, doi
  30. (1999). Managing Gigabytes, doi
  31. (2002). Monadic datalog and the expressive power of Web information extraction languages. doi
  32. (2006). MonetDB/XQuery: a fast XQuery processor powered by a relational engine. doi
  33. (2007). On the Size of Succinct Indices. doi
  34. (2006). Optimization in XSLT and XQuery. doi
  35. (2003). Path queries on compressed XML. doi
  36. (2002). PPM: One Step to Practicality. doi
  37. (2003). Query Evaluation on Compressed Trees (Extended Abstract). doi
  38. Recall that in Chapter 6, we gave a data structure for storing a collection of non-empty strings
  39. (2005). Representing trees of higher degree. doi
  40. SDOM Software Libraries.
  41. (1989). Space-efficient static trees and graphs. doi
  42. (2006). Squeezing succinct data structures into entropy bounds. doi
  43. (2003). Succinct dynamic dictionaries and trees. doi
  44. (2002). Succinct indexable dictionaries with applications to encoding D-ary trees and multisets. doi
  45. (2006). Succinct ordinal trees with levelancestor queries. doi
  46. (2002). Succinct representation of balanced parentheses and static Trees. doi
  47. (2006). Tradeoffs in XML Database Compression. doi
  48. (2005). Vectorizing and Querying Large XML Repositories. doi
  49. Xalan XSLT Processor The Apache XML project. doi
  50. (2006). XCQ: A queriable XML compression system. doi
  51. (2002). XGRIND: A Query-Friendly XML Compressor. In doi
  52. (2000). XMill: an efficient compressor for XML data. doi
  53. XML Benchmark Project. doi
  54. (2007). XQueC: A queryconscious compressed XML database. doi
  55. (2004). XQzip: Querying Compressed XML Using Structural Indexing. doi

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.