Search CORE

19,486 research outputs found

The Importance of Sibling Clustering for Efficient Bulkload of XML Document Trees

Author: Kanne Carl-Christian
Moerkotte Guido
Publication venue
Publication date: 01/01/2005
Field of study

In an XML Data Store (XDS), importing documents from external sources is a very frequent operation. Since a document import consists of a large number of individual node inserts, it is essentially a small bulkload operation. Hence, efficient bulkload support is crucial for XDSs. Essentially, XML bulkload is the transformation of an XML parser's output into the XDS's persistent storage structures. This involves two major subtasks: (1) Partitioning the documents' logical tree structure into subtrees smaller than a disk page in a way that is both space-efficient an suitable for later processing. (2) Mapping the subtrees to the XDS's internal page representation. In enterprise-scale environments with very large documents and/or very many parallel bulkloads, task (1) is particularly challenging, as not only disk space consumption, but also CPU and main-memory usage are important factors. In this article, we (1) discuss requirements for an XML bulkload module, (2) examine existing algorithms for tree partitioning with respect to their applicability as XML bulkload algorithms, (3) derive a new tree partitioning algorithm, (4) present the design and implementation of the bulkload module used in our Natix XDS, and (5) evaluate the implementation

CiteSeerX

MAnnheim DOCument Server

Path Queries on Compressed XML

Author: Abiteboul
Ailamaki
Batory
Bryant
Buneman
Burch
Chan
Deutsch
Fernandez
Florescu
Frick
Goldman
Gottlob
Liefke
McMillan
Milo
Neumüller
Shanmugasundaram
Tolani
Ziv
Publication venue
Publication date: 01/01/2003
Field of study

Central to any XML query language is a path language such as XPath which operates on the tree structure of the XML document. We demonstrate in this paper that the tree structure can be e#ectively compressed and manipulated using techniques derived from symbolic model checking . Specifically, we show first that succinct representations of document tree structures based on sharing subtrees are highly e#ective. Second, we show that compressed structures can be queried directly and e#ciently through a process of manipulating selections of nodes and partial decompression

CiteSeerX

Crossref

Edinburgh Research Explorer

Tree Compression with Top Trees Revisited

Author: F Wang
G Busatto
JI Munro
M Charikar
M Hirakawa
M Lohrey
M Lohrey
NJ Larsson
P Ferragina
PJ Downey
S Alstrup
S Gog
S Maneth
S Maruyama
Publication venue
Publication date: 01/01/2015
Field of study

We revisit tree compression with top trees (Bille et al, ICALP'13) and present several improvements to the compressor and its analysis. By significantly reducing the amount of information stored and guiding the compression step using a RePair-inspired heuristic, we obtain a fast compressor achieving good compression ratios, addressing an open problem posed by Bille et al. We show how, with relatively small overhead, the compressed file can be converted into an in-memory representation that supports basic navigation operations in worst-case logarithmic time without decompression. We also show a much improved worst-case bound on the size of the output of top-tree compression (answering an open question posed in a talk on this algorithm by Weimann in 2012).Comment: SEA 201

arXiv.org e-Print Archive

Crossref

KITopen

Leicester Research Archive

XML documents clustering using a tensor space model

Author: Kutty Sangeetha
Li Yuefeng
Nayak Richi
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

The traditional Vector Space Model (VSM) is not able to represent both the structure and the content of XML documents. This paper introduces a novel method of representing XML documents in a Tensor Space Model (TSM) and then utilizing it for clustering. Empirical analysis shows that the proposed method is scalable for large-sized datasets; as well, the factorized matrices produced from the proposed method help to improve the quality of clusters through the enriched document representation of both structure and content information

CiteSeerX

Queensland University of Technology ePrints Archive

Designing a resource-efficient data structure for mobile data systems

Author: Gourlay Richard Scott
Publication venue
Publication date: 01/07/2006
Field of study

Designing data structures for use in mobile devices requires attention on optimising data volumes with associated benefits for data transmission, storage space and battery use. For semi-structured data, tree summarisation techniques can be used to reduce the volume of structured elements while dictionary compression can efficiently deal with value-based predicates. This project seeks to investigate and evaluate an integration of the two approaches. The key strength of this technique is that both structural and value predicates could be resolved within one graph while further allowing for compression of the resulting data structure. As the current trend is towards the requirement for working with larger semi-structured data sets this work would allow for the utilisation of much larger data sets whilst reducing requirements on bandwidth and minimising the memory necessary both for the storage and querying of the data

University of Strathclyde Institutional Repository

Implementing a Portable Clinical NLP System with a Common Data Model - a Lisp Perspective

Author: Luo Yuan
Szolovits Peter
Publication venue
Publication date: 14/11/2018
Field of study

This paper presents a Lisp architecture for a portable NLP system, termed LAPNLP, for processing clinical notes. LAPNLP integrates multiple standard, customized and in-house developed NLP tools. Our system facilitates portability across different institutions and data systems by incorporating an enriched Common Data Model (CDM) to standardize necessary data elements. It utilizes UMLS to perform domain adaptation when integrating generic domain NLP tools. It also features stand-off annotations that are specified by positional reference to the original document. We built an interval tree based search engine to efficiently query and retrieve the stand-off annotations by specifying positional requirements. We also developed a utility to convert an inline annotation format to stand-off annotations to enable the reuse of clinical text datasets with inline annotations. We experimented with our system on several NLP facilitated tasks including computational phenotyping for lymphoma patients and semantic relation extraction for clinical notes. These experiments showcased the broader applicability and utility of LAPNLP.Comment: 6 pages, accepted by IEEE BIBM 2018 as regular pape

arXiv.org e-Print Archive

DSpace@MIT

Crossref

Querying XML data streams from wireless sensor networks: an evaluation of query engines

Author: Conroy Kenneth
Moyna Niall
O'Connor Martin F.
Roantree Mark
Smeaton Alan F.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/04/2009
Field of study

As the deployment of wireless sensor networks increase and their application domain widens, the opportunity for effective use of XML filtering and streaming query engines is ever more present. XML filtering engines aim to provide efficient real-time querying of streaming XML encoded data. This paper provides a detailed analysis of several such engines, focusing on the technology involved, their capabilities, their support for XPath and their performance. Our experimental evaluation identifies which filtering engine is best suited to process a given query based on its properties. Such metrics are important in establishing the best approach to filtering XML streams on-the-fly

CiteSeerX

Crossref

Irish Universities

DCU Online Research Access Service

AMaχoS—Abstract Machine for Xcerpt

Author: Bry François
Furche Tim
Linse Benedikt
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

Web query languages promise convenient and efficient access to Web data such as XML, RDF, or Topic Maps. Xcerpt is one such Web query language with strong emphasis on novel high-level constructs for effective and convenient query authoring, particularly tailored to versatile access to data in different Web formats such as XML or RDF. However, so far it lacks an efficient implementation to supplement the convenient language features. AMaχoS is an abstract machine implementation for Xcerpt that aims at efficiency and ease of deployment. It strictly separates compilation and execution of queries: Queries are compiled once to abstract machine code that consists in (1) a code segment with instructions for evaluating each rule and (2) a hint segment that provides the abstract machine with optimization hints derived by the query compilation. This article summarizes the motivation and principles behind AMaχoS and discusses how its current architecture realizes these principles

Open Access LMU