Search CORE

21,232 research outputs found

Investigation into Indexing XML Data Techniques

Author: Joan Lu
Klaib Alhadi
Publication venue
Publication date: 21/07/2014
Field of study

The rapid development of XML technology improves the WWW, since the XML data has many advantages and has become a common technology for transferring data cross the internet. Therefore, the objective of this research is to investigate and study the XML indexing techniques in terms of their structures. The main goal of this investigation is to identify the main limitations of these techniques and any other open issues. Furthermore, this research considers most common XML indexing techniques and performs a comparison between them. Subsequently, this work makes an argument to find out these limitations. To conclude, the main problem of all the XML indexing techniques is the trade-off between the size and the efficiency of the indexes. So, all the indexes become large in order to perform well, and none of them is suitable for all users’ requirements. However, each one of these techniques has some advantages in somehow

University of Huddersfield Repository

Recommended from our members

Okapi-based XML indexing

Author: Lu W.
MacFarlane A.
Venuti F.
Publication venue: 'Emerald'
Publication date: 18/09/2009
Field of study

Purpose – Being an important data exchange and information storage standard, XML has generated a great deal of interest and particular attention has been paid to the issue of XML indexing. Clear use cases for structured search in XML have been established. However, most of the research in the area is either based on relational database systems or specialized semi‐structured data management systems. This paper aims to propose a method for XML indexing based on the information retrieval (IR) system Okapi. Design/methodology/approach – First, the paper reviews the structure of inverted files and gives an overview of the issues of why this indexing mechanism cannot properly support XML retrieval, using the underlying data structures of Okapi as an example. Then the paper explores a revised method implemented on Okapi using path indexing structures. The paper evaluates these index structures through the metrics of indexing run time, path search run time and space costs using the INEX and Reuters RVC1 collections. Findings – Initial results on the INEX collections show that there is a substantial overhead in space costs for the method, but this increase does not affect run time adversely. Indexing results on differing sized Reuters RVC1 sub‐collections show that the increase in space costs with increasing the size of a collection is significant, but in terms of run time the increase is linear. Path search results show sub‐millisecond run times, demonstrating minimal overhead for XML search. Practical implications – Overall, the results show the method implemented to support XML search in a traditional IR system such as Okapi is viable. Originality/value – The paper provides useful information on a method for XML indexing based on the IR system Okapi

City Research Online

Crossref

Indexing XML Documents

Author: Šestáková Eliška
Publication venue: Czech Technical University in Prague. Computing and Information Centre.
Publication date: 09/06/2015
Field of study

Výzkum v oblasti indexování řetězců má již mnoho prezentovaných výsledků, což však neplatí pro ostatní datové struktury, jakými jsou například stromy. Tato práce obsahuje v prvé řadě shrnutí metod pro indexování řetězců a stromů. Dále se podrobně zabývá rešerší existujících řešení indexování XML dokumentů. Představena je zde nová jednoduchá metoda využívající deterministický konečný automat, jež umožňuje efektivně zpracovat XPath dotazy skládající se z libovolné kombinace child (/) a descendant-or-self (//) os, sloužících k navigaci v XML dokumentu. Spolu s touto metodou byly dále navrženy dva další konečné automaty na podporu jednodušších dotazů obsahujících vždy pouze jednu z uvedených os. Ke konstrukci indexu pro daný XML dokument D s n elementy je využit odpovídající XML stromový model T. Zpracování dotazu Q o m elementech proběhne v čase O(m) nezávislém na n. Výsledkem dotazu je poté množina elementů splňujících dané požadavky. Ačkoli automat podporující všechny dotazy s // osou indexuje až O(2^n) různých dotazů, počet stavů vlastního deterministického automatu je O(h^k), kde h je výška XML stromového modelu T a k je počet listů T. Pro běžné XML dokumenty lze navíc tuto mez triviálně snížit až na O(h.2^k).The theory of text indexing is very well-researched, which does not hold for theories of indexing other data structures, such as trees for example. In this thesis we review existing techniques for indexing texts and trees and study state-of-the-art methods for indexing XML documents. We show that automata can be used effectively for the purpose of indexing XML documents. A new and simple method for indexing XML documents using deterministic finite automaton is introduced. The presented method supports a significant fragment of XPath queries which may use any combination of child (/) and descendant-or-self (//) axes. We also propose another two indexing techniques based on finite automata, aimed to assist in evaluating paths queries with either / or // axis only. Given a subject XML document D and its corresponding XML tree model T with n nodes, the tree is preprocessed and the index is constructed. The searching phase uses the index, reads an input query Q of size m and computes the list of positions of all occurrences of target nodes of Q in T. All the proposed automata performed the searching in time O(m) and do not depend on n. Although the automaton that supports all linear XPath queries where just // axis is used evaluates O(2^n) distinct queries, number of states of the deterministic automaton is O(h^k), where h is the height of T and k is the number of its leaf nodes. Moreover, we discuss that in case of indexing a common XML document the number of state in the deterministic finite automaton is at most O(h.2^k)

Digital Library of the Czech Technical University in Prague

Non-hierarchical Structures: How to Model and Index Overlaps?

Author: Bratsberg Svein Erik
Hasibi Faegheh
Publication venue
Publication date: 08/10/2016
Field of study

Overlap is a common phenomenon seen when structural components of a digital object are neither disjoint nor nested inside each other. Overlapping components resist reduction to a structural hierarchy, and tree-based indexing and query processing techniques cannot be used for them. Our solution to this data modeling problem is TGSA (Tree-like Graph for Structural Annotations), a novel extension of the XML data model for non-hierarchical structures. We introduce an algorithm for constructing TGSA from annotated documents; the algorithm can efficiently process non-hierarchical structures and is associated with formal proofs, ensuring that transformation of the document to the data model is valid. To enable high performance query analysis in large data repositories, we further introduce an extension of XML pre-post indexing for non-hierarchical structures, which can process both reachability and overlapping relationships.Comment: The paper has been accepted at the Balisage 2014 conferenc

arXiv.org e-Print Archive

CiteSeerX

Configurable indexing and ranking for XML information retrieval

Author: Qinghua Zou
Shaorong Liu
Wesley W. Chu
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2004
Field of study

Indexing and ranking are two key factors for efficient and effective XML information retrieval. Inappropriate indexing may result in false negatives and false positives, and improper ranking may lead to low precisions. In this paper, we propose a configurable XML information retrieval system, in which users can configure appropriate index types for XML tags and text contents. Based on users ’ index configurations, the system transforms XML structures into a compact tree representation, Ctree, and indexes XML text contents. To support XML ranking, we propose the concepts of “weighted term frequency ” and “inverted element frequency, ” where the weight of a term depends on its frequency and location within an XML element as well as its popularity among similar elements in an XML dataset. We evaluate the effectiveness of our system through extensive experiments on the INEX 03 dataset and 30 content and structure (CAS) topics. The experimental results reveal that our system has significantly high precision at low recall regions and achieves the highest average precision (0.3309) as compared with 38 official INEX 03 submissions using the strict evaluation metric

CiteSeerX

Crossref

Content-Aware DataGuides for Indexing Large Collections of XML Documents

Author: Bry François
Meuss Holger
Schulz Klaus U.
Weigel Felix
Publication venue
Publication date: 01/01/2003
Field of study

XML is well-suited for modelling structured data with textual content. However, most indexing approaches perform structure and content matching independently, combining the retrieved path and keyword occurrences in a third step. This paper shows that retrieval in XML documents can be accelerated significantly by processing text and structure simultaneously during all retrieval phases. To this end, the Content-Aware DataGuide (CADG) enhances the wellknown DataGuide with (1) simultaneous keyword and path matching and (2) a precomputed content/structure join. Extensive experiments prove the CADG to be 50-90% faster than the DataGuide for various sorts of query and document, including difficult cases such as poorly structured queries and recursive document paths. A new query classification scheme identifies precise query characteristics with a predominant influence on the performance of the individual indices. The experiments show that the CADG is applicable to many real-world applications, in particular large collections of heterogeneously structured XML documents

CiteSeerX

Open Access LMU

Indexing Temporal XML documents

Author: A MENDELZON
A VAISMAN
F RIZZOLO
Publication venue: 'Elsevier BV'
Publication date: 01/01/2007
Field of study

Crossref