3,863 research outputs found
Investigation into Indexing XML Data Techniques
The rapid development of XML technology improves the WWW, since the XML data has many advantages and has become a common technology for transferring data cross the internet. Therefore, the objective of this research is to investigate and study the XML indexing techniques in terms of their structures. The main goal of this investigation is to identify the main limitations of these techniques and any other open issues.
Furthermore, this research considers most common XML indexing techniques and performs a comparison between them. Subsequently, this work makes an argument to find out these limitations. To conclude, the main problem of all the XML indexing techniques is the trade-off between the
size and the efficiency of the indexes. So, all the indexes become large in order to perform well, and none of them is suitable for all users’ requirements. However, each one of these techniques has some advantages in somehow
Term-Specific Eigenvector-Centrality in Multi-Relation Networks
Fuzzy matching and ranking are two information retrieval techniques widely used in web search. Their application to structured data, however, remains an open problem. This article investigates how eigenvector-centrality can be used for approximate matching in multi-relation graphs, that is, graphs where connections of many different types may exist. Based on an extension of the PageRank matrix, eigenvectors representing the distribution of a term after propagating term weights between related data items are computed. The result is an index which takes the document structure into account and can be used with standard document retrieval techniques. As the scheme takes the shape of an index transformation, all necessary calculations are performed during index tim
A survey on tree matching and XML retrieval
International audienceWith the increasing number of available XML documents, numerous approaches for retrieval have been proposed in the literature. They usually use the tree representation of documents and queries to process them, whether in an implicit or explicit way. Although retrieving XML documents can be considered as a tree matching problem between the query tree and the document trees, only a few approaches take advantage of the algorithms and methods proposed by the graph theory. In this paper, we aim at studying the theoretical approaches proposed in the literature for tree matching and at seeing how these approaches have been adapted to XML querying and retrieval, from both an exact and an approximate matching perspective. This study will allow us to highlight theoretical aspects of graph theory that have not been yet explored in XML retrieval
XML Schema Clustering with Semantic and Hierarchical Similarity Measures
With the growing popularity of XML as the data representation language, collections of the XML data are exploded in numbers. The methods are required to manage and discover the useful information from them for the improved document handling. We present a schema clustering process by organising the heterogeneous XML schemas into various groups. The methodology considers not only the linguistic and the context of the elements but also the hierarchical structural similarity. We support our findings with experiments and analysis
Creating Structured PDF Files Using XML Templates
This paper describes a tool for recombining the logical structure from an XML document with the typeset appearance of the corresponding PDF document. The tool uses the XML representation as a template for the insertion of the logical structure into the existing PDF document, thereby creating a Structured/Tagged PDF. The addition of logical structure adds value to the PDF in three ways: the accessibility is improved (PDF screen readers for visually impaired users perform better), media options are enhanced (the ability to reflow PDF documents, using structure as a guide, makes PDF viable for use on hand-held devices) and the re-usability of the PDF documents benefits greatly from the presence of an XML-like structure tree to guide the process of text retrieval in reading order (e.g. when interfacing to XML applications and databases)
Development of Use Cases, Part I
For determining requirements and constructs appropriate for a Web query language, or in fact
any language, use cases are of essence. The W3C has published two sets of use cases for XML
and RDF query languages. In this article, solutions for these use cases are presented using
Xcerpt. a novel Web and Semantic Web query language that combines access to standard Web
data such as XML documents with access to Semantic Web metadata
such as RDF resource
descriptions with reasoning abilities and rules familiar from logicprogramming.
To the
best knowledge of the authors, this is the first in depth study of how to solve use cases for
accessing XML and RDF in a single language: Integrated access to data and metadata
has been
recognized by industry and academia as one of the key challenges in data processing for the
next decade. This article is a contribution towards addressing this challenge by demonstrating
along practical and recognized use cases the usefulness of reasoning abilities, rules, and
semistructured
query languages for accessing both data (XML) and metadata
(RDF)
Measuring the similarity of PML documents with RFID-based sensors
The Electronic Product Code (EPC) Network is an important part of the
Internet of Things. The Physical Mark-Up Language (PML) is to represent and
de-scribe data related to objects in EPC Network. The PML documents of each
component to exchange data in EPC Network system are XML documents based on PML
Core schema. For managing theses huge amount of PML documents of tags captured
by Radio frequency identification (RFID) readers, it is inevitable to develop
the high-performance technol-ogy, such as filtering and integrating these tag
data. So in this paper, we propose an approach for meas-uring the similarity of
PML documents based on Bayesian Network of several sensors. With respect to the
features of PML, while measuring the similarity, we firstly reduce the
redundancy data except information of EPC. On the basis of this, the Bayesian
Network model derived from the structure of the PML documents being compared is
constructed.Comment: International Journal of Ad Hoc and Ubiquitous Computin
Structurally Tractable Uncertain Data
Many data management applications must deal with data which is uncertain,
incomplete, or noisy. However, on existing uncertain data representations, we
cannot tractably perform the important query evaluation tasks of determining
query possibility, certainty, or probability: these problems are hard on
arbitrary uncertain input instances. We thus ask whether we could restrict the
structure of uncertain data so as to guarantee the tractability of exact query
evaluation. We present our tractability results for tree and tree-like
uncertain data, and a vision for probabilistic rule reasoning. We also study
uncertainty about order, proposing a suitable representation, and study
uncertain data conditioned by additional observations.Comment: 11 pages, 1 figure, 1 table. To appear in SIGMOD/PODS PhD Symposium
201
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
- …