3,276 research outputs found
XML Matchers: approaches and challenges
Schema Matching, i.e. the process of discovering semantic correspondences
between concepts adopted in different data source schemas, has been a key topic
in Database and Artificial Intelligence research areas for many years. In the
past, it was largely investigated especially for classical database models
(e.g., E/R schemas, relational databases, etc.). However, in the latest years,
the widespread adoption of XML in the most disparate application fields pushed
a growing number of researchers to design XML-specific Schema Matching
approaches, called XML Matchers, aiming at finding semantic matchings between
concepts defined in DTDs and XSDs. XML Matchers do not just take well-known
techniques originally designed for other data models and apply them on
DTDs/XSDs, but they exploit specific XML features (e.g., the hierarchical
structure of a DTD/XSD) to improve the performance of the Schema Matching
process. The design of XML Matchers is currently a well-established research
area. The main goal of this paper is to provide a detailed description and
classification of XML Matchers. We first describe to what extent the
specificities of DTDs/XSDs impact on the Schema Matching task. Then we
introduce a template, called XML Matcher Template, that describes the main
components of an XML Matcher, their role and behavior. We illustrate how each
of these components has been implemented in some popular XML Matchers. We
consider our XML Matcher Template as the baseline for objectively comparing
approaches that, at first glance, might appear as unrelated. The introduction
of this template can be useful in the design of future XML Matchers. Finally,
we analyze commercial tools implementing XML Matchers and introduce two
challenging issues strictly related to this topic, namely XML source clustering
and uncertainty management in XML Matchers.Comment: 34 pages, 8 tables, 7 figure
Designing visual analytics methods for massive collections of movement data
Exploration and analysis of large data sets cannot be carried out using purely visual means but require the involvement of database technologies, computerized data processing, and computational analysis methods. An appropriate combination of these technologies and methods with visualization may facilitate synergetic work of computer and human whereby the unique capabilities of each “partner” can be utilized. We suggest a systematic approach to defining what methods and techniques, and what ways of linking them, can appropriately support such a work. The main idea is that software tools prepare and visualize the data so that the human analyst can detect various types of patterns by looking at the visual displays. To facilitate the detection of patterns, we must understand what types of patterns may exist in the data (or, more exactly, in the underlying phenomenon). This study focuses on data describing movements of multiple discrete entities that change their positions in space while preserving their integrity and identity. We define the possible types of patterns in such movement data on the basis of an abstract model of the data as a mathematical function that maps entities and times onto spatial positions. Then, we look for data transformations, computations, and visualization techniques that can facilitate the detection of these types of patterns and are suitable for very large data sets – possibly too large for a computer's memory. Under such constraints, visualization is applied to data that have previously been aggregated and generalized by means of database operations and/or computational techniques
Reasoning & Querying – State of the Art
Various query languages for Web and Semantic Web data, both for practical use and as an area of research in the scientific community, have emerged in recent years. At the same time, the broad adoption of the internet where keyword search is used in many applications, e.g. search engines, has familiarized casual users with using keyword queries to retrieve information on the internet. Unlike this easy-to-use querying, traditional query languages require knowledge of the language itself as well as of the data to be queried. Keyword-based query languages for XML and RDF bridge the gap between the two, aiming at enabling simple querying of semi-structured data, which is relevant e.g. in the context of the emerging Semantic Web. This article presents an overview of the field of keyword querying for XML and RDF
DescribeX: A Framework for Exploring and Querying XML Web Collections
This thesis introduces DescribeX, a powerful framework that is capable of
describing arbitrarily complex XML summaries of web collections, providing
support for more efficient evaluation of XPath workloads. DescribeX permits the
declarative description of document structure using all axes and language
constructs in XPath, and generalizes many of the XML indexing and summarization
approaches in the literature. DescribeX supports the construction of
heterogeneous summaries where different document elements sharing a common
structure can be declaratively defined and refined by means of path regular
expressions on axes, or axis path regular expression (AxPREs). DescribeX can
significantly help in the understanding of both the structure of complex,
heterogeneous XML collections and the behaviour of XPath queries evaluated on
them.
Experimental results demonstrate the scalability of DescribeX summary
refinements and stabilizations (the key enablers for tailoring summaries) with
multi-gigabyte web collections. A comparative study suggests that using a
DescribeX summary created from a given workload can produce query evaluation
times orders of magnitude better than using existing summaries. DescribeX's
light-weight approach of combining summaries with a file-at-a-time XPath
processor can be a very competitive alternative, in terms of performance, to
conventional fully-fledged XML query engines that provide DB-like functionality
such as security, transaction processing, and native storage.Comment: PhD thesis, University of Toronto, 2008, 163 page
Evolution of Wikipedia's Category Structure
Wikipedia, as a social phenomenon of collaborative knowledge creating, has
been studied extensively from various points of views. The category system of
Wikipedia, introduced in 2004, has attracted relatively little attention. In
this study, we focus on the documentation of knowledge, and the transformation
of this documentation with time. We take Wikipedia as a proxy for knowledge in
general and its category system as an aspect of the structure of this
knowledge. We investigate the evolution of the category structure of the
English Wikipedia from its birth in 2004 to 2008. We treat the category system
as if it is a hierarchical Knowledge Organization System, capturing the changes
in the distributions of the top categories. We investigate how the clustering
of articles, defined by the category system, matches the direct link network
between the articles and show how it changes over time. We find the Wikipedia
category network mostly stable, but with occasional reorganization. We show
that the clustering matches the link structure quite well, except short periods
preceding the reorganizations.Comment: Preprint of an article submitted for consideration in Advances in
Complex Systems (2012) http://www.worldscinet.com/acs/, 19 pages, 7 figure
Patterns of Learning Object Reuse in the Connexions Repository
Doctoral Dissertation abstract: Since the term learning object was first published, there has been either an explicit or implicit expectation of reuse. There has also been a lot of speculation about why learning objects are, or are not, reused. This study quantitatively examined the actual amount and type of learning object use, to include reuse, modification, and translation, within a single open educational resource repositoryâConnexions. The results indicate that about a quarter of used objects are subsequently reused, modified, or translated. While these results are repository specific, they represent an important first step in providing an empirical evaluation of the frequency and some reasons for reuse, as well as establishing metrics and terminology for future studies
Ergatis: a web interface and scalable software system for bioinformatics workflows
Motivation: The growth of sequence data has been accompanied by an increasing need to analyze data on distributed computer clusters. The use of these systems for routine analysis requires scalable and robust software for data management of large datasets. Software is also needed to simplify data management and make large-scale bioinformatics analysis accessible and reproducible to a wide class of target users
Capture-based Automated Test Input Generation
Testing object-oriented software is critical because object-oriented languages have been commonly used in developing modern software systems. Many efficient test input generation techniques for object-oriented software have been proposed; however, state-of-the-art algorithms yield very low code coverage (e.g., less than 50%) on large-scale software. Therefore, one important and yet challenging problem is to generate desirable input objects for receivers and arguments that can achieve high code coverage (such as branch coverage) or help reveal bugs. Desirable objects help tests exercise the new parts of the code. However, generating desirable objects has been a significant challenge for automated test input generation tools, partly because the search space for such desirable objects is huge.
To address this significant challenge, we propose a novel approach called Capture-based Automated Test Input Generation for Objected-Oriented Unit Testing (CAPTIG). The contributions of this proposed research are the following.
First, CAPTIG enhances method-sequence generation techniques. Our approach intro-duces a set of new algorithms for guided input and method selection that increase code coverage. In addition, CAPTIG efficently reduces the amount of generated input.
Second, CAPTIG captures objects dynamically from program execution during either system testing or real use. These captured inputs can support existing automated test input generation tools, such as a random testing tool called Randoop, to achieve higher code coverage.
Third, CAPTIG statically analyzes the observed branches that had not been covered and attempts to exercise them by mutating existing inputs, based on the weakest precon-dition analysis. This technique also contributes to achieve higher code coverage.
Fourth, CAPTIG can be used to reproduce software crashes, based on crash stack trace. This feature can considerably reduce cost for analyzing and removing causes of the crashes.
In addition, each CAPTIG technique can be independently applied to leverage existing testing techniques. We anticipate our approach can achieve higher code coverage with a reduced duration of time with smaller amount of test input. To evaluate this new approach, we performed experiments with well-known large-scale open-source software and discovered our approach can help achieve higher code coverage with fewer amounts of time and test inputs
- …