50 research outputs found
IMPrECISE: Good-is-good-enough data integration
IMPrECISE is an XQuery module that adds probabilistic XML functionality to an existing XML DBMS, in our case MonetDB/XQuery. We demonstrate probabilistic XML and data integration functionality of IMPrECISE. The prototype is configurable with domain knowledge such that the amount of uncertainty arising during data integration is reduced to an acceptable level, thus obtaining a "good is good enough" data integration with minimal human effort
BioCloud Search EnGene: Surfing Biological Data on the Cloud
The massive production and spread of biomedical data around the web introduces new challenges related to identify computational approaches for providing quality search and browsing of web resources. This papers presents BioCloud Search EnGene (BSE), a cloud application that facilitates searching and integration of the many layers of biological information offered by public large-scale genomic repositories. Grounding on the concept of dataspace, BSE is built on top of a cloud platform that severely curtails issues associated with scalability and performance. Like popular online gene portals, BSE adopts a gene-centric approach: researchers can find their information of interest by means of a simple “Google-like” query interface that accepts standard gene identification as keywords. We present BSE architecture and functionality and discuss how our strategies contribute to successfully tackle big data problems in querying gene-based web resources. BSE is publically available at: http://biocloud-unica.appspot.com/
Impliance: A Next Generation Information Management Appliance
ably successful in building a large market and adapting to the changes of the
last three decades, its impact on the broader market of information management
is surprisingly limited. If we were to design an information management system
from scratch, based upon today's requirements and hardware capabilities, would
it look anything like today's database systems?" In this paper, we introduce
Impliance, a next-generation information management system consisting of
hardware and software components integrated to form an easy-to-administer
appliance that can store, retrieve, and analyze all types of structured,
semi-structured, and unstructured information. We first summarize the trends
that will shape information management for the foreseeable future. Those trends
imply three major requirements for Impliance: (1) to be able to store, manage,
and uniformly query all data, not just structured records; (2) to be able to
scale out as the volume of this data grows; and (3) to be simple and robust in
operation. We then describe four key ideas that are uniquely combined in
Impliance to address these requirements, namely the ideas of: (a) integrating
software and off-the-shelf hardware into a generic information appliance; (b)
automatically discovering, organizing, and managing all data - unstructured as
well as structured - in a uniform way; (c) achieving scale-out by exploiting
simple, massive parallel processing, and (d) virtualizing compute and storage
resources to unify, simplify, and streamline the management of Impliance.
Impliance is an ambitious, long-term effort to define simpler, more robust, and
more scalable information systems for tomorrow's enterprises.Comment: This article is published under a Creative Commons License Agreement
(http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute,
display, and perform the work, make derivative works and make commercial use
of the work, but, you must attribute the work to the author and CIDR 2007.
3rd Biennial Conference on Innovative Data Systems Research (CIDR) January
710, 2007, Asilomar, California, US
Cross-organisation dataspace (COD) - architecture and implementation
With the rapid development of information and
communication technologies, the need to share information
to improve efficiency in large enterprises is also increasing
rapidly. For a large enterprise the information can come
from many different sources and in different formats. There
is a real requirement to manage the vast amount and
diverse sources of data in a convenient and integrated way
so that repositories of information can be built up with little
additional effort and the information can be easily accessed
globally. This paper presents the design and
implementation of a prototype, called COD (Cross-
Organisation Dataspace), that addresses the above
challenges. COD, in the context of an enterprise involving
multiple organisations, allows users from different
geographical locations to contribute information and to
search and access information easily. The information can
be contained in many different forms, e.g. text files, reports,
drawings and databases
Linked Data - the story so far
The term “Linked Data” refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions— the Web of Data. In this article, the authors present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. They describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward
Indeterministic Handling of Uncertain Decisions in Duplicate Detection
In current research, duplicate detection is usually considered as a deterministic approach in which tuples are either declared as duplicates or not. However, most often it is not completely clear whether two tuples represent the same real-world entity or not. In deterministic approaches, however, this uncertainty is ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for handling uncertain decisions in a duplicate detection process by using a probabilistic target schema. Thus, instead of deciding between multiple possible worlds, all these worlds can be modeled in the resulting data. This approach minimizes the negative impacts of false decisions. Furthermore, the duplicate detection process becomes almost fully automatic and human effort can be reduced to a large extent. Unfortunately, a full-indeterministic approach is by definition too expensive (in time as well as in storage) and hence impractical. For that reason, we additionally introduce several semi-indeterministic methods for heuristically reducing the set of indeterministic handled decisions in a meaningful way
Relative Expressive Power of Navigational Querying on Graphs
Motivated by both established and new applications, we study navigational
query languages for graphs (binary relations). The simplest language has only
the two operators union and composition, together with the identity relation.
We make more powerful languages by adding any of the following operators:
intersection; set difference; projection; coprojection; converse; and the
diversity relation. All these operators map binary relations to binary
relations. We compare the expressive power of all resulting languages. We do
this not only for general path queries (queries where the result may be any
binary relation) but also for boolean or yes/no queries (expressed by the
nonemptiness of an expression). For both cases, we present the complete Hasse
diagram of relative expressiveness. In particular the Hasse diagram for boolean
queries contains some nontrivial separations and a few surprising collapses.Comment: An extended abstract announcing the results of this paper was
presented at the 14th International Conference on Database Theory, Uppsala,
Sweden, March 201
Dataspaces: Concepts, Architectures and Initiatives
Despite not being a new concept, dataspaces have become a prominent topic due to the increasing availability of data and the need for efficient management and utilization of diverse data sources. In simple terms, a dataspace refers to an environment where data from various sources, formats, and domains can be integrated, shared, and analyzed. It aims to provide a unified view of heterogeneous data by bridging the gap between different data silos, enabling interoperability. The concept of dataspaces promotes the idea that data should be treated as a cohesive entity, rather than being fragmented across different systems and applications. Dataspaces often involve the integration of structured and unstructured data, including databases, documents, sensor data, social media feeds, and more. The goal is to enable organizations to harness the full potential of their data assets by facilitating data discovery, access, and analysis. By bringing together diverse data sources, dataspaces can offer new insights, support decision-making processes, and drive innovation. In the context of European Commission-funded research projects, dataspaces are often explored as part of initiatives focused on data management, data sharing, and the development of data-driven technologies. These projects aim to address challenges related to data integration, data privacy, data governance, and scalability. The goal is to advance the state of the art in data management and enable organizations to leverage data more effectively for societal, economic, and scientific advancements. It is important to notice that while dataspaces offer potential benefits, they also come with challenges. These challenges include data quality assurance, data privacy and security, semantic interoperability, scalability, and the need for appropriate data governance frameworks. Overall, dataspaces represent an approach to managing and utilizing data that emphasizes integration, interoperability, and accessibility. The concept is being explored and researched to develop innovative solutions that can unlock the value of data in various domains and sectors
A Survey of the State of Dataspaces
Published in International Journal of Computer and Information Technology.This paper presents a survey of the state of dataspaces. With dataspaces becoming the modern technique of systems integration, the achievement of complete dataspace development is a critical issue. This has led to the design and implementation of dataspace systems using various approaches. Dataspaces are data integration approaches that target for data coexistence in the spatial domain. Unlike traditional data integration techniques, they do not require up front semantic integration of data. In this paper, we outline and compare the properties and implementations of dataspaces including the approaches of optimizing dataspace development. We finally present actual dataspace development recommendations to provide a global overview of this significant research topic.This paper presents a
survey of the state of
dataspaces
.
With dataspaces becoming the modern technique of
systems integration, the ach
ievement of complete dataspace
development is a critical issue. This has led to the design and
implementation of dataspace systems using various approaches.
Dataspaces are data integration approaches that target for data
coexistence in the spatial domain.
Unlike traditional data
integration techniques, they do not require up front semantic
integration of data. In this paper, we outline and compare the
properties and implementations of dataspaces including the
approaches of optimizing dataspace development.
We finally
present actual dataspace development recommendations
to
provide a global overview of this significant research topic