499 research outputs found
Instance-Independent View Serializability for Semistructured Databases
Semistructured databases require tailor-made concurrency control mechanisms
since traditional solutions for the relational model have been shown to be
inadequate. Such mechanisms need to take full advantage of the hierarchical
structure of semistructured data, for instance allowing concurrent updates of
subtrees of, or even individual elements in, XML documents. We present an
approach for concurrency control which is document-independent in the sense
that two schedules of semistructured transactions are considered equivalent if
they are equivalent on all possible documents. We prove that it is decidable in
polynomial time whether two given schedules in this framework are equivalent.
This also solves the view serializability for semistructured schedules
polynomially in the size of the schedule and exponentially in the number of
transactions
Isolation in XML Bases
The eXtensible Markup Language (XML) is well accepted in many different application areas. As a consequence, there is an increasing need for persistently storing XML documents. As soon as many users and applications work concurrently on the same collection of XML documents - i.e. an XML base - isolating accesses and modifications of different transactions becomes an important issue. We discuss six different core protocols for synchronizing access to and modifications of XML document collections. These core protocols synchronize structure traversals and modifications. They are meant to be integrated into a native XML base management System (XBMS). Four of the six core protocols are based on two phase locking, one uses time stamps, and the last one uses a novel dynamic commit-ordering approach. The latter two protocols achieve a higher degree of concurrency by a novel implicit representation of multiple versions. We also discuss extensions of these core protocols to full-fledged protocols. Further, we show how the two phase locking based protocols can achieve a higher degree of concurrency by exploiting the semantics expressed in Document Type Definitions (DTDs)
Impliance: A Next Generation Information Management Appliance
ably successful in building a large market and adapting to the changes of the
last three decades, its impact on the broader market of information management
is surprisingly limited. If we were to design an information management system
from scratch, based upon today's requirements and hardware capabilities, would
it look anything like today's database systems?" In this paper, we introduce
Impliance, a next-generation information management system consisting of
hardware and software components integrated to form an easy-to-administer
appliance that can store, retrieve, and analyze all types of structured,
semi-structured, and unstructured information. We first summarize the trends
that will shape information management for the foreseeable future. Those trends
imply three major requirements for Impliance: (1) to be able to store, manage,
and uniformly query all data, not just structured records; (2) to be able to
scale out as the volume of this data grows; and (3) to be simple and robust in
operation. We then describe four key ideas that are uniquely combined in
Impliance to address these requirements, namely the ideas of: (a) integrating
software and off-the-shelf hardware into a generic information appliance; (b)
automatically discovering, organizing, and managing all data - unstructured as
well as structured - in a uniform way; (c) achieving scale-out by exploiting
simple, massive parallel processing, and (d) virtualizing compute and storage
resources to unify, simplify, and streamline the management of Impliance.
Impliance is an ambitious, long-term effort to define simpler, more robust, and
more scalable information systems for tomorrow's enterprises.Comment: This article is published under a Creative Commons License Agreement
(http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute,
display, and perform the work, make derivative works and make commercial use
of the work, but, you must attribute the work to the author and CIDR 2007.
3rd Biennial Conference on Innovative Data Systems Research (CIDR) January
710, 2007, Asilomar, California, US
Memory aware query scheduling in a database cluster
Query throughput is one of the primary optimization goals in interactive web-based information systems in order to achieve the performance necessary to serve large user communities. Queries in this application domain differ significantly from those in traditional database applications: they are of lower complexity and almost exclusively read-only. The architecture we propose here is specifically tailored to take advantage of the query characteristics. It is based on a large parallel shared-nothing database cluster where each node runs a separate server with a fully replicated copy of the database. A query is assigned and entirely executed on one single node avoiding network contention or synchronization effects. However, the actual key to enhanced throughput is a resource efficient scheduling of the arriving queries. We develop a simple and robust scheduling scheme that takes the currently memory resident data at each server into account and trades off memory re-use and execution time, reordering queries as necessary. Our experimental evaluation demonstrates the effectiveness when scaling the system beyond hundreds of nodes showing super-linear speedup
Native RFL Factors in Quartz
In this project, our team focused on the database migration of CDO (Collateralized Debt Obligation) square Calibration tool, an in-house proprietary software tool used to price and analyze this certain credit derivative. We defined the new data structure of CDO square, which is fully compatible with Quartz, and provided web services so that CDO square objects are accessible to other applications on the Quartz platform
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
Container-Managed ETL Applications for Integrating Data in Near Real-Time
As the analytical capabilities and applications of e-business systems expand, providing real-time access to critical business performance indicators to improve the speed and effectiveness of business operations has become crucial. The monitoring of business activities requires focused, yet incremental enterprise application integration (EAI) efforts and balancing information requirements in real-time with historical perspectives. The decision-making process in traditional data warehouse environments is often delayed because data cannot be propagated from the source system to the data warehouse in a timely manner. In this paper, we present an architecture for a container-based ETL (extraction, transformation, loading) environment, which supports a continual near real-time data integration with the aim of decreasing the time it takes to make business decisions and to attain minimized latency between the cause and effect of a business decision. Instead of using vendor proprietary ETL solutions, we use an ETL container for managing ETLets (pronounced āet-letsā) for the ETL processing tasks. The architecture takes full advantage of existing J2EE (Java 2 Platform, Enterprise Edition) technology and enables the implementation of a distributed, scalable, near real-time ETL environment. We have fully implemented the proposed architecture. Furthermore, we compare the ETL container to alternative continuous data integration approaches
Anatomy of a Native XML Base Management System
Several alternatives to manage large XML document collections exist, ranging from file systems over relational or other database systems to specifically tailored XML repositories. In this paper we give a tour of Natix, a database management system designed from scratch for storing and processing XML data. Contrary to the common belief that management of XML data is just another application for traditional databases like relational systems, we illustrate how almost every component in a database system is affected in terms of adequacy and performance. We show how to design and optimize areas such as storage, transaction management comprising recovery and multi-user synchronisation as well as query processing for XML
- ā¦