1,781 research outputs found
Database Systems - Present and Future
The database systems have nowadays an increasingly important role in the knowledge-based society, in which computers have penetrated all fields of activity and the Internet tends to develop worldwide. In the current informatics context, the development of the applications with databases is the work of the specialists. Using databases, reach a database from various applications, and also some of related concepts, have become accessible to all categories of IT users. This paper aims to summarize the curricular area regarding the fundamental database systems issues, which are necessary in order to train specialists in economic informatics higher education. The database systems integrate and interfere with several informatics technologies and therefore are more difficult to understand and use. Thus, students should know already a set of minimum, mandatory concepts and their practical implementation: computer systems, programming techniques, programming languages, data structures. The article also presents the actual trends in the evolution of the database systems, in the context of economic informatics.database systems - DBS, database management systems – DBMS, database – DB, programming languages, data models, database design, relational database, object-oriented systems, distributed systems, advanced database systems
Deductive Optimization of Relational Data Storage
Optimizing the physical data storage and retrieval of data are two key
database management problems. In this paper, we propose a language that can
express a wide range of physical database layouts, going well beyond the row-
and column-based methods that are widely used in database management systems.
We use deductive synthesis to turn a high-level relational representation of a
database query into a highly optimized low-level implementation which operates
on a specialized layout of the dataset. We build a compiler for this language
and conduct experiments using a popular database benchmark, which shows that
the performance of these specialized queries is competitive with a
state-of-the-art in memory compiled database system
A Survey on Array Storage, Query Languages, and Systems
Since scientific investigation is one of the most important providers of
massive amounts of ordered data, there is a renewed interest in array data
processing in the context of Big Data. To the best of our knowledge, a unified
resource that summarizes and analyzes array processing research over its long
existence is currently missing. In this survey, we provide a guide for past,
present, and future research in array processing. The survey is organized along
three main topics. Array storage discusses all the aspects related to array
partitioning into chunks. The identification of a reduced set of array
operators to form the foundation for an array query language is analyzed across
multiple such proposals. Lastly, we survey real systems for array processing.
The result is a thorough survey on array data storage and processing that
should be consulted by anyone interested in this research topic, independent of
experience level. The survey is not complete though. We greatly appreciate
pointers towards any work we might have forgotten to mention.Comment: 44 page
Formal Representation of the SS-DB Benchmark and Experimental Evaluation in EXTASCID
Evaluating the performance of scientific data processing systems is a
difficult task considering the plethora of application-specific solutions
available in this landscape and the lack of a generally-accepted benchmark. The
dual structure of scientific data coupled with the complex nature of processing
complicate the evaluation procedure further. SS-DB is the first attempt to
define a general benchmark for complex scientific processing over raw and
derived data. It fails to draw sufficient attention though because of the
ambiguous plain language specification and the extraordinary SciDB results. In
this paper, we remedy the shortcomings of the original SS-DB specification by
providing a formal representation in terms of ArrayQL algebra operators and
ArrayQL/SciQL constructs. These are the first formal representations of the
SS-DB benchmark. Starting from the formal representation, we give a reference
implementation and present benchmark results in EXTASCID, a novel system for
scientific data processing. EXTASCID is complete in providing native support
both for array and relational data and extensible in executing any user code
inside the system by the means of a configurable metaoperator. These features
result in an order of magnitude improvement over SciDB at data loading,
extracting derived data, and operations over derived data.Comment: 32 pages, 3 figure
Scalable mining for classification rules in relational databases
doi:10.1214/lnms/1196285404Data mining is a process of discovering useful patterns (knowledge)
hidden in extremely large datasets. Classification is a fundamental data mining
function, and some other functions can be reduced to it. In this paper we
propose a novel classification algorithm (classifier) called MIND (MINing in
Databases). MIND can be phrased in such a way that its implementation is
very easy using the extended relational calculus SQL, and this in turn allows
the classifier to be built into a relational database system directly. MIND is
truly scalable with respect to I/O efficiency, which is important since scalability
is a key requirement for any data mining algorithm.
We have built a prototype of MIND in the relational database management
system DB2 and have benchmarked its performance. We describe the working
prototype and report the measured performance with respect to the previous
method of choice. MIND scales not only with the size of datasets but also
with the number of processors on an IBM SP2 computer system. Even on
uniprocessors, MIND scales well beyond dataset sizes previously published for
classifiers.We also give some insights that may have an impact on the evolution
of the extended relational calculus SQL
Just-in-time Analytics Over Heterogeneous Data and Hardware
Industry and academia are continuously becoming more data-driven and data-intensive, relying on the analysis of a wide variety of datasets to gain insights. At the same time, data variety increases continuously across multiple axes. First, data comes in multiple formats, such as the binary tabular data of a DBMS, raw textual files, and domain-specific formats. Second, different datasets follow different data models, such as the relational and the hierarchical one. Data location also varies: Some datasets reside in a central "data lake", whereas others lie in remote data sources. In addition, users execute widely different analysis tasks over all these data types. Finally, the process of gathering and integrating diverse datasets introduces several inconsistencies and redundancies in the data, such as duplicate entries for the same real-world concept. In summary, heterogeneity significantly affects the way data analysis is performed. In this thesis, we aim for data virtualization: Abstracting data out of its original form and manipulating it regardless of the way it is stored or structured, without a performance penalty. To achieve data virtualization, we design and implement systems that i) mask heterogeneity through the use of heterogeneity-aware, high-level building blocks and ii) offer fast responses through on-demand adaptation techniques. Regarding the high-level building blocks, we use a query language and algebra to handle multiple collection types, such as relations and hierarchies, express transformations between these collection types, as well as express complex data cleaning tasks over them. In addition, we design a location-aware compiler and optimizer that masks away the complexity of accessing multiple remote data sources. Regarding on-demand adaptation, we present a design to produce a new system per query. The design uses customization mechanisms that trigger runtime code generation to mimic the system most appropriate to answer a query fast: Query operators are thus created based on the query workload and the underlying data models; the data access layer is created based on the underlying data formats. In addition, we exploit emerging hardware by customizing the system implementation based on the available heterogeneous processors â CPUs and GPGPUs. We thus pair each workload with its ideal processor type. The end result is a just-in-time database system that is specific to the query, data, workload, and hardware instance. This thesis redesigns the data management stack to natively cater for data heterogeneity and exploit hardware heterogeneity. Instead of centralizing all relevant datasets, converting them to a single representation, and loading them in a monolithic, static, suboptimal system, our design embraces heterogeneity. Overall, our design decouples the type of performed analysis from the original data layout; users can perform their analysis across data stores, data models, and data formats, but at the same time experience the performance offered by a custom system that has been built on demand to serve their specific use case
Query Flattening and the Nested Data Parallelism Paradigm
This work is based on the observation that languages for two seemingly distant
domains are closely related. Orthogonal query languages based on comprehension
syntax admit various forms of query nesting to construct nested query results
and express complex predicates. Languages for nested data parallelism allow to
nest parallel iterators and thereby admit the parallel evaluation of
computations that are themselves parallel. Both kinds of languages center around
the application of side-effect-free functions to each element of a collection.
The motivation for this work is the seamless integration of relational database
queries with programming languages. In frameworks for language-integrated
database queries, a host language's native collection-programming API is used
to express queries. To mediate between native collection programming and
relational queries, we define an expressive, orthogonal query calculus that
supports nesting and order. The challenge of query flattening is to translate
this calculus to bundles of efficient relational queries restricted to flat,
unordered multisets. Prior approaches to query flattening either support only
query languages that lack in expressiveness or employ a complex, monolithic
translation that is hard to comprehend and generates inefficient code that is
hard to optimize.
To improve on those approaches, we draw on the similarity to nested data
parallelism. Blelloch's flattening transformation is a static program
transformation that translates nested data parallelism to flat data parallel
programs over flat arrays. Based on the flattening transformation, we describe a
pipeline of small, comprehensible lowering steps that translates our nested
query calculus to a bundle of relational queries. The pipeline is based on a
number of well-defined intermediate languages. Our translation adopts the key
concepts of the flattening transformation but is designed with specifics of
relational query processing in mind.
Based on this translation, we revisit all aspects of query flattening. Our
translation is fully compositional and can translate any term of the input
language. Like prior work, the translation by itself produces inefficient code
due to compositionality that is not fit for execution without optimization. In
contrast to prior work, we show that query optimization is orthogonal to
flattening and can be performed before flattening. We employ well-known work on
logical query optimization for nested query languages and demonstrate that this
body of work integrates well with our approach.
Furthermore, we describe an improved encoding of ordered and nested collections
in terms of flat, unordered multisets. Our approach emits idiomatic relational
queries in which the effort required to maintain the non-relational semantics of
the source language (order and nesting) is minimized.
A set of experiments provides evidence that our approach to query flattening can
handle complex, list-based queries with nested results and nested intermediate
data well. We apply our approach to a number of flat and nested benchmark
queries and compare their runtime with hand-written SQL queries. In these
experiments, our SQL code generated from a list-based nested query language
usually performs as well as hand-written queries
MIL primitives for querying a fragmented world
In query-intensive database application areas, like decision support and data mining, systems that use vertical fragmentation have a significant performance advantage. In order to support relational or object oriented applications on top of such a fragmented data model, a flexible yet powerful intermediate language is needed. This problem has been successfully tackled in Monet, a modern extensible database kernel developed by our group. We focus on the design choices made in the Monet Interpreter Language (MIL), its algebraic query language, and outline how its concept of tactical optimization enhances and simplifies the optimization of complex queries. Finally, we summarize the experience gained in Monet by creating a highly efficient implementation of MIL
The {RDF}-3X Engine for Scalable Management of {RDF} Data
RDF is a data model for schema-free structured information that is gaining momentum in the context of Semantic-Web data, life sciences, and also Web 2.0 platforms. The ``pay-as-you-go'' nature of RDF and the flexible pattern-matching capabilities of its query language SPARQL entail efficiency and scalability challenges for complex queries including long join paths. This paper presents the RDF-3X engine, an implementation of SPARQL that achieves excellent performance by pursuing a RISC-style architecture with streamlined indexing and query processing. The physical design is identical for all RDF-3X databases regardless of their workloads, and completely eliminates the need for index tuning by exhaustive indexes for all permutations of subject-property-object triples and their binary and unary projections. These indexes are highly compressed, and the query processor can aggressively leverage fast merge joins with excellent performance of processor caches. The query optimizer is able to choose optimal join orders even for complex queries, with a cost model that includes statistical synopses for entire join paths. Although RDF-3X is optimized for queries, it also provides good support for efficient online updates by means of a staging architecture: direct updates to the main database indexes are deferred, and instead applied to compact differential indexes which are later merged into the main indexes in a batched manner. Experimental studies with several large-scale datasets with more than 50 million RDF triples and benchmark queries that include pattern matching, manyway star-joins, and long path-joins demonstrate that RDF-3X can outperform the previously best alternatives by one or two orders of magnitude
- …