879 research outputs found
Enhancing Query Processing on Stock Market Cloud-based Database
Cloud computing is rapidly expanding because it allows users to save the development and implementation time on their work. It also reduces the maintenance and operational costs of the used systems. Furthermore, it enables the elastic use of any resource rather than estimating workload, which may be inaccurate, as database systems can benefit from such a trend. In this paper, we propose an algorithm that allocates the materialized view over cloud-based replica sets to enhance the database system\u27s performance in stock market using a Peer-to-Peer architecture. The results show that the proposed model improves the query processing time and network transfer cost by distributing the materialized views over cloud-based replica sets. Also, it has a significant effect on decision-making and achieving economic returns
Forecasting the cost of processing multi-join queries via hashing for main-memory databases (Extended version)
Database management systems (DBMSs) carefully optimize complex multi-join
queries to avoid expensive disk I/O. As servers today feature tens or hundreds
of gigabytes of RAM, a significant fraction of many analytic databases becomes
memory-resident. Even after careful tuning for an in-memory environment, a
linear disk I/O model such as the one implemented in PostgreSQL may make query
response time predictions that are up to 2X slower than the optimal multi-join
query plan over memory-resident data. This paper introduces a memory I/O cost
model to identify good evaluation strategies for complex query plans with
multiple hash-based equi-joins over memory-resident data. The proposed cost
model is carefully validated for accuracy using three different systems,
including an Amazon EC2 instance, to control for hardware-specific differences.
Prior work in parallel query evaluation has advocated right-deep and bushy
trees for multi-join queries due to their greater parallelization and
pipelining potential. A surprising finding is that the conventional wisdom from
shared-nothing disk-based systems does not directly apply to the modern
shared-everything memory hierarchy. As corroborated by our model, the
performance gap between the optimal left-deep and right-deep query plan can
grow to about 10X as the number of joins in the query increases.Comment: 15 pages, 8 figures, extended version of the paper to appear in
SoCC'1
An Efficient Query Optimizer with Materialized Intermediate Views in Distributed and Cloud Environment
In cloud computing environment hardware resources required for the execution of query using distributed relational database system are scaled up or scaled down according to the query workload performance. Complex queries require large scale of resources in order to complete their execution efficiently. The large scale of resource requirements can be reduced by minimizing query execution time that maximizes resource utilization and decreases payment overhead of customers. Complex queries or batch queries contain some common subexpressions. If these common subexpressions evaluated once and their results are cached, they can be used for execution of further queries. In this research, we have come up with an algorithm for query optimization, which aims at storing intermediate results of the queries and use these by-products for execution of future queries. Extensive experiments have been carried out with the help of simulation model to test the algorithm efficiency
Data Warehousing Modernization: Big Data Technology Implementation
Considering the challenges posed by Big Data, the cost to scale traditional data warehouses is high and the performances would be inadequate to meet the growing needs of the volume, variety and velocity of data. The Hadoop ecosystem answers both of the shortcomings. Hadoop has the ability to store and analyze large data sets in parallel on a distributed environment but cannot replace the existing data warehouses and RDBMS systems due to its own limitations explained in this paper. In this paper, I identify the reasons why many enterprises fail and struggle to adapt to Big Data technologies. A brief outline of two different technologies to handle Big Data will be presented in this paper: Using IBM’s Pure Data system for analytics (Netezza) usually used in reporting, and Hadoop with Hive which is used in analytics. Also, this paper covers the Enterprise architecture consisting of Hadoop that successful companies are adapting to analyze, filter, process, and store the data running along a massively parallel processing data warehouse. Despite, having the technology to support and process Big Data, industries are still struggling to meet their goals due to the lack of skilled personnel to study and analyze the data, in short data scientists and data statisticians
Database Systems - Present and Future
The database systems have nowadays an increasingly important role in the knowledge-based society, in which computers have penetrated all fields of activity and the Internet tends to develop worldwide. In the current informatics context, the development of the applications with databases is the work of the specialists. Using databases, reach a database from various applications, and also some of related concepts, have become accessible to all categories of IT users. This paper aims to summarize the curricular area regarding the fundamental database systems issues, which are necessary in order to train specialists in economic informatics higher education. The database systems integrate and interfere with several informatics technologies and therefore are more difficult to understand and use. Thus, students should know already a set of minimum, mandatory concepts and their practical implementation: computer systems, programming techniques, programming languages, data structures. The article also presents the actual trends in the evolution of the database systems, in the context of economic informatics.database systems - DBS, database management systems – DBMS, database – DB, programming languages, data models, database design, relational database, object-oriented systems, distributed systems, advanced database systems
On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark
Querying very large RDF data sets in an efficient manner requires a
sophisticated distribution strategy. Several innovative solutions have recently
been proposed for optimizing data distribution with predefined query workloads.
This paper presents an in-depth analysis and experimental comparison of five
representative and complementary distribution approaches. For achieving fair
experimental results, we are using Apache Spark as a common parallel computing
framework by rewriting the concerned algorithms using the Spark API. Spark
provides guarantees in terms of fault tolerance, high availability and
scalability which are essential in such systems. Our different implementations
aim to highlight the fundamental implementation-independent characteristics of
each approach in terms of data preparation, load balancing, data replication
and to some extent to query answering cost and performance. The presented
measures are obtained by testing each system on one synthetic and one
real-world data set over query workloads with differing characteristics and
different partitioning constraints.Comment: 16 pages, 3 figure
The Data Lakehouse: Data Warehousing and More
Relational Database Management Systems designed for Online Analytical
Processing (RDBMS-OLAP) have been foundational to democratizing data and
enabling analytical use cases such as business intelligence and reporting for
many years. However, RDBMS-OLAP systems present some well-known challenges.
They are primarily optimized only for relational workloads, lead to
proliferation of data copies which can become unmanageable, and since the data
is stored in proprietary formats, it can lead to vendor lock-in, restricting
access to engines, tools, and capabilities beyond what the vendor offers. As
the demand for data-driven decision making surges, the need for a more robust
data architecture to address these challenges becomes ever more critical. Cloud
data lakes have addressed some of the shortcomings of RDBMS-OLAP systems, but
they present their own set of challenges. More recently, organizations have
often followed a two-tier architectural approach to take advantage of both
these platforms, leveraging both cloud data lakes and RDBMS-OLAP systems.
However, this approach brings additional challenges, complexities, and
overhead. This paper discusses how a data lakehouse, a new architectural
approach, achieves the same benefits of an RDBMS-OLAP and cloud data lake
combined, while also providing additional advantages. We take today's data
warehousing and break it down into implementation independent components,
capabilities, and practices. We then take these aspects and show how a
lakehouse architecture satisfies them. Then, we go a step further and discuss
what additional capabilities and benefits a lakehouse architecture provides
over an RDBMS-OLAP
UnifyDR: A Generic Framework for Unifying Data and Replica Placement
The advent of (big) data management applications operating at Cloud scale has led to extensive research on the data placement problem. The key objective of data placement is to obtain a partitioning (possibly allowing for replicas) of a set of data-items into distributed nodes that minimizes the overall network communication cost. Although replication is intrinsic to data placement, it has seldom been studied in combination with the latter. On the contrary, most of the existing solutions treat them as two independent problems, and employ a two-phase approach: (1) data placement, followed by (2) replica placement. We address this by proposing a new paradigm, CDR , with the objective of c ombining d ata and r eplica placement as a single joint optimization problem. Specifically, we study two variants of the CDR problem: (1) CDR-Single , where the objective is to minimize the communication cost alone, and (2) CDR-Multi , which performs a multi-objective optimization to also minimize traffic and storage costs. To unify data and replica placement, we propose a generic framework called UnifyDR , which leverages overlapping correlation clustering to assign a data-item to multiple nodes, thereby facilitating data and replica placement to be performed jointly. We establish the generic nature of UnifyDR by portraying its ability to address the CDR problem in two real-world use-cases, that of join-intensive online analytical processing (OLAP) queries and a location-based online social network (OSN) service. The effectiveness and scalability of UnifyDR are showcased by experiments performed on data generated using the TPC-DS benchmark and a trace of the Gowalla OSN for the OLAP queries and OSN service use-case, respectively. Empirically, the presented approach obtains an improvement of approximately 35% in terms of the evaluated metrics and a speed-up of 8 times in comparison to state-of-the-art techniques.This work was supported by the Agentschap Innoveren & Ondernemen (VLAIO) Strategic Fundamental Research (SBO) under Grant 150038 (DiSSeCt)
- …