2,015 research outputs found
Designing and Implementing a Distributed Database for a Small Multi-Outlet Business
Data is a fundamental and necessary element for businesses. During their operations they generate a certain amount of data that they need to capture, store, and later on retrieve when required. Databases provide the means to store and effectively retrieve data. Such a database can help a business improve its services, be more competitive, and ultimately increase its profits. In this paper, the system requirements of a distributed database are researched for a movie rental and sale store that has at least two outlets in different locations besides the main one. This project investigates the different stages of such a database, namely, the planning, analysis, decision, implementation and testing
The End of Slow Networks: It's Time for a Redesign
Next generation high-performance RDMA-capable networks will require a
fundamental rethinking of the design and architecture of modern distributed
DBMSs. These systems are commonly designed and optimized under the assumption
that the network is the bottleneck: the network is slow and "thin", and thus
needs to be avoided as much as possible. Yet this assumption no longer holds
true. With InfiniBand FDR 4x, the bandwidth available to transfer data across
network is in the same ballpark as the bandwidth of one memory channel, and it
increases even further with the most recent EDR standard. Moreover, with the
increasing advances of RDMA, the latency improves similarly fast. In this
paper, we first argue that the "old" distributed database design is not capable
of taking full advantage of the network. Second, we propose architectural
redesigns for OLTP, OLAP and advanced analytical frameworks to take better
advantage of the improved bandwidth, latency and RDMA capabilities. Finally,
for each of the workload categories, we show that remarkable performance
improvements can be achieved
PerfXplain: Debugging MapReduce Job Performance
While users today have access to many tools that assist in performing large
scale data analysis tasks, understanding the performance characteristics of
their parallel computations, such as MapReduce jobs, remains difficult. We
present PerfXplain, a system that enables users to ask questions about the
relative performances (i.e., runtimes) of pairs of MapReduce jobs. PerfXplain
provides a new query language for articulating performance queries and an
algorithm for generating explanations from a log of past MapReduce job
executions. We formally define the notion of an explanation together with three
metrics, relevance, precision, and generality, that measure explanation
quality. We present the explanation-generation algorithm based on techniques
related to decision-tree building. We evaluate the approach on a log of past
executions on Amazon EC2, and show that our approach can generate quality
explanations, outperforming two naive explanation-generation methods.Comment: VLDB201
Scalable mining for classification rules in relational databases
doi:10.1214/lnms/1196285404Data mining is a process of discovering useful patterns (knowledge)
hidden in extremely large datasets. Classification is a fundamental data mining
function, and some other functions can be reduced to it. In this paper we
propose a novel classification algorithm (classifier) called MIND (MINing in
Databases). MIND can be phrased in such a way that its implementation is
very easy using the extended relational calculus SQL, and this in turn allows
the classifier to be built into a relational database system directly. MIND is
truly scalable with respect to I/O efficiency, which is important since scalability
is a key requirement for any data mining algorithm.
We have built a prototype of MIND in the relational database management
system DB2 and have benchmarked its performance. We describe the working
prototype and report the measured performance with respect to the previous
method of choice. MIND scales not only with the size of datasets but also
with the number of processors on an IBM SP2 computer system. Even on
uniprocessors, MIND scales well beyond dataset sizes previously published for
classifiers.We also give some insights that may have an impact on the evolution
of the extended relational calculus SQL
Business Analytics in (a) Blink
The Blink project’s ambitious goal is to answer all Business Intelligence (BI) queries in mere seconds,
regardless of the database size, with an extremely low total cost of ownership. Blink is a new DBMS
aimed primarily at read-mostly BI query processing that exploits scale-out of commodity multi-core
processors and cheap DRAM to retain a (copy of a) data mart completely in main memory. Additionally,
it exploits proprietary compression technology and cache-conscious algorithms that reduce memory
bandwidth consumption and allow most SQL query processing to be performed on the compressed data.
Blink always scans (portions of) the data mart in parallel on all nodes, without using any indexes or
materialized views, and without any query optimizer to choose among them. The Blink technology has
thus far been incorp
- …