Search CORE

775 research outputs found

Towards Large-Scale Knowledge Discovery in Databases (KDD) by Exploiting Parallelism in Generic KDD Primitives

Author: Freitas Alex A.
Publication venue
Publication date: 01/07/1997
Field of study

The End of Slow Networks: It's Time for a Redesign

Author: Binnig Carsten
Crotty Andrew
Galakatos Alex
Kraska Tim
Zamanian Erfan
Publication venue
Publication date: 19/12/2015
Field of study

Next generation high-performance RDMA-capable networks will require a fundamental rethinking of the design and architecture of modern distributed DBMSs. These systems are commonly designed and optimized under the assumption that the network is the bottleneck: the network is slow and "thin", and thus needs to be avoided as much as possible. Yet this assumption no longer holds true. With InfiniBand FDR 4x, the bandwidth available to transfer data across network is in the same ballpark as the bandwidth of one memory channel, and it increases even further with the most recent EDR standard. Moreover, with the increasing advances of RDMA, the latency improves similarly fast. In this paper, we first argue that the "old" distributed database design is not capable of taking full advantage of the network. Second, we propose architectural redesigns for OLTP, OLAP and advanced analytical frameworks to take better advantage of the improved bandwidth, latency and RDMA capabilities. Finally, for each of the workload categories, we show that remarkable performance improvements can be achieved

arXiv.org e-Print Archive

TUbiblio

PerfXplain: Debugging MapReduce Job Performance

Author: Balazinska Magdalena
Khoussainova Nodira
Suciu Dan
Publication venue
Publication date: 01/01/2012
Field of study

While users today have access to many tools that assist in performing large scale data analysis tasks, understanding the performance characteristics of their parallel computations, such as MapReduce jobs, remains difficult. We present PerfXplain, a system that enables users to ask questions about the relative performances (i.e., runtimes) of pairs of MapReduce jobs. PerfXplain provides a new query language for articulating performance queries and an algorithm for generating explanations from a log of past MapReduce job executions. We formally define the notion of an explanation together with three metrics, relevance, precision, and generality, that measure explanation quality. We present the explanation-generation algorithm based on techniques related to decision-tree building. We evaluate the approach on a log of past executions on Amazon EC2, and show that our approach can generate quality explanations, outperforming two naive explanation-generation methods.Comment: VLDB201

arXiv.org e-Print Archive

CiteSeerX

Scalable mining for classification rules in relational databases

Author: Iyer Bala
Vitter Jeffrey Scott
Wang Min
Publication venue: Institute of Mathematical Statistics
Publication date: 01/01/2004
Field of study

doi:10.1214/lnms/1196285404Data mining is a process of discovering useful patterns (knowledge) hidden in extremely large datasets. Classification is a fundamental data mining function, and some other functions can be reduced to it. In this paper we propose a novel classification algorithm (classifier) called MIND (MINing in Databases). MIND can be phrased in such a way that its implementation is very easy using the extended relational calculus SQL, and this in turn allows the classifier to be built into a relational database system directly. MIND is truly scalable with respect to I/O efficiency, which is important since scalability is a key requirement for any data mining algorithm. We have built a prototype of MIND in the relational database management system DB2 and have benchmarked its performance. We describe the working prototype and report the measured performance with respect to the previous method of choice. MIND scales not only with the size of datasets but also with the number of processors on an IBM SP2 computer system. Even on uniprocessors, MIND scales well beyond dataset sizes previously published for classifiers.We also give some insights that may have an impact on the evolution of the extended relational calculus SQL

Crossref

KU ScholarWorks

Designing and Implementing a Distributed Database for a Small Multi-Outlet Business

Author: Grech Joseph
Publication venue: ePublications at Regis University
Publication date: 26/08/2009
Field of study

Data is a fundamental and necessary element for businesses. During their operations they generate a certain amount of data that they need to capture, store, and later on retrieve when required. Databases provide the means to store and effectively retrieve data. Such a database can help a business improve its services, be more competitive, and ultimately increase its profits. In this paper, the system requirements of a distributed database are researched for a movie rental and sale store that has at least two outlets in different locations besides the main one. This project investigates the different stages of such a database, namely, the planning, analysis, decision, implementation and testing

ePublications at Regis University

XML-based Execution Plan Format (XEP)

Author: Christoph Koch
Publication venue: RonPub
Publication date: 01/01/2016
Field of study

Execution plan analysis is one of the most common SQL tuning tasks performed by relational database administrators and developers. Currently each database management system (DBMS) provides its own execution plan format, which supports system-specific details for execution plans and contains inherent plan operators. This makes SQL tuning a challenging issue. Firstly, administrators and developers often work with more than one DBMS and thus have to rethink among different plan formats. In addition, the analysis tools of execution plans only support single DBMSs, or they have to implement separate logic to handle each specific plan format of different DBMSs. To address these problems, this paper proposes an XML-based Execution Plan format (XEP), aiming to standardize the representation of execution plans of relational DBMSs. Two approaches are developed for transforming DBMS-specific execution plans into XEP format. They have been successfully evaluated for IBM DB2, Oracle Database and Microsoft SQL

RonPub -- Research Online Publishing

Analytical response time estimation in parallel relational database systems

Author: Broughton P.
Burger A.
Dempster E.
King P. J B
Taylor H.
Tomov N.
Williams Howard
Publication venue: 'Elsevier BV'
Publication date: 01/02/2004
Field of study

Techniques for performance estimation in parallel database systems are well established for parameters such as throughput, bottlenecks and resource utilisation. However, response time estimation is a complex activity which is difficult to predict and has attracted research for a number of years. Simulation is one option for predicting response time but this is a costly process. Analytical modelling is a less expensive option but requires approximations and assumptions about the queueing networks built up in real parallel database machines which are often questionable and few of the papers on analytical approaches are backed by results from validation against real machines. This paper describes a new analytical approach for response time estimation that is based on a detailed study of different approaches and assumptions. The approach has been validated against two commercial parallel DBMSs running on actual parallel machines and is shown to produce acceptable accuracy

Heriot Watt Pure

Abertay Research Portal

Geoprocessing Optimization in Grids

Author: Liu Shuo
Publication venue
Publication date: 30/09/2005
Field of study

Geoprocessing is commonly used in solving problems across disciplines which feature geospatial data and/or phenomena. Geoprocessing requires specialized algorithms and more recently, due to large volumes of geospatial databases and complex geoprocessing operations, it has become data- and/or compute-intensive. The conventional approach, which is predominately based on centralized computing solutions, is unable to handle geoprocessing efficiently. To that end, there is a need for developing distributed geoprocessing solutions by taking advantage of existing and emerging advanced techniques and high-performance computing and communications resources. As an emerging new computing paradigm, grid computing offers a novel approach for integrating distributed computing resources and supporting collaboration across networks, making it suitable for geoprocessing. Although there have been research efforts applying grid computing in the geospatial domain, there is currently a void in the literature for a general geoprocessing optimization. In this research, a new optimization technique for geoprocessing in grid systems, Geoprocessing Optimization in Grids (GOG), is designed and developed. The objective of GOG is to reduce overall response time with a reasonable cost. To meet this objective, GOG contains a set of algorithms, including a resource selection algorithm and a parallelism processing algorithm, to speed up query execution. GOG is validated by comparing its optimization time and estimated costs of generated execution plans with two existing optimization techniques. A proof of concept based on an application in air quality control is developed to demonstrate the advantages of GOG

D-Scholarship@Pitt