Search CORE

1,676 research outputs found

Set-oriented data mining in relational databases

Author: Houtsma Maurice
Swami Arun
Publication venue: North Holland
Publication date: 01/01/1995
Field of study

Data mining is an important real-life application for businesses. It is critical to find efficient ways of mining large data sets. In order to benefit from the experience with relational databases, a set-oriented approach to mining data is needed. In such an approach, the data mining operations are expressed in terms of relational or set-oriented operations. Query optimization technology can then be used for efficient processing.\ud \ud In this paper, we describe set-oriented algorithms for mining association rules. Such algorithms imply performing multiple joins and thus may appear to be inherently less efficient than special-purpose algorithms. We develop new algorithms that can be expressed as SQL queries, and discuss optimization of these algorithms. After analytical evaluation, an algorithm named SETM emerges as the algorithm of choice. Algorithm SETM uses only simple database primitives, viz., sorting and merge-scan join. Algorithm SETM is simple, fast, and stable over the range of parameter values. It is easily parallelized and we suggest several additional optimizations. The set-oriented nature of Algorithm SETM makes it possible to develop extensions easily and its performance makes it feasible to build interactive data mining tools for large databases

CiteSeerX

University of Twente Research Information

Efficient Cross-Device Query Processing

Author: Pirk H. (Holger)
Publication venue
Publication date: 01/01/2012
Field of study

The increasing diversity of hardware within a single system promises large performance gains but also poses a challenge for data management systems. Strategies for the efficient use of hardware with large performance differences are still lacking. For example, existing research on GPU supported data management largely handles the GPU in isolation from the system’s CPU — The GPU is considered the central processor and the CPU used only to mitigate the GPU’s weaknesses where necessary. To make efficient use of all available devices, we developed a processing strategy that lets unequal devices like GPU and CPU combine their strengths rather than work in isolation. To this end, we decompose relational data into individual bits and place the resulting partitions on the appropriate devices. Operations are processed in phases, each phase executed on one device. This way, we achieve significant performance gains and good load distribution among the available devices in a limited real-life use case. To grow this idea into a generic system, we identify challenges as well as potential hardware configurations and applications that can benefit from this approach

CWI's Institutional Repository

Saber: window-based hybrid stream processing for heterogeneous architectures

Author: Costa P
Fernandez R
Koliousis A
Pietzuch P
Weidlich M
Wolf A
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 20/09/2015
Field of study

Modern servers have become heterogeneous, often combining multicore CPUs with many-core GPGPUs. Such heterogeneous architectures have the potential to improve the performance of data-intensive stream processing applications, but they are not supported by current relational stream processing engines. For an engine to exploit a heterogeneous architecture, it must execute streaming SQL queries with sufficient data-parallelism to fully utilise all available heterogeneous processors, and decide how to use each in the most effective way. It must do this while respecting the semantics of streaming SQL queries, in particular with regard to window handling. We describe SABER, a hybrid high-performance relational stream processing engine for CPUs and GPGPUs. SABER executes windowbased streaming SQL queries in a data-parallel fashion using all available CPU and GPGPU cores. Instead of statically assigning query operators to heterogeneous processors, SABER employs a new adaptive heterogeneous lookahead scheduling strategy, which increases the share of queries executing on the processor that yields the highest performance. To hide data movement costs, SABER pipelines the transfer of stream data between different memory types and the CPU/GPGPU. Our experimental comparison against state-ofthe-art engines shows that SABER increases processing throughput while maintaining low latency for a wide range of streaming SQL queries with small and large windows sizes

Spiral - Imperial College Digital Repository

Relational queries with a tensor processing unit

Author: Holanda P.T. (Pedro)
Mühleisen H.F. (Hannes)
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

Tensor Processing Units are specialized hardware devices built to train and apply Machine Learning models at high speed through high-bandwidth memory and massive instruction parallelism. In this short paper, we investigate how relational operations can be translated to those devices. We present mapping of relational operators to TPU-supported TensorFlow operations and experimental results comparing with GPU and CPU implementations. Results show that while raw speeds are enticing, TPUs are unlikely to improve relational query processing for now due to a variety of issues

Crossref

CWI's Institutional Repository

X-Device Query Processing by Bitwise Distribution

Author: Kersten M.L. (Martin)
Manegold S. (Stefan)
Pirk H. (Holger)
Sellam T.H.J. (Thibault)
Publication venue: 'American College of Medical Physics (ACMP)'
Publication date: 01/05/2012
Field of study

The diversity of hardware components within a single system calls for strategies for efficient cross-device data processing. For exam- ple, existing approaches to CPU/GPU co-processing distribute individual relational operators to the “most appropriate” device. While pleasantly simple, this strategy has a number of problems: it may leave the “inappropriate” devices idle while overloading the “appropriate” device and putting a high pressure on the PCI bus. To address these issues we distribute data among the devices by par- tially decomposing relations at the granularity of individual bits. Each of the resulting bit-partitions is stored and processed on one of the available devices. Using this strategy, we implemented a processor for spatial range queries that makes efficient use of all available devices. The performance gains achieved indicate that bitwise distribution makes a good cross-device processing strategy

CWI's Institutional Repository

Runtime Adaptive Hybrid Query Engine based on FPGAs

Author: Christopher Blochwitz
Dennis Heinrich
Stefan Werner
Sven Groppe
Thilo Pionteck
Publication venue: RonPub
Publication date: 01/01/2016
Field of study

This paper presents the fully integrated hardware-accelerated query engine for large-scale datasets in the context of Semantic Web databases. As queries are typically unknown at design time, a static approach is not feasible and not flexible to cover a wide range of queries at system runtime. Therefore, we introduce a runtime reconfigurable accelerator based on a Field Programmable Gate Array (FPGA), which transparently incorporates with the freely available Semantic Web database LUPOSDATE. At system runtime, the proposed approach dynamically generates an optimized hardware accelerator in terms of an FPGA configuration for each individual query and transparently retrieves the query result to be displayed to the user. During hardware-accelerated execution the host supplies triple data to the FPGA and retrieves the results from the FPGA via PCIe interface. The benefits and limitations are evaluated on large-scale synthetic datasets with up to 260 million triples as well as the widely known Billion Triples Challenge

RonPub -- Research Online Publishing

Recommended from our members

A Generalization of Band Joins and the Merge-Purge Problem

Author: Hernandez Mauricio A.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/1995
Field of study

The problem of merging multiple databases of information about common entities is frequently encountered in large commercial and government organizations. The problem we study is often called the Merge/Purge problem and is difficult to solve both in scale and accuracy. Large repositories of data always have numerous duplicate information entries about the same entities that are difficult to cull together without an intelligent "equational theory" that identifies equivalent items by a complex, domain dependent matching process. We have developed a system for accomplishing this task for lists of names of potential customers in a direct marketing-type application. Our results for statistically generated data are shown to be accurate and effective when processing the data multiple times using different keys for sorting. The system provides a rule programming module that is easy to program and quite good at finding duplicates especially in an environment with massive amounts of data

Columbia University Academic Commons