Search CORE

779 research outputs found

Towards Large-Scale Knowledge Discovery in Databases (KDD) by Exploiting Parallelism in Generic KDD Primitives

Author: Freitas Alex A.
Publication venue
Publication date: 01/07/1997
Field of study

Scalable mining for classification rules in relational databases

Author: Iyer Bala
Vitter Jeffrey Scott
Wang Min
Publication venue: Institute of Mathematical Statistics
Publication date: 01/01/2004
Field of study

doi:10.1214/lnms/1196285404Data mining is a process of discovering useful patterns (knowledge) hidden in extremely large datasets. Classification is a fundamental data mining function, and some other functions can be reduced to it. In this paper we propose a novel classification algorithm (classifier) called MIND (MINing in Databases). MIND can be phrased in such a way that its implementation is very easy using the extended relational calculus SQL, and this in turn allows the classifier to be built into a relational database system directly. MIND is truly scalable with respect to I/O efficiency, which is important since scalability is a key requirement for any data mining algorithm. We have built a prototype of MIND in the relational database management system DB2 and have benchmarked its performance. We describe the working prototype and report the measured performance with respect to the previous method of choice. MIND scales not only with the size of datasets but also with the number of processors on an IBM SP2 computer system. Even on uniprocessors, MIND scales well beyond dataset sizes previously published for classifiers.We also give some insights that may have an impact on the evolution of the extended relational calculus SQL

Crossref

KU ScholarWorks

Design and implementation data flow analysis of jobs in IBM DataStage for Manta project

Author: Vladyslav Zavirskyy
Publication venue: Czech Technical University in Prague. Computing and Information Centre.
Publication date: 13/06/2019
Field of study

Cílem této práce je návrh a implementace funkčního prototypu modulu, provádějícího syntaktickou a sémantickou analýzu úloh v IBM InfoSphere DataStage. Modul se používá pro analýzu datových toků a generaci grafu, který reprezentuje datove toky. Návrh a implementace podporují bezproblémové připojení modulu k projektu Manta. Práce obsahuje důkladnou analýzu nástroje IBM InfoSphere DataStage, návrhovou dokumentaci, implementovaný prototyp modulu a také testy, které zajišťují funkcionalitu modulu.This work aims to design and implement a functional module prototype that performs syntactic and semantic analysis of tasks in IBM InfoSphere DataStage. The module provides data flow analysis and generation of the graph, which represents data flows. Design and implementation support the trouble-free connection of the module to the Manta project. The work contains an in-depth analysis of the IBM InfoSphere DataStage tool, design documentation, implemented the module prototype and tests, which ensures module functionality

Digital Library of the Czech Technical University in Prague

AT-GIS: highly parallel spatial query processing with associative transducers

Author: Ogden
Pietzuch P
Thomas D
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 11/11/2015
Field of study

Users in many domains, including urban planning, transportation, and environmental science want to execute analytical queries over continuously updated spatial datasets. Current solutions for largescale spatial query processing either rely on extensions to RDBMS, which entails expensive loading and indexing phases when the data changes, or distributed map/reduce frameworks, running on resource-hungry compute clusters. Both solutions struggle with the sequential bottleneck of parsing complex, hierarchical spatial data formats, which frequently dominates query execution time. Our goal is to fully exploit the parallelism offered by modern multicore CPUs for parsing and query execution, thus providing the performance of a cluster with the resources of a single machine. We describe AT-GIS, a highly-parallel spatial query processing system that scales linearly to a large number of CPU cores. ATGIS integrates the parsing and querying of spatial data using a new computational abstraction called associative transducers(ATs). ATs can form a single data-parallel pipeline for computation without requiring the spatial input data to be split into logically independent blocks. Using ATs, AT-GIS can execute, in parallel, spatial query operators on the raw input data in multiple formats, without any pre-processing. On a single 64-core machine, AT-GIS provides 3× the performance of an 8-node Hadoop cluster with 192 cores for containment queries, and 10× for aggregation queries

Spiral - Imperial College Digital Repository

PerfXplain: Debugging MapReduce Job Performance

Author: Balazinska Magdalena
Khoussainova Nodira
Suciu Dan
Publication venue
Publication date: 01/01/2012
Field of study

While users today have access to many tools that assist in performing large scale data analysis tasks, understanding the performance characteristics of their parallel computations, such as MapReduce jobs, remains difficult. We present PerfXplain, a system that enables users to ask questions about the relative performances (i.e., runtimes) of pairs of MapReduce jobs. PerfXplain provides a new query language for articulating performance queries and an algorithm for generating explanations from a log of past MapReduce job executions. We formally define the notion of an explanation together with three metrics, relevance, precision, and generality, that measure explanation quality. We present the explanation-generation algorithm based on techniques related to decision-tree building. We evaluate the approach on a log of past executions on Amazon EC2, and show that our approach can generate quality explanations, outperforming two naive explanation-generation methods.Comment: VLDB201

arXiv.org e-Print Archive

CiteSeerX

Land Cover/Land Use Mapping Using Soft Computing Techniques with Optimized Features

Author: Nisia T. Gladima
Rajesh Selvaraj
Publication venue: 'IntechOpen'
Publication date: 26/02/2020
Field of study

The chapter discusses soft computing techniques for solving complex computational tasks. It highlights some of the soft computing techniques like fuzzy logic, genetic algorithm, artificial neural network, and machine learning. The classification of the remotely sensed images is always a tedious task. So, here we explain how these soft computing techniques could be used for image classification. Image classification mainly concentrates on the feature’s extraction process. The features extracted in an efficient manner improve classification accuracy. Hence, the different kinds of features and different methods for these extractions are explained. The best extracted features are selected using genetic algorithm. Various algorithms are shown and comparisons are made. Finally, the results are verified using a hypothetical case study

IntechOpen

Crossref

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture

Author: K Bernhard Schiefer
Robert L Grossman
Sadique Syed
Sivakumar Harinath
Xun Xue
Publication venue
Publication date: 06/03/2020
Field of study

Abstract. DB2 Universal Database Enterprise-Extended Edition (DB2 UDB EEE) is a parallel relational database management system using a sharednothing architecture. DB2 UDB EEE uses multiple nodes connected by an interconnect and partitions data across these nodes. The communication protocol used between nodes of DB2 UDB EEE has historically been Transmission Control Protocol (TCP) / Internet Protocol (IP) but has now been extended to include the Virtual Interface (VI) Architecture. This paper discusses a new protocol termed Virtual Interface Protocol (VIP), built on top of the primitives provided by the VI Architecture. DB2 UDB EEE with VIP on a fast interconnect has shown significant improvement in reducing the elapsed time of queries when compared with TCP/IP over fast ethernet. This paper discusses the implementation and performance results on a Transaction Processing Council's Decision (TPC-D) support database

CiteSeerX

Business Analytics in (a) Blink

Author: Barber R.
Bendel P.
Czech M.
Draese O.
Ho F.
Hrle N.
Idreos S. (Stratos)
Kim M
Koeth O
Lee J. (Jae-Gil)
Lohman G.
Morfonios K.
Mueller R. (René)
Murthy K
Pandis I.
Qiao L.
Raman V. (Vijayshankar)
Sidle R.
Stolze K
Szabo S.
Tim Li T. (Tianchao)
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2012
Field of study

The Blink project’s ambitious goal is to answer all Business Intelligence (BI) queries in mere seconds, regardless of the database size, with an extremely low total cost of ownership. Blink is a new DBMS aimed primarily at read-mostly BI query processing that exploits scale-out of commodity multi-core processors and cheap DRAM to retain a (copy of a) data mart completely in main memory. Additionally, it exploits proprietary compression technology and cache-conscious algorithms that reduce memory bandwidth consumption and allow most SQL query processing to be performed on the compressed data. Blink always scans (portions of) the data mart in parallel on all nodes, without using any indexes or materialized views, and without any query optimizer to choose among them. The Blink technology has thus far been incorp

CWI's Institutional Repository