779 research outputs found
Scalable mining for classification rules in relational databases
doi:10.1214/lnms/1196285404Data mining is a process of discovering useful patterns (knowledge)
hidden in extremely large datasets. Classification is a fundamental data mining
function, and some other functions can be reduced to it. In this paper we
propose a novel classification algorithm (classifier) called MIND (MINing in
Databases). MIND can be phrased in such a way that its implementation is
very easy using the extended relational calculus SQL, and this in turn allows
the classifier to be built into a relational database system directly. MIND is
truly scalable with respect to I/O efficiency, which is important since scalability
is a key requirement for any data mining algorithm.
We have built a prototype of MIND in the relational database management
system DB2 and have benchmarked its performance. We describe the working
prototype and report the measured performance with respect to the previous
method of choice. MIND scales not only with the size of datasets but also
with the number of processors on an IBM SP2 computer system. Even on
uniprocessors, MIND scales well beyond dataset sizes previously published for
classifiers.We also give some insights that may have an impact on the evolution
of the extended relational calculus SQL
Design and implementation data flow analysis of jobs in IBM DataStage for Manta project
CĂlem tĂ©to práce je návrh a implementace funkÄŤnĂho prototypu modulu, provádÄ›jĂcĂho syntaktickou a sĂ©mantickou analĂ˝zu Ăşloh v IBM InfoSphere DataStage. Modul se pouĹľĂvá pro analĂ˝zu datovĂ˝ch tokĹŻ a generaci grafu, kterĂ˝ reprezentuje datove toky. Návrh a implementace podporujĂ bezproblĂ©movĂ© pĹ™ipojenĂ modulu k projektu Manta. Práce obsahuje dĹŻkladnou analĂ˝zu nástroje IBM InfoSphere DataStage, návrhovou dokumentaci, implementovanĂ˝ prototyp modulu a takĂ© testy, kterĂ© zajišťujĂ funkcionalitu modulu.This work aims to design and implement a functional module prototype that performs syntactic and semantic analysis of tasks in IBM InfoSphere DataStage. The module provides data flow analysis and generation of the graph, which represents data flows. Design and implementation support the trouble-free connection of the module to the Manta project. The work contains an in-depth analysis of the IBM InfoSphere DataStage tool, design documentation, implemented the module prototype and tests, which ensures module functionality
AT-GIS: highly parallel spatial query processing with associative transducers
Users in many domains, including urban planning, transportation, and environmental science want to execute analytical queries over continuously updated spatial datasets. Current solutions for largescale spatial query processing either rely on extensions to RDBMS, which entails expensive loading and indexing phases when the data changes, or distributed map/reduce frameworks, running on resource-hungry compute clusters. Both solutions struggle with the sequential bottleneck of parsing complex, hierarchical spatial data formats, which frequently dominates query execution time. Our goal is to fully exploit the parallelism offered by modern multicore CPUs for parsing and query execution, thus providing the performance of a cluster with the resources of a single machine. We describe AT-GIS, a highly-parallel spatial query processing system that scales linearly to a large number of CPU cores. ATGIS integrates the parsing and querying of spatial data using a new computational abstraction called associative transducers(ATs). ATs can form a single data-parallel pipeline for computation without requiring the spatial input data to be split into logically independent blocks. Using ATs, AT-GIS can execute, in parallel, spatial query operators on the raw input data in multiple formats, without any pre-processing. On a single 64-core machine, AT-GIS provides 3Ă— the performance of an 8-node Hadoop cluster with 192 cores for containment queries, and 10Ă— for aggregation queries
PerfXplain: Debugging MapReduce Job Performance
While users today have access to many tools that assist in performing large
scale data analysis tasks, understanding the performance characteristics of
their parallel computations, such as MapReduce jobs, remains difficult. We
present PerfXplain, a system that enables users to ask questions about the
relative performances (i.e., runtimes) of pairs of MapReduce jobs. PerfXplain
provides a new query language for articulating performance queries and an
algorithm for generating explanations from a log of past MapReduce job
executions. We formally define the notion of an explanation together with three
metrics, relevance, precision, and generality, that measure explanation
quality. We present the explanation-generation algorithm based on techniques
related to decision-tree building. We evaluate the approach on a log of past
executions on Amazon EC2, and show that our approach can generate quality
explanations, outperforming two naive explanation-generation methods.Comment: VLDB201
Land Cover/Land Use Mapping Using Soft Computing Techniques with Optimized Features
The chapter discusses soft computing techniques for solving complex computational tasks. It highlights some of the soft computing techniques like fuzzy logic, genetic algorithm, artificial neural network, and machine learning. The classification of the remotely sensed images is always a tedious task. So, here we explain how these soft computing techniques could be used for image classification. Image classification mainly concentrates on the feature’s extraction process. The features extracted in an efficient manner improve classification accuracy. Hence, the different kinds of features and different methods for these extractions are explained. The best extracted features are selected using genetic algorithm. Various algorithms are shown and comparisons are made. Finally, the results are verified using a hypothetical case study
Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture
Abstract. DB2 Universal Database Enterprise-Extended Edition (DB2 UDB EEE) is a parallel relational database management system using a sharednothing architecture. DB2 UDB EEE uses multiple nodes connected by an interconnect and partitions data across these nodes. The communication protocol used between nodes of DB2 UDB EEE has historically been Transmission Control Protocol (TCP) / Internet Protocol (IP) but has now been extended to include the Virtual Interface (VI) Architecture. This paper discusses a new protocol termed Virtual Interface Protocol (VIP), built on top of the primitives provided by the VI Architecture. DB2 UDB EEE with VIP on a fast interconnect has shown significant improvement in reducing the elapsed time of queries when compared with TCP/IP over fast ethernet. This paper discusses the implementation and performance results on a Transaction Processing Council's Decision (TPC-D) support database
Business Analytics in (a) Blink
The Blink project’s ambitious goal is to answer all Business Intelligence (BI) queries in mere seconds,
regardless of the database size, with an extremely low total cost of ownership. Blink is a new DBMS
aimed primarily at read-mostly BI query processing that exploits scale-out of commodity multi-core
processors and cheap DRAM to retain a (copy of a) data mart completely in main memory. Additionally,
it exploits proprietary compression technology and cache-conscious algorithms that reduce memory
bandwidth consumption and allow most SQL query processing to be performed on the compressed data.
Blink always scans (portions of) the data mart in parallel on all nodes, without using any indexes or
materialized views, and without any query optimizer to choose among them. The Blink technology has
thus far been incorp
- …