779 research outputs found

    Scalable mining for classification rules in relational databases

    Get PDF
    doi:10.1214/lnms/1196285404Data mining is a process of discovering useful patterns (knowledge) hidden in extremely large datasets. Classification is a fundamental data mining function, and some other functions can be reduced to it. In this paper we propose a novel classification algorithm (classifier) called MIND (MINing in Databases). MIND can be phrased in such a way that its implementation is very easy using the extended relational calculus SQL, and this in turn allows the classifier to be built into a relational database system directly. MIND is truly scalable with respect to I/O efficiency, which is important since scalability is a key requirement for any data mining algorithm. We have built a prototype of MIND in the relational database management system DB2 and have benchmarked its performance. We describe the working prototype and report the measured performance with respect to the previous method of choice. MIND scales not only with the size of datasets but also with the number of processors on an IBM SP2 computer system. Even on uniprocessors, MIND scales well beyond dataset sizes previously published for classifiers.We also give some insights that may have an impact on the evolution of the extended relational calculus SQL

    Design and implementation data flow analysis of jobs in IBM DataStage for Manta project

    Get PDF
    Cílem této práce je návrh a implementace funkčního prototypu modulu, provádějícího syntaktickou a sémantickou analýzu úloh v IBM InfoSphere DataStage. Modul se používá pro analýzu datových toků a generaci grafu, který reprezentuje datove toky. Návrh a implementace podporují bezproblémové připojení modulu k projektu Manta. Práce obsahuje důkladnou analýzu nástroje IBM InfoSphere DataStage, návrhovou dokumentaci, implementovaný prototyp modulu a také testy, které zajišťují funkcionalitu modulu.This work aims to design and implement a functional module prototype that performs syntactic and semantic analysis of tasks in IBM InfoSphere DataStage. The module provides data flow analysis and generation of the graph, which represents data flows. Design and implementation support the trouble-free connection of the module to the Manta project. The work contains an in-depth analysis of the IBM InfoSphere DataStage tool, design documentation, implemented the module prototype and tests, which ensures module functionality

    AT-GIS: highly parallel spatial query processing with associative transducers

    Get PDF
    Users in many domains, including urban planning, transportation, and environmental science want to execute analytical queries over continuously updated spatial datasets. Current solutions for largescale spatial query processing either rely on extensions to RDBMS, which entails expensive loading and indexing phases when the data changes, or distributed map/reduce frameworks, running on resource-hungry compute clusters. Both solutions struggle with the sequential bottleneck of parsing complex, hierarchical spatial data formats, which frequently dominates query execution time. Our goal is to fully exploit the parallelism offered by modern multicore CPUs for parsing and query execution, thus providing the performance of a cluster with the resources of a single machine. We describe AT-GIS, a highly-parallel spatial query processing system that scales linearly to a large number of CPU cores. ATGIS integrates the parsing and querying of spatial data using a new computational abstraction called associative transducers(ATs). ATs can form a single data-parallel pipeline for computation without requiring the spatial input data to be split into logically independent blocks. Using ATs, AT-GIS can execute, in parallel, spatial query operators on the raw input data in multiple formats, without any pre-processing. On a single 64-core machine, AT-GIS provides 3Ă— the performance of an 8-node Hadoop cluster with 192 cores for containment queries, and 10Ă— for aggregation queries

    PerfXplain: Debugging MapReduce Job Performance

    Full text link
    While users today have access to many tools that assist in performing large scale data analysis tasks, understanding the performance characteristics of their parallel computations, such as MapReduce jobs, remains difficult. We present PerfXplain, a system that enables users to ask questions about the relative performances (i.e., runtimes) of pairs of MapReduce jobs. PerfXplain provides a new query language for articulating performance queries and an algorithm for generating explanations from a log of past MapReduce job executions. We formally define the notion of an explanation together with three metrics, relevance, precision, and generality, that measure explanation quality. We present the explanation-generation algorithm based on techniques related to decision-tree building. We evaluate the approach on a log of past executions on Amazon EC2, and show that our approach can generate quality explanations, outperforming two naive explanation-generation methods.Comment: VLDB201

    Land Cover/Land Use Mapping Using Soft Computing Techniques with Optimized Features

    Get PDF
    The chapter discusses soft computing techniques for solving complex computational tasks. It highlights some of the soft computing techniques like fuzzy logic, genetic algorithm, artificial neural network, and machine learning. The classification of the remotely sensed images is always a tedious task. So, here we explain how these soft computing techniques could be used for image classification. Image classification mainly concentrates on the feature’s extraction process. The features extracted in an efficient manner improve classification accuracy. Hence, the different kinds of features and different methods for these extractions are explained. The best extracted features are selected using genetic algorithm. Various algorithms are shown and comparisons are made. Finally, the results are verified using a hypothetical case study

    Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture

    Get PDF
    Abstract. DB2 Universal Database Enterprise-Extended Edition (DB2 UDB EEE) is a parallel relational database management system using a sharednothing architecture. DB2 UDB EEE uses multiple nodes connected by an interconnect and partitions data across these nodes. The communication protocol used between nodes of DB2 UDB EEE has historically been Transmission Control Protocol (TCP) / Internet Protocol (IP) but has now been extended to include the Virtual Interface (VI) Architecture. This paper discusses a new protocol termed Virtual Interface Protocol (VIP), built on top of the primitives provided by the VI Architecture. DB2 UDB EEE with VIP on a fast interconnect has shown significant improvement in reducing the elapsed time of queries when compared with TCP/IP over fast ethernet. This paper discusses the implementation and performance results on a Transaction Processing Council's Decision (TPC-D) support database

    Business Analytics in (a) Blink

    Get PDF
    The Blink project’s ambitious goal is to answer all Business Intelligence (BI) queries in mere seconds, regardless of the database size, with an extremely low total cost of ownership. Blink is a new DBMS aimed primarily at read-mostly BI query processing that exploits scale-out of commodity multi-core processors and cheap DRAM to retain a (copy of a) data mart completely in main memory. Additionally, it exploits proprietary compression technology and cache-conscious algorithms that reduce memory bandwidth consumption and allow most SQL query processing to be performed on the compressed data. Blink always scans (portions of) the data mart in parallel on all nodes, without using any indexes or materialized views, and without any query optimizer to choose among them. The Blink technology has thus far been incorp
    • …
    corecore