189 research outputs found

    Scalable mining for classification rules in relational databases

    Get PDF
    doi:10.1214/lnms/1196285404Data mining is a process of discovering useful patterns (knowledge) hidden in extremely large datasets. Classification is a fundamental data mining function, and some other functions can be reduced to it. In this paper we propose a novel classification algorithm (classifier) called MIND (MINing in Databases). MIND can be phrased in such a way that its implementation is very easy using the extended relational calculus SQL, and this in turn allows the classifier to be built into a relational database system directly. MIND is truly scalable with respect to I/O efficiency, which is important since scalability is a key requirement for any data mining algorithm. We have built a prototype of MIND in the relational database management system DB2 and have benchmarked its performance. We describe the working prototype and report the measured performance with respect to the previous method of choice. MIND scales not only with the size of datasets but also with the number of processors on an IBM SP2 computer system. Even on uniprocessors, MIND scales well beyond dataset sizes previously published for classifiers.We also give some insights that may have an impact on the evolution of the extended relational calculus SQL

    A comparison of statistical machine learning methods in heartbeat detection and classification

    Get PDF
    In health care, patients with heart problems require quick responsiveness in a clinical setting or in the operating theatre. Towards that end, automated classification of heartbeats is vital as some heartbeat irregularities are time consuming to detect. Therefore, analysis of electro-cardiogram (ECG) signals is an active area of research. The methods proposed in the literature depend on the structure of a heartbeat cycle. In this paper, we use interval and amplitude based features together with a few samples from the ECG signal as a feature vector. We studied a variety of classification algorithms focused especially on a type of arrhythmia known as the ventricular ectopic fibrillation (VEB). We compare the performance of the classifiers against algorithms proposed in the literature and make recommendations regarding features, sampling rate, and choice of the classifier to apply in a real-time clinical setting. The extensive study is based on the MIT-BIH arrhythmia database. Our main contribution is the evaluation of existing classifiers over a range sampling rates, recommendation of a detection methodology to employ in a practical setting, and extend the notion of a mixture of experts to a larger class of algorithms

    Graph Processing in Main-Memory Column Stores

    Get PDF
    Evermore, novel and traditional business applications leverage the advantages of a graph data model, such as the offered schema flexibility and an explicit representation of relationships between entities. As a consequence, companies are confronted with the challenge of storing, manipulating, and querying terabytes of graph data for enterprise-critical applications. Although these business applications operate on graph-structured data, they still require direct access to the relational data and typically rely on an RDBMS to keep a single source of truth and access. Existing solutions performing graph operations on business-critical data either use a combination of SQL and application logic or employ a graph data management system. For the first approach, relying solely on SQL results in poor execution performance caused by the functional mismatch between typical graph operations and the relational algebra. To the worse, graph algorithms expose a tremendous variety in structure and functionality caused by their often domain-specific implementations and therefore can be hardly integrated into a database management system other than with custom coding. Since the majority of these enterprise-critical applications exclusively run on relational DBMSs, employing a specialized system for storing and processing graph data is typically not sensible. Besides the maintenance overhead for keeping the systems in sync, combining graph and relational operations is hard to realize as it requires data transfer across system boundaries. A basic ingredient of graph queries and algorithms are traversal operations and are a fundamental component of any database management system that aims at storing, manipulating, and querying graph data. Well-established graph traversal algorithms are standalone implementations relying on optimized data structures. The integration of graph traversals as an operator into a database management system requires a tight integration into the existing database environment and a development of new components, such as a graph topology-aware optimizer and accompanying graph statistics, graph-specific secondary index structures to speedup traversals, and an accompanying graph query language. In this thesis, we introduce and describe GRAPHITE, a hybrid graph-relational data management system. GRAPHITE is a performance-oriented graph data management system as part of an RDBMS allowing to seamlessly combine processing of graph data with relational data in the same system. We propose a columnar storage representation for graph data to leverage the already existing and mature data management and query processing infrastructure of relational database management systems. At the core of GRAPHITE we propose an execution engine solely based on set operations and graph traversals. Our design is driven by the observation that different graph topologies expose different algorithmic requirements to the design of a graph traversal operator. We derive two graph traversal implementations targeting the most common graph topologies and demonstrate how graph-specific statistics can be leveraged to select the optimal physical traversal operator. To accelerate graph traversals, we devise a set of graph-specific, updateable secondary index structures to improve the performance of vertex neighborhood expansion. Finally, we introduce a domain-specific language with an intuitive programming model to extend graph traversals with custom application logic at runtime. We use the LLVM compiler framework to generate efficient code that tightly integrates the user-specified application logic with our highly optimized built-in graph traversal operators. Our experimental evaluation shows that GRAPHITE can outperform native graph management systems by several orders of magnitude while providing all the features of an RDBMS, such as transaction support, backup and recovery, security and user management, effectively providing a promising alternative to specialized graph management systems that lack many of these features and require expensive data replication and maintenance processes

    Scalable aggregation predictive analytics: a query-driven machine learning approach

    Get PDF
    We introduce a predictive modeling solution that provides high quality predictive analytics over aggregation queries in Big Data environments. Our predictive methodology is generally applicable in environments in which large-scale data owners may or may not restrict access to their data and allow only aggregation operators like COUNT to be executed over their data. In this context, our methodology is based on historical queries and their answers to accurately predict ad-hoc queries’ answers. We focus on the widely used set-cardinality, i.e., COUNT, aggregation query, as COUNT is a fundamental operator for both internal data system optimizations and for aggregation-oriented data exploration and predictive analytics. We contribute a novel, query-driven Machine Learning (ML) model whose goals are to: (i) learn the query-answer space from past issued queries, (ii) associate the query space with local linear regression & associative function estimators, (iii) define query similarity, and (iv) predict the cardinality of the answer set of unseen incoming queries, referred to the Set Cardinality Prediction (SCP) problem. Our ML model incorporates incremental ML algorithms for ensuring high quality prediction results. The significance of contribution lies in that it (i) is the only query-driven solution applicable over general Big Data environments, which include restricted-access data, (ii) offers incremental learning adjusted for arriving ad-hoc queries, which is well suited for query-driven data exploration, and (iii) offers a performance (in terms of scalability, SCP accuracy, processing time, and memory requirements) that is superior to data-centric approaches. We provide a comprehensive performance evaluation of our model evaluating its sensitivity, scalability and efficiency for quality predictive analytics. In addition, we report on the development and incorporation of our ML model in Spark showing its superior performance compared to the Spark’s COUNT method

    Efficient Algorithms to Compute Hierarchical Summaries from Big Data Streams

    Full text link
    Many data stream applications have hierarchical data; containing time, geographic locations, product information, clickstreams, server logs, IP addresses. A hierarchical summary of such volumous data offers multiple advantages including compactness, quick understanding, and abstraction. The goal of this thesis is to design algorithmic approaches for summarizing hierarchical data streams. First, this thesis provides a theoretical analysis of the benchmark hierarchical heavy hitters' algorithms and uncovers their shortcomings such as requiring high theoretical memory, updates and coverage problem. To address these shortcomings, this thesis proposes efficient algorithms which offer deterministic estimation accuracy using O(η/ε) worst-case memory and O(η) worst-case time complexity per item, where ε ∈ [0,1] is a user defined parameter and η is a small constant derived from the data. The proposed hierarchical heavy hitters' algorithms are shown to have improved significantly over existing algorithms both theoretically as well as empirically. Next, this thesis introduces a new concept called hierarchically correlated heavy hitters, which is different from existing hierarchical summarization techniques. The thesis provides a formal definition of the proposed concept and compares it with existing hierarchical summarization approaches both at definition level and empirically. It also proposes an efficient hierarchy-aware algorithm for computing hierarchically correlated heavy hitters. The proposed algorithm offers deterministic estimation accuracy using O(η / (ε_p * ε_s )) worst-case memory and O(η) worst-case time complexity per item, where η is as defined previously, and ε_p ∈ [0,1], ε_s ∈ [0,1] are other user defined parameters. Finally, the thesis proposes a special hierarchical data structure and algorithm to summarize spatiotemporal data. It can be used to extract interesting and useful patterns from high-speed spatiotemporal data streams at multiple spatial and temporal granularities. Theoretical and empirical analysis are provided, which show that the proposed data structure is very efficient concerning data storage and response to queries. It updates a single item in O(1) time and responds to a point query in O(1) time. Importantly, the memory requirement of the proposed data structure is independent of the size of the data and only depends on user-supplied parameters ψ ⃗ and φ ⃗. In summary, this thesis provides a general framework consisting of a set of algorithms and data structures to compute hierarchical summaries of the big data streams. All of the proposed algorithms exploit a lattice structure built from the hierarchical attributes of the data to compute different hierarchical summaries, which can be used to address various data analytic issues in many emerging applications

    Analytical study and computational modeling of statistical methods for data mining

    Get PDF
    Today, there is tremendous increase of the information available on electronic form. Day by day it is increasing massively. There are enough opportunities for research to retrieve knowledge from the data available in this information. Data mining and app
    corecore