898 research outputs found
Platform Dependent Verification: On Engineering Verification Tools for 21st Century
The paper overviews recent developments in platform-dependent explicit-state
LTL model checking.Comment: In Proceedings PDMC 2011, arXiv:1111.006
Graph Processing in Main-Memory Column Stores
Evermore, novel and traditional business applications leverage the advantages of a graph data model, such as the offered schema flexibility and an explicit representation of relationships between entities. As a consequence, companies are confronted with the challenge of storing, manipulating, and querying terabytes of graph data for enterprise-critical applications. Although these business applications operate on graph-structured data, they still require direct access to the relational data and typically rely on an RDBMS to keep a single source of truth and access.
Existing solutions performing graph operations on business-critical data either use a combination of SQL and application logic or employ a graph data management system. For the first approach, relying solely on SQL results in poor execution performance caused by the functional mismatch between typical graph operations and the relational algebra. To the worse, graph algorithms expose a tremendous variety in structure and functionality caused by their often domain-specific implementations and therefore can be hardly integrated into a database management system other than with custom coding. Since the majority of these enterprise-critical applications exclusively run on relational DBMSs, employing a specialized system for storing and processing graph data is typically not sensible. Besides the maintenance overhead for keeping the systems in sync, combining graph and relational operations is hard to realize as it requires data transfer across system boundaries.
A basic ingredient of graph queries and algorithms are traversal operations and are a fundamental component of any database management system that aims at storing, manipulating, and querying graph data. Well-established graph traversal algorithms are standalone implementations relying on optimized data structures. The integration of graph traversals as an operator into a database management system requires a tight integration into the existing database environment and a development of new components, such as a graph topology-aware optimizer and accompanying graph statistics, graph-specific secondary index structures to speedup traversals, and an accompanying graph query language.
In this thesis, we introduce and describe GRAPHITE, a hybrid graph-relational data management system. GRAPHITE is a performance-oriented graph data management system as part of an RDBMS allowing to seamlessly combine processing of graph data with relational data in the same system. We propose a columnar storage representation for graph data to leverage the already existing and mature data management and query processing infrastructure of relational database management systems. At the core of GRAPHITE we propose an execution engine solely based on set operations and graph traversals.
Our design is driven by the observation that different graph topologies expose different algorithmic requirements to the design of a graph traversal operator. We derive two graph traversal implementations targeting the most common graph topologies and demonstrate how graph-specific statistics can be leveraged to select the optimal physical traversal operator. To accelerate graph traversals, we devise a set of graph-specific, updateable secondary index structures to improve the performance of vertex neighborhood expansion. Finally, we introduce a domain-specific language with an intuitive programming model to extend graph traversals with custom application logic at runtime. We use the LLVM compiler framework to generate efficient code that tightly integrates the user-specified application logic with our highly optimized built-in graph traversal operators.
Our experimental evaluation shows that GRAPHITE can outperform native graph management systems by several orders of magnitude while providing all the features of an RDBMS, such as transaction support, backup and recovery, security and user management, effectively providing a promising alternative to specialized graph management systems that lack many of these features and require expensive data replication and maintenance processes
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable
There has been significant recent interest in parallel graph processing due
to the need to quickly analyze the large graphs available today. Many graph
codes have been designed for distributed memory or external memory. However,
today even the largest publicly-available real-world graph (the Hyperlink Web
graph with over 3.5 billion vertices and 128 billion edges) can fit in the
memory of a single commodity multicore server. Nevertheless, most experimental
work in the literature report results on much smaller graphs, and the ones for
the Hyperlink graph use distributed or external memory. Therefore, it is
natural to ask whether we can efficiently solve a broad class of graph problems
on this graph in memory.
This paper shows that theoretically-efficient parallel graph algorithms can
scale to the largest publicly-available graphs using a single machine with a
terabyte of RAM, processing them in minutes. We give implementations of
theoretically-efficient parallel algorithms for 20 important graph problems. We
also present the optimizations and techniques that we used in our
implementations, which were crucial in enabling us to process these large
graphs quickly. We show that the running times of our implementations
outperform existing state-of-the-art implementations on the largest real-world
graphs. For many of the problems that we consider, this is the first time they
have been solved on graphs at this scale. We have made the implementations
developed in this work publicly-available as the Graph-Based Benchmark Suite
(GBBS).Comment: This is the full version of the paper appearing in the ACM Symposium
on Parallelism in Algorithms and Architectures (SPAA), 201
Efficient Race Detection with Futures
This paper addresses the problem of provably efficient and practically good
on-the-fly determinacy race detection in task parallel programs that use
futures. Prior works determinacy race detection have mostly focused on either
task parallel programs that follow a series-parallel dependence structure or
ones with unrestricted use of futures that generate arbitrary dependences. In
this work, we consider a restricted use of futures and show that it can be race
detected more efficiently than general use of futures.
Specifically, we present two algorithms: MultiBags and MultiBags+. MultiBags
targets programs that use futures in a restricted fashion and runs in time
, where is the sequential running time of the
program, is the inverse Ackermann's function, is the total number
of memory accesses, is the dynamic count of places at which parallelism is
created. Since is a very slowly growing function (upper bounded by
for all practical purposes), it can be treated as a close-to-constant overhead.
MultiBags+ an extension of MultiBags that target programs with general use of
futures. It runs in time where , ,
and are defined as before, and is the number of future operations in
the computation. We implemented both algorithms and empirically demonstrate
their efficiency
Plant-Wide Diagnosis: Cause-and-Effect Analysis Using Process Connectivity and Directionality Information
Production plants used in modern process industry must produce products that meet stringent
environmental, quality and profitability constraints. In such integrated plants, non-linearity and
strong process dynamic interactions among process units complicate root-cause diagnosis of
plant-wide disturbances because disturbances may propagate to units at some distance away
from the primary source of the upset. Similarly, implemented advanced process control
strategies, backup and recovery systems, use of recycle streams and heat integration may
hamper detection and diagnostic efforts.
It is important to track down the root-cause of a plant-wide disturbance because once
corrective action is taken at the source, secondary propagated effects can be quickly eliminated
with minimum effort and reduced down time with the resultant positive impact on process
efficiency, productivity and profitability.
In order to diagnose the root-cause of disturbances that manifest plant-wide, it is crucial to
incorporate and utilize knowledge about the overall process topology or interrelated physical
structure of the plant, such as is contained in Piping and Instrumentation Diagrams (P&IDs).
Traditionally, process control engineers have intuitively referred to the physical structure of
the plant by visual inspection and manual tracing of fault propagation paths within the process
structures, such as the process drawings on printed P&IDs, in order to make logical
conclusions based on the results from data-driven analysis. This manual approach, however, is
prone to various sources of errors and can quickly become complicated in real processes.
The aim of this thesis, therefore, is to establish innovative techniques for the electronic
capture and manipulation of process schematic information from large plants such as
refineries in order to provide an automated means of diagnosing plant-wide performance
problems. This report also describes the design and implementation of a computer application
program that integrates: (i) process connectivity and directionality information from intelligent
P&IDs (ii) results from data-driven cause-and-effect analysis of process measurements and (iii)
process know-how to aid process control engineers and plant operators gain process insight.
This work explored process intelligent P&IDs, created with AVEVA® P&ID, a Computer
Aided Design (CAD) tool, and exported as an ISO 15926 compliant platform and vendor
independent text-based XML description of the plant. The XML output was processed by a
software tool developed in Microsoft® .NET environment in this research project to
computationally generate connectivity matrix that shows plant items and their connections.
The connectivity matrix produced can be exported to Excel® spreadsheet application as a basis
for other application and has served as precursor to other research work. The final version of
the developed software tool links statistical results of cause-and-effect analysis of process data
with the connectivity matrix to simplify and gain insights into the cause and effect analysis
using the connectivity information. Process knowhow and understanding is incorporated to
generate logical conclusions.
The thesis presents a case study in an atmospheric crude heating unit as an illustrative example
to drive home key concepts and also describes an industrial case study involving refinery
operations. In the industrial case study, in addition to confirming the root-cause candidate, the
developed software tool was set the task to determine the physical sequence of fault
propagation path within the plant.
This was then compared with the hypothesis about disturbance propagation sequence
generated by pure data-driven method. The results show a high degree of overlap which helps
to validate statistical data-driven technique and easily identify any spurious results from the
data-driven multivariable analysis. This significantly increase control engineers confidence in
data-driven method being used for root-cause diagnosis.
The thesis concludes with a discussion of the approach and presents ideas for further
development of the methods
Coping with new Challenges in Clustering and Biomedical Imaging
The last years have seen a tremendous increase of data acquisition in different scientific fields such as molecular biology, bioinformatics or biomedicine. Therefore, novel methods are needed for automatic data processing and analysis of this large amount of data. Data mining is the process of applying methods like clustering or classification to large databases in order to uncover hidden patterns. Clustering is the task of partitioning points of a data set into distinct groups in order to minimize the intra cluster similarity and to maximize the inter cluster similarity. In contrast to unsupervised learning like clustering, the classification problem is known as supervised learning that aims at the prediction of group membership of data objects on the basis of rules learned from a training set where the group membership is known.
Specialized methods have been proposed for hierarchical and partitioning clustering. However, these methods suffer from several drawbacks. In the first part of this work, new clustering methods are proposed that cope with problems from conventional clustering algorithms. ITCH (Information-Theoretic Cluster Hierarchies) is a hierarchical clustering method that is based on a hierarchical variant of the Minimum Description Length (MDL) principle which finds hierarchies of clusters without requiring input parameters. As ITCH may converge only to a local optimum we propose GACH (Genetic Algorithm for Finding Cluster Hierarchies) that combines the benefits from genetic algorithms with information-theory. In this way the search space is explored more effectively.
Furthermore, we propose INTEGRATE a novel clustering method for data with mixed numerical and categorical attributes. Supported by the MDL principle our method integrates the information provided by heterogeneous numerical and categorical attributes and thus naturally balances the influence of both sources of information. A competitive evaluation illustrates that INTEGRATE is more effective than existing clustering methods for mixed type data. Besides clustering methods for single data objects we provide a solution for clustering different data sets that are represented by their skylines. The skyline operator is a well-established database primitive for finding database objects which minimize two or more attributes with an unknown weighting between these attributes. In this thesis, we define a similarity measure, called SkyDist, for comparing skylines of different data sets that can directly be integrated into different data mining tasks such as clustering or classification. The experiments show that SkyDist in combination with different clustering algorithms can give useful insights into many applications.
In the second part, we focus on the analysis of high resolution magnetic resonance images (MRI) that are clinically relevant and may allow for an early detection and diagnosis of several diseases. In particular, we propose a framework for the classification of Alzheimer's disease in MR images combining the data mining steps of feature selection, clustering and classification. As a result, a set of highly selective features discriminating patients with Alzheimer and healthy people has been identified. However, the analysis of the high dimensional MR images is extremely time-consuming. Therefore we developed JGrid, a scalable distributed computing solution designed to allow for a large scale analysis of MRI and thus an optimized prediction of diagnosis. In another study we apply efficient algorithms for motif discovery to task-fMRI scans in order to identify patterns in the brain that are characteristic for patients with somatoform pain disorder. We find groups of brain compartments that occur frequently within the brain networks and discriminate well among healthy and diseased people
- …