6,434 research outputs found
Compiler and runtime support for shared memory parallelization of data mining algorithms
Abstract. Data mining techniques focus on finding novel and useful patterns or models from large datasets. Because of the volume of the data to be analyzed, the amount of computation involved, and the need for rapid or even interactive analysis, data mining applications require the use of parallel machines. We have been developing compiler and runtime support for developing scalable implementations of data mining algorithms. Our work encompasses shared memory parallelization, distributed memory parallelization, and optimizations for processing disk-resident datasets. In this paper, we focus on compiler and runtime support for shared memory parallelization of data mining algorithms. We have developed a set of parallelization techniques that apply across algorithms for a variety of mining tasks. We describe the interface of the middleware where these techniques are implemented. Then, we present compiler techniques for translating data parallel code to the middleware specification. Finally, we present a brief evaluation of our compiler using apriori association mining and k-means clustering.
The Signal Data Explorer: A high performance Grid based signal search tool for use in distributed diagnostic applications
We describe a high performance Grid based signal search tool for distributed diagnostic applications developed in conjunction with Rolls-Royce plc for civil aero engine condition monitoring applications. With the introduction of advanced monitoring technology into engineering systems, healthcare, etc., the associated diagnostic processes are increasingly required to handle and consider vast amounts of data. An exemplar of such a diagnosis process was developed during the DAME project, which built a proof of concept demonstrator to assist in the enhanced diagnosis and prognosis of aero-engine conditions. In particular it has shown the utility of an interactive viewing and high performance distributed search tool (the Signal Data Explorer) in the aero-engine diagnostic process. The viewing and search techniques are equally applicable to other domains. The Signal Data Explorer and search services have been demonstrated on the Worldwide Universities Network to search distributed databases of electrocardiograph data
MalStone: Towards A Benchmark for Analytics on Large Data Clouds
Developing data mining algorithms that are suitable for cloud computing
platforms is currently an active area of research, as is developing cloud
computing platforms appropriate for data mining. Currently, the most common
benchmark for cloud computing is the Terasort (and related) benchmarks.
Although the Terasort Benchmark is quite useful, it was not designed for data
mining per se. In this paper, we introduce a benchmark called MalStone that is
specifically designed to measure the performance of cloud computing middleware
that supports the type of data intensive computing common when building data
mining models. We also introduce MalGen, which is a utility for generating data
on clouds that can be used with MalStone
- …