792 research outputs found

    An ontology enhanced parallel SVM for scalable spam filter training

    Get PDF
    This is the post-print version of the final paper published in Neurocomputing. The published article is available from the link below. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. Copyright @ 2013 Elsevier B.V.Spam, under a variety of shapes and forms, continues to inflict increased damage. Varying approaches including Support Vector Machine (SVM) techniques have been proposed for spam filter training and classification. However, SVM training is a computationally intensive process. This paper presents a MapReduce based parallel SVM algorithm for scalable spam filter training. By distributing, processing and optimizing the subsets of the training data across multiple participating computer nodes, the parallel SVM reduces the training time significantly. Ontology semantics are employed to minimize the impact of accuracy degradation when distributing the training data among a number of SVM classifiers. Experimental results show that ontology based augmentation improves the accuracy level of the parallel SVM beyond the original sequential counterpart

    Improved K-means clustering on Hadoop

    Get PDF
    Clustering is the portioning method in which we grouped similar attribute items. Recently data grows rapidly so data analysis using clustering getting difficult. K-means is traditional clustering method. K-means is easy to implement and scalable but it suffers from local minima and sensitive to initial cluster centroids. Particle swarm optimization is mimic behavior based clustering algorithm based on particle’s velocity but it suffers from number of iterations. So we use PSO for finding initial cluster center and then use this centroids for K-means clustering which is running parallel on Hadoop. Hadoop is used for large database. We try to find global clusters in limited iterations

    Mining of Frequent Item with BSW Chunking

    Get PDF
    Apriori is an algorithm for finding the frequent patterns in transactional databases is considered as one of the most important data mining problems. Apriori algorithm is a masterpiece algorithm of association rule mining. This algorithm somehow has constraint and thus, giving the opportunity to do this research. Increased availability of the Multicore processors is forcing us to re-design algorithms and applications so as to accomplishment the computational power from multiple cores finding frequent item sets is more expensive in terms of computing resources utilization and CPU power. Thus superiority of parallel apriori algorithms effect on parallelizing the process of frequent item set find. The parallel frequent item sets mining algorithms gives the direction to solve the issue of distributing the candidates among processors. Efficient algorithm to discover frequent patterns is important in data mining research Lots of algorithms for mining association rules and their mutations are proposed on basis of Apriori algorithm, but traditional algorithms are not efficient. The objective of the Apriori Algorithm is to find associations between different sets of data. It is occasionally referred to as "Market Basket Analysis". Every several set of data has a number of items and is called a transaction. The achievement of Apriori is sets of rules that tell us how often items are contained in sets of data. In order to find more valuable rules, our basic aim is to implement apriori algorithm using multithreading approach which can utilization our system hardware power to improved algorithm is reasonable and effective, can extract more value information

    A highly efficient multi-core algorithm for clustering extremely large datasets

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer.</p> <p>Results</p> <p>We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization.</p> <p>Conclusions</p> <p>Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.</p

    Scalable Solutions for Automated Single Pulse Identification and Classification in Radio Astronomy

    Full text link
    Data collection for scientific applications is increasing exponentially and is forecasted to soon reach peta- and exabyte scales. Applications which process and analyze scientific data must be scalable and focus on execution performance to keep pace. In the field of radio astronomy, in addition to increasingly large datasets, tasks such as the identification of transient radio signals from extrasolar sources are computationally expensive. We present a scalable approach to radio pulsar detection written in Scala that parallelizes candidate identification to take advantage of in-memory task processing using Apache Spark on a YARN distributed system. Furthermore, we introduce a novel automated multiclass supervised machine learning technique that we combine with feature selection to reduce the time required for candidate classification. Experimental testing on a Beowulf cluster with 15 data nodes shows that the parallel implementation of the identification algorithm offers a speedup of up to 5X that of a similar multithreaded implementation. Further, we show that the combination of automated multiclass classification and feature selection speeds up the execution performance of the RandomForest machine learning algorithm by an average of 54% with less than a 2% average reduction in the algorithm's ability to correctly classify pulsars. The generalizability of these results is demonstrated by using two real-world radio astronomy data sets.Comment: In Proceedings of the 47th International Conference on Parallel Processing (ICPP 2018). ACM, New York, NY, USA, Article 11, 11 page

    Pando: Personal Volunteer Computing in Browsers

    Full text link
    The large penetration and continued growth in ownership of personal electronic devices represents a freely available and largely untapped source of computing power. To leverage those, we present Pando, a new volunteer computing tool based on a declarative concurrent programming model and implemented using JavaScript, WebRTC, and WebSockets. This tool enables a dynamically varying number of failure-prone personal devices contributed by volunteers to parallelize the application of a function on a stream of values, by using the devices' browsers. We show that Pando can provide throughput improvements compared to a single personal device, on a variety of compute-bound applications including animation rendering and image processing. We also show the flexibility of our approach by deploying Pando on personal devices connected over a local network, on Grid5000, a French-wide computing grid in a virtual private network, and seven PlanetLab nodes distributed in a wide area network over Europe.Comment: 14 pages, 12 figures, 2 table

    High Performance Frequent Subgraph Mining on Transactional Datasets

    Get PDF
    Graph data mining has been a crucial as well as inevitable area of research. Large amounts of graph data are produced in many areas, such as Bioinformatics, Cheminformatics, Social Networks, and Web etc. Scalable graph data mining methods are getting increasingly popular and necessary due to increased graph complexities. Frequent subgraph mining is one such area where the task is to find overly recurring patterns/subgraphs. To tackle this problem, many main memory-based methods were proposed, which proved to be inefficient as the data size grew exponentially over time. In the past few years several research groups have attempted to handle the frequent subgraph mining (FSM) problem in multiple ways. Many authors have tried to achieve better performance using Graphic Processing Units (GPUs) which has multi-fold improvement over in-memory while dealing with large datasets. Later, Google\u27s MapReduce model with the Hadoop framework proved to be a major breakthrough in high performance large batch processing. Although MapReduce came with many benefits, its disk I/O and non-iterative style model could not help much for FSM domain since subgraph mining process is an iterative approach. In recent years, Spark has emerged to be the De Facto industry standard with its distributed in-memory computing capability. This is a right fit solution for iterative style of programming as well. In this work, we cover how high-performance computing has helped in improving the performance tremendously in the transactional directed and undirected aspect of graphs and performance comparisons of various FSM techniques are done based on experimental results
    corecore