1,048 research outputs found

    An ontology enhanced parallel SVM for scalable spam filter training

    Get PDF
    This is the post-print version of the final paper published in Neurocomputing. The published article is available from the link below. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. Copyright @ 2013 Elsevier B.V.Spam, under a variety of shapes and forms, continues to inflict increased damage. Varying approaches including Support Vector Machine (SVM) techniques have been proposed for spam filter training and classification. However, SVM training is a computationally intensive process. This paper presents a MapReduce based parallel SVM algorithm for scalable spam filter training. By distributing, processing and optimizing the subsets of the training data across multiple participating computer nodes, the parallel SVM reduces the training time significantly. Ontology semantics are employed to minimize the impact of accuracy degradation when distributing the training data among a number of SVM classifiers. Experimental results show that ontology based augmentation improves the accuracy level of the parallel SVM beyond the original sequential counterpart

    Jeeva: Enterprise Grid-enabled Web Portal for Protein Secondary Structure Prediction

    Get PDF
    This paper presents a Grid portal for protein secondary structure prediction developed by using services of Aneka, a .NET-based enterprise Grid technology. The portal is used by research scientists to discover new prediction structures in a parallel manner. An SVM (Support Vector Machine)-based prediction algorithm is used with 64 sample protein sequences as a case study to demonstrate the potential of enterprise Grids.Comment: 7 page

    Scalable Solutions for Automated Single Pulse Identification and Classification in Radio Astronomy

    Full text link
    Data collection for scientific applications is increasing exponentially and is forecasted to soon reach peta- and exabyte scales. Applications which process and analyze scientific data must be scalable and focus on execution performance to keep pace. In the field of radio astronomy, in addition to increasingly large datasets, tasks such as the identification of transient radio signals from extrasolar sources are computationally expensive. We present a scalable approach to radio pulsar detection written in Scala that parallelizes candidate identification to take advantage of in-memory task processing using Apache Spark on a YARN distributed system. Furthermore, we introduce a novel automated multiclass supervised machine learning technique that we combine with feature selection to reduce the time required for candidate classification. Experimental testing on a Beowulf cluster with 15 data nodes shows that the parallel implementation of the identification algorithm offers a speedup of up to 5X that of a similar multithreaded implementation. Further, we show that the combination of automated multiclass classification and feature selection speeds up the execution performance of the RandomForest machine learning algorithm by an average of 54% with less than a 2% average reduction in the algorithm's ability to correctly classify pulsars. The generalizability of these results is demonstrated by using two real-world radio astronomy data sets.Comment: In Proceedings of the 47th International Conference on Parallel Processing (ICPP 2018). ACM, New York, NY, USA, Article 11, 11 page

    COMET: A Recipe for Learning and Using Large Ensembles on Massive Data

    Full text link
    COMET is a single-pass MapReduce algorithm for learning on large-scale data. It builds multiple random forest ensembles on distributed blocks of data and merges them into a mega-ensemble. This approach is appropriate when learning from massive-scale data that is too large to fit on a single machine. To get the best accuracy, IVoting should be used instead of bagging to generate the training subset for each decision tree in the random forest. Experiments with two large datasets (5GB and 50GB compressed) show that COMET compares favorably (in both accuracy and training time) to learning on a subsample of data using a serial algorithm. Finally, we propose a new Gaussian approach for lazy ensemble evaluation which dynamically decides how many ensemble members to evaluate per data point; this can reduce evaluation cost by 100X or more

    MRCRAIG: MapReduce and Ensemble Classifiers for Parallelizing Data Classification Problems

    Get PDF
    In this paper, a novel technique for parallelizing data-classification problems is applied to finding genes in sequences of DNA. The technique involves various ensem- ble classification methods such as Bagging and Select Best. It then distributes the classifier training and prediction using MapReduce. A novel sequence classification voting algorithm is evaluated in the Bagging method, as well as compared against the Select Best method
    corecore