16,759 research outputs found

    Apache Mahout’s k-Means vs. fuzzy k-Means performance evaluation

    Get PDF
    (c) 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.The emergence of the Big Data as a disruptive technology for next generation of intelligent systems, has brought many issues of how to extract and make use of the knowledge obtained from the data within short times, limited budget and under high rates of data generation. The foremost challenge identified here is the data processing, and especially, mining and analysis for knowledge extraction. As the 'old' data mining frameworks were designed without Big Data requirements, a new generation of such frameworks is being developed fully implemented in Cloud platforms. One such frameworks is Apache Mahout aimed to leverage fast processing and analysis of Big Data. The performance of such new data mining frameworks is yet to be evaluated and potential limitations are to be revealed. In this paper we analyse the performance of Apache Mahout using large real data sets from the Twitter stream. We exemplify the analysis for the case of two clustering algorithms, namely, k-Means and Fuzzy k-Means, using a Hadoop cluster infrastructure for the experimental study.Peer ReviewedPostprint (author's final draft

    PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development

    Full text link
    This paper describes PlinyCompute, a system for development of high-performance, data-intensive, distributed computing tools and libraries. In the large, PlinyCompute presents the programmer with a very high-level, declarative interface, relying on automatic, relational-database style optimization to figure out how to stage distributed computations. However, in the small, PlinyCompute presents the capable systems programmer with a persistent object data model and API (the "PC object model") and associated memory management system that has been designed from the ground-up for high performance, distributed, data-intensive computing. This contrasts with most other Big Data systems, which are constructed on top of the Java Virtual Machine (JVM), and hence must at least partially cede performance-critical concerns such as memory management (including layout and de/allocation) and virtual method/function dispatch to the JVM. This hybrid approach---declarative in the large, trusting the programmer's ability to utilize PC object model efficiently in the small---results in a system that is ideal for the development of reusable, data-intensive tools and libraries. Through extensive benchmarking, we show that implementing complex objects manipulation and non-trivial, library-style computations on top of PlinyCompute can result in a speedup of 2x to more than 50x or more compared to equivalent implementations on Spark.Comment: 48 pages, including references and Appendi

    CARMA Large Area Star Formation Survey: Project Overview with Analysis of Dense Gas Structure and Kinematics in Barnard 1

    Get PDF
    We present details of the CARMA Large Area Star Formation Survey (CLASSy), while focusing on observations of Barnard 1. CLASSy is a CARMA Key Project that spectrally imaged N2H+, HCO+, and HCN (J=1-0 transitions) across over 800 square arcminutes of the Perseus and Serpens Molecular Clouds. The observations have angular resolution near 7" and spectral resolution near 0.16 km/s. We imaged ~150 square arcminutes of Barnard 1, focusing on the main core, and the B1 Ridge and clumps to its southwest. N2H+ shows the strongest emission, with morphology similar to cool dust in the region, while HCO+ and HCN trace several molecular outflows from a collection of protostars in the main core. We identify a range of kinematic complexity, with N2H+ velocity dispersions ranging from ~0.05-0.50 km/s across the field. Simultaneous continuum mapping at 3 mm reveals six compact object detections, three of which are new detections. A new non-binary dendrogram algorithm is used to analyze dense gas structures in the N2H+ position-position-velocity (PPV) cube. The projected sizes of dendrogram-identified structures range from about 0.01-0.34 pc. Size-linewidth relations using those structures show that non-thermal line-of-sight velocity dispersion varies weakly with projected size, while rms variation in the centroid velocity rises steeply with projected size. Comparing these relations, we propose that all dense gas structures in Barnard 1 have comparable depths into the sky, around 0.1-0.2 pc; this suggests that over-dense, parsec-scale regions within molecular clouds are better described as flattened structures rather than spherical collections of gas. Science-ready PPV cubes for Barnard 1 molecular emission are available for download.Comment: Accepted to The Astrophysical Journal (ApJ), 51 pages, 27 figures (some with reduced resolution in this preprint); Project website is at http://carma.astro.umd.edu/class
    corecore