16,759 research outputs found
Apache Mahout’s k-Means vs. fuzzy k-Means performance evaluation
(c) 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.The emergence of the Big Data as a disruptive technology for next generation of intelligent systems, has brought many issues of how to extract and make use of the knowledge obtained from the data within short times, limited budget and under high rates of data generation. The foremost challenge identified here is the data processing, and especially, mining and analysis for knowledge extraction. As the 'old' data mining frameworks were designed without Big Data requirements, a new generation of such frameworks is being developed fully implemented in Cloud platforms. One such frameworks is Apache Mahout aimed to leverage fast processing and analysis of Big Data. The performance of such new data mining frameworks is yet to be evaluated and potential limitations are to be revealed. In this paper we analyse the performance of Apache Mahout using large real data sets from the Twitter stream. We exemplify the analysis for the case of two clustering algorithms, namely, k-Means and Fuzzy k-Means, using a Hadoop cluster infrastructure for the experimental study.Peer ReviewedPostprint (author's final draft
PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development
This paper describes PlinyCompute, a system for development of
high-performance, data-intensive, distributed computing tools and libraries. In
the large, PlinyCompute presents the programmer with a very high-level,
declarative interface, relying on automatic, relational-database style
optimization to figure out how to stage distributed computations. However, in
the small, PlinyCompute presents the capable systems programmer with a
persistent object data model and API (the "PC object model") and associated
memory management system that has been designed from the ground-up for high
performance, distributed, data-intensive computing. This contrasts with most
other Big Data systems, which are constructed on top of the Java Virtual
Machine (JVM), and hence must at least partially cede performance-critical
concerns such as memory management (including layout and de/allocation) and
virtual method/function dispatch to the JVM. This hybrid approach---declarative
in the large, trusting the programmer's ability to utilize PC object model
efficiently in the small---results in a system that is ideal for the
development of reusable, data-intensive tools and libraries. Through extensive
benchmarking, we show that implementing complex objects manipulation and
non-trivial, library-style computations on top of PlinyCompute can result in a
speedup of 2x to more than 50x or more compared to equivalent implementations
on Spark.Comment: 48 pages, including references and Appendi
CARMA Large Area Star Formation Survey: Project Overview with Analysis of Dense Gas Structure and Kinematics in Barnard 1
We present details of the CARMA Large Area Star Formation Survey (CLASSy),
while focusing on observations of Barnard 1. CLASSy is a CARMA Key Project that
spectrally imaged N2H+, HCO+, and HCN (J=1-0 transitions) across over 800
square arcminutes of the Perseus and Serpens Molecular Clouds. The observations
have angular resolution near 7" and spectral resolution near 0.16 km/s. We
imaged ~150 square arcminutes of Barnard 1, focusing on the main core, and the
B1 Ridge and clumps to its southwest. N2H+ shows the strongest emission, with
morphology similar to cool dust in the region, while HCO+ and HCN trace several
molecular outflows from a collection of protostars in the main core. We
identify a range of kinematic complexity, with N2H+ velocity dispersions
ranging from ~0.05-0.50 km/s across the field. Simultaneous continuum mapping
at 3 mm reveals six compact object detections, three of which are new
detections. A new non-binary dendrogram algorithm is used to analyze dense gas
structures in the N2H+ position-position-velocity (PPV) cube. The projected
sizes of dendrogram-identified structures range from about 0.01-0.34 pc.
Size-linewidth relations using those structures show that non-thermal
line-of-sight velocity dispersion varies weakly with projected size, while rms
variation in the centroid velocity rises steeply with projected size. Comparing
these relations, we propose that all dense gas structures in Barnard 1 have
comparable depths into the sky, around 0.1-0.2 pc; this suggests that
over-dense, parsec-scale regions within molecular clouds are better described
as flattened structures rather than spherical collections of gas. Science-ready
PPV cubes for Barnard 1 molecular emission are available for download.Comment: Accepted to The Astrophysical Journal (ApJ), 51 pages, 27 figures
(some with reduced resolution in this preprint); Project website is at
http://carma.astro.umd.edu/class
- …