234,796 research outputs found

    BigFCM: Fast, Precise and Scalable FCM on Hadoop

    Full text link
    Clustering plays an important role in mining big data both as a modeling technique and a preprocessing step in many data mining process implementations. Fuzzy clustering provides more flexibility than non-fuzzy methods by allowing each data record to belong to more than one cluster to some degree. However, a serious challenge in fuzzy clustering is the lack of scalability. Massive datasets in emerging fields such as geosciences, biology and networking do require parallel and distributed computations with high performance to solve real-world problems. Although some clustering methods are already improved to execute on big data platforms, but their execution time is highly increased for large datasets. In this paper, a scalable Fuzzy C-Means (FCM) clustering named BigFCM is proposed and designed for the Hadoop distributed data platform. Based on the map-reduce programming model, it exploits several mechanisms including an efficient caching design to achieve several orders of magnitude reduction in execution time. Extensive evaluation over multi-gigabyte datasets shows that BigFCM is scalable while it preserves the quality of clustering

    Clustering big urban data sets

    Get PDF
    Cities are producing and collecting massive amount of data from various sources such as transportation network, energy sector, smart homes, tax records, surveys, LIDAR data, mobile phones sensors etc. All of the aforementioned data, when connected via the Internet, fall under the Internet of Things (IoT) category. To use such a large volume of data for potential scientific computing benefits, it is important to store and analyze such amount of urban data using efficient computing resources and algorithms. However, this can be problematic due to many challenges. This article explores some of these challenges and test the performance of two partitional algorithms for clustering Big Urban Datasets, namely: the K-Means vs. the Fuzzy cMean (FCM). Clustering Big Urban Data in compact format represents the information of the whole data and this can benefit researchers to deal with this reorganized data much efficiently. Our experiments conclude that FCM outperformed the K-Means when presented with such type of dataset, however the later is lighter on the hardware utilisations

    Characterizing and Subsetting Big Data Workloads

    Full text link
    Big data benchmark suites must include a diversity of data and workloads to be useful in fairly evaluating big data systems and architectures. However, using truly comprehensive benchmarks poses great challenges for the architecture community. First, we need to thoroughly understand the behaviors of a variety of workloads. Second, our usual simulation-based research methods become prohibitively expensive for big data. As big data is an emerging field, more and more software stacks are being proposed to facilitate the development of big data applications, which aggravates hese challenges. In this paper, we first use Principle Component Analysis (PCA) to identify the most important characteristics from 45 metrics to characterize big data workloads from BigDataBench, a comprehensive big data benchmark suite. Second, we apply a clustering technique to the principle components obtained from the PCA to investigate the similarity among big data workloads, and we verify the importance of including different software stacks for big data benchmarking. Third, we select seven representative big data workloads by removing redundant ones and release the BigDataBench simulation version, which is publicly available from http://prof.ict.ac.cn/BigDataBench/simulatorversion/.Comment: 11 pages, 6 figures, 2014 IEEE International Symposium on Workload Characterizatio
    • …
    corecore