141 research outputs found

    Apache Mahout’s k-Means vs. fuzzy k-Means performance evaluation

    Get PDF
    (c) 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.The emergence of the Big Data as a disruptive technology for next generation of intelligent systems, has brought many issues of how to extract and make use of the knowledge obtained from the data within short times, limited budget and under high rates of data generation. The foremost challenge identified here is the data processing, and especially, mining and analysis for knowledge extraction. As the 'old' data mining frameworks were designed without Big Data requirements, a new generation of such frameworks is being developed fully implemented in Cloud platforms. One such frameworks is Apache Mahout aimed to leverage fast processing and analysis of Big Data. The performance of such new data mining frameworks is yet to be evaluated and potential limitations are to be revealed. In this paper we analyse the performance of Apache Mahout using large real data sets from the Twitter stream. We exemplify the analysis for the case of two clustering algorithms, namely, k-Means and Fuzzy k-Means, using a Hadoop cluster infrastructure for the experimental study.Peer ReviewedPostprint (author's final draft

    Classification of Clinical Tweets Using Apache Mahout

    Get PDF
    Title from PDF of title page, viewed on July 31, 2015Thesis advisor: Praveen R. RaoVitaIncludes bibliographic references (pages 54-58)Thesis (M.S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2015There is an increasing amount of healthcare related data available on Twitter. Due to Twitter’s popularity, every day large amount of clinical tweets are posted on this microblogging service platform. One interesting problem we face today is the classification of clinical tweets so that the classified tweets can be readily consumed by new healthcare applications. While there are several tools available to classify small datasets, the size of Twitter data demands new tools and techniques for fast and accurate classification. Motivated by these reasons, we propose a new tool called Clinical Tweets Classifier (CTC) to enable scalable classification of clinical content on Twitter. CTC uses Apache Mahout, and in addition to keywords and hashtags in the tweets, it also leverages the SNOMED CT clinical terminology and a new tweet influence scoring scheme to construct high accuracy models for classification. CTC uses the Naïve Bayes algorithm. We trained four models based on different feature sets such as hashtags, keywords, clinical terms from SNOMED CT, and so on. We selected the training and test datasets based on the influence score of the tweets. We validated the accuracy of these models using a large number of tweets. Our results show that using SNOMET CT terms and a training dataset with more influential tweets, yields the most accurate model for classification. We also tested the scalability of CTC using 100 million tweets in a small cluster.Introduction -- Background and related work -- Design and framework -- Evaluation -- Conclusion and future wor

    Performance Evaluation of an Independent Time Optimized Infrastructure for Big Data Analytics that Maintains Symmetry

    Get PDF
    Traditional data analytics tools are designed to deal with the asymmetrical type of data i.e., structured, semi-structured, and unstructured. The diverse behavior of data produced by different sources requires the selection of suitable tools. The restriction of recourses to deal with a huge volume of data is a challenge for these tools, which affects the performances of the tool's execution time. Therefore, in the present paper, we proposed a time optimization model, shares common HDFS (Hadoop Distributed File System) between three Name-node (Master Node), three Data-node, and one Client-node. These nodes work under the DeMilitarized zone (DMZ) to maintain symmetry. Machine learning jobs are explored from an independent platform to realize this model. In the first node (Name-node 1), Mahout is installed with all machine learning libraries through the maven repositories. The second node (Name-node 2), R connected to Hadoop, is running through the shiny-server. Splunk is configured in the third node (Name-node 3) and is used to analyze the logs. Experiments are performed between the proposed and legacy model to evaluate the response time, execution time, and throughput. K-means clustering, Navies Bayes, and recommender algorithms are run on three different data sets, i.e., movie rating, newsgroup, and Spam SMS data set, representing structured, semi-structured, and unstructured data, respectively. The selection of tools defines data independence, e.g., Newsgroup data set to run on Mahout as others cannot be compatible with this data. It is evident from the outcome of the data that the performance of the proposed model establishes the hypothesis that our model overcomes the limitation of the resources of the legacy model. In addition, the proposed model can process any kind of algorithm on different sets of data, which resides in its native formats

    Performance evaluation of Apache Mahout for mining large datasets

    Get PDF
    The main purpose of this project is to evaluate the performance of the Apache Mahout library, that contains data mining algorithms for data processing, using a twitter dataset. Performance is evaluated in terms of processing time, in-memory usage, I/O performance and algorithmic accuracy

    Personalized large scale classification of public tenders on hadoop

    Get PDF
    Ce projet a été réalisé dans le cadre d’un partenariat entre Fujitsu Canada et Université Laval. Les besoins du projets ont été centrés sur une problématique d’affaire définie conjointement avec Fujitsu. Le projet consistait à classifier un corpus d’appels d’offres électroniques avec une approche orienté big data. L’objectif était d’identifier avec un très fort rappel les offres pertinentes au domaine d’affaire de l’entreprise. Après une séries d’expérimentations à petite échelle qui nous ont permise d’illustrer empiriquement (93% de rappel) l’efficacité de notre approche basé sur l’algorithme BNS (Bi-Normal Separation), nous avons implanté un système complet qui exploite l’infrastructure technologique big data Hadoop. Nos expérimentations sur le système complet démontrent qu’il est possible d’obtenir une performance de classification tout aussi efficace à grande échelle (91% de rappel) tout en exploitant les gains de performance rendus possible par l’architecture distribuée de Hadoop.This project was completed as part of an innovation partnership with Fujitsu Canada and Université Laval. The needs and objectives of the project were centered on a business problem defined jointly with Fujitsu. Our project aimed to classify a corpus of electronic public tenders based on state of the art Hadoop big data technology. The objective was to identify with high recall public tenders relevant to the IT services business of Fujitsu Canada. A small scale prototype based on the BNS algorithm (Bi-Normal Separation) was empirically shown to classify with high recall (93%) the public tender corpus. The prototype was then re-implemented on a full scale Hadoop cluster using Apache Pig for the data preparation pipeline and using Apache Mahout for classification. Our experimentation show that the large scale system not only maintains high recall (91%) on the classification task, but can readily take advantage of the massive scalability gains made possible by Hadoop’s distributed architecture

    Data Mining Applications in Big Data

    Get PDF
    Data mining is a process of extracting hidden, unknown, but potentially useful information from massive data. Big Data has great impacts on scientific discoveries and value creation. This paper introduces methods in data mining and technologies in Big Data. Challenges of data mining and data mining with big data are discussed. Some technology progress of data mining and data mining with big data are also presented

    A Scalable Machine Learning Online Service for Big Data Real-Time Analysis

    Get PDF
    Proceedings of: IEEE Symposium Series on Computational Intelligence (SSCI 2014). Orlando, FL, USA, December 09-12, 2014.This work describes a proposal for developing and testing a scalable machine learning architecture able to provide real-time predictions or analytics as a service over domain-independent big data, working on top of the Hadoop ecosystem and providing real-time analytics as a service through a RESTful API. Systems implementing this architecture could provide companies with on-demand tools facilitating the tasks of storing, analyzing, understanding and reacting to their data, either in batch or stream fashion; and could turn into a valuable asset for improving the business performance and be a key market differentiator in this fast pace environment. In order to validate the proposed architecture, two systems are developed, each one providing classical machine-learning services in different domains: the first one involves a recommender system for web advertising, while the second consists in a prediction system which learns from gamers' behavior and tries to predict future events such as purchases or churning. An evaluation is carried out on these systems, and results show how both services are able to provide fast responses even when a number of concurrent requests are made, and in the particular case of the second system, results clearly prove that computed predictions significantly outperform those obtained if random guess was used.This research work is part of Memento Data Analysis project, co-funded by the Spanish Ministry of Industry, Energy and Tourism with identifier TSI-020601-2012-99.Publicad

    Exploring the meaning behind twitter hashtags through clustering

    Get PDF
    Abstract. Social networks are generators of large amount of data produced by users, who are not limited with respect to the content of the information they exchange. The data generated can be a good indicator of trends and topic preferences among users. In our paper we focus on analyzing and representing hashtags by the corpus in which they appear. We cluster a large set of hashtags using K-means on map reduce in order to process data in a distributed manner. Our intention is to retrieve connections that might exist between different hashtags and their textual representation, and grasp their semantics through the main topics they occur with
    • …
    corecore