150,831 research outputs found

    An Analysis of Clustering Algorithms for Big Data

    Get PDF
    Clustering is an important data mining and tool for reading big records. There are difficulties for making use of clustering strategies to huge data duo to new challenges which might be raised with massive records. As large information is relating to terabytes and peta bytes of information and clustering algorithms are come with excessive computational costs, the question is the way to take care of with this hassle and how to install clustering techniques to big information and get the outcomes in a reasonable time. This study is aimed to review the style and progress of agglomeration algorithms to cope with massive knowledge challenges from first projected algorithms until modern novel solutions. The algorithms and the centered demanding situations for generating stepped forward clustering algorithms are introduced and analyzed, and later on the viable future path for extra superior algorithms are based on computational complexity. In this paper we discuss clustering algorithms and big data applications for real world things

    Interaction of descriptive and predictive analytics with product networks: The case of Sam's club

    Get PDF
    Due to the fact that there are massive amounts of available data all around the world, big data analytics has become an extremely important phenomenon in many disciplines. As the data grow, the need for businesses to achieve more reliable and accurate data-driven management decisions and to create value with big data applications grows as well. That is the reason why big data analytics becomes a primary tech priority today. In this thesis, initially we used a two-stage clustering algorithms in the customer segmentation setting. After the clustering stage, the customer lifetime value (CLV) of clusters were calculated based on the purchasing behaviors of the customers in order to reveal managerial insights and develop marketing strategies for each segment. At the second stage, we used HITS algorithm in product network analysis to achieve valuable insights from generated patterns, with the aim of discovering cross-selling e ects, identifying recurring purchasing patterns, and trigger products within the networks. This is important for practitioners in real-life application in terms of emphasizing the relatively important transactions by ranking them with corresponding item sets. From practical point of view, we foresee that our proposed methodology is adaptable and applicable to other similar businesses throughout the world, providing a road map for the potential application

    Big Data Analytics for Discovering Electricity Consumption Patterns in Smart Cities

    Get PDF
    New technologies such as sensor networks have been incorporated into the management of buildings for organizations and cities. Sensor networks have led to an exponential increase in the volume of data available in recent years, which can be used to extract consumption patterns for the purposes of energy and monetary savings. For this reason, new approaches and strategies are needed to analyze information in big data environments. This paper proposes a methodology to extract electric energy consumption patterns in big data time series, so that very valuable conclusions can be made for managers and governments. The methodology is based on the study of four clustering validity indices in their parallelized versions along with the application of a clustering technique. In particular, this work uses a voting system to choose an optimal number of clusters from the results of the indices, as well as the application of the distributed version of the k-means algorithm included in Apache Spark’s Machine Learning Library. The results, using electricity consumption for the years 2011–2017 for eight buildings of a public university, are presented and discussed. In addition, the performance of the proposed methodology is evaluated using synthetic big data, which cab represent thousands of buildings in a smart city. Finally, policies derived from the patterns discovered are proposed to optimize energy usage across the university campus.Ministerio de Economía y Competitividad TIN2014-55894-C2-RMinisterio de Economía y Competitividad TIN2017-88209-C2-RJunta de Andalucía P12-TIC-172

    Advances in Big Data Analytics: Algorithmic Stability and Data Cleansing

    Get PDF
    Analysis of what has come to be called “big data” presents a number of challenges as data continues to grow in size, complexity and heterogeneity. To help addresses these challenges, we study a pair of foundational issues in algorithmic stability (robustness and tuning), with application to clustering in high-throughput computational biology, and an issue in data cleansing (outlier detection), with application to pre-processing in streaming meteorological measurement. These issues highlight major ongoing research aspects of modern big data analytics. First, a new metric, robustness, is proposed in the setting of biological data clustering to measure an algorithm’s tendency to maintain output coherence over a range of parameter settings. It is well known that different algorithms tend to produce different clusters, and that the choice of algorithm is often driven by factors such as data size and type, similarity measure(s) employed, and the sort of clusters desired. Even within the context of a single algorithm, clusters often vary drastically depending on parameter settings. Empirical comparisons performed over a variety of algorithms and settings show highly differential performance on transcriptomic data and demonstrate that many popular methods actually perform poorly. Second, tuning strategies are studied for maximizing biological fidelity when using the well-known paraclique algorithm. Three initialization strategies are compared, using ontological enrichment as a proxy for cluster quality. Although extant paraclique codes begin by simply employing the first maximum clique found, results indicate that by generating all maximum cliques and then choosing one of highest average edge weight, one can produce a small but statistically significant expected improvement in overall cluster quality. Third, a novel outlier detection method is described that helps cleanse data by combining Pearson correlation coefficients, K-means clustering, and Singular Spectrum Analysis in a coherent framework that detects instrument failures and extreme weather events in Atmospheric Radiation Measurement sensor data. The framework is tested and found to produce more accurate results than do traditional approaches that rely on a hand-annotated database

    A Review of Text Corpus-Based Tourism Big Data Mining

    Get PDF
    With the massive growth of the Internet, text data has become one of the main formats of tourism big data. As an effective expression means of tourists’ opinions, text mining of such data has big potential to inspire innovations for tourism practitioners. In the past decade, a variety of text mining techniques have been proposed and applied to tourism analysis to develop tourism value analysis models, build tourism recommendation systems, create tourist profiles, and make policies for supervising tourism markets. The successes of these techniques have been further boosted by the progress of natural language processing (NLP), machine learning, and deep learning. With the understanding of the complexity due to this diverse set of techniques and tourism text data sources, this work attempts to provide a detailed and up-to-date review of text mining techniques that have been, or have the potential to be, applied to modern tourism big data analysis. We summarize and discuss different text representation strategies, text-based NLP techniques for topic extraction, text classification, sentiment analysis, and text clustering in the context of tourism text mining, and their applications in tourist profiling, destination image analysis, market demand, etc. Our work also provides guidelines for constructing new tourism big data applications and outlines promising research areas in this field for incoming years

    A Review of Text Corpus-Based Tourism Big Data Mining

    Get PDF
    With the massive growth of the Internet, text data has become one of the main formats of tourism big data. As an effective expression means of tourists’ opinions, text mining of such data has big potential to inspire innovations for tourism practitioners. In the past decade, a variety of text mining techniques have been proposed and applied to tourism analysis to develop tourism value analysis models, build tourism recommendation systems, create tourist profiles, and make policies for supervising tourism markets. The successes of these techniques have been further boosted by the progress of natural language processing (NLP), machine learning, and deep learning. With the understanding of the complexity due to this diverse set of techniques and tourism text data sources, this work attempts to provide a detailed and up-to-date review of text mining techniques that have been, or have the potential to be, applied to modern tourism big data analysis. We summarize and discuss different text representation strategies, text-based NLP techniques for topic extraction, text classification, sentiment analysis, and text clustering in the context of tourism text mining, and their applications in tourist profiling, destination image analysis, market demand, etc. Our work also provides guidelines for constructing new tourism big data applications and outlines promising research areas in this field for incoming years
    • 

    corecore