583 research outputs found

    MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail!

    Full text link
    Hadoop is currently the large-scale data analysis "hammer" of choice, but there exist classes of algorithms that aren't "nails", in the sense that they are not particularly amenable to the MapReduce programming model. To address this, researchers have proposed MapReduce extensions or alternative programming models in which these algorithms can be elegantly expressed. This essay espouses a very different position: that MapReduce is "good enough", and that instead of trying to invent screwdrivers, we should simply get rid of everything that's not a nail. To be more specific, much discussion in the literature surrounds the fact that iterative algorithms are a poor fit for MapReduce: the simple solution is to find alternative non-iterative algorithms that solve the same problem. This essay captures my personal experiences as an academic researcher as well as a software engineer in a "real-world" production analytics environment. From this combined perspective I reflect on the current state and future of "big data" research

    Shared Nearest-Neighbor Quantum Game-Based Attribute Reduction with Hierarchical Coevolutionary Spark and Its Application in Consistent Segmentation of Neonatal Cerebral Cortical Surfaces

    Full text link
    © 2012 IEEE. The unprecedented increase in data volume has become a severe challenge for conventional patterns of data mining and learning systems tasked with handling big data. The recently introduced Spark platform is a new processing method for big data analysis and related learning systems, which has attracted increasing attention from both the scientific community and industry. In this paper, we propose a shared nearest-neighbor quantum game-based attribute reduction (SNNQGAR) algorithm that incorporates the hierarchical coevolutionary Spark model. We first present a shared coevolutionary nearest-neighbor hierarchy with self-evolving compensation that considers the features of nearest-neighborhood attribute subsets and calculates the similarity between attribute subsets according to the shared neighbor information of attribute sample points. We then present a novel attribute weight tensor model to generate ranking vectors of attributes and apply them to balance the relative contributions of different neighborhood attribute subsets. To optimize the model, we propose an embedded quantum equilibrium game paradigm (QEGP) to ensure that noisy attributes do not degrade the big data reduction results. A combination of the hierarchical coevolutionary Spark model and an improved MapReduce framework is then constructed that it can better parallelize the SNNQGAR to efficiently determine the preferred reduction solutions of the distributed attribute subsets. The experimental comparisons demonstrate the superior performance of the SNNQGAR, which outperforms most of the state-of-the-art attribute reduction algorithms. Moreover, the results indicate that the SNNQGAR can be successfully applied to segment overlapping and interdependent fuzzy cerebral tissues, and it exhibits a stable and consistent segmentation performance for neonatal cerebral cortical surfaces

    Multiple Relevant Feature Ensemble Selection Based on Multilayer Co-Evolutionary Consensus MapReduce

    Full text link
    IEEE Although feature selection for large data has been intensively investigated in data mining, machine learning, and pattern recognition, the challenges are not just to invent new algorithms to handle noisy and uncertain large data in applications, but rather to link the multiple relevant feature sources, structured, or unstructured, to develop an effective feature reduction method. In this paper, we propose a multiple relevant feature ensemble selection (MRFES) algorithm based on multilayer co-evolutionary consensus MapReduce (MCCM). We construct an effective MCCM model to handle feature ensemble selection of large-scale datasets with multiple relevant feature sources, and explore the unified consistency aggregation between the local solutions and global dominance solutions achieved by the co-evolutionary memeplexes, which participate in the cooperative feature ensemble selection process. This model attempts to reach a mutual decision agreement among co-evolutionary memeplexes, which calls for the need for mechanisms to detect some noncooperative co-evolutionary behaviors and achieve better Nash equilibrium resolutions. Extensive experimental comparative studies substantiate the effectiveness of MRFES to solve large-scale dataset problems with the complex noise and multiple relevant feature sources on some well-known benchmark datasets. The algorithm can greatly facilitate the selection of relevant feature subsets coming from the original feature space with better accuracy, efficiency, and interpretability. Moreover, we apply MRFES to human cerebral cortex-based classification prediction. Such successful applications are expected to significantly scale up classification prediction for large-scale and complex brain data in terms of efficiency and feasibility

    Load Forecasting Based Distribution System Network Reconfiguration-A Distributed Data-Driven Approach

    Full text link
    In this paper, a short-term load forecasting approach based network reconfiguration is proposed in a parallel manner. Specifically, a support vector regression (SVR) based short-term load forecasting approach is designed to provide an accurate load prediction and benefit the network reconfiguration. Because of the nonconvexity of the three-phase balanced optimal power flow, a second-order cone program (SOCP) based approach is used to relax the optimal power flow problem. Then, the alternating direction method of multipliers (ADMM) is used to compute the optimal power flow in distributed manner. Considering the limited number of the switches and the increasing computation capability, the proposed network reconfiguration is solved in a parallel way. The numerical results demonstrate the feasible and effectiveness of the proposed approach.Comment: 5 pages, preprint for Asilomar Conference on Signals, Systems, and Computers 201

    Deep Learning in the Automotive Industry: Applications and Tools

    Full text link
    Deep Learning refers to a set of machine learning techniques that utilize neural networks with many hidden layers for tasks, such as image classification, speech recognition, language understanding. Deep learning has been proven to be very effective in these domains and is pervasively used by many Internet services. In this paper, we describe different automotive uses cases for deep learning in particular in the domain of computer vision. We surveys the current state-of-the-art in libraries, tools and infrastructures (e.\,g.\ GPUs and clouds) for implementing, training and deploying deep neural networks. We particularly focus on convolutional neural networks and computer vision use cases, such as the visual inspection process in manufacturing plants and the analysis of social media data. To train neural networks, curated and labeled datasets are essential. In particular, both the availability and scope of such datasets is typically very limited. A main contribution of this paper is the creation of an automotive dataset, that allows us to learn and automatically recognize different vehicle properties. We describe an end-to-end deep learning application utilizing a mobile app for data collection and process support, and an Amazon-based cloud backend for storage and training. For training we evaluate the use of cloud and on-premises infrastructures (including multiple GPUs) in conjunction with different neural network architectures and frameworks. We assess both the training times as well as the accuracy of the classifier. Finally, we demonstrate the effectiveness of the trained classifier in a real world setting during manufacturing process.Comment: 10 page

    HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

    Full text link
    Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other's work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then HOGWILD! achieves a nearly optimal rate of convergence. We demonstrate experimentally that HOGWILD! outperforms alternative schemes that use locking by an order of magnitude.Comment: 22 pages, 10 figure
    • …
    corecore