969 research outputs found

    Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey

    Full text link
    The next-generation astronomy digital archives will cover most of the universe at fine resolution in many wave-lengths, from X-rays to ultraviolet, optical, and infrared. The archives will be stored at diverse geographical locations. One of the first of these projects, the Sloan Digital Sky Survey (SDSS) will create a 5-wavelength catalog over 10,000 square degrees of the sky (see http://www.sdss.org/). The 200 million objects in the multi-terabyte database will have mostly numerical attributes, defining a space of 100+ dimensions. Points in this space have highly correlated distributions. The archive will enable astronomers to explore the data interactively. Data access will be aided by a multidimensional spatial index and other indices. The data will be partitioned in many ways. Small tag objects consisting of the most popular attributes speed up frequent searches. Splitting the data among multiple servers enables parallel, scalable I/O and applies parallel processing to the data. Hashing techniques allow efficient clustering and pair-wise comparison algorithms that parallelize nicely. Randomly sampled subsets allow debugging otherwise large queries at the desktop. Central servers will operate a data pump that supports sweeping searches that touch most of the data. The anticipated queries require special operators related to angular distances and complex similarity tests of object properties, like shapes, colors, velocity vectors, or temporal behaviors. These issues pose interesting data management challenges.Comment: 9 pages, original at research.microsoft.com/~gray/papers/MS_TR_99_30_Sloan_Digital_Sky_Survey.do

    Optimizing Deep Learning Recommender Systems' Training On CPU Cluster Architectures

    Full text link
    During the last two years, the goal of many researchers has been to squeeze the last bit of performance out of HPC system for AI tasks. Often this discussion is held in the context of how fast ResNet50 can be trained. Unfortunately, ResNet50 is no longer a representative workload in 2020. Thus, we focus on Recommender Systems which account for most of the AI cycles in cloud computing centers. More specifically, we focus on Facebook's DLRM benchmark. By enabling it to run on latest CPU hardware and software tailored for HPC, we are able to achieve more than two-orders of magnitude improvement in performance (110x) on a single socket compared to the reference CPU implementation, and high scaling efficiency up to 64 sockets, while fitting ultra-large datasets. This paper discusses the optimization techniques for the various operators in DLRM and which component of the systems are stressed by these different operators. The presented techniques are applicable to a broader set of DL workloads that pose the same scaling challenges/characteristics as DLRM

    On Big Data Benchmarking

    Full text link
    Big data systems address the challenges of capturing, storing, managing, analyzing, and visualizing big data. Within this context, developing benchmarks to evaluate and compare big data systems has become an active topic for both research and industry communities. To date, most of the state-of-the-art big data benchmarks are designed for specific types of systems. Based on our experience, however, we argue that considering the complexity, diversity, and rapid evolution of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads. Given this motivation, in this paper, we first propose the key requirements and challenges in developing big data benchmarks from the perspectives of generating data with 4V properties (i.e. volume, velocity, variety and veracity) of big data, as well as generating tests with comprehensive workloads for big data systems. We then present the methodology on big data benchmarking designed to address these challenges. Next, the state-of-the-art are summarized and compared, following by our vision for future research directions.Comment: 7 pages, 4 figures, 2 tables, accepted in BPOE-04 (http://prof.ict.ac.cn/bpoe_4_asplos/

    Role of Apache Software Foundation in Big Data Projects

    Full text link
    With the increase in amount of Big Data being generated each year, tools and technologies developed and used for the purpose of storing, processing and analyzing Big Data has also improved. Open-Source software has been an important factor in the success and innovation in the field of Big Data while Apache Software Foundation (ASF) has played a crucial role in this success and innovation by providing a number of state-of-the-art projects, free and open to the public. ASF has classified its project in different categories. In this report, projects listed under Big Data category are deeply analyzed and discussed with reference to one-of-the seven sub-categories defined. Our investigation has shown that many of the Apache Big Data projects are autonomous but some are built based on other Apache projects and some work in conjunction with other projects to improve and ease development in Big Data space

    Snap ML: A Hierarchical Framework for Machine Learning

    Full text link
    We describe a new software framework for fast training of generalized linear models. The framework, named Snap Machine Learning (Snap ML), combines recent advances in machine learning systems and algorithms in a nested manner to reflect the hierarchical architecture of modern computing systems. We prove theoretically that such a hierarchical system can accelerate training in distributed environments where intra-node communication is cheaper than inter-node communication. Additionally, we provide a review of the implementation of Snap ML in terms of GPU acceleration, pipelining, communication patterns and software architecture, highlighting aspects that were critical for achieving high performance. We evaluate the performance of Snap ML in both single-node and multi-node environments, quantifying the benefit of the hierarchical scheme and the data streaming functionality, and comparing with other widely-used machine learning software frameworks. Finally, we present a logistic regression benchmark on the Criteo Terabyte Click Logs dataset and show that Snap ML achieves the same test loss an order of magnitude faster than any of the previously reported results, including those obtained using TensorFlow and scikit-learn.Comment: in Proceedings of the Thirty-Second Conference on Neural Information Processing Systems (NeurIPS 2018

    FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs

    Full text link
    Graph analysis performs many random reads and writes, thus, these workloads are typically performed in memory. Traditionally, analyzing large graphs requires a cluster of machines so the aggregate memory exceeds the graph size. We demonstrate that a multicore server can process graphs with billions of vertices and hundreds of billions of edges, utilizing commodity SSDs with minimal performance loss. We do so by implementing a graph-processing engine on top of a user-space SSD file system designed for high IOPS and extreme parallelism. Our semi-external memory graph engine called FlashGraph stores vertex state in memory and edge lists on SSDs. It hides latency by overlapping computation with I/O. To save I/O bandwidth, FlashGraph only accesses edge lists requested by applications from SSDs; to increase I/O throughput and reduce CPU overhead for I/O, it conservatively merges I/O requests. These designs maximize performance for applications with different I/O characteristics. FlashGraph exposes a general and flexible vertex-centric programming interface that can express a wide variety of graph algorithms and their optimizations. We demonstrate that FlashGraph in semi-external memory performs many algorithms with performance up to 80% of its in-memory implementation and significantly outperforms PowerGraph, a popular distributed in-memory graph engine.Comment: published in FAST'1

    Data Mining

    Get PDF
    RESEARCH INTERESTS Data mining in massive graphs, with an emphasis on bridging graph mining and systems techniques for extremely scalable data analysis. Specifically: distributed mining and managing billion-scale graphs, graph indexing, graph compression, spectral graph analysis, tensor, anomaly detection, modeling evolution, and inference in graphs

    Learning to fail: Predicting fracture evolution in brittle material models using recurrent graph convolutional neural networks

    Full text link
    We propose a machine learning approach to address a key challenge in materials science: predicting how fractures propagate in brittle materials under stress, and how these materials ultimately fail. Our methods use deep learning and train on simulation data from high-fidelity models, emulating the results of these models while avoiding the overwhelming computational demands associated with running a statistically significant sample of simulations. We employ a graph convolutional network that recognizes features of the fracturing material and a recurrent neural network that models the evolution of these features, along with a novel form of data augmentation that compensates for the modest size of our training data. We simultaneously generate predictions for qualitatively distinct material properties. Results on fracture damage and length are within 3% of their simulated values, and results on time to material failure, which is notoriously difficult to predict even with high-fidelity models, are within approximately 15% of simulated values. Once trained, our neural networks generate predictions within seconds, rather than the hours needed to run a single simulation

    Accelerating Recommendation System Training by Leveraging Popular Choices

    Full text link
    Recommender models are commonly used to suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store numerical representation of items' and users' categorical variables (memory intensive) and employ neural networks (compute intensive) to generate final recommendations. Training these large-scale recommendation models is evolving to require increasing data and compute resources. The highly parallel neural networks portion of these models can benefit from GPU acceleration however, large embedding tables often cannot fit in the limited-capacity GPU device memory. Hence, this paper deep dives into the semantics of training data and obtains insights about the feature access, transfer, and usage patterns of these models. We observe that, due to the popularity of certain inputs, the accesses to the embeddings are highly skewed with a few embedding entries being accessed up to 10000x more. This paper leverages this asymmetrical access pattern to offer a framework, called FAE, and proposes a hot-embedding aware data layout for training recommender models. This layout utilizes the scarce GPU memory for storing the highly accessed embeddings, thus reduces the data transfers from CPU to GPU. At the same time, FAE engages the GPU to accelerate the executions of these hot embedding entries. Experiments on production-scale recommendation models with real datasets show that FAE reduces the overall training time by 2.3x and 1.52x in comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline accurac

    Real-time Text Analytics Pipeline Using Open-source Big Data Tools

    Full text link
    Real-time text processing systems are required in many domains to quickly identify patterns, trends, sentiments, and insights. Nowadays, social networks, e-commerce stores, blogs, scientific experiments, and server logs are main sources generating huge text data. However, to process huge text data in real time requires building a data processing pipeline. The main challenge in building such pipeline is to minimize latency to process high-throughput data. In this paper, we explain and evaluate our proposed real-time text processing pipeline using open-source big data tools which minimize the latency to process data streams. Our proposed data processing pipeline is based on Apache Kafka for data ingestion, Apache Spark for in-memory data processing, Apache Cassandra for storing processed results, and D3 JavaScript library for visualization. We evaluate the effectiveness of the proposed pipeline under varying deployment scenarios to perform sentiment analysis using Twitter dataset. Our experimental evaluations show less than a minute latency to process 466,700466,700 Tweets in 10.710.7 minutes when three virtual machines allocated to the proposed pipeline