969 research outputs found
Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey
The next-generation astronomy digital archives will cover most of the
universe at fine resolution in many wave-lengths, from X-rays to ultraviolet,
optical, and infrared. The archives will be stored at diverse geographical
locations. One of the first of these projects, the Sloan Digital Sky Survey
(SDSS) will create a 5-wavelength catalog over 10,000 square degrees of the sky
(see http://www.sdss.org/). The 200 million objects in the multi-terabyte
database will have mostly numerical attributes, defining a space of 100+
dimensions. Points in this space have highly correlated distributions.
The archive will enable astronomers to explore the data interactively. Data
access will be aided by a multidimensional spatial index and other indices. The
data will be partitioned in many ways. Small tag objects consisting of the most
popular attributes speed up frequent searches. Splitting the data among
multiple servers enables parallel, scalable I/O and applies parallel processing
to the data. Hashing techniques allow efficient clustering and pair-wise
comparison algorithms that parallelize nicely. Randomly sampled subsets allow
debugging otherwise large queries at the desktop. Central servers will operate
a data pump that supports sweeping searches that touch most of the data. The
anticipated queries require special operators related to angular distances and
complex similarity tests of object properties, like shapes, colors, velocity
vectors, or temporal behaviors. These issues pose interesting data management
challenges.Comment: 9 pages, original at
research.microsoft.com/~gray/papers/MS_TR_99_30_Sloan_Digital_Sky_Survey.do
Optimizing Deep Learning Recommender Systems' Training On CPU Cluster Architectures
During the last two years, the goal of many researchers has been to squeeze
the last bit of performance out of HPC system for AI tasks. Often this
discussion is held in the context of how fast ResNet50 can be trained.
Unfortunately, ResNet50 is no longer a representative workload in 2020. Thus,
we focus on Recommender Systems which account for most of the AI cycles in
cloud computing centers. More specifically, we focus on Facebook's DLRM
benchmark. By enabling it to run on latest CPU hardware and software tailored
for HPC, we are able to achieve more than two-orders of magnitude improvement
in performance (110x) on a single socket compared to the reference CPU
implementation, and high scaling efficiency up to 64 sockets, while fitting
ultra-large datasets. This paper discusses the optimization techniques for the
various operators in DLRM and which component of the systems are stressed by
these different operators. The presented techniques are applicable to a broader
set of DL workloads that pose the same scaling challenges/characteristics as
DLRM
On Big Data Benchmarking
Big data systems address the challenges of capturing, storing, managing,
analyzing, and visualizing big data. Within this context, developing benchmarks
to evaluate and compare big data systems has become an active topic for both
research and industry communities. To date, most of the state-of-the-art big
data benchmarks are designed for specific types of systems. Based on our
experience, however, we argue that considering the complexity, diversity, and
rapid evolution of big data systems, for the sake of fairness, big data
benchmarks must include diversity of data and workloads. Given this motivation,
in this paper, we first propose the key requirements and challenges in
developing big data benchmarks from the perspectives of generating data with 4V
properties (i.e. volume, velocity, variety and veracity) of big data, as well
as generating tests with comprehensive workloads for big data systems. We then
present the methodology on big data benchmarking designed to address these
challenges. Next, the state-of-the-art are summarized and compared, following
by our vision for future research directions.Comment: 7 pages, 4 figures, 2 tables, accepted in BPOE-04
(http://prof.ict.ac.cn/bpoe_4_asplos/
Role of Apache Software Foundation in Big Data Projects
With the increase in amount of Big Data being generated each year, tools and
technologies developed and used for the purpose of storing, processing and
analyzing Big Data has also improved. Open-Source software has been an
important factor in the success and innovation in the field of Big Data while
Apache Software Foundation (ASF) has played a crucial role in this success and
innovation by providing a number of state-of-the-art projects, free and open to
the public. ASF has classified its project in different categories. In this
report, projects listed under Big Data category are deeply analyzed and
discussed with reference to one-of-the seven sub-categories defined. Our
investigation has shown that many of the Apache Big Data projects are
autonomous but some are built based on other Apache projects and some work in
conjunction with other projects to improve and ease development in Big Data
space
Snap ML: A Hierarchical Framework for Machine Learning
We describe a new software framework for fast training of generalized linear
models. The framework, named Snap Machine Learning (Snap ML), combines recent
advances in machine learning systems and algorithms in a nested manner to
reflect the hierarchical architecture of modern computing systems. We prove
theoretically that such a hierarchical system can accelerate training in
distributed environments where intra-node communication is cheaper than
inter-node communication. Additionally, we provide a review of the
implementation of Snap ML in terms of GPU acceleration, pipelining,
communication patterns and software architecture, highlighting aspects that
were critical for achieving high performance. We evaluate the performance of
Snap ML in both single-node and multi-node environments, quantifying the
benefit of the hierarchical scheme and the data streaming functionality, and
comparing with other widely-used machine learning software frameworks. Finally,
we present a logistic regression benchmark on the Criteo Terabyte Click Logs
dataset and show that Snap ML achieves the same test loss an order of magnitude
faster than any of the previously reported results, including those obtained
using TensorFlow and scikit-learn.Comment: in Proceedings of the Thirty-Second Conference on Neural Information
Processing Systems (NeurIPS 2018
FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs
Graph analysis performs many random reads and writes, thus, these workloads
are typically performed in memory. Traditionally, analyzing large graphs
requires a cluster of machines so the aggregate memory exceeds the graph size.
We demonstrate that a multicore server can process graphs with billions of
vertices and hundreds of billions of edges, utilizing commodity SSDs with
minimal performance loss. We do so by implementing a graph-processing engine on
top of a user-space SSD file system designed for high IOPS and extreme
parallelism. Our semi-external memory graph engine called FlashGraph stores
vertex state in memory and edge lists on SSDs. It hides latency by overlapping
computation with I/O. To save I/O bandwidth, FlashGraph only accesses edge
lists requested by applications from SSDs; to increase I/O throughput and
reduce CPU overhead for I/O, it conservatively merges I/O requests. These
designs maximize performance for applications with different I/O
characteristics. FlashGraph exposes a general and flexible vertex-centric
programming interface that can express a wide variety of graph algorithms and
their optimizations. We demonstrate that FlashGraph in semi-external memory
performs many algorithms with performance up to 80% of its in-memory
implementation and significantly outperforms PowerGraph, a popular distributed
in-memory graph engine.Comment: published in FAST'1
Data Mining
RESEARCH INTERESTS Data mining in massive graphs, with an emphasis on bridging graph mining and systems techniques for extremely scalable data analysis. Specifically: distributed mining and managing billion-scale graphs, graph indexing, graph compression, spectral graph analysis, tensor, anomaly detection, modeling evolution, and inference in graphs
Learning to fail: Predicting fracture evolution in brittle material models using recurrent graph convolutional neural networks
We propose a machine learning approach to address a key challenge in
materials science: predicting how fractures propagate in brittle materials
under stress, and how these materials ultimately fail. Our methods use deep
learning and train on simulation data from high-fidelity models, emulating the
results of these models while avoiding the overwhelming computational demands
associated with running a statistically significant sample of simulations. We
employ a graph convolutional network that recognizes features of the fracturing
material and a recurrent neural network that models the evolution of these
features, along with a novel form of data augmentation that compensates for the
modest size of our training data. We simultaneously generate predictions for
qualitatively distinct material properties. Results on fracture damage and
length are within 3% of their simulated values, and results on time to material
failure, which is notoriously difficult to predict even with high-fidelity
models, are within approximately 15% of simulated values. Once trained, our
neural networks generate predictions within seconds, rather than the hours
needed to run a single simulation
Accelerating Recommendation System Training by Leveraging Popular Choices
Recommender models are commonly used to suggest relevant items to a user for
e-commerce and online advertisement-based applications. These models use
massive embedding tables to store numerical representation of items' and users'
categorical variables (memory intensive) and employ neural networks (compute
intensive) to generate final recommendations. Training these large-scale
recommendation models is evolving to require increasing data and compute
resources. The highly parallel neural networks portion of these models can
benefit from GPU acceleration however, large embedding tables often cannot fit
in the limited-capacity GPU device memory. Hence, this paper deep dives into
the semantics of training data and obtains insights about the feature access,
transfer, and usage patterns of these models. We observe that, due to the
popularity of certain inputs, the accesses to the embeddings are highly skewed
with a few embedding entries being accessed up to 10000x more. This paper
leverages this asymmetrical access pattern to offer a framework, called FAE,
and proposes a hot-embedding aware data layout for training recommender models.
This layout utilizes the scarce GPU memory for storing the highly accessed
embeddings, thus reduces the data transfers from CPU to GPU. At the same time,
FAE engages the GPU to accelerate the executions of these hot embedding
entries. Experiments on production-scale recommendation models with real
datasets show that FAE reduces the overall training time by 2.3x and 1.52x in
comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline
accurac
Real-time Text Analytics Pipeline Using Open-source Big Data Tools
Real-time text processing systems are required in many domains to quickly
identify patterns, trends, sentiments, and insights. Nowadays, social networks,
e-commerce stores, blogs, scientific experiments, and server logs are main
sources generating huge text data. However, to process huge text data in real
time requires building a data processing pipeline. The main challenge in
building such pipeline is to minimize latency to process high-throughput data.
In this paper, we explain and evaluate our proposed real-time text processing
pipeline using open-source big data tools which minimize the latency to process
data streams. Our proposed data processing pipeline is based on Apache Kafka
for data ingestion, Apache Spark for in-memory data processing, Apache
Cassandra for storing processed results, and D3 JavaScript library for
visualization. We evaluate the effectiveness of the proposed pipeline under
varying deployment scenarios to perform sentiment analysis using Twitter
dataset. Our experimental evaluations show less than a minute latency to
process Tweets in minutes when three virtual machines
allocated to the proposed pipeline
- …