9,828 research outputs found
EOS: Automatic In-vivo Evolution of Kernel Policies for Better Performance
Today's monolithic kernels often implement a small, fixed set of policies
such as disk I/O scheduling policies, while exposing many parameters to let
users select a policy or adjust the specific setting of the policy. Ideally,
the parameters exposed should be flexible enough for users to tune for good
performance, but in practice, users lack domain knowledge of the parameters and
are often stuck with bad, default parameter settings.
We present EOS, a system that bridges the knowledge gap between kernel
developers and users by automatically evolving the policies and parameters in
vivo on users' real, production workloads. It provides a simple policy
specification API for kernel developers to programmatically describe how the
policies and parameters should be tuned, a policy cache to make in-vivo tuning
easy and fast by memorizing good parameter settings for past workloads, and a
hierarchical search engine to effectively search the parameter space.
Evaluation of EOS on four main Linux subsystems shows that it is easy to use
and effectively improves each subsystem's performance.Comment: 14 pages, technique repor
Technical Report: KNN Joins Using a Hybrid Approach: Exploiting CPU/GPU Workload Characteristics
This paper studies finding the K nearest neighbors (KNN) of all points in a
dataset. Typical solutions to KNN searches use indexing to prune the search,
which reduces the number of candidate points that may be within the set of the
nearest points of each query point. In high dimensionality, index searches
degrade, making the KNN self-join a prohibitively expensive operation in some
scenarios. Furthermore, there are a significant number of distance calculations
needed to determine which points are nearest to each query point. To address
these challenges, we propose a hybrid CPU/GPU approach. Since the CPU and GPU
are considerably different architectures that are best exploited using
different algorithms, we advocate for splitting the work between both
architectures based on the characteristic workloads defined by the query points
in the dataset. As such, we assign dense regions to the GPU, and sparse regions
to the CPU to most efficiently exploit the relative strengths of each
architecture. Critically, we find that the relative performance gains over the
reference implementation across four real-world datasets are a function of the
data properties (size, dimensionality, distribution), and number of neighbors,
K.Comment: 30 pages, 10 figures, 6 table
GPU Accelerated Self-join for the Distance Similarity Metric
The self-join finds all objects in a dataset within a threshold of each other
defined by a similarity metric. As such, the self-join is a building block for
the field of databases and data mining, and is employed in Big Data
applications. In this paper, we advance a GPU-efficient algorithm for the
similarity self-join that uses the Euclidean distance metric. The
search-and-refine strategy is an efficient approach for low dimensionality
datasets, as index searches degrade with increasing dimension (i.e., the curse
of dimensionality). Thus, we target the low dimensionality problem, and compare
our GPU self-join to a search-and-refine implementation, and a state-of-the-art
parallel algorithm. In low dimensionality, there are several unique challenges
associated with efficiently solving the self-join problem on the GPU. Low
dimensional data often results in higher data densities, causing a significant
number of distance calculations and a large result set. As dimensionality
increases, index searches become increasingly exhaustive, forming a performance
bottleneck. We advance several techniques to overcome these challenges using
the GPU. The techniques we propose include a GPU-efficient index that employs a
bounded search, a batching scheme to accommodate large result set sizes, and a
reduction in distance calculations through duplicate search removal. Our GPU
self-join outperforms both search-and-refine and state-of-the-art algorithms.Comment: Accepted for Publication in the 4th IEEE International Workshop on
High-Performance Big Data, Deep Learning, and Cloud Computing. To appear in
the Proceedings of the 32nd IEEE International Parallel and Distributed
Processing Symposium Workshop
Application-Driven Near-Data Processing for Similarity Search
Similarity search is a key to a variety of applications including
content-based search for images and video, recommendation systems, data
deduplication, natural language processing, computer vision, databases,
computational biology, and computer graphics. At its core, similarity search
manifests as k-nearest neighbors (kNN), a computationally simple primitive
consisting of highly parallel distance calculations and a global top-k sort.
However, kNN is poorly supported by today's architectures because of its high
memory bandwidth requirements.
This paper proposes an application-driven near-data processing accelerator
for similarity search: the Similarity Search Associative Memory (SSAM). By
instantiating compute units close to memory, SSAM benefits from the higher
memory bandwidth and density exposed by emerging memory technologies. We
evaluate the SSAM design down to layout on top of the Micron hybrid memory cube
(HMC), and show that SSAM can achieve up to two orders of magnitude
area-normalized throughput and energy efficiency improvement over multicore
CPUs; we also show SSAM is faster and more energy efficient than competing GPUs
and FPGAs. Finally, we show that SSAM is also useful for other data intensive
tasks like kNN index construction, and can be generalized to semantically
function as a high capacity content addressable memory.Comment: 15 pages, 8 figures, 7 table
HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud
Eliminating duplicate data in primary storage of clouds increases the
cost-efficiency of cloud service providers as well as reduces the cost of users
for using cloud services. Existing primary deduplication techniques either use
inline caching to exploit locality in primary workloads or use post-processing
deduplication running in system idle time to avoid the negative impact on I/O
performance. However, neither of them works well in the cloud servers running
multiple services or applications for the following two reasons: Firstly, the
temporal locality of duplicate data writes may not exist in some primary
storage workloads thus inline caching often fails to achieve good deduplication
ratio. Secondly, the post-processing deduplication allows duplicate data to be
written into disks, therefore does not provide the benefit of I/O deduplication
and requires high peak storage capacity. This paper presents HPDedup, a Hybrid
Prioritized data Deduplication mechanism to deal with the storage system shared
by applications running in co-located virtual machines or containers by fusing
an inline and a post-processing process for exact deduplication. In the inline
deduplication phase, HPDedup gives a fingerprint caching mechanism that
estimates the temporal locality of duplicates in data streams from different
VMs or applications and prioritizes the cache allocation for these streams
based on the estimation. HPDedup also allows different deduplication threshold
for streams based on their spatial locality to reduce the disk fragmentation.
The post-processing phase removes duplicates whose fingerprints are not able to
be cached due to the weak temporal locality from disks. Our experimental
results show that HPDedup clearly outperforms the state-of-the-art primary
storage deduplication techniques in terms of inline cache efficiency and
primary deduplication efficiency.Comment: 14 pages, 11 figures, submitted to MSST201
Cuttlefish: A Lightweight Primitive for Adaptive Query Processing
Modern data processing applications execute increasingly sophisticated
analysis that requires operations beyond traditional relational algebra. As a
result, operators in query plans grow in diversity and complexity. Designing
query optimizer rules and cost models to choose physical operators for all of
these novel logical operators is impractical. To address this challenge, we
develop Cuttlefish, a new primitive for adaptively processing online query
plans that explores candidate physical operator instances during query
execution and exploits the fastest ones using multi-armed bandit reinforcement
learning techniques. We prototype Cuttlefish in Apache Spark and adaptively
choose operators for image convolution, regular expression matching, and
relational joins. Our experiments show Cuttlefish-based adaptive convolution
and regular expression operators can reach 72-99% of the throughput of an
all-knowing oracle that always selects the optimal algorithm, even when
individual physical operators are up to 105x slower than the optimal.
Additionally, Cuttlefish achieves join throughput improvements of up to 7.5x
compared with Spark SQL's query optimizer
Analyzes of the Distributed System Load with Multifractal Input Data Flows
The paper proposes a solution an actual scientific problem related to load
balancing and efficient utilization of resources of the distributed system. The
proposed method is based on calculation of load CPU, memory, and bandwidth by
flows of different classes of service for each server and the entire
distributed system and taking into account multifractal properties of input
data flows. Weighting factors were introduced that allow to determine the
significance of the characteristics of server relative to each other. Thus,
this method allows to calculate the imbalance of the all system servers and
system utilization. The simulation of the proposed method for different
multifractal parameters of input flows was conducted. The simulation showed
that the characteristics of multifractal traffic have a appreciable effect on
the system imbalance. The usage of proposed method allows to distribute
requests across the servers thus that the deviation of the load servers from
the average value was minimal, that allows to get a higher metrics of system
performance and faster processing flows.Comment: 5 page
Machine Learning in Compiler Optimisation
In the last decade, machine learning based compilation has moved from an an
obscure research niche to a mainstream activity. In this article, we describe
the relationship between machine learning and compiler optimisation and
introduce the main concepts of features, models, training and deployment. We
then provide a comprehensive survey and provide a road map for the wide variety
of different research areas. We conclude with a discussion on open issues in
the area and potential research directions. This paper provides both an
accessible introduction to the fast moving area of machine learning based
compilation and a detailed bibliography of its main achievements.Comment: Accepted to be published at Proceedings of the IEE
A Comparative Exploration of ML Techniques for Tuning Query Degree of Parallelism
There is a large body of recent work applying machine learning (ML)
techniques to query optimization and query performance prediction in relational
database management systems (RDBMSs). However, these works typically ignore the
effect of \textit{intra-parallelism} -- a key component used to boost the
performance of OLAP queries in practice -- on query performance prediction. In
this paper, we take a first step towards filling this gap by studying the
problem of \textit{tuning the degree of parallelism (DOP) via ML techniques} in
Microsoft SQL Server, a popular commercial RDBMS that allows an individual
query to execute using multiple cores.
In our study, we cast the problem of DOP tuning as a {\em regression} task,
and examine how several popular ML models can help with query performance
prediction in a multi-core setting. We explore the design space and perform an
extensive experimental study comparing different models against a list of
performance metrics, testing how well they generalize in different settings:
to queries from the same template, to queries from a new template,
to instances of different scale, and to different instances and
queries. Our experimental results show that a simple featurization of the input
query plan that ignores cost model estimations can accurately predict query
performance, capture the speedup trend with respect to the available
parallelism, as well as help with automatically choosing an optimal per-query
DOP
GRACOS: Scalable and Load Balanced P3M Cosmological N-body Code
We present a parallel implementation of the particle-particle/particle-mesh
(P3M) algorithm for distributed memory clusters. The GRACOS (GRAvitational
COSmology) code uses a hybrid method for both computation and domain
decomposition. Long-range forces are computed using a Fourier transform gravity
solver on a regular mesh; the mesh is distributed across parallel processes
using a static one-dimensional slab domain decomposition. Short-range forces
are computed by direct summation of close pairs; particles are distributed
using a dynamic domain decomposition based on a space-filling Hilbert curve. A
nearly-optimal method was devised to dynamically repartition the particle
distribution so as to maintain load balance even for extremely inhomogeneous
mass distributions. Tests using simulations on a 40-processor beowulf
cluster showed good load balance and scalability up to 80 processes. We discuss
the limits on scalability imposed by communication and extreme clustering and
suggest how they may be removed by extending our algorithm to include adaptive
mesh refinement.Comment: to be submitted to ApJ.
- …