3,924 research outputs found
ARM Wrestling with Big Data: A Study of Commodity ARM64 Server for Big Data Workloads
ARM processors have dominated the mobile device market in the last decade due
to their favorable computing to energy ratio. In this age of Cloud data centers
and Big Data analytics, the focus is increasingly on power efficient
processing, rather than just high throughput computing. ARM's first commodity
server-grade processor is the recent AMD A1100-series processor, based on a
64-bit ARM Cortex A57 architecture. In this paper, we study the performance and
energy efficiency of a server based on this ARM64 CPU, relative to a comparable
server running an AMD Opteron 3300-series x64 CPU, for Big Data workloads.
Specifically, we study these for Intel's HiBench suite of web, query and
machine learning benchmarks on Apache Hadoop v2.7 in a pseudo-distributed
setup, for data sizes up to files, web pages and tuples. Our
results show that the ARM64 server's runtime performance is comparable to the
x64 server for integer-based workloads like Sort and Hive queries, and only
lags behind for floating-point intensive benchmarks like PageRank, when they do
not exploit data parallelism adequately. We also see that the ARM64 server
takes the energy, and has an Energy Delay Product (EDP) that
is lower than the x64 server. These results hold promise for ARM64
data centers hosting Big Data workloads to reduce their operational costs,
while opening up opportunities for further analysis.Comment: Accepted for publication in the Proceedings of the 24th IEEE
International Conference on High Performance Computing, Data, and Analytics
(HiPC), 201
Recommended from our members
SoC-Based In-Storage Processing: Bringing Flexibility and Efficiency to Near-Data Processing
Data are among the most valuable assets in the modern world, and they have caused a revolutionary stage in human life. Nowadays, companies make knowledge-based decisions by analyzing a huge volume of data, super-scale data centers are used to process customers’ data to suggest products to them, government services rely on the data people provide to them, and there are many similar cases wherein data are used as an important asset. Data are originally stored in storage systems. To process data, application servers need to fetch the data from storage units, which imposes the cost of moving the data to the system. This cost has a direct relationship to the distance of the processing engines from the data, and this is the key motivation for the emergence of distributed processing platforms such as Hadoop, which bring the process closer to the data.In-storage processing (ISP) pushes the “bring the process to data” paradigm to its ultimate boundaries by utilizing processing engines inside the storage units to process data. The architecture of modern solid-state drives (SSDs) provides a suitable environment for implementing such technology. Thus, this dissertation focuses on SSD architectures that are able to run user applications in-place, which are called computational storage devices (CSDs). In this dissertation, we propose CSD architectures and investigate the benefits of deploying CSDs for running different applications. This research uses a practical approach that includes building fully functional prototypes of the proposed CSD architectures, developing storage systems equipped with the CSDs, and running different benchmarks to investigate the benefits of deploying the CSDs in the systems. This research proposes two different CSD architectures, namely CompStor and Catalina.These are the first CSDs to be equipped with a dedicated ISP engine for running user applications in-place that includes a quad-core ARM Cortex-A53 processor together with FPGA- and application-specific integrated circuit (ASIC) based accelerators. The proposed architectures run a full-fledged operating system inside, which provides a flexible environment for running a wide range of user applications in-place. The system-on-chip (SOC) based architecture of Catalina CSD, together with a software stack developed for seamless deployment of the CSD, makes it a platform for the implementation of different ISP concepts and ideas.To the best of our knowledge, Catalina is the only ISP platform that can be seamlessly deployed in clusters to run distributed applications such as Hadoop MapReduce and message passing interface (MPI) based applications in-place without any modifications to the underlying distributed processing framework. We performed extensive experimental tests using several datasets on both CompStor and Catalina CSDs. The experimental results show up to 2.2x and 4.3x improvements in performance and energy consumption, respectively, for running Hadoop MapReduce benchmarks using Catalina CSDs and up to 5.4x and 8.9x improvements for running 1-, 2-, and 3-dimensional DFT algorithms due to the Neon SIMD engines inside Catalina. Additionally, using FPGA-based accelerators, Catalina CSDs can improve the performance and energy consumption of a highly demanding image similarity search application up to 11x and 7x, respectively
CRAID: Online RAID upgrades using dynamic hot data reorganization
Current algorithms used to upgrade RAID arrays typically require large amounts of data to be migrated, even those that move only the minimum amount of data required to keep a balanced data load. This paper presents CRAID, a self-optimizing RAID array that performs an online block reorganization of frequently used, long-term accessed data in order to reduce this migration even further. To achieve this objective, CRAID tracks frequently used, long-term data blocks and copies them to a dedicated partition spread across all the disks in the array. When new disks are added, CRAID only needs to extend this process to the new devices to redistribute this partition, thus greatly reducing the overhead of the upgrade process. In addition, the reorganized access patterns within this partition improve the array’s performance, amortizing the copy overhead and allowing CRAID to offer a performance competitive with traditional RAIDs.
We describe CRAID’s motivation and design and we evaluate it by replaying seven real-world workloads including a file server, a web server and a user share. Our experiments show that CRAID can successfully detect hot data variations and begin using new disks as soon as they are added to the array. Also, the usage of a dedicated
partition improves the sequentiality of relevant data access, which amortizes the cost of reorganizations. Finally, we prove that a full-HDD CRAID array with a small distributed partition (<1.28% per disk) can compete in performance with an ideally restriped RAID-5 and a hybrid RAID-5 with a small SSD cache.Peer ReviewedPostprint (published version
FastDepth: Fast Monocular Depth Estimation on Embedded Systems
Depth sensing is a critical function for robotic tasks such as localization,
mapping and obstacle detection. There has been a significant and growing
interest in depth estimation from a single RGB image, due to the relatively low
cost and size of monocular cameras. However, state-of-the-art single-view depth
estimation algorithms are based on fairly complex deep neural networks that are
too slow for real-time inference on an embedded platform, for instance, mounted
on a micro aerial vehicle. In this paper, we address the problem of fast depth
estimation on embedded systems. We propose an efficient and lightweight
encoder-decoder network architecture and apply network pruning to further
reduce computational complexity and latency. In particular, we focus on the
design of a low-latency decoder. Our methodology demonstrates that it is
possible to achieve similar accuracy as prior work on depth estimation, but at
inference speeds that are an order of magnitude faster. Our proposed network,
FastDepth, runs at 178 fps on an NVIDIA Jetson TX2 GPU and at 27 fps when using
only the TX2 CPU, with active power consumption under 10 W. FastDepth achieves
close to state-of-the-art accuracy on the NYU Depth v2 dataset. To the best of
the authors' knowledge, this paper demonstrates real-time monocular depth
estimation using a deep neural network with the lowest latency and highest
throughput on an embedded platform that can be carried by a micro aerial
vehicle.Comment: Accepted for presentation at ICRA 2019. 8 pages, 6 figures, 7 table
Don't Thrash: How to Cache Your Hash on Flash
This paper presents new alternatives to the well-known Bloom filter data
structure. The Bloom filter, a compact data structure supporting set insertion
and membership queries, has found wide application in databases, storage
systems, and networks. Because the Bloom filter performs frequent random reads
and writes, it is used almost exclusively in RAM, limiting the size of the sets
it can represent. This paper first describes the quotient filter, which
supports the basic operations of the Bloom filter, achieving roughly comparable
performance in terms of space and time, but with better data locality.
Operations on the quotient filter require only a small number of contiguous
accesses. The quotient filter has other advantages over the Bloom filter: it
supports deletions, it can be dynamically resized, and two quotient filters can
be efficiently merged. The paper then gives two data structures, the buffered
quotient filter and the cascade filter, which exploit the quotient filter
advantages and thus serve as SSD-optimized alternatives to the Bloom filter.
The cascade filter has better asymptotic I/O performance than the buffered
quotient filter, but the buffered quotient filter outperforms the cascade
filter on small to medium data sets. Both data structures significantly
outperform recently-proposed SSD-optimized Bloom filter variants, such as the
elevator Bloom filter, buffered Bloom filter, and forest-structured Bloom
filter. In experiments, the cascade filter and buffered quotient filter
performed insertions 8.6-11 times faster than the fastest Bloom filter variant
and performed lookups 0.94-2.56 times faster.Comment: VLDB201
- …