Search CORE

17,553 research outputs found

Characterizing Deep-Learning I/O Workloads in TensorFlow

Author: Chien Steven W. D.
Herman Pawel
Laure Erwin
Markidis Stefano
Narasimhamurthy Sai
Santos Luis
Sishtla Chaitanya Prasad
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 06/10/2018
Field of study

The performance of Deep-Learning (DL) computing frameworks rely on the performance of data ingestion and checkpointing. In fact, during the training, a considerable high number of relatively small files are first loaded and pre-processed on CPUs and then moved to accelerator for computation. In addition, checkpointing and restart operations are carried out to allow DL computing frameworks to restart quickly from a checkpoint. Because of this, I/O affects the performance of DL applications. In this work, we characterize the I/O performance and scaling of TensorFlow, an open-source programming framework developed by Google and specifically designed for solving DL problems. To measure TensorFlow I/O performance, we first design a micro-benchmark to measure TensorFlow reads, and then use a TensorFlow mini-application based on AlexNet to measure the performance cost of I/O and checkpointing in TensorFlow. To improve the checkpointing performance, we design and implement a burst buffer. We find that increasing the number of threads increases TensorFlow bandwidth by a maximum of 2.3x and 7.8x on our benchmark environments. The use of the tensorFlow prefetcher results in a complete overlap of computation on accelerator and input pipeline on CPU eliminating the effective cost of I/O on the overall performance. The use of a burst buffer to checkpoint to a fast small capacity storage and copy asynchronously the checkpoints to a slower large capacity storage resulted in a performance improvement of 2.6x with respect to checkpointing directly to slower storage on our benchmark environment.Comment: Accepted for publication at pdsw-DISCS 201

arXiv.org e-Print Archive

Crossref

Circuit-switch architecture for a 30/20-GHz FDMA/TDM geostationary satellite communications network

Author: Ivancic William D.
Publication venue
Publication date
Field of study

A circuit switching architecture is described for a 30/20 GHz frequency division, multiple access uplink/time division multiplexed downlink (FDMA/TDM) geostationary satellite communications network. Critical subsystems and problem areas are identified and addressed. Work was concentrated primarily on the space segment; however, the ground segment was considered concurrently to ensure cost efficiency and realistic operational constraints

NASA Technical Reports Server

Preparing HPC Applications for the Exascale Era: A Decoupling Strategy

Author: Gioiosa Roberto
Kestor Gokcen
Laure Erwin
Markidis Stefano
Peng Ivy Bo
Publication venue
Publication date: 03/08/2017
Field of study

Production-quality parallel applications are often a mixture of diverse operations, such as computation- and communication-intensive, regular and irregular, tightly coupled and loosely linked operations. In conventional construction of parallel applications, each process performs all the operations, which might result inefficient and seriously limit scalability, especially at large scale. We propose a decoupling strategy to improve the scalability of applications running on large-scale systems. Our strategy separates application operations onto groups of processes and enables a dataflow processing paradigm among the groups. This mechanism is effective in reducing the impact of load imbalance and increases the parallel efficiency by pipelining multiple operations. We provide a proof-of-concept implementation using MPI, the de-facto programming system on current supercomputers. We demonstrate the effectiveness of this strategy by decoupling the reduce, particle communication, halo exchange and I/O operations in a set of scientific and data-analytics applications. A performance evaluation on 8,192 processes of a Cray XC40 supercomputer shows that the proposed approach can achieve up to 4x performance improvement.Comment: The 46th International Conference on Parallel Processing (ICPP-2017

arXiv.org e-Print Archive

Crossref

Destination directed packet switch architecture for a 30/20 GHz FDMA/TDM geostationary communication satellite network

Author: Ivancic William D.
Shalkhauser Mary JO
Publication venue
Publication date
Field of study

Emphasis is on a destination directed packet switching architecture for a 30/20 GHz frequency division multiplex access/time division multiplex (FDMA/TDM) geostationary satellite communication network. Critical subsystems and problem areas are identified and addressed. Efforts have concentrated heavily on the space segment; however, the ground segment was considered concurrently to ensure cost efficiency and realistic operational constraints

NASA Technical Reports Server

Destination-directed, packet-switching architecture for 30/20-GHz FDMA/TDM geostationary communications satellite network

Author: Ivancic William D.
Shalkhauser Mary JO
Publication venue
Publication date
Field of study

A destination-directed packet switching architecture for a 30/20-GHz frequency division multiple access/time division multiplexed (FDMA/TDM) geostationary satellite communications network is discussed. Critical subsystems and problem areas are identified and addressed. Efforts have concentrated heavily on the space segment; however, the ground segment has been considered concurrently to ensure cost efficiency and realistic operational constraints

NASA Technical Reports Server

Extended Bit-Plane Compression for Convolutional Neural Network Accelerators

Author: Benini Luca
Cavigelli Lukas
Publication venue
Publication date: 01/10/2018
Field of study

After the tremendous success of convolutional neural networks in image classification, object detection, speech recognition, etc., there is now rising demand for deployment of these compute-intensive ML models on tightly power constrained embedded and mobile systems at low cost as well as for pushing the throughput in data centers. This has triggered a wave of research towards specialized hardware accelerators. Their performance is often constrained by I/O bandwidth and the energy consumption is dominated by I/O transfers to off-chip memory. We introduce and evaluate a novel, hardware-friendly compression scheme for the feature maps present within convolutional neural networks. We show that an average compression ratio of 4.4x relative to uncompressed data and a gain of 60% over existing method can be achieved for ResNet-34 with a compression block requiring <300 bit of sequential cells and minimal combinational logic

arXiv.org e-Print Archive

Repository for Publications and Research Data

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Design and Implementation of an RNS-based 2D DWT Processor

Author: Lai Edmund M-K.
Liu Y.
Publication venue: Massey University.
Publication date: 01/02/2004
Field of study

No abstract availabl

Massey Research Online