Search CORE

6,923 research outputs found

A performance model of speculative prefetching in distributed information systems

Author: Kumar M.
Tuah N. J.
Venkatesh S.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1999
Field of study

Previous studies in speculative prefetching focus on building and evaluating access models for the purpose of access prediction. This paper investigates a complementary area which has been largely ignored, that of performance modelling. We use improvement in access time as the performance metric, for which we derive a formula in terms of resource parameters (time available and time required for prefetching) and speculative parameters (probabilities for next access). The performance maximization problem is expressed as a stretch knapsack problem. We develop an algorithm to maximize the improvement in access time by solving the stretch knapsack problem, using theoretically proven apparatus to reduce the search space. Integration between speculative prefetching and caching is also investigated, albeit under the assumption of equal item sizes

CiteSeerX

DRO Deakin Research Online

Optimization of Lattice QCD codes for the AMD Opteron processor

Author: Koma Miho
Publication venue: 'Elsevier BV'
Publication date: 05/10/2005
Field of study

We report our experience of the optimization of the lattice QCD codes for the new Opteron cluster at DESY Hamburg, including benchmarks. Details of the optimization using SSE/SSE2 instructions and the effective use of prefetch instructions are discussed.Comment: 5 pages, 4 figures, espcrc2.cls, Proceedings of X International Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT 2005), DESY Zeuthen, Germany, May 22 - 27, 200

arXiv.org e-Print Archive

Crossref

CERN Document Server

Bulk extractor windows prefetch decoder

Author: Garcia Luis E.
Publication venue: Monterey, California. Naval Postgraduate School
Publication date: 26/08/2011
Field of study

scan winprefetch is a C++ and thread-safe Windows prefetch scanner for the bulk extractor framework that decodes prefetch files. The decoder analyzes disk images for Windows prefetch files. At the completion of analyzing each prefetch file found on the disk image, a text file is created containing a XML output detailing all found prefetch files.Approved for public release; distribution is unlimited

Calhoun, Institutional Archive of the Naval Postgraduate School

Software prefetching for software pipelined loops

Author: González Colás Antonio María
Sánchez Jesús
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1998
Field of study

The paper investigates the interaction between software pipelining and different software prefetching techniques for VLIW machines. It is shown that processor stalls due to memory dependencies have a great impact into execution time. A novel heuristic is proposed and it is show to outperform previous proposals.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

UPCommons (Universitat Politècnica de Catalunya)

vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design

Author: Clemons Jason
Gimelshein Natalia
Keckler Stephen W.
Rhu Minsoo
Zulfiqar Arslan
Publication venue
Publication date: 28/07/2016
Field of study

The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU. This restriction hampers a researcher's flexibility to study different machine learning algorithms, forcing them to either use a less desirable network architecture or parallelize the processing across multiple GPUs. We propose a runtime memory manager that virtualizes the memory usage of DNNs such that both GPU and CPU memory can simultaneously be utilized for training larger DNNs. Our virtualized DNN (vDNN) reduces the average GPU memory usage of AlexNet by up to 89%, OverFeat by 91%, and GoogLeNet by 95%, a significant reduction in memory requirements of DNNs. Similar experiments on VGG-16, one of the deepest and memory hungry DNNs to date, demonstrate the memory-efficiency of our proposal. vDNN enables VGG-16 with batch size 256 (requiring 28 GB of memory) to be trained on a single NVIDIA Titan X GPU card containing 12 GB of memory, with 18% performance loss compared to a hypothetical, oracular GPU with enough memory to hold the entire DNN.Comment: Published as a conference paper at the 49th IEEE/ACM International Symposium on Microarchitecture (MICRO-49), 201

arXiv.org e-Print Archive

Crossref

포항공과대학교

Evaluation of vectorization potential of Graph500 on Intel's Xeon Phi

Author: Cristal Kestelman Adrián
Duric Milovan
Palomar Pérez Óscar
Ratkovic Ivan
Stanic Milan
Unsal Osman Sabri
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

Graph500 is a data intensive application for high performance computing and it is an increasingly important workload because graphs are a core part of most analytic applications. So far there is no work that examines if Graph500 is suitable for vectorization mostly due a lack of vector memory instructions for irregular memory accesses. The Xeon Phi is a massively parallel processor recently released by Intel with new features such as a wide 512-bit vector unit and vector scatter/gather instructions. Thus, the Xeon Phi allows for more efficient parallelization of Graph500 that is combined with vectorization. In this paper we vectorize Graph500 and analyze the impact of vectorization and prefetching on the Xeon Phi. We also show that the combination of parallelization, vectorization and prefetching yields a speedup of 27% over a parallel version with prefetching that does not leverage the vector capabilities of the Xeon Phi.The research leading to these results has received funding from the European Research Council under the European Unions 7th FP (FP/2007- 2013) / ERC GA n. 321253. It has been partially funded by the Spanish Government (TIN2012-34557)Peer ReviewedPostprint (published version

Crossref

UPCommons. Portal del coneixement obert de la UPC

UPCommons (Universitat Politècnica de Catalunya)

Evaluation of the Cedar memory system: Configuration of 16 by 16

Author: Gallivan K.
Jalby W.
Wijshoff H.
Publication venue
Publication date
Field of study

Some basic results on the performance of the Cedar multiprocessor system are presented. Empirical results on the 16 processor 16 memory bank system configuration, which show the behavior of the Cedar system under different modes of operation are presented

NASA Technical Reports Server

Improving Mobile Video Streaming with Mobility Prediction and Prefetching in Integrated Cellular-WiFi Networks

Author: Anagnostopoulou Maria
Dimopoulos Dimitris
Siris Vasilios A.
Publication venue
Publication date: 23/10/2013
Field of study

We present and evaluate a procedure that utilizes mobility and throughput prediction to prefetch video streaming data in integrated cellular and WiFi networks. The effective integration of such heterogeneous wireless technologies will be significant for supporting high performance and energy efficient video streaming in ubiquitous networking environments. Our evaluation is based on trace-driven simulation considering empirical measurements and shows how various system parameters influence the performance, in terms of the number of paused video frames and the energy consumption; these parameters include the number of video streams, the mobile, WiFi, and ADSL backhaul throughput, and the number of WiFi hotspots. Also, we assess the procedure's robustness to time and throughput variability. Finally, we present our initial prototype that implements the proposed approach.Comment: 7 pages, 15 figure

arXiv.org e-Print Archive

Crossref