6,923 research outputs found
A performance model of speculative prefetching in distributed information systems
Previous studies in speculative prefetching focus on building and evaluating access models for the purpose of access prediction. This paper investigates a complementary area which has been largely ignored, that of performance modelling. We use improvement in access time as the performance metric, for which we derive a formula in terms of resource parameters (time available and time required for prefetching) and speculative parameters (probabilities for next access). The performance maximization problem is expressed as a stretch knapsack problem. We develop an algorithm to maximize the improvement in access time by solving the stretch knapsack problem, using theoretically proven apparatus to reduce the search space. Integration between speculative prefetching and caching is also investigated, albeit under the assumption of equal item sizes
Optimization of Lattice QCD codes for the AMD Opteron processor
We report our experience of the optimization of the lattice QCD codes for the
new Opteron cluster at DESY Hamburg, including benchmarks. Details of the
optimization using SSE/SSE2 instructions and the effective use of prefetch
instructions are discussed.Comment: 5 pages, 4 figures, espcrc2.cls, Proceedings of X International
Workshop on Advanced Computing and Analysis Techniques in Physics Research
(ACAT 2005), DESY Zeuthen, Germany, May 22 - 27, 200
Bulk extractor windows prefetch decoder
scan winprefetch is a C++ and thread-safe Windows prefetch scanner for the bulk extractor framework that decodes prefetch files. The decoder analyzes disk images for Windows prefetch files. At the completion of analyzing each prefetch file found on the disk image, a text file is created containing a XML output detailing all found prefetch files.Approved for public release; distribution is unlimited
Software prefetching for software pipelined loops
The paper investigates the interaction between software pipelining and different software prefetching techniques for VLIW machines. It is shown that processor stalls due to memory dependencies have a great impact into execution time. A novel heuristic is proposed and it is show to outperform previous proposals.Peer ReviewedPostprint (published version
vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design
The most widely used machine learning frameworks require users to carefully
tune their memory usage so that the deep neural network (DNN) fits into the
DRAM capacity of a GPU. This restriction hampers a researcher's flexibility to
study different machine learning algorithms, forcing them to either use a less
desirable network architecture or parallelize the processing across multiple
GPUs. We propose a runtime memory manager that virtualizes the memory usage of
DNNs such that both GPU and CPU memory can simultaneously be utilized for
training larger DNNs. Our virtualized DNN (vDNN) reduces the average GPU memory
usage of AlexNet by up to 89%, OverFeat by 91%, and GoogLeNet by 95%, a
significant reduction in memory requirements of DNNs. Similar experiments on
VGG-16, one of the deepest and memory hungry DNNs to date, demonstrate the
memory-efficiency of our proposal. vDNN enables VGG-16 with batch size 256
(requiring 28 GB of memory) to be trained on a single NVIDIA Titan X GPU card
containing 12 GB of memory, with 18% performance loss compared to a
hypothetical, oracular GPU with enough memory to hold the entire DNN.Comment: Published as a conference paper at the 49th IEEE/ACM International
Symposium on Microarchitecture (MICRO-49), 201
Evaluation of vectorization potential of Graph500 on Intel's Xeon Phi
Graph500 is a data intensive application for high performance computing and it is an increasingly important workload because graphs are a core part of most analytic applications. So far there is no work that examines if Graph500 is suitable for vectorization mostly due a lack of vector memory instructions for irregular memory accesses. The Xeon Phi is a massively parallel processor recently released by Intel with new features such as a wide 512-bit vector unit and vector scatter/gather instructions. Thus, the Xeon Phi allows for more efficient parallelization of Graph500 that is combined with vectorization. In this paper we vectorize Graph500 and analyze the impact of vectorization and prefetching on the Xeon Phi. We also show that the combination of parallelization, vectorization and prefetching yields a speedup of 27% over a parallel version with prefetching that does not leverage the vector capabilities of the Xeon Phi.The research leading to these results has received funding from the
European Research Council under the European Unions 7th FP (FP/2007-
2013) / ERC GA n. 321253. It has been partially funded by the Spanish
Government (TIN2012-34557)Peer ReviewedPostprint (published version
Evaluation of the Cedar memory system: Configuration of 16 by 16
Some basic results on the performance of the Cedar multiprocessor system are presented. Empirical results on the 16 processor 16 memory bank system configuration, which show the behavior of the Cedar system under different modes of operation are presented
Improving Mobile Video Streaming with Mobility Prediction and Prefetching in Integrated Cellular-WiFi Networks
We present and evaluate a procedure that utilizes mobility and throughput
prediction to prefetch video streaming data in integrated cellular and WiFi
networks. The effective integration of such heterogeneous wireless technologies
will be significant for supporting high performance and energy efficient video
streaming in ubiquitous networking environments. Our evaluation is based on
trace-driven simulation considering empirical measurements and shows how
various system parameters influence the performance, in terms of the number of
paused video frames and the energy consumption; these parameters include the
number of video streams, the mobile, WiFi, and ADSL backhaul throughput, and
the number of WiFi hotspots. Also, we assess the procedure's robustness to time
and throughput variability. Finally, we present our initial prototype that
implements the proposed approach.Comment: 7 pages, 15 figure
- …
