372 research outputs found
Web Caching and Prefetching with Cyclic Model Analysis of Web Object Sequences
Web caching is the process in which web objects are temporarily stored to reduce bandwidth consumption, server load and latency. Web prefetching is the process of fetching web objects from the server before they are actually requested by the client. Integration of caching and prefetching can be very beneficial as the two techniques can support each other. By implementing this integrated scheme in a client-side proxy, the perceived latency can be reduced for not one but many users. In this paper, we propose a new integrated caching and prefetching policy called the WCP-CMA which makes use of a profit-driven caching policy that takes into account the periodicity and cyclic behaviour of the web access sequences for deriving prefetching rules. Our experimental results have shown a 10%-15% increase in the hit ratios of the cached objects and 5%-10% decrease in delay compared to the existing schem
Optimizing for a Many-Core Architecture without Compromising Ease-of-Programming
Faced with nearly stagnant clock speed advances, chip manufacturers have turned to parallelism as the source for continuing performance improvements. But even though numerous parallel architectures have already been brought to market, a universally accepted methodology for programming them for general purpose applications has yet to emerge. Existing solutions tend to be hardware-specific, rendering them difficult to use for the majority of application programmers and domain experts, and not providing scalability guarantees for future generations of the hardware.
This dissertation advances the validation of the following thesis: it is possible to develop efficient general-purpose programs for a many-core platform using a model recognized for its simplicity. To prove this thesis, we refer to the eXplicit Multi-Threading (XMT) architecture designed and built at the University of Maryland. XMT is an attempt at re-inventing parallel computing with a solid theoretical foundation and an aggressive scalable design. Algorithmically, XMT is inspired by the PRAM (Parallel Random Access Machine) model and the architecture design is focused on reducing inter-task communication and synchronization overheads and providing an easy-to-program parallel model.
This thesis builds upon the existing XMT infrastructure to improve support for efficient execution with a focus on ease-of-programming. Our contributions aim at reducing the programmer's effort in developing XMT applications and improving the overall performance. More concretely, we: (1) present a work-flow guiding programmers to produce efficient parallel solutions starting from a high-level problem; (2) introduce an analytical performance model for XMT programs and provide a methodology to project running time from an implementation; (3) propose and evaluate RAP -- an improved resource-aware compiler loop prefetching algorithm targeted at fine-grained many-core architectures; we demonstrate performance improvements of up to 34.79% on average over the GCC loop prefetching implementation and up to 24.61% on average over a simple hardware prefetching scheme; and (4) implement a number of parallel benchmarks and evaluate the overall performance of XMT relative to existing serial and parallel solutions, showing speedups of up to 13.89x vs.~ a serial processor and 8.10x vs.~parallel code optimized for an existing many-core (GPU). We also discuss the implementation and optimization of the Max-Flow algorithm on XMT, a problem which is among the more advanced in terms of complexity, benchmarking and research interest in the parallel algorithms community. We demonstrate better speed-ups compared to a best serial solution than previous attempts on other parallel platforms
Prefetching techniques for client server object-oriented database systems
The performance of many object-oriented database applications suffers from the page fetch latency which is determined by the expense of disk access. In this work we suggest several prefetching techniques to avoid, or at least to reduce, page fetch latency. In practice no prediction technique is perfect and no prefetching technique can entirely eliminate delay due to page fetch latency. Therefore we are interested in the trade-off between the level of accuracy required for obtaining good results in terms of elapsed time reduction and the processing overhead needed to achieve this level of accuracy. If prefetching accuracy is high then the total elapsed time of an application can be reduced significantly otherwise if the prefetching accuracy is low, many incorrect pages are prefetched and the extra load on the client, network, server and disks decreases the whole system performance. Access pattern of object-oriented databases are often complex and usually hard to predict accurately. The ..
Exploring the potential for accelerating sparse matrix-vector product on a Processing-in-Memory architecture
As the importance of memory access delays on performance has mushroomed over the past few decades, researchers have begun exploring Processing-in-Memory (PIM) technology, which offers higher memory bandwidth, lower memory latency, and lower power consumption. In this study, we investigate whether an emerging PIM design from Sandia National Laboratories can boost performance for sparse matrix-vector product (SMVP). While SMVP is in the best-case bandwidth-bound, factors related to matrix structure and representation also limit performance. We analyze SMVP both in the context of an AMD Opteron processor and the Sandia PIM, exploring the performance limiters for each and the degree to which these can be ameliorated by data and code transformations. Over a range of sparse matrices, SMVP on the PIM outperformed the Opteron by a factor of 1.82. On the PIM, computational kernel and data structure transformations improved performance by almost 40% over conventional implementations using compressed-sparse row format
DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks
Data movement between the CPU and main memory is a first-order obstacle
against improving performance, scalability, and energy efficiency in modern
systems. Computer systems employ a range of techniques to reduce overheads tied
to data movement, spanning from traditional mechanisms (e.g., deep multi-level
cache hierarchies, aggressive hardware prefetchers) to emerging techniques such
as Near-Data Processing (NDP), where some computation is moved close to memory.
Our goal is to methodically identify potential sources of data movement over a
broad set of applications and to comprehensively compare traditional
compute-centric data movement mitigation techniques to more memory-centric
techniques, thereby developing a rigorous understanding of the best techniques
to mitigate each source of data movement.
With this goal in mind, we perform the first large-scale characterization of
a wide variety of applications, across a wide range of application domains, to
identify fundamental program properties that lead to data movement to/from main
memory. We develop the first systematic methodology to classify applications
based on the sources contributing to data movement bottlenecks. From our
large-scale characterization of 77K functions across 345 applications, we
select 144 functions to form the first open-source benchmark suite (DAMOV) for
main memory data movement studies. We select a diverse range of functions that
(1) represent different types of data movement bottlenecks, and (2) come from a
wide range of application domains. Using NDP as a case study, we identify new
insights about the different data movement bottlenecks and use these insights
to determine the most suitable data movement mitigation mechanism for a
particular application. We open-source DAMOV and the complete source code for
our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.Comment: Our open source software is available at
https://github.com/CMU-SAFARI/DAMO
Exploiting Data Skew for Improved Query Performance
Analytic queries enable sophisticated large-scale data analysis within many
commercial, scientific and medical domains today. Data skew is a ubiquitous
feature of these real-world domains. In a retail database, some products are
typically much more popular than others. In a text database, word frequencies
follow a Zipf distribution with a small number of very common words, and a long
tail of infrequent words. In a geographic database, some regions have much
higher populations (and data measurements) than others. Current systems do not
make the most of caches for exploiting skew. In particular, a whole cache line
may remain cache resident even though only a small part of the cache line
corresponds to a popular data item. In this paper, we propose a novel index
structure for repositioning data items to concentrate popular items into the
same cache lines. The net result is better spatial locality, and better
utilization of limited cache resources. We develop a theoretical model for
analyzing the cache behavior, and implement database operators that are
efficient in the presence of skew. Our experiments on real and synthetic data
show that exploiting skew can significantly improve in-memory query
performance. In some cases, our techniques can speed up queries by over an
order of magnitude
A Survey of Techniques for Architecting TLBs
“Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used
in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently
and a TLB miss is extremely costly, prudent management of TLB is important for improving performance
and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and
managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and
distinctions. We believe that this paper will be useful for chip designers, computer architects and system
engineers
Mixed Speculative Multithreaded Execution Models
Institute for Computing Systems ArchitectureThe current trend toward chip multiprocessor architectures has placed great pressure
on programmers and compilers to generate thread-parallel programs. Improved execution
performance can no longer be obtained via traditional single-thread instruction
level parallelism (ILP), but, instead, via multithreaded execution. One notable technique
that facilitates the extraction of parallel threads from sequential applications is
thread-level speculation (TLS). This technique allows programmers/compilers to generate
threads without checking for inter-thread data and control dependences, which
are then transparently enforced by the hardware. Most prior work on TLS has concentrated
on thread selection and mechanisms to efficiently support the main TLS operations,
such as squashes, data versioning, and commits.
This thesis seeks to enhance TLS functionality by combining it with other speculative
multithreaded execution models. The main idea is that TLS already requires
extensive hardware support, which when slightly augmented can accommodate other
speculative multithreaded techniques. Recognizing that for different applications, or
even program phases, the application bottlenecks may be different, it is reasonable to
assume that the more versatile a system is, the more efficiently it will be able to execute
the given program.
As mentioned above, generating thread-parallel programs is hard and TLS has
been suggested as an execution model that can speculatively exploit thread-level parallelism
(TLP) even when thread independence cannot be guaranteed by the programmer/
compiler. Alternatively, the helper threads (HT) execution model has been proposed
where subordinate threads are executed in parallel with a main thread in order to
improve the execution efficiency (i.e., ILP) of the latter. Yet another execution model,
runahead execution (RA), has also been proposed where subordinate versions of the
main thread are dynamically created especially to cope with long-latency operations,
again with the aim of improving the execution efficiency of the main thread (ILP).
Each one of these multithreaded execution models works best for different applications
and application phases. We combine these three models into a single execution
model and single hardware infrastructure such that the system can dynamically adapt
to find the most appropriate multithreaded execution model. More specifically, TLS is favored whenever successful parallel execution of instructions in multiple threads
(i.e., TLP) is possible and the system can seamlessly transition at run-time to the other
models otherwise. In order to understand the tradeoffs involved, we also develop a performance
model that allows one to quantitatively attribute overall performance gains
to either TLP or ILP in such combined multithreaded execution model.
Experimental results show that our combined execution model achieves speedups
of up to 41.2%, with an average of 10.2%, over an existing state-of-the-art TLS system
and speedups of up to 35.2%, with an average of 18.3%, over a flavor of runahead
execution for a subset of the SPEC2000 Integer benchmark suite.
We then investigate how a common ILP-enhancingmicroarchitectural feature, namely
branch prediction, interacts with TLS.We show that branch prediction for TLS is even
more important than it is for single core machines. Unfortunately, branch prediction for
TLS systems is also inherently harder. Code partitioning and re-executions of squashed
threads pollute the branch history making it harder for predictors to be accurate.
We thus propose to augment the hardware, so as to accommodate Multi-Path (MP)
execution within the existing TLS protocol. Under the MP execution model, all paths
following a number of hard-to-predict conditional branches are followed. MP execution
thus, removes branches that would have been otherwise mispredicted helping in
this way the processor to exploit more ILP. We show that with only minimal hardware
support, one can combine these two execution models into a unified one, which can
achieve far better performance than both TLS and MP execution.
Experimental results show that our combied execution model achieves speedups of
up to 20.1%, with an average of 8.8%, over an existing state-of-the-art TLS system and
speedups of up to 125%, with an average of 29.0%, when compared with multi-path
execution for a subset of the SPEC2000 Integer benchmark suite.
Finally, Since systems that support speculative multithreading usually treat all
threads equally, they are energy-inefficient. This inefficiency stems from the fact that
speculation occasionally fails and, thus, power is spent on threads that will have to
be discarded. We propose a profitability-based power allocation scheme, where we
“steal” power from non-profitable threads and use it to speed up more useful ones. We
evaluate our techniques for a state-of-the-art TLS system and show that, with minimalhardware support, we achieve improvements in ED of up to 25.5% with an average of
18.9%, for a subset of the SPEC 2000 Integer benchmark suite
An accurate prefetching policy for object oriented systems
PhD ThesisIn the latest high-performance computers, there is a growing requirement for
accurate prefetching(AP) methodologies for advanced object management schemes
in virtual memory and migration systems. The major issue for achieving this goal is that
of finding a simple way of accurately predicting the objects that will be referenced in
the near future and to group them so as to allow them to be fetched same time. The
basic notion of AP involves building a relationship for logically grouping related
objects and prefetching them, rather than using their physical grouping and it relies on
demand fetching such as is done in existing restructuring or grouping schemes. By this,
AP tries to overcome some of the shortcomings posed by physical grouping methods.
Prefetching also makes use of the properties of object oriented languages to
build inter and intra object relationships as a means of logical grouping. This thesis
describes how this relationship can be established at compile time and how it can be
used for accurate object prefetching in virtual memory systems. In addition, AP
performs control flow and data dependency analysis to reinforce the relationships and
to find the dependencies of a program. The user program is decomposed into
prefetching blocks which contain all the information needed for block prefetching such
as long branches and function calls at major branch points.
The proposed prefetching scheme is implemented by extending a C++
compiler and evaluated on a virtual memory simulator. The results show a significant
reduction both in the number of page fault and memory pollution. In particular, AP
can suppress many page faults that occur during transition phases which are
unmanageable by other ways of fetching. AP can be applied to a local and distributed
virtual memory system so as to reduce the fault rate by fetching groups of objects at the
same time and consequently lessening operating system overheads.British Counci
Space-Efficient Predictive Block Management
With growing disk and storage capacities, the amount of required metadata for tracking all blocks in a system becomes a daunting task by itself. In previous work, we have demonstrated a system software effort in the area of predictive data grouping for reducing power and latency on hard disks. The structures used, very similar to prior efforts in prefetching and prefetch caching, track access successor information at the block level, keeping a fixed number of immediate successors per block. While providing powerful predictive expansion capabilities and being more space efficient in the amount of required metadata than many previous strategies, there remains a growing concern of how much data is actually required. In this paper, we present a novel method of storing equivalent information, SESH, a Space Efficient Storage of Heredity. This method utilizes the high amount of block-level predictability observed in a number of workload trace sets to reduce the overall metadata storage by up to 99% without any loss of information. As a result, we are able to provide a predictive tool that is adaptive, accurate, and robust in the face of workload noise, for a tiny fraction of the metadata cost previously anticipated; in some cases, reducing the required size from 12 gigabytes to less than 150 megabytes
- …