81 research outputs found
Reducing Cache Contention On GPUs
The usage of Graphics Processing Units (GPUs) as an application accelerator has become increasingly popular because, compared to traditional CPUs, they are more cost-effective, their highly parallel nature complements a CPU, and they are more energy efficient. With the popularity of GPUs, many GPU-based compute-intensive applications (a.k.a., GPGPUs) present significant performance improvement over traditional CPU-based implementations. Caches, which significantly improve CPU performance, are introduced to GPUs to further enhance application performance. However, the effect of caches is not significant for many cases in GPUs and even detrimental for some cases. The massive parallelism of the GPU execution model and the resulting memory accesses cause the GPU memory hierarchy to suffer from significant memory resource contention among threads. One cause of cache contention arises from column-strided memory access patterns that GPU applications commonly generate in many data-intensive applications. When such access patterns are mapped to hardware thread groups, they become memory-divergent instructions whose memory requests are not GPU hardware friendly, resulting in serialized access and performance degradation. Cache contention also arises from cache pollution caused by lines with low reuse. For the cache to be effective, a cached line must be reused before its eviction. Unfortunately, the streaming characteristic of GPGPU workloads and the massively parallel GPU execution model increase the reuse distance, or equivalently reduce reuse frequency of data. In a GPU, the pollution caused by a large reuse distance data is significant. Memory request stall is another contention factor. A stalled Load/Store (LDST) unit does not execute memory requests from any ready warps in the issue stage. This stall prevents the potential hit chances for the ready warps. This dissertation proposes three novel architectural modifications to reduce the contention: 1) contention-aware selective caching detects the memory-divergent instructions caused by the column-strided access patterns, calculates the contending cache sets and locality information and then selectively caches; 2) locality-aware selective caching dynamically calculates the reuse frequency with efficient hardware and caches based on the reuse frequency; and 3) memory request scheduling queues the memory requests from a warp issuing stage, frees the LDST unit stall and schedules items from the queue to the LDST unit by multiple probing of the cache. Through systematic experiments and comprehensive comparisons with existing state-of-the-art techniques, this dissertation demonstrates the effectiveness of our aforementioned techniques and the viability of reducing cache contention through architectural support. Finally, this dissertation suggests other promising opportunities for future research on GPU architecture
Recommended from our members
Standard cell optimization and physical design in advanced technology nodes
Integrated circuits (ICs) are at the heart of modern electronics, which rely heavily on the state-of-the-art semiconductor manufacturing technology. The key to pushing forward semiconductor technology is IC feature-size miniaturization. However, this brings ever-increasing design complexities and manufacturing challenges to the $340 billion semiconductor industry. The manufacturing of two-dimensional layout on high-density metal layers depends on complex design-for-manufacturing techniques and sophisticated empirical optimizations, which introduces huge amounts of turnaround time and yield loss in advanced technology nodes. Our study reveals that unidirectional layout design can significantly reduce the manufacturing complexities and improve the yield, which is becoming increasingly adopted in semiconductor industry [61, 89]. The lithography printing of unidirectional layout can be tightly controlled using advanced patterning techniques, such as self-aligned double and quadruple patterning. Despite the manufacturing benefits, unidirectional layout leads to more restrictive solution space and brings significant impacts on the IC design automation ow for routing closure. Notably, unidirectional routing limits the standard cell pin accessibility, which further exacerbates the resource competitions during routing. Moreover, for post-routing optimization, traditional redundant-via insertion has become obsolete under unidirectional routing style, which makes the yield enhancement task extremely challenging. Regardless of complex multiple patterning and design-for-manufacturing approaches, mask optimization through resolution enhancement techniques remains as the key strategy to improve the yield of the semiconductor manufacturing processes. Among them, Sub-Resolution Assist Feature (SRAF) generation is a very important method to improve lithographic process windows. Model-based SRAF generation has been widely used to achieve high accuracy but it is time-consuming and hard to obtain consistent SRAFs. This dissertation proposes novel CAD algorithms and methodologies for standard cell optimization and physical design in advanced technology nodes, which ultimately reduces the design cycle and manufacturing cost of IC design. First, a standard cell pin access optimization engine is proposed to evaluate the pin accessibility of a given standard cell library. We further propose novel pin access planning techniques and concurrent pin access optimizations to efficiently resolve the routing resource competitions, which generates much better routing solutions than state-of-the-art, manufacturing-friendly routers. To systematically improve the manufacturing yield in the post-routing stage, a global optimization engine has been introduced for redundant local-loop insertion considering advanced manufacturing constraints. Finally, we propose the first machine learning-based framework for fast yet consistent SRAF generation with the high quality of results.Electrical and Computer Engineerin
Area-power-delay trade-off in logic synthesis
This thesis introduces new concepts to perform area-power-delay trade-offs in a logic synthesis system. To achieve this, a new delay model is presented, which gives accurate delay estimations for arbitrary sets of Boolean expressions. This allows use of this delay model already during the very first steps of logic synthesis. Furthermore, new algorithms are presented for a number of different optimization tasks within logic synthesis. There are new algorithms to create prime irredundant Boo lean expressions, to perform technology mapping for use with standard cell generators, and to perform gate sizing. To prove the validity of the presented ideas, benchmark results are given throughout the thesis
Digital watermark technology in security applications
With the rising emphasis on security and the number of fraud related crimes
around the world, authorities are looking for new technologies to tighten
security of identity. Among many modern electronic technologies, digital
watermarking has unique advantages to enhance the document authenticity.
At the current status of the development, digital watermarking technologies
are not as matured as other competing technologies to support identity authentication
systems. This work presents improvements in performance of
two classes of digital watermarking techniques and investigates the issue of
watermark synchronisation.
Optimal performance can be obtained if the spreading sequences are designed
to be orthogonal to the cover vector. In this thesis, two classes of
orthogonalisation methods that generate binary sequences quasi-orthogonal
to the cover vector are presented. One method, namely "Sorting and Cancelling"
generates sequences that have a high level of orthogonality to the
cover vector. The Hadamard Matrix based orthogonalisation method, namely
"Hadamard Matrix Search" is able to realise overlapped embedding, thus the
watermarking capacity and image fidelity can be improved compared to using
short watermark sequences. The results are compared with traditional
pseudo-randomly generated binary sequences. The advantages of both classes
of orthogonalisation inethods are significant.
Another watermarking method that is introduced in the thesis is based
on writing-on-dirty-paper theory. The method is presented with biorthogonal
codes that have the best robustness. The advantage and trade-offs of
using biorthogonal codes with this watermark coding methods are analysed
comprehensively. The comparisons between orthogonal and non-orthogonal
codes that are used in this watermarking method are also made. It is found
that fidelity and robustness are contradictory and it is not possible to optimise
them simultaneously.
Comparisons are also made between all proposed methods. The comparisons
are focused on three major performance criteria, fidelity, capacity and
robustness. aom two different viewpoints, conclusions are not the same. For
fidelity-centric viewpoint, the dirty-paper coding methods using biorthogonal
codes has very strong advantage to preserve image fidelity and the advantage
of capacity performance is also significant. However, from the power
ratio point of view, the orthogonalisation methods demonstrate significant
advantage on capacity and robustness. The conclusions are contradictory
but together, they summarise the performance generated by different design
considerations.
The synchronisation of watermark is firstly provided by high contrast
frames around the watermarked image. The edge detection filters are used
to detect the high contrast borders of the captured image. By scanning
the pixels from the border to the centre, the locations of detected edges
are stored. The optimal linear regression algorithm is used to estimate the
watermarked image frames. Estimation of the regression function provides
rotation angle as the slope of the rotated frames. The scaling is corrected by
re-sampling the upright image to the original size. A theoretically studied
method that is able to synchronise captured image to sub-pixel level accuracy
is also presented. By using invariant transforms and the "symmetric
phase only matched filter" the captured image can be corrected accurately
to original geometric size. The method uses repeating watermarks to form an
array in the spatial domain of the watermarked image and the the array that
the locations of its elements can reveal information of rotation, translation
and scaling with two filtering processes
A combinatorial approach to orthogonal placement problems
liegt nicht vor!Wir betrachten zwei Familien von NP-schwierigen orthogonalen Platzierungsproblemen aus dem Bereich der Informationsvisualisierung von einem theoretischen und praktischen Standpunkt aus. Diese Arbeit enthält ein gemeinsames kombinatorisches Gerüst für Kompaktierungsprobleme aus dem Bereich des orthogonalen Graphenzeichnens und Beschriftungsprobleme von Punktmengen aus dem Gebiet der Computer-Kartografie. Bei den Kompaktierungsproblemen geht es darum, eine gegebene dimensionslose Beschreibung der orthogonalen Form eines Graphen in eine orthogonale Gitterzeichnung mit kurzen Kanten und geringem Flächenverbrauch zu transformieren. Die Beschriftungsprobleme haben zur Aufgabe, eine gegebene Menge von rechteckigen Labels so zu platzieren, dass eine lesbare Karte entsteht. In einer klassischen Anwendung repräsentieren die Punkte beispielsweise Städte einer Landkarte, und die Labels enthalten die Namen der Städte. Wir präsentieren neue kombinatorische Formulierungen für diese Probleme und verwenden dabei eine pfad- und kreisbasierte graphentheoretische Eigenschaft in einem zugehörigen problemspezifschen Paar von Constraint-Graphen. Die Umformulierung ermöglicht es uns, exakte Algorithmen für die Originalprobleme zu entwickeln. Umfassende experimentelle Studien mit Benchmark-Instanzen aus der Praxis zeigen, dass unsere Algorithmen, die auf linearer Programmierung beruhen, in der Lage sind, große Instanzen der Platzierungsprobleme beweisbar optimal und in kurzer Rechenzeit zu lösen. Ferner kombinieren wir die Formulierungen für Kompaktierungs- und Beschriftungsprobleme und präsentieren einen exakten algorithmischen Ansatz für ein Graphbeschriftungsproblem. Oftmals sind unsere neuen Algorithmen die ersten exakten Algorithmen für die jeweilige Problemvariante
A combinatorial approach to orthogonal placement problems
liegt nicht vor!Wir betrachten zwei Familien von NP-schwierigen orthogonalen Platzierungsproblemen aus dem Bereich der Informationsvisualisierung von einem theoretischen und praktischen Standpunkt aus. Diese Arbeit enthält ein gemeinsames kombinatorisches Gerüst für Kompaktierungsprobleme aus dem Bereich des orthogonalen Graphenzeichnens und Beschriftungsprobleme von Punktmengen aus dem Gebiet der Computer-Kartografie. Bei den Kompaktierungsproblemen geht es darum, eine gegebene dimensionslose Beschreibung der orthogonalen Form eines Graphen in eine orthogonale Gitterzeichnung mit kurzen Kanten und geringem Flächenverbrauch zu transformieren. Die Beschriftungsprobleme haben zur Aufgabe, eine gegebene Menge von rechteckigen Labels so zu platzieren, dass eine lesbare Karte entsteht. In einer klassischen Anwendung repräsentieren die Punkte beispielsweise Städte einer Landkarte, und die Labels enthalten die Namen der Städte. Wir präsentieren neue kombinatorische Formulierungen für diese Probleme und verwenden dabei eine pfad- und kreisbasierte graphentheoretische Eigenschaft in einem zugehörigen problemspezifschen Paar von Constraint-Graphen. Die Umformulierung ermöglicht es uns, exakte Algorithmen für die Originalprobleme zu entwickeln. Umfassende experimentelle Studien mit Benchmark-Instanzen aus der Praxis zeigen, dass unsere Algorithmen, die auf linearer Programmierung beruhen, in der Lage sind, große Instanzen der Platzierungsprobleme beweisbar optimal und in kurzer Rechenzeit zu lösen. Ferner kombinieren wir die Formulierungen für Kompaktierungs- und Beschriftungsprobleme und präsentieren einen exakten algorithmischen Ansatz für ein Graphbeschriftungsproblem. Oftmals sind unsere neuen Algorithmen die ersten exakten Algorithmen für die jeweilige Problemvariante
- …