394 research outputs found
Recommended from our members
Bridging the gap between mobile CPU design and user satisfaction via crowdsourcing
This report aims to provide an understanding of how the mobile CPU designs have evolved and its influence on end-user satisfaction. To that end, a quantitative performance analysis is conducted across ten cutting-edge mobile CPU designs studied within top-selling off-the-shelf smartphones released over the past seven years. This analysis is then used to guide a large-scale user study spanning over 25,000 participants via crowdsourcing on the Amazon Mechanical Turk service. The user study asks participants to assess the responsiveness of interactive application use cases for a set of current-generation applications (e.g. Angry Birds and FaceBook) and next-generation applications (i.e. face recognition and augmented reality) relative to the performance capabilities of the devices studied. This framework allows us to quantitatively link how the mobile CPU designs studied impacted end-user satisfaction. The study results indicate that mobile CPU designs have exhibited signifiant performance improvements through aggressive core scaling techniques prevalent in desktop CPUs. Just as was observed in desktop CPU design, these same techniques have lead to excessive mobile CPU power consumption. However, from an end-user perspective this power consumption was not without success. Mobile CPUs have evolved to provide satisfactory experiences for the studied current- generation applications. The reason is that many of these applications rely heavily on single-threaded performance. Other, more recent applications, actually multi-thread user-critical parts of the applications, which also demonstrates that multi- core mobile CPUs are an important design consideration – contrary to conventional wisdom. However, looking ahead, the same mobile CPUs where not able to provide satisfactory experiences for many of the next-generation applications studied, questioning the sustainability of these power-hungry design techniques in future mobile CPU designs.Electrical and Computer Engineerin
On the automated compilation of UML notation to a VLIW chip multiprocessor
With the availability of more and more cores within architectures the process of extracting implicit and explicit parallelism in applications to fully utilise these cores is becoming complex. Implicit parallelism extraction is performed through the inclusion of intelligent software and hardware sections of tool chains although these reach their theoretical limit rather quickly.
Due to this the concept of a method of allowing explicit parallelism to be performed as fast a possible has been investigated. This method enables application developers to perform creation and synchronisation of parallel sections of an application at a finer-grained level than previously possible, resulting in smaller sections of code being executed in parallel while still reducing overall execution time.
Alongside explicit parallelism, a concept of high level design of applications destined for multicore systems was also investigated. As systems are getting larger it is becoming more difficult to design and track the full life-cycle of development. One method used to ease this process is to use a graphical design process to visualise the high level designs of such systems.
One drawback in graphical design is the explicit nature in which systems are required to be generated, this was investigated, and using concepts already in use in text based programming languages, the generation of platform-independent models which are able to be specialised to multiple hardware architectures was developed.
The explicit parallelism was performed using hardware elements to perform thread management, this resulted in speed ups of over 13 times when compared to threading libraries executed in software on commercially available processors. This allowed applications with large data dependent sections to be parallelised in small sections within the code resulting in a decrease of overall execution time.
The modelling concepts resulted in the saving of between 40-50% of the time and effort required to generate platform-specific models while only incurring an overhead of up to 15% the execution cycles of these models designed for specific architectures
Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors
abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions.
Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%.
Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications.
Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future.
In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.Dissertation/ThesisDoctoral Dissertation Computer Science 201
An Investigation of thread scheduling heuristics for a simultaneous multithreaded processor
Over the years, the von Neumann model of computing has undergone many enhancements. These changes include an improved memory hierarchy, multiple instruction issue and branch predic tion. Since the model\u27s introduction, the performance of processors has increased at a much greater rate than that of memory. Several modifications to hide this ever widening gap in performance are being examined in current research. A very promising one is the Simultaneous Multithreaded processor. This architecture strives to further reduce the effects of long latency instructions, such as memory accesses, by allowing multiple threads of execution to be active in the processor at the same time. With the introduction of multiple active threads in a single processor, several new aspects of processor operation can have a sizeable effect on performance. One such aspect is how to choose from which thread to fetch instructions during the next cycle. For this project, three different classes of fetch scheduling mechanisms were defined and exam ples of each were either studied or proposed. The proposed mechanisms were then tested using a set of four sample programs by adding the mechanisms to a Simultaneous Multithreading sim ulator based on the Simple Scalar tool set from the University of Wisconsin-Madison. With the proper configuration, each of the proposed mechanisms improved the performance of the simulated architecture. However, the best increase in performance was produced by the Event History Table. It achieved an IPC of 2.0995 for two threads while overriding the primary scheduling mechanism only 0.070% of the time
An FPGA implementation of an investigative many-core processor, Fynbos : in support of a Fortran autoparallelising software pipeline
Includes bibliographical references.In light of the power, memory, ILP, and utilisation walls facing the computing industry, this work examines the hypothetical many-core approach to finding greater compute performance and efficiency. In order to achieve greater efficiency in an environment in which Moore’s law continues but TDP has been capped, a means of deriving performance from dark and dim silicon is needed. The many-core hypothesis is one approach to exploiting these available transistors efficiently. As understood in this work, it involves trading in hardware control complexity for hundreds to thousands of parallel simple processing elements, and operating at a clock speed sufficiently low as to allow the efficiency gains of near threshold voltage operation. Performance is there- fore dependant on exploiting a new degree of fine-grained parallelism such as is currently only found in GPGPUs, but in a manner that is not as restrictive in application domain range. While removing the complex control hardware of traditional CPUs provides space for more arithmetic hardware, a basic level of control is still required. For a number of reasons this work chooses to replace this control largely with static scheduling. This pushes the burden of control primarily to the software and specifically the compiler, rather not to the programmer or to an application specific means of control simplification. An existing legacy tool chain capable of autoparallelising sequential Fortran code to the degree of parallelism necessary for many-core exists. This work implements a many-core architecture to match it. Prototyping the design on an FPGA, it is possible to examine the real world performance of the compiler-architecture system to a greater degree than simulation only would allow. Comparing theoretical peak performance and real performance in a case study application, the system is found to be more efficient than any other reviewed, but to also significantly under perform relative to current competing architectures. This failing is apportioned to taking the need for simple hardware too far, and an inability to implement static scheduling mitigating tactics due to lack of support for such in the compiler
Optimierung der Rechenleistung pro Fläche von Prozessorarchitekturen durch Rekonfiguration von Funktionseinheiten
Viele eingebettete Systeme, wie Smartphones, PDAs, MP3-Player und zahlreiche weitere, werden zur Miniaturisierung, Kostenreduktion und Steigerung der Robustheit zunehmend als System-on-a-Chip, also auf nur einem Stück Silizium, gefertigt. In solchen Systemen arbeiten sowohl Prozessoren und Speicher, wie auch mannigfaltige andere Peripherieeinheiten, welche spezialisierte Aufgaben des jeweiligen Einsatzgebietes des Systems übernehmen. Einige dieser Einheiten sind jedoch nicht durchgängig im Einsatz, wie beispielsweise ein GSM-Modulator bei Smartphones oder ein Hardware MPEG-Dekoder im PDA. Aufgrund der benötigten Flexibilität und des einfacheren Entwurfsprozesses wird es zunehmend populärer, Systems-on-a-Chip mit Field Programmable Gate Arrays (FPGAs), frei programmierbaren Logikbausteinen, zu realisieren. Aktuelle Bausteine erlauben dynamische partielle Rekonfiguration. Sie können also Teile ihrer Logik ersetzen, während andere weiter in Betrieb bleiben. Die Ressourcen nicht aktiver Einheiten des Systems können somit dynamisch für andere Zwecke benutzt werden. Diese Arbeit schlägt eine Prozessorarchitektur vor, deren Rechenleistung sich durch zeitlich variable Hinzunahme und Abgabe von zur Verfügung stehenden Ressourcen der programmierbaren Logik anpasst. Zusätzliche Ressourcen werden, um dies zu erreichen, durch zusätzliche Funktionseinheiten für den Prozessor belegt. Deren Einbindung in die Berechnungen wird durch parallel ausführbare, den Prinzipien des Explicitly Parallel Instruction Computings genügende Instruktionen erreicht. Werden die belegten Ressourcen des Prozessors an anderer Stelle wieder benötigt, werden schrittweise Funktionseinheiten abgetreten, bis ein Minimum an Rechenleistung des Prozessors erreicht ist. Durch diesen Ansatz werden die zeitweise ungenutzten Ressourcen des Prozessors sinnvoll verwendet. Zudem bietet die vorgeschlagene Architektur die Fähigkeit, sich selbst an die auszuführenden Berechnungen anzupassen und sie somit schneller auszuführen. Ziel dieser Arbeit ist es, eine solche Klasse neuer Prozessoren zu definieren, ihren möglichen Nutzen zu quantifizieren und ihre technische Umsetzbarkeit nachzuweisen. Die mögliche Beschleunigung durch eine solche Architektur wird durch simulative Zuordnung von Befehlen potentieller Traces von Programmen auf Funktionseinheiten ermittelt. Die technische Machbarkeit des Ansatzes wird durch prototypische Implementierungen der kritischen Elemente der Architektur, vor allem im Bereich der partiellen Rekonfiguration von FPGAs, gezeigt
Recommended from our members
Measuring program similarity for efficient benchmarking and performance analysis of computer systems
textComputer benchmarking involves running a set of benchmark programs to measure performance of a computer system. Modern benchmarks are developed from real applications. Applications are becoming complex and hence modern benchmarks run for a very long time. These benchmarks are also used for performance evaluation in the early design phase of microprocessors. Due to the size of benchmarks and increase in complexity of microprocessor design, the effort required for performance evaluation has increased significantly. This dissertation proposes methodologies to reduce the effort of benchmarking and performance evaluation of computer systems. Identifying a set of programs that can be used in the process of benchmarking can be very challenging. A solution to this problem can start by identifying similarity between programs to capture the diversity in their behavior before they can be considered for benchmarking. The aim of this methodology is to identify redundancy in the set of benchmarks and find a subset of representative benchmarks with the least possible loss of information. This dissertation proposes the use of program characteristics which capture the performance behavior of programs and identifies representative benchmarks applicable over a wide range of system configurations. The use of benchmark subsetting has not been restricted to academic research. Recently, the SPEC CPU subcommittee used the information derived from measuring similarity based on program behavior characteristics between different benchmark candidates as one of the criteria for selecting the SPEC CPU2006 benchmarks. The information of similarity between programs can also be used to predict performance of an application when it is difficult to port the application on different platforms. This is a common problem when a customer wants to buy the best computer system for his application. Performance of a customer's application on a particular system can be predicted using the performance scores of the standard benchmarks on that system and the similarity information between the application and the benchmarks. Similarity between programs is quantified by the distance between them in the space of the measured characteristics, and is appropriately used to predict performance of a new application using the performance scores of its neighbors in the workload space.Electrical and Computer Engineerin
Optimal Global Instruction Scheduling for the Itanium® Processor Architecture
On the Itanium 2 processor, effective global instruction scheduling is crucial to high performance. At the same time, it poses a challenge to the compiler: This code generation subtask involves strongly interdependent decisions and complex trade-offs that are difficult to cope with for heuristics. We tackle this NP-complete problem with integer linear programming (ILP), a search-based method that yields provably optimal results. This promises faster code as well as insights into the potential of the architecture. Our ILP model comprises global code motion with compensation copies, predication, and Itanium-specific features like control/data speculation. In integer linear programming, well-structured models are the key to acceptable solution times. The feasible solutions of an ILP are represented by integer points inside a polytope. If all vertices of this polytope are integral, then the ILP can be solved in polynomial time. We define two subproblems of global scheduling in which some constraint classes are omitted and show that the corresponding two subpolytopes of our ILP model are integral and polynomial sized. This substantiates that the found model is of high efficiency, which is also confirmed by the reasonable solution times. The ILP formulation is extended by further transformations like cyclic code motion, which moves instructions upwards out of a loop, circularly in the opposite direction of the loop backedges. Since the architecture requires instructions to be encoded in fixed-sized bundles of three, a bundler is developed that computes bundle sequences of minimal size by means of precomputed results and dynamic programming. Experiments have been conducted with a postpass tool that implements the ILP scheduler. It parses assembly procedures generated by Intel�s Itanium compiler and reschedules them as a whole. Using this tool, we optimize a selection of hot functions from the SPECint 2000 benchmark. The results show a significant speedup over the original code.Globale Instruktionsanordnung hat beim Itanium-2-Prozessor groĂźen
EinfluĂź auf die Leistung und stellt dabei gleichzeitig eine Herausforderung
fĂĽr den Compiler dar: Sie ist mit zahlreichen komplexen, wechselseitig
voneinander abhängigen Entscheidungen verbunden, die für Heuristiken
nur schwer zu beherrschen sind.Wir lösen diesesNP-vollständige
Problem mit ganzzahliger linearer Programmierung (ILP), einer suchbasierten
Methode mit beweisbar optimalen Ergebnissen. Das ermöglicht
neben schnellerem Code auch Einblicke in das Potential der Itanium-
Prozessorarchitektur. Unser ILP-Modell umfaĂźt globale Codeverschiebungen
mit Kompensationscode, Prädikation und Itanium-spezifische
Techniken wie Kontroll- und Datenspekulation.
Bei ganzzahliger linearer Programmierung sind wohlstrukturierte
Modelle der Schlüssel zu akzeptablen Lösungszeiten. Die zulässigen Lösungen
eines ILPs werden durch ganzzahlige Punkte innerhalb eines
Polytops repräsentiert. Sind die Eckpunkte dieses Polytops ganzzahlig,
kann das ILP in Polynomialzeit gelöst werden. Wir definieren zwei Teilprobleme
globaler Instruktionsanordnung durch Auslassung bestimmter
Klassen von Nebenbedingungen und beweisen, daĂź die korrespondierenden
Teilpolytope unseres ILP-Modells ganzzahlig und von polynomieller
Größe sind. Dies untermauert die hohe Effizienz des gefundenen Modells,
die auch durch moderate Lösungszeiten bestätigt wird.
Das ILP-Modell wird um weitere Transformationen wie zyklische Codeverschiebung
erweitert; letztere bezeichnet das Verschieben von Befehlen
aufwärts aus einer Schleife heraus, in Gegenrichtung ihrer Rückwärtskanten.
Da die Architektur eine Kodierung der Befehle in DreierbĂĽndeln
fester Größe vorschreibt, wird ein Bundler entwickelt, der Bündelsequenzen
minimaler Länge mit Hilfe vorberechneter Teilergebnisse und dynamischer
Programmierung erzeugt.
FĂĽr die Experimente wurde ein Postpassoptimierer erstellt. Er liest
von Intels Itanium-Compiler erzeugte Assemblerroutinen ein und ordnet
die enthaltenen Instruktionen mit Hilfe der ILP-Methode neu an. Angewandt
auf eine Auswahl von Funktionen aus dem Benchmark SPECint
2000 erreicht der Optimierer eine signifikante Beschleunigung gegenĂĽber
dem Originalcode
Recommended from our members
Interferometric Methods
Future radio telescopes promise great advances in resolution and sensitivity. These
include the Square Kilometer Array, a two array instrument, in South Africa and Australia. Similarly, the next
generation Very Large Array (ngVLA) is being designed for construction in
North America. These arrays all promise exceptional advances in sensitivity,
angular resolution, and survey speed. The SKA and ngVLA are both specified to
have sensitivities at the level of Jy's. The SKA-Low instrument will consist
of a huge number of dipoles antennas in Australia which is pushing the bounds of
current FX correlator technology with scaling, where is the
number of antennas. The design proposals for these instruments include a dense
core of antennas, necessitating advances in imaging methods for these very
dense cores versus more traditionally sparse instruments.
Another ambitious experiment is the Hydrogen Epoch of Reionisation Array (HERA) in
South Africa which hopes to make the first direct detection of the Epoch of Reionisation
through the red-shifted H{\sc i} signal
which is a factor of smaller than the thermal-like noise.
In this thesis, these problems are tackled by re-examining the underlying
principles of interferometry. The first working
example of a direct imaging correlator is presented which allows images to be
formed directly from the voltages off each antenna in a dense array, without an
expensive cross-correlation operation as is typically required. A detailed discussion
is given of how standard steps in interferometric imaging differ in this new
scheme, including calibration. Additionally the first wide field direct imaging
correlator is presented, which allows the problems of non-coplanarity to be
dealt with for both sparse and dense arrays in a very efficient manner on modern GPU compute hardware. These are, to the best of the authors knowledge, the only working implementations of
a direct imaging correlator for generic arrays with no restrictions on the geometry of the
array or homogeneity of constituent receiver elements. These new approaches have been published
in the scientific literature as discussed in the Declaration.
Moving on from this, the closure phase bispectrum is presented as a way of uncovering
the cosmological Epoch of Reionisation signal from the H{\sc i} line. This is using the
HERA telescope, which consists of a dense core of parabolic antennas in a highly redundant layout.
A data reduction and processing pipeline for the HERA telescope is constructed and presented, for use with the
bispectrum. Initial results towards a cosmologial limit are reported.
The HERA telescope relies on redundancy in its antenna elements for its calibration
and measurement strategy. The bispectrum with its unique mathematical propeties, in combination with forward modelling, is shown to be a
potent tool for probing departures from the assumed reudundancy. It is shown, through
this method, that HERA
suffers significant direction-dependent non-redundancies in the dataset used for our analysis,
which are extremely difficult to calibrate out.
Finally, the problem of wide-field imaging in next generation arrays is tackled
through the development and implementation of a new scheme of wide field
imaging. This uses a new method of parallelising the
problem of wide-field imaging, and is intended for use with the very large
datasets that will be produced by upcoming instruments. Two schemes are introduced: -towers, and
Improved -towers. The latter generalises the former in combination with
advances in optimal convolution theory for the radio astronomy ``gridding'' problem.
The theory behind this approach is explored, and a high performance implementation is presented for
-towers and Improved -stacking within Improved -towers.ARM Ltd iCase Sponsorshi
Optimal Global Instruction Scheduling for the Itanium® Processor Architecture
On the Itanium 2 processor, effective global instruction scheduling is crucial to high performance. At the same time, it poses a challenge to the compiler: This code generation subtask involves strongly interdependent decisions and complex trade-offs that are difficult to cope with for heuristics. We tackle this NP-complete problem with integer linear programming (ILP), a search-based method that yields provably optimal results. This promises faster code as well as insights into the potential of the architecture. Our ILP model comprises global code motion with compensation copies, predication, and Itanium-specific features like control/data speculation. In integer linear programming, well-structured models are the key to acceptable solution times. The feasible solutions of an ILP are represented by integer points inside a polytope. If all vertices of this polytope are integral, then the ILP can be solved in polynomial time. We define two subproblems of global scheduling in which some constraint classes are omitted and show that the corresponding two subpolytopes of our ILP model are integral and polynomial sized. This substantiates that the found model is of high efficiency, which is also confirmed by the reasonable solution times. The ILP formulation is extended by further transformations like cyclic code motion, which moves instructions upwards out of a loop, circularly in the opposite direction of the loop backedges. Since the architecture requires instructions to be encoded in fixed-sized bundles of three, a bundler is developed that computes bundle sequences of minimal size by means of precomputed results and dynamic programming. Experiments have been conducted with a postpass tool that implements the ILP scheduler. It parses assembly procedures generated by Intel�s Itanium compiler and reschedules them as a whole. Using this tool, we optimize a selection of hot functions from the SPECint 2000 benchmark. The results show a significant speedup over the original code.Globale Instruktionsanordnung hat beim Itanium-2-Prozessor groĂźen
EinfluĂź auf die Leistung und stellt dabei gleichzeitig eine Herausforderung
fĂĽr den Compiler dar: Sie ist mit zahlreichen komplexen, wechselseitig
voneinander abhängigen Entscheidungen verbunden, die für Heuristiken
nur schwer zu beherrschen sind.Wir lösen diesesNP-vollständige
Problem mit ganzzahliger linearer Programmierung (ILP), einer suchbasierten
Methode mit beweisbar optimalen Ergebnissen. Das ermöglicht
neben schnellerem Code auch Einblicke in das Potential der Itanium-
Prozessorarchitektur. Unser ILP-Modell umfaĂźt globale Codeverschiebungen
mit Kompensationscode, Prädikation und Itanium-spezifische
Techniken wie Kontroll- und Datenspekulation.
Bei ganzzahliger linearer Programmierung sind wohlstrukturierte
Modelle der Schlüssel zu akzeptablen Lösungszeiten. Die zulässigen Lösungen
eines ILPs werden durch ganzzahlige Punkte innerhalb eines
Polytops repräsentiert. Sind die Eckpunkte dieses Polytops ganzzahlig,
kann das ILP in Polynomialzeit gelöst werden. Wir definieren zwei Teilprobleme
globaler Instruktionsanordnung durch Auslassung bestimmter
Klassen von Nebenbedingungen und beweisen, daĂź die korrespondierenden
Teilpolytope unseres ILP-Modells ganzzahlig und von polynomieller
Größe sind. Dies untermauert die hohe Effizienz des gefundenen Modells,
die auch durch moderate Lösungszeiten bestätigt wird.
Das ILP-Modell wird um weitere Transformationen wie zyklische Codeverschiebung
erweitert; letztere bezeichnet das Verschieben von Befehlen
aufwärts aus einer Schleife heraus, in Gegenrichtung ihrer Rückwärtskanten.
Da die Architektur eine Kodierung der Befehle in DreierbĂĽndeln
fester Größe vorschreibt, wird ein Bundler entwickelt, der Bündelsequenzen
minimaler Länge mit Hilfe vorberechneter Teilergebnisse und dynamischer
Programmierung erzeugt.
FĂĽr die Experimente wurde ein Postpassoptimierer erstellt. Er liest
von Intels Itanium-Compiler erzeugte Assemblerroutinen ein und ordnet
die enthaltenen Instruktionen mit Hilfe der ILP-Methode neu an. Angewandt
auf eine Auswahl von Funktionen aus dem Benchmark SPECint
2000 erreicht der Optimierer eine signifikante Beschleunigung gegenĂĽber
dem Originalcode
- …