Search CORE

394 research outputs found

Recommended from our members

Bridging the gap between mobile CPU design and user satisfaction via crowdsourcing

Author: Halpern Matthew Franklin
Publication venue
Publication date: 14/10/2016
Field of study

This report aims to provide an understanding of how the mobile CPU designs have evolved and its influence on end-user satisfaction. To that end, a quantitative performance analysis is conducted across ten cutting-edge mobile CPU designs studied within top-selling off-the-shelf smartphones released over the past seven years. This analysis is then used to guide a large-scale user study spanning over 25,000 participants via crowdsourcing on the Amazon Mechanical Turk service. The user study asks participants to assess the responsiveness of interactive application use cases for a set of current-generation applications (e.g. Angry Birds and FaceBook) and next-generation applications (i.e. face recognition and augmented reality) relative to the performance capabilities of the devices studied. This framework allows us to quantitatively link how the mobile CPU designs studied impacted end-user satisfaction. The study results indicate that mobile CPU designs have exhibited signifiant performance improvements through aggressive core scaling techniques prevalent in desktop CPUs. Just as was observed in desktop CPU design, these same techniques have lead to excessive mobile CPU power consumption. However, from an end-user perspective this power consumption was not without success. Mobile CPUs have evolved to provide satisfactory experiences for the studied current- generation applications. The reason is that many of these applications rely heavily on single-threaded performance. Other, more recent applications, actually multi-thread user-critical parts of the applications, which also demonstrates that multi- core mobile CPUs are an important design consideration – contrary to conventional wisdom. However, looking ahead, the same mobile CPUs where not able to provide satisfactory experiences for many of the next-generation applications studied, questioning the sustainability of these power-hungry design techniques in future mobile CPU designs.Electrical and Computer Engineerin

Texas ScholarWorks

On the automated compilation of UML notation to a VLIW chip multiprocessor

Author: David Stevens (4254274)
Publication venue
Publication date: 01/01/2013
Field of study

With the availability of more and more cores within architectures the process of extracting implicit and explicit parallelism in applications to fully utilise these cores is becoming complex. Implicit parallelism extraction is performed through the inclusion of intelligent software and hardware sections of tool chains although these reach their theoretical limit rather quickly. Due to this the concept of a method of allowing explicit parallelism to be performed as fast a possible has been investigated. This method enables application developers to perform creation and synchronisation of parallel sections of an application at a finer-grained level than previously possible, resulting in smaller sections of code being executed in parallel while still reducing overall execution time. Alongside explicit parallelism, a concept of high level design of applications destined for multicore systems was also investigated. As systems are getting larger it is becoming more difficult to design and track the full life-cycle of development. One method used to ease this process is to use a graphical design process to visualise the high level designs of such systems. One drawback in graphical design is the explicit nature in which systems are required to be generated, this was investigated, and using concepts already in use in text based programming languages, the generation of platform-independent models which are able to be specialised to multiple hardware architectures was developed. The explicit parallelism was performed using hardware elements to perform thread management, this resulted in speed ups of over 13 times when compared to threading libraries executed in software on commercially available processors. This allowed applications with large data dependent sections to be parallelised in small sections within the code resulting in a decrease of overall execution time. The modelling concepts resulted in the saving of between 40-50% of the time and effort required to generate platform-specific models while only incurring an overhead of up to 15% the execution cycles of these models designed for specific architectures

Loughborough University Institutional Repository

Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors

Author
Publication venue
Publication date: 01/01/2018
Field of study

abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions. Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%. Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications. Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future. In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.Dissertation/ThesisDoctoral Dissertation Computer Science 201

ASU Digital Repository

An Investigation of thread scheduling heuristics for a simultaneous multithreaded processor

Author: Zajac Ralph, Jr
Publication venue: RIT Scholar Works
Publication date: 01/09/2000
Field of study

Over the years, the von Neumann model of computing has undergone many enhancements. These changes include an improved memory hierarchy, multiple instruction issue and branch predic tion. Since the model\u27s introduction, the performance of processors has increased at a much greater rate than that of memory. Several modifications to hide this ever widening gap in performance are being examined in current research. A very promising one is the Simultaneous Multithreaded processor. This architecture strives to further reduce the effects of long latency instructions, such as memory accesses, by allowing multiple threads of execution to be active in the processor at the same time. With the introduction of multiple active threads in a single processor, several new aspects of processor operation can have a sizeable effect on performance. One such aspect is how to choose from which thread to fetch instructions during the next cycle. For this project, three different classes of fetch scheduling mechanisms were defined and exam ples of each were either studied or proposed. The proposed mechanisms were then tested using a set of four sample programs by adding the mechanisms to a Simultaneous Multithreading sim ulator based on the Simple Scalar tool set from the University of Wisconsin-Madison. With the proper configuration, each of the proposed mechanisms improved the performance of the simulated architecture. However, the best increase in performance was produced by the Event History Table. It achieved an IPC of 2.0995 for two threads while overriding the primary scheduling mechanism only 0.070% of the time

RIT Scholar Works

An FPGA implementation of an investigative many-core processor, Fynbos : in support of a Fortran autoparallelising software pipeline

Author: Wyngaard Janet Ruth
Publication venue: Department of Electrical Engineering
Publication date: 01/01/2014
Field of study

Includes bibliographical references.In light of the power, memory, ILP, and utilisation walls facing the computing industry, this work examines the hypothetical many-core approach to finding greater compute performance and efficiency. In order to achieve greater efficiency in an environment in which Moore’s law continues but TDP has been capped, a means of deriving performance from dark and dim silicon is needed. The many-core hypothesis is one approach to exploiting these available transistors efficiently. As understood in this work, it involves trading in hardware control complexity for hundreds to thousands of parallel simple processing elements, and operating at a clock speed sufficiently low as to allow the efficiency gains of near threshold voltage operation. Performance is there- fore dependant on exploiting a new degree of fine-grained parallelism such as is currently only found in GPGPUs, but in a manner that is not as restrictive in application domain range. While removing the complex control hardware of traditional CPUs provides space for more arithmetic hardware, a basic level of control is still required. For a number of reasons this work chooses to replace this control largely with static scheduling. This pushes the burden of control primarily to the software and specifically the compiler, rather not to the programmer or to an application specific means of control simplification. An existing legacy tool chain capable of autoparallelising sequential Fortran code to the degree of parallelism necessary for many-core exists. This work implements a many-core architecture to match it. Prototyping the design on an FPGA, it is possible to examine the real world performance of the compiler-architecture system to a greater degree than simulation only would allow. Comparing theoretical peak performance and real performance in a case study application, the system is found to be more efficient than any other reviewed, but to also significantly under perform relative to current competing architectures. This failing is apportioned to taking the need for simple hardware too far, and an inability to implement static scheduling mitigating tactics due to lack of support for such in the compiler

Cape Town University OpenUCT

Optimierung der Rechenleistung pro Fläche von Prozessorarchitekturen durch Rekonfiguration von Funktionseinheiten

Author: Scholz Rainer
Publication venue: Universität der Bundeswehr München, Fakultät für Informatik
Publication date: 01/01/2011
Field of study

Viele eingebettete Systeme, wie Smartphones, PDAs, MP3-Player und zahlreiche weitere, werden zur Miniaturisierung, Kostenreduktion und Steigerung der Robustheit zunehmend als System-on-a-Chip, also auf nur einem Stück Silizium, gefertigt. In solchen Systemen arbeiten sowohl Prozessoren und Speicher, wie auch mannigfaltige andere Peripherieeinheiten, welche spezialisierte Aufgaben des jeweiligen Einsatzgebietes des Systems übernehmen. Einige dieser Einheiten sind jedoch nicht durchgängig im Einsatz, wie beispielsweise ein GSM-Modulator bei Smartphones oder ein Hardware MPEG-Dekoder im PDA. Aufgrund der benötigten Flexibilität und des einfacheren Entwurfsprozesses wird es zunehmend populärer, Systems-on-a-Chip mit Field Programmable Gate Arrays (FPGAs), frei programmierbaren Logikbausteinen, zu realisieren. Aktuelle Bausteine erlauben dynamische partielle Rekonfiguration. Sie können also Teile ihrer Logik ersetzen, während andere weiter in Betrieb bleiben. Die Ressourcen nicht aktiver Einheiten des Systems können somit dynamisch für andere Zwecke benutzt werden. Diese Arbeit schlägt eine Prozessorarchitektur vor, deren Rechenleistung sich durch zeitlich variable Hinzunahme und Abgabe von zur Verfügung stehenden Ressourcen der programmierbaren Logik anpasst. Zusätzliche Ressourcen werden, um dies zu erreichen, durch zusätzliche Funktionseinheiten für den Prozessor belegt. Deren Einbindung in die Berechnungen wird durch parallel ausführbare, den Prinzipien des Explicitly Parallel Instruction Computings genügende Instruktionen erreicht. Werden die belegten Ressourcen des Prozessors an anderer Stelle wieder benötigt, werden schrittweise Funktionseinheiten abgetreten, bis ein Minimum an Rechenleistung des Prozessors erreicht ist. Durch diesen Ansatz werden die zeitweise ungenutzten Ressourcen des Prozessors sinnvoll verwendet. Zudem bietet die vorgeschlagene Architektur die Fähigkeit, sich selbst an die auszuführenden Berechnungen anzupassen und sie somit schneller auszuführen. Ziel dieser Arbeit ist es, eine solche Klasse neuer Prozessoren zu definieren, ihren möglichen Nutzen zu quantifizieren und ihre technische Umsetzbarkeit nachzuweisen. Die mögliche Beschleunigung durch eine solche Architektur wird durch simulative Zuordnung von Befehlen potentieller Traces von Programmen auf Funktionseinheiten ermittelt. Die technische Machbarkeit des Ansatzes wird durch prototypische Implementierungen der kritischen Elemente der Architektur, vor allem im Bereich der partiellen Rekonfiguration von FPGAs, gezeigt

Universität der Bundeswehr München: AtheneForschung

Recommended from our members

Measuring program similarity for efficient benchmarking and performance analysis of computer systems

Author: Phansalkar Aashish S.
Publication venue
Publication date: 01/05/2007
Field of study

textComputer benchmarking involves running a set of benchmark programs to measure performance of a computer system. Modern benchmarks are developed from real applications. Applications are becoming complex and hence modern benchmarks run for a very long time. These benchmarks are also used for performance evaluation in the early design phase of microprocessors. Due to the size of benchmarks and increase in complexity of microprocessor design, the effort required for performance evaluation has increased significantly. This dissertation proposes methodologies to reduce the effort of benchmarking and performance evaluation of computer systems. Identifying a set of programs that can be used in the process of benchmarking can be very challenging. A solution to this problem can start by identifying similarity between programs to capture the diversity in their behavior before they can be considered for benchmarking. The aim of this methodology is to identify redundancy in the set of benchmarks and find a subset of representative benchmarks with the least possible loss of information. This dissertation proposes the use of program characteristics which capture the performance behavior of programs and identifies representative benchmarks applicable over a wide range of system configurations. The use of benchmark subsetting has not been restricted to academic research. Recently, the SPEC CPU subcommittee used the information derived from measuring similarity based on program behavior characteristics between different benchmark candidates as one of the criteria for selecting the SPEC CPU2006 benchmarks. The information of similarity between programs can also be used to predict performance of an application when it is difficult to port the application on different platforms. This is a common problem when a customer wants to buy the best computer system for his application. Performance of a customer's application on a particular system can be predicted using the performance scores of the standard benchmarks on that system and the similarity information between the application and the benchmarks. Similarity between programs is quantified by the distance between them in the space of the measured characteristics, and is appropriately used to predict performance of a new application using the performance scores of its neighbors in the workload space.Electrical and Computer Engineerin

Texas ScholarWorks

Optimal Global Instruction Scheduling for the Itanium® Processor Architecture

Author: Winkel Sebastian
Publication venue: Fakultät 6 - Naturwissenschaftlich-Technische Fakultät I. Fachrichtung 6.2 - Informatik
Publication date: 01/01/2004
Field of study

On the Itanium 2 processor, effective global instruction scheduling is crucial to high performance. At the same time, it poses a challenge to the compiler: This code generation subtask involves strongly interdependent decisions and complex trade-offs that are difficult to cope with for heuristics. We tackle this NP-complete problem with integer linear programming (ILP), a search-based method that yields provably optimal results. This promises faster code as well as insights into the potential of the architecture. Our ILP model comprises global code motion with compensation copies, predication, and Itanium-specific features like control/data speculation. In integer linear programming, well-structured models are the key to acceptable solution times. The feasible solutions of an ILP are represented by integer points inside a polytope. If all vertices of this polytope are integral, then the ILP can be solved in polynomial time. We define two subproblems of global scheduling in which some constraint classes are omitted and show that the corresponding two subpolytopes of our ILP model are integral and polynomial sized. This substantiates that the found model is of high efficiency, which is also confirmed by the reasonable solution times. The ILP formulation is extended by further transformations like cyclic code motion, which moves instructions upwards out of a loop, circularly in the opposite direction of the loop backedges. Since the architecture requires instructions to be encoded in fixed-sized bundles of three, a bundler is developed that computes bundle sequences of minimal size by means of precomputed results and dynamic programming. Experiments have been conducted with a postpass tool that implements the ILP scheduler. It parses assembly procedures generated by Intel�s Itanium compiler and reschedules them as a whole. Using this tool, we optimize a selection of hot functions from the SPECint 2000 benchmark. The results show a significant speedup over the original code.Globale Instruktionsanordnung hat beim Itanium-2-Prozessor großen Einfluß auf die Leistung und stellt dabei gleichzeitig eine Herausforderung für den Compiler dar: Sie ist mit zahlreichen komplexen, wechselseitig voneinander abhängigen Entscheidungen verbunden, die für Heuristiken nur schwer zu beherrschen sind.Wir lösen diesesNP-vollständige Problem mit ganzzahliger linearer Programmierung (ILP), einer suchbasierten Methode mit beweisbar optimalen Ergebnissen. Das ermöglicht neben schnellerem Code auch Einblicke in das Potential der Itanium- Prozessorarchitektur. Unser ILP-Modell umfaßt globale Codeverschiebungen mit Kompensationscode, Prädikation und Itanium-spezifische Techniken wie Kontroll- und Datenspekulation. Bei ganzzahliger linearer Programmierung sind wohlstrukturierte Modelle der Schlüssel zu akzeptablen Lösungszeiten. Die zulässigen Lösungen eines ILPs werden durch ganzzahlige Punkte innerhalb eines Polytops repräsentiert. Sind die Eckpunkte dieses Polytops ganzzahlig, kann das ILP in Polynomialzeit gelöst werden. Wir definieren zwei Teilprobleme globaler Instruktionsanordnung durch Auslassung bestimmter Klassen von Nebenbedingungen und beweisen, daß die korrespondierenden Teilpolytope unseres ILP-Modells ganzzahlig und von polynomieller Größe sind. Dies untermauert die hohe Effizienz des gefundenen Modells, die auch durch moderate Lösungszeiten bestätigt wird. Das ILP-Modell wird um weitere Transformationen wie zyklische Codeverschiebung erweitert; letztere bezeichnet das Verschieben von Befehlen aufwärts aus einer Schleife heraus, in Gegenrichtung ihrer Rückwärtskanten. Da die Architektur eine Kodierung der Befehle in Dreierbündeln fester Größe vorschreibt, wird ein Bundler entwickelt, der Bündelsequenzen minimaler Länge mit Hilfe vorberechneter Teilergebnisse und dynamischer Programmierung erzeugt. Für die Experimente wurde ein Postpassoptimierer erstellt. Er liest von Intels Itanium-Compiler erzeugte Assemblerroutinen ein und ordnet die enthaltenen Instruktionen mit Hilfe der ILP-Methode neu an. Angewandt auf eine Auswahl von Funktionen aus dem Benchmark SPECint 2000 erreicht der Optimierer eine signifikante Beschleunigung gegenüber dem Originalcode

Universaar

MPG.PuRe

Acronym

Recommended from our members

Interferometric Methods

Author: Kent James
Publication venue: University of Cambridge
Publication date: 01/07/2020
Field of study

Future radio telescopes promise great advances in resolution and sensitivity. These include the Square Kilometer Array, a two array instrument, in South Africa and Australia. Similarly, the next generation Very Large Array (ngVLA) is being designed for construction in North America. These arrays all promise exceptional advances in sensitivity, angular resolution, and survey speed. The SKA and ngVLA are both specified to have sensitivities at the level of

\mu

Jy's. The SKA-Low instrument will consist of a huge number of dipoles antennas in Australia which is pushing the bounds of current FX correlator technology with

\mathbb{O}(n^2)

scaling, where

n

is the number of antennas. The design proposals for these instruments include a dense core of antennas, necessitating advances in imaging methods for these very dense cores versus more traditionally sparse instruments. Another ambitious experiment is the Hydrogen Epoch of Reionisation Array (HERA) in South Africa which hopes to make the first direct detection of the Epoch of Reionisation through the red-shifted H{\sc i} signal which is a factor of

10^{5}

smaller than the thermal-like noise. In this thesis, these problems are tackled by re-examining the underlying principles of interferometry. The first working example of a direct imaging correlator is presented which allows images to be formed directly from the voltages off each antenna in a dense array, without an expensive cross-correlation operation as is typically required. A detailed discussion is given of how standard steps in interferometric imaging differ in this new scheme, including calibration. Additionally the first wide field direct imaging correlator is presented, which allows the problems of non-coplanarity to be dealt with for both sparse and dense arrays in a very efficient manner on modern GPU compute hardware. These are, to the best of the authors knowledge, the only working implementations of a direct imaging correlator for generic arrays with no restrictions on the geometry of the array or homogeneity of constituent receiver elements. These new approaches have been published in the scientific literature as discussed in the Declaration. Moving on from this, the closure phase bispectrum is presented as a way of uncovering the cosmological Epoch of Reionisation signal from the H{\sc i} line. This is using the HERA telescope, which consists of a dense core of parabolic antennas in a highly redundant layout. A data reduction and processing pipeline for the HERA telescope is constructed and presented, for use with the bispectrum. Initial results towards a cosmologial limit are reported. The HERA telescope relies on redundancy in its antenna elements for its calibration and measurement strategy. The bispectrum with its unique mathematical propeties, in combination with forward modelling, is shown to be a potent tool for probing departures from the assumed reudundancy. It is shown, through this method, that HERA suffers significant direction-dependent non-redundancies in the dataset used for our analysis, which are extremely difficult to calibrate out. Finally, the problem of wide-field imaging in next generation arrays is tackled through the development and implementation of a new scheme of wide field imaging. This uses a new method of parallelising the problem of wide-field imaging, and is intended for use with the very large datasets that will be produced by upcoming instruments. Two schemes are introduced:

w

-towers, and Improved

w

-towers. The latter generalises the former in combination with advances in optimal convolution theory for the radio astronomy ``gridding'' problem. The theory behind this approach is explored, and a high performance implementation is presented for