12 research outputs found
Mixed-length SIMD code generation for VLIW architectures with multiple native vector-widths
The degree of DLP parallelism in applications is not fixed and varies due to different computational characteristics of applications. On the contrary, most of the processors today include single-width SIMD (vector) hardware to exploit DLP. However, single-width SIMD architectures may not be optimal to serve applications with varying DLP and they may cause performance and energy inefficiency. We propose the usage of VLIW processors with multiple native vector-widths to better serve applications with changing DLP. SHAVE is an example of such VLIW processor and provides hardware support for the native 32-bit and 128-bit wide vector operations. This paper researches and implements the mixed-length SIMD code generation support for SHAVE processor. More specifically, we target generating 32-bit and 128/64-bit SIMD code for the native 32-bit and 128-bit wide vector units of SHAVE processor. In this way, we improved the performance of compiler generated SIMD code by reducing the number of overhead operations and by increasing the SIMD hardware utilization. Experimental results demonstrated that our methodology implemented in the compiler improves the performance of synthetic benchmarks up to 47%
Libra: Achieving Efficient Instruction- and Data- Parallel Execution for Mobile Applications.
Mobile computing as exemplified by the smart phone has become an integral part of our daily lives. The next generation of these devices will be driven by providing richer user experiences and compelling capabilities: higher definition multimedia, 3D graphics, augmented reality, and voice interfaces. To meet these goals, the core computing capabilities of the smart phone must be scaled. But, the energy budgets are increasing at a much lower rate, thus fundamental improvements in computing efficiency must be garnered. To meet this challenge, computer architects employ hardware accelerators in the form of SIMD and VLIW. Single-instruction multiple-data (SIMD) accelerators provide high degrees of scalability for applications rich in data-level parallelism (DLP). Very long instruction word (VLIW) accelerators provide moderate scalability for applications with high degrees of instruction-level parallelism (ILP). Unfortunately, applications are not so nicely partitioned into two groups: many applications have some DLP, but also contain significant fractions of code with low trip count loops, complex control/data dependences, or non-uniform execution behavior for which no DLP exists. Therefore, a more adaptive accelerator is required to be able to deploy resources as needed: exploit DLP on SIMD when it’s available, but fall back to ILP on the same hardware when necessary.
  In this thesis, we first focus on various compiler solutions that solve inefficiency problem in both VLIW and SIMD accelerators. For SIMD accelerators, a new vectorization pass, called SIMD Defragmenter, is introduced to uncover hidden DLP using subgraph identification in SIMD accelerators. CGRA express effectively accelerates sequential code regions using a bypass network in VLIW accelerators, and Resource Recycling leverages stream-graph modulo scheduling technique for scheduling of multiple code regions in multi-core accelerators.
  Second, we propose the new scalable multicore accelerator referred to as Libra for mobile systems, which can support execution of code regions having both DLP and ILP, as well as hybrid combinations of the two. We believe that as industry requires higher performance, the proposed flexible accelerator and compiler support will put more resources to work in order to meet the performance and power efficiency requirements.PHDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/99840/1/yjunpark_1.pd
Întreţinerea şi repararea calculatoarelor
Beginner's guide to computer maintenance and troubleshooting, with an introduction to computer concepts, hardware, software (including operating systems), and Internet security.
A general-purpose computer has four main components: the arithmetic logic unit (ALU), the control unit, memory, and input and output devices (collectively called I/O). These parts are interconnected by buses, often made of groups of wires.
The defining characteristic of modern computers, which distinguishes them from all other machines, is that they can be programmed. This assumes that a certain type of instructions (program) can be implemented in the computer, which will process them. Modern computers, based on the von Neumann architecture, often have machine code in the form of an imperative programming language
Real-Time Processing of Forward Error Correction Coding for High-Datarate Free-Space Optical Satellite Communication on a Commercial Off-The-Shelf System on Chip
Free-space optical communication is a promising technology for the next generation of very-high data rate links between satellites and the Earth as well as between satellites themselves. The OSIRIS program of the German Aerospace Centers (DLR) Institute of Communication and Navigation aims to develop, test, and commercialize optical communication technologies for small satellites used in Earth observation and megaconstellations. Building on the success of DLRs OSIRIS4CubeSat (O4C), the worlds smallest laser terminal, DLR is enhancing the O4C terminal to enable bidirectional optical communication links between CubeSats in the CubeISL project.
To improve the optical link performance, forward error correction (FEC) schemes are utilized to correct bit and burst errors introduced e.g. by the atmosphere. Processing of these forward error correction codes however is computationally demanding. While the processing of the FEC coding was performed in advance of a link in O4C, additional near-real-time processing is planned for CubeISL, requiring high throughput FEC coding on the inherently limited computational resources on the CubeSat. This thesis evaluates and optimizes the
achievable FEC processing performance on the commercial off-the-shelf System-on-Chip (SoC) that is deployed on the CubeISL satellite. The FEC schemes for encoding and decoding were integrated, profiled and optimized on the SoC. A more than sevenfold increase in throughput is demonstrated, factors that need to be considered for near-real-time processing are discussed and a concept for further improvements is proposed
Compilation and Code Optimization for Data Analytics
The trade-offs between the use of modern high-level and low-level programming languages in constructing complex software artifacts are well known. High-level languages allow for greater programmer productivity: abstraction and genericity allow for the same functionality to be implemented with significantly less code compared to low-level languages. Modularity, object-orientation, functional programming, and powerful type systems allow programmers not only to create clean abstractions and protect them from leaking, but also to define code units that are reusable and easily composable, and software architectures that are adaptable and extensible. The abstraction, succinctness, and modularity of high-level code help to avoid software bugs and facilitate debugging and maintenance.
The use of high-level languages comes at a performance cost: increased indirection due to abstraction, virtualization, and interpretation, and superfluous work, particularly in the form of tempory memory allocation and deallocation to support objects and encapsulation.
As a result of this, the cost of high-level languages for performance-critical systems may seem prohibitive.
The vision of abstraction without regret argues that it is possible to use high-level languages for building performance-critical systems that allow for both productivity and high performance, instead of trading off the former for the latter. In this thesis, we realize this vision for building different types of data analytics systems. Our means of achieving this is by employing compilation. The goal is to compile away expensive language features -- to compile high-level code down to efficient low-level code
Compilation and Code Optimization for Data Analytics
The trade-offs between the use of modern high-level and low-level programming languages in constructing complex software artifacts are well known. High-level languages allow for greater programmer productivity: abstraction and genericity allow for the same functionality to be implemented with significantly less code compared to low-level languages. Modularity, object-orientation, functional programming, and powerful type systems allow programmers not only to create clean abstractions and protect them from leaking, but also to define code units that are reusable and easily composable, and software architectures that are adaptable and extensible. The abstraction, succinctness, and modularity of high-level code help to avoid software bugs and facilitate debugging and maintenance.
The use of high-level languages comes at a performance cost: increased indirection due to abstraction, virtualization, and interpretation, and superfluous work, particularly in the form of tempory memory allocation and deallocation to support objects and encapsulation.
As a result of this, the cost of high-level languages for performance-critical systems may seem prohibitive.
The vision of abstraction without regret argues that it is possible to use high-level languages for building performance-critical systems that allow for both productivity and high performance, instead of trading off the former for the latter. In this thesis, we realize this vision for building different types of data analytics systems. Our means of achieving this is by employing compilation. The goal is to compile away expensive language features -- to compile high-level code down to efficient low-level code
Online scheduling for real-time multitasking on reconfigurable hardware devices
Nowadays the ever increasing algorithmic complexity of embedded applications requires the designers to turn towards heterogeneous and highly integrated systems denoted as SoC (System-on-a-Chip). These architectures may embed CPU-based processors, dedicated datapaths as well as recon gurable units. However, embedded SoCs are submitted to stringent requirements in terms of speed, size, cost, power consumption, throughput, etc. Therefore, new computing paradigms are required to ful l the constraints of the applications and the requirements of the architecture. Recon gurable Computing is a promising paradigm that provides probably the best trade-o between these requirements and constraints. Dynamically recon gurable architectures are their key enabling technology. They enable the hardware to adapt to the application at runtime. However, these architectures raise new challenges in SoC design. For example, on one hand, designing a system that takes advantage of dynamic recon guration is still very time consuming because of the lack of design methodologies and tools. On the other hand, scheduling hardware tasks di ers from classical software tasks scheduling on microprocessor or multiprocessors systems, as it bears a further complicated placement problem. This thesis deals with the problem of scheduling online real-time hardware tasks on Dynamically Recon gurable Hardware Devices (DRHWs). The problem is addressed from two angles : (i) Investigating novel algorithms for online real-time scheduling/placement on DRHWs. (ii) Scheduling/Placement algorithms library for RTOS-driven Design Space Exploration (DSE). Regarding the first point, the thesis proposes two main runtime-aware scheduling and placement techniques and assesses their suitability for online real-time scenarios. The first technique discusses the impact of synthesizing, at design time, several shapes and/or sizes per hardware task (denoted as multi-shape task), in order to ease the online scheduling process. The second technique combines a looking-ahead scheduling approach with a slots-based recon gurable areas management that relies on a 1D placement. The results show that in both techniques, the scheduling and placement quality is improved without signi cantly increasing the algorithm time complexity. Regarding the second point, in the process of designing SoCs embedding recon gurable parts, new design paradigms tend to explore and validate as early as possible, at system level, the architectural design space. Therefore, the RTOS (Real-Time Operating System) services that manage the recon gurable parts of the SoC can be re fined. In such a context, gathering numerous hardware tasks scheduling and placement algorithms of various complexity vs performance trade-o s in a kind of library is required. In this thesis, proposed algorithms in addition to some existing ones are purposely implemented in C++ language, in order to insure the compatibility with any C++/SystemC based SoC design methodology.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
