Search CORE

220 research outputs found

Parallel and Distributed Computing

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

Directory of Open Access Books (DOAB)

A Survey on Thread-Level Speculation Techniques

Author: Estébanez López Álvaro
González Escribano Arturo
Llanos Ferraris Diego Rafael
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

Producción CientíficaThread-Level Speculation (TLS) is a promising technique that allows the parallel execution of sequential code without relying on a prior, compile-time-dependence analysis. In this work, we introduce the technique, present a taxonomy of TLS solutions, and summarize and put into perspective the most relevant advances in this field.MICINN (Spain) and ERDF program of the European Union: HomProg-HetSys project (TIN2014-58876-P), CAPAP-H5 network (TIN2014-53522-REDT), and COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS)

Repositorio Documental de la Universidad de Valladolid

The effect of an optical network on-chip on the performance of chip multiprocessors

Author: Van Laer Anouk
Publication venue: UCL (University College London)
Publication date: 28/04/2018
Field of study

Optical networks on-chip (ONoC) have been proposed to reduce power consumption and increase bandwidth density in high performance chip multiprocessors (CMP), compared to electrical NoCs. However, as buffering in an ONoC is not viable, the end-to-end message path needs to be acquired in advance during which the message is buffered at the network ingress. This waiting latency is therefore a combination of path setup latency and contention and forms a significant part of the total message latency. Many proposed ONoCs, such as Single Writer, Multiple Reader (SWMR), avoid path setup latency at the expense of increased optical components. In contrast, this thesis investigates a simple circuit-switched ONoC with lower component count where nodes need to request a channel before transmission. To hide the path setup latency, a coherence-based message predictor is proposed, to setup circuits before message arrival. Firstly, the effect of latency and bandwidth on application performance is thoroughly investigated using full-system simulations of shared memory CMPs. It is shown that the latency of an ideal NoC affects the CMP performance more than the NoC bandwidth. Increasing the number of wavelengths per channel decreases the serialisation latency and improves the performance of both ONoC types. With 2 or more wavelengths modulating at 25 Gbit=s , the ONoCs will outperform a conventional electrical mesh (maximal speedup of 20%). The SWMR ONoC outperforms the circuit-switched ONoC. Next coherence-based prediction techniques are proposed to reduce the waiting latency. The ideal coherence-based predictor reduces the waiting latency by 42%. A more streamlined predictor (smaller than a L1 cache) reduces the waiting latency by 31%. Without prediction, the message latency in the circuit-switched ONoC is 11% larger than in the SWMR ONoC. Applying the realistic predictor reverses this: the message latency in the SWMR ONoC is now 18% larger than the predictive circuitswitched ONoC

UCL Discovery

Castell: a heterogeneous cmp architecture scalable to hundreds of processors

Author: Cabarcas Jaramillo Felipe
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2011
Field of study

Technology improvements and power constrains have taken multicore architectures to dominate microprocessor designs over uniprocessors. At the same time, accelerator based architectures have shown that heterogeneous multicores are very efficient and can provide high throughput for parallel applications, but with a high-programming effort. We propose Castell a scalable chip multiprocessor architecture that can be programmed as uniprocessors, and provides the high throughput of accelerator-based architectures. Castell relies on task-based programming models that simplify software development. These models use a runtime system that dynamically finds, schedules, and adds hardware-specific features to parallel tasks. One of these features is DMA transfers to overlap computation and data movement, which is known as double buffering. This feature allows applications on Castell to tolerate large memory latencies and lets us design the memory system focusing on memory bandwidth. In addition to provide programmability and the design of the memory system, we have used a hierarchical NoC and added a synchronization module. The NoC design distributes memory traffic efficiently to allow the architecture to scale. The synchronization module is a consequence of the large performance degradation of application for large synchronization latencies. Castell is mainly an architecture framework that enables the definition of domain-specific implementations, fine-tuned to a particular problem or application. So far, Castell has been successfully used to propose heterogeneous multicore architectures for scientific kernels, video decoding (using H.264), and protein sequence alignment (using Smith-Waterman and clustalW). It has also been used to explore a number of architecture optimizations such as enhanced DMA controllers, and architecture support for task-based programming models. ii

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Secretaría de Estado de Cultura

CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs

Author: Lhuillier Yves
Palatin Pierre
Temam Olivier
Publication venue: HAL CCSD
Publication date: 11/12/2006
Field of study

Since processor performance scalability will now mostly be achieved through thread-level parallelism, there is a strong incen- tive to parallelize a broad range of applications, including those with complex control ﬂow and data structures. And writing par- allel programs is a notoriously difﬁcult task. Beyond processor performance, the architect can help by facilitating the task of the programmer, especially by simplifying the model exposed to the programmer. In this article, among the many issues associated with writing par- allel programs, we focus on ﬁnding the appropriate parallelism granularity, and efﬁciently mapping tasks with complex control and data ﬂow to threads. We propose to relieve the user and com- piler of both tasks by delegating the parallelization decision to the architecture at run-time, through a combination of hardware and software support and a tight dialogue between both. For the software support, we leverage an increasingly popular approach in software engineering, called component-based pro- gramming; the component contract assumes tight encapsulation of code and data for easy manipulation. Previous research works have shown that it is possible to augment components with the ability to split/spawn, providing a simple and ﬁtting approach for programming parallel applications with complex control and data structures. However, such environments still require the program- mer to determine the appropriate granularity of parallelism, and spawning incurs signiﬁcant overheads due to software run-time system management. For that purpose, we provide an environment with the ability to spawn conditionally depending on available hardware resources, and we delegate spawning decisions and actions to the architec- ture. This conditional spawning is implemented through frequent hardware resource probing by the program. This, in turn, enables rapid adaptation to varying workload conditions, data sets and hardware resources. Furthermore, thanks to appropriate com- bined hardware and compiler support, the probing has no signiﬁ- cant overhead on program performance. We demonstrate this approach on an 8-context SMT, sev- eral non-trivial algorithms and re-engineered SPEC CINT2000 benchmarks, written using component syntax processed by our toolchain. We achieve speedups ranging from 1.1 to 3.0 on our test suite

INRIA a CCSD electronic archive server

Hardware-Assisted Dependable Systems

Author: Kuvaiskii Dmitrii
Publication venue
Publication date: 22/01/2018
Field of study

Unpredictable hardware faults and software bugs lead to application crashes, incorrect computations, unavailability of internet services, data losses, malfunctioning components, and consequently financial losses or even death of people. In particular, faults in microprocessors (CPUs) and memory corruption bugs are among the major unresolved issues of today. CPU faults may result in benign crashes and, more problematically, in silent data corruptions that can lead to catastrophic consequences, silently propagating from component to component and finally shutting down the whole system. Similarly, memory corruption bugs (memory-safety vulnerabilities) may result in a benign application crash but may also be exploited by a malicious hacker to gain control over the system or leak confidential data. Both these classes of errors are notoriously hard to detect and tolerate. Usual mitigation strategy is to apply ad-hoc local patches: checksums to protect specific computations against hardware faults and bug fixes to protect programs against known vulnerabilities. This strategy is unsatisfactory since it is prone to errors, requires significant manual effort, and protects only against anticipated faults. On the other extreme, Byzantine Fault Tolerance solutions defend against all kinds of hardware and software errors, but are inadequately expensive in terms of resources and performance overhead. In this thesis, we examine and propose five techniques to protect against hardware CPU faults and software memory-corruption bugs. All these techniques are hardware-assisted: they use recent advancements in CPU designs and modern CPU extensions. Three of these techniques target hardware CPU faults and rely on specific CPU features: ∆-encoding efficiently utilizes instruction-level parallelism of modern CPUs, Elzar re-purposes Intel AVX extensions, and HAFT builds on Intel TSX instructions. The rest two target software bugs: SGXBounds detects vulnerabilities inside Intel SGX enclaves, and “MPX Explained” analyzes the recent Intel MPX extension to protect against buffer overflow bugs. Our techniques achieve three goals: transparency, practicality, and efficiency. All our systems are implemented as compiler passes which transparently harden unmodified applications against hardware faults and software bugs. They are practical since they rely on commodity CPUs and require no specialized hardware or operating system support. Finally, they are efficient because they use hardware assistance in the form of CPU extensions to lower performance overhead

Technische Universität Dresden: Qucosa

Recommended from our members

Predicting Fuel Salt Composition via Linear Optimization in Molten Salt Reactors

Author: Wooten Daniel D
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Molten salt reactors (MSRs) are a class of nuclear reactor which uses a molten ionic liquidas either the coolant or also as the fuel. While a 8 MWth MSR was successfully operatedin the 1960s it was not until the early 2000s that MSRs gained widespread attention. Sincethen MSRs have enjoyed plentiful research support. Despite such support only one generalMSR fuel cycle analysis tool is available for use by the research community and even thisthis tool lacks features necessary for modelling a MSR, and so possibly providing answers ofa lower quality.In this work a method is proposed and implemented within the SERPENT 2 reactorphysics Monte-Carlo code. This method, named ADER - the Advanced Depletion Extensionfor Reprocessing - is a seamlessly incorporated source code modification to the SERPENT2 base code which allows the user to define arbitrary collections of elements, isotopes, andchemicals as well as relationships among them.Furthermore ADER allows the user to specify a variety of mass flows subject to theconstraints as defined through the collections of elements, isotopes, and chemicals the userhas structured. Along with support for constraints involving both corrosion and nuclearcontrol concerns, ADER allows the user to optimize the solution against a quantity of interest- e.g, total uranium fed into the system.Through these structures much of the complex chemistry, corrosion modelling, and nuclear concerns of operating a MSR can be linearized and solved against an optimizationtarget, which is necessary given the large number of constraints and variables in such aproblem space. Linearization reduces the problem complexity and eliminates concerns overlocal versus global optimization targets. Within the typically narrow operating parametersof MSRs linearization is an appropriate approximation to the higher dimensional equationsrepresenting the phenomenon involved.2This linear system and optimization target may then be passed to a linear optimizationsolver, in this work the CLP library as part of the COIN-OR package, from which an optimized system material composition and material flows solution may be found. ADER thenuses this solution to create a brand new depletion matrix which SERPENT 2 then solvesusing the CRAM approximation method.From this algorithm a more accurate modelling of MSR fuel cycles and physics maybe arrived at through the consideration of chemistry driven limitations, corrosion drivenlimitations, nuclear driven limitations, and operator driven limitations. Results from thisimplemented method indicate that ADER drives the MSR fuel cycle simulations towardsa more physically representative result. Unfortunately, as detailed later in this work, anunderlying and pernicious numerical instability issue was uncovered within the linear optimization library selected for this work. Any future work on this method must begin with theadoption of a quadruple-precision floating-point linear optimization library over the currentimplementation of CLP as used in ADER.In the following chapters an introduction to MSRs and their fuel cycle modelling isgiven. Following this the theory behind ADER and its implementation within SERPENT 2is discussed after which the results from one of the less numerically unstable simulations ispresented after which concluding remarks are given

eScholarship - University of California

Fundamental Approaches to Software Engineering

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 13/04/2022
Field of study

This open access book constitutes the proceedings of the 25th International Conference on Fundamental Approaches to Software Engineering, FASE 2022, which was held during April 4-5, 2022, in Munich, Germany, as part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022. The 17 regular papers presented in this volume were carefully reviewed and selected from 64 submissions. The proceedings also contain 3 contributions from the Test-Comp Competition. The papers deal with the foundations on which software engineering is built, including topics like software engineering as an engineering discipline, requirements engineering, software architectures, software quality, model-driven development, software processes, software evolution, AI-based software engineering, and the specification, design, and implementation of particular classes of systems, such as (self-)adaptive, collaborative, AI, embedded, distributed, mobile, pervasive, cyber-physical, or service-oriented applications

Directory of Open Access Books (DOAB)

Parallel architectures and runtime systems co-design for task-based programming models

Author: Castillo Villar Emilio
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2019
Field of study

The increasing parallelism levels in modern computing systems has extolled the need for a holistic vision when designing multiprocessor architectures taking in account the needs of the programming models and applications. Nowadays, system design consists of several layers on top of each other from the architecture up to the application software. Although this design allows to do a separation of concerns where it is possible to independently change layers due to a well-known interface between them, it is hampering future systems design as the Law of Moore reaches to an end. Current performance improvements on computer architecture are driven by the shrinkage of the transistor channel width, allowing faster and more power efficient chips to be made. However, technology is reaching physical limitations were the transistor size will not be able to be reduced furthermore and requires a change of paradigm in systems design. This thesis proposes to break this layered design, and advocates for a system where the architecture and the programming model runtime system are able to exchange information towards a common goal, improve performance and reduce power consumption. By making the architecture aware of runtime information such as a Task Dependency Graph (TDG) in the case of dataflow task-based programming models, it is possible to improve power consumption by exploiting the critical path of the graph. Moreover, the architecture can provide hardware support to create such a graph in order to reduce the runtime overheads and making possible the execution of fine-grained tasks to increase the available parallelism. Finally, the current status of inter-node communication primitives can be exposed to the runtime system in order to perform a more efficient communication scheduling, and also creates new opportunities of computation and communication overlap that were not possible before. An evaluation of the proposals introduced in this thesis is provided and a methodology to simulate and characterize the application behavior is also presented.El aumento del paralelismo proporcionado por los sistemas de cómputo modernos ha provocado la necesidad de una visión holística en el diseño de arquitecturas multiprocesador que tome en cuenta las necesidades de los modelos de programación y las aplicaciones. Hoy en día el diseño de los computadores consiste en diferentes capas de abstracción con una interfaz bien definida entre ellas. Las limitaciones de esta aproximación junto con el fin de la ley de Moore limitan el potencial de los futuros computadores. La mayoría de las mejoras actuales en el diseño de los computadores provienen fundamentalmente de la reducción del tamaño del canal del transistor, lo cual permite chips más rápidos y con un consumo eficiente sin apenas cambios fundamentales en el diseño de la arquitectura. Sin embargo, la tecnología actual está alcanzando limitaciones físicas donde no será posible reducir el tamaño de los transistores motivando así un cambio de paradigma en la construcción de los computadores. Esta tesis propone romper este diseño en capas y abogar por un sistema donde la arquitectura y el sistema de tiempo de ejecución del modelo de programación sean capaces de intercambiar información para alcanzar una meta común: La mejora del rendimiento y la reducción del consumo energético. Haciendo que la arquitectura sea consciente de la información disponible en el modelo de programación, como puede ser el grafo de dependencias entre tareas en los modelos de programación dataflow, es posible reducir el consumo energético explotando el camino critico del grafo. Además, la arquitectura puede proveer de soporte hardware para crear este grafo con el objetivo de reducir el overhead de construir este grado cuando la granularidad de las tareas es demasiado fina. Finalmente, el estado de las comunicaciones entre nodos puede ser expuesto al sistema de tiempo de ejecución para realizar una mejor planificación de las comunicaciones y creando nuevas oportunidades de solapamiento entre cómputo y comunicación que no eran posibles anteriormente. Esta tesis aporta una evaluación de todas estas propuestas, así como una metodología para simular y caracterizar el comportamiento de las aplicacionesPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

A data dependency recovery system for a heterogeneous multicore processor

Author: Kainth Haresh S.
Publication venue: University of Derby
Publication date: 01/01/2014
Field of study

Multicore processors often increase the performance of applications. However, with their deeper pipelining, they have proven increasingly difficult to improve. In an attempt to deliver enhanced performance at lower power requirements, semiconductor microprocessor manufacturers have progressively utilised chip-multicore processors. Existing research has utilised a very common technique known as thread-level speculation. This technique attempts to compute results before the actual result is known. However, thread-level speculation impacts operation latency, circuit timing, confounds data cache behaviour and code generation in the compiler. We describe an software framework codenamed Lyuba that handles low-level data hazards and automatically recovers the application from data hazards without programmer and speculation intervention for an asymmetric chip-multicore processor. The problem of determining correct execution of multiple threads when data hazards occur on conventional symmetrical chip-multicore processors is a significant and on-going challenge. However, there has been very little focus on the use of asymmetrical (heterogeneous) processors with applications that have complex data dependencies. The purpose of this thesis is to: (i) define the development of a software framework for an asymmetric (heterogeneous) chip-multicore processor; (ii) present an optimal software control of hardware for distributed processing and recovery from violations;(iii) provides performance results of five applications using three datasets. Applications with a small dataset showed an improvement of 17% and a larger dataset showed an improvement of 16% giving overall 11% improvement in performance

UDORA - University of Derby Online Research Archive