Search CORE

2,172 research outputs found

Acceleration of stereo-matching on multi-core CPU and GPU

Author: Cockshott Paul
Oehler Susanne
Tian Xu
Publication venue
Publication date: 01/01/2014
Field of study

This paper presents an accelerated version of a dense stereo-correspondence algorithm for two different parallelism enabled architectures, multi-core CPU and GPU. The algorithm is part of the vision system developed for a binocular robot-head in the context of the CloPeMa 1 research project. This research project focuses on the conception of a new clothes folding robot with real-time and high resolution requirements for the vision system. The performance analysis shows that the parallelised stereo-matching algorithm has been significantly accelerated, maintaining 12x and 176x speed-up respectively for multi-core CPU and GPU, compared with non-SIMD singlethread CPU. To analyse the origin of the speed-up and gain deeper understanding about the choice of the optimal hardware, the algorithm was broken into key sub-tasks and the performance was tested for four different hardware architectures

CiteSeerX

Enlighten

One machine, one minute, three billion tetrahedra

Author: Marot Célestin
Pellerin Jeanne
Remacle Jean-François
Publication venue: 'Wiley'
Publication date: 01/01/2018
Field of study

This paper presents a new scalable parallelization scheme to generate the 3D Delaunay triangulation of a given set of points. Our first contribution is an efficient serial implementation of the incremental Delaunay insertion algorithm. A simple dedicated data structure, an efficient sorting of the points and the optimization of the insertion algorithm have permitted to accelerate reference implementations by a factor three. Our second contribution is a multi-threaded version of the Delaunay kernel that is able to concurrently insert vertices. Moore curve coordinates are used to partition the point set, avoiding heavy synchronization overheads. Conflicts are managed by modifying the partitions with a simple rescaling of the space-filling curve. The performances of our implementation have been measured on three different processors, an Intel core-i7, an Intel Xeon Phi and an AMD EPYC, on which we have been able to compute 3 billion tetrahedra in 53 seconds. This corresponds to a generation rate of over 55 million tetrahedra per second. We finally show how this very efficient parallel Delaunay triangulation can be integrated in a Delaunay refinement mesh generator which takes as input the triangulated surface boundary of the volume to mesh

arXiv.org e-Print Archive

DIAL UCLouvain

Design of non conventional Synchronous Reluctance machine

Author: GAMBA MATTEO
Publication venue: country:Italy
Publication date: 01/01/2017
Field of study

Synchronous reluctance (SyR) and Permanent magnet Synchronous Reluctance (PM-SyR) machines represent an answer to the growing emphasis on higher efficiency, higher torque density and overload capability of ac machines for variable-speed applications. Their high performance is particularly attractive in electric traction and industry applications. The SyR technology represents a convenient solution to obtain high efficiency machines at reduced cost and high reliability. The manufacturing costs are comparable to other existing technologies such as induction motors. Different SyR and PM-SyR machines with different ratings and applications were designed, for comparison with induction motors having equal frame. An accurate comparison between Induction motors, SyR and PM-SyR machines is reported, with reference to the IE4 and IE5 efficiency specifications that could become mandatory in the next years. Three studies are classified under the term ”Non-Conventional” machines: Line-Start SyR motor: is a special SyR machine designed for constant speed applications, line supplied. The rotor flux barriers are filled with aluminum, to obtain a squirrel cage that resembles the one of an induction motor. The manufacturing costs are comparable to those of the induction motor, and the efficiency is higher. Two prototypes were realized and tested. FSW-SyR: tooth-wound coils and fractional slot per pole combinations were investigated. They are of interest because they permit a simplification and higher degree of automatization of the manufacturing process. However, FSW-SyR machines are known for their high torque ripple, low specific torque and power factor. The number of slots per pole was optimized to maximize the torque density. Dealing with the torque ripple, a lumped parameters model was used together with optimization in SyRE. A design with minimized ripple was obtained, comparable to a distributed winding machine in this respect. This design was prototyped and tested. Mild Overlapped SyR: this study shows a new winding configuration applied to SyR and PM-SyR machines. The proposed case is in the direction to find a hybrid solution between distributed winding and tooth winding motors, that permits to reduce costs and improve performances. One limitation of this solution is that only number of pole pairs equal to five or higher are feasible, and this reduces the applicability of the solution to classical industry applications, where one to three pole pairs are normally used

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Lace: non-blocking split deque for work-stealing

Author: D. Hendler
K.F. Faxén
N.S. Arora
R.D. Blumofe
S. Olivier
Publication venue: Springer International Publishing
Publication date: 01/01/2014
Field of study

Work-stealing is an efficient method to implement load balancing in fine-grained task parallelism. Typically, concurrent deques are used for this purpose. A disadvantage of many concurrent deques is that they require expensive memory fences for local deque operations.\ud \ud In this paper, we propose a new non-blocking work-stealing deque based on the split task queue. Our design uses a dynamic split point between the shared and the private portions of the deque, and only requires memory fences when shrinking the shared portion.\ud \ud We present Lace, an implementation of work-stealing based on this deque, with an interface similar to the work-stealing library Wool, and an evaluation of Lace based on several common benchmarks. We also implement a recent approach using private deques in Lace. We show that the split deque and the private deque in Lace have similar low overhead and high scalability as Wool

Crossref

University of Twente Research Information

Project Final Report: HPC-Colony II

Author
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date
Field of study

Crossref

Architectural support for task dependence management with flexible software scheduling

Author: Beivide Palacio Ramon
Bosque Jose L.
Casas Marc
Castillo Emilio
Moreto Planas Miquel
Valero Cortés Mateo
Vallejo Enrique
Álvarez Martí Lluc
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

The growing complexity of multi-core architectures has motivated a wide range of software mechanisms to improve the orchestration of parallel executions. Task parallelism has become a very attractive approach thanks to its programmability, portability and potential for optimizations. However, with the expected increase in core counts, finer-grained tasking will be required to exploit the available parallelism, which will increase the overheads introduced by the runtime system. This work presents Task Dependence Manager (TDM), a hardware/software co-designed mechanism to mitigate runtime system overheads. TDM introduces a hardware unit, denoted Dependence Management Unit (DMU), and minimal ISA extensions that allow the runtime system to offload costly dependence tracking operations to the DMU and to still perform task scheduling in software. With lower hardware cost, TDM outperforms hardware-based solutions and enhances the flexibility, adaptability and composability of the system. Results show that TDM improves performance by 12.3% and reduces EDP by 20.4% on average with respect to a software runtime system. Compared to a runtime system fully implemented in hardware, TDM achieves an average speedup of 4.2% with 7.3x less area requirements and significant EDP reductions. In addition, five different software schedulers are evaluated with TDM, illustrating its flexibility and performance gains.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P, TIN2016-76635-C2-2-R and TIN2016-81840-REDT), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), and by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 671697 and No. 671610. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047.Peer ReviewedPostprint (author's final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

Adaptiveness and Lock-free Synchronization in Parallel Stochastic Gradient Descent

Author: B\ue4ckstr\uf6m Karl
Publication venue
Publication date: 01/01/2021
Field of study

The emergence of big data in recent years due to the vast societal digitalization and large-scale sensor deployment has entailed significant interest in machine learning methods to enable automatic data analytics. In a majority of the learning algorithms used in industrial as well as academic settings, the first-order iterative optimization procedure Stochastic gradient descent (SGD), is the backbone. However, SGD is often time-consuming, as it typically requires several passes through the entire dataset in order to converge to a solution of sufficient quality.In order to cope with increasing data volumes, and to facilitate accelerated processing utilizing contemporary hardware, various parallel SGD variants have been proposed. In addition to traditional synchronous parallelization schemes, asynchronous ones have received particular interest in recent literature due to their improved ability to scale due to less coordination, and subsequently waiting time. However, asynchrony implies inherent challenges in understanding the execution of the algorithm and its convergence properties, due the presence of both stale and inconsistent views of the shared state.In this work, we aim to increase the understanding of the convergence properties of SGD for practical applications under asynchronous parallelism and develop tools and frameworks that facilitate improved convergence properties as well as further research and development. First, we focus on understanding the impact of staleness, and introduce models for capturing the dynamics of parallel execution of SGD. This enables (i) quantifying the statistical penalty on the convergence due to staleness and (ii) deriving an adaptation scheme, introducing a staleness-adaptive SGD variant MindTheStep-AsyncSGD, which provably reduces this penalty. Second, we aim at exploring the impact of synchronization mechanisms, in particular consistency-preserving ones, and the overall effect on the convergence properties. To this end, we propose LeashedSGD, an extensible algorithmic framework supporting various synchronization mechanisms for different degrees of consistency, enabling in particular a lock-free and consistency-preserving implementation. In addition, the algorithmic construction of Leashed-SGD enables dynamic memory allocation, claiming memory only when necessary, which reduces the overall memory footprint. We perform an extensive empirical study, benchmarking the proposed methods, together with established baselines, focusing on the prominent application of Deep Learning for image classification on the benchmark datasets MNIST and CIFAR, showing significant improvements in converge time for Leashed-SGD and MindTheStep-AsyncSGD

Chalmers Research