Search CORE

27 research outputs found

PARALiA: a performance aware runtime for auto-tuning linear algebra on heterogeneous systems

Author: Anastasiadis Petros
Goumas Georgios
Hoppe Dennis
Koziris Nectarios
Papadopoulou Nikela
Zhong Li
Publication venue: ACM
Publication date: 01/12/2023
Field of study

Dense linear algebra operations appear very frequently in high-performance computing (HPC) applications, rendering their performance crucial to achieve optimal scalability. As many modern HPC clusters contain multi-GPU nodes, BLAS operations are frequently offloaded on GPUs, necessitating the use of optimized libraries to ensure good performance. Unfortunately, multi-GPU systems are accompanied by two significant optimization challenges: data transfer bottlenecks as well as problem splitting and scheduling in multiple workers (GPUs) with distinct memories. We demonstrate that the current multi-GPU BLAS methods for tackling these challenges target very specific problem and data characteristics, resulting in serious performance degradation for any slightly deviating workload. Additionally, an even more critical decision is omitted because it cannot be addressed using current scheduler-based approaches: the determination of which devices should be used for a certain routine invocation. To address these issues we propose a model-based approach: using performance estimation to provide problem-specific autotuning during runtime. We integrate this autotuning into an end-to-end BLAS framework named PARALiA. This framework couples autotuning with an optimized task scheduler, leading to near-optimal data distribution and performance-aware resource utilization. We evaluate PARALiA in an HPC testbed with 8 NVIDIA-V100 GPUs, improving the average performance of GEMM by 1.7× and energy efficiency by 2.5× over the state-of-the-art in a large and diverse dataset and demonstrating the adaptability of our performance-aware approach to future heterogeneous systems

Enlighten

Parallelization of General Linkage Analysis Problems

Author: Cottingham R.W.
Cox A.L.
Dwarkadas S.
Keleher P.
Schaffer A.A.
Zwaenepoel W
Publication venue: 'S. Karger AG'
Publication date: 17/10/2005
Field of study

We describe a parallel implementation of a genetic linkage analysis program that achieves good speedups, even for analyses on a single pedigree and with a single starting recombination fraction vector. Our parallel implementation has been run on three different platforms: an Ethernet network of workstations, a higher-bandwidth. Asynchronous Transfer Mode (ATM) network of workstations, and a shared-memory multiprocessor. The same program, written in a shared memory programming style, is used on all platforms. On the workstation networks, the hardware does not provide shared memory, so the program executes on a distributed shared memory system that implements shared memory in software. These three platforms represent different points on the price/performance scale. Ethernet networks are cheap and omnipresent. ATM networks are an emerging technology that others higher bandwidth, and shared-memory multiprocessors offer the best performance because communication is implemented entirely by hardware. On 8 processors and for the longer runs, we achieve speedups between 3.5 and 5 on the Ethernet network and between 4.8 and 6 on the ATM network. On the shared-memory multiprocessor, we achieve speedups in the 5.5 to 6.5 range for all runs

Infoscience - École polytechnique fédérale de Lausanne

Exploiting the Hard-Working DWARF: Trojan and Exploit Techniques Without Native Executable Code

Author: Oakley James M.H.
Publication venue: Dartmouth Digital Commons
Publication date: 02/06/2011
Field of study

The study of vulnerabilities and exploitation is one of finding mechanisms affecting the flow of computation and of finding new means to perform unexpected computation. In this paper we show the extent to which exception handling mechanisms as implemented and used by \gcc can be used to control program execution. We show that the data structures used to store exception handling information on UNIX-like systems actually contain Turing-complete bytecode, which is executed by a virtual machine during the course of exception unwinding and handling. We discuss how a malicious attacker could gain control over these structures and how such an attacker could utilize them once control has been achieved

Dartmouth Digital Commons (Dartmouth College)

Policy network assisted Monte Carlo Tree search for intelligent service function chain deployment

Author: Fan Q
Fu Z
Li X
Wang S
Wang Y
Zhang X
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 26/07/2021
Field of study

This is the author accepted manuscript. The final version is available from IEEE via the DOI in this recordNetwork function virtualization (NFV) simplifies the configuration and management of security services by migrating the network security functions from dedicated hardware devices to software middle-boxes that run on commodity servers. Under the paradigm of NFV, the service function chain (SFC) consisting of a series of ordered virtual network security functions is becoming a mainstream form to carry network security services. Allocating the underlying physical network resources to the demands of SFCs under given constraints over time is known as the SFC deployment problem. It is a crucial issue for infrastructure providers. However, SFC deployment is facing new challenges in trading off between pursuing the objective of high revenue-to-cost ratio and making decisions in an online manner. In this paper, we investigate the use of reinforcement learning to guide online deployment decisions for SFC requests and propose a Policy network Assisted Monte Carlo Tree search approach named PACT to address the above challenge, aiming to maximize the average revenue-to-cost ratio. PACT combines the strengths of the policy network, which evaluates the placement potential of physical servers and the Monte Carlo Tree Search, which is able to tackle problems with large state spaces. Extensive experimental results demonstrate that our PACT achieves the best performance and superior to other algorithms by up to 30% and 23.8% on average revenue-to-cost ratio and acceptance rate, respectivelyMajor Special Program for Technical Innovation & Application Development of Chongqing Science & Technology CommissionNational NSFCChongqing Research Program of Basic Research and Frontier TechnologyNatural Science Foundation of JiangsuLeading Technology of Jiangsu Basic Research PlanEuropean Union Horizon 202

Open Research Exeter

MLIP: using multiple processors to compute the posterior probability of linkage

Author: Govil Manika
Segre Alberto M
Vieland Veronica J
Publication venue: BioMed Central
Publication date: 01/05/2008
Field of study

Abstract Background Localization of complex traits by genetic linkage analysis may involve exploration of a vast multidimensional parameter space. The posterior probability of linkage (PPL), a class of statistics for complex trait genetic mapping in humans, is designed to model the trait model complexity represented by the multidimensional parameter space in a mathematically rigorous fashion. However, the method requires the evaluation of integrals with no functional form, making it difficult to compute, and thus further test, develop and apply. This paper describes MLIP, a multiprocessor two-point genetic linkage analysis system that supports statistical calculations, such as the PPL, based on the full parameter space implicit in the linkage likelihood. Results The fundamental question we address here is whether the use of additional processors effectively reduces total computation time for a PPL calculation. We use a variety of data – both simulated and real – to explore the question "how close can we get?" to linear speedup. Empirical results of our study show that MLIP does significantly speed up two-point log-likelihood ratio calculations over a grid space of model parameters. Conclusion Observed performance of the program is dependent on characteristics of the data including granularity of the parameter grid space being explored and pedigree size and structure. While work continues to further optimize performance, the current version of the program can already be used to efficiently compute the PPL. Thanks to MLIP, full multidimensional genome scans are now routinely being completed at our centers with runtimes on the order of days, not months or years.</p

Springer - Publisher Connector

Directory of Open Access Journals

KnowledgeBank at OSU

PubMed Central

Branch and Bound pattern in FastFlow

Author: CECCHI GIANLUCA
Publication venue: 'Pisa University Press'
Publication date: 22/04/2014
Field of study

The thesis describes the development of a FastFlow skeleton implementing a Branch and Bound parallel framework that would be specialized by the user to solve all of the problems type solvable with a Branch and Bound approach. To achieve this scope the skeleton is worked out leaving to the users the task of specifying the application specific code and providing all the needed objects to implement the Branch and Bound framework

Electronic Thesis and Dissertation Archive - Università di Pisa

X-linked mental retardation in S.E. Scotland

Author: Strain Lisa
Publication venue: The University of Edinburgh
Publication date: 01/01/1996
Field of study

Edinburgh Research Archive

A simulation framework for traffic information dissemination in ubiquitous vehicular ad hoc networks

Author: Kim Hyoungsoo
Publication venue
Publication date: 04/12/2007
Field of study

The ongoing efforts to apply advanced technologies to help solve transportation problems advanced the growing trend of integrating mobile wireless communications into transportation systems. In particular, vehicular ad hoc networks (VANETs) allow vehicles to constitute a decentralized traffic information system on roadways and to share their own information. This research focused on the development of an integrated transportation and communication simulation framework to build a more realistic environment with which to study VANETs, as compared to previous studies. This research implemented a VANET-based information model into an integrated transportation and communication simulation framework in which these independent simulation tools were tightly coupled and finely synchronized. A traffic information system as a VANET application was built and demonstrated based on the simulation framework developed in this research. In this system, vehicles record their own travel time data, share these data via an ad hoc network, and reroute at split sections based on stored travel time data. Disseminated speeds of traffic information via broadcast on a real roadway network were obtained. In this research, Traffic information speeds were approximately between the road speed limit in a low traffic density - in which case they were mostly delivered by vehicles traveling on the opposite directions - and half of the transmission range (250/2 meter) per second in a high traffic density, which means they were delivered by vehicles traveling in the same direction. Successful dynamic routing based on stored travel time data was demonstrated with and without an incident in this framework. At the both cases, the benefits from dynamic routing were shown even in the low market penetration. It is believed that a wide range of VANET applications can be designed and assessed using methodologies influenced by and contributed to by the simulation framework and other methods developed in this dissertation

Digital Repository at the University of Maryland