Search CORE

970 research outputs found

A Fast Causal Profiler for Task Parallel Programs

Author: Nagarakatte Santosh
Yoga Adarsh
Publication venue
Publication date: 02/07/2017
Field of study

This paper proposes TASKPROF, a profiler that identifies parallelism bottlenecks in task parallel programs. It leverages the structure of a task parallel execution to perform fine-grained attribution of work to various parts of the program. TASKPROF's use of hardware performance counters to perform fine-grained measurements minimizes perturbation. TASKPROF's profile execution runs in parallel using multi-cores. TASKPROF's causal profile enables users to estimate improvements in parallelism when a region of code is optimized even when concrete optimizations are not yet known. We have used TASKPROF to isolate parallelism bottlenecks in twenty three applications that use the Intel Threading Building Blocks library. We have designed parallelization techniques in five applications to in- crease parallelism by an order of magnitude using TASKPROF. Our user study indicates that developers are able to isolate performance bottlenecks with ease using TASKPROF.Comment: 11 page

arXiv.org e-Print Archive

Crossref

Exploring the impact of user involvement on health and social care services for cancer in the UK.

Author: Attree Pam
Clifton Maggie
Hinder Sue
Morris Sara
Vaughan Suzanne
Publication venue
Publication date: 01/01/2009
Field of study

This report presents the findings from a study of cancer network partnership groups in the UK. Cancer network partnership groups are regional organisations set up to enable joint working between people affected by cancer and health professionals, with the aim of improving cancer care

Lancaster E-Prints

Foo's To Blame: Techniques For Mapping Performance Data To Program Variables

Author: Rutar Nickolas Jon
Publication venue
Publication date: 01/01/2011
Field of study

Traditional methods of performance analysis offer a code centric view, presenting performance data in terms of blocks of contiguous code (statement, basic block, loop, function, etc.). Existing data centric techniques allow various program properties to be mapped directly to variables. Our approach extends these data centric mappings. Just as code centric techniques allow lower level objects like source lines be mapped up to functions, our inclusive technique allows low level data centric operations like computations on scalars to be mapped up to complex data structures like those found in scientific frameworks. Our system utilizes static analysis to collect information about the program that can be combined with runtime information to perform data centric program analysis. By pushing most of the analysis to pre-run and post-mortem, we can minimize the amount of data collected at runtime. This allows us to perform less instrumentation and also minimizes program perturbation. It also allows us to collect information that would not be possible with existing techniques. We present two applications of this analysis. The first application of our analysis is targeted at mapping performance data to high level data structures with multiple levels of abstraction. We create extended data centric mappings, which we call variable blame, that relates data centric information to these variables. The second application is a method for mapping cache miss information to variables. Existing approaches for this analysis rely on explicit hardware support and extensive program instrumentation. By utilizing our analysis and applying software heuristics, we are able to lessen those requirements. We apply both of these analyses to applications and show what performance information can be provided by our analysis that can not currently be determined. We also discuss how we can use that information to improve program performance

Digital Repository at the University of Maryland

Architecting Data Centers for High Efficiency and Low Latency

Author: Zhang Yunqi
Publication venue
Publication date: 01/01/2018
Field of study

Modern data centers, housing remarkably powerful computational capacity, are built in massive scales and consume a huge amount of energy. The energy consumption of data centers has mushroomed from virtually nothing to about three percent of the global electricity supply in the last decade, and will continuously grow. Unfortunately, a significant fraction of this energy consumption is wasted due to the inefficiency of current data center architectures, and one of the key reasons behind this inefficiency is the stringent response latency requirements of the user-facing services hosted in these data centers such as web search and social networks. To deliver such low response latency, data center operators often have to overprovision resources to handle high peaks in user load and unexpected load spikes, resulting in low efficiency. This dissertation investigates data center architecture designs that reconcile high system efficiency and low response latency. To increase the efficiency, we propose techniques that understand both microarchitectural-level resource sharing and system-level resource usage dynamics to enable highly efficient co-locations of latency-critical services and low-priority batch workloads. We investigate the resource sharing on real-system simultaneous multithreading (SMT) processors to enable SMT co-locations by precisely predicting the performance interference. We then leverage historical resource usage patterns to further optimize the task scheduling algorithm and data placement policy to improve the efficiency of workload co-locations. Moreover, we introduce methodologies to better manage the response latency by automatically attributing the source of tail latency to low-level architectural and system configurations in both offline load testing environment and online production environment. We design and develop a response latency evaluation framework at microsecond-level precision for data center applications, with which we construct statistical inference procedures to attribute the source of tail latency. Finally, we present an approach that proactively enacts carefully designed causal inference micro-experiments to diagnose the root causes of response latency anomalies, and automatically correct them to reduce the response latency.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144144/1/yunqi_1.pd

Deep Blue Documents at the University of Michigan

Precise event sampling on AMD versus intel: quantitative and qualitative comparison

Author: Chabbi M
Kelly PHJ
Sasongko MA
Unat D
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 09/03/2023
Field of study

Precise event sampling is a profiling feature in commodity processors that can sample hardware events and accurately locate the instructions that trigger the events. This feature has been used in a large number of tools to detect application performance issues. Although precise event sampling is readily supported in modern multicore architectures, vendor supports exhibit great differences that affect their accuracy, stability, overhead, and functionality. This work presents the most comprehensive study to date on benchmarking the event sampling features of Intel PEBS and AMD IBS and performs in-depth analysis on key differences through series of microbenchmarks. Our qualitative and quantitative analysis shows that PEBS allows finer-grained and more accurate sampling of hardware events, while IBS offers richer set of information at each sample though it suffers from lower accuracy and stability. Moreover, OS signal delivery, which is a common method used by the profiling software, introduces significant time overhead to the original overhead incurred by the hardware mechanisms in both PEBS and IBS. We also found that both PEBS and IBS have bias in sampling events across multiple different locations in a code. Lastly, we demonstrate how our findings on microbenchmarks under different thread counts hold for a full-fledged profiling tool that runs on the state-of-the-art Intel and AMD machines. Overall our detailed comparisons serve as a great reference and provide invaluable information for hardware designers and profiling tool developers

Spiral - Imperial College Digital Repository

Data-centric Performance Measurement and Mapping for Highly Parallel Programming Models

Author: Zhang Hui
Publication venue
Publication date: 01/01/2018
Field of study

Modern supercomputers have complex features: many hardware threads, deep memory hierarchies, and many co-processors/accelerators. Productively and effectively designing programs to utilize those hardware features is crucial in gaining the best performance. There are several highly parallel programming models in active development that allow programmers to write efficient code on those architectures. Performance profiling is a very important technique in the development to achieve the best performance. In this dissertation, I proposed a new performance measurement and mapping technique that can associate performance data with program variables instead of code blocks. To validate the applicability of my data-centric profiling idea, I designed and implemented a profiler for PGAS and CUDA. For PGAS, I developed ChplBlamer, for both single-node and multi-node Chapel programs. My tool also provides new features such as data-centric inter-node load imbalance identification. For CUDA, I developed CUDABlamer for GPU-accelerated applications. CUDABlamer also attributes performance data to program variables, which is a feature that was not found in any previous CUDA profilers. Directed by the insights from the tools, I optimized several widely-studied benchmarks and significantly improved program performance by a factor of up to 4x for Chapel and 47x for CUDA kernels

Digital Repository at the University of Maryland

Polynomial-Time Reasoning Support for Design and Maintenance of Large-Scale Biomedical Ontologies

Author: Suntisrivaraporn Boontawee
Publication venue: Technische Universität Dresden
Publication date: 21/01/2009
Field of study

Description Logics (DLs) belong to a successful family of knowledge representation formalisms with two key assets: formally well-defined semantics which allows to represent knowledge in an unambiguous way and automated reasoning which allows to infer implicit knowledge from the one given explicitly. This thesis investigates various reasoning techniques for tractable DLs in the EL family which have been implemented in the CEL system. It suggests that the use of the lightweight DLs, in which reasoning is tractable, is beneficial for ontology design and maintenance both in terms of expressivity and scalability. The claim is supported by a case study on the renown medical ontology SNOMED CT and extensive empirical evaluation on several large-scale biomedical ontologies

Technische Universität Dresden: Qucosa