Search CORE

31 research outputs found

Recommended from our members

Graph prefetching using data structure knowledge

Author: Ainsworth S
Jones TM
Publication venue: Proceedings of the International Conference on Supercomputing
Publication date: 01/01/2016
Field of study

Searches on large graphs are heavily memory latency bound, as a result of many high latency DRAM accesses. Due to the highly irregular nature of the access patterns involved, caches and prefetchers, both hardware and software, perform poorly on graph workloads. This leads to CPU stalling for the majority of the time. However, in many cases the data access pattern is well defined and predictable in advance, many falling into a small set of simple patterns. Although existing implicit prefetchers cannot bring significant benefit, a prefetcher armed with knowledge of the data structures and access patterns could accurately anticipate applications' traversals to bring in the appropriate data. This paper presents a design of an explicitly configured prefetcher to improve performance for breadth-first searches and sequential iteration on the efficient and commonly-used compressed sparse row graph format. By snooping L1 cache accesses from the core and reacting to data returned from its own prefetches, the prefetcher can schedule timely loads of data in advance of the application needing it. For a range of applications and graph sizes, our prefetcher achieves average speedups of 2.3x, and up to 3.3x, with little impact on memory bandwidth requirements.This work was supported by the Engineering and Physical Sciences Research Council (EPSRC), through grant references EP/K026399/1 and EP/M506485/1, and ARM Ltd.This is the author accepted manuscript. The final version is available from ACM at http://dx.doi.org/10.1145/2925426.2926254

Apollo (Cambridge)

Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design

Author: Ahmadi Agreen
Austin Todd
Behroozi Armand
Dreslinski Ronald
Kaszyk Kuba
Li Lu
Mahlke Scott
May Kyle
Morton John Magnus
Mudge Trevor
Nguyen Brandon
O'Boyle Michael F P
Sun Jiawen
Talati Nishil
Vasiladiotis Christos
Verma Tarunesh
Yang Yichen
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/04/2021
Field of study

Edinburgh Research Explorer

A pattern language for parallelizing irregular algorithms

Author: Monteiro Pedro Miguel Ferreira Costa
Publication venue: Faculdade de Ciências e Tecnologia
Publication date: 01/01/2009
Field of study

Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa para obtenção do grau de Mestre em Engenharia InformáticaIn irregular algorithms, data set’s dependences and distributions cannot be statically predicted. This class of algorithms tends to organize computations in terms of data locality instead of parallelizing control in multiple threads. Thus, opportunities for exploiting parallelism vary dynamically, according to how the algorithm changes data dependences. As such, effective parallelization of such algorithms requires new approaches that account for that dynamic nature. This dissertation addresses the problem of building efficient parallel implementations of irregular algorithms by proposing to extract, analyze and document patterns of concurrency and parallelism present in the Galois parallelization framework for irregular algorithms. Patterns capture formal representations of a tangible solution to a problem that arises in a well defined context within a specific domain. We document the said patterns in a pattern language, i.e., a set of inter-dependent patterns that compose well-documented template solutions that can be reused whenever a certain problem arises in a well-known context

Repositório da Universidade Nova de Lisboa

Intelligent cloud-based digital imaging medical system solution

Author: Hamdouni Hind
Publication venue: Kingston University
Publication date
Field of study

This research started with a simple fact: The global needs in medical care, and in medical imaging specifically, are increasing. This is mainly due to a population that is getting older and hence more likely to be exposed to diseases; but this same population would wish to keep a high quality of life. Therefore, to cope with these challenges, many systems, innovations and programs have been created and developed. Among them is the Picture Archiving and Communication System or PACS. Although this filmless system has shown to have a great deal of advantages when onsite - such as the capability to access medical data at different locations - these benefits seem to be outbalanced by the high initial costs, potential risk of data loss and the complexity of data sharing. Therefore, the aim of this research is to suggest a potential betterment of the onsite medical system, by introducing cloud and Computer Aided Diagnosis aspects to it. Lausanne Hospital has been used as a benchmark in order to evaluate the proposed solution, in terms of cost efficiency, diagnosis accuracy, users’ productivity, medical data sharing opportunities, data accessibility, procedure when upgrading systems, reporting process, workflow performed for handling technical issues, and teleradiology benefits. Investigating the potential impact of merging Cloud, PACS and CAD as one intelligent cloud-based digital imaging medical system solution has resulted with the following discovery: the proposed medical technology appears to be more profitable for its potential users than the current option. In point of fact, the proposed solution minimises initial costs, as a result of offsite hosting. Moreover, the suggested system eases offsite medical data viewing and sharing, which strengthens opportunities for 6 teleradiology and collaboration between medical experts. This system also allows its potential users to centre their focus on their core area of expertise, as the system provider becomes the sole manager responsible for the software. Regarding the integration of CAD, the analysis has shown that utilising this software presumably adds greater value to the cloud-based medical system, as CAD engenders higher efficiency and productivity during diagnosis and reporting processes

Kingston University Research Repository

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

Author: Fernandez Ivan
Ghose Saugata
Gómez-Luna Juan
Mutlu Onur
Oliveira Geraldo F.
Orosa Lois
Sadrosadati Mohammad
Vijaykumar Nandita
Publication venue
Publication date: 01/01/2021
Field of study

Data movement between the CPU and main memory is a first-order obstacle against improving performance, scalability, and energy efficiency in modern systems. Computer systems employ a range of techniques to reduce overheads tied to data movement, spanning from traditional mechanisms (e.g., deep multi-level cache hierarchies, aggressive hardware prefetchers) to emerging techniques such as Near-Data Processing (NDP), where some computation is moved close to memory. Our goal is to methodically identify potential sources of data movement over a broad set of applications and to comprehensively compare traditional compute-centric data movement mitigation techniques to more memory-centric techniques, thereby developing a rigorous understanding of the best techniques to mitigate each source of data movement. With this goal in mind, we perform the first large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory. We develop the first systematic methodology to classify applications based on the sources contributing to data movement bottlenecks. From our large-scale characterization of 77K functions across 345 applications, we select 144 functions to form the first open-source benchmark suite (DAMOV) for main memory data movement studies. We select a diverse range of functions that (1) represent different types of data movement bottlenecks, and (2) come from a wide range of application domains. Using NDP as a case study, we identify new insights about the different data movement bottlenecks and use these insights to determine the most suitable data movement mitigation mechanism for a particular application. We open-source DAMOV and the complete source code for our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.Comment: Our open source software is available at https://github.com/CMU-SAFARI/DAMO

arXiv.org e-Print Archive

Repository for Publications and Research Data

Directory of Open Access Journals

Doctor of Philosophy

Author: King James Sokhom
Publication venue: University of Utah
Publication date: 01/01/2017
Field of study

dissertationMemory access irregularities are a major bottleneck for bandwidth limited problems on Graphics Processing Unit (GPU) architectures. GPU memory systems are designed to allow consecutive memory accesses to be coalesced into a single memory access. Noncontiguous accesses within a parallel group of threads working in lock step may cause serialized memory transfers. Irregular algorithms may have data-dependent control flow and memory access, which requires runtime information to be evaluated. Compile time methods for evaluating parallelism, such as static dependence graphs, are not capable of evaluating irregular algorithms. The goals of this dissertation are to study irregularities within the context of unstructured mesh and sparse matrix problems, analyze the impact of vectorization widths on irregularities, and present data-centric methods that improve control flow and memory access irregularity within those contexts. Reordering associative operations has often been exploited for performance gains in parallel algorithms. This dissertation presents a method for associative reordering of stencil computations over unstructured meshes that increases data reuse through caching. This novel parallelization scheme offers considerable speedups over standard methods. Vectorization widths can have significant impact on performance in vectorized computations. Although the hardware vector width is generally fixed, the logical vector width used within a computation can range from one up to the width of the computation. Significant performance differences can occur due to thread scheduling and resource limitations. This dissertation analyzes the impact of vectorization widths on dense numerical computations such as 3D dG postprocessing. It is difficult to efficiently perform dynamic updates on traditional sparse matrix formats. Explicitly controlling memory segmentation allows for in-place dynamic updates in sparse matrices. Dynamically updating the matrix without rebuilding or sorting greatly improves processing time and overall throughput. This dissertation presents a new sparse matrix format, dynamic compressed sparse row (DCSR), which allows for dynamic streaming updates to a sparse matrix. A new method for parallel sparse matrix-matrix multiplication (SpMM) that uses dynamic updates is also presented

The University of Utah: J. Willard Marriott Digital Library

Static timing analysis tool validation in the presence of timing anomalies

Author: Gebhard Gernot
Publication venue: Fakultät 6 - Naturwissenschaftlich-Technische Fakultät I. Fachrichtung 6.2 - Informatik
Publication date: 01/01/2013
Field of study

The validation of the timing behavior of a safety-critical embedded software system requires both safe and precise worst-case execution time bounds for the tasks of that system. Such bounds need to be safe to ensure that each component of the software system performs its job in time. Furthermore, the execution time bounds are required to be precise to ensure the (provable) schedulability of the software system. When trying to achieve both safe and precise bounds, timing anomalies are one of the greatest challenges to overcome. Almost every modern hardware architecture shows timing anomalies, which also greatly impacts the analyzability of such architectures with respect to timing. Intuitively spoken, a timing anomaly is a counterintuitive behavior of a hardware architecture, where a good event (e.g., a cache hit) leads to an overall longer execution, whereas the corresponding bad event (in this case, a cache miss) leads to a globally shorter execution time. In the presence of such anomalies, the local worst-case is not always a safe assumption in static timing analysis. To compute safe timing guarantees, any (static) timing analysis has to consider all possible executions. In this thesis we investigate the source of timing anomalies in modern architectures and study instances of timing anomalies found in rather simple hardware architectures. Furthermore we discuss the impact of timing anomalies on static timing analysis. Finally we provide means to validate the result of static timing analysis for such architectures through trace validation.Um das Zeitverhalten eines sicherheitskritischen eingebettenen Softwaresystems zu validieren, benötigt man sichere und präzise Grenzen für die Ausführungszeiten der einzelnen Softwaretasks im schlimmsten Falle (Worst-Case). Diese Zeitschranken müssen zuverlässig sein, damit sichergestellt ist, dass jede Komponente des Softwaresystems rechtzeitig ausgeführt wird. Zudem müssen die zuvor bestimmten Zeitschranken so präsize wie möglich sein damit das Softwaresystem als Ganzes (beweisbar) ausführbar ist (Schedulability). Für die Erreichung dieser beiden Ziele stellen Zeitanomalien eine der größten Hürden dar. Fast jede moderne Prozessorarchitektur weist Zeitanomalien auf, die einen großen Einfluß auf die Analysierbarkeit solcher Architekturen haben. Eine Zeitanomalie ist ein kontraintuitives Verhalten einer Hardwarearchitektur, bei dem ein lokal gutes Ereignis (z.B., ein Cache Hit) zu einer insgesamt längeren Ausführungszeit führt, das entgegengesetzte schlechte Ereignis (in diesem Fall ein Cache Miss) aber eine global kürzere Ausführungszeit mit sich bringt. Weist eine Prozessorarchitektur ein solches Verhalten auf, darf eine Zeitanalyse für diese Architektur nicht nur lokal schlechte Ereignisse in Betracht ziehen, um eine obere Schranke der worst-case Ausführungszeit für einen Task zu ermitteln. Um zuverlässige Zeitgarantien zu bestimmen, muss eine Zeitanalyse alle möglichen Ausführungszustände betrachten, die durch unbekannte Hardwarezustände entstehen könnten. In dieser Arbeit untersuchen wir die Ursache von Zeitanomalien in modernen Prozessorarchitekturen und betrachten Zeitanomalien, die auch in eher einfachen Prozessoren vorkommen können. Desweiteren diskutieren wir den Einfluß von Zeitanomalien auf statische Zeitanalysen für eben solche Architekturen, die dieses nicht-lokale Zeitverhalten aufweisen. Zuletzt zeigen wir, wie mittels Trace Validierung Analyseergebnisse von statischen Zeitanalysen in diesem Kontext überprüft werden können

Recommended from our members

Elixir: synthesis of parallel irregular algorithms

Author: Prountzos Dimitrios
Publication venue
Publication date: 11/02/2016
Field of study

Algorithms in new application areas like machine learning and data analytics usually operate on unstructured sparse graphs. Writing efficient parallel code to implement these algorithms is very challenging for a number of reasons. First, there may be many algorithms to solve a problem and each algorithm may have many implementations. Second, synchronization, which is necessary for correct parallel execution, introduces potential problems such as data-races and deadlocks. These issues interact in subtle ways, making the best solution dependent both on the parallel platform and on properties of the input graph. Consequently, implementing and selecting the best parallel solution can be a daunting task for non-experts, since we have few performance models for predicting the performance of parallel sparse graph programs on parallel hardware. This dissertation presents a synthesis methodology and a system, Elixir, that addresses these problems by (i) allowing programmers to specify solutions at a high level of abstraction, and (ii) generating many parallel implementations automatically and using search to find the best one. An Elixir specification consists of a set of operators capturing the main algorithm logic and a schedule specifying how to efficiently apply the operators. Elixir employs sophisticated automated reasoning to merge these two components, and uses techniques based on automated planning to insert synchronization and synthesize efficient parallel code. Experimental evaluation of our approach demonstrates that the performance of the Elixir generated code is competitive to, and can even outperform, hand-optimized code written by expert programmers for many interesting graph benchmarks.Computer Science

Texas ScholarWorks