31 research outputs found
Recommended from our members
Graph prefetching using data structure knowledge
Searches on large graphs are heavily memory latency bound,
as a result of many high latency DRAM accesses. Due to
the highly irregular nature of the access patterns involved,
caches and prefetchers, both hardware and software, perform
poorly on graph workloads. This leads to CPU stalling for
the majority of the time. However, in many cases the data
access pattern is well defined and predictable in advance,
many falling into a small set of simple patterns. Although
existing implicit prefetchers cannot bring significant benefit,
a prefetcher armed with knowledge of the data structures
and access patterns could accurately anticipate applications'
traversals to bring in the appropriate data.
This paper presents a design of an explicitly configured
prefetcher to improve performance for breadth-first searches
and sequential iteration on the efficient and commonly-used
compressed sparse row graph format. By snooping L1 cache
accesses from the core and reacting to data returned from its
own prefetches, the prefetcher can schedule timely loads of
data in advance of the application needing it. For a range of
applications and graph sizes, our prefetcher achieves average
speedups of 2.3x, and up to 3.3x, with little impact on
memory bandwidth requirements.This work was supported by the Engineering and Physical
Sciences Research Council (EPSRC), through grant references EP/K026399/1 and EP/M506485/1, and ARM Ltd.This is the author accepted manuscript. The final version is available from ACM at http://dx.doi.org/10.1145/2925426.2926254
A pattern language for parallelizing irregular algorithms
Dissertação apresentada na Faculdade de CiĂȘncias e Tecnologia da Universidade Nova de Lisboa para obtenção do grau de Mestre em Engenharia InformĂĄticaIn irregular algorithms, data setâs dependences and distributions cannot be statically predicted.
This class of algorithms tends to organize computations in terms of data locality instead of parallelizing control in multiple threads. Thus, opportunities for exploiting parallelism vary dynamically, according to how the algorithm changes data dependences. As such, effective parallelization of such algorithms requires new approaches that account for that dynamic nature.
This dissertation addresses the problem of building efficient parallel implementations of irregular algorithms by proposing to extract, analyze and document patterns of concurrency and parallelism present in the Galois parallelization framework for irregular algorithms.
Patterns capture formal representations of a tangible solution to a problem that arises in a well defined context within a specific domain.
We document the said patterns in a pattern language, i.e., a set of inter-dependent patterns that compose well-documented template solutions that can be reused whenever a certain problem arises in a well-known context
Intelligent cloud-based digital imaging medical system solution
This research started with a simple fact: The global needs in medical care, and in medical imaging specifically, are increasing. This is mainly due to a population that is getting older and hence more likely to be exposed to diseases; but this same population would wish to keep a high quality of life. Therefore, to cope with these challenges, many systems, innovations and programs have been created and developed. Among them is the Picture Archiving and Communication System or PACS. Although this filmless system has shown to have a great deal of advantages when onsite - such as the capability to access medical data at different locations - these benefits seem to be outbalanced by the high initial costs, potential risk of data loss and the complexity of data sharing. Therefore, the aim of this research is to suggest a potential betterment of the onsite medical system, by introducing cloud and Computer Aided Diagnosis aspects to it. Lausanne Hospital has been used as a benchmark in order to evaluate the proposed solution, in terms of cost efficiency, diagnosis accuracy, usersâ productivity, medical data sharing opportunities, data accessibility, procedure when upgrading systems, reporting process, workflow performed for handling technical issues, and teleradiology benefits. Investigating the potential impact of merging Cloud, PACS and CAD as one intelligent cloud-based digital imaging medical system solution has resulted with the following discovery: the proposed medical technology appears to be more profitable for its potential users than the current option. In point of fact, the proposed solution minimises initial costs, as a result of offsite hosting. Moreover, the suggested system eases offsite medical data viewing and sharing, which strengthens opportunities for 6 teleradiology and collaboration between medical experts. This system also allows its potential users to centre their focus on their core area of expertise, as the system provider becomes the sole manager responsible for the software. Regarding the integration of CAD, the analysis has shown that utilising this software presumably adds greater value to the cloud-based medical system, as CAD engenders higher efficiency and productivity during diagnosis and reporting processes
DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks
Data movement between the CPU and main memory is a first-order obstacle
against improving performance, scalability, and energy efficiency in modern
systems. Computer systems employ a range of techniques to reduce overheads tied
to data movement, spanning from traditional mechanisms (e.g., deep multi-level
cache hierarchies, aggressive hardware prefetchers) to emerging techniques such
as Near-Data Processing (NDP), where some computation is moved close to memory.
Our goal is to methodically identify potential sources of data movement over a
broad set of applications and to comprehensively compare traditional
compute-centric data movement mitigation techniques to more memory-centric
techniques, thereby developing a rigorous understanding of the best techniques
to mitigate each source of data movement.
With this goal in mind, we perform the first large-scale characterization of
a wide variety of applications, across a wide range of application domains, to
identify fundamental program properties that lead to data movement to/from main
memory. We develop the first systematic methodology to classify applications
based on the sources contributing to data movement bottlenecks. From our
large-scale characterization of 77K functions across 345 applications, we
select 144 functions to form the first open-source benchmark suite (DAMOV) for
main memory data movement studies. We select a diverse range of functions that
(1) represent different types of data movement bottlenecks, and (2) come from a
wide range of application domains. Using NDP as a case study, we identify new
insights about the different data movement bottlenecks and use these insights
to determine the most suitable data movement mitigation mechanism for a
particular application. We open-source DAMOV and the complete source code for
our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.Comment: Our open source software is available at
https://github.com/CMU-SAFARI/DAMO
Doctor of Philosophy
dissertationMemory access irregularities are a major bottleneck for bandwidth limited problems on Graphics Processing Unit (GPU) architectures. GPU memory systems are designed to allow consecutive memory accesses to be coalesced into a single memory access. Noncontiguous accesses within a parallel group of threads working in lock step may cause serialized memory transfers. Irregular algorithms may have data-dependent control flow and memory access, which requires runtime information to be evaluated. Compile time methods for evaluating parallelism, such as static dependence graphs, are not capable of evaluating irregular algorithms. The goals of this dissertation are to study irregularities within the context of unstructured mesh and sparse matrix problems, analyze the impact of vectorization widths on irregularities, and present data-centric methods that improve control flow and memory access irregularity within those contexts. Reordering associative operations has often been exploited for performance gains in parallel algorithms. This dissertation presents a method for associative reordering of stencil computations over unstructured meshes that increases data reuse through caching. This novel parallelization scheme offers considerable speedups over standard methods. Vectorization widths can have significant impact on performance in vectorized computations. Although the hardware vector width is generally fixed, the logical vector width used within a computation can range from one up to the width of the computation. Significant performance differences can occur due to thread scheduling and resource limitations. This dissertation analyzes the impact of vectorization widths on dense numerical computations such as 3D dG postprocessing. It is difficult to efficiently perform dynamic updates on traditional sparse matrix formats. Explicitly controlling memory segmentation allows for in-place dynamic updates in sparse matrices. Dynamically updating the matrix without rebuilding or sorting greatly improves processing time and overall throughput. This dissertation presents a new sparse matrix format, dynamic compressed sparse row (DCSR), which allows for dynamic streaming updates to a sparse matrix. A new method for parallel sparse matrix-matrix multiplication (SpMM) that uses dynamic updates is also presented
Static timing analysis tool validation in the presence of timing anomalies
The validation of the timing behavior of a safety-critical embedded software system requires both safe and precise worst-case execution time bounds for the tasks of that system. Such bounds need to be safe to ensure that each component of the software system performs its job in time. Furthermore, the execution time bounds are required to be precise to ensure the (provable) schedulability of the software system. When trying to achieve both safe and precise bounds, timing anomalies are one of the greatest challenges to overcome. Almost every modern hardware architecture shows timing anomalies, which also greatly impacts the analyzability of such architectures with respect to timing.
Intuitively spoken, a timing anomaly is a counterintuitive behavior of a hardware architecture, where a good event (e.g., a cache hit) leads to an overall longer execution, whereas the corresponding bad event (in this case, a cache miss) leads to a globally shorter execution time. In the presence of such anomalies, the local worst-case is not always a safe assumption in static timing analysis. To compute safe timing guarantees, any (static) timing analysis has to consider all possible executions.
In this thesis we investigate the source of timing anomalies in modern architectures and study instances of timing anomalies found in rather simple hardware architectures. Furthermore we discuss the impact of timing anomalies on static timing analysis. Finally we provide means to validate the result of static timing analysis for such architectures through trace validation.Um das Zeitverhalten eines sicherheitskritischen eingebettenen Softwaresystems zu validieren, benötigt man sichere und prĂ€zise Grenzen fĂŒr die AusfĂŒhrungszeiten der einzelnen Softwaretasks im schlimmsten Falle (Worst-Case). Diese Zeitschranken mĂŒssen zuverlĂ€ssig sein, damit sichergestellt ist, dass jede Komponente des Softwaresystems rechtzeitig ausgefĂŒhrt wird. Zudem mĂŒssen die zuvor bestimmten Zeitschranken so prĂ€size wie möglich sein damit das Softwaresystem als Ganzes (beweisbar) ausfĂŒhrbar ist (Schedulability). FĂŒr die Erreichung dieser beiden Ziele stellen Zeitanomalien eine der gröĂten HĂŒrden dar. Fast jede moderne Prozessorarchitektur weist Zeitanomalien auf, die einen groĂen EinfluĂ auf die Analysierbarkeit solcher Architekturen haben.
Eine Zeitanomalie ist ein kontraintuitives Verhalten einer Hardwarearchitektur, bei dem ein lokal gutes Ereignis (z.B., ein Cache Hit) zu einer insgesamt lĂ€ngeren AusfĂŒhrungszeit fĂŒhrt, das entgegengesetzte schlechte Ereignis (in diesem Fall ein Cache Miss) aber eine global kĂŒrzere AusfĂŒhrungszeit mit sich bringt. Weist eine Prozessorarchitektur ein solches Verhalten auf, darf eine Zeitanalyse fĂŒr diese Architektur nicht nur lokal schlechte Ereignisse in Betracht ziehen, um eine obere Schranke der worst-case AusfĂŒhrungszeit fĂŒr einen Task zu ermitteln. Um zuverlĂ€ssige Zeitgarantien zu bestimmen, muss eine Zeitanalyse alle möglichen AusfĂŒhrungszustĂ€nde betrachten, die durch unbekannte HardwarezustĂ€nde entstehen könnten.
In dieser Arbeit untersuchen wir die Ursache von Zeitanomalien in modernen Prozessorarchitekturen und betrachten Zeitanomalien, die auch in eher einfachen Prozessoren vorkommen können. Desweiteren diskutieren wir den EinfluĂ von Zeitanomalien auf statische Zeitanalysen fĂŒr eben solche Architekturen, die dieses nicht-lokale Zeitverhalten aufweisen. Zuletzt zeigen wir, wie mittels Trace Validierung Analyseergebnisse von statischen Zeitanalysen in diesem Kontext ĂŒberprĂŒft werden können
Recommended from our members
Elixir: synthesis of parallel irregular algorithms
Algorithms in new application areas like machine learning and data analytics usually operate on unstructured sparse graphs. Writing efficient parallel code to implement these algorithms is very challenging for a number of reasons.
First, there may be many algorithms to solve a problem and each algorithm may have many implementations. Second, synchronization, which is necessary for correct parallel execution, introduces potential problems such as data-races and deadlocks. These issues interact in subtle ways, making the best solution dependent both on the parallel platform and on properties of the input graph. Consequently, implementing and selecting the best parallel solution can be a daunting task for non-experts, since we have few performance models for predicting the performance of parallel sparse graph programs on parallel hardware.
This dissertation presents a synthesis methodology and a system, Elixir, that addresses these problems by (i) allowing programmers to specify solutions at a high level of abstraction, and (ii) generating many parallel implementations automatically and using search to find the best one. An Elixir specification consists of a set of operators capturing the main algorithm logic and a schedule specifying how to efficiently apply the operators. Elixir employs sophisticated automated reasoning to merge these two components, and uses techniques based on automated planning to insert synchronization and synthesize efficient parallel code.
Experimental evaluation of our approach demonstrates that the performance of the Elixir generated code is competitive to, and can even outperform, hand-optimized code written by expert programmers for many interesting graph benchmarks.Computer Science