31 research outputs found

    A pattern language for parallelizing irregular algorithms

    Get PDF
    Dissertação apresentada na Faculdade de CiĂȘncias e Tecnologia da Universidade Nova de Lisboa para obtenção do grau de Mestre em Engenharia InformĂĄticaIn irregular algorithms, data set’s dependences and distributions cannot be statically predicted. This class of algorithms tends to organize computations in terms of data locality instead of parallelizing control in multiple threads. Thus, opportunities for exploiting parallelism vary dynamically, according to how the algorithm changes data dependences. As such, effective parallelization of such algorithms requires new approaches that account for that dynamic nature. This dissertation addresses the problem of building efficient parallel implementations of irregular algorithms by proposing to extract, analyze and document patterns of concurrency and parallelism present in the Galois parallelization framework for irregular algorithms. Patterns capture formal representations of a tangible solution to a problem that arises in a well defined context within a specific domain. We document the said patterns in a pattern language, i.e., a set of inter-dependent patterns that compose well-documented template solutions that can be reused whenever a certain problem arises in a well-known context

    Intelligent cloud-based digital imaging medical system solution

    Get PDF
    This research started with a simple fact: The global needs in medical care, and in medical imaging specifically, are increasing. This is mainly due to a population that is getting older and hence more likely to be exposed to diseases; but this same population would wish to keep a high quality of life. Therefore, to cope with these challenges, many systems, innovations and programs have been created and developed. Among them is the Picture Archiving and Communication System or PACS. Although this filmless system has shown to have a great deal of advantages when onsite - such as the capability to access medical data at different locations - these benefits seem to be outbalanced by the high initial costs, potential risk of data loss and the complexity of data sharing. Therefore, the aim of this research is to suggest a potential betterment of the onsite medical system, by introducing cloud and Computer Aided Diagnosis aspects to it. Lausanne Hospital has been used as a benchmark in order to evaluate the proposed solution, in terms of cost efficiency, diagnosis accuracy, users’ productivity, medical data sharing opportunities, data accessibility, procedure when upgrading systems, reporting process, workflow performed for handling technical issues, and teleradiology benefits. Investigating the potential impact of merging Cloud, PACS and CAD as one intelligent cloud-based digital imaging medical system solution has resulted with the following discovery: the proposed medical technology appears to be more profitable for its potential users than the current option. In point of fact, the proposed solution minimises initial costs, as a result of offsite hosting. Moreover, the suggested system eases offsite medical data viewing and sharing, which strengthens opportunities for 6 teleradiology and collaboration between medical experts. This system also allows its potential users to centre their focus on their core area of expertise, as the system provider becomes the sole manager responsible for the software. Regarding the integration of CAD, the analysis has shown that utilising this software presumably adds greater value to the cloud-based medical system, as CAD engenders higher efficiency and productivity during diagnosis and reporting processes

    DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

    Full text link
    Data movement between the CPU and main memory is a first-order obstacle against improving performance, scalability, and energy efficiency in modern systems. Computer systems employ a range of techniques to reduce overheads tied to data movement, spanning from traditional mechanisms (e.g., deep multi-level cache hierarchies, aggressive hardware prefetchers) to emerging techniques such as Near-Data Processing (NDP), where some computation is moved close to memory. Our goal is to methodically identify potential sources of data movement over a broad set of applications and to comprehensively compare traditional compute-centric data movement mitigation techniques to more memory-centric techniques, thereby developing a rigorous understanding of the best techniques to mitigate each source of data movement. With this goal in mind, we perform the first large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory. We develop the first systematic methodology to classify applications based on the sources contributing to data movement bottlenecks. From our large-scale characterization of 77K functions across 345 applications, we select 144 functions to form the first open-source benchmark suite (DAMOV) for main memory data movement studies. We select a diverse range of functions that (1) represent different types of data movement bottlenecks, and (2) come from a wide range of application domains. Using NDP as a case study, we identify new insights about the different data movement bottlenecks and use these insights to determine the most suitable data movement mitigation mechanism for a particular application. We open-source DAMOV and the complete source code for our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.Comment: Our open source software is available at https://github.com/CMU-SAFARI/DAMO

    Doctor of Philosophy

    Get PDF
    dissertationMemory access irregularities are a major bottleneck for bandwidth limited problems on Graphics Processing Unit (GPU) architectures. GPU memory systems are designed to allow consecutive memory accesses to be coalesced into a single memory access. Noncontiguous accesses within a parallel group of threads working in lock step may cause serialized memory transfers. Irregular algorithms may have data-dependent control flow and memory access, which requires runtime information to be evaluated. Compile time methods for evaluating parallelism, such as static dependence graphs, are not capable of evaluating irregular algorithms. The goals of this dissertation are to study irregularities within the context of unstructured mesh and sparse matrix problems, analyze the impact of vectorization widths on irregularities, and present data-centric methods that improve control flow and memory access irregularity within those contexts. Reordering associative operations has often been exploited for performance gains in parallel algorithms. This dissertation presents a method for associative reordering of stencil computations over unstructured meshes that increases data reuse through caching. This novel parallelization scheme offers considerable speedups over standard methods. Vectorization widths can have significant impact on performance in vectorized computations. Although the hardware vector width is generally fixed, the logical vector width used within a computation can range from one up to the width of the computation. Significant performance differences can occur due to thread scheduling and resource limitations. This dissertation analyzes the impact of vectorization widths on dense numerical computations such as 3D dG postprocessing. It is difficult to efficiently perform dynamic updates on traditional sparse matrix formats. Explicitly controlling memory segmentation allows for in-place dynamic updates in sparse matrices. Dynamically updating the matrix without rebuilding or sorting greatly improves processing time and overall throughput. This dissertation presents a new sparse matrix format, dynamic compressed sparse row (DCSR), which allows for dynamic streaming updates to a sparse matrix. A new method for parallel sparse matrix-matrix multiplication (SpMM) that uses dynamic updates is also presented

    Static timing analysis tool validation in the presence of timing anomalies

    Get PDF
    The validation of the timing behavior of a safety-critical embedded software system requires both safe and precise worst-case execution time bounds for the tasks of that system. Such bounds need to be safe to ensure that each component of the software system performs its job in time. Furthermore, the execution time bounds are required to be precise to ensure the (provable) schedulability of the software system. When trying to achieve both safe and precise bounds, timing anomalies are one of the greatest challenges to overcome. Almost every modern hardware architecture shows timing anomalies, which also greatly impacts the analyzability of such architectures with respect to timing. Intuitively spoken, a timing anomaly is a counterintuitive behavior of a hardware architecture, where a good event (e.g., a cache hit) leads to an overall longer execution, whereas the corresponding bad event (in this case, a cache miss) leads to a globally shorter execution time. In the presence of such anomalies, the local worst-case is not always a safe assumption in static timing analysis. To compute safe timing guarantees, any (static) timing analysis has to consider all possible executions. In this thesis we investigate the source of timing anomalies in modern architectures and study instances of timing anomalies found in rather simple hardware architectures. Furthermore we discuss the impact of timing anomalies on static timing analysis. Finally we provide means to validate the result of static timing analysis for such architectures through trace validation.Um das Zeitverhalten eines sicherheitskritischen eingebettenen Softwaresystems zu validieren, benötigt man sichere und prĂ€zise Grenzen fĂŒr die AusfĂŒhrungszeiten der einzelnen Softwaretasks im schlimmsten Falle (Worst-Case). Diese Zeitschranken mĂŒssen zuverlĂ€ssig sein, damit sichergestellt ist, dass jede Komponente des Softwaresystems rechtzeitig ausgefĂŒhrt wird. Zudem mĂŒssen die zuvor bestimmten Zeitschranken so prĂ€size wie möglich sein damit das Softwaresystem als Ganzes (beweisbar) ausfĂŒhrbar ist (Schedulability). FĂŒr die Erreichung dieser beiden Ziele stellen Zeitanomalien eine der grĂ¶ĂŸten HĂŒrden dar. Fast jede moderne Prozessorarchitektur weist Zeitanomalien auf, die einen großen Einfluß auf die Analysierbarkeit solcher Architekturen haben. Eine Zeitanomalie ist ein kontraintuitives Verhalten einer Hardwarearchitektur, bei dem ein lokal gutes Ereignis (z.B., ein Cache Hit) zu einer insgesamt lĂ€ngeren AusfĂŒhrungszeit fĂŒhrt, das entgegengesetzte schlechte Ereignis (in diesem Fall ein Cache Miss) aber eine global kĂŒrzere AusfĂŒhrungszeit mit sich bringt. Weist eine Prozessorarchitektur ein solches Verhalten auf, darf eine Zeitanalyse fĂŒr diese Architektur nicht nur lokal schlechte Ereignisse in Betracht ziehen, um eine obere Schranke der worst-case AusfĂŒhrungszeit fĂŒr einen Task zu ermitteln. Um zuverlĂ€ssige Zeitgarantien zu bestimmen, muss eine Zeitanalyse alle möglichen AusfĂŒhrungszustĂ€nde betrachten, die durch unbekannte HardwarezustĂ€nde entstehen könnten. In dieser Arbeit untersuchen wir die Ursache von Zeitanomalien in modernen Prozessorarchitekturen und betrachten Zeitanomalien, die auch in eher einfachen Prozessoren vorkommen können. Desweiteren diskutieren wir den Einfluß von Zeitanomalien auf statische Zeitanalysen fĂŒr eben solche Architekturen, die dieses nicht-lokale Zeitverhalten aufweisen. Zuletzt zeigen wir, wie mittels Trace Validierung Analyseergebnisse von statischen Zeitanalysen in diesem Kontext ĂŒberprĂŒft werden können
    corecore