704 research outputs found

    Castell: a heterogeneous cmp architecture scalable to hundreds of processors

    Get PDF
    Technology improvements and power constrains have taken multicore architectures to dominate microprocessor designs over uniprocessors. At the same time, accelerator based architectures have shown that heterogeneous multicores are very efficient and can provide high throughput for parallel applications, but with a high-programming effort. We propose Castell a scalable chip multiprocessor architecture that can be programmed as uniprocessors, and provides the high throughput of accelerator-based architectures. Castell relies on task-based programming models that simplify software development. These models use a runtime system that dynamically finds, schedules, and adds hardware-specific features to parallel tasks. One of these features is DMA transfers to overlap computation and data movement, which is known as double buffering. This feature allows applications on Castell to tolerate large memory latencies and lets us design the memory system focusing on memory bandwidth. In addition to provide programmability and the design of the memory system, we have used a hierarchical NoC and added a synchronization module. The NoC design distributes memory traffic efficiently to allow the architecture to scale. The synchronization module is a consequence of the large performance degradation of application for large synchronization latencies. Castell is mainly an architecture framework that enables the definition of domain-specific implementations, fine-tuned to a particular problem or application. So far, Castell has been successfully used to propose heterogeneous multicore architectures for scientific kernels, video decoding (using H.264), and protein sequence alignment (using Smith-Waterman and clustalW). It has also been used to explore a number of architecture optimizations such as enhanced DMA controllers, and architecture support for task-based programming models. ii

    Trace-based Performance Analysis for Hardware Accelerators

    Get PDF
    This thesis presents how performance data from hardware accelerators can be included in event logs. It extends the capabilities of trace-based performance analysis to also monitor and record data from this novel parallelization layer. The increasing awareness to power consumption of computing devices has led to an interest in hybrid computing architectures as well. High-end computers, workstations, and mobile devices start to employ hardware accelerators to offload computationally intense and parallel tasks, while at the same time retaining a highly efficient scalar compute unit for non-parallel tasks. This execution pattern is typically asynchronous so that the scalar unit can resume other work while the hardware accelerator is busy. Performance analysis tools provided by the hardware accelerator vendors cover the situation of one host using one device very well. Yet, they do not address the needs of the high performance computing community. This thesis investigates ways to extend existing methods for recording events from highly parallel applications to also cover scenarios in which hardware accelerators aid these applications. After introducing a generic approach that is suitable for any API based acceleration paradigm, the thesis derives a suggestion for a generic performance API for hardware accelerators and its implementation with NVIDIA CUPTI. In a next step the visualization of event logs containing data from execution streams on different levels of parallelism is discussed. In order to overcome the limitations of classic performance profiles and timeline displays, a graph-based visualization using Parallel Performance Flow Graphs (PPFGs) is introduced. This novel technical approach is using program states in order to display similarities and differences between the potentially very large number of event streams and, thus, enables a fast way to spot load imbalances. The thesis concludes with the in-depth analysis of a case-study of PIConGPU---a highly parallel, multi-hybrid plasma physics simulation---that benefited greatly from the developed performance analysis methods.Diese Dissertation zeigt, wie der Ablauf von Anwendungsteilen, die auf Hardwarebeschleuniger ausgelagert wurden, als Programmspur mit aufgezeichnet werden kann. Damit wird die bekannte Technik der Leistungsanalyse von Anwendungen mittels Programmspuren so erweitert, dass auch diese neue ParallelitĂ€tsebene mit erfasst wird. Die BeschrĂ€nkungen von Computersystemen bezĂŒglich der elektrischen Leistungsaufnahme hat zu einer steigenden Anzahl von hybriden Computerarchitekturen gefĂŒhrt. Sowohl Hochleistungsrechner, aber auch Arbeitsplatzcomputer und mobile EndgerĂ€te nutzen heute Hardwarebeschleuniger um rechenintensive, parallele Programmteile auszulagern und so den skalaren Hauptprozessor zu entlasten und nur fĂŒr nicht parallele Programmteile zu verwenden. Dieses AusfĂŒhrungsschema ist typischerweise asynchron: der Skalarprozessor kann, wĂ€hrend der Hardwarebeschleuniger rechnet, selbst weiterarbeiten. Die Leistungsanalyse-Werkzeuge der Hersteller von Hardwarebeschleunigern decken den Standardfall (ein Host-System mit einem Hardwarebeschleuniger) sehr gut ab, scheitern aber an einer UnterstĂŒtzung von hochparallelen Rechnersystemen. Die vorliegende Dissertation untersucht, in wie weit auch multi-hybride Anwendungen die AktivitĂ€t von Hardwarebeschleunigern aufzeichnen können. Dazu wird die vorhandene Methode zur Erzeugung von Programmspuren fĂŒr hochparallele Anwendungen entsprechend erweitert. In dieser Untersuchung wird zuerst eine allgemeine Methodik entwickelt, mit der sich fĂŒr jede API-gestĂŒtzte Hardwarebeschleunigung eine Programmspur erstellen lĂ€sst. Darauf aufbauend wird eine eigene Programmierschnittstelle entwickelt, die es ermöglicht weitere leistungsrelevante Daten aufzuzeichnen. Die Umsetzung dieser Schnittstelle wird am Beispiel von NVIDIA CUPTI darstellt. Ein weiterer Teil der Arbeit beschĂ€ftigt sich mit der Darstellung von Programmspuren, welche Aufzeichnungen von den unterschiedlichen ParallelitĂ€tsebenen enthalten. Um die EinschrĂ€nkungen klassischer Leistungsprofile oder Zeitachsendarstellungen zu ĂŒberwinden, wird mit den parallelen Programmablaufgraphen (PPFGs) eine neue graphenbasisierte Darstellungsform eingefĂŒhrt. Dieser neuartige Ansatz zeigt eine Programmspur als eine Folge von ProgrammzustĂ€nden mit gemeinsamen und unterchiedlichen AblĂ€ufen. So können divergierendes Programmverhalten und Lastimbalancen deutlich einfacher lokalisiert werden. Die Arbeit schließt mit der detaillierten Analyse von PIConGPU -- einer multi-hybriden Simulation aus der Plasmaphysik --, die in großem Maße von den in dieser Arbeit entwickelten Analysemöglichkeiten profiert hat

    Scalability of parallel video decoding on heterogeneous manycore architectures

    Get PDF
    This paper presents an analysis of the scalability of the parallel video decoding on heterogeneous many core architectures. As benchmark, we use a highly parallel H.264/AVC video decoder that generates a large number of independent tasks. In order to translate task-level parallelism into performance gains both the video decoder and the architecture have been optimized. The video decoder was modified for exploiting coarse-grain frame-level parallelism in the entropy decoding kernel which has been considered the main bottleneck. Second, a heterogeneous combination of cores is evaluated for executing different type of tasks. Finally, an evaluation of the memory requirements of the whole system has been carried out. Experiments conducted using a trace-driven simulation methodology shows that the evaluated system exhibits a good parallel scalability up to 68 cores. At this point the parallel video decoder is able to decode more than 200 HD frames per second using simple low power processors.Postprint (published version

    WAVOS: a MATLAB toolkit for wavelet analysis and visualization of oscillatory systems

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Wavelets have proven to be a powerful technique for the analysis of periodic data, such as those that arise in the analysis of circadian oscillators. While many implementations of both continuous and discrete wavelet transforms are available, we are aware of no software that has been designed with the nontechnical end-user in mind. By developing a toolkit that makes these analyses accessible to end users without significant programming experience, we hope to promote the more widespread use of wavelet analysis.</p> <p>Findings</p> <p>We have developed the WAVOS toolkit for wavelet analysis and visualization of oscillatory systems. WAVOS features both the continuous (Morlet) and discrete (Daubechies) wavelet transforms, with a simple, user-friendly graphical user interface within MATLAB. The interface allows for data to be imported from a number of standard file formats, visualized, processed and analyzed, and exported without use of the command line. Our work has been motivated by the challenges of circadian data, thus default settings appropriate to the analysis of such data have been pre-selected in order to minimize the need for fine-tuning. The toolkit is flexible enough to deal with a wide range of oscillatory signals, however, and may be used in more general contexts.</p> <p>Conclusions</p> <p>We have presented WAVOS: a comprehensive wavelet-based MATLAB toolkit that allows for easy visualization, exploration, and analysis of oscillatory data. WAVOS includes both the Morlet continuous wavelet transform and the Daubechies discrete wavelet transform. We have illustrated the use of WAVOS, and demonstrated its utility for the analysis of circadian data on both bioluminesence and wheel-running data. WAVOS is freely available at <url>http://sourceforge.net/projects/wavos/files/</url></p

    Homologs of ancestral CNNM proteins affect magnesium homeostasis and circadian rhythmicity in a model eukaryotic cell

    Get PDF
    Biological rhythms are ubiquitous across organisms and coordinate key cellular processes. Oscillations of Mg2+ levels in cells are now well-established, and due to the critical roles of Mg2+ in cell metabolism, they are potentially fundamental for the circadian control of cellular activity. The identity of the transport proteins responsible for sustaining Mg2+ levels in eukaryotic cells remains hotly debated, and several are restricted to specific groups of higher eukaryotes. Here, using the eukaryotic minimal model cells of Ostreococcus tauri, we report two homologs of common descents of the Cyclin M (CNNM)/CorC protein family. Overexpression of these proteins leads to a reduction in the overall magnesium content of cells and a lengthening of the period of circadian gene expression rhythms. However, we observed a paradoxical increase in the magnesium content of the organelle fraction. The chemical inhibition of Mg2+ transport has a synergistic effect on circadian period lengthening upon the overexpression of one CNNM homolog, but not the other. Finally, both homologs rescue the deleterious effect of low extracellular magnesium on cell proliferation rates. Overall, we identified two CNNM proteins that directly affect Mg2+ homeostasis and cellular rhythms

    A Genome-Scale Resource for the Functional Characterization of Arabidopsis Transcription Factors

    Get PDF
    SummaryExtensive transcriptional networks play major roles in cellular and organismal functions. Transcript levels are in part determined by the combinatorial and overlapping functions of multiple transcription factors (TFs) bound to gene promoters. Thus, TF-promoter interactions provide the basic molecular wiring of transcriptional regulatory networks. In plants, discovery of the functional roles of TFs is limited by an increased complexity of network circuitry due to a significant expansion of TF families. Here, we present the construction of a comprehensive collection of Arabidopsis TFs clones created to provide a versatile resource for uncovering TF biological functions. We leveraged this collection by implementing a high-throughput DNA binding assay and identified direct regulators of a key clock gene (CCA1) that provide molecular links between different signaling modules and the circadian clock. The resources introduced in this work will significantly contribute to a better understanding of the transcriptional regulatory landscape of plant genomes
    • 

    corecore