307 research outputs found

    Aspects of Code Generation and Data Transfer Techniques for Modern Parallel Architectures

    Get PDF
    Im Bereich der Prozessorarchitekturen hat sich der Fokus neuer Entwicklungen von immer höheren Taktfrequenzen hin zu immer mehr Kernen auf einem Chip verschoben. Eine hohe Kernanzahl ermöglicht es unterschiedlich leistungsfähige Kerne anzubieten, und sogar dedizierte Kerne mit speziellen Befehlssätzen. Die Entwicklung für solch heterogene Plattformen ist herausfordernd und benötigt entsprechende Unterstützung von Entwicklungswerkzeugen, wie beispielsweise Übersetzern. Neben ihrer heterogenen Kernstruktur gibt es eine zweite Dimension, die die Entwicklung für solche Architekturen anspruchsvoll macht: ihre Speicherstruktur. Die Aufrechterhaltung von globaler Cache-Kohärenz erschwert das Erreichen hoher Kernzahlen. Hardwarebasierte Cache-Kohärenz-Protokolle skalieren entweder schlecht, oder sind kompliziert und führen zu Problemen bei Ausführungszeit und Energieeffizienz. Eine radikale Lösung dieses Problems stellt die Abschaffung der globalen Cache-Kohärenz dar. Jedoch ist es schwierig, bestehende Programmiermodelle effizient auf solch eine Hardware-Architektur mit schwachen Garantien abzubilden. Der erste Teil dieser Dissertation beschäftigt sich Datentransfertechniken für nicht-cache-kohärente Architekturen mit gemeinsamem Speicher. Diese Architekturen bieten einen gemeinsamen physikalischen Adressraum, implementieren aber keine hardwarebasierte Kohärenz zwischen allen Caches des Systems. Die logische Partitionierung des gemeinsamen Speichers ermöglicht die sichere Programmierung einer solchen Plattform. Im Allgemeinen erzeugt dies die Notwendigkeit Daten zwischen Speicherpartitionen zu kopieren. Wir untersuchen die Übersetzung für invasive Architekturen, einer Familie von nicht-cache-kohärenten Vielkernarchitekturen. Wir betrachten die effiziente Implementierung von Datentransfers sowohl einfacher als auch komplexer Datenstrukturen auf invasiven Architekturen. Insbesondere schlagen wir eine neuartige Technik zum Kopieren komplexer verzeigerter Datenstrukturen vor, die ohne Serialisierung auskommt. Hierzu verallgemeinern wir den Objekt-Klon-Ansatz mit übersetzergesteuerter automatischer software-basierter Kohärenz, sodass er auch im Kontext nicht-kohärenter Caches funktioniert. Wir präsentieren Implementierungen mehrerer Datentransfertechniken im Rahmen eines existierenden Übersetzers und seines Laufzeitsystems. Wir führen eine ausführliche Auswertung dieser Implementierungen auf einem FPGA-basierten Prototypen einer invasiven Architektur durch. Schließlich schlagen wir vor, Hardwareunterstützung für bereichsbasierte Cache-Operationen hinzuzufügen und beschreiben und bewerten mögliche Implementierungen und deren Kosten. Der zweite Teil dieser Dissertation befasst sich mit der Beschleunigung von Shuffle-Code, der bei der Registerzuteilung auftritt, durch die Verwendung von Permutationsbefehlen. Die Aufgabe der Registerzuteilung während der Programmübersetzung ist die Abbildung von Programmvariablen auf Maschinenregister. Während der Registerzuteilung erzeugt der Übersetzer Shuffle-Code, der aus Kopier- und Tauschbefehlen besteht, um Werte zwischen Registern zu transferieren. Abhängig von der Qualität der Registerzuteilung und der Zahl der verfügbaren Register kann eine große Menge an Shuffle-Code erzeugt werden. Wir schlagen vor, die Ausführung von Shuffle-Code mit Hilfe von neuartigen Permutationsbefehlen zu beschleunigen, die die Inhalte von einigen Registern in einem Taktzyklus beliebig permutieren. Um die Machbarkeit dieser Idee zu demonstrieren, erweitern wir zunächst ein bestehendes RISC-Befehlsformat um Permutationsbefehle. Anschließend beschreiben wir, wie die vorgeschlagenen Permutationsbefehle in einer bestehenden RISC-Architektur implementiert werden können. Dann entwickeln wir zwei Verfahren zur Codeerzeugung, die die Permutationsbefehle ausnutzen, um Shuffle-Code zu beschleunigen: eine schnelle Heuristik und einen auf dynamischer Programmierung basierenden optimalen Ansatz. Wir beweisen Qualitäts- und Korrektheitseingeschaften beider Ansätze und zeigen die Optimalität des zweiten Ansatzes. Im Folgenden implementieren wir beide Codeerzeugungsverfahren in einem Übersetzer und untersuchen sowie vergleichen deren Codequalität ausführlich mit Hilfe standardisierter Benchmarks. Zunächst messen wir die genaue Zahl der dynamisch ausgeführten Befehle, welche wir folgend validieren, indem wir Programmlaufzeiten auf einer FPGA-basierten Prototypimplementierung der um Permutationsbefehle erweiterten RISC-Architektur messen. Schließlich argumentieren wir, dass Permutationsbefehle auf modernen Out-Of-Order-Prozessorarchitekturen, die bereits Registerumbenennung unterstützen, mit wenig Aufwand implementierbar sind

    Spatial software pipelining on distributed architectures for sparse matrix codes

    Get PDF
    Thesis (M. Eng. and S.B.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.Includes bibliographical references (p. 101-103).Wire delays and communication time are forcing processors to become decentralized modules communicating through a fast, scalable interconnect. For scalability, every portion of the processor must be decentralized, including the memory system. Compilers that can take a sequential program as input and parallelize it (including the memory) across the new processors are necessary. Much research has gone towards the ensuing problem of optimal data layout in memory and instruction placement, but the problem is so large that some aspects have yet to be addressed. This thesis presents spatial software pipelining, a new mechanism for doing data layout and instruction placement for loops. Spatial software pipelining places instructions and memory to avoid communication cycles, decreases the dependencies of tiles on each other, allows the bodies of loops to be pipelined across tiles, allows branch conditions to be pipelined along with data, and reduces the execution time of loops across multiple iterations. This thesis additionally presents the algorithms used to effect spatial software pipelining. Results show that spatial software pipelining performs 2.14x better than traditional assignment and scheduling techniques for a sparse matrix benchmark, and that spatial software pipelining can improve the execution time of certain loops by over a factor of three.by Michelle Duvall.M.Eng.and S.B

    A new approach to reversible computing with applications to speculative parallel simulation

    Get PDF
    In this thesis, we propose an innovative approach to reversible computing that shifts the focus from the operations to the memory outcome of a generic program. This choice allows us to overcome some typical challenges of "plain" reversible computing. Our methodology is to instrument a generic application with the help of an instrumentation tool, namely Hijacker, which we have redesigned and developed for the purpose. Through compile-time instrumentation, we enhance the program's code to keep track of the memory trace it produces until the end. Regardless of the complexity behind the generation of each computational step of the program, we can build inverse machine instructions just by inspecting the instruction that is attempting to write some value to memory. Therefore from this information, we craft an ad-hoc instruction that conveys this old value and the knowledge of where to replace it. This instruction will become part of a more comprehensive structure, namely the reverse window. Through this structure, we have sufficient information to cancel all the updates done by the generic program during its execution. In this writing, we will discuss the structure of the reverse window, as the building block for the whole reversing framework we designed and finally realized. Albeit we settle our solution in the specific context of the parallel discrete event simulation (PDES) adopting the Time Warp synchronization protocol, this framework paves the way for further general-purpose development and employment. We also present two additional innovative contributions coming from our innovative reversibility approach, both of them still embrace traditional state saving-based rollback strategy. The first contribution aims to harness the advantages of both the possible approaches. We implement the rollback operation combining state saving together with our reversible support through a mathematical model. This model enables the system to choose in autonomicity the best rollback strategy, by the mutable runtime dynamics of programs. The second contribution explores an orthogonal direction, still related to reversible computing aspects. In particular, we will address the problem of reversing shared libraries. Indeed, leading from their nature, shared objects are visible to the whole system and so does every possible external modification of their code. As a consequence, it is not possible to instrument them without affecting other unaware applications. We propose a different method to deal with the instrumentation of shared objects. All our innovative proposals have been assessed using the last generation of the open source ROOT-Sim PDES platform, where we integrated our solutions. ROOT-Sim is a C-based package implementing a general purpose simulation environment based on the Time Warp synchronization protocol

    Energy Efficient Load Latency Tolerance: Single-Thread Performance for the Multi-Core Era

    Get PDF
    Around 2003, newly activated power constraints caused single-thread performance growth to slow dramatically. The multi-core era was born with an emphasis on explicitly parallel software. Continuing to grow single-thread performance is still important in the multi-core context, but it must be done in an energy efficient way. One significant impediment to performance growth in both out-of-order and in-order processors is the long latency of last-level cache misses. Prior work introduced the idea of load latency tolerance---the ability to dynamically remove miss-dependent instructions from critical execution structures, continue execution under the miss, and re-execute miss-dependent instructions after the miss returns. However, previously proposed designs were unable to improve performance in an energy-efficient way---they introduced too many new large, complex structures and re-executed too many instructions. This dissertation describes a new load latency tolerant design that is both energy-efficient, and applicable to both in-order and out-of-order cores. Key novel features include formulation of slice re-execution as an alternative use of multi-threading support, efficient schemes for register and memory state management, and new pruning mechanisms for drastically reducing load latency tolerance\u27s dynamic execution overheads. Area analysis shows that energy-efficient load latency tolerance increases the footprint of an out-of-order core by a few percent, while cycle-level simulation shows that it significantly improves the performance of memory-bound programs. Energy-efficient load latency tolerance is more energy-efficient than---and synergistic with---existing performance technique like dynamic voltage and frequency scaling (DVFS)

    Phenomenology and difference: the body, architecture and race

    Get PDF
    The aim of the thesis is to consider the position of phenomenology in contemporary thought in order to argue that only on its terms can a political ontology of difference be thought. To inaugurate this project I being by questioning Heidegger's relation to phenomenology. I take issue with the way that Heidegger privileges time over space in "Being and Time". In this way, the task of the thesis is clarified as the need to elaborate a spatio-temporal phenomenology. After re-situating Heidegger's failure in this respect within a Kantian background, I suggest that the phenomenological grounding of difference must work through the body. I contend that the body is the ontological site of both the subject and the object. I use Whitehead and Merleau-Ponty to explore the ramifications of this thesis. I suggest first of all that architecture should be grounded ontologically in the body, and as such avoids being a 'master discourse'. Secondly, by theorising the body and world as reciprocally transformative, my reading of Merleau-Ponty emphasises the ways in which his thinking opens up a phenomenology of embodied difference. It is on the basis of these themes that I develop this thinking in the direction of race, exploring the dialectics of visibility and invisibility in the work of Frantz Fanon and James Baldwin. I argue that embodied difference attests to variations in the agent's freedom to act in the world. If freedom is understood through Merleau-Ponty as being the embodied ground of historicity, we must ask after unfreedom. I suggest that the "flesh" ontology of a pre-thetic community should be rethought as a regulative ideal, the ideal of a justice that can never be given. In this light, phenomenology becomes as much as poetics. Beyond being though of as conservative, phenomenology henceforth unleashes the possibility of thinking a transformative embodied agency

    DYRK1A in cancer: good or evil? : Defining properties of DYRK1A kinase as a novel tumor driver

    Get PDF
    DYRK (dual-specificity tyrosine-regulated kinases) is an evolutionary conserved family of protein kinases involved in the regulation of cellular processes, such as proliferation and survival, which play pivotal role in tumor development. In this Thesis work, the potential role of DYRK genes as tumor drivers has been explored through an extensive analysis of The Cancer Genome Atlas data. DYRKs were found altered in tumor samples at different levels. In particular, the dosage sensitive member DYRK1A emerged as a potential tumor driver. Functional screens on DYRK1A cancer somatic mutations showed that most of the mutants analyzed have a strong impact on the catalytic activity and/or stability of the protein, suggesting that DYRK1A loss-of-function is positively selected in cancer. Reversion, by a CRISPR/Cas9 strategy, of a DYRK1A mutant allele to wt in an endometrial cancer cell line strongly impaired tumor cell growth. The DYRK1A tumor suppressive role was confirmed in vivo using mouse tumor xenografts. Finally, an integrated transcriptomic, proteomic and phospho-proteomic analysis has uncovered potential molecular mechanisms underlying the DYRK1A-mediated tumor driver function.Las proteínas quinasas DYRK (dual-specificity tyrosine-regulated kinases) son una familia evolutivamente conservada que participan en la regulación de procesos celulares con funciones fundamentales en la transformación maligna. En este trabajo de tesis, el papel de los genes DYRK como conductores (drivers) de tumores se ha explorado mediante un extenso análisis de los datos de The Cancer Genome Atlas. El análisis encontró alterados los genes DYRK en muestras de tumores a diferentes niveles, e identificó al miembro de la familia DYRK1A como un potencial gen conductor en un grupo de tumores. Mediante ensayos funcionales se ha demostrado que las mutaciones somáticas de DYRK1A en cáncer investigadas tienen un fuerte impacto en la actividad catalítica y/o la estabilidad de la proteína, lo que sugiere que la perdida de función de DYRK1A se selecciona positivamente en la célula tumoral. La corrección, mediante la técnica CRISPR/Cas9, de una línea celular de cáncer de endometrio mutada en DYRK1A tiene un fuerte impacto en el crecimiento de estas células tumorales, sugiriendo un papel para DYRK1A como supresor de tumores en cáncer de endometrio, que se ha confirmado in vivo en modelos tumorales de xenoinjerto en ratón. Finalmente, un análisis transcriptómico, proteómico y fosfo-proteómico integrado ha desvelado posibles mecanismos moleculares que participan en la función de DYRK1A como supresor tumoral

    Aceleración de algoritmos de procesamiento de imágenes para el análisis de partículas individuales con microscopia electrónica

    Full text link
    Tesis Doctoral inédita cotutelada por la Masaryk University (República Checa) y la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Ingeniería Informática. Fecha de Lectura: 24-10-2022Cryogenic Electron Microscopy (Cryo-EM) is a vital field in current structural biology. Unlike X-ray crystallography and Nuclear Magnetic Resonance, it can be used to analyze membrane proteins and other samples with overlapping spectral peaks. However, one of the significant limitations of Cryo-EM is the computational complexity. Modern electron microscopes can produce terabytes of data per single session, from which hundreds of thousands of particles must be extracted and processed to obtain a near-atomic resolution of the original sample. Many existing software solutions use high-Performance Computing (HPC) techniques to bring these computations to the realm of practical usability. The common approach to acceleration is parallelization of the processing, but in praxis, we face many complications, such as problem decomposition, data distribution, load scheduling, balancing, and synchronization. Utilization of various accelerators further complicates the situation, as heterogeneous hardware brings additional caveats, for example, limited portability, under-utilization due to synchronization, and sub-optimal code performance due to missing specialization. This dissertation, structured as a compendium of articles, aims to improve the algorithms used in Cryo-EM, esp. the SPA (Single Particle Analysis). We focus on the single-node performance optimizations, using the techniques either available or developed in the HPC field, such as heterogeneous computing or autotuning, which potentially needs the formulation of novel algorithms. The secondary goal of the dissertation is to identify the limitations of state-of-the-art HPC techniques. Since the Cryo-EM pipeline consists of multiple distinct steps targetting different types of data, there is no single bottleneck to be solved. As such, the presented articles show a holistic approach to performance optimization. First, we give details on the GPU acceleration of the specific programs. The achieved speedup is due to the higher performance of the GPU, adjustments of the original algorithm to it, and application of the novel algorithms. More specifically, we provide implementation details of programs for movie alignment, 2D classification, and 3D reconstruction that have been sped up by order of magnitude compared to their original multi-CPU implementation or sufficiently the be used on-the-fly. In addition to these three programs, multiple other programs from an actively used, open-source software package XMIPP have been accelerated and improved. Second, we discuss our contribution to HPC in the form of autotuning. Autotuning is the ability of software to adapt to a changing environment, i.e., input or executing hardware. Towards that goal, we present cuFFTAdvisor, a tool that proposes and, through autotuning, finds the best configuration of the cuFFT library for given constraints of input size and plan settings. We also introduce a benchmark set of ten autotunable kernels for important computational problems implemented in OpenCL or CUDA, together with the introduction of complex dynamic autotuning to the KTT tool. Third, we propose an image processing framework Umpalumpa, which combines a task-based runtime system, data-centric architecture, and dynamic autotuning. The proposed framework allows for writing complex workflows which automatically use available HW resources and adjust to different HW and data but at the same time are easy to maintainThe project that gave rise to these results received the support of a fellowship from the “la Caixa” Foundation (ID 100010434). The fellowship code is LCF/BQ/DI18/11660021. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 71367

    MULTI-OBJECTIVE DESIGN AUTOMATION FOR RECONFIGURABLE MULTI-PROCESSOR SYSTEMS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    corecore