41 research outputs found
Instruction prefetching techniques for ultra low-power multicore architectures
As the gap between processor and memory speeds increases, memory latencies have become a critical bottleneck for computing performance. To reduce this bottleneck, designers have been working on techniques to hide these latencies. On the other hand, design of embedded processors typically targets low cost and low power consumption. Therefore, techniques which can satisfy these constraints are more desirable for embedded domains. While out-of-order execution, aggressive speculation, and complex branch prediction algorithms can help hide the memory access latency in high-performance systems, yet they can cost a heavy power budget and are not suitable for embedded systems. Prefetching is another popular method for hiding the memory access latency, and has been studied very well for high-performance processors. Similarly, for embedded processors with strict power requirements, the application of complex prefetching techniques is greatly limited, and therefore, a low power/energy solution is mostly desired in this context.
In this work, we focus on instruction prefetching for ultra-low power processing architectures and aim to reduce energy overhead of this operation by proposing a combination of simple, low-cost, and energy efficient prefetching techniques. We study a wide range of applications from cryptography to computer vision and show that our proposed mechanisms can effectively improve the hit-rate of almost all of them to above 95%, achieving an average performance improvement of more than 2X. Plus, by synthesizing our designs using the state-of-the-art technologies we show that the prefetchers increase system’s power consumption less than 15% and total silicon area by less than 1%. Altogether, a total energy reduction of 1.9X is achieved, thanks to the proposed schemes, enabling a significantly higher battery life
ADAM: A Decentralized Parallel Computer Architecture Featuring Fast Thread and Data Migration and a Uniform Hardware Abstraction
The furious pace of Moore's Law is driving computer architecture into a realm where the the speed of light is the dominant factor in system latencies. The number of clock cycles to span a chip are increasing, while the number of bits that can be accessed within a clock cycle is decreasing. Hence, it is becoming more difficult to hide latency. One alternative solution is to reduce latency by migrating threads and data, but the overhead of existing implementations has previously made migration an unserviceable solution so far. I present an architecture, implementation, and mechanisms that reduces the overhead of migration to the point where migration is a viable supplement to other latency hiding mechanisms, such as multithreading. The architecture is abstract, and presents programmers with a simple, uniform fine-grained multithreaded parallel programming model with implicit memory management. In other words, the spatial nature and implementation details (such as the number of processors) of a parallel machine are entirely hidden from the programmer. Compiler writers are encouraged to devise programming languages for the machine that guide a programmer to express their ideas in terms of objects, since objects exhibit an inherent physical locality of data and code. The machine implementation can then leverage this locality to automatically distribute data and threads across the physical machine by using a set of high performance migration mechanisms. An implementation of this architecture could migrate a null thread in 66 cycles -- over a factor of 1000 improvement over previous work. Performance also scales well; the time required to move a typical thread is only 4 to 5 times that of a null thread. Data migration performance is similar, and scales linearly with data block size. Since the performance of the migration mechanism is on par with that of an L2 cache, the implementation simulated in my work has no data caches and relies instead on multithreading and the migration mechanism to hide and reduce access latencies
ADAM : a decentralized parallel computer architecture featuring fast thread and data migration and a uniform hardware abstraction
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2002.Includes bibliographical references (p. 247-256).The furious pace of Moore's Law is driving computer architecture into a realm where the the speed of light is the dominant factor in system latencies. The number of clock cycles to span a chip are increasing, while the number of bits that can be accessed within a clock cycle is decreasing. Hence, it is becoming more difficult to hide latency. One alternative solution is to reduce latency by migrating threads and data, but the overhead of existing implementations has previously made migration an unserviceable solution so far. I present an architecture, implementation, and mechanisms that reduces the overhead of migration to the point where migration is a viable supplement to other latency hiding mechanisms, such as multithreading. The architecture is abstract, and presents programmers with a simple, uniform fine-grained multithreaded parallel programming model with implicit memory management. In other words, the spatial nature and implementation details (such as the number of processors) of a parallel machine are entirely hidden from the programmer. Compiler writers are encouraged to devise programming languages for the machine that guide a programmer to express their ideas in terms of objects, since objects exhibit an inherent physical locality of data and code. The machine implementation can then leverage this locality to automatically distribute data and threads across the physical machine by using a set of high performance migration mechanisms.(cont.) An implementation of this architecture could migrate a null thread in 66 cycles - over a factor of 1000 improvement over previous work. Performance also scales well; the time required to move a typical thread is only 4 to 5 times that of a null thread. Data migration performance is similar, and scales linearly with data block size. Since the performance of the migration mechanism is on par with that of an L2 cache, the implementation simulated in my work has no data caches and relies instead on multithreading and the migration mechanism to hide and reduce access latencies.by Andrew "bunnie" Huang.Ph.D
A GPU performance estimation model based on micro-benchmarks and black-box kernel profiling
Κατά την τελευταία δεκαετία, οι επεξεργαστές γραφικών (GPUs) έχουν εδραιωθεί στον τομέα των υπολογιστικών συστημάτων υψηλής απόδοσης ως επιταχυντές υπολογισμών. Τα βασικά χαρακτηριστικά που δικαιολογούν αυτή τη σύγχρονη τάση είναι η εξαιρετικά υψηλή υπολογιστική απόδοση τους και η αξιοσημείωτη ενεργειακή αποδοτικότητα τους. Ωστόσο, η απόδοση τους είναι πολύ ευαίσθητη σε πολλούς παράγοντες, όπως π.χ. τον τύπο των μοτίβων πρόσβασης στη μνήμη (memory access patterns), την απόκλιση διακλαδώσεων (branch divergence), τον βαθμό παραλληλισμού και τις δυνητικές καθυστερήσεις (latencies). Συνεπώς, ο χρόνος εκτέλεσης ενός πυρήνα (kernel) σε ένα επεξεργαστή γραφικών είναι ένα δύσκολα προβλέψιμο μέγεθος. Στην περίπτωση που η απόδοση του πυρήνα δεν περιορίζεται από καθυστερήσεις, μπορεί να παρασχεθεί μια χονδρική εκτίμηση του χρόνου εκτέλεσης σε ένα συγκεκριμένο επεξεργαστή εφαρμόζοντας το μοντέλο γραμμής-οροφής (roofline), το οποίο χρησιμοποιείται για να αντιστοιχίσει την ένταση υπολογισμών του προγράμματος στην μέγιστη αναμενόμενη απόδοση για ένα συγκεκριμένο επεξεργαστή. Αν και αυτή η προσέγγιση είναι απλή, δεν μπορεί να παρέχει ακριβή αποτελέσματα πρόβλεψης.
Σε αυτή τη διατριβή, μετά την επαλήθευση της αρχής του μοντέλου γραμμής-οροφής σε επεξεργαστές γραφικών με τη χρήση ενός μικρο-μετροπρογράμματος, προτείνεται ένα αναλυτικό μοντέλο απόδοσης. Συγκεκριμένα, βελτιώνεται το μοντέλο γραμμής-οροφής ακολουθώντας μια ποσοτική προσέγγιση και παρουσιάζεται μία πλήρως αυτοματοποιημένη μέθοδος πρόβλεψης απόδοσης σε επεξεργαστή γραφικών. Από αυτή την άποψη, το προτεινόμενο μοντέλο χρησιμοποιεί την αξιολόγηση μέσω μικρο-μετροπρογραμμάτων και την καταγραφή μετρικών με μέθοδο «μαύρου κουτιού», καθώς δεν απαιτείται διερεύνηση
του πηγαίου/δυαδικού κώδικα. Το προτεινόμενο μοντέλο συνδυάζει τις παραμέτρους του επεξεργαστή γραφικών και του πυρήνα για να χαρακτηρίσει τον παράγοντα περιορισμού της απόδοσης και να προβλέψει το χρόνο εκτέλεσης στο στοχευόμενο υλικό, λαμβάνοντας υπόψη την αποδοτικότητα των ωφελίμων υπολογιστικών εντολών. Επιπλέον, προτείνεται η οπτική αναπαράσταση «διαμοιρασμού-τεταρτημορίου» (“quadrant-split”), η οποία αποδίδει τα χαρακτηριστικά πολλών επεξεργαστών σε σχέση με έναν συγκεκριμένο πυρήνα.
Η πειραματική αξιολόγηση συνδυάζει δοκιμαστικές εκτελέσεις σε υπολογισμούς μορίων (κόκκινο/μαύρο SOR, LMSOR), πολλαπλασιασμό πινάκων (SGEMM) και ένα σύνολο 28 πυρήνων της σουίτας μετροπρογραμμάτων Rodinia, όλα εφαρμοσμένα σε έξι επεξεργαστές γραφικών CUDA. Το παρατηρηθέν απόλυτο σφάλμα στις προβλέψεις ήταν 27,66% στη μέση περίπτωση. Διερευνήθηκαν και αιτιολογήθηκαν ιδιαίτερες περιπτώσεις εσφαλμένων προβλέψεων. Επιπλέον, το προαναφερθέν μικρο-μετροπρόγραμμα χρησιμοποιήθηκε ως αντικείμενο για την πρόβλεψη απόδοσης και τα αποτελέσματα ήταν πολύ ακριβή. Προσθέτως, το μοντέλο απόδοσης εξετάστηκε σε σύνθετο περιβάλλον μεταξύ διαφορετικών κατασκευαστών, εφαρμόζοντας τη μέθοδο πρόβλεψης στους ίδιους πηγαίους κώδικες πυρήνων μέσω του περιβάλλοντος προγραμματισμού HIP που υποστηρίζεται από την πλατφόρμα AMD ROCm. Τα σφάλματα πρόβλεψης ήταν συγκρίσιμα αυτών των πειραμάτων του περιβάλλοντος CUDA, παρά τις σημαντικές διαφορές αρχιτεκτονικής που παρατηρούνται μεταξύ των διαφορετικών κατασκευαστών επεξεργαστών γραφικών.Over the last decade GPUs have been established in the High Performance Computing sector as compute accelerators. The primary characteristics that justify this modern trend are the exceptionally high compute throughput and the remarkable power efficiency of GPUs. However, GPU performance is highly sensitive to many factors, e.g. the type of memory access patterns, branch divergence, the degree of parallelism and potential latencies. Consequently, the execution time of a kernel on a GPU is a difficult to predict measure. Unless the kernel is latency bound, a rough estimate of the execution time on a particular GPU could be provided by applying the roofline model, which is used to map the program’s operation intensity to the peak expected performance on a particular processor.
Though this approach is straightforward, it cannot not provide accurate prediction results. In this thesis, after validating the roofline principle on GPUs by employing a micro-benchmark, an analytical throughput oriented performance model is proposed. In particular, this improves on the roofline model following a quantitative approach and a completely automated GPU performance prediction technique is presented. In this respect, the proposed model utilizes micro-benchmarking and profiling in a “black-box” fashion as no inspection of source/binary code is required. The proposed model combines GPU and kernel parameters in order to characterize the performance limiting factor and to predict the execution time on target hardware, by taking into account the efficiency of beneficial computational instructions. In addition, the “quadrant-split” visual representation is proposed, which captures the characteristics of multiple processors in relation to a particular kernel.
The experimental evaluation combines test executions on stencil computations (red/black SOR, LMSOR), matrix multiplication (SGEMM) and a total of 28 kernels of the Rodinia benchmark suite, all applied on six CUDA GPUs. The observed absolute error in predictions was 27.66% in the average case. Special cases of mispredicted results were investigated and justified. Moreover, the aforementioned micro-benchmark was used as a subject for performance prediction and the exhibited results were very accurate. Furthermore, the performance model was also examined in a cross vendor configuration by applying the prediction method on the same kernel source codes through the HIP programming environment supported on the AMD ROCm platform. Prediction errors were comparable to CUDA experiments despite the significant architectural differences evident between different vendor GPUs
Exploiting data locality in cache-coherent NUMA systems
The end of Dennard scaling has caused a stagnation of the clock frequency in computers.To overcome this issue, in the last two decades vendors have been integrating larger numbers of processing elements in the systems, interconnecting many nodes, including multiple chips in the nodes and increasing the number of cores in each chip. The speed of main memory has not evolved at the same rate as processors, it is much slower and there is a need to provide more total bandwidth to the processors, especially with the increase in the number of cores and chips.
Still keeping a shared address space, where all processors can access the whole memory, solutions have come by integrating more memories: by using newer technologies like high-bandwidth memories (HBM) and non-volatile memories (NVM), by giving groups cores (like sockets, for example) faster access to some subset of the DRAM, or by combining many of these solutions. This has caused some heterogeneity in the access speed to main memory, depending on the CPU requesting access to a memory address and the actual physical location of that address, causing non-uniform memory access (NUMA) behaviours.
Moreover, many of these systems are cache-coherent (ccNUMA), meaning that changes in the memory done from one CPU must be visible by the other CPUs and transparent for the programmer.
These NUMA behaviours reduce the performance of applications and can pose a challenge to the programmers. To tackle this issue, this thesis proposes solutions, at the software and hardware levels, to improve the data locality in NUMA systems and, therefore, the performance of applications in these computer systems.
The first contribution shows how considering hardware prefetching simultaneously with thread and data placement in NUMA systems can find configurations with better performance than considering these aspects separately. The performance results combined with performance counters are then used to build a performance model to predict, both offline and online, the best configuration for new applications not in the model. The evaluation is done using two different high performance NUMA systems, and the performance counters collected in one machine are used to predict the best configurations in the other machine.
The second contribution builds on the idea that prefetching can have a strong effect in NUMA systems and proposes a
NUMA-aware hardware prefetching scheme. This scheme is generic and can be applied to multiple hardware prefetchers with a low hardware cost but giving very good results. The evaluation is done using a cycle-accurate architectural simulator and provides detailed results of the performance, the data transfer reduction and the energy costs.
Finally, the third and last contribution consists in scheduling algorithms for task-based programming models. These programming models help improve the programmability of applications in parallel systems and also provide useful information to the underlying runtime system. This information is used to build a task dependency graph (TDG), a directed acyclic graph that models the application where the nodes are sequential pieces of code known as tasks and the edges are the data dependencies between the different tasks. The proposed scheduling algorithms use graph partitioning techniques and provide a scheduling for the tasks in the TDG that minimises the data transfers between the different NUMA regions of the system. The results have been evaluated in real ccNUMA systems with multiple NUMA regions.La fi de la llei de Dennard ha provocat un estancament de la freqüència de rellotge dels computadors. Amb l'objectiu de superar aquest fet, durant les darreres dues dècades els fabricants han integrat més quantitat d'unitats de còmput als sistemes mitjançant la interconnexió de nodes diferents, la inclusió de múltiples xips als nodes i l'increment de nuclis de processador a cada xip. La rapidesa de la memòria principal no ha evolucionat amb el mateix factor que els processadors; és molt més lenta i hi ha la necessitat de proporcionar més ample de banda als processadors, especialment amb l'increment del nombre de nuclis i xips. Tot mantenint un adreçament compartit en el qual tots els processadors poden accedir a la memòria sencera, les solucions han estat al voltant de la integració de més memòries: amb tecnologies modernes com HBM (high-bandwidth memories) i NVM (non-volatile memories), fent que grups de nuclis (com sòcols sencers) tinguin accés més ràpid a una part de la DRAM o amb la combinació de solucions. Això ha provocat una heterogeneïtat en la velocitat d'accés a la memòria principal, en funció del nucli que sol·licita l'accés a una adreça en particular i la seva localització física, fet que provoca uns comportaments no uniformes en l'accés a la memòria (non-uniform memory access, NUMA). A més, sovint tenen memòries cau coherents (cache-coherent NUMA, ccNUMA), que implica que qualsevol canvi fet a la memòria des d'un nucli d'un processador ha de ser visible la resta de manera transparent. Aquests comportaments redueixen el rendiment de les aplicacions i suposen un repte. Per abordar el problema, a la tesi s'hi proposen solucions, a nivell de programari i maquinari, que milloren la localitat de dades als sistemes NUMA i, en conseqüència, el rendiment de les aplicacions en aquests sistemes. La primera contribució mostra que, quan es tenen en compte alhora la precàrrega d'adreces de memòria amb maquinari (hardware prefetching) i les decisions d'ubicació dels fils d'execució i les dades als sistemes NUMA, es poden trobar millors configuracions que quan es condieren per separat. Una combinació dels resultats de rendiment i dels comptadors disponibles al sistema s'utilitza per construir un model de rendiment per fer la predicció, tant per avançat com també en temps d'execució, de la millor configuració per aplicacions que no es troben al model. L'avaluació es du a terme a dos sistemes NUMA d'alt rendiment, i els comptadors mesurats en un sistema s'usen per predir les millors configuracions a l'altre sistema. La segona contribució es basa en la idea que el prefetching pot tenir un efecte considerable als sistemes NUMA i proposa un esquema de precàrrega a nivell de maquinari que té en compte els efectes NUMA. L'esquema és genèric i es pot aplicar als algorismes de precàrrega existents amb un cost de maquinari molt baix però amb molt bons resultats. S'avalua amb un simulador arquitectural acurat a nivell de cicle i proporciona resultats detallats del rendiment, la reducció de les comunicacions de dades i els costos energètics. La tercera i darrera contribució consisteix en algorismes de planificació per models de programació basats en tasques. Aquests simplifiquen la programabilitat de les aplicacions paral·leles i proveeixen informació molt útil al sistema en temps d'execució (runtime system) que en controla el funcionament. Amb aquesta informació es construeix un graf de dependències entre tasques (task dependency graph, TDG), un graf dirigit i acíclic que modela l'aplicació i en el qual els nodes són fragments de codi seqüencial (o tasques) i els arcs són les dependències de dades entre les tasques. Els algorismes de planificació proposats fan servir tècniques de particionat de grafs i proporcionen una planificació de les tasques del TDG que minimitza la comunicació de dades entre les diferents regions NUMA del sistema. Els resultats han estat avaluats en sistemes ccNUMA reals amb múltiples regions NUMA.El final de la ley de Dennard ha provocado un estancamiento de la frecuencia
de reloj de los computadores. Con el objetivo de superar este problema,
durante las últimas dos décadas los fabricantes han integrado más unidades
de cómputo en los sistemas mediante la interconexión de nodos diferentes,
la inclusión de múltiples chips en los nodos y el incremento de núcleos
de procesador en cada chip. La rapidez de la memoria principal no ha
evolucionado con el mismo factor que los procesadores; es mucho más lenta
y hay la necesidad de proporcionar más ancho de banda a los procesadores,
especialmente con el incremento del número de núcleos y chips.
Aun manteniendo un sistema de direccionamiento compartido en el que
todos los procesadores pueden acceder al conjunto de la memoria, las soluciones
han oscilado alrededor de la integración de más memorias: usando
tecnologías modernas como las memorias de alto ancho de banda (highbandwidth
memories, HBM) y memorias no volátiles (non-volatile memories,
NVM), haciendo que grupos de núcleos (como zócalos completos) tengan
acceso más veloz a un subconjunto de la DRAM, o con la combinación de
soluciones. Esto ha provocado una heterogeneidad en la velocidad de acceso
a la memoria principal, en función del núcleo que solicita el acceso a una
dirección de memoria en particular y la ubicación física de esta dirección, lo
que provoca unos comportamientos no uniformes en el acceso a la memoria
(non-uniform memory access, NUMA). Además, muchos de estos sistemas
tienen memorias caché coherentes (cache-coherent NUMA, ccNUMA), lo
que implica que cualquier cambio hecho en la memoria desde un núcleo
de un procesador debe ser visible por el resto de procesadores de forma
transparente para los programadores.
Estos comportamientos NUMA reducen el rendimiento de las aplicaciones
y pueden suponer un reto para los programadores. Para abordar dicho problema,
en esta tesis se proponen soluciones, a nivel de software y hardware,
que mejoran la localidad de datos en los sistemas NUMA y, en consecuencia,
el rendimiento de las aplicaciones en estos sistemas informáticos. La primera contribución muestra que, cuando se tienen en cuenta a la vez
la precarga de direcciones de memoria mediante hardware (o hardware
prefetching ) y las decisiones de la ubicación de los hilos de ejecución y los
datos en los sistemas NUMA, se pueden hallar mejores configuraciones que
cuando se consideran ambos aspectos por separado. Con una combinación
de los resultados de rendimiento y de los contadores disponibles en el
sistema se construye un modelo de rendimiento, tanto por avanzado como
en en tiempo de ejecución, de la mejor configuración para aplicaciones que
no están incluidas en el modelo. La evaluación se realiza en dos sistemas
NUMA de alto rendimiento, y los contadores medidos en uno de los sistemas
se usan para predecir las mejores configuraciones en el otro sistema.
La segunda contribución se basa en la idea de que el prefetching puede
tener un efecto considerable en los sistemas NUMA y propone un esquema
de precarga a nivel hardware que tiene en cuenta los efectos NUMA. Este
esquema es genérico y se puede aplicar a diferentes algoritmos de precarga
existentes con un coste de hardware muy bajo pero que proporciona muy
buenos resultados. Dichos resultados se obtienen y evalúan mediante un
simulador arquitectural preciso a nivel de ciclo y proporciona resultados
detallados del rendimiento, la reducción de las comunicaciones de datos y
los costes energéticos.
Finalmente, la tercera y última contribución consiste en algoritmos de planificación
para modelos de programación basados en tareas. Estos modelos
simplifican la programabilidad de las aplicaciones paralelas y proveen información
muy útil al sistema en tiempo de ejecución (runtime system)
que controla su funcionamiento. Esta información se utiliza para construir
un grafo de dependencias entre tareas (task dependency graph, TDG), un
grafo dirigido y acíclico que modela la aplicación y en el ue los nodos son
fragmentos de código secuencial, conocidos como tareas, y los arcos son las
dependencias de datos entre las distintas tareas. Los algoritmos de planificación
que se proponen usan técnicas e particionado de grafos y proporcionan
una planificación de las tareas del TDG que minimiza la comunicación de
datos entre las distintas regiones NUMA del sistema. Los resultados se han
evaluado en sistemas ccNUMA reales con múltiples regiones NUMA.Postprint (published version
Recommended from our members
AN ARCHITECTURE EVALUATION AND IMPLEMENTATION OF A SOFT GPGPU FOR FPGAs
Embedded and mobile systems must be able to execute a variety of different types of code, often with minimal available hardware. Many embedded systems now come with a simple processor and an FPGA, but not more energy-hungry components, such as a GPGPU. In this dissertation we present FlexGrip, a soft architecture which allows for the execution of GPGPU code on an FPGA without the need to recompile the design. The architecture is optimized for FPGA implementation to effectively support the conditional and thread-based execution characteristics of GPGPU execution without FPGA design recompilation. This architecture supports direct CUDA compilation to a binary which is executable on the FPGA-based GPGPU. Our architecture is customizable, thus providing the FPGA designer with a selection of GPGPU cores which display performance versus area tradeoffs.
This dissertation describes the FlexGrip architecture in detail and showcases the benefits by evaluating the design for a collection of five standard CUDA benchmarks which are compiled using standard GPGPU compilation tools. Speedups of 23x, on average, versus a MicroBlaze microprocessor are achieved for designs which take advantage of the conditional execution capabilities offered by FlexGrip. We also show FlexGrip can achieve an 80% average reduction of dynamic energy versus the MicroBlaze microprocessor.
The dissertation furthers discussion by exploring application-customized versions of the soft GPGPU, thus exploiting the overlay architecture. We expand the architecture to multiple processors per GPGPU and optimizing away features which are not needed by certain classes of applications. These optimizations, which include the effective use of block RAMs and DSP blocks, are critical to the performance of FlexGrip. By implementing a 2 GPGPU design, we show speedups of 44x on average versus a MicroBlaze microprocessor. Application-customized versions of the soft GPGPU can be used to further reduce dynamic energy consumption by an average of 14%.
To complete this thesis, we augmented a GPGPU cycle accurate simulator to emulate FlexGrip and evaluate different levels of cache design spaces. We show performance increases for select benchmarks, however, we also show that 64% and 45% of benchmarks exhibited performance decreases when L1D cache was enabled for the 1 SMP and 2 SMP configurations, and only one benchmark showed performance improvement when the L2 cache was enabled
Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU
Enhancing the match between software executions and hardware features is key to computing efficiency. The match is a continuously evolving and challenging problem. This dissertation focuses on the development of programming system support for exploiting two key features of modern hardware development: the massive parallelism of emerging computational accelerators such as Graphic Processing Units (GPU), and the non-uniformity of cache sharing in modern multicore processors. They are respectively driven by the important role of accelerators in today\u27s general-purpose computing and the ultimate importance of memory performance. This dissertation particularly concentrates on optimizing control flows and memory references, at both compilation and execution time, to tap into the full potential of pure software solutions in taking advantage of the two key hardware features.;Conditional branches cause divergences in program control flows, which may result in serious performance degradation on massively data-parallel GPU architectures with Single Instruction Multiple Data (SIMD) parallelism. On such an architecture, control divergence may force computing units to stay idle for a substantial time, throttling system throughput by orders of magnitude. This dissertation provides an extensive exploration of the solution to this problem and presents program level transformations based upon two fundamental techniques --- thread relocation and data relocation. These two optimizations provide fundamental support for swapping jobs among threads so that the control flow paths of threads converge within every SIMD thread group.;In memory performance, this dissertation concentrates on two aspects: the influence of nonuniform sharing on multithreading applications, and the optimization of irregular memory references on GPUs. In shared cache multicore chips, interactions among threads are complicated due to the interplay of cache contention and synergistic prefetching. This dissertation presents the first systematic study on the influence of non-uniform shared cache on contemporary parallel programs, reveals the mismatch between the software development and underlying cache sharing hierarchies, and further demonstrates it by proposing and applying cache-sharing-aware data transformations that bring significant performance improvement. For the second aspect, the efficiency of GPU accelerators is sensitive to irregular memory references, which refer to the memory references whose access patterns remain unknown until execution time (e.g., A[P[i]]). The root causes of the irregular memory reference problem are similar to that of the control flow problem, while in a more general and complex form. I developed a framework, named G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations