    Enabling the use of embedded and mobile technologies for high-performance computing

    In the late 1990s, powerful economic forces led to the adoption of commodity desktop processors in High-Performance Computing(HPC). This transformation has been so effective that the November 2016 TOP500 list is still dominated by x86 architecture. In 2016, the largest commodity market in computing is not PCs or servers, but mobile computing, comprising smartphones andtablets, most of which are built with ARM-based Systems on Chips (SoC). This suggests that once mobile SoCs deliver sufficient performance, mobile SoCs can help reduce the cost of HPC. This thesis addresses this question in detail.We analyze the trend in mobile SoC performance, comparing it with the similar trend in the 1990s. Through development of real system prototypes and their performance analysis we assess the feasibility of building an HPCsystem based on mobile SoCs. Through simulation of the future mobile SoC, we identify the missing features and suggest improvements that would enable theuse of future mobile SoCs in HPC environment. Thus, we present design guidelines for future generations mobile SoCs, and HPC systems built around them, enabling the newclass of cheap supercomputers.A finales de la década de los 90, razones económicas llevaron a la adopción de procesadores de uso general en sistemas de Computación de Altas Prestaciones (HPC). Esta transformación ha sido tan efectiva que la lista TOP500 de noviembre de 2016 sigue aun dominada por la arquitectura x86. En 2016, el mayor mercado de productos básicos en computación no son los ordenadores de sobremesa o los servidores, sino la computación móvil, que incluye teléfonos inteligentes y tabletas, la mayoría de los cuales están construidos con sistemas en chip(SoC) de arquitectura ARM. Esto sugiere que una vez que los SoC móviles ofrezcan un rendimiento suficiente, podrán utilizarse para reducir el costo desistemas HPC. Esta tesis aborda esta cuestión en detalle. Analizamos la tendencia del rendimiento de los SoC para móvil, comparándola con la tendencia similar ocurrida en los añosnoventa. A través del desarrollo de prototipos de sistemas reales y su análisis de rendimiento, evaluamos la factibilidad de construir unsistema HPC basado en SoCs móviles. A través de la simulación de SoCs móviles futuros, identificamos las características que faltan y sugerimos mejoras quepermitirían su uso en entornos HPC. Por lo tanto, presentamos directrices de diseño para futuras generaciones de SoCs móviles y sistemas HPC construidos a sualrededor, para permitir la construcción de una nueva clase de supercomputadores de coste reducido

    Scientific Application Acceleration Utilizing Heterogeneous Architectures

    Within the past decade, there have been substantial leaps in computer architectures to exploit the parallelism that is inherently present in many applications. The scientific community has benefited from the emergence of not only multi-core processors, but also other, less traditional architectures including general purpose graphical processing units (GPGPUs), field programmable gate arrays (FPGAs), and Intel\u27s many integrated cores (MICs) architecture (i.e. Xeon Phi). The popularity of the GPGPU has increased rapidly because of their ability to perform massive amounts of parallel computation quickly and at low cost with an ease of programmability. Also, with the addition of high-level programming interfaces for these devices, technical and non-technical individuals can interface with the device and rapidly obtain improved performance for many algorithms. Many applications can take advantage of the parallelism present in distributed computing and multithreading to achieve higher levels of performance for the computationally intensive parts of the application. The work presented in this thesis implements three applications for use in a performance study of the GPGPU architecture and multi-GPGPU systems. The first application study in this research is a K-Means clustering algorithm that categorizes each data point into the closest cluster. The second algorithm implemented is a spiking neural network algorithm that is used as a computational model for machine learning. The third, and final, study is the longest common subsequences problem, which attempts to enumerate comparisons between sequences (namely, DNA sequences). The results for the aforementioned applications with varying problem sizes and architectural configurations are presented and discussed in this thesis. The K-Means clustering algorithm achieved approximately 97x speedup when utilizing an architecture consisting of 32 CPU/GPGPU pairs. To achieve this substantial speedup, up to 750,000 data points were used with up 30,000 centroids (means). The spiking neural network algorithm resulted in speedups of about 33x for the entire algorithm and 160x for each iteration with a two-level network with 1000 total neurons (800 excitatory and 200 inhibitory neurons). The longest common subsequences problem achieved speedup of greater than 10x with 100 random sequences up to 500 characters in length. The maximum speedup values for each application were achieved by utilizing the GPGPU as well as multi-core devices simultaneously. The computations were scattered over multiple CPU/GPGPU pairs with the computationally intensive pieces of the algorithms offloaded onto the GPGPU device. The research in this thesis illustrates the ability to scale a heterogeneous cluster (i.e. CPUs and GPUs working collaboratively) for large-scale scientific application performance improvements. Each algorithm demonstrates slightly different types of computations and communications, which can be compared to other algorithms to predict how they would perform on an accelerator. The results show that substantial speedups can be achieved for scientific applications when utilizing the GPGPU and multi-core architectures

    Production Level CFD Code Acceleration for Hybrid Many-Core Architectures

    In this work, a novel graphics processing unit (GPU) distributed sharing model for hybrid many-core architectures is introduced and employed in the acceleration of a production-level computational fluid dynamics (CFD) code. The latest generation graphics hardware allows multiple processor cores to simultaneously share a single GPU through concurrent kernel execution. This feature has allowed the NASA FUN3D code to be accelerated in parallel with up to four processor cores sharing a single GPU. For codes to scale and fully use resources on these and the next generation machines, codes will need to employ some type of GPU sharing model, as presented in this work. Findings include the effects of GPU sharing on overall performance. A discussion of the inherent challenges that parallel unstructured CFD codes face in accelerator-based computing environments is included, with considerations for future generation architectures. This work was completed by the author in August 2010, and reflects the analysis and results of the time

    Knowledge is power: Quantum chemistry on novel computer architectures

    In the first chapter of this thesis, a background of fundamental quantum chemistry concepts is provided. Chapter two contains an analysis of the performance and energy efficiency of various modern computer processor architectures while performing computational chemistry calculations. In chapter three, the processor architectural study is expanded to include parallel computational chemistry algorithms executed across multiple-node computer clusters. Chapter four describes a novel computational implementation of the fundamental Hartree-Fock method which significantly reduces computer memory requirements. In chapter five, a case study of quantum chemistry two-electron integral code interoperability is described. The final chapters of this work discuss applications of quantum chemistry. In chapter six, an investigation of the esterification of acetic acid on acid-functionalized silica is presented. In chapter seven, the application of ab initio molecular dynamics to study the photoisomerization and photocyclization of stilbene is discussed. Final concluding remarks are noted in chapter eight

    Simulating Nonlinear Neutrino Oscillations on Next-Generation Many-Core Architectures

    In this work an astrophysical simulation code, XFLAT, is developed to study neutrino oscillations in supernovae. XFLAT is a hybrid modular code which was designed to utilize multiple levels of parallelism through MPI, OpenMP, and SIMD instructions (vectorization). It can run on both the CPU and the Xeon Phi co-processor, the latter of which is based on the Intel Many Integrated Core Architecture (MIC). The performance of XFLAT on various system configurations and physics scenarios has been analyzed. In addition, the impact of I/O and the multi-node configuration on the Xeon Phi-equipped heterogeneous supercomputers such as Stampede at the Texas Advanced Computing Center (TACC) was investigated

    An Analysis of Storage Virtualization

    Investigating technologies and writing expansive documentation on their capabilities is like hitting a moving target. Technology is evolving, growing, and expanding what it can do each and every day. This makes it very difficult when trying to snap a line and investigate competing technologies. Storage virtualization is one of those moving targets. Large corporations develop software and hardware solutions that try to one up the competition by releasing firmware and patch updates to include their latest developments. Some of their latest innovations include differing RAID levels, virtualized storage, data compression, data deduplication, file deduplication, thin provisioning, new file system types, tiered storage, solid state disk, and software updates to coincide these technologies with their applicable hardware. Even data center environmental considerations like reusable energies, data center environmental characteristics, and geographic locations are being used by companies both small and large to reduce operating costs and limit environmental impacts. Companies are even moving to an entire cloud based setup to limit their environmental impact as it could be cost prohibited to maintain your own corporate infrastructure. The trifecta of integrating smart storage architectures to include storage virtualization technologies, reducing footprint to promote energy savings, and migrating to cloud based services will ensure a long-term sustainable storage subsystem

    A Server Consolidation Solution

    Advances in server architecture has enabled corporations the ability to strategically redesign their data centers in order to realign the system infrastructure to business needs. The architectural design of physically and logically consolidating servers into fewer and smaller hardware platforms can reduce data center overhead costs, while adding quality of service. In order for the organization to take advantage of the architectural opportunity a server consolidation project was proposed that utilized blade technology coupled with the virtualization of servers. Physical consolidation reduced the data center facility requirements, while server virtualization reduced the number of required hardware platforms. With the constant threat of outsourcing, coupled with the explosive growth of the organization, the IT managers were challenged to provide increased system services and functionality to a larger user community, while maintaining the same head count. A means of reducing overhead costs associated with the in-house data center was to reduce the required facility and hardware resources. The reduction in the data center footprint required less real estate, electricity, fire suppression infrastructure, and HVAC utilities. In addition, since the numerous stand alone servers were consolidated onto a standard platform system administration became more agile to business opportunities.