99 research outputs found

    A Message-Passing, Thread-Migrating Operating System for a Non-Cache-Coherent Many-Core Architecture

    Get PDF
    The difference between emerging many-core architectures and their multi-core predecessors goes beyond just the number of cores incorporated on a chip. Current technologies for maintaining cache coherency are not scalable beyond a few dozen cores, and a lack of coherency presents a new paradigm for software developers to work with. While shared memory multithreading has been a viable and popular programming technique for multi-cores, the distributed nature of many-cores is more amenable to a model of share-nothing, message-passing threads. This model places different demands on a many-core operating system, and this thesis aims to understand and accommodate those demands. We introduce Xipx, a port of the lightweight Embedded Xinu operating system to the many-core Intel Single-chip Cloud Computer (SCC). The SCC is a 48-core x86 architecture that lacks cache coherency. It features a fast mesh network-on-chip (NoC) and on-die message passing buffers to facilitate message-passing communications between cores. Running as a separate instance per core, Xipx takes advantage of this hardware in its implementation of a message-passing device. The device multiplexes the message passing hardware, thereby allowing multiple concurrent threads to share the hardware without interfering with each other. Xipx also features a limited framework for transparent thread migration. This achievement required fundamental modifications to the kernel, including incorporation of a new type of thread. Additionally, a minimalistic framework for bare-metal development on the SCC has been produced as a pragmatic offshoot of the work on Xipx. This thesis discusses the design and implementation of the many-core extensions described above. While Xipx serves as a foundation for continued research on many-core operating systems, test results show good performance from both message passing and thread migration suggesting that, as it stands, Xipx is an effective platform for exploration of many-core development at the application level as well

    Enabling and scaling biomolecular simulations of 100 million atoms on petascale machines with a multicore-optimized message-driven runtime

    Full text link
    A 100-million-atom biomolecular simulation with NAMD is one of the three benchmarks for the NSF-funded sustainable petascale machine. Simulating this large molecular system on a petascale machine presents great challenges, including handling I/O, large memory footprint and getting good strong-scaling results. In this paper, we present parallel I/O techniques to enable the simula-tion. A new SMP model is designed to efficiently utilize ubiquitous wide multicore clusters by extending the CHARM++ asynchronous message-driven runtime. We exploit node-aware techniques to op-timize both the application and the underlying SMP runtime. Hi-erarchical load balancing is further exploited to scale NAMD to the full Jaguar PF Cray XT5 (224,076 cores) at Oak Ridge Na-tional Laboratory, both with and without PME full electrostatics, achieving 93 % parallel efficiency (vs 6720 cores) at 9 ms per step for a simple cutoff calculation. Excellent scaling is also obtained on 65,536 cores of the Intrepid Blue Gene/P at Argonne National Laboratory. 1

    Optimizing an MPI weather forecasting model via processor virtualization

    Full text link
    Abstractā€”Weather forecasting models are computationally intensive applications. These models are typically executed in parallel machines and a major obstacle for their scalability is load imbalance. The causes of such imbalance are either static (e.g. topography) or dynamic (e.g. shortwave radiation, moving thunderstorms). Various techniques, often embedded in the applicationā€™s source code, have been used to address both sources. However, these techniques are inflexible and hard to use in legacy codes. In this paper, we demonstrate the effectiveness of processor virtualization for dynamically balancing the load in BRAMS, a mesoscale weather forecasting model based on MPI paral-lelization. We use the Charm++ infrastructure, with its over-decomposition and object-migration capabilities, to move sub-domains across processors during execution of the model. Pro-cessor virtualization enables better overlap between computation and communication and improved cache efficiency. Furthermore, by employing an appropriate load balancer, we achieve better processor utilization while requiring minimal changes to the modelā€™s code. I

    Automatic MPI to AMPI Program Transformation Using Photran

    Full text link
    Abstract. Adaptive MPI, or AMPI, is an implementation of the Mes-sage Passing Interface (MPI) standard. AMPI benefits MPI applications with features such as dynamic load balancing, virtualization, and check-pointing. Because AMPI uses multiple user-level threads per physical core, global variables become an obstacle. It is thus necessary to con-vert MPI programs to AMPI by eliminating global variables. Manually removing the global variables in the program is tedious and error-prone. In this paper, we present a Photran-based tool that automates this task with a source-to-source transformation that supports Fortran. We eval-uate our tool on the multi-zone NAS Benchmarks with AMPI. We also demonstrate the tool on a real-world large-scale FLASH code and present preliminary results of running FLASH on AMPI. Both results show sig-nificant performance improvement using AMPI. This demonstrates that the tool makes using AMPI easier and more productive.

    A study of System Interface Sets (SIS) for the host, target and integration environments of the Space Station Program (SSP)

    Get PDF
    System interface sets (SIS) for large, complex, non-stop, distributed systems are examined. The SIS of the Space Station Program (SSP) was selected as the focus of this study because an appropriate virtual interface specification of the SIS is believed to have the most potential to free the project from four life cycle tyrannies which are rooted in a dependance on either a proprietary or particular instance of: operating systems, data management systems, communications systems, and instruction set architectures. The static perspective of the common Ada programming support environment interface set (CAIS) and the portable common execution environment (PCEE) activities are discussed. Also, the dynamic perspective of the PCEE is addressed

    PicoGrid: A Web-Based Distributed Computing Framework for Heterogeneous Networks Using Java

    Get PDF
    We propose a framework for distributed computing applications in heterogeneous networks. The system is simple to deploy and can run on any operating systems that support the Java Virtual Machine. Using our developed system, idle computing power in an organization can be harvested for performing computing tasks. Agent computers can enter and leave the computation at any time which makes our system very flexible and easily scalable. Our system also does not affect the normal use of client machines to guarantee satisfactory user experience. System tests show that the system has comparable performance to the theoretical case and the computation time is significantly reduced by utilizing multiple computers on the network

    TALUS: Reinforcing TEE Confidentiality with Cryptographic Coprocessors (Technical Report)

    Full text link
    Platforms are nowadays typically equipped with tristed execution environments (TEES), such as Intel SGX and ARM TrustZone. However, recent microarchitectural attacks on TEEs repeatedly broke their confidentiality guarantees, including the leakage of long-term cryptographic secrets. These systems are typically also equipped with a cryptographic coprocessor, such as a TPM or Google Titan. These coprocessors offer a unique set of security features focused on safeguarding cryptographic secrets. Still, despite their simultaneous availability, the integration between these technologies is practically nonexistent, which prevents them from benefitting from each other's strengths. In this paper, we propose TALUS, a general design and a set of three main requirements for a secure symbiosis between TEEs and cryptographic coprocessors. We implement a proof-of-concept of TALUS based on Intel SGX and a hardware TPM. We show that with TALUS, the long-term secrets used in the SGX life cycle can be moved to the TPM. We demonstrate that our design is robust even in the presence of transient execution attacks, preventing an entire class of attacks due to the reduced attack surface on the shared hardware.Comment: In proceedings of Financial Cryptography 2023. This is the technical report of the published pape
    • ā€¦
    corecore