31 research outputs found

    Applied (Meta)-Heuristic in Intelligent Systems

    Get PDF
    Engineering and business problems are becoming increasingly difficult to solve due to the new economics triggered by big data, artificial intelligence, and the internet of things. Exact algorithms and heuristics are insufficient for solving such large and unstructured problems; instead, metaheuristic algorithms have emerged as the prevailing methods. A generic metaheuristic framework guides the course of search trajectories beyond local optimality, thus overcoming the limitations of traditional computation methods. The application of modern metaheuristics ranges from unmanned aerial and ground surface vehicles, unmanned factories, resource-constrained production, and humanoids to green logistics, renewable energy, circular economy, agricultural technology, environmental protection, finance technology, and the entertainment industry. This Special Issue presents high-quality papers proposing modern metaheuristics in intelligent systems

    Lossy and Lossless Compression Techniques to Improve the Utilization of Memory Bandwidth and Capacity

    Get PDF
    Main memory is a critical resource in modern computer systems and is in increasing demand. An increasing number of on-chip cores and specialized accelerators improves the potential processing throughput but also calls for higher data rates and greater memory capacity. In addition, new emerging data-intensive applications further increase memory traffic and footprint. On the other hand, memory bandwidth is pin limited and power constrained and is therefore more difficult to scale. Memory capacity is limited by cost and energy considerations.This thesis proposes a variety of memory compression techniques as a means to reduce the memory bottleneck. These techniques target two separate problems in the memory hierarchy: memory bandwidth and memory capacity. In order to reduce transferred data volumes, lossy compression is applied which is able to reach more aggressive compression ratios. A reduction of off-chip memory traffic leads to reduced memory latency, which in turn improves the performance and energy efficiency of the system. To improve memory capacity, a novel approach to memory compaction is presented.The first part of this thesis introduces Approximate Value Reconstruction (AVR), which combines a low-complexity downsampling compressor with an LLC design able to co-locate compressed and uncompressed data. Two separate thresholds limit the error introduced by approximation. For applications that tolerate aggressive approximation in large fractions of their data, in a system with 1GB of 1600MHz DDR4 per core and 1MB of LLC space per core, AVR reduces memory traffic by up to 70%, execution time by up to 55%, and energy costs by up to 20% introducing at most 1.2% error in the application output.The second part of this thesis proposes Memory Squeeze (MemSZ), introducing a parallelized implementation of the more advanced Squeeze (SZ) compression method. Furthermore, MemSZ improves on the error limiting capability of AVR by keeping track of life-time accumulated error. An alternate memory compression architecture is also proposed, which utilizes 3D-stacked DRAM as a last-level cache. In a system with 1GB of 800MHz DDR4 per core and 1MB of LLC space per core, MemSZ improves execution time, energy and memory traffic over AVR by up to 15%, 9%, and 64%, respectively.The third part of the thesis describes L2C, a hybrid lossy and lossless memory compression scheme. L2C applies lossy compression to approximable data, and falls back to lossless if an error threshold is exceeded. In a system with 4GB of 800MHz DDR4 per core and 1MB of LLC space per core, L2C improves on the performance of MemSZ by 9%, and energy consumption by 3%.The fourth and final contribution is FlatPack, a novel memory compaction scheme. FlatPack is able to reduce the traffic overhead compared to other memory compaction systems, thus retaining the bandwidth benefits of compression. Furthermore, FlatPack is flexible to changes in block compressibility both over time and between adjacent blocks. When available memory corresponds to 50% of the application footprint, in a system with 4GB of 800MHz DDR4 per core and 1MB of LLC space per core, FlatPack increases system performance compared to current state-of-the-art designs by 36%, while reducing system energy consumption by 12%

    Holistic Performance Analysis and Optimization of Unified Virtual Memory

    Get PDF
    The programming difficulty of creating GPU-accelerated high performance computing (HPC) codes has been greatly reduced by the advent of Unified Memory technologies that abstract the management of physical memory away from the developer. However, these systems incur substantial overhead that paradoxically grows for codes where these technologies are most useful. While these technologies are increasingly adopted for use in modern HPC frameworks and applications, the performance cost reduces the efficiency of these systems and turns away some developers from adoption entirely. These systems are naturally difficult to optimize due to the large number of interconnected hardware and software components that must be untangled to perform thorough analysis. In this thesis, we take the first deep dive into a functional implementation of a Unified Memory system, NVIDIA UVM, to evaluate the performance and characteristics of these systems. We show specific hardware and software interactions that cause serialization between host and devices. We further provide a quantitative evaluation of fault handling for various applications under different scenarios, including prefetching and oversubscription. Through lower-level analysis, we find that the driver workload is dependent on the interactions among application access patterns, GPU hardware constraints, and Host OS components. These findings indicate that the cost of host OS components is significant and present across UM implementations. We also provide a proof-of-concept asynchronous approach to memory management in UVM that allows for reduced system overhead and improved application performance. This study provides constructive insight into future implementations and systems, such as Heterogeneous Memory Management

    Hyperscale Data Processing With Network-Centric Designs

    Get PDF
    Today’s largest data processing workloads are hosted in cloud data centers. Due to unprecedented data growth and the end of Moore’s Law, these workloads have ballooned to the hyperscale level, encompassing billions to trillions of data items and hundreds to thousands of machines per query. Enabling and expanding with these workloads are highly scalable data center networks that connect up to hundreds of thousands of networked servers. These massive scales fundamentally challenge the designs of both data processing systems and data center networks, and the classic layered designs are no longer sustainable. Rather than optimize these massive layers in silos, we build systems across them with principled network-centric designs. In current networks, we redesign data processing systems with network-awareness to minimize the cost of moving data in the network. In future networks, we propose new interfaces and services that the cloud infrastructure offers to applications and codesign data processing systems to achieve optimal query processing performance. To transform the network to future designs, we facilitate network innovation at scale. This dissertation presents a line of systems work that covers all three directions. It first discusses GraphRex, a network-aware system that combines classic database and systems techniques to push the performance of massive graph queries in current data centers. It then introduces data processing in disaggregated data centers, a promising new cloud proposal. It details TELEPORT, a compute pushdown feature that eliminates data processing performance bottlenecks in disaggregated data centers, and Redy, which provides high-performance caches using remote disaggregated memory. Finally, it presents MimicNet, a fine-grained simulation framework that evaluates network proposals at datacenter scale with machine learning approximation. These systems demonstrate that our ideas in network-centric designs achieve orders of magnitude higher efficiency compared to the state of the art at hyperscale

    Enabling aggressive compiler optimization for the mobile environment

    Get PDF
    Aggressive code optimization on the mobile environment is a difficult endeavor. Billions of users rely on mobile devices for their daily computing tasks. Yet, they mostly run poorly optimized code, under-utilizing their already limited processing and energy resources. Existing optimization approaches, like iterative compilation, perform well in other domains but fall short on the mobile environment. They either rely on representative inputs that are hard to reconstruct, or expose users to slowdowns and crashes. An ideal solution must be able to perform an optimization search by repeatedly evaluating different optimization decisions on the same input. That input should be representative of actual user usage without requiring developers to artificially create it. Finally, users should never be exposed to slow or crashing evaluations, a quite common side-effect of iterative compilation. This thesis presents a novel approach with all above in mind, bringing aggressive code optimization to the mobile environment. With a transparent capture mechanism, real user inputs can be stored. This mechanism is infrequently invoked and remains unnoticeable from the users. A single capture is enough to enable offline, input-driven code optimization. It supports C functions as well as code regions of interactive Android applications. It allows controlling the timing and frequency of captures, it bails out on imminent high-impact runtime events, and excludes from captures some immutable data. A replay-based evaluation mechanism is able to repeatedly restore a captured input while changing the underlying code. For C programs, it employs compile and link-time strategies to consistently work despite code transformations. For Android apps, a novel mechanism was developed, able to replay using different code types. These are the original Android-compiled code, interpretation, and LLVM-generated code. Additionally, it works well even in the presence of memory-shuffling security mechanisms. Capture and replay is fused into an iterative compilation system that uses offline, replay-based evaluations. Initially, real inputs are captured online, without noticeably affecting the users. For C and interactive apps, captures required on average 2ms and 15ms respectively. Then, an optimization search is performed by repeatedly replaying the inputs using different code transformations. As this happens offline, any crashing or erroneous executions are not affecting the users. C programs became 29% faster using a random search, while interactive apps became 44% faster using a genetic algorithm and a novel Android backend based on LLVM. Finally, with crowd-sourcing, the offline evaluation effort was significantly accelerated. Specifically, for the user with the highest workload the search accelerated by 7 times

    Developing Trustworthy Hardware with Security-Driven Design and Verification

    Full text link
    Over the past several decades, computing hardware has evolved to become smaller, yet more performant and energy-efficient. Unfortunately these advancements have come at a cost of increased complexity, both physically and functionally. Physically, the nanometer-scale transistors used to construct Integrated Circuits (ICs), have become astronomically expensive to fabricate. Functionally, ICs have become increasingly dense and feature rich to optimize application-specific tasks. To cope with these trends, IC designers outsource both fabrication and portions of Register-Transfer Level (RTL) design. Outsourcing, combined with the increased complexity of modern ICs, presents a security risk: we must trust our ICs have been designed and fabricated to specification, i.e., they do not contain any hardware Trojans. Working in a bottom-up fashion, I initially study the threat of outsourcing fabrication. While prior work demonstrates fabrication-time attacks (modifications) on IC layouts, it is unclear what makes a layout vulnerable to attack. To answer this, in my IC Attack Surface (ICAS) work, I develop a framework that quantifies the security of IC layouts. Using ICAS, I show that modern ICs leave a plethora of both placement and routing resources available for attackers to exploit. Next, to plug these gaps, I construct the first routing-centric defense (T-TER) against fabrication-time Trojans. T-TER wraps security-critical interconnects in IC layouts with tamper-evident guard wires to prevent foundry-side attackers from modifying a design. After hardening layouts against fabrication-time attacks, outsourced designs become the most critical threat. To address this, I develop a dynamic verification technique (Bomberman) to vet untrusted third-party RTL hardware for Ticking Timebomb Trojans (TTTs). By targeting a specific type of Trojan behavior, Bomberman does not suffer from false negatives (missed TTTs), and therefore systematically reduces the overall design-time attack surface. Lastly, to generalize the Bomberman approach to automatically discover other behaviorally-defined classes of malicious logic, I adapt coverage-guided software fuzzers to the RTL verification domain. Leveraging software fuzzers for RTL verification enables IC design engineers to optimize test coverage of third-party designs without intimate implementation knowledge. Overall, this dissertation aims to make security a first-class design objective, alongside power, performance, and area, throughout the hardware development process.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169761/1/trippel_1.pd

    Extending Memory Capacity in Consumer Devices with Emerging Non-Volatile Memory: An Experimental Study

    Full text link
    The number and diversity of consumer devices are growing rapidly, alongside their target applications' memory consumption. Unfortunately, DRAM scalability is becoming a limiting factor to the available memory capacity in consumer devices. As a potential solution, manufacturers have introduced emerging non-volatile memories (NVMs) into the market, which can be used to increase the memory capacity of consumer devices by augmenting or replacing DRAM. Since entirely replacing DRAM with NVM in consumer devices imposes large system integration and design challenges, recent works propose extending the total main memory space available to applications by using NVM as swap space for DRAM. However, no prior work analyzes the implications of enabling a real NVM-based swap space in real consumer devices. In this work, we provide the first analysis of the impact of extending the main memory space of consumer devices using off-the-shelf NVMs. We extensively examine system performance and energy consumption when the NVM device is used as swap space for DRAM main memory to effectively extend the main memory capacity. For our analyses, we equip real web-based Chromebook computers with the Intel Optane SSD, which is a state-of-the-art low-latency NVM-based SSD device. We compare the performance and energy consumption of interactive workloads running on our Chromebook with NVM-based swap space, where the Intel Optane SSD capacity is used as swap space to extend main memory capacity, against two state-of-the-art systems: (i) a baseline system with double the amount of DRAM than the system with the NVM-based swap space; and (ii) a system where the Intel Optane SSD is naively replaced with a state-of-the-art (yet slower) off-the-shelf NAND-flash-based SSD, which we use as a swap space of equivalent size as the NVM-based swap space

    Virtualization of Micro-architectural Components Using Software Solutions

    Get PDF
    Cloud computing has become a dominant computing paradigm in the information technology industry due to its flexibility and efficiency in resource sharing and management. The key technology that enables cloud computing is virtualization. Essential requirements in a virtualized system where several virtual machines (VMs) run on a same physical machine include performance isolation and predictability. To enforce these properties, the virtualization software (called the hypervisor) must find a way to divide physical resources (e.g., physical memory, processor time) of the system and allocate them to VMs with respect to the amount of virtual resources defined for each VM. However, modern hardware have complex architectures and some microarchitectural-level resources such as processor caches, memory controllers, interconnects cannot be divided and allocated to VMs. They are globally shared among all VMs which compete for their use, leading to contention. Therefore, performance isolation and predictability are compromised. In this thesis, we propose software solutions for preventing unpredictability in performance due to micro-architectural components. The first contribution is called Kyoto, a solution to the cache contention issue, inspired by the polluters pay principle. A VM is said to pollute the cache if it provokes significant cache replacements which impact the performance of other VMs. Henceforth, using the Kyoto system, the provider can encourage cloud users to book pollution permits for their VMs. The second contribution addresses the problem of efficiently virtualizing NUMA machines. The major challenge comes from the fact that the hypervisor regularly reconfigures the placement of a VM over the NUMA topology. However, neither guest operating systems (OSs) nor system runtime libraries (e.g., HotSpot) are designed to consider NUMA topology changes at runtime, leading end user applications to unpredictable performance. We presents eXtended Para-Virtualization (XPV), a new principle to efficiently virtualize a NUMA architecture. XPV consists in revisiting the interface between the hypervisor and the guest OS, and between the guest OS and system runtime libraries so that they can dynamically take into account NUMA topology changes

    A New System Architecture for Heterogeneous Compute Units

    Get PDF
    The ongoing trend to more heterogeneous systems forces us to rethink the design of systems. In this work, I study a new system design that considers heterogeneous compute units (general-purpose cores with different instruction sets, DSPs, FPGAs, fixed-function accelerators, etc.) from the beginning instead of as an afterthought. The goal is to treat all compute units (CUs) as first-class citizens, enabling (1) isolation and secure communication between all types of CUs, (2) a direct interaction of all CUs, removing the conventional CPU from the critical path, and (3) access to operating system (OS) services such as file systems and network stacks for all CUs. To study this system design, I am using a hardware/software co-design based on two key ideas: 1) introduce a new hardware component next to each CU used by the OS as the CUs' common interface and 2) let the OS kernel control applications remotely from a different CU. The hardware component is called data transfer unit (DTU) and offers the minimal set of features to reach the stated goals: secure message passing and memory access. The OS is called M³ and runs its kernel on a dedicated CU and runs the OS services and applications on the remaining CUs. The kernel is responsible for establishing DTU-based communication channels between services and applications. After a channel has been set up, services and applications communicate directly without involving the kernel. This approach allows to support arbitrary CUs as aforementioned first-class citizens, ranging from fixed-function accelerators to complex general-purpose cores
    corecore