297 research outputs found

    Efficient caching algorithms for memory management in computer systems

    Get PDF
    As disk performance continues to lag behind that of memory systems and processors, fully utilizing memory to reduce disk accesses is a highly effective effort to improve the entire system performance. Furthermore, to serve the applications running on a computer in distributed systems, not only the local memory but also the memory on remote servers must be effectively managed to minimize I/O operations. The critical challenges in an effective memory cache management include: (1) Insightfully understanding and quantifying the locality inherent in the memory access requests; (2) Effectively utilizing the locality information in replacement algorithms; (3) Intelligently placing and replacing data in the multi-level caches of a distributed system; (4) Ensuring that the overheads of the proposed schemes are acceptable.;This dissertation provides solutions and makes unique and novel contributions in application locality quantification, general replacement algorithms, low-cost replacement policy, thrashing protection, as well as multi-level cache management in a distributed system. First, the dissertation proposes a new method to quantify locality strength, and accurately to identify the data with strong locality. It also provides a new replacement algorithm, which significantly outperforms existing algorithms. Second, considering the extremely low-cost requirements on replacement policies in virtual memory management, the dissertation proposes a policy meeting the requirements, and considerably exceeding the performance existing policies. Third, the dissertation provides an effective scheme to protect the system from thrashing for running memory-intensive applications. Finally, the dissertation provides a multi-level block placement and replacement protocol in a distributed client-server environment, exploiting non-uniform locality strengths in the I/O access requests.;The methodology used in this study include careful application behavior characterization, system requirement analysis, algorithm designs, trace-driven simulation, and system implementations. A main conclusion of the work is that there is still much room for innovation and significant performance improvement for the seemingly mature and stable policies that have been broadly used in the current operating system design

    Centaur: Host-Side SSD Caching for Storage Performance Control

    Full text link

    Low-Power Embedded Design Solutions and Low-Latency On-Chip Interconnect Architecture for System-On-Chip Design

    Get PDF
    This dissertation presents three design solutions to support several key system-on-chip (SoC) issues to achieve low-power and high performance. These are: 1) joint source and channel decoding (JSCD) schemes for low-power SoCs used in portable multimedia systems, 2) efficient on-chip interconnect architecture for massive multimedia data streaming on multiprocessor SoCs (MPSoCs), and 3) data processing architecture for low-power SoCs in distributed sensor network (DSS) systems and its implementation. The first part includes a low-power embedded low density parity check code (LDPC) - H.264 joint decoding architecture to lower the baseband energy consumption of a channel decoder using joint source decoding and dynamic voltage and frequency scaling (DVFS). A low-power multiple-input multiple-output (MIMO) and H.264 video joint detector/decoder design that minimizes energy for portable, wireless embedded systems is also designed. In the second part, a link-level quality of service (QoS) scheme using unequal error protection (UEP) for low-power network-on-chip (NoC) and low latency on-chip network designs for MPSoCs is proposed. This part contains WaveSync, a low-latency focused network-on-chip architecture for globally-asynchronous locally-synchronous (GALS) designs and a simultaneous dual-path routing (SDPR) scheme utilizing path diversity present in typical mesh topology network-on-chips. SDPR is akin to having a higher link width but without the significant hardware overhead associated with simple bus width scaling. The last part shows data processing unit designs for embedded SoCs. We propose a data processing and control logic design for a new radiation detection sensor system generating data at or above Peta-bits-per-second level. Implementation results show that the intended clock rate is achieved within the power target of less than 200mW. We also present a digital signal processing (DSP) accelerator supporting configurable MAC, FFT, FIR, and 3-D cross product operations for embedded SoCs. It consumes 12.35mW along with 0.167mm2 area at 333MHz

    A Construction Kit for Efficient Low Power Neural Network Accelerator Designs

    Get PDF
    Implementing embedded neural network processing at the edge requires efficient hardware acceleration that couples high computational performance with low power consumption. Driven by the rapid evolution of network architectures and their algorithmic features, accelerator designs are constantly updated and improved. To evaluate and compare hardware design choices, designers can refer to a myriad of accelerator implementations in the literature. Surveys provide an overview of these works but are often limited to system-level and benchmark-specific performance metrics, making it difficult to quantitatively compare the individual effect of each utilized optimization technique. This complicates the evaluation of optimizations for new accelerator designs, slowing-down the research progress. This work provides a survey of neural network accelerator optimization approaches that have been used in recent works and reports their individual effects on edge processing performance. It presents the list of optimizations and their quantitative effects as a construction kit, allowing to assess the design choices for each building block separately. Reported optimizations range from up to 10'000x memory savings to 33x energy reductions, providing chip designers an overview of design choices for implementing efficient low power neural network accelerators

    Parallel Natural Language Parsing: From Analysis to Speedup

    Get PDF
    Electrical Engineering, Mathematics and Computer Scienc

    IF-MANET: Interoperable framework for heterogeneous mobile ad hoc networks

    Get PDF
    The advances in low power micro-processors, wireless networks and embedded systems have raised the need to utilize the significant resources of mobile devices. These devices for example, smart phones, tablets, laptops, wearables, and sensors are gaining enormous processing power, storage capacity and wireless bandwidth. In addition, the advancement in wireless mobile technology has created a new communication paradigm via which a wireless network can be created without any priori infrastructure called mobile ad hoc network (MANET). While progress is being made towards improving the efficiencies of mobile devices and reliability of wireless mobile networks, the mobile technology is continuously facing the challenges of un-predictable disconnections, dynamic mobility and the heterogeneity of routing protocols. Hence, the traditional wired, wireless routing protocols are not suitable for MANET due to its unique dynamic ad hoc nature. Due to the reason, the research community has developed and is busy developing protocols for routing in MANET to cope with the challenges of MANET. However, there are no single generic ad hoc routing protocols available so far, which can address all the basic challenges of MANET as mentioned before. Thus this diverse range of ever growing routing protocols has created barriers for mobile nodes of different MANET taxonomies to intercommunicate and hence wasting a huge amount of valuable resources. To provide interaction between heterogeneous MANETs, the routing protocols require conversion of packets, meta-model and their behavioural capabilities. Here, the fundamental challenge is to understand the packet level message format, meta-model and behaviour of different routing protocols, which are significantly different for different MANET Taxonomies. To overcome the above mentioned issues, this thesis proposes an Interoperable Framework for heterogeneous MANETs called IF-MANET. The framework hides the complexities of heterogeneous routing protocols and provides a homogeneous layer for seamless communication between these routing protocols. The framework creates a unique Ontology for MANET routing protocols and a Message Translator to semantically compare the packets and generates the missing fields using the rules defined in the Ontology. Hence, the translation between an existing as well as newly arriving routing protocols will be achieved dynamically and on-the-fly. To discover a route for the delivery of packets across heterogeneous MANET taxonomies, the IF-MANET creates a special Gateway node to provide cluster based inter-domain routing. The IF-MANET framework can be used to develop different middleware applications. For example: Mobile grid computing that could potentially utilise huge amounts of aggregated data collected from heterogeneous mobile devices. Disaster & crises management applications can be created to provide on-the-fly infrastructure-less emergency communication across organisations by utilising different MANET taxonomies

    On I/O Performance and Cost Efficiency of Cloud Storage: A Client\u27s Perspective

    Get PDF
    Cloud storage has gained increasing popularity in the past few years. In cloud storage, data are stored in the service providerā€™s data centers; users access data via the network and pay the fees based on the service usage. For such a new storage model, our prior wisdom and optimization schemes on conventional storage may not remain valid nor applicable to the emerging cloud storage. In this dissertation, we focus on understanding and optimizing the I/O performance and cost efficiency of cloud storage from a clientā€™s perspective. We first conduct a comprehensive study to gain insight into the I/O performance behaviors of cloud storage from the client side. Through extensive experiments, we have obtained several critical findings and useful implications for system optimization. We then design a client cache framework, called Pacaca, to further improve end-to-end performance of cloud storage. Pacaca seamlessly integrates parallelized prefetching and cost-aware caching by utilizing the parallelism potential and object correlations of cloud storage. In addition to improving system performance, we have also made efforts to reduce the monetary cost of using cloud storage services by proposing a latency- and cost-aware client caching scheme, called GDS-LC, which can achieve two optimization goals for using cloud storage services: low access latency and low monetary cost. Our experimental results show that our proposed client-side solutions significantly outperform traditional methods. Our study contributes to inspiring the community to reconsider system optimization methods in the cloud environment, especially for the purpose of integrating cloud storage into the current storage stack as a primary storage layer

    Improving Caches in Consolidated Environments

    Get PDF
    Memory (cache, DRAM, and disk) is in charge of providing data and instructions to a computerā€™s processor. In order to maximize performance, the speeds of the memory and the processor should be equal. However, using memory that always match the speed of the processor is prohibitively expensive. Computer hardware designers have managed to drastically lower the cost of the system with the use of memory caches by sacrificing some performance. A cache is a small piece of fast memory that stores popular data so it can be accessed faster. Modern computers have evolved into a hierarchy of caches, where a memory level is the cache for a larger and slower memory level immediately below it. Thus, by using caches, manufacturers are able to store terabytes of data at the cost of cheapest memory while achieving speeds close to the speed of the fastest one. The most important decision about managing a cache is what data to store in it. Failing to make good decisions can lead to performance overheads and over- provisioning. Surprisingly, caches choose data to store based on policies that have not changed in principle for decades. However, computing paradigms have changed radically leading to two noticeably different trends. First, caches are now consol- idated across hundreds to even thousands of processes. And second, caching is being employed at new levels of the storage hierarchy due to the availability of high-performance flash-based persistent media. This brings four problems. First, as the workloads sharing a cache increase, it is more likely that they contain dupli- cated data. Second, consolidation creates contention for caches, and if not managed carefully, it translates to wasted space and sub-optimal performance. Third, as contented caches are shared by more workloads, administrators need to carefully estimate specific per-workload requirements across the entire memory hierarchy in order to meet per-workload performance goals. And finally, current cache write poli- cies are unable to simultaneously provide performance and consistency guarantees for the new levels of the storage hierarchy. We addressed these problems by modeling their impact and by proposing solu- tions for each of them. First, we measured and modeled the amount of duplication at the buffer cache level and contention in real production systems. Second, we created a unified model of workload cache usage under contention to be used by administrators for provisioning, or by process schedulers to decide what processes to run together. Third, we proposed methods for removing cache duplication and to eliminate wasted space because of contention for space. And finally, we pro- posed a technique to improve the consistency guarantees of write-back caches while preserving their performance benefits

    Compilation Techniques for High-Performance Embedded Systems with Multiple Processors

    Get PDF
    Institute for Computing Systems ArchitectureDespite the progress made in developing more advanced compilers for embedded systems, programming of embedded high-performance computing systems based on Digital Signal Processors (DSPs) is still a highly skilled manual task. This is true for single-processor systems, and even more for embedded systems based on multiple DSPs. Compilers often fail to optimise existing DSP codes written in C due to the employed programming style. Parallelisation is hampered by the complex multiple address space memory architecture, which can be found in most commercial multi-DSP configurations. This thesis develops an integrated optimisation and parallelisation strategy that can deal with low-level C codes and produces optimised parallel code for a homogeneous multi-DSP architecture with distributed physical memory and multiple logical address spaces. In a first step, low-level programming idioms are identified and recovered. This enables the application of high-level code and data transformations well-known in the field of scientific computing. Iterative feedback-driven search for ā€œgoodā€ transformation sequences is being investigated. A novel approach to parallelisation based on a unified data and loop transformation framework is presented and evaluated. Performance optimisation is achieved through exploitation of data locality on the one hand, and utilisation of DSP-specific architectural features such as Direct Memory Access (DMA) transfers on the other hand. The proposed methodology is evaluated against two benchmark suites (DSPstone & UTDSP) and four different high-performance DSPs, one of which is part of a commercial four processor multi-DSP board also used for evaluation. Experiments confirm the effectiveness of the program recovery techniques as enablers of high-level transformations and automatic parallelisation. Source-to-source transformations of DSP codes yield an average speedup of 2.21 across four different DSP architectures. The parallelisation scheme is ā€“ in conjunction with a set of locality optimisations ā€“ able to produce linear and even super-linear speedups on a number of relevant DSP kernels and applications
    • ā€¦
    corecore