93 research outputs found

    ENERGY-AWARE OPTIMIZATION FOR EMBEDDED SYSTEMS WITH CHIP MULTIPROCESSOR AND PHASE-CHANGE MEMORY

    Get PDF
    Over the last two decades, functions of the embedded systems have evolved from simple real-time control and monitoring to more complicated services. Embedded systems equipped with powerful chips can provide the performance that computationally demanding information processing applications need. However, due to the power issue, the easy way to gain increasing performance by scaling up chip frequencies is no longer feasible. Recently, low-power architecture designs have been the main trend in embedded system designs. In this dissertation, we present our approaches to attack the energy-related issues in embedded system designs, such as thermal issues in the 3D chip multiprocessor (CMP), the endurance issue in the phase-change memory(PCM), the battery issue in the embedded system designs, the impact of inaccurate information in embedded system, and the cloud computing to move the workload to remote cloud computing facilities. We propose a real-time constrained task scheduling method to reduce peak temperature on a 3D CMP, including an online 3D CMP temperature prediction model and a set of algorithm for scheduling tasks to different cores in order to minimize the peak temperature on chip. To address the challenging issues in applying PCM in embedded systems, we propose a PCM main memory optimization mechanism through the utilization of the scratch pad memory (SPM). Furthermore, we propose an MLC/SLC configuration optimization algorithm to enhance the efficiency of the hybrid DRAM + PCM memory. We also propose an energy-aware task scheduling algorithm for parallel computing in mobile systems powered by batteries. When scheduling tasks in embedded systems, we make the scheduling decisions based on information, such as estimated execution time of tasks. Therefore, we design an evaluation method for impacts of inaccurate information on the resource allocation in embedded systems. Finally, in order to move workload from embedded systems to remote cloud computing facility, we present a resource optimization mechanism in heterogeneous federated multi-cloud systems. And we also propose two online dynamic algorithms for resource allocation and task scheduling. We consider the resource contention in the task scheduling

    GPU Array Access Auto-Tuning

    Get PDF
    GPUs have been used for years in compute intensive applications. Their massive parallel processing capabilities can speedup calculations significantly. However, to leverage this speedup it is necessary to rethink and develop new algorithms that allow parallel processing. These algorithms are only one piece to achieve high performance. Nearly as important as suitable algorithms is the actual implementation and the usage of special hardware features such as intra-warp communication, shared memory, caches, and memory access patterns. Optimizing these factors is usually a time consuming task that requires deep understanding of the algorithms and the underlying hardware. Unlike CPUs, the internal structure of GPUs has changed significantly and will likely change even more over the years. Therefore it does not suffice to optimize the code once during the development, but it has to be optimized for each new GPU generation that is released. To efficiently (re-)optimize code towards the underlying hardware, auto-tuning tools have been developed that perform these optimizations automatically, taking this burden from the programmer. In particular, NVIDIA -- the leading manufacturer for GPUs today -- applied significant changes to the memory hierarchy over the last four hardware generations. This makes the memory hierarchy an attractive objective for an auto-tuner. In this thesis we introduce the MATOG auto-tuner that automatically optimizes array access for NVIDIA CUDA applications. In order to achieve these optimizations, MATOG has to analyze the application to determine optimal parameter values. The analysis relies on empirical profiling combined with a prediction method and a data post-processing step. This allows to find nearly optimal parameter values in a minimal amount of time. Further, MATOG is able to automatically detect varying application workloads and can apply different optimization parameter settings at runtime. To show MATOG's capabilities, we evaluated it on a variety of different applications, ranging from simple algorithms up to complex applications on the last four hardware generations, with a total of 14 GPUs. MATOG is able to achieve equal or even better performance than hand-optimized code. Further, it is able to provide performance portability across different GPU types (low-, mid-, high-end and HPC) and generations. In some cases it is able to exceed the performance of hand-crafted code that has been specifically optimized for the tested GPU by dynamically changing data layouts throughout the execution

    Design and Performance of Scalable High-Performance Programmable Routers - Doctoral Dissertation, August 2002

    Get PDF
    The flexibility to adapt to new services and protocols without changes in the underlying hardware is and will increasingly be a key requirement for advanced networks. Introducing a processing component into the data path of routers and implementing packet processing in software provides this ability. In such a programmable router, a powerful processing infrastructure is necessary to achieve to level of performance that is comparable to custom silicon-based routers and to demonstrate the feasibility of this approach. This work aims at the general design of such programmable routers and, specifically, at the design and performance analysis of the processing subsystem. The necessity of programmable routers is motivated, and a router design is proposed. Based on the design, a general performance model is developed and quantitatively evaluated using a new network processor benchmark. Operational challenges, like scheduling of packets to processing engines, are addressed, and novel algorithms are presented. The results of this work give qualitative and quantitative insights into this new domain that combines issues from networking, computer architecture, and system design

    Parallel simulation techniques for telecommunication network modelling

    Get PDF
    In this thesis, we consider the application of parallel simulation to the performance modelling of telecommunication networks. A largely automated approach was first explored using a parallelizing compiler to speed up the simulation of simple models of circuit-switched networks. This yielded reasonable results for relatively little effort compared with other approaches. However, more complex simulation models of packet- and cell-based telecommunication networks, requiring the use of discrete event techniques, need an alternative approach. A critical review of parallel discrete event simulation indicated that a distributed model components approach using conservative or optimistic synchronization would be worth exploring. Experiments were therefore conducted using simulation models of queuing networks and Asynchronous Transfer Mode (ATM) networks to explore the potential speed-up possible using this approach. Specifically, it is shown that these techniques can be used successfully to speed-up the execution of useful telecommunication network simulations. A detailed investigation has demonstrated that conservative synchronization performs very well for applications with good look ahead properties and sufficient message traffic density and, given such properties, will significantly outperform optimistic synchronization. Optimistic synchronization, however, gives reasonable speed-up for models with a wider range of such properties and can be optimized for speed-up and memory usage at run time. Thus, it is confirmed as being more generally applicable particularly as model development is somewhat easier than for conservative synchronization. This has to be balanced against the more difficult task of developing and debugging an optimistic synchronization kernel and the application models

    Qualität und Nutzen - Über den Gebrauch von Zeit-Wert-Funktionen zur Integration qualitäts- und zeit-flexibler Aspekte in einer dynamischen Echtzeit-Einplanungsumgebung

    Get PDF
    Scheduling methodologies for real-time applications have been of keen interest to diverse research communities for several decades. Depending on the application area, algorithms have been developed that are tailored to specific requirements with respect to both the individual components of which an application is made up and the computational platform on which it is to be executed. Many real-time scheduling algorithms base their decisions solely or partly on timing constraints expressed by deadlines which must be met even under worst-case conditions. The increasing complexity of computing hardware means that worst-case execution time analysis becomes increasingly pessimistic. Scheduling hard real-time computations according to their worst-case execution times (which is common practice) will thus result, on average, in an increasing amount of spare capacity. The main goal of flexible real-time scheduling is to exploit this otherwise wasted capacity. Flexible scheduling schemes have been proposed to increase the ability of a real-time system to adapt to changing requirements and nondeterminism in the application behaviour. These models can be categorised as those whose source of flexibility is the quality of computations and those which are flexible regarding their timing constraints. This work describes a novel model which allows to specify both flexible timing constraints and quality profiles for an application. Furthermore, it demonstrates the applicability of this specification method to real-world examples and suggests a set of feasible scheduling algorithms for the proposed problem class.Einplanungsverfahren für Echtzeitanwendungen stehen seit Jahrzehnten im Interesse verschiedener Forschungsgruppen. Abhängig vom Anwendungsgebiet wurden Algorithmen entwickelt, welche an die spezifischen Anforderungen sowohl hinsichtlich der einzelnen Komponenten, aus welchen eine Anwendung besteht, als auch an die Rechnerplattform, auf der diese ausgeführt werden sollen, angepasst sind. Viele Echtzeit-Einplanungsverfahren gründen ihre Entscheidungen ausschließlich oder teilweise auf Zeitbedingungen, welche auch bei Auftreten maximaler Ausführungszeiten eingehalten werden müssen. Die zunehmende Komplexität von Rechner-Hardware bedeutet, dass die Worst-Case-Analyse in steigendem Maße pessimistisch wird. Die Einplanung harter Echtzeit-Berechnungen anhand ihrer maximalen Ausführungszeiten (was die gängige Praxis darstellt) resultiert daher im Regelfall in einer frei verfügbaren Rechenkapazität in steigender Höhe. Das Hauptziel flexibler Echtzeit-Einplanungsverfahren ist es, diese ansonsten verschwendete Kapazität auszunutzen. Flexible Einplanungsverfahren wurden vorgeschlagen, welche die Fähigkeit eines Echtzeitsystems erhöhen, sich an veränderte Anforderungen und Nichtdeterminismus im Verhalten der Anwendung anzupassen. Diese Modelle können unterteilt werden in solche, deren Quelle der Flexibilität die Qualität der Berechnungen ist, und jene, welche flexibel hinsichtlich ihrer Zeitbedingungen sind. Diese Arbeit beschreibt ein neuartiges Modell, welches es erlaubt, sowohl flexible Zeitbedingungen als auch Qualitätsprofile für eine Anwendung anzugeben. Außerdem demonstriert sie die Anwendbarkeit dieser Spezifikationsmethode auf reale Beispiele und schlägt eine Reihe von Einplanungsalgorithmen für die vorgestellte Problemklasse vor

    Studies of inspection algorithms and associated microprogrammable hardware implementations

    Get PDF
    This work is concerned with the design and development of real-time algorithms for industrial inspection applications. Rather than implement algorithms in dedicated hardware, microprogrammable machines were considered essential in order to maintain flexibility. After a survey of image pattern recognition where algorithms applicable to real-time use are cited, this thesis presents industrial inspection algorithms that locate and scrutinise actual manufactured products. These are fast and robust - a necessary requirement in industrial environments. The National Physical Laboratory have developed a Linear Array Processor (LAP) specifically designed for industrial recognition work. As with most array processors, the LAP has a greater performance than conventional processors, yet is strictly limited to parallel algorithms for optimum performance. It was therefore necessary to incorporate sequentialism into the design of a multiprocessor system. A microcoded bit-slice Sequential Image Processor (SIP) has been designed and built at RHBNC in conjunction with the NPL. This was primarily intended as a post-processor for the LAP based on the VMEbus but in fact has proved its usefulness as a stand-alone processor. This is described along with an assembler written for SIP which translates assembly language mnemonics to microcode. This work, which includes a review of current architectures, leads to the specification of a hybrid (SIMD/NIMD) architecture consisting of multiple autonomous sequential processors. This involves an analysis of various configurations and entails an investigation of the source of bottlenecks within each design. Such systems require a significant amount of interprocessor communication: methods for achieving this are discussed, some of which have only become practical with the decrease incost of electronic components. This eventually leads to a system for which algorithm execution speed increases approximately linearly with the number of processors. The algorithms described in earlier chapters are examined on the system and the practicalities of such a design are analysed in detail. Overall, this thesis has arrived at designs of programmable real-time inspection systems, and has obtained guidelines which will help with the implementation of future inspection systems.<p

    Analysis, Modeling, and Algorithms for Scalable Web Crawling

    Get PDF
    This dissertation presents a modeling framework for the intermediate data generated by external-memory sorting algorithms (e.g., merge sort, bucket sort, hash sort, replacement selection) that are well-known, yet without accurate models of produced data volume. The motivation comes from the IRLbot crawl experience in June 2007, where a collection of scalable and high-performance external sorting methods are used to handle such problems as URL uniqueness checking, real-time frontier ranking, budget allocation, spam avoidance, all being monumental tasks, especially when limited to the resources of a single-machine. We discuss this crawl experience in detail, use novel algorithms to collect data from the crawl image, and then advance to a broader problem – sorting arbitrarily large-scale data using limited resources and accurately capturing the required cost (e.g., time and disk usage). To solve these problems, we present an accurate model of uniqueness probability the probability to encounter previous unseen data and use that to analyze the amount of intermediate data generated the above-mentioned sorting methods. We also demonstrate how the intermediate data volume and runtime vary based on the input properties (e.g., frequency distribution), hardware configuration (e.g., main memory size, CPU and disk speed) and the choice of sorting method, and that our proposed models accurately capture such variation. Furthermore, we propose a novel hash-based method for replacement selection sort and its model in case of duplicate data, where existing literature is limited to random or mostly-unique data. Note that the classic replacement selection method has the ability to increase the length of sorted runs and reduce their number, both directly benefiting the merge step of external sorting and . But because of a priority queue-assisted sort operation that is inherently slow, the application of replacement selection was limited. Our hash-based design solves this problem by making the sort phase significantly faster compared to existing methods, making this method a preferred choice. The presented models also enable exact analysis of Least-Recently-Used (LRU) and Random Replacement caches (i.e., their hit rate) that are used as part of the algorithms presented here. These cache models are more accurate than the ones in existing literature, since the existing ones mostly assume infinite stream of data, while our models work accurately on finite streams (e.g., sampled web graphs, click stream) as well. In addition, we present accurate models for various crawl characteristics of random graphs, which can forecast a number of aspects of crawl experience based on the graph properties (e.g., degree distribution). All these models are presented under a unified umbrella to analyze a set of large-scale information processing algorithms that are streamlined for high performance and scalability

    Programming Languages and Systems

    Get PDF
    This open access book constitutes the proceedings of the 30th European Symposium on Programming, ESOP 2021, which was held during March 27 until April 1, 2021, as part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2021. The conference was planned to take place in Luxembourg and changed to an online format due to the COVID-19 pandemic. The 24 papers included in this volume were carefully reviewed and selected from 79 submissions. They deal with fundamental issues in the specification, design, analysis, and implementation of programming languages and systems

    Research into software executives for space operations support

    Get PDF
    Research concepts pertaining to a software (workstation) executive which will support a distributed processing command and control system characterized by high-performance graphics workstations used as computing nodes are presented. Although a workstation-based distributed processing environment offers many advantages, it also introduces a number of new concerns. In order to solve these problems, allow the environment to function as an integrated system, and present a functional development environment to application programmers, it is necessary to develop an additional layer of software. This 'executive' software integrates the system, provides real-time capabilities, and provides the tools necessary to support the application requirements
    • …
    corecore