24 research outputs found

    Compiler-driven data layout transformations for network applications

    Get PDF
    This work approaches the little studied topic of compiler optimisations directed to network applications. It starts by investigating if there exist any fundamental differences between application domains that justify the development and tuning of domain-specific compiler optimisations. It shows an automated approach that is capable of identifying domain-specific workload characterisations and presenting them in a readily interpretable format based on decision trees. The generated workload profiles summarise key resource utilisation issues and enable compiler engineers to address the highlighted bottlenecks. By applying this methodology to data intensive network infrastructure application it shows that data organisation is the key obstacle to overcome in order to achieve high performance. It therefore proposes and evaluates three specialised data transformations (structure splitting, array regrouping, and software caching) against the industrial EEMBC networking benchmarks and real-world data sets. It also demonstrates on one hand that speedups of up to 2.62 can be achieved, but on the other that no single solution performs equally well across different network traffic scenarios. Hence, to address this issue, an adaptive software caching scheme for high frequency route lookup operations is introduced and its effectiveness evaluated one more time against EEMBC networking benchmarks and real-world data sets achieving speedups of up to 3.30 and 2.27. The results clearly demonstrate that adaptive data organisation schemes are necessary to ensure optimal performance under varying network loads. Finally this research addresses another issue introduced by data transformations such as array regrouping and software caching, i.e. the need for static analysis to allow efficient resource allocation. This thesis proposes a static code analyser that allows the automatic resource analysis of source code containing lists and tree structures. The tool applies a combination of amortised analysis and separation logic methodology to real code and is able to evaluate type and resource usage of existing data structures, which can be used to compute global resource consumption values for full data intensive network applications

    A Structured Design Methodology for High Performance VLSI Arrays

    Get PDF
    abstract: The geometric growth in the integrated circuit technology due to transistor scaling also with system-on-chip design strategy, the complexity of the integrated circuit has increased manifold. Short time to market with high reliability and performance is one of the most competitive challenges. Both custom and ASIC design methodologies have evolved over the time to cope with this but the high manual labor in custom and statistic design in ASIC are still causes of concern. This work proposes a new circuit design strategy that focuses mostly on arrayed structures like TLB, RF, Cache, IPCAM etc. that reduces the manual effort to a great extent and also makes the design regular, repetitive still achieving high performance. The method proposes making the complete design custom schematic but using the standard cells. This requires adding some custom cells to the already exhaustive library to optimize the design for performance. Once schematic is finalized, the designer places these standard cells in a spreadsheet, placing closely the cells in the critical paths. A Perl script then generates Cadence Encounter compatible placement file. The design is then routed in Encounter. Since designer is the best judge of the circuit architecture, placement by the designer will allow achieve most optimal design. Several designs like IPCAM, issue logic, TLB, RF and Cache designs were carried out and the performance were compared against the fully custom and ASIC flow. The TLB, RF and Cache were the part of the HEMES microprocessor.Dissertation/ThesisPh.D. Electrical Engineering 201

    An efficient design space exploration framework to optimize power-efficient heterogeneous many-core multi-threading embedded processor architectures

    Get PDF
    By the middle of this decade, uniprocessor architecture performance had hit a roadblock due to a combination of factors, such as excessive power dissipation due to high operating frequencies, growing memory access latencies, diminishing returns on deeper instruction pipelines, and a saturation of available instruction level parallelism in applications. An attractive and viable alternative embraced by all the processor vendors was multi-core architectures where throughput is improved by using micro-architectural features such as multiple processor cores, interconnects and low latency shared caches integrated on a single chip. The individual cores are often simpler than uniprocessor counterparts, use hardware multi-threading to exploit thread-level parallelism and latency hiding and typically achieve better performance-power figures. The overwhelming success of the multi-core microprocessors in both high performance and embedded computing platforms motivated chip architects to dramatically scale the multi-core processors to many-cores which will include hundreds of cores on-chip to further improve throughput. With such complex large scale architectures however, several key design issues need to be addressed. First, a wide range of micro- architectural parameters such as L1 caches, load/store queues, shared cache structures and interconnection topologies and non-linear interactions between them define a vast non-linear multi-variate micro-architectural design space of many-core processors; the traditional method of using extensive in-loop simulation to explore the design space is simply not practical. Second, to accurately evaluate the performance (measured in terms of cycles per instruction (CPI)) of a candidate design, the contention at the shared cache must be accounted in addition to cycle-by-cycle behavior of the large number of cores which superlinearly increases the number of simulation cycles per iteration of the design exploration. Third, single thread performance does not scale linearly with number of hardware threads per core and number of cores due to memory wall effect. This means that at every step of the design process designers must ensure that single thread performance is not unacceptably slowed down while increasing overall throughput. While all these factors affect design decisions in both high performance and embedded many-core processors, the design of embedded processors required for complex embedded applications such as networking, smart power grids, battlefield decision-making, consumer electronics and biomedical devices to name a few, is fundamentally different from its high performance counterpart because of the need to consider (i) low power and (ii) real-time operations. This implies the design objective for embedded many-core processors cannot be to simply maximize performance, but improve it in such a way that overall power dissipation is minimized and all real-time constraints are met. This necessitates additional power estimation models right at the design stage to accurately measure the cost and reliability of all the candidate designs during the exploration phase. In this dissertation, a statistical machine learning (SML) based design exploration framework is presented which employs an execution-driven cycle- accurate simulator to accurately measure power and performance of embedded many-core processors. The embedded many-core processor domain is Network Processors (NePs) used to processed network IP packets. Future generation NePs required to operate at terabits per second network speeds captures all the aspects of a complex embedded application consisting of shared data structures, large volume of compute-intensive and data-intensive real-time bound tasks and a high level of task (packet) level parallelism. Statistical machine learning (SML) is used to efficiently model performance and power of candidate designs in terms of wide ranges of micro-architectural parameters. The method inherently minimizes number of in-loop simulations in the exploration framework and also efficiently captures the non-linear interactions between the micro-architectural design parameters. To ensure scalability, the design space is partitioned into (i) core-level micro-architectural parameters to optimize single core architectures subject to the real-time constraints and (ii) shared memory level micro- architectural parameters to explore the shared interconnection network and shared cache memory architectures and achieves overall optimality. The cost function of our exploration algorithm is the total power dissipation which is minimized, subject to the constraints of real-time throughput (as determined from the terabit optical network router line-speed) required in IP packet processing embedded application

    Secure Virtualization of Latency-Constrained Systems

    Get PDF
    Virtualization is a mature technology in server and desktop environments where multiple systems are consolidate onto a single physical hardware platform, increasing the utilization of todays multi-core systems as well as saving resources such as energy, space and costs compared to multiple single systems. Looking at embedded environments reveals that many systems use multiple separate computing systems inside, including requirements for real-time and isolation properties. For example, modern high-comfort cars use up to a hundred embedded computing systems. Consolidating such diverse configurations promises to save resources such as energy and weight. In my work I propose a secure software architecture that allows consolidating multiple embedded software systems with timing constraints. The base of the architecture builds a microkernel-based operating system that supports a variety of different virtualization approaches through a generic interface, supporting hardware-assisted virtualization and paravirtualization as well as multiple architectures. Studying guest systems with latency constraints with regards to virtualization showed that standard techniques such as high-frequency time-slicing are not a viable approach. Generally, guest systems are a combination of best-effort and real-time work and thus form a mixed-criticality system. Further analysis showed that such systems need to export relevant internal scheduling information to the hypervisor to support multiple guests with latency constraints. I propose a mechanism to export those relevant events that is secure, flexible, has good performance and is easy to use. The thesis concludes with an evaluation covering the virtualization approach on the ARM and x86 architectures and two guest operating systems, Linux and FreeRTOS, as well as evaluating the export mechanism

    Operating system security for mobile devices

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (p. 75-78).This thesis presents the design and implementation of a port of the Asbestos operating system to the ARM processor. The port to the ARM allows Asbestos to run on mobile devices such as cell phones and personal digital assistants. These mobile, wireless-enabled devices are at risk for data attacks because they store private data but often roam in public networks. The Asbestos operating system is designed to prevent disclosure of such data. The port includes a file system and a network driver, which together enable future development of Asbestos applications on the ARM platform. This thesis evaluates the port with a performance comparison between Asbestos running on an HP iPAQ hand held computer and the original x86 Asbestos.by Martijn Stevenson.M.Eng

    Extending the reach of microprocessors : column and curious caching

    Get PDF
    Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references (p. 162-167).by Derek T. Chiou.Ph.D

    MMP

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.Includes bibliographical references (p. 129-135).Reliability and security are quickly becoming users' biggest concern due to the increasing reliance on computers in all areas of society. Hardware-enforced, fine-grained memory protection can increase the reliability and security of computer systems, but will be adopted only if the protection mechanism does not compromise performance, and if the hardware mechanism can be used easily by existing software. Mondriaan memory protection (MMP) provides fine-grained memory protection for a linear address space, while supporting an efficient hardware implementation. MMP's use of linear addressing makes it compatible with current software programming models and program binaries, and it is also backwards compatible with current operating systems and instruction sets. MMP can be implemented efficiently because it separates protection information from program data, allowing protection information to be compressed and cached efficiently. This organization is similar to paging hardware, where the translation information for a page of data bytes is compressed to a single translation value and cached in the TLB. MMP stores protection information in tables in protected system memory, just as paging hardware stores translation information in page tables. MMP is well suited to improve the robustness of modern software. Modern software development favors modules (or plugins) as a way to structure and provide extensibility for large systems, like operating systems, web servers and web clients. Protection between modules written in unsafe languages is currently provided only by programmer convention, reducing system stability.(cont.) Device drivers, which are implemented as loadable modules, are now the most frequent source of operating system crashes (e.g., 85% of Windows XP crashes in one study [SBL03]). MMP provides a mechanism to enforce module boundaries, increasing system robustness by isolating modules from each other and making all memory sharing explicit. We implement the MMP hardware in a simulator and modify a version of the Linux 2.4.19 operating system to use it. Linux loads its device drivers as kernel module extensions, and MMP enforces the module boundaries, only allowing the device drivers access to the memory they need to function. The memory isolation provided by MMP increases Linux's resistance to programmer error, and exposed two kernel bugs in common, heavily-tested drivers. Experiments with several benchmarks where MMP was used extensively indicate the space taken by the MMP data structures is less than 11% of the memory used by the kernel, and the kernel's runtime, according to a simple performance model, increases less than 12% (relative to an unmodified kernel).by Emmett Jethro Witchel.Ph.D

    CROSS-LAYER CUSTOMIZATION PLATFORM FOR LOW-POWER AND REAL-TIME EMBEDDED APPLICATIONS

    Get PDF
    Modern embedded applications have become increasingly complex and diverse in their functionalities and requirements. Data processing, communication and multimedia signal processing, real-time control and various other functionalities can often need to be implemented on the same System-on-Chip(SOC) platform. The significant power constraints and real-time guarantee requirements of these applications have become significant obstacles for the traditional embedded system design methodologies. The general-purpose computing microarchitectures of these platforms are designed to achieve good performance on average, which is far from optimal for any particular application. The system must always assume worst-case scenarios, which results in significant power inefficiencies and resource under-utilization. This dissertation introduces a cross-layer application-customizable embedded platform, which dynamically exploits application information and fine-tunes system components at system software and hardware layers. This is achieved with the close cooperation and seamless integration of the compiler, the operating system, and the hardware architecture. The compiler is responsible for extracting application regularities through static and profile-based analysis. The relevant application knowledge is propagated and utilized at run-time across the system layers through the judiciously introduced reconfigurability at both OS and hardware layers. The introduced framework comprehensively covers the fundamental subsystems of memory management and multi-tasking execution control

    CROSS-LAYER CUSTOMIZATION FOR LOW POWER AND HIGH PERFORMANCE EMBEDDED MULTI-CORE PROCESSORS

    Get PDF
    Due to physical limitations and design difficulties, computer processor architecture has shifted to multi-core and even many-core based approaches in recent years. Such architectures provide potentials for sustainable performance scaling into future peta-scale/exa-scale computing platforms, at affordable power budget, design complexity, and verification efforts. To date, multi-core processor products have been replacing uni-core processors in almost every market segment, including embedded systems, general-purpose desktops and laptops, and super computers. However, many issues still remain with multi-core processor architectures that need to be addressed before their potentials could be fully realized. People in both academia and industry research community are still seeking proper ways to make efficient and effective use of these processors. The issues involve hardware architecture trade-offs, the system software service, the run-time management, and user application design, which demand more research effort into this field. Due to the architectural specialties with multi-core based computers, a Cross-Layer Customization framework is proposed in this work, which combines application specific information and system platform features, along with necessary operating system service support, to achieve exceptional power and performance efficiency for targeted multi-core platforms. Several topics are covered with specific optimization goals, including snoop cache coherence protocol, inter-core communication for producer-consumer applications, synchronization mechanisms, and off-chip memory bandwidth limitations. Analysis of benchmark program execution with conventional mechanisms is made to reveal the overheads in terms of power and performance. Specific customizations are proposed to eliminate such overheads with support from hardware, system software, compiler, and user applications. Experiments show significant improvement on system performance and power efficiency

    Scheduler de sucesión binaria con cambio de contexto

    Get PDF
    En un sistema embebido es primordial la optimización de los recursos, por lo que es importante el uso de técnicas multitareas que permitan a los diseñadores de software asignar los recursos de manera eficiente. El scheduler o planificador es la base para lograr un sistema multitareas, porque administra el orden de ejecución de las tareas y distribuye la carga del procesador. En este trabajo se presenta la implementación de un scheduler de sucesión binaria con la capacidad de configurar un número variable de tareas, su frecuencia de ejecución y el desface para evitar colisiones. El scheduler tiene un mecanismo de cambio de contexto que complementa el proceso automático de guardado y restaurado de los registros del procesador cuando una interrupción ocurre, lo que evita la perdida de información de las tareas. Con el uso del modelo de programación para ARM, se desarrolló código en lenguaje c el scheduler con bloques de instrucciones en ensamblador para implementar el cambio de contexto; además se utilizaron dos punteros a la pila para separar la memoria el scheduler de las tareas, excepciones del sistema con la finalidad de dar menor prioridad en los procesos del scheduler que las interrupciones externas invocadas por periféricos y el reloj del sistema definiendo los tiempos de ejecución de las tareas. El cambio de contexto en un scheduler puede llegar a ser una debilidad importante en el sistema, con lo que se compromete el tiempo de respuesta de un sistema multitareas, la selección de una mala estrategia pudiera afectar hasta ≈ 350µsec en un procesador de 200MHz.Consejo Nacional de Ciencia y Tecnologí
    corecore