38 research outputs found

    The Design of a System Architecture for Mobile Multimedia Computers

    Get PDF
    This chapter discusses the system architecture of a portable computer, called Mobile Digital Companion, which provides support for handling multimedia applications energy efficiently. Because battery life is limited and battery weight is an important factor for the size and the weight of the Mobile Digital Companion, energy management plays a crucial role in the architecture. As the Companion must remain usable in a variety of environments, it has to be flexible and adaptable to various operating conditions. The Mobile Digital Companion has an unconventional architecture that saves energy by using system decomposition at different levels of the architecture and exploits locality of reference with dedicated, optimised modules. The approach is based on dedicated functionality and the extensive use of energy reduction techniques at all levels of system design. The system has an architecture with a general-purpose processor accompanied by a set of heterogeneous autonomous programmable modules, each providing an energy efficient implementation of dedicated tasks. A reconfigurable internal communication network switch exploits locality of reference and eliminates wasteful data copies

    Vector-thread architecture and implementation

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 181-186).This thesis proposes vector-thread architectures as a performance-efficient solution for all-purpose computing. The VT architectural paradigm unifies the vector and multithreaded compute models. VT provides the programmer with a control processor and a vector of virtual processors. The control processor can use vector-fetch commands to broadcast instructions to all the VPs or each VP can use thread-fetches to direct its own control flow. A seamless intermixing of the vector and threaded control mechanisms allows a VT architecture to flexibly and compactly encode application parallelism and locality. VT architectures can efficiently exploit a wide variety of loop-level parallelism, including non-vectorizable loops with cross-iteration dependencies or internal control flow. The Scale VT architecture is an instantiation of the vector-thread paradigm designed for low-power and high-performance embedded systems. Scale includes a scalar RISC control processor and a four-lane vector-thread unit that can execute 16 operations per cycle and supports up to 128 simultaneously active virtual processor threads. Scale provides unit-stride and strided-segment vector loads and stores, and it implements cache refill/access decoupling. The Scale memory system includes a four-port, non-blocking, 32-way set-associative, 32 KB cache. A prototype Scale VT processor was implemented in 180 nm technology using an ASIC-style design flow. The chip has 7.1 million transistors and a core area of 16.6 mm2, and it runs at 260 MHz while consuming 0.4-1.1 W. This thesis evaluates Scale using a diverse selection of embedded benchmarks, including example kernels for image processing, audio processing, text and data processing, cryptography, network processing, and wireless communication.(cont.) Larger applications also include a JPEG image encoder and an IEEE 802.11 la wireless transmitter. Scale achieves high performance on a range of different types of codes, generally executing 3-11 compute operations per cycle. Unlike other architectures which improve performance at the expense of increased energy consumption, Scale is generally even more energy efficient than a scalar RISC processor.by Ronny Meir Krashinsky.Ph.D

    Automated design of domain-specific custom instructions

    Get PDF

    Überblick zur Softwareentwicklung in Wissenschaftlichen Anwendungen

    Get PDF
    Viele wissenschaftliche Disziplinen mĂŒssen heute immer komplexer werdende numerische Probleme lösen. Die KomplexitĂ€t der benutzten wissenschaftlichen Software steigt dabei kontinuierlich an. Diese KomplexitĂ€tssteigerung wird durch eine ganze Reihe sich Ă€ndernder Anforderungen verursacht: Die Betrachtung gekoppelter PhĂ€nomene gewinnt Aufmerksamkeit und gleichzeitig mĂŒssen neue Technologien wie das Grid-Computing oder neue Multiprozessorarchitekturen genutzt werden, um weiterhin in angemessener Zeit zu Berechnungsergebnissen zu kommen. Diese FĂŒlle an neuen Anforderungen kann nicht mehr von kleinen spezialisierten Wissenschaftlergruppen in Isolation bewĂ€ltigt werden. Die Entwicklung wissenschaftlicher Software muss vielmehr in interdisziplinĂ€ren Gruppen geschehen, was neue Herausforderungen in der Softwareentwicklung induziert. Ein Paradigmenwechsel zu einer stĂ€rkeren Separation von Verantwortlichkeiten innerhalb interdisziplinĂ€rer Entwicklergruppen ist bis jetzt in vielen FĂ€llen nur in AnsĂ€tzen erkennbar. Die Kopplung partitioniert durchgefĂŒhrter Simulationen physikalischer PhĂ€nomene ist ein wichtiges Beispiel fĂŒr softwaretechnisch herausfordernde Aufgaben im Gebiet des wissenschaftlichen Rechnens. In diesem Kontext modellieren verschiedene Simulationsprogramme unterschiedliche Teile eines komplexeren gekoppelten Systems. Die vorliegende Arbeit gibt einen Überblick ĂŒber Paradigmen, die darauf abzielen Softwareentwicklung fĂŒr Berechnungsprogramme verlĂ€sslicher und weniger abhĂ€ngig voneinander zu machen. Ein spezielles Augenmerk liegt auf der Entwicklung gekoppelter Simulationen.Fields of modern science and engineering are in need of solving more and more complex numerical problems. The complexity of scientiïŹc software thereby rises continuously. This growth is caused by a number of changing requirements. Coupled phenomena gain importance and new technologies like the computational Grid, graphical and heterogeneous multi-core processors have to be used to achieve high-performance. The amount of additional complexity can not be handled by small groups of specialised scientists. The interdiciplinary nature of scientiïŹc software thereby presents new challanges for software engineering. A paradigm shift towards a stronger separation of concerns becomes necessary in the development of future scientiïŹc software. The coupling of independently simulated physical phenomena is an important example for a software-engineering concern in the domain of computational science. In this context, different simulation-programs model only a part of a more complex coupled system. The present work gives overview on paradigms which aim at making software-development in computational sciences more reliable and less interdependent. A special focus is put on the development of coupled simulations

    Performance and power optimizations in chip multiprocessors for throughput-aware computation

    Get PDF
    The so-called "power (or power density) wall" has caused core frequency (and single-thread performance) to slow down, giving rise to the era of multi-core/multi-thread processors. For example, the IBM POWER4 processor, released in 2001, incorporated two single-thread cores into the same chip. In 2010, IBM released the POWER7 processor with eight 4-thread cores in the same chip, for a total capacity of 32 execution contexts. The ever increasing number of cores and threads gives rise to new opportunities and challenges for software and hardware architects. At software level, applications can benefit from the abundant number of execution contexts to boost throughput. But this challenges programmers to create highly-parallel applications and operating systems capable of scheduling them correctly. At hardware level, the increasing core and thread count puts pressure on the memory interface, because memory bandwidth grows at a slower pace ---phenomenon known as the "bandwidth (or memory) wall". In addition to memory bandwidth issues, chip power consumption rises due to manufacturers' difficulty to lower operating voltages sufficiently every processor generation. This thesis presents innovations to improve bandwidth and power consumption in chip multiprocessors (CMPs) for throughput-aware computation: a bandwidth-optimized last-level cache (LLC), a bandwidth-optimized vector register file, and a power/performance-aware thread placement heuristic. In contrast to state-of-the-art LLC designs, our organization avoids data replication and, hence, does not require keeping data coherent. Instead, the address space is statically distributed all over the LLC (in a fine-grained interleaving fashion). The absence of data replication increases the cache effective capacity, which results in better hit rates and higher bandwidth compared to a coherent LLC. We use double buffering to hide the extra access latency due to the lack of data replication. The proposed vector register file is composed of thousands of registers and organized as an aggregation of banks. We leverage such organization to attach small special-function "local computation elements" (LCEs) to each bank. This approach ---referred to as the "processor-in-regfile" (PIR) strategy--- overcomes the limited number of register file ports. Because each LCE is a SIMD computation element and all of them can proceed concurrently, the PIR strategy constitutes a highly-parallel super-wide-SIMD device (ideal for throughput-aware computation). Finally, we present a heuristic to reduce chip power consumption by dynamically placing software (application) threads across hardware (physical) threads. The heuristic gathers chip-level power and performance information at runtime to infer characteristics of the applications being executed. For example, if an application's threads share data, the heuristic may decide to place them in fewer cores to favor inter-thread data sharing and communication. In such case, the number of active cores decreases, which is a good opportunity to switch off the unused cores to save power. It is increasingly harder to find bulletproof (micro-)architectural solutions for the bandwidth and power scalability limitations in CMPs. Consequently, we think that architects should attack those problems from different flanks simultaneously, with complementary innovations. This thesis contributes with a battery of solutions to alleviate those problems in the context of throughput-aware computation: 1) proposing a bandwidth-optimized LLC; 2) proposing a bandwidth-optimized register file organization; and 3) proposing a simple technique to improve power-performance efficiency.El excesivo consumo de potencia de los procesadores actuales ha desacelerado el incremento en la frecuencia operativa de los mismos para dar lugar a la era de los procesadores con mĂșltiples nĂșcleos y mĂșltiples hilos de ejecuciĂłn. Por ejemplo, el procesador POWER7 de IBM, lanzado al mercado en 2010, incorpora ocho nĂșcleos en el mismo chip, con cuatro hilos de ejecuciĂłn por nĂșcleo. Esto da lugar a nuevas oportunidades y desafĂ­os para los arquitectos de software y hardware. A nivel de software, las aplicaciones pueden beneficiarse del abundante nĂșmero de nĂșcleos e hilos de ejecuciĂłn para aumentar el rendimiento. Pero esto obliga a los programadores a crear aplicaciones altamente paralelas y sistemas operativos capaces de planificar correctamente la ejecuciĂłn de las mismas. A nivel de hardware, el creciente nĂșmero de nĂșcleos e hilos de ejecuciĂłn ejerce presiĂłn sobre la interfaz de memoria, ya que el ancho de banda de memoria crece a un ritmo mĂĄs lento. AdemĂĄs de los problemas de ancho de banda de memoria, el consumo de energĂ­a del chip se eleva debido a la dificultad de los fabricantes para reducir suficientemente los voltajes de operaciĂłn entre generaciones de procesadores. Esta tesis presenta innovaciones para mejorar el ancho de banda y consumo de energĂ­a en procesadores multinĂșcleo en el ĂĄmbito de la computaciĂłn orientada a rendimiento ("throughput-aware computation"): una memoria cachĂ© de Ășltimo nivel ("last-level cache" o LLC) optimizada para ancho de banda, un banco de registros vectorial optimizado para ancho de banda, y una heurĂ­stica para planificar la ejecuciĂłn de aplicaciones paralelas orientada a mejorar la eficiencia del consumo de potencia y desempeño. En contraste con los diseños de LLC de Ășltima generaciĂłn, nuestra organizaciĂłn evita la duplicaciĂłn de datos y, por tanto, no requiere de tĂ©cnicas de coherencia. El espacio de direcciones de memoria se distribuye estĂĄticamente en la LLC con un entrelazado de grano fino. La ausencia de replicaciĂłn de datos aumenta la capacidad efectiva de la memoria cachĂ©, lo que se traduce en mejores tasas de acierto y mayor ancho de banda en comparaciĂłn con una LLC coherente. Utilizamos la tĂ©cnica de "doble buffering" para ocultar la latencia adicional necesaria para acceder a datos remotos. El banco de registros vectorial propuesto se compone de miles de registros y se organiza como una agregaciĂłn de bancos. Incorporamos a cada banco una pequeña unidad de cĂłmputo de propĂłsito especial ("local computation element" o LCE). Este enfoque ---que llamamos "computaciĂłn en banco de registros"--- permite superar el nĂșmero limitado de puertos en el banco de registros. Debido a que cada LCE es una unidad de cĂłmputo con soporte SIMD ("single instruction, multiple data") y todas ellas pueden proceder de forma concurrente, la estrategia de "computaciĂłn en banco de registros" constituye un dispositivo SIMD altamente paralelo. Por Ășltimo, presentamos una heurĂ­stica para planificar la ejecuciĂłn de aplicaciones paralelas orientada a reducir el consumo de energĂ­a del chip, colocando dinĂĄmicamente los hilos de ejecuciĂłn a nivel de software entre los hilos de ejecuciĂłn a nivel de hardware. La heurĂ­stica obtiene, en tiempo de ejecuciĂłn, informaciĂłn de consumo de potencia y desempeño del chip para inferir las caracterĂ­sticas de las aplicaciones. Por ejemplo, si los hilos de ejecuciĂłn a nivel de software comparten datos significativamente, la heurĂ­stica puede decidir colocarlos en un menor nĂșmero de nĂșcleos para favorecer el intercambio de datos entre ellos. En tal caso, los nĂșcleos no utilizados se pueden apagar para ahorrar energĂ­a. Cada vez es mĂĄs difĂ­cil encontrar soluciones de arquitectura "a prueba de balas" para resolver las limitaciones de escalabilidad de los procesadores actuales. En consecuencia, creemos que los arquitectos deben atacar dichos problemas desde diferentes flancos simultĂĄneamente, con innovaciones complementarias
    corecore