6 research outputs found

    Affordable kilo-instruction processors

    Get PDF
    Diversos motius expliquen l'estancament en el que es troba el desenvolupament del processador tradicional dissenyat per maximitzar el rendiment d'un únic fil d'execució. Per una banda, técniques agressives com la supersegmentacó del camí de dades o l'execució fora d'ordre tenen un impacte molt negatiu sobre el consum de potència i la complexitat del disseny. Altrament, l'increment en la freqüència del processador augmenta la discrepància entre la velocitat del processador i el temps d'accés a memòria principal. Tot i que les memòries cau redueixen considerablement el nombre d'accessos a memòria principal, aquests accessos introdueixen latencies prou grans per reduir considerablement el rendiment. Tècniques convencionals com l'execució fora d'ordre, útils per ocultar accessos a les memòries cau de 2on nivell, no estan pensades per ocultar latències tan grans. Caldrien cues amb mides de centenars d'instruccions i milers de registres per tal de no interrompre l'execució en el moment de produir-se un accés a memòria principal. Desafortunadament, la tecnologia disponible no és eficient per implementar aquestes estructures monolíticament, doncs resultaria un temps d'accés molt elevat, un consum de potència igualment elevat i un àrea no menyspreable. En aquesta tesi s'han estudiat tècniques que permeten l'implementació d'un processador amb capacitat per continuar processant instruccions en el cas de que es produeixin accessos a memòria principal. Les condicions per a que aquest processador sigui implementable són que estigui basat en estructures de mida convencional i que tingui una unitat de control senzilla. El repte es troba en conciliar un model de processador distribuït amb un control senzill. El problema del disseny del processador s'ha enfocat observant el comportament d'un processador de recursos infinits. S'ha observat que l'execució segueix uns patrons molt interessants, basats en la localitat d'execució. En aplicacions numèriques s'observa que més del 70% de les instruccions no depenen de accessos a memòria principal. Aixó és molt important doncs mostra que sempre hi ha una porció important d'instruccions executables poc després de la decodificació. Aixó permet proposar un nou tipus de processador amb dues unitats d'execució. La primera unitat (el "Cache Processor") processa a alta velocitat instruccions independents de memòria principal. La segona unitat ("Memory Processor") processa les instruccions dependents de accessos a memòria principal, pero de forma molt més relaxada, cosa que li permet mantenir milers de instruccions en vol. Aquesta proposta rep el nom de Decoupled KILO-Instruction Processor (D-KIP) i té forces avantatges: per un costat permet la construcció d'un kilo-instruction processor basat en estructures convencionals i per l'altre simplifica el disseny ja que minimitza les interaccions entre ambdos unitats d'execució.En aquesta tesi es proposen dos implementacions de processadors desacoblats: el D-KIP original, i el Flexible Heterogeneous MultiCore (FMC). Sobre aquestes propostes s'analitza el rendiment i es compara amb altres tècniques que incrementan el parallelisme de memoria, com el prefetching o l'execució "runahead". D'aquesta avaluació es desprén que el processador FMC té un rendiment similar al de un processador convencional amb una finestra de 1500 instruccions en vol. Posteriorment s'analitza l'integració del FMC en entorns multicore/multiprogrammats. La tesi es completa amb la proposta d'una cua de loads i stores (LSQ) per a aquest tipus de processador.Several motives explain the slowdown of high-performance single-thread processor development. On the one hand, aggressive techniques such as superpipelining or out-of-order execution have a considerable impact on power consumption and design complexity. On the other hand, the increment in processor frequencies has led to a large disparity between processor speed and memory access time. Although cache memories considerably reduce the number of accesses to main memory, the remaining accesses introduce latencies large enough to considerably decrease performance. Conventional techniques such as out-of-order execution, while effective in hiding L2 cache accesses, cannot hide latencies this large. Queues of hundreds of entries and thousands of registers would be necessary in order to prevent execution from stalling in the event of a L2 cache miss. Unfortunately, current technology cannot efficiently implement such structures monolithically, as access latencies would considerably increase, as would power consumption and area consumption.In this thesis we studied techniques that allow the processor to continue processing instructions in the event of main memory accesses. The conditions for such a processor to be implementable are that it should be based on structures of conventional size and that it should feature simple control logic. The challenge lies in being able to design a distributed processor with simple control. The design of this processor has been approached by analyzing the behavior of a processor with infinite resources. We have observed that execution follows a very interesting pattern based on execution locality. In numerical codes we observed that over 70% of all instructions do not depend on memory accesses. This is interesting since it shows that there is always a large portion of instructions that can be executed shortly after decode. This allows us to propose a new kind of processor with two execution units. The first unit, the Cache Processor, processes memory-independent instructions at high speed. The second unit, the Memory Processor, processes instructions that depend on main memory accesses, but using relaxed scheduling logic, which allows it to scale to thousands of in-flight instructions. This proposal, which receives the name of Decoupled KILO-Instruction Processor (D-KIP), has several advantages. On the one hand it allows the construction of a kilo-instruction processor based on conventional structures and, on the other hand, it simplifies the design as the interaction between both execution units is minimal. In this thesis two implementations for this kind of processor are presented: the original D-KIP and the Flexible Heterogeneous MultiCore (FMC). The performance of these proposals is analyzed and compared to other proposals that increase memory-level parallelism, such as prefetching or runahead execution. It is observed that the FMC processor performs at the same level of a conventional processor with a window of around 1500 instructions. Further, the integration of the FMC processor into a multicore/multiprogrammed environment is studied. This thesis concludes with the proposal of a two-level Load/Store Queue for this kind of processor

    Novel Operation Modes of Accelerated Neuromorphic Hardware

    Get PDF
    The hybrid operation mode relies on a combination of conventional computing resources and a neuromorphic, beyond von Neumann system to perform a joint real-time experiment. The interactive operation mode provides prompt feedback to the user and benefits from high experiment throughput. The performance of a custom transport-layer protocol is evaluated connecting the accelerated neuromorphic system and the computer cluster. Wire-speed performance is achieved between host and eight FPGAs ((846.7 ± 1.2) MiB/s, 94% wire speed), and between two hosts using 10-Gigabit Ethernet (> 99%) as well as 40GbE (> 99%) to explore scaling behavior. The software architecture to process neuronal network experiments at high rates is presented including measurements which address the key performance indicators. During hybrid operation, the tight coupling between both resources requires low-latency communication. Using a custom-developed software framework, an average one-way latency between two host computers connected via 10GbE is found to be (2.4 ± 0.2) μs and (8.5 ± 0.4) μs to the neuromorphic system. A hybrid experiment is designed to demonstrate the hardware infrastructure and software framework. Starting from a conventional neuronal network simulation, the experiment is gradually migrated into a time-continuous experiment which interacts between a host computer and the neuromorphic system in real time. Results of the intermediate steps and the final, hybrid operation are evaluated

    Realizing High IPC Through a Scalable Memory-Latency Tolerant Multipath Microarchitecture

    No full text
    A microarchitecture is described that achieves high performance on conventional single-threaded program codes without compiler assistance. To obtain high instructions per clock (IPC) for inherently sequential (e.g., SpecInt-2000 programs), a large number of instructions must be in flight simultaneously. However, several problems are associated with such microarchitectures, including scalability, issues related to control flow, and memory latency

    Yhteisöllinen energiatehokkuus mobiililaitteilla

    Get PDF
    We have created a mobile energy measurement application and gathered energy measurement data from over 725,000 devices, running over 300,000 applications, in heterogeneous environments, and constructed models of what is normal in each context for each application. We have used this data to find energy abnormalities in the wild, and provide users of our application advice on how to deal with them. These abnormalities cannot be discovered in laboratory conditions due to the rich interaction of the smartphone and its operating environment. Employing a collaborative mobile energy awareness application with thousands of users allows us to gather a large amount of data in a short time. Such a large and diverse dataset has helped us answer many research questions. Our work is the first collaborative approach in the area of mobile energy debugging. Information received from each device running our application improves the advice given to other users running the same applications. The author has developed a context data gathering hub for smartphones, discovered the need for a common API that unifies network connectivity, energy awareness, and user experience, and investigated the impact of mobile collaborative energy awareness applications, to find previously unknown energy bugs on smartphones, and to improve users' knowledge of smartphone energy behavior.Viime vuosien aikana älypuhelinten laitteistot ovat kehittyneet entistä tehokkaammiksi, mutta akkuteknologia ei ole kehittynyt yhtä nopeasti. Tämä on synnyttänyt tarpeen tehostaa sekä laitteiston että ohjelmiston energiatehokkuutta. Älypuhelimen energiatehokkuuden optimointi on haastavaa, koska toimintaympäristö on moninainen ja käsittää paitsi laitteiston ja sen asetukset, niin myös sovellukset, jotka käyttävät laitteiston toimintoja. Tässä väitöstyössä on keskitytty mobiililaitteiden energiaongelmien ja poikkeamien löytämiseen ja niiden korjaamiseen. Väitöskirja käsittelee yhteisöllisen metodin käyttöä energiankulutukseen liittyvien epätehokkuuksien löytämisessä ja korjaamisessa mobiililaitteilla. Tätä metodia on ensimmäistä kertaa sovellettu mobiililaitteille väitöstyöhön liittyvässä Carat-projektissa. Projektissa on luotu energianmittaussovellus mobiililaitteille ja kerätty energiamittauksia yli 725 000 laitteelta ja 300 000 sovelluksesta monipuolisissa ympäristöissä. Näiden pohjalta on tehty malleja sovellusten normaalista energiankulutuksesta eri konteksteissa. Tietojen ja mallien avulla on löydetty energiapoikkeavuuksia tavallisessa käytössä olevilta laitteilta ja annettu sovelluksen käyttäjille neuvoja poikkeavuuksien korjaamiseen. Väitöstyön aikana kerätty suurikokoinen ja monipuolinen aineisto on auttanut vastaamaan moniin kysymyksiin koskien älypuhelinten energiankulutusta arkikäytössä. Kaikkia poikkeavuuksia ei voida löytää laboratorio-olosuhteissa, sillä mobiililaitteen ympäristö vaikuttaa vahvasti sen toimintaan. Esitetty menetelmä on ensimmäinen, joka soveltaa yhteisöllistä lähestymistapaa mobiililaitteiden energiaongelmien löytämiseen. Kirjoittaja on kehittänyt kontekstitietojen keräysratkaisun älypuhelimille. Hän on huomannut tarpeen järjestelmälle, joka yhdistää mobiililaitteen tilanteen, käytön, energiatehokkuuden ja käyttäjäkokemuksen. Työssä on kehitetty uusi menetelmä energiapoikkeamien analyysiin yhteisöllisesti kerättyjen mittausten perusteella sekä tutkittu energiatehokkuussovellusten vaikutusta eri mobiililaitteilla. Näiden avulla on löydetty ennen tuntemattomia energiaongelmia älypuhelimista ja parannettu käyttäjien ymmärrystä älypuhelinten energiakäyttäytymisestä

    XXIII Congreso Argentino de Ciencias de la Computación - CACIC 2017 : Libro de actas

    Get PDF
    Trabajos presentados en el XXIII Congreso Argentino de Ciencias de la Computación (CACIC), celebrado en la ciudad de La Plata los días 9 al 13 de octubre de 2017, organizado por la Red de Universidades con Carreras en Informática (RedUNCI) y la Facultad de Informática de la Universidad Nacional de La Plata (UNLP).Red de Universidades con Carreras en Informática (RedUNCI
    corecore