    Research in the effective implementation of guidance computers with large scale arrays Interim report

    Functional logic character implementation in breadboard design of NASA modular compute

    Concepção e realização de um framework para sistemas embarcados baseados em FPGA aplicado a um classificador Floresta de Caminhos Ótimos

    Orientadores: Eurípedes Guilherme de Oliveira Nóbrega, Isabelle Fantoni-Coichot, Vincent FrémontTese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Mecânica, Université de Technologie de CompiègneResumo: Muitas aplicações modernas dependem de métodos de Inteligência Artificial, tais como classificação automática. Entretanto, o alto custo computacional associado a essas técnicas limita seu uso em plataformas embarcadas com recursos restritos. Grandes quantidades de dados podem superar o poder computacional disponível em tais ambientes, o que torna o processo de projetá-los uma tarefa desafiadora. As condutas de processamento mais comuns usam muitas funções de custo computacional elevadas, o que traz a necessidade de combinar alta capacidade computacional com eficiência energética. Uma possível estratégia para superar essas limitações e prover poder computacional suficiente aliado ao baixo consumo de energia é o uso de hardware especializado como, por exemplo, FPGA. Esta classe de dispositivos é amplamente conhecida por sua boa relação desempenho/consumo, sendo uma alternativa interessante para a construção de sistemas embarcados eficazes e eficientes. Esta tese propõe um framework baseado em FPGA para a aceleração de desempenho de um algoritmo de classificação a ser implementado em um sistema embarcado. A aceleração do desempenho foi atingida usando o esquema de paralelização SIMD, aproveitando as características de paralelismo de grão fino dos FPGA. O sistema proposto foi implementado e testado em hardware FPGA real. Para a validação da arquitetura, um classificador baseado em Teoria dos Grafos, o OPF, foi avaliado em uma proposta de aplicação e posteriormente implementado na arquitetura proposta. O estudo do OPF levou à proposição de um novo algoritmo de aprendizagem para o mesmo, usando conceitos de Computação Evolutiva, visando a redução do tempo de processamento de classificação, que, combinada à implementação em hardware, oferece uma aceleração de desempenho suficiente para ser aplicada em uma variedade de sistemas embarcadosAbstract: Many modern applications rely on Artificial Intelligence methods such as automatic classification. However, the computational cost associated with these techniques limit their use in resource constrained embedded platforms. A high amount of data may overcome the computational power available in such embedded environments while turning the process of designing them a challenging task. Common processing pipelines use many high computational cost functions, which brings the necessity of combining high computational capacity with energy efficiency. One of the strategies to overcome this limitation and provide sufficient computational power allied with low energy consumption is the use of specialized hardware such as FPGA. This class of devices is widely known for their performance to consumption ratio, being an interesting alternative to building capable embedded systems. This thesis proposes an FPGA-based framework for performance acceleration of a classification algorithm to be implemented in an embedded system. Acceleration is achieved using SIMD-based parallelization scheme, taking advantage of FPGA characteristics of fine-grain parallelism. The proposed system is implemented and tested in actual FPGA hardware. For the architecture validation, a graph-based classifier, the OPF, is evaluated in an application proposition and afterward applied to the proposed architecture. The OPF study led to a proposition of a new learning algorithm using evolutionary computation concepts, aiming at classification processing time reduction, which combined to the hardware implementation offers sufficient performance acceleration to be applied in a variety of embedded systemsDoutoradoMecanica dos Sólidos e Projeto MecanicoDoutor em Engenharia Mecânica3077/2013-09CAPE

    Applications of reprogrammability in algorithm acceleration

    This doctoral thesis consists of an introductory part and eight appended publications, which deal with hardware-based reprogrammability in algorithm acceleration with a specific emphasis on the possibilities offered by modern large-scale Field Programmable Gate Arrays (FPGAs) in computationally demanding applications. The historical evolution of both the theoretical and technological paths culminating in the introduction of reprogrammable logic devices is first outlined. This is followed by defining the commonly used terms in the thesis. The reprogrammable logic market is surveyed, and the architectural structures and the technological reasonings behind them are described in detail. As reprogrammable logic lies between Application Specific Integrated Circuits (ASICs) and general-purpose microprocessors in the implementation spectrum of electronics systems, special attention has been paid to differentiate these three implementation approaches. This has been done to emphasize, that reprogrammable logic offers much more than just a low-volume replacement for ASICs. Design systems for reprogrammable logic are investigated, as the learning curve associated with them is the main hurdle for software-oriented designers for using reprogrammable logic devices. The theoretically important topic of partial reprogrammability is described in detail, but it is concluded, that the practical problems in designing viable development platforms for partially reprogrammable systems will hinder its wide-spread adoption. The main technical, design-oriented, and economic applicability factors of reprogrammable logic are laid out. The main advantages of reprogrammable logic are their suitability for fine-grained bit-level parallelizable computing with a short time-to-market and low upfront costs. It is also concluded, that the main opportunities for reprogrammable logic lie in the potential of high-level design systems, and the ever-growing ASIC design gap. On the other hand, most power-conscious mass-market portable products do not seem to offer major new market potential for reprogrammable logic. The appended publications are examined and compared to contemporaneous research at other research institutions. The conclusion is that for relatively wide classes of well-defined computation problems, reprogrammable logic offers a more efficient solution than a software-centered approach, with a much shorter production cycle than is the case with ASICs.reviewe

    Reconfigurable microarchitectures at the programmable logic interface

    Diseño digital utilizando lógica programable : aplicaciones a la enseñanza

    Los dispositivos lógicos programables son circuitos integrados que contienen una gran cantidad de celdas básicas, específicamente compuertas y registros, cuyas interconexiones pueden ser configuradas por el usuario para dar lugar a un diseño determinado. Estos dispositivos se han transformado en componentes esenciales de cualquier diseño electrónico digital, desplazando en gran medida a los componentes discretos. La tecnología de la lógica programable ha significado un cambio de paradigma en el diseño electrónico: un circuito que puede modificarse vía software, ofreciendo una gran cantidad de ventajas y posibilidades. Este cambio de paradigma en la forma de diseñar también ha producido importantes transformaciones en la forma de enseñar. En esta tesis se presentan una serie de experiencias innovadoras en la enseñanza de ingeniería electrónica utilizando lógica programable. Se plantea el estudio de un tema tecnológico como es la lógica programable, eligiendo como campo de aplicación la utilización de esta tecnología en la enseñanza de diseño electrónico digital. Se comienza estudiando diferentes recomendaciones en la enseñanza de la ingeniería, profundizando los aspectos prácticos, de diseño y de laboratorio. Luego se realiza una puesta al día en profundidad de la lógica programable, incluyendo los dispositivos, el proceso de diseño y las herramientas utilizadas. Por último se presentan una serie de experiencias e investigaciones en metodologías de enseñanza de diseño electrónico digital. Estas experiencias están divididas en dos grupos, en una primer instancia la mejora del curso introductorio de diseño lógico con una nueva e innovadora metodología de laboratorio, y posteriormente el desarrollo de plataformas reconfigurables realizadas por estudiantes avanzados como proyectos de fin de carrera. En ambos casos se muestran los resultados obtenidos, tanto desde el punto de vista educativo como tecnológico

    The hardware track finder processor in CMS at CERN

    The work covers the design of the Track Finder Processor in the high energy experiment CMS (Compact Muon Solenoid, planned for 2005) at CERN/Geneva. The task of this processor is to identify muons and measure their transverse momentum. The track finder processor makes it possible to determine the physical relevance of each high energetic collision and to forward only interesting data to the data an alysis units. Data of more than two hundred thousand detector cells are used to determine the location of muons and measure their transverse momentum. Each 25 ns a new data set is generated. Measurem ent of location and transverse momentum of the muons can be terminated within 350 ns by using an ASIC (Application Specific Integrated Circuit). A pipeline architecture processes new data sets with th e required data rate of 40 MHz to ensure dead time free operation. In the framework of this study specifications and the overall concept of the track finder processor were worked out in detail. Simul ations were performed in order to select the most appropriate measurement method and implementation technology. Already existing systems were evaluated and their specifications were compared with thos e of the track finder processor's. The classic method in high energy physics experiments is to search for predefined tracks or bit patterns in the measurement data and to determine their properties. T he predefined patterns are compared to the found patterns. The high number of data channels of the track finder processor and the complex requirements to the spatial detector resolution do not permit to employ a pattern comparison method. A so called track following algorithm was designed, which is able to assemble complete tracks through the whole detector starting from single track segments. Ins tead of storing a high number of track patterns an algorithm for track finding and momentum measurement is employed directly. This enables to realize a hardware implementation within the requirements given by the experiment. The algorithm was translated to the level of digital electronics. Comprehensive simulations, employing the hardware simulation language VHDL, were conducted in order to optimi ze the algorithm and its hardware implementation. An FPGA (field programmable gate array)-prototype and a test system was designed. A feasibility study to implement the track finder processor employin g ASICs was conducted. The study proves that the track finder processor can be implemented using today's technology

    Techniques for Efficient Implementation of FIR and Particle Filtering

    Three-Dimensional Processing-In-Memory-Architectures: A Holistic Tool For Modeling And Simulation

    Die gemeinhin als Memory Wall bekannte, sich stetig weitende Leistungslücke zwischen Prozessor- und Speicherarchitekturen erfordert neue Konzepte, um weiterhin eine Skalierung der Rechenleistung zu ermöglichen. Da Speicher als die Beschränkung innerhalb einer Von-Neumann-Architektur identifiziert wurden, widmet sich die Arbeit dieser Problemstellung. Obgleich dreidimensionale Speicher zu einer Linderung der Memory Wall beitragen können, sind diese alleinig für die zukünftige Skalierung ungenügend. Aufgrund höherer Effizienzen stellt die Integration von Rechenkapazität in den Speicher (Processing-In-Memory, PIM) ein vielversprechender Ausweg dar, jedoch existiert ein Mangel an PIM-Simulationsmodellen. Daher wurde ein flexibles Simulationswerkzeug für dreidimensionale Speicherstapel geschaffen, welches zur Modellierung von dreidimensionalen PIM erweitert wurde. Dieses kann Speicherstapel wie etwa Hybrid Memory Cube standardkonform simulieren und bietet zugleich eine hohe Genauigkeit indem auf elementaren Datenpaketen in Kombination mit dem Hardware validierten Simulator BOBSim modelliert wird. Ein eigens entworfener Simulationstaktbaum ermöglicht zugleich eine schnelle Ausführung. Messungen weisen im funktionalen Modus eine 100-fache Beschleunigung auf, wohingegen eine Verdoppelung der Ausführungsgeschwindigkeit mit Taktgenauigkeit erzielt wird. Anhand eines eigens implementierten, binärkompatiblen GPU-Beschleunigers wird die Modellierung einer vollständig dreidimensionalen PIM-Architektur demonstriert. Dabei orientieren sich die maximalen Hardwareressourcen an einem PIM-Beschleuniger aus der Literatur. Evaluiert wird einerseits das GPU-Simulationsmodell eigenständig, andererseits als PIM-Verbund jeweils mit Hilfe einer repräsentativ gewählten, speicherbeschränkten geophysikalischen Bildverarbeitung. Bei alleiniger Betrachtung des GPU-Simulationsmodells weist dieses eine signifikant gesteigerte Simulationsgeschwindigkeit auf, bei gleichzeitiger Abweichung von 6% gegenüber dem Verilator-Modell. Nachfolgend werden innerhalb dieser Arbeit unterschiedliche Konfigurationen des integrierten PIM-Beschleunigers evaluiert. Je nach gewählter Konfiguration kann der genutzte Algorithmus entweder bis zu 140GFLOPS an tatsächlicher Rechenleistung abrufen oder eine maximale Recheneffizienz von synthetisch 30% bzw. real 24,5% erzielen. Letzteres stellt eine Verdopplung des Stands der Technik dar. Eine anknüpfende Diskussion erläutert eingehend die Resultate.The steadily widening performance gap between processor- and memory-architectures - commonly known as the Memory Wall - requires novel concepts to achieve further scaling in processing performance. As memories were identified as the limitation within a Von-Neumann-architecture, this work addresses this constraining issue. Although three-dimensional memories alleviate the effects of the Memory Wall, the sole utilization of such memories would be insufficient. Due to higher efficiencies, the integration of processing capacity into memories (so-called Processing-In-Memory, PIM) depicts a promising alternative. However, a lack of PIM simulation models still remains. As a consequence, a flexible simulation tool for three-dimensional stacked memories was established, which was extended for modeling three-dimensional PIM architectures. This tool can simulate stacked memories such as Hybrid Memory Cube standard-compliant and simultaneously offers high accuracy by modeling on elementary data packets (FLIT) in combination with the hardware validated BOBSim simulator. To this, a specifically designed simulation clock tree enables an rapid simulation execution. A 100x speed up in simulation execution can be measured while utilizing the functional mode, whereas a 2x speed up is achieved during clock-cycle accuracy mode. With the aid of a specifically implemented, binary compatible GPU accelerator and the established tool, the modeling of a holistic three-dimensional PIM architecture is demonstrated within this work. Hardware resources used were constrained by a PIM architecture from literature. A representative, memory-bound, geophysical imaging algorithm was leveraged to evaluate the GPU model as well as the compound PIM simulation model. The sole GPU simulation model depicts a significantly improved simulation performance with a deviation of 6% compared to a Verilator model. Subsequently, various PIM accelerator configurations with the integrated GPU model were evaluated. Depending on the chosen PIM configuration, the utilized algorithm achieves 140GFLOPS of processing performance or a maximum computing efficiency of synthetically 30% or realistically 24.5%. The latter depicts a 2x improvement compared to state-of-the-art. A following discussion showcases the results in depth