6 research outputs found

    Neural network computing using on-chip accelerators

    Get PDF
    The use of neural networks, machine learning, or artificial intelligence, in its broadest and most controversial sense, has been a tumultuous journey involving three distinct hype cycles and a history dating back to the 1960s. Resurgent, enthusiastic interest in machine learning and its applications bolsters the case for machine learning as a fundamental computational kernel. Furthermore, researchers have demonstrated that machine learning can be utilized as an auxiliary component of applications to enhance or enable new types of computation such as approximate computing or automatic parallelization. In our view, machine learning becomes not the underlying application, but a ubiquitous component of applications. This view necessitates a different approach towards the deployment of machine learning computation that spans not only hardware design of accelerator architectures, but also user and supervisor software to enable the safe, simultaneous use of machine learning accelerator resources. In this dissertation, we propose a multi-transaction model of neural network computation to meet the needs of future machine learning applications. We demonstrate that this model, encompassing a decoupled backend accelerator for inference and learning from hardware and software for managing neural network transactions can be achieved with low overhead and integrated with a modern RISC-V microprocessor. Our extensions span user and supervisor software and data structures and, coupled with our hardware, enable multiple transactions from different address spaces to execute simultaneously, yet safely. Together, our system demonstrates the utility of a multi-transaction model to increase energy efficiency improvements and improve overall accelerator throughput for machine learning applications

    Design of a CMOS-Memristive Mixed-Signal Neuromorphic System with Energy and Area Efficiency in System Level Applications

    Get PDF
    The von Neumann architecture has been the backbone of modern computers for several years. This computational framework is popular because it defines an easy, simple and cheap design for the processing unit and memory. Unfortunately, this architecture faces a huge bottleneck going forward since complexity in computations now demands increased parallelism and this architecture is not efficient at parallel processing. Moreover, the post-Moore\u27s law era brings a constant demand for energy-efficient computing with fewer resources and less area. Hence, researchers are interested in establishing alternatives to the von Neumann architecture and neuromorphic computing is one of the few aspiring computing architectures that contributes to this research effectively. Initially, neuromorphic computing attracted attention because of the parallelism found in the bio-inspired networks and they were interested in leveraging this advantage on a single chip. Moreover, the need for speed in real time performance also escalated the popularity of neuromorphic computing and different research groups started working on hardware implementations of neural networks. Also, neuroscience is consistently building a better understanding of biological networks that provides opportunities for bridging the gap between biological neuronal activities and artificial neural networks. As a consequence, the idea behind neuromorphic computing has continued to gain in popularity. In this research, a memristive neuromorphic system for improved power and area efficiency has been presented. This particular implementation introduces a mixed-signal platform to implement neural networks in a synchronous way. In addition to mixed-signal design, a nano-scale memristive device has been introduced that provides power and area efficiency for the overall system. The system design also includes synchronous digital long term plasticity (DLTP), an online learning methodology that helps train the neural networks during the operation phase, improving the efficiency in learning when considering power consumption and area overhead. This research also proposes a stochastic neuron design with a sigmoidal firing rate. The design introduces variability in the membrane capacitance to reach different membrane potential leading to a variable stochastic firing rate

    6G Enabled Smart Infrastructure for Sustainable Society: Opportunities, Challenges, and Research Roadmap

    Get PDF
    The 5G wireless communication network is currently faced with the challenge of limited data speed exacerbated by the proliferation of billions of data-intensive applications. To address this problem, researchers are developing cutting-edge technologies for the envisioned 6G wireless communication standards to satisfy the escalating wireless services demands. Though some of the candidate technologies in the 5G standards will apply to 6G wireless networks, key disruptive technologies that will guarantee the desired quality of physical experience to achieve ubiquitous wireless connectivity are expected in 6G. This article first provides a foundational background on the evolution of different wireless communication standards to have a proper insight into the vision and requirements of 6G. Second, we provide a panoramic view of the enabling technologies proposed to facilitate 6G and introduce emerging 6G applications such as multi-sensory–extended reality, digital replica, and more. Next, the technology-driven challenges, social, psychological, health and commercialization issues posed to actualizing 6G, and the probable solutions to tackle these challenges are discussed extensively. Additionally, we present new use cases of the 6G technology in agriculture, education, media and entertainment, logistics and transportation, and tourism. Furthermore, we discuss the multi-faceted communication capabilities of 6G that will contribute significantly to global sustainability and how 6G will bring about a dramatic change in the business arena. Finally, we highlight the research trends, open research issues, and key take-away lessons for future research exploration in 6G wireless communicatio

    Three-Dimensional Processing-In-Memory-Architectures: A Holistic Tool For Modeling And Simulation

    Get PDF
    Die gemeinhin als Memory Wall bekannte, sich stetig weitende Leistungslücke zwischen Prozessor- und Speicherarchitekturen erfordert neue Konzepte, um weiterhin eine Skalierung der Rechenleistung zu ermöglichen. Da Speicher als die Beschränkung innerhalb einer Von-Neumann-Architektur identifiziert wurden, widmet sich die Arbeit dieser Problemstellung. Obgleich dreidimensionale Speicher zu einer Linderung der Memory Wall beitragen können, sind diese alleinig für die zukünftige Skalierung ungenügend. Aufgrund höherer Effizienzen stellt die Integration von Rechenkapazität in den Speicher (Processing-In-Memory, PIM) ein vielversprechender Ausweg dar, jedoch existiert ein Mangel an PIM-Simulationsmodellen. Daher wurde ein flexibles Simulationswerkzeug für dreidimensionale Speicherstapel geschaffen, welches zur Modellierung von dreidimensionalen PIM erweitert wurde. Dieses kann Speicherstapel wie etwa Hybrid Memory Cube standardkonform simulieren und bietet zugleich eine hohe Genauigkeit indem auf elementaren Datenpaketen in Kombination mit dem Hardware validierten Simulator BOBSim modelliert wird. Ein eigens entworfener Simulationstaktbaum ermöglicht zugleich eine schnelle Ausführung. Messungen weisen im funktionalen Modus eine 100-fache Beschleunigung auf, wohingegen eine Verdoppelung der Ausführungsgeschwindigkeit mit Taktgenauigkeit erzielt wird. Anhand eines eigens implementierten, binärkompatiblen GPU-Beschleunigers wird die Modellierung einer vollständig dreidimensionalen PIM-Architektur demonstriert. Dabei orientieren sich die maximalen Hardwareressourcen an einem PIM-Beschleuniger aus der Literatur. Evaluiert wird einerseits das GPU-Simulationsmodell eigenständig, andererseits als PIM-Verbund jeweils mit Hilfe einer repräsentativ gewählten, speicherbeschränkten geophysikalischen Bildverarbeitung. Bei alleiniger Betrachtung des GPU-Simulationsmodells weist dieses eine signifikant gesteigerte Simulationsgeschwindigkeit auf, bei gleichzeitiger Abweichung von 6% gegenüber dem Verilator-Modell. Nachfolgend werden innerhalb dieser Arbeit unterschiedliche Konfigurationen des integrierten PIM-Beschleunigers evaluiert. Je nach gewählter Konfiguration kann der genutzte Algorithmus entweder bis zu 140GFLOPS an tatsächlicher Rechenleistung abrufen oder eine maximale Recheneffizienz von synthetisch 30% bzw. real 24,5% erzielen. Letzteres stellt eine Verdopplung des Stands der Technik dar. Eine anknüpfende Diskussion erläutert eingehend die Resultate.The steadily widening performance gap between processor- and memory-architectures - commonly known as the Memory Wall - requires novel concepts to achieve further scaling in processing performance. As memories were identified as the limitation within a Von-Neumann-architecture, this work addresses this constraining issue. Although three-dimensional memories alleviate the effects of the Memory Wall, the sole utilization of such memories would be insufficient. Due to higher efficiencies, the integration of processing capacity into memories (so-called Processing-In-Memory, PIM) depicts a promising alternative. However, a lack of PIM simulation models still remains. As a consequence, a flexible simulation tool for three-dimensional stacked memories was established, which was extended for modeling three-dimensional PIM architectures. This tool can simulate stacked memories such as Hybrid Memory Cube standard-compliant and simultaneously offers high accuracy by modeling on elementary data packets (FLIT) in combination with the hardware validated BOBSim simulator. To this, a specifically designed simulation clock tree enables an rapid simulation execution. A 100x speed up in simulation execution can be measured while utilizing the functional mode, whereas a 2x speed up is achieved during clock-cycle accuracy mode. With the aid of a specifically implemented, binary compatible GPU accelerator and the established tool, the modeling of a holistic three-dimensional PIM architecture is demonstrated within this work. Hardware resources used were constrained by a PIM architecture from literature. A representative, memory-bound, geophysical imaging algorithm was leveraged to evaluate the GPU model as well as the compound PIM simulation model. The sole GPU simulation model depicts a significantly improved simulation performance with a deviation of 6% compared to a Verilator model. Subsequently, various PIM accelerator configurations with the integrated GPU model were evaluated. Depending on the chosen PIM configuration, the utilized algorithm achieves 140GFLOPS of processing performance or a maximum computing efficiency of synthetically 30% or realistically 24.5%. The latter depicts a 2x improvement compared to state-of-the-art. A following discussion showcases the results in depth

    Advances and Novel Approaches in Discrete Optimization

    Get PDF
    Discrete optimization is an important area of Applied Mathematics with a broad spectrum of applications in many fields. This book results from a Special Issue in the journal Mathematics entitled ‘Advances and Novel Approaches in Discrete Optimization’. It contains 17 articles covering a broad spectrum of subjects which have been selected from 43 submitted papers after a thorough refereeing process. Among other topics, it includes seven articles dealing with scheduling problems, e.g., online scheduling, batching, dual and inverse scheduling problems, or uncertain scheduling problems. Other subjects are graphs and applications, evacuation planning, the max-cut problem, capacitated lot-sizing, and packing algorithms

    Exact Integer Programming Approaches to Sequential Instruction Scheduling and Offset Assignment

    Get PDF
    The dissertation at hand presents the main concepts and results derived when studying the optimal solution of two NP-hard compiler optimization problems, namely instruction scheduling and offset assignment, by means of integer programming. It is the outcome of several years of research as an assistant at Michael Jünger's computer science chair in Cologne, with the particular aim to apply exact mathematical optimization techniques to real-world problems arising in the domain of technical computer science. The two problems studied are rather unrelated apart from the fact that they both take place during the machine code generation phase of a compiler and deal with the handling of limited resources. Instruction scheduling is about the assignment of issue clock cycles to instructions in the presence of precedence, latency, and resource constraints such that the total time needed to execute all the instructions is minimized. Offset assignment deals with storage layouts of program variables and the efficient use of address registers for accesses to these variables. The objective is to employ specialized instructions in order to minimize the overhead caused by address computations. While instruction scheduling needs to be carried out by almost every present compiler irrespective of the processor architecture, the offset assignment problem occurs mainly in compilers for highly specialized processor designs. Instruction scheduling is a well-studied field where several exact and heuristic approaches have been developed and experimentally evaluated in the past. In this thesis, we concentrate on the basic-block instruction scheduling problem for single-issue processors. Basic blocks are program fragments with no side-entrances and -exits, i.e., every instruction of a basic block needs to be executed before the control flow may leave it and enter another basic block. Single-issue processors are capable of starting the execution of exactly one instruction per clock cycle. A number of techniques to preprocess instances of the basic-block instruction scheduling problem were proposed in the literature and are, with emphasis on the more recent ones that arose since the year 2000, thoroughly reviewed in this thesis. They finally led to a constraint programming approach in 2006 that was shown to solve about 350,000 instances to optimality and where some of these instances comprised up to about 2,500 instructions. The last attempt to tackle the problem using integer programming however dates to a time prior to the publication of the latest preprocessing advances. While being successful on a set of instances that impose very restrictive latency constraints, it was shown to be unable to solve hundreds of instances from the aforementioned benchmark set that comprises also large and varying latencies. In addition, the previous integer programming models were almost all based on so-called time-indexed formulations where decision variables model an explicit assignment of instructions to clock cycles. In this thesis, a completely different and novel approach is taken based on the linear ordering problem, a well-studied combinatorial optimization problem. The new models lead to alternative characterizations of the feasible solutions to the basic-block instruction scheduling problem. These facilitate the employment of advanced integer programming methodologies, in particular the design of branch-and-cut algorithms that can handle larger instances. The formulations are further extended by additional inequalities that can be used as cutting planes. Combined with the preprocessing routines that are partially extended and improved as well, the respective solver implementation eventually turned out to be competitive to the constraint programming method. Reaching this point has taken some years and this thesis presents not only the derived models but also several ideas and byproducts that arose in the meantime, and that can help and inspire researchers even if they aim at the application of different solution methodologies. The starting point regarding the offset assignment problem was a different one because especially exact solution approaches were rather rare prior to the models presented in this thesis. The offset assignment problem arose in the 1990s and is considered in several variants that are of theoretical and practical interest. In the simplest one, a processor is assumed to provide only a single address register and only very restricted possibilities to avoid address computation overhead. However, even this simplest variant, that may serve as a building block for the more complex ones, is already NP-hard and has been studied mainly from a heuristic point of view. The few existing exact solution approaches were not capable to solve moderately sized instances so that the quality of heuristic solutions relative to the optimum was hardly known at all. Again, the inspection of the combinatorial structure of the various problem variants turned out to be the key for designing branch-and-cut implementations that can profit from knowledge about related combinatorial optimization problems. The implementation targeting the simple problem variant was the first capable to optimally solve the majority of about 3,000 instances collected in a standard benchmark set. The method could then be further generalized in two steps. First, in a collaboration with Roberto Castañeda Lozano, additional techniques could be incorporated into the approach in order to handle multiple address registers. Fortunately, the methods could then even be further extended to as well deal with more flexible addressing capabilities. In this way, the thesis at hand does not only answer the question how large the address computation overhead can be when using heuristics, but as well presents first results that allow to analyze the impact of the mentioned increased addressing capabilities on the runtime performance and size of real-world programs
    corecore