23 research outputs found

    Vector support for multicore processors with major emphasis on configurable multiprocessors

    Get PDF
    It recently became increasingly difficult to build higher speed uniprocessor chips because of performance degradation and high power consumption. The quadratically increasing circuit complexity forbade the exploration of more instruction-level parallelism (JLP). To continue raising the performance, processor designers then focused on thread-level parallelism (TLP) to realize a new architecture design paradigm. Multicore processor design is the result of this trend. It has proven quite capable in performance increase and provides new opportunities in power management and system scalability. But current multicore processors do not provide powerful vector architecture support which could yield significant speedups for array operations while maintaining arealpower efficiency. This dissertation proposes and presents the realization of an FPGA-based prototype of a multicore architecture with a shared vector unit (MCwSV). FPGA stands for Filed-Programmable Gate Array. The idea is that rather than improving only scalar or TLP performance, some hardware budget could be used to realize a vector unit to greatly speedup applications abundant in data-level parallelism (DLP). To be realistic, limited by the parallelism in the application itself and by the compiler\u27s vectorizing abilities, most of the general-purpose programs can only be partially vectorized. Thus, for efficient resource usage, one vector unit should be shared by several scalar processors. This approach could also keep the overall budget within acceptable limits. We suggest that this type of vector-unit sharing be established in future multicore chips. The design, implementation and evaluation of an MCwSV system with two scalar processors and a shared vector unit are presented for FPGA prototyping. The MicroBlaze processor, which is a commercial IP (Intellectual Property) core from Xilinx, is used as the scalar processor; in the experiments the vector unit is connected to a pair of MicroBlaze processors through standard bus interfaces. The overall system is organized in a decoupled and multi-banked structure. This organization provides substantial system scalability and better vector performance. For a given area budget, benchmarks from several areas show that the MCwSV system can provide significant performance increase as compared to a multicore system without a vector unit. However, a MCwSV system with two MicroBlazes and a shared vector unit is not always an optimized system configuration for various applications with different percentages of vectorization. On the other hand, the MCwSV framework was designed for easy scalability to potentially incorporate various numbers of scalar/vector units and various function units. Also, the flexibility inherent to FPGAs can aid the task of matching target applications. These benefits can be taken into account to create optimized MCwSV systems for various applications. So the work eventually focused on building an architecture design framework incorporating performance and resource management for application-specific MCwSV (AS-MCwSV) systems. For embedded system design, resource usage, power consumption and execution latency are three metrics to be used in design tradeoffs. The product of these metrics is used here to choose the MCwSV system with the smallest value

    A process for improving early life failure response

    Get PDF
    Thesis (M.B.A.)--Massachusetts Institute of Technology, Sloan School of Management; and, (S.M.)--Massachusetts Institute of Technology, Dept. of Mechanical Engineering; in conjunction with the Leaders for Manufacturing Program at MIT, 2003.Includes bibliographical references.by Jason T. Chen.S.M.M.B.A

    An integrated soft- and hard-programmable multithreaded architecture

    Get PDF

    Transmitter Architectures Based on Near-Field Direct Antenna Modulation

    Get PDF
    A near-field direct antenna modulation (NFDAM) technique is introduced, where the radiated far-field signal is modulated by time-varying changes in the antenna near-field electromagnetic (EM) boundary conditions. This enables the transmitter to send data in a direction-dependent fashion producing a secure communication link. Near-field direct antenna modulation (NFDAM) can be performed by using either switches or varactors. Two fully-integrated proof-of-concept NFDAM transmitters operating at 60 GHz using switches and varactors are demonstrated in silicon proving the feasibility of this approach

    High level compilation for gate reconfigurable architectures

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2001.Includes bibliographical references (p. 205-215).A continuing exponential increase in the number of programmable elements is turning management of gate-reconfigurable architectures as "glue logic" into an intractable problem; it is past time to raise this abstraction level. The physical hardware in gate-reconfigurable architectures is all low level - individual wires, bit-level functions, and single bit registers - hence one should look to the fetch-decode-execute machinery of traditional computers for higher level abstractions. Ordinary computers have machine-level architectural mechanisms that interpret instructions - instructions that are generated by a high-level compiler. Efficiently moving up to the next abstraction level requires leveraging these mechanisms without introducing the overhead of machine-level interpretation. In this dissertation, I solve this fundamental problem by specializing architectural mechanisms with respect to input programs. This solution is the key to efficient compilation of high-level programs to gate reconfigurable architectures. My approach to specialization includes several novel techniques. I develop, with others, extensive bitwidth analyses that apply to registers, pointers, and arrays. I use pointer analysis and memory disambiguation to target devices with blocks of embedded memory. My approach to memory parallelization generates a spatial hierarchy that enables easier-to-synthesize logic state machines with smaller circuits and no long wires.(cont.) My space-time scheduling approach integrates the techniques of high-level synthesis with the static routing concepts developed for single-chip multiprocessors. Using DeepC, a prototype compiler demonstrating my thesis, I compile a new benchmark suite to Xilinx Virtex FPGAs. Resulting performance is comparable to a custom MIPS processor, with smaller area (40 percent on average), higher evaluation speeds (2.4x), and lower energy (18x) and energy-delay (45x). Specialization of advanced mechanisms results in additional speedup, scaling with hardware area, at the expense of power. For comparison, I also target IBM's standard cell SA-27E process and the RAW microprocessor. Results include sensitivity analysis to the different mechanisms specialized and a grand comparison between alternate targets.by Jonathan William Babb.Ph.D

    High speed simulation of microprocessor systems using LTU dynamic binary translation

    Get PDF
    This thesis presents new simulation techniques designed to speed up the simulation of microprocessor systems. The advanced simulation techniques may be applied to the simulator class which employs dynamic binary translation as its underlying technology. This research supports the hypothesis that faster simulation speeds can be realized by translating larger sections of the target program at runtime. The primary motivation for this research was to help facilitate comprehensive design-space exploration and hardware/software co-design of novel processor architectures by reducing the time required to run simulations. Instruction set simulators are used to design and to verify new system architectures, and to develop software in parallel with hardware. However, compromises must often be made when performing these tasks due to time constraints. This is particularly true in the embedded systems domain where there is a short time-to-market. The processing demands placed on simulation platforms are exacerbated further by the need to simulate the increasingly complex, multi-core processors of tomorrow. High speed simulators are therefore essential to reducing the time required to design and test advanced microprocessors, enabling new systems to be released ahead of the competition. Dynamic binary translation based simulators typically translate small sections of the target program at runtime. This research considers the translation of larger units of code in order to increase simulation speed. The new simulation techniques identify large sections of program code suitable for translation after analyzing a profile of the target program’s execution path built-up during simulation. The average instruction level simulation speed for the EEMBC benchmark suite is shown to be at least 63% faster for the new simulation techniques than for basic block dynamic binary translation based simulation and 14.8 times faster than interpretive simulation. The average cycle-approximate simulation speed is shown to be at least 32% faster for the new simulation techniques than for basic block dynamic binary translation based simulation and 8.37 times faster than cycle-accurate interpretive simulation

    Analytical Modeling of High Performance Reconfigurable Computers: Prediction and Analysis of System Performance.

    Get PDF
    The use of a network of shared, heterogeneous workstations each harboring a Reconfigurable Computing (RC) system offers high performance users an inexpensive platform for a wide range of computationally demanding problems. However, effectively using the full potential of these systems can be challenging without the knowledge of the system’s performance characteristics. While some performance models exist for shared, heterogeneous workstations, none thus far account for the addition of Reconfigurable Computing systems. This dissertation develops and validates an analytic performance modeling methodology for a class of fork-join algorithms executing on a High Performance Reconfigurable Computing (HPRC) platform. The model includes the effects of the reconfigurable device, application load imbalance, background user load, basic message passing communication, and processor heterogeneity. Three fork-join class of applications, a Boolean Satisfiability Solver, a Matrix-Vector Multiplication algorithm, and an Advanced Encryption Standard algorithm are used to validate the model with homogeneous and simulated heterogeneous workstations. A synthetic load is used to validate the model under various loading conditions including simulating heterogeneity by making some workstations appear slower than others by the use of background loading. The performance modeling methodology proves to be accurate in characterizing the effects of reconfigurable devices, application load imbalance, background user load and heterogeneity for applications running on shared, homogeneous and heterogeneous HPRC resources. The model error in all cases was found to be less than five percent for application runtimes greater than thirty seconds and less than fifteen percent for runtimes less than thirty seconds. The performance modeling methodology enables us to characterize applications running on shared HPRC resources. Cost functions are used to impose system usage policies and the results of vii the modeling methodology are utilized to find the optimal (or near-optimal) set of workstations to use for a given application. The usage policies investigated include determining the computational costs for the workstations and balancing the priority of the background user load with the parallel application. The applications studied fall within the Master-Worker paradigm and are well suited for a grid computing approach. A method for using NetSolve, a grid middleware, with the model and cost functions is introduced whereby users can produce optimal workstation sets and schedules for Master-Worker applications running on shared HPRC resources

    Applying the Engineering Statechart Formalism to the evaluation of soft real-time in operating systems : a use case tailored modeling and analysis technique

    Get PDF
    Multimedia applications that have emerged in recent years impose unique requirements on an underlying general purpose operating system (GPOS). The suitability of a GPOS for multimedia processing is judged by its soft real-time capabilities. To date, the question of how these capabilities can be assessed has scarcely been addressed: this is a gap in GPOS research. By answering questions on the impacts of the Interrupt Handling Facility (IHF) on the overall soft real-time capabilities of a GPOS, this thesis contributes to the filling of this blank space. The Engineering Statechart Formalism (ESF), a use case tailored formal method of modeling real-world OS, is syntactically and semantically defined. Models of the IHF of selected real-world operating systems are then created by means of this technique. As no appropriate real-time concept fitting the goals of this thesis as yet exists, a suitable definition is constructed. By projecting this system-wide idea to the interrupt subsystem, specific indicators for this subsystem are erived. These indicators are then evaluated by applying formal techniques such as graph-based analysis and temporal logic model checking to the ESF models. Finally, the assertions derived from this evaluation are interpreted with respect to their impacts on real-time multimedia processing in different general purpose operating systems.Multimedia-Anwendungen haben in den letzten Jahren weite Verbreitung erfahren. Solche Anwendungen stellen besondere Anforderungen an das Betriebssystem (BS), auf dem sie ausgefßhrt werden. Insbesondere Echtzeitfähigkeiten des Betriebssystems sind von Bedeutung, wenn es um seine Eignung fßr Multimedia-Verarbeitung geht. Bis heute wurde die Frage, wie sich diese Fähigkeiten konkret innerhalb eines BS manifestieren, nur unzureichend untersucht. Die vorliegende Arbeit leistet einen Beitrag zur Fßllung dieser Lßcke in der BS-Forschung. Die Effekte des Subsystems zur Unterbrechungsbehandlung in BS auf die Echtzeitfähigkeit des Gesamtsystems werden detailliert auf Basis von Modellen dieses Subsystems in verschiedenen BS analysiert. Um eine formale Auswertung zu erlauben, wird eine auf den Anwendungsfall zugeschnittene formale Methode zur BS-Modellierung verwendet. Die spezifizierte Syntax und Semantik dieses Engineering Statechart Formalism (ESF) basieren auf dem klassischen Statechart-Formalismus. Da bislang kein geeigneter Echtzeit-Begriff existiert, wird eine konsistente Definition hergeleitet. Durch die Abbildung dieser sich auf das Gesamtsystem beziehenden Eigenschaft auf die Unterbrechungsbehandlung werden spezifische Indikatoren fßr dieses Subsystem hergeleitet. Die Ausprägungen dieser Indikatoren fßr die verschiedenen untersuchten Betriebssyteme werden anhand formaler Methoden wie graphbasierter Analyse und Temporal Logic Model Checking ausgewertet. Die Interpretation der Untersuchungsergebnisse liefert Aussagen ßber die Effekte der Implementierung der Unterbrechungsbehandlung auf die Echtzeitfähigkeit der untersuchten Betriebssysteme bei der Verarbeitung von multimedialen Daten

    Tightly-Coupled and Fault-Tolerant Communication in Parallel Systems

    Full text link
    The demand for processing power is increasing steadily. In the past, single processor architectures clearly dominated the markets. As instruction level parallelism is limited in most applications, significant performance can only be achieved in the future by exploiting parallelism at the higher levels of thread or process parallelism. As a consequence, modern “processors” incorporate multiple processor cores that form a single shared memory multiprocessor. In such systems, high performance devices like network interface controllers are connected to processors and memory like every other input/output device over a hierarchy of peripheral interconnects. Thus, one target must be to couple coprocessors physically closer to main memory and to the processors of a computing node. This removes the overhead of today’s peripheral interconnect structures. Such a step is the direct connection of HyperTransport (HT) devices to Opteron processors, which is presented in this thesis. Also, this work analyzes how communication from a device to processors can be optimized on the protocol level. As today’s computing nodes are shared memory systems, the cache coherence protocol is the central protocol for data exchange between processors and devices. Consequently, the analysis extends to classes of devices that are cache coherence protocol aware. Also, the concept of a transfer cache is proposed in this thesis, which reduces latency significantly even for non-coherent devices. The trend to the exploitation of process and thread level parallelism leads to a steady increase of system sizes. Networks that are used in such large systems are very susceptible to both hard and transient faults. Most transient fault rates are constant per bit that is stored or transmitted. With increasing system sizes and higher clock frequencies, the number of faults in time increases drastically. In the end, the error rate may rise at a level where high level error recovery becomes too costly if lower layers do not perform error correction that is transparent to the layers above. The second part of this thesis describes a direct interconnection network that provides a reliable transport service even without the use of end-to-end protocols. Also, a novel hardware based solution for intermediate routing is developed in this thesis, which allows an efficient, deadlock free routing around faulty links
    corecore