141 research outputs found
Tiled microprocessors
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.Includes bibliographical references (p. 251-258).Current-day microprocessors have reached the point of diminishing returns due to inherent scalability limitations. This thesis examines the tiled microprocessor, a class of microprocessor which is physically scalable but inherits many of the desirable properties of conventional microprocessors. Tiled microprocessors are composed of an array of replicated tiles connected by a special class of network, the Scalar Operand Network (SON), which is optimized for low-latency, low-occupancy communication between remote ALUs on different tiles. Tiled microprocessors can be constructed to scale to 100's or 1000's of functional units. This thesis identifies seven key criteria for achieving physical scalability in tiled microprocessors. It employs an archetypal tiled microprocessor to examine the challenges in achieving these criteria and to explore the properties of Scalar Operand Networks. The thesis develops the field of SONs in three major ways: it introduces the 5-tuple performance metric, it describes a complete, high-frequency SON implementation, and it proposes a taxonomy, called AsTrO, for categorizing them.(cont.) To develop these ideas, the thesis details the design, implementation and analysis of a tiled microprocessor prototype, the Raw Microprocessor, which was implemented at MIT in 180 nm technology. Overall, compared to Raw, recent commercial processors with half the transistors required 30x as many lines of code, occupied 100x as many designers, contained 50x as many pre-tapeout bugs, and resulted in 33x as many post-tapeout bugs. At the same time, the Raw microprocessor proves to be more versatile in exploiting ILP, stream, and server-farm workloads with modest to large amounts of parallelism.by Michael Bedford Taylor.Ph.D
Recommended from our members
Standard cell optimization and physical design in advanced technology nodes
Integrated circuits (ICs) are at the heart of modern electronics, which rely heavily on the state-of-the-art semiconductor manufacturing technology. The key to pushing forward semiconductor technology is IC feature-size miniaturization. However, this brings ever-increasing design complexities and manufacturing challenges to the $340 billion semiconductor industry. The manufacturing of two-dimensional layout on high-density metal layers depends on complex design-for-manufacturing techniques and sophisticated empirical optimizations, which introduces huge amounts of turnaround time and yield loss in advanced technology nodes. Our study reveals that unidirectional layout design can significantly reduce the manufacturing complexities and improve the yield, which is becoming increasingly adopted in semiconductor industry [61, 89]. The lithography printing of unidirectional layout can be tightly controlled using advanced patterning techniques, such as self-aligned double and quadruple patterning. Despite the manufacturing benefits, unidirectional layout leads to more restrictive solution space and brings significant impacts on the IC design automation ow for routing closure. Notably, unidirectional routing limits the standard cell pin accessibility, which further exacerbates the resource competitions during routing. Moreover, for post-routing optimization, traditional redundant-via insertion has become obsolete under unidirectional routing style, which makes the yield enhancement task extremely challenging. Regardless of complex multiple patterning and design-for-manufacturing approaches, mask optimization through resolution enhancement techniques remains as the key strategy to improve the yield of the semiconductor manufacturing processes. Among them, Sub-Resolution Assist Feature (SRAF) generation is a very important method to improve lithographic process windows. Model-based SRAF generation has been widely used to achieve high accuracy but it is time-consuming and hard to obtain consistent SRAFs. This dissertation proposes novel CAD algorithms and methodologies for standard cell optimization and physical design in advanced technology nodes, which ultimately reduces the design cycle and manufacturing cost of IC design. First, a standard cell pin access optimization engine is proposed to evaluate the pin accessibility of a given standard cell library. We further propose novel pin access planning techniques and concurrent pin access optimizations to efficiently resolve the routing resource competitions, which generates much better routing solutions than state-of-the-art, manufacturing-friendly routers. To systematically improve the manufacturing yield in the post-routing stage, a global optimization engine has been introduced for redundant local-loop insertion considering advanced manufacturing constraints. Finally, we propose the first machine learning-based framework for fast yet consistent SRAF generation with the high quality of results.Electrical and Computer Engineerin
Tile size selection for low-power tile-based architectures
In this paper, we investigate the power implications of tile size selection for tile-based processors. We refer to this investigation as a tile granularity study. This is accomplished by distilling the architectural cost of tiles with different computational widths into a system metric we call the Granularity Indicator (GI). The GI is then compared against the communications exposed when algorithms are partitioned across multiple tiles. Through this comparison, the tile granularity that best fits a given set of algorithms can be determined, reducing the system power for that set of algorithms. When the GI analysis is applied to the Synchroscalar tile architecture[1], we find that Synchroscalar\u27s already low power consumption can be further reduced by 14% when customized for execution of the 802.11a receiver. In addition, the GI can also be a used to evaluate tile size when considering multiple applications simultaneously, providing a convenient platform for hardware-software co-design
The "MIND" Scalable PIM Architecture
MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a
Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on
each chip and supports global shared memory across systems of MIND components. MIND is distinguished from other PIM architectures in that it incorporates mechanisms for efficient support of a global parallel execution model based on the semantics of message-driven multithreaded split-transaction processing. MIND is designed to operate either in conjunction with other conventional microprocessors or in standalone arrays of like devices. It also incorporates mechanisms for fault tolerance, real time execution, and active power management. This paper describes the major elements and operational methods of the MIND
architecture
Resource allocation and scalability in dynamic wavelength-routed optical networks.
This thesis investigates the potential benefits of dynamic operation of wavelength-routed optical networks (WRONs) compared to the static approach. It is widely believed that dynamic operation of WRONs would overcome the inefficiencies of the static allocation in improving resource use. By rapidly allocating resources only when and where required, dynamic networks could potentially provide the same service that static networks but at decreased cost, very attractive to network operators. This hypothesis, however, has not been verified. It is therefore the focus of this thesis to investigate whether dynamic operation of WRONs can save significant number of wavelengths compared to the static approach whilst maintaining acceptable levels of delay and scalability. Firstly, the wavelength-routed optical-burst-switching (WR-OBS) network architecture is selected as the dynamic architecture to be studied, due to its feasibility of implementation and its improved network performance. Then, the wavelength requirements of dynamic WR-OBS are evaluated by means of novel analysis and simulation and compared to that of static networks for uniform and non-uniform traffic demand. It is shown that dynamic WR-OBS saves wavelengths with respect to the static approach only at low loads and especially for sparsely connected networks and that wavelength conversion is a key capability to significantly increase the benefits of dynamic operation. The mean delay introduced by dynamic operation of WR-OBS is then assessed. The results show that the extra delay is not significant as to violate end-to-end limits of time-sensitive applications. Finally, the limiting scalability of WR-OBS as a function of the lightpath allocation algorithm computational complexity is studied. The trade-off between the request processing time and blocking probability is investigated and a new low-blocking and scalable lightpath allocation algorithm which improves the mentioned trade-off is proposed. The presented algorithms and results can be used in the analysis and design of dynamic WRONs
Parallel and Distributed Computing
The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing
Flip: Data-Centric Edge CGRA Accelerator
Coarse-Grained Reconfigurable Arrays (CGRA) are promising edge accelerators
due to the outstanding balance in flexibility, performance, and energy
efficiency. Classic CGRAs statically map compute operations onto the processing
elements (PE) and route the data dependencies among the operations through the
Network-on-Chip. However, CGRAs are designed for fine-grained static
instruction-level parallelism and struggle to accelerate applications with
dynamic and irregular data-level parallelism, such as graph processing. To
address this limitation, we present Flip, a novel accelerator that enhances
traditional CGRA architectures to boost the performance of graph applications.
Flip retains the classic CGRA execution model while introducing a special
data-centric mode for efficient graph processing. Specifically, it exploits the
natural data parallelism of graph algorithms by mapping graph vertices onto
processing elements (PEs) rather than the operations, and supporting dynamic
routing of temporary data according to the runtime evolution of the graph
frontier. Experimental results demonstrate that Flip achieves up to 36
speedup with merely 19% more area compared to classic CGRAs. Compared to
state-of-the-art large-scale graph processors, Flip has similar energy
efficiency and 2.2 better area efficiency at a much-reduced power/area
budget
Performance, scalability, and flexibility in the RAW network router
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.Includes bibliographical references (p. 46).Conventional high speed Internet routers are built using custom designed microprocessors, dubbed network processors, to efficiently handle the task of packet routing. While capable of meeting the performance demanded of them, these custom network processors generally lack the flexibility to incorporate new features and do not scale well beyond that for which they were designed. Furthermore, they tend to suffer from long and costly development cycles, since each new generation must be redesigned to support new features and fabricated anew in hardware. This thesis presents a new design for a network processor, one implemented entirely in software, on a tiled, general purpose microprocessor. The network processor is implemented on the Raw microprocessor, a general purpose microchip developed by the Computer Architecture Group at MIT. The Raw chip consists of sixteen identical processing tiles arranged in a four by four matrix and connected by four inter-tile communication networks; the Raw chip is designed to be able to scale up merely by adding more tiles to the matrix. By taking advantage of the parallelism inherent in the task of packet forwarding on this inherently parallel microprocessor, the Raw network processor is able to achieve performance that matches or exceeds that of commercially available custom designed network processors. At the same time, it maintains the flexibility to incorporate new features since it is implemented entirely in software, as well as the scalability to handle more ports by simply adding more tiles to the microprocessor.by Anthony M. DeGangi.M.Eng
- …