214 research outputs found
An A-FPGA architecture for relative timing based asynchronous designs
pre-printThis paper presents an asynchronous FPGA architecture that is capable of implementing relative timing based asynchronous designs. The architecture uses the Xilinx 7-Series architecture as a starting point and proposes modifications that would make it asynchronous design capable while keeping it fully functional for synchronous designs. Even though the architecture requires additional components, it is observed when implemented on the 64-nm node, the area of the slice was increases marginally by 7%. The architecture leaves configurable routing structures untouched and does not compromise on performance of the synchronous architecture
Recommended from our members
On Multicast in Asynchronous Networks-on-Chip: Techniques, Architectures, and FPGA Implementation
In this era of exascale computing, conventional synchronous design techniques are facing unprecedented challenges. The consumer electronics market is replete with many-core systems in the range of 16 cores to thousands of cores on chip, integrating multi-billion transistors. However, with this ever increasing complexity, the traditional design approaches are facing key issues such as increasing chip power, process variability, aging, thermal problems, and scalability. An alternative paradigm that has gained significant interest in the last decade is asynchronous design. Asynchronous designs have several potential advantages: they are naturally energy proportional, burning power only when active, do not require complex clock distribution, are robust to different forms of variability, and provide ease of composability for heterogeneous platforms. Networks-on-chip (NoCs) is an interconnect paradigm that has been introduced to deal with the ever-increasing system complexity. NoCs provide a distributed, scalable, and efficient interconnect solution for todayâs many-core systems. Moreover, NoCs are a natural match with asynchronous design techniques, as they separate communication infrastructure and timing from the computational elements. To this end, globally-asynchronous locally-synchronous (GALS) systems that interconnect multiple processing cores, operating at different clock speeds, using an asynchronous NoC, have gained significant interest. While asynchronous NoCs have several advantages, they also face a key challenge of supporting new types of traffic patterns. Once such pattern is multicast communication, where a source sends packets to arbitrary number of destinations. Multicast is not only common in parallel computing, such as for cache coherency, but also for emerging areas such as neuromorphic computing. This important capability has been largely missing from asynchronous NoCs. This thesis introduces several efficient multicast solutions for these interconnects. In particular, techniques, and network architectures are introduced to support high-performance and low-power multicast. Two leading network topologies are the focus: a variant mesh-of-trees (MoT) and a 2D mesh. In addition, for a more realistic implementation and analysis, as well as significantly advancing the field of asynchronous NoCs, this thesis also targets synthesis of these NoCs on commercial FPGAs. While there has been significant advances in FPGA technologies, there has been only limited research on implementing asynchronous NoCs on FPGAs. To this end, a systematic computeraided design (CAD) methodology has been introduced to efficiently and safely map asynchronous NoCs on FPGAs. Overall, this thesis makes the following three contributions. The first contribution is a multicast solution for a variant MoT network topology. This topology consists of simple low-radix switches, and has been used in high-performance computing platforms. A novel local speculation technique is introduced, where a subset of the networkâs switches are speculative that always broadcast every packet. These switches are very simple and have high performance. Speculative switches are surrounded by non-speculative ones that route packets based on their destinations and also throttle any redundant copies created by the former. This hybrid network architecture achieved significant performance and power benefits over other multicast approaches. The second contribution is a multicast solution for a 2D-mesh topology, which is more complex with higher-radix switches and also is more commonly used. A novel continuous-time replication strategy is introduced to optimize the critical multi-way forking operation of a multicast transmission. In this technique, a multicast packet is first stored in an input port of a switch, from where it is sent through distinct output ports towards different destinations concurrently, at each outputâs own rate and in continuous time. This strategy is shown to have significant latency and energy benefits over an approach that performs multicast using multiple distinct serial unicasts to each destination. Finally, a systematic CAD methodology is introduced to synthesize asynchronous NoCs on commercial FPGAs. A two-fold goal is targeted: correctness and high performance. For ease of implementation, only existing FPGA synthesis tools are used. Moreover, since asynchronous NoCs involve special asynchronous components, a comprehensive guide is introduced to map these elements correctly and efficiently. Two asynchronous NoC switches are synthesized using the proposed approach on a leading Xilinx FPGA in 28 nm: one that only handles unicast, and the other that also supports multicast. Both showed significant energy benefits with some performance gains over a state-of-the-art synchronous switch
A cell set for self-timed design using actel FPGAs
technical reportAsynchronous or self-timed systems that do not rely on a global clock to keep system components synchronized can offer significant advantages over traditional clocked circuits in a variety of applications. However, these systems require that suitable self-timed circuit primitives are available for building the system. This report describes a cell set designed for building self-timed circuits and systems using Actel field programmable gate arrays (FPGAs). The cells use a two-phase transition signalling protocol for control signals and a bundled protocol for data signals. This library of macro cells is designed to be used with the Workmen tool suite from VIEWlogic and the Action Logic System (ALS) from Actel
SpinLink: An interconnection system for the SpiNNaker biologically inspired multi-computer
SpiNNaker is a large-scale biologically-inspired multi-computer designed to model very heavily distributed problems, with the flagship application being the simulation of large neural networks. The project goal is to have one million processors included in a single machine, which consequently span many thousands of circuit boards. A computer of this scale imposes large communication requirements between these boards, and requires an extensible method of connecting to external equipment such as sensors, actuators and visualisation systems. This paper describes two systems that can address each of these problems.Firstly, SpinLink is a proposed method of connecting the SpiNNaker boards by using time-division multiplexing (TDM) to allow eight SpiNNaker links to run at maximum bandwidth between two boards. SpinLink will be deployed on Spartan-6 FPGAs and uses a locally generated clock that can be paused while the asynchronous links from SpiNNaker are sending data, thus ensuring a fast and glitch-free response. Secondly, SpiNNterceptor is a separate system, currently in the early stages of design, that will build upon SpinLink to address the important external I/O issues faced by SpiNNaker. Specifically, spare resources in the FPGAs will be used to implement the debugging and I/O interfacing features of SpiNNterceptor
Petri nets based components within globally asynchronous locally synchronous systems
Dissertação apresentada na Faculdade de CiĂȘncias e Tecnologias da Universidade Nova de Lisboa para a obtenção do grau de Mestre em Engenharia ElectrotĂ©cnica e ComputadoresThe main goal is to develop a solution for the interconnection of components constituent of a GALS - Globally Asynchronous, Locally Synchronous â system. The components are implemented in parallel obtained as a result of the partition of a model expressed a Petri net (PN), performed using the PNs editor SNOOPY-IOPT in conjunction with the Split tool and the tools to automatically generate the VHDL code from the representations of the PNML models resulting from the partition (these tools were developed under the project FORDESIGN and are available at http://www.uninova.pt/FORDESIGN). Typical solutions will be analyzed to ensure proper communication between components of the GALS system, as well as characterized and developed an appropriate solution for the interconnection of the components associated with the PN sub-models. The final goal (not attained with this thesis) would be to acquire a tool that allows generation of code for the interconnection solution from the associated components, considering a specific application. The solution proposed for componentes interconnection was coded in VHDL and the implementation platforms used for testing include the Xilinx FPGA Spartan-3 and Virtex-II
Pipelined Asynchronous High Level Synthesis for General Programs
High-level synthesis (HLS) translates algorithms from software programming language into hardware. We use the dataflow HLS methodology to translate programs into asynchronous circuits by implementing programs using asynchronous dataflow elements as hardware building blocks. We extend the prior work in dataflow synthesis in the following aspects:i) we propose Fluid to synthesize pipelined dataflow circuits for real-world programs with complex control flows, which are not supported in the previous work; ii) we propose PipeLink to permit pipelined access to shared resources in the dataflow circuit. Dataflow circuit results in distributed control and an implicitly pipelined implementation. However, resource sharing in the presence of pipelining is challenging in this context due to the absence of a global scheduler. Traditional solutions to this problem impose restrictions on pipelining to guarantee mutually exclusive access to the shared resource, but PipeLink removes such restrictions and can generate pipelined asynchronous dataflow circuits for shared function calls, pipelined memory accesses and function pointers; iii) we apply several dataflow optimizations to improve the quality of the synthesized dataflow circuits; iv) we implement our system (Fluid + PipeLink) on the LLVM compiler framework, which allows us to take advantage of the optimization efforts from the compiler community; v) we compare our system with a widely-used academic HLS tool and two commercial HLS tools. Compared to commercial (academic) HLS tools, our system results in 12X (20X) reduction in energy, 1.29X (1.64X) improvement in throughput, 1.27X (1.61X) improvement in latency at a cost of 2.4X (1.61X) increase in the area
Dynamically reconfigurable asynchronous processor
The main design requirements for today's mobile applications are:
· high throughput performance.
· high energy efficiency.
· high programmability.
Until now, the choice of platform has often been limited to Application-Specific
Integrated Circuits (ASICs), due to their best-of-breed performance and power
consumption. The economies of scale possible with these high-volume markets have
traditionally been able to hide the high Non-Recurring Engineering (NRE) costs
required for designing and fabricating new ASICs. However, with the NREs and
design time escalating with each generation of mobile applications, this practice may
be reaching its limit.
Designers today are looking at programmable solutions, so that they can respond
more rapidly to changes in the market and spread costs over several generations of
mobile applications. However, there have been few feasible alternatives to ASICs:
Digital Signals Processors (DSPs) and microprocessors cannot meet the throughput
requirements, whereas Field-Programmable Gate Arrays (FPGAs) require too much
area and power.
Coarse-grained dynamically reconfigurable architectures offer better solutions for
high throughput applications, when power and area considerations are taken into
account. One promising example is the Reconfigurable Instruction Cell Array
(RICA). RICA consists of an array of cells with an interconnect that can be
dynamically reconfigured on every cycle. This allows quite complex datapaths to be
rendered onto the fabric and executed in a single configuration - making these
architectures particularly suitable to stream processing. Furthermore, RICA can be
programmed from C, making it a good fit with existing design methodologies.
However the RICA architecture has a drawback: poor scalability in terms of area and
power. As the core gets bigger, the number of sequential elements in the array must
be increased significantly to maintain the ability to achieve high throughputs through
pipelining. As a result, a larger clock tree is required to synchronise the increased
number of sequential elements. The clock tree therefore takes up a larger percentage
of the area and power consumption of the core.
This thesis presents a novel Dynamically Reconfigurable Asynchronous Processor
(DRAP), aimed at high-throughput mobile applications. DRAP is based on the RICA
architecture, but uses asynchronous design techniques - methods of designing digital
systems without clocks. The absence of a global clock signal makes DRAP more
scalable in terms of power and area overhead than its synchronous counterpart.
The DRAP architecture maintains most of the benefits of custom asynchronous
design, whilst also providing programmability via conventional high-level languages.
Results show that the DRAP processor delivers considerably lower power
consumption when compared to a market-leading Very Long Instruction Word
(VLIW) processor and a low-power ARM processor. For example, DRAP resulted in
a reduction in power consumption of 20 times compared to the ARM7 processor, and
29 times compared to the TIC64x VLIW, when running the same benchmark capped
to the same throughput and for the same process technology (0.13ÎŒm). When
compared to an equivalent RICA design, DRAP was up to 22% larger than RICA but
resulted in a power reduction of up to 1.9 times. It was also capable of achieving up
to 2.8 times higher throughputs than RICA for the same benchmarks
Design and Implementation of FPGA Configuration Logic Block Using Asynchronous Static NCL
This paper proposes the design of a FPGA configurable logic block (CLB) using asynchronous static NULL convention logic (NCL) Library. The proposed design uses three static LUT\u27s for implementing NCL logic functions. Each LUT can be configured to function as any one of the 27 fundamental NCL Static gates. The proposed CLB supports 10 inputs and three different outputs, each with resettable and inverting variations. The CLB has two modes: Configuration mode and operation mode. The static NCL FPGA CLB is simulated at the transistor level using the 1.8 V, 180 nm TSMC CMOS process
- âŠ