273 research outputs found
High performance computing with FPGAs
Field-programmable gate arrays represent an army of logical units which can be organized in a highly parallel or pipelined fashion to implement an algorithm in hardware. The flexibility of this new medium creates new challenges to find the right processing paradigm which takes into account of the natural constraints of FPGAs: clock frequency, memory footprint and communication bandwidth. In this paper first use of FPGAs as a multiprocessor on a chip or its use as a highly functional coprocessor are compared, and the programming tools for hardware/software codesign are discussed. Next a number of techniques are presented to maximize the parallelism and optimize the data locality in nested loops. This includes unimodular transformations, data locality improving loop transformations and use of smart buffers. Finally, the use of these techniques on a number of examples is demonstrated.
The results in the paper and in the literature show that, with the proper programming tool set, FPGAs can speedup computation kernels significantly with respect to traditional processors
A proposition of a manufactronic network approach for intelligent and flexible manufacturing systems
The XPRESS project introduces a completely new scalable concept of a manufactronic networked factory, which is composed by a co-ordinated team of specialized autonomous objects (Manufactrons), each knowing how to do a certain process optimally. This knowledge based concept integrated the complete chain: production configuration (decrease of ramp-up time of at least 50%), multi-variant production line (varying types and volumes on a single line) and 100% quality monitoring. The manufactronic networked architecture allows continuous process improvement, and will be able to anticipate and to respond to rapidly changing consumer needs, producing high-quality products in adequate quantities while reducing costs. This concept is demonstrated in the automotive, aeronautics and electrical industry but can be transferred to nearly all production processes
Second year technical report on-board processing for future satellite communications systems
Advanced baseband and microwave switching techniques for large domestic communications satellites operating in the 30/20 GHz frequency bands are discussed. The nominal baseband processor throughput is one million packets per second (1.6 Gb/s) from one thousand T1 carrier rate customer premises terminals. A frequency reuse factor of sixteen is assumed by using 16 spot antenna beams with the same 100 MHz bandwidth per beam and a modulation with a one b/s per Hz bandwidth efficiency. Eight of the beams are fixed on major metropolitan areas and eight are scanning beams which periodically cover the remainder of the U.S. under dynamic control. User signals are regenerated (demodulated/remodulated) and message packages are reformatted on board. Frequency division multiple access and time division multiplex are employed on the uplinks and downlinks, respectively, for terminals within the coverage area and dwell interval of a scanning beam. Link establishment and packet routing protocols are defined. Also described is a detailed design of a separate 100 x 100 microwave switch capable of handling nonregenerated signals occupying the remaining 2.4 GHz bandwidth with 60 dB of isolation, at an estimated weight and power consumption of approximately 400 kg and 100 W, respectively
Interconnect-aware scheduling and resource allocation for high-level synthesis
A high-level architectural synthesis can be described as the process of transforming a behavioral description into a structural description. The scheduling, processor allocation, and register binding are the most important tasks in the high-level synthesis. In the past, it has been possible to focus simply on the delays of the processing units in a high-level synthesis and neglect the wire delays, since the overall delay of a digital system was dominated by the delay of the logic gates. However, with the process technology being scaled down to deep-submicron region, the global interconnect delays can no longer be neglected in VLSI designs. It is, therefore, imperative to include in high-level synthesis the delays on wires and buses used to communicate data between the processing units i.e., inter-processor communication delays. Furthermore, the way the process of register binding is performed also has an impact on the complexity of the interconnect paths required to transfer data between the processing units. Hence, the register binding can no longer ignore its effect on the wiring complexity of resulting designs. The objective of this thesis is to develop techniques for an interconnect-aware high-level synthesis. Under this common theme, this thesis has two distinct focuses. The first focus of this thesis is on developing a new high-level synthesis framework while taking the inter-processor communication delay into consideration. The second focus of this thesis is on the developing of a technique to carry out the register binding and a scheme to reduce the number of registers while taking the complexity of the interconnects into consideration. A novel scheduling and processor allocation technique taking into consideration the inter-processor communication delay is presented. In the proposed technique, the communication delay between a pair of nodes of different types is treated as a non-computing node, whereas that between a pair of nodes of the same type is taken into account by re-adjusting the firing times of the appropriate nodes of the data flow graph (DFG). Another technique for the integration of the placement process into the scheduling and processor allocation in order to determine the actual positions of the processing units in the placement space is developed. The proposed technique makes use of a hybrid library of functional units, which includes both operation-specific and reconfigurable multiple-operation functional units, to maximize the local data transfer. A technique for register binding that results in a reduced number of registers and interconnects is developed by appropriately dividing the lifetime of a token into multiple segments and then binding those having the same source and/or destination into a single register. A node regeneration scheme, in which the idle processing units are utilized to generate multiple copies of the nodes in a given DFG, is devised to reduce the number of registers and interconnects even further. The techniques and schemes developed in this thesis are applied to the synthesis of architectures for a number of benchmark DSP problems and compared with various other commonly used synthesis methods in order to assess their effectiveness. It is shown that the proposed techniques provide superior performance in terms of the iteration period, placement area, and the numbers of the processing units, registers and interconnects in the synthesized architectur
A Study of Dynamic Optimization Techniques: Lessons and Directions in Kernel Design
The Synthesis kernel [21,22,23,27,28] showed that dynamic code generation, software feedback, and fine-grain modular kernel organization are useful implementation techniques for improving the performance of operating system kernels. In addition, and perhaps more importantly, we discovered that there are strong interactions between the techniques. Hence, a careful and systematic combination of the techniques can be very powerful even though each one by itself may have serious limitations. By identifying these interactions we illustrate the problems of applying each technique in isolation to existing kernels. We also highlight the important common under-pinnings of the Synthesis experience and present our ideas on future operating system design and implementation. Finally, we outline a more uniform approach to dynamic optimizations called incremental partial evaluation
Recommended from our members
Performance Debugging Frameworks for FPGA High-Level Synthesis
Using high-level synthesis (HLS) tools for field-programmable gate array (FPGA) design is becoming an increasingly popular choice because HLS tools can generate a high-quality design in a short development time. However, current HLS tools still cannot adequately support users in understanding and fixing the performance issues of the current design. That is, current HLS tools lack in performance debugging capability. Previous work on performance debugging automates the process of inserting hardware monitors in low-level register-transfer level (RTL) languages which limits the comprehensibility of the obtained result. Instead, our HLS-based flows offer analysis on a function or loop level and provide more intuitive feedback that can be used to pinpoint the performance bottleneck of a design. In this dissertation, we present a collection of HLS-based debugging frameworks for various purposes and characteristics of the design. First, we address the problem in the HLS synthesis step, where an inaccurate cycle estimation is provided if the program has input-dependent behavior. We propose a new performance estimator that automatically instruments code that models the hardware execution behavior and interprets the information from the HLS software simulation. However, the performance estimation result of this flow may not be accurate for a type of designs that cannot be simulated correctly by existing HLS software simulators. To handle such cases, we propose a new software simulator that provides cycle-accurate result based on the HLS scheduling information. If the input dataset is not available for software simulation or high-level models do not exist for all components of the FPGA design, we also present an on-board monitoring flow for automated cycle extraction and stall analysis. Finally, we address the needs of HLS programmers to automatically find the best set of directives for FPGA designs. We propose a design space exploration (DSE) framework to optimize applications with variable loop bounds in Polybench benchmark. A quantitative comparison among the proposed frameworks is shown using the sparse matrix-vector multiplication benchmark
The ALICE TPC, a large 3-dimensional tracking device with fast readout for ultra-high multiplicity events
The design, construction, and commissioning of the ALICE Time-Projection
Chamber (TPC) is described. It is the main device for pattern recognition,
tracking, and identification of charged particles in the ALICE experiment at
the CERN LHC. The TPC is cylindrical in shape with a volume close to 90 m^3 and
is operated in a 0.5 T solenoidal magnetic field parallel to its axis.
In this paper we describe in detail the design considerations for this
detector for operation in the extreme multiplicity environment of central
Pb--Pb collisions at LHC energy. The implementation of the resulting
requirements into hardware (field cage, read-out chambers, electronics),
infrastructure (gas and cooling system, laser-calibration system), and software
led to many technical innovations which are described along with a presentation
of all the major components of the detector, as currently realized. We also
report on the performance achieved after completion of the first round of
stand-alone calibration runs and demonstrate results close to those specified
in the TPC Technical Design Report.Comment: 55 pages, 82 figure
Recommended from our members
Fully-photonic digital radio over fibre for future super-broadband access network applications
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel UniversityIn this thesis a Fully-Photonic DRoF (FP-DRoF) system is proposed for deploying of future super-broadband access networks. Digital Radio over Fibre (DRoF) is more independent of the fibre network impairments and the length of fibre than the ARoF link. In order for fully optical deployment of the signal conversion techniques in the FP-DRoF architecture, two key components an Analogue-to-Digital Converter (ADC) and a Digital-to-Analogue Converter (DAC)) for data conversion are designed and their performance are investigated whereas the physical functionality is evaluated. The system simulation results of the proposed pipelined Photonic ADC (PADC) show that the PADC has 10 GHz bandwidth around 60 GHz of sampling rate. Furthermore, by
changing the bandwidth of the optical bandpass filter, switching to another band of sampling frequency provides optimised performance condition of the PADC. The PADC has low changes on the Effective Number of Bit (ENOB) response versus analogue RF input from 1 GHz up to 22 GHz for 60 GHz sampling frequency. The proposed 8-Bit pipelined PADC performance in terms of ENOB is evaluated at 60 Gigasample/s which is about 4.1. Recently, different methods have been reported by researchers to implement Photonic DACs
(PDACs), but their aim was to convert digital electrical signals to the corresponding analogue signal by assisting the optical techniques. In this thesis, a Binary Weighted PDAC (BW-PDAC) is proposed. In this BW-PDAC, optical digital signals are fully optically converted to an analogue signal. The spurious free dynamic range at the output of the PDAC in a back-to-back deployment of the PADC and the PDAC was 26.6 dBc. For further improvement in the system performance, a 3R (Retiming, Reshaping and Reamplifying) regeneration system is proposed in this thesis. Simulation results show that for an ultrashort RZ pulse with a 5% duty cycle at 65 Gbit/s using the proposed 3R regeneration system on a link reduces rms timing jitter by 90% while the regenerated pulse eye opening height is improved by 65%. Finally, in this thesis the proposed FP-DRoF functionality is evaluated whereas its performance is investigated through a dedicated and shared fibre links. The simulation results show (in the case of low level signal to noise ratio, in comparison with ARoF through
a dedicated fibre link) that the FP-DRoF has better BER performance than the ARoF in the order of 10-20. Furthermore, in order to realize a BER about 10-25 for the ARoF, the power penalty is about 4 dBm higher than the FP-DRoF link. The simulation results demonstrate that by considering 0.2 dB/km attenuation of a standard single mode fibre, the dedicated fibre length for the FP-DRoF link can be increased to about 20 km more than the ARoF link. Moreover, for performance assessment of the proposed FP-DRoF in a shared fibre link, the BER of the FP-DRoF link is about 10-10 magnitude less than the ARoF link for -19 dBm launched power into the fibre and the power penalty of the ARoF system is 10 dBm more than the FP-DRoF link. It is significant to increase the fibre link’s length of the FP-DRoF access network using common infrastructure. In addition, the simulation results are demonstrated that the FP-DRoF with non-uniform Wavelength Division Multiplexing (WDM) is more robust against four wave mixing impairment than the conventional WDM technique with uniform wavelength allocation and has better performance in terms of BER. It is clearly verified that the lunched power penalty at CS for DRoF link with uniform WDM techniques is about 2 dB higher than non-uniform WDM technique. Furthermore, uniform WDM method requires more bandwidth than non-uniform scheme which depends on the total number of channels and channels spacing
A novel parallel algorithm for surface editing and its FPGA implementation
A thesis submitted to the University of Bedfordshire in partial fulfilment of the requirements for the degree of Doctor of PhilosophySurface modelling and editing is one of important subjects in computer graphics. Decades of research in computer graphics has been carried out on both low-level, hardware-related algorithms and high-level, abstract software. Success of computer graphics has been seen in many application areas, such as multimedia, visualisation, virtual reality and the Internet. However, the hardware realisation of OpenGL architecture based on FPGA (field programmable gate array) is beyond the scope of most of computer graphics researches. It is an uncultivated research area where the OpenGL pipeline, from hardware through the whole embedded system (ES) up to applications, is implemented in an FPGA chip.
This research proposes a hybrid approach to investigating both software and hardware methods. It aims at bridging the gap between methods of software and hardware, and enhancing the overall performance for computer graphics. It consists of four parts, the construction of an FPGA-based ES, Mesa-OpenGL implementation for FPGA-based ESs, parallel processing, and a novel algorithm for surface modelling and editing.
The FPGA-based ES is built up. In addition to the Nios II soft processor and DDR SDRAM memory, it consists of the LCD display device, frame buffers, video pipeline, and algorithm-specified module to support the graphics processing.
Since there is no implementation of OpenGL ES available for FPGA-based ESs, a specific OpenGL implementation based on Mesa is carried out. Because of the limited FPGA resources, the implementation adopts the fixed-point arithmetic, which can offer faster computing and lower storage than the floating point arithmetic, and the accuracy satisfying the needs of 3D rendering. Moreover, the implementation includes Bézier-spline curve and surface algorithms to support surface modelling and editing.
The pipelined parallelism and co-processors are used to accelerate graphics processing in this research. These two parallelism methods extend the traditional computation parallelism in fine-grained parallel tasks in the FPGA-base ESs.
The novel algorithm for surface modelling and editing, called Progressive and Mixing Algorithm (PAMA), is proposed and implemented on FPGA-based ES’s. Compared with two main surface editing methods, subdivision and deformation, the PAMA can eliminate the large storage requirement and computing cost of intermediated processes. With four independent shape parameters, the PAMA can be used to model and edit freely the shape of an open or closed surface that keeps globally the zero-order geometric continuity. The PAMA can be applied independently not only FPGA-based ESs but also other platforms.
With the parallel processing, small size, and low costs of computing, storage and power, the FPGA-based ES provides an effective hybrid solution to surface modelling and editing
- …