47 research outputs found
An empirical evaluation of High-Level Synthesis languages and tools for database acceleration
High Level Synthesis (HLS) languages and tools are emerging as the most promising technique to make FPGAs more accessible to software developers. Nevertheless, picking the most suitable HLS for a certain class of algorithms depends on requirements such as area and throughput, as well as on programmer experience. In this paper, we explore the different trade-offs present when using a representative set of HLS tools in the context of Database Management Systems (DBMS) acceleration. More specifically, we conduct an empirical analysis of four representative frameworks (Bluespec SystemVerilog, Altera OpenCL, LegUp and Chisel) that we utilize to accelerate commonly-used database algorithms such as sorting, the median operator, and hash joins. Through our implementation experience and empirical results for database acceleration, we conclude that the selection of the most suitable HLS depends on a set of orthogonal characteristics, which we highlight for each HLS framework.Peer ReviewedPostprint (author’s final draft
Embedded Machine Learning: Emphasis on Hardware Accelerators and Approximate Computing for Tactile Data Processing
Machine Learning (ML) a subset of Artificial Intelligence (AI) is driving the industrial
and technological revolution of the present and future. We envision a world with smart
devices that are able to mimic human behavior (sense, process, and act) and perform
tasks that at one time we thought could only be carried out by humans. The vision
is to achieve such a level of intelligence with affordable, power-efficient, and fast hardware
platforms. However, embedding machine learning algorithms in many application domains
such as the internet of things (IoT), prostheses, robotics, and wearable devices is an ongoing
challenge. A challenge that is controlled by the computational complexity of ML algorithms,
the performance/availability of hardware platforms, and the application\u2019s budget (power
constraint, real-time operation, etc.). In this dissertation, we focus on the design and
implementation of efficient ML algorithms to handle the aforementioned challenges. First, we
apply Approximate Computing Techniques (ACTs) to reduce the computational complexity of
ML algorithms. Then, we design custom Hardware Accelerators to improve the performance
of the implementation within a specified budget. Finally, a tactile data processing application
is adopted for the validation of the proposed exact and approximate embedded machine
learning accelerators.
The dissertation starts with the introduction of the various ML algorithms used for
tactile data processing. These algorithms are assessed in terms of their computational
complexity and the available hardware platforms which could be used for implementation.
Afterward, a survey on the existing approximate computing techniques and hardware
accelerators design methodologies is presented. Based on the findings of the survey, an
approach for applying algorithmic-level ACTs on machine learning algorithms is provided.
Then three novel hardware accelerators are proposed: (1) k-Nearest Neighbor (kNN) based
on a selection-based sorter, (2) Tensorial Support Vector Machine (TSVM) based on Shallow
Neural Networks, and (3) Hybrid Precision Binary Convolution Neural Network (BCNN).
The three accelerators offer a real-time classification with monumental reductions in the
hardware resources and power consumption compared to existing implementations targeting
the same tactile data processing application on FPGA. Moreover, the approximate accelerators
maintain a high classification accuracy with a loss of at most 5%
Hardware acceleration of the trace transform for vision applications
Computer Vision is a rapidly developing field in which machines process visual data to extract meaningful information. Digitised images in their pixels and bits serve no purpose of their own. It is only by interpreting the data, and extracting higher level information that a scene can be understood. The algorithms that enable this process are often complex, and data-intensive, limiting the processing rate when implemented in software. Hardware-accelerated implementations provide a significant performance boost that can enable real- time processing. The Trace Transform is a newly proposed algorithm that has been proven effective in image categorisation and recognition tasks. It is flexibly defined allowing the mathematical details to be tailored to the target application. However, it is highly computationally intensive, which limits its applications. Modern heterogeneous FPGAs provide an ideal platform for accelerating the Trace transform for real-time performance, while also allowing an element of flexibility, which highly suits the generality of the Trace transform. This thesis details the implementation of an extensible Trace transform architecture for vision applications, before extending this architecture to a full flexible platform suited to the exploration of Trace transform applications. As part of the work presented, a general set of architectures for large-windowed median and weighted median filters are presented as required for a number of Trace transform implementations. Finally an acceleration of Pseudo 2-Dimensional Hidden Markov Model decoding, usable in a person detection system, is presented. Such a system can be used to extract frames of interest from a video sequence, to be subsequently processed by the Trace transform. All these architectures emphasise the need for considered, platform-driven design in achieving maximum performance through hardware acceleration
High-dimensional decoy-state quantum key distribution over 0.3 km of multicore telecommunication optical fibers
Multiplexing is a strategy to augment the transmission capacity of a
communication system. It consists of combining multiple signals over the same
data channel and it has been very successful in classical communications.
However, the use of enhanced channels has only reached limited practicality in
quantum communications (QC) as it requires the complex manipulation of quantum
systems of higher dimensions. Considerable effort is being made towards QC
using high-dimensional quantum systems encoded into the transverse momentum of
single photons but, so far, no approach has been proven to be fully compatible
with the existing telecommunication infrastructure. Here, we overcome such a
technological challenge and demonstrate a stable and secure high-dimensional
decoy-state quantum key distribution session over a 0.3 km long multicore
optical fiber. The high-dimensional quantum states are defined in terms of the
multiple core modes available for the photon transmission over the fiber, and
the decoy-state analysis demonstrates that our technique enables a positive
secret key generation rate up to 25 km of fiber propagation. Finally, we show
how our results build up towards a high-dimensional quantum network composed of
free-space and fiber based linksComment: Please see the complementary work arXiv:1610.01812 (2016
Doctor of Philosophy
dissertationIn-memory big data applications are growing in popularity, including in-memory versions of the MapReduce framework. The move away from disk-based datasets shifts the performance bottleneck from slow disk accesses to memory bandwidth. MapReduce is a data-parallel application, and is therefore amenable to being executed on as many parallel processors as possible, with each processor requiring high amounts of memory bandwidth. We propose using Near Data Computing (NDC) as a means to develop systems that are optimized for in-memory MapReduce workloads, offering high compute parallelism and even higher memory bandwidth. This dissertation explores three different implementations and styles of NDC to improve MapReduce execution. First, we use 3D-stacked memory+logic devices to process the Map phase on compute elements in close proximity to database splits. Second, we attempt to replicate the performance characteristics of the 3D-stacked NDC using only commodity memory and inexpensive processors to improve performance of both Map and Reduce phases. Finally, we incorporate fixed-function hardware accelerators to improve sorting performance within the Map phase. This dissertation shows that it is possible to improve in-memory MapReduce performance by potentially two orders of magnitude by designing system and memory architectures that are specifically tailored to that end
Estimation of power generation potential of nonwoody biomass species
In view of high energy potentials in non-woody biomass species and an increasing interest in their utilization for power generation, an attempt has been made in this study to assess the proximate analysis and energy content of different components of Sida rhombifolia, Xanthium strumarium, Anisomeles lamiaceae and Eupatorium coelestinum biomass species (both non-woody), and their impact on power generation and land requirement for energy plantations. The net energy content in Sida is the highest. Xanthium biomass species appears to have slightly higher calorific values in its components than those of Anisomeles and Eupatorium. The pattern of variation of calorific value in the components like stump, branch, leaf and bark is not identical for all the presently studied biomass species. In all these studied biomass species, the calorific values of leaves, in general, are after stump and branch. The data for proximate and ultimate analysis of the components of these species are very close to each other and hence it is very difficult to draw a concrete conclusion. However, it appears from the present work that Eupatorium and Xanthium biomass species have the highest fixed carbon and lowest volatile matter contents in their stumps than the stumps of the others. As for ash fusion temperature The Sida biomass species has the highest values of IDT, ST, HT and FT (7860- 1490˚C) for its ash, followed by Anisomeles (740-1441˚C). Xanthium and Eupatorium have lower values for IDT, ST, HT and FT (670- 1244˚C) for their ashes. The results have shown that approximately 4, 7, 6 and 2 hectares of land are required to generate 20,000 kWh/day electricity from Sida rhombifolia, Eupatorium coelestinum, Xanthium strumarium and Anisomeles lamiaceae biomass species. Coal samples, obtained from six different local mines, were also examined for their qualities and the results were compared with those of studied biomass materials. This comparison reveals much higher power output with negligible emission of suspended particulate matters (SPM) from biomass materials
A low power selective median filter design
A selective median filter which consumes less power has been designed and different logics for majority bit evaluation has been applied and simulated in VHDL .It is rightly called as selective because an edge pixel detector [2] has been used to select those pixels which are to be processed through median filter. As for median value calculation; sorting of 3 x 3 window’s pixel values has been done using majority bit circuit [4].Different majority bit calculation method has been implemented and the result sorting circuit has been analyzed for power analysis. In this work a general median filter which uses binary sorting method known as Majority Voting Circuit (MVC) has been designed using VHDL and optimized using SYNOPSIS which has used 0.13μm CMOS technology .The digital design of sorting circuit saves approximately 60% of power comprising of cell leakage and dynamic power comparing to a mixed signal design of Floating gate based Majority bit median filter [4]. Before operating median filter on each pixel double derivative filter [2] has been applied to check whether it is an edge pixel or not. Overall this is a digital design of a mixed filter which preserves edges and removes noises as well.Low power techniques at logic level and algorithmic level have been embedded into this work. In our work we have also designed a small microprocessor using VHDL code. Later a memory (for the purpose of image storing) based Control Unit for single median value evaluation has been designed and simulated in XILINX. Here for sorting circuit a common logic based circuit (component) has been put forward. The power, latency or delay, area of whole design has been compared and tested with other designs