7 research outputs found

    Window size and round-trip-time in a network transmission session

    Get PDF
    A transmission session in a network constitutes a period beginning with the transport of data from one communicating node to the other. A transmission session is always set out for end-to-end connection and involves many network resources. Previous research studies on smooth data flow across a network reveals that the maximum number of data in an optimal transmission session is associated with window size. Problems are still encountered when it comes to the rate at which data move in a transmission session and also the required window size. This should be dynamically and automatically controlled. This research investigates the effect of Window Size and Round-Trip Time (RTT) in a transmission session. Packet data are collected for many network transmission sessions. The raw data were normalized, and the Naïve Bayes technique was used for the analytical evaluation. The effect of window size and RTT in a transmission session is examined, which reveals that the rate at which data move in a transmission session can be dynamically controlled to a considerably high degree of accuracy. Each network node cannot be overwhelmed when the window size is adjusted to the required siz

    Comparing Energy Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels

    Get PDF
    Developing high performance embedded vision applications requires balancing run-time performance with energy constraints. Given the mix of hardware accelerators that exist for embedded computer vision (e.g. multi-core CPUs, GPUs, and FPGAs), and their associated vendor optimized vision libraries, it becomes a challenge for developers to navigate this fragmented solution space. To aid with determining which embedded platform is most suitable for their application, we conduct a comprehensive benchmark of the run-time performance and energy efficiency of a wide range of vision kernels. We discuss rationales for why a given underlying hardware architecture innately performs well or poorly based on the characteristics of a range of vision kernel categories. Specifically, our study is performed for three commonly used HW accelerators for embedded vision applications: ARM57 CPU, Jetson TX2 GPU and ZCU102 FPGA, using their vendor optimized vision libraries: OpenCV, VisionWorks and xfOpenCV. Our results show that the GPU achieves an energy/frame reduction ratio of 1.1–3.2× compared to the others for simple kernels. While for more complicated kernels and complete vision pipelines, the FPGA outperforms the others with energy/frame reduction ratios of 1.2–22.3×. It is also observed that the FPGA performs increasingly better as a vision application’s pipeline complexity grows

    Boosting the hardware-efficiency of cascade support vector machines for embedded classification applications

    Get PDF
    Support Vector Machines (SVMs) are considered as a state-of-the-art classification algorithm capable of high accuracy rates for a different range of applications. When arranged in a cascade structure, SVMs can efficiently handle problems where the majority of data belongs to one of the two classes, such as image object classification, and hence can provide speedups over monolithic (single) SVM classifiers. However, the SVM classification process is still computationally demanding due to the number of support vectors. Consequently, in this paper we propose a hardware architecture optimized for cascaded SVM processing to boost performance and hardware efficiency, along with a hardware reduction method in order to reduce the overheads from the implementation of additional stages in the cascade, leading to significant resource and power savings. The architecture was evaluated for the application of object detection on 800×600 resolution images on a Spartan 6 Industrial Video Processing FPGA platform achieving over 30 frames-per-second. Moreover, by utilizing the proposed hardware reduction method we were able to reduce the utilization of FPGA custom-logic resources by ∼30%, and simultaneously observed ∼20% peak power reduction compared to a baseline implementation

    Dispositivos Reconfiguráveis em Processamento de Imagem – Aplicação à detecção de faixas de rodagem

    Get PDF
    O Field-Programmable Gate Array (FPGA) tem sido cada vez mais explorado e investigado como plataforma de prototipagem e implementação de sistemas em variadíssimas áreas, incluindo as de processamento de imagem e visão computacional, uma vez que a sua arquitectura massivamente paralela proporciona numerosos benefícios no desempenho, custos e gastos energéticos quando comparado com os processadores tradicionais. O principal objectivo desta dissertação consiste em desenvolver e implementar uma aplicação de detecção de faixa de rodagem automóvel baseada em FPGA, que permita identificar diferentes tipos de linhas separadoras de faixa de rodagem, nomeadamente as linhas da faixa em que se localiza o carro, as linhas das potenciais faixas e as linhas verti cais de transição de faixa, quando o carro efectua a passagem de uma faixa de rodagem para outra. Pretende-se que a aplicação abranja todos os processos de processamento, nomeadamente de aquisição de dados vídeo, o seu processamento propriamente dito e a apresentação do resultado final num monitor. Os métodos e as funções desenvolvidas foram validadas utilizando as linguagens de programação mais populares no momento, C e C++, recorrendo também a várias biblio tecas C, sendo a mais usada a biblioteca Video do ambiente de desenvolvimento Vivado HLS, cujas funções de processamento de vídeo são compatíveis com funções existentes de OpenCV. Para a implementação do sistema foi usado um Embedded Vision Bundle da Digilent, ou seja, uma placa Zybo Z7-20 em conjunto com um módulo de imagem Pcam 5CThe Field-Programmable Gate Array (FPGA) has been increasingly explored and investi gated as a platform for prototyping and system implementation in a wide range of areas, including areas of image processing and computer vision, since its massively parallel architecture provides numerous benefits in performance, cost and power consumption compared to traditional processors. The main objective of this dissertation is to develop and implement an FPGA based lane detection application, that can identify different types of lane lines, namely, the lane lines in which the car is located, the lines of the potential lanes and the vertical lines, when the car moves from one lane to another. The intention is to create an application that covers all involved processes, namely video data acquisition process, the processing of the data, and the presentation of the final result on a monitor. The developed methods and functions were validated using C and C++ programming languages, which are still the most popular worldwide. Also, several C libraries were used, Video library of Vivado HLS development environment being the most used, whose video processing functions are compatible with existing OpenCV functions. For the im plementation of the system was used a Digilent Embedded Vision Bundle, which consists of a Zybo Z7-20 board and a Pcam 5C image module

    Fast and Scalable Architectures and Algorithms for the Computation of the Forward and Inverse Discrete Periodic Radon Transform with Applications to 2D Convolutions and Cross-Correlations

    Get PDF
    The Discrete Radon Transform (DRT) is an essential component of a wide range of applications in image processing, e.g. image denoising, image restoration, texture analysis, line detection, encryption, compressive sensing and reconstructing objects from projections in computed tomography and magnetic resonance imaging. A popular method to obtain the DRT, or its inverse, involves the use of the Fast Fourier Transform, with the inherent approximation/rounding errors and increased hardware complexity due the need for floating point arithmetic implementations. An alternative implementation of the DRT is through the use of the Discrete Periodic Radon Transform (DPRT). The DPRT also exhibits discrete properties of the continuous-space Radon Transform, including the Fourier Slice Theorem and the convolution property. Unfortunately, the use of the DPRT has been limited by the need to compute a large number of additions O(N^3) and the need for a large number of memory accesses. This PhD dissertation introduces a fast and scalable approach for computing the forward and inverse DPRT that is based on the use of: (i) a parallel array of fixed-point adder trees, (ii) circular shift registers to remove the need for accessing external memory components when selecting the input data for the adder trees, and (iii) an image block-based approach to DPRT computation that can fit the proposed architecture to available resources, and as a result, for an NxN image (N prime), the proposed approach can compute up to N^2 additions per clock cycle. Compared to previous approaches, the scalable approach provides the fastest known implementations for different amounts of computational resources. For the fastest case, I introduce optimized architectures that can compute the DPRT and its inverse in just 2N +ceil(log2 N)+1 and 2N +3(log2 N)+B+2 clock cycles respectively, where B is the number of bits used to represent each input pixel. In comparison, the prior state of the art method required N^2 +N +1 clock cycles for computing the forward DPRT. For systems with limited resources, the resource usage can be reduced to O(N) with a running time of ceil(N/2)(N + 9) + N + 2 for the forward DPRT and ceil(N/2)(N + 2) + 3ceil(log2 N) + B + 4 for the inverse. The results also have important applications in the computation of fast convolutions and cross-correlations for large and non-separable kernels. For this purpose, I introduce fast algorithms and scalable architectures to compute 2-D Linear convolutions/cross-correlations using the convolution property of the DPRT and fixed point arithmetic to simplify the 2-D problem into a 1-D problem. Also an alternative system is proposed for non-separable kernels with low rank using the LU decomposition. As a result, for implementations with enough resources, for a an image and convolution kernel of size PxP, linear convolutions/cross correlations can be computed in just 6N + 4 log2 N + 17 clock cycles for N = 2P-1. Finally, I also propose parallel algorithms to compute the forward and inverse DPRT using Graphic Processing Units (GPUs) and CPUs with multiple cores. The proposed algorithms are implemented in a GPU Nvidia Maxwell GM204 with 2048 cores@1367MHz, 348KB L1 cache (24KB per multiprocessor), 2048KB L2 cache (512KB per memory controller), 4GB device memory, and compared against a serial implementation on a CPU Intel Xeon E5-2630 with 8 physical cores (16 logical processors via hyper-threading)@3.2GHz, L1 cache 512K (32KB Instruction cache, 32KB data cache, per core), L2 cache 2MB (256KB per core), L3 cache 20MB (Shared among all cores), 32GB of system memory. For the CPU, there is a tenfold speedup using 16 logical cores versus a single-core serial implementation. For the GPU, there is a 715-fold speedup compared to the serial implementation. For real-time applications, for an 1021x1021 image, the forward DPRT takes 11.5ms and 11.4ms for the inverse
    corecore