Search CORE

3,618 research outputs found

Background subtraction and transient timing with Bayesian Blocks

Author: Schwope Axel D.
Worpel Hauke
Publication venue: 'EDP Sciences'
Publication date: 16/03/2015
Field of study

Aims: To incorporate background subtraction into the Bayesian Blocks algorithm so that transient events can be timed accurately and precisely even in the presence of a substantial, rapidly variable, background. Methods: We developed several modifications to the algorithm and tested them on a simulated XMM-Newton observation of a bursting and eclipsing object. Results: We found that bursts can be found to good precision for almost all background subtraction methods, but eclipse ingresses and egresses present problems for most methods. We found one method that recovered these events with precision comparable to the interval between individual photons, in which both source and background region photons are combined into a single list and weighted according to the exposure area. We have also found that adjusting the Bayesian Blocks change points nearer to blocks with higher count rate removes a systematic bias towards blocks of low count rate.Comment: 10 pages, 13 figures, 1 tabl

arXiv.org e-Print Archive

EDP Sciences OAI-PMH repository (1.2.0)

An FPGA-based infant monitoring system

Author: Appiah Kofi
Dickinson Patrick
Hunter Andrew
Ormston Stephen
Publication venue
Publication date: 01/01/2005
Field of study

We have designed an automated visual surveillance system for monitoring sleeping infants. The low-level image processing is implemented on an embedded Xilinx’s Virtex II XC2v6000 FPGA and quantifies the level of scene activity using a specially designed background subtraction algorithm. We present our algorithm and show how we have optimised it for this platform

University of Lincoln Institutional Repository

Nottingham Trent Institutional Repository (IRep)

Accelerating radio transient detection using the Bispectrum algorithm and GPGPU

Author: Lin Tsu-Shiuan
Publication venue: Department of Computer Science
Publication date: 01/01/2015
Field of study

Modern radio interferometers such as those in the Square Kilometre Array (SKA) project are powerful tools to discover completely new classes of astronomical phenomena. Amongst these phenomena are radio transients. Transients are bursts of electromagnetic radiation and is an exciting area of research as localizing pulsars (transient emitters) allow physicists to test and formulate theories on strong gravitational forces. Current methods for detecting transients requires an image of the sky to be produced at every time step. Since interferometers have more information available to them, the computational demands for producing images becomes infeasible due to the larger data sets provided by larger interferometers. Law and Bower (2012) formulated a different approach by using a closure quantity known as the "bispectrum": the product of visibilities around a closed loop of antennae. The proposed algorithm has been shown to be easily parallelized and suitable for Graphics processing units (GPUs).Recent advancements in the field of many core technology such as GPUs has demonstrated significant performance enhancements to many scientific applications. A GPU implementation of the bispectrum algorithm has yet to be explored. In this thesis, we present a number of modified implementations of the bispectrum algorithm, allowing both instruction-level and data-level parallelism. Firstly, a multi-threaded CPU version is developed in C++ using OpenMP and then compared to a GPU version developed using Compute Unified Device Architecture (CUDA).In order to verify validity of the implementations presented, the implementations were firstly run on simulated data created from MeqTrees: a tool for simulating transients developed by the SKA. Thereafter, data from the Karl Jansky Very Large Array (JVLA) containing the B0355+54pulsar was used to test the implementation on real data. This research concludes that the bispectrum algorithm is well suited for both CPU and GPU implementations as we achieved a 3.2x speed up on a 4-core multi-threaded CPU implementation over a single thread implementation. The GPU implementation on a GTX670, achieved about a 20 times speed-up over the multi-threaded CPU implementation. These results show that the bispectrum algorithm will open doors to a series of efficient transient surveys suitable for modern data-intensive radio interferometers

Cape Town University OpenUCT

SYSTEM-ON-A-CHIP (SOC)-BASED HARDWARE ACCELERATION FOR HUMAN ACTION RECOGNITION WITH CORE COMPONENTS

Author: Safaei Amin
Publication venue: 'University of Windsor Leddy Library'
Publication date: 01/01/2018
Field of study

Today, the implementation of machine vision algorithms on embedded platforms or in portable systems is growing rapidly due to the demand for machine vision in daily human life. Among the applications of machine vision, human action and activity recognition has become an active research area, and market demand for providing integrated smart security systems is growing rapidly. Among the available approaches, embedded vision is in the top tier; however, current embedded platforms may not be able to fully exploit the potential performance of machine vision algorithms, especially in terms of low power consumption. Complex algorithms can impose immense computation and communication demands, especially action recognition algorithms, which require various stages of preprocessing, processing and machine learning blocks that need to operate concurrently. The market demands embedded platforms that operate with a power consumption of only a few watts. Attempts have been mad to improve the performance of traditional embedded approaches by adding more powerful processors; this solution may solve the computation problem but increases the power consumption. System-on-a-chip eld-programmable gate arrays (SoC-FPGAs) have emerged as a major architecture approach for improving power eciency while increasing computational performance. In a SoC-FPGA, an embedded processor and an FPGA serving as an accelerator are fabricated in the same die to simultaneously improve power consumption and performance. Still, current SoC-FPGA-based vision implementations either shy away from supporting complex and adaptive vision algorithms or operate at very limited resolutions due to the immense communication and computation demands. The aim of this research is to develop a SoC-based hardware acceleration workflow for the realization of advanced vision algorithms. Hardware acceleration can improve performance for highly complex mathematical calculations or repeated functions. The performance of a SoC system can thus be improved by using hardware acceleration method to accelerate the element that incurs the highest performance overhead. The outcome of this research could be used for the implementation of various vision algorithms, such as face recognition, object detection or object tracking, on embedded platforms. The contributions of SoC-based hardware acceleration for hardware-software codesign platforms include the following: (1) development of frameworks for complex human action recognition in both 2D and 3D; (2) realization of a framework with four main implemented IPs, namely, foreground and background subtraction (foreground probability), human detection, 2D/3D point-of-interest detection and feature extraction, and OS-ELM as a machine learning algorithm for action identication; (3) use of an FPGA-based hardware acceleration method to resolve system bottlenecks and improve system performance; and (4) measurement and analysis of system specications, such as the acceleration factor, power consumption, and resource utilization. Experimental results show that the proposed SoC-based hardware acceleration approach provides better performance in terms of the acceleration factor, resource utilization and power consumption among all recent works. In addition, a comparison of the accuracy of the framework that runs on the proposed embedded platform (SoCFPGA) with the accuracy of other PC-based frameworks shows that the proposed approach outperforms most other approaches

Scholarship at UWindsor

CrossNorm: Normalization for Off-Policy TD Reinforcement Learning

Author: Amiranashvili Artemij
Argus Max
Bhatt Aditya
Brox Thomas
Publication venue
Publication date: 17/10/2019
Field of study

Off-policy temporal difference (TD) methods are a powerful class of reinforcement learning (RL) algorithms. Intriguingly, deep off-policy TD algorithms are not commonly used in combination with feature normalization techniques, despite positive effects of normalization in other domains. We show that naive application of existing normalization techniques is indeed not effective, but that well-designed normalization improves optimization stability and removes the necessity of target networks. In particular, we introduce a normalization based on a mixture of on- and off-policy transitions, which we call cross-normalization. It can be regarded as an extension of batch normalization that re-centers data for two different distributions, as present in off-policy learning. Applied to DDPG and TD3, cross-normalization improves over the state of the art across a range of MuJoCo benchmark tasks

arXiv.org e-Print Archive

Modeling and Mapping of Optimized Schedules for Embedded Signal Processing Systems

Author: Wu Hsiang-Huang
Publication venue
Publication date: 01/01/2013
Field of study

The demand for Digital Signal Processing (DSP) in embedded systems has been increasing rapidly due to the proliferation of multimedia- and communication-intensive devices such as pervasive tablets and smart phones. Efficient implementation of embedded DSP systems requires integration of diverse hardware and software components, as well as dynamic workload distribution across heterogeneous computational resources. The former implies increased complexity of application modeling and analysis, but also brings enhanced potential for achieving improved energy consumption, cost or performance. The latter results from the increased use of dynamic behavior in embedded DSP applications. Furthermore, parallel programming is highly relevant in many embedded DSP areas due to the development and use of Multiprocessor System-On-Chip (MPSoC) technology. The need for efficient cooperation among different devices supporting diverse parallel embedded computations motivates high-level modeling that expresses dynamic signal processing behaviors and supports efficient task scheduling and hardware mapping. Starting with dynamic modeling, this thesis develops a systematic design methodology that supports functional simulation and hardware mapping of dynamic reconfiguration based on Parameterized Synchronous Dataflow (PSDF) graphs. By building on the DIF (Dataflow Interchange Format), which is a design language and associated software package for developing and experimenting with dataflow-based design techniques for signal processing systems, we have developed a novel tool for functional simulation of PSDF specifications. This simulation tool allows designers to model applications in PSDF and simulate their functionality, including use of the dynamic parameter reconfiguration capabilities offered by PSDF. With the help of this simulation tool, our design methodology helps to map PSDF specifications into efficient implementations on field programmable gate arrays (FPGAs). Furthermore, valid schedules can be derived from the PSDF models at runtime to adapt hardware configurations based on changing data characteristics or operational requirements. Under certain conditions, efficient quasi-static schedules can be applied to reduce overhead and enhance predictability in the scheduling process. Motivated by the fact that scheduling is critical to performance and to efficient use of dynamic reconfiguration, we have focused on a methodology for schedule design, which complements the emphasis on automated schedule construction in the existing literature on dataflow-based design and implementation. In particular, we have proposed a dataflow-based schedule design framework called the dataflow schedule graph (DSG), which provides a graphical framework for schedule construction based on dataflow semantics, and can also be used as an intermediate representation target for automated schedule generation. Our approach to applying the DSG in this thesis emphasizes schedule construction as a design process rather than an outcome of the synthesis process. Our approach employs dataflow graphs for representing both application models and schedules that are derived from them. By providing a dataflow-integrated framework for unambiguously representing, analyzing, manipulating, and interchanging schedules, the DSG facilitates effective codesign of dataflow-based application models and schedules for execution of these models. As multicore processors are deployed in an increasing variety of embedded image processing systems, effective utilization of resources such as multiprocessor systemon-chip (MPSoC) devices, and effective handling of implementation concerns such as memory management and I/O become critical to developing efficient embedded implementations. However, the diversity and complexity of applications and architectures in embedded image processing systems make the mapping of applications onto MPSoCs difficult. We help to address this challenge through a structured design methodology that is built upon the DSG modeling framework. We refer to this methodology as the DEIPS methodology (DSG-based design and implementation of Embedded Image Processing Systems). The DEIPS methodology provides a unified framework for joint consideration of DSG structures and the application graphs from which they are derived, which allows designers to integrate considerations of parallelization and resource constraints together with the application modeling process. We demonstrate the DEIPS methodology through cases studies on practical embedded image processing systems

CiteSeerX

Digital Repository at the University of Maryland

An efficient sparse conjugate gradient solver using a Beneš permutation network

Author: Burovskiy PA
Chow G
Grigoras P
Luk W
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/09/2014
Field of study

© 2014 Technical University of Munich (TUM).The conjugate gradient (CG) is one of the most widely used iterative methods for solving systems of linear equations. However, parallelizing CG for large sparse systems is difficult due to the inherent irregularity in memory access pattern. We propose a novel processor architecture for the sparse conjugate gradient method. The architecture consists of multiple processing elements and memory banks, and is able to compute efficiently both sparse matrix-vector multiplication, and other dense vector operations. A Beneš permutation network with an optimised control scheme is introduced to reduce memory bank conflicts without expensive logic. We describe a heuristics for offline scheduling, the effect of which is captured in a parametric model for estimating the performance of designs generated from our approach

Crossref

Spiral - Imperial College Digital Repository

Performance Optimization of Memory Intensive Applications on FPGA Accelerator

Author: Arif Arslan
Publication venue: Politecnico di Torino
Publication date
Field of study

L'abstract è presente nell'allegato / the abstract is in the attachmen

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

DSpot: Test Amplification for Automatic Assessment of Computational Diversity

Author: Allier Simon
Baudry Benoit
Monperrus Martin
Rodriguez-Cancio Marcelino
Publication venue
Publication date: 09/06/2015
Field of study

Context: Computational diversity, i.e., the presence of a set of programs that all perform compatible services but that exhibit behavioral differences under certain conditions, is essential for fault tolerance and security. Objective: We aim at proposing an approach for automatically assessing the presence of computational diversity. In this work, computationally diverse variants are defined as (i) sharing the same API, (ii) behaving the same according to an input-output based specification (a test-suite) and (iii) exhibiting observable differences when they run outside the specified input space. Method: Our technique relies on test amplification. We propose source code transformations on test cases to explore the input domain and systematically sense the observation domain. We quantify computational diversity as the dissimilarity between observations on inputs that are outside the specified domain. Results: We run our experiments on 472 variants of 7 classes from open-source, large and thoroughly tested Java classes. Our test amplification multiplies by ten the number of input points in the test suite and is effective at detecting software diversity. Conclusion: The key insights of this study are: the systematic exploration of the observable output space of a class provides new insights about its degree of encapsulation; the behavioral diversity that we observe originates from areas of the code that are characterized by their flexibility (caching, checking, formatting, etc.).Comment: 12 page

arXiv.org e-Print Archive

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

HAL-Rennes 1

Accelerating noninvasive transmural electrophysiological imaging with CUDA

Author: Corraine Martin
Publication venue: RIT Scholar Works
Publication date: 01/05/2012
Field of study

The human heart is a vital muscle of the body. Abnormalities in the heart can disrupt its normal operation. One such abnormality that affects the middle layer of the heart wall (myocardium) is called myocardial scars. Just like any tissue in the body, damage to healthy tissue will trigger scar tissue to form. Normally this scar tissue is benign. However, myocardial scars can disrupt the heart\u27s normal operation by changing the electrical properties of the myocardium. It is the most common cause of ventricular arrhythmia and sudden cardiac death. Leading edge research has developed a technique called Noninvasive Transmural Electrophysiological Imaging (NTEPI) to help diagnose myocardial scars. However, NTEPI is hindered by its high computational requirements. Due to the parallel nature of NTEPI, Graphics Processing Units (GPUs) equipped with the Compute Unified Device Architecture (CUDA) by Nvidia can be leveraged to accelerate NTEPI. GPUs were chosen over other alternatives because they are ubiquitous in hospitals and medical offices where NTEPI will be used. This project accelerated NTEPI with CUDA. First, NTEPI was profiled to determine where most of the time was spent. This information was used to determine what functions were chosen for CUDA acceleration. The accelerated NTEPI algorithm was tested for accurateness by comparing the outputs of the baseline CPU version to the CUDA version. Lastly, the CUDA accelerated NTEPI algorithm was profiled on three GPUs with different costs and features. The profiling was used to determine if any bottlenecks existed in the accelerated NTEPI algorithm. Lastly, CUDA specifications were identified from this profiling data to achieve the highest performance in NTEPI with and without cost as a factor

RIT Scholar Works