411,608 research outputs found

    Memory system support for image processing

    Get PDF
    Journal ArticleProcessor speeds are increasing rapidly, but memory speeds are not keeping pace. Image processing is an important application domain that is particularly impacted by this growing performance gap. Image processing algorithms tend to have poor memory locality because they access their data in a non-sequential fashion and reuse that data infrequently. As a result, they often exhibit poor cache and TLB hit rates on conventional memory systems, which limits overall performance. Most current approaches to addressing the memory bottleneck focus on modifying cache organizations or introducing processor-based prefetching. The Impulse memory system takes a different approach: allowing application software to control how, when, and where data are loaded into a conventional processor cache. Impulse does this by letting software configure how the memory controller interprets the physical addresses exported by a processor. Introducing an extra level of address translation in the memory. Data that is sparse in memory can be accessed densely, which improves both cache and TLB utilization, and Impulse hides memory latency by prefectching data within the memory controller. We describe how Impulse improves the performance of three image processing algorithms: an Impulse memory system yields speedups of 40% to 226% over an otherwise identical machine with a conventional memory system

    Concurrent Image Processing Executive (CIPE)

    Get PDF
    The design and implementation of a Concurrent Image Processing Executive (CIPE), which is intended to become the support system software for a prototype high performance science analysis workstation are discussed. The target machine for this software is a JPL/Caltech Mark IIIfp Hypercube hosted by either a MASSCOMP 5600 or a Sun-3, Sun-4 workstation; however, the design will accommodate other concurrent machines of similar architecture, i.e., local memory, multiple-instruction-multiple-data (MIMD) machines. The CIPE system provides both a multimode user interface and an applications programmer interface, and has been designed around four loosely coupled modules; (1) user interface, (2) host-resident executive, (3) hypercube-resident executive, and (4) application functions. The loose coupling between modules allows modification of a particular module without significantly affecting the other modules in the system. In order to enhance hypercube memory utilization and to allow expansion of image processing capabilities, a specialized program management method, incremental loading, was devised. To minimize data transfer between host and hypercube a data management method which distributes, redistributes, and tracks data set information was implemented

    Concurrent Image Processing Executive (CIPE). Volume 1: Design overview

    Get PDF
    The design and implementation of a Concurrent Image Processing Executive (CIPE), which is intended to become the support system software for a prototype high performance science analysis workstation are described. The target machine for this software is a JPL/Caltech Mark 3fp Hypercube hosted by either a MASSCOMP 5600 or a Sun-3, Sun-4 workstation; however, the design will accommodate other concurrent machines of similar architecture, i.e., local memory, multiple-instruction-multiple-data (MIMD) machines. The CIPE system provides both a multimode user interface and an applications programmer interface, and has been designed around four loosely coupled modules: user interface, host-resident executive, hypercube-resident executive, and application functions. The loose coupling between modules allows modification of a particular module without significantly affecting the other modules in the system. In order to enhance hypercube memory utilization and to allow expansion of image processing capabilities, a specialized program management method, incremental loading, was devised. To minimize data transfer between host and hypercube, a data management method which distributes, redistributes, and tracks data set information was implemented. The data management also allows data sharing among application programs. The CIPE software architecture provides a flexible environment for scientific analysis of complex remote sensing image data, such as planetary data and imaging spectrometry, utilizing state-of-the-art concurrent computation capabilities

    A PC-based multispectral scanner data evaluation workstation: Application to Daedalus scanners

    Get PDF
    In late 1989, a personal computer (PC)-based data evaluation workstation was developed to support post flight processing of Multispectral Atmospheric Mapping Sensor (MAMS) data. The MAMS Quick View System (QVS) is an image analysis and display system designed to provide the capability to evaluate Daedalus scanner data immediately after an aircraft flight. Even in its original form, the QVS offered the portability of a personal computer with the advanced analysis and display features of a mainframe image analysis system. It was recognized, however, that the original QVS had its limitations, both in speed and processing of MAMS data. Recent efforts are presented that focus on overcoming earlier limitations and adapting the system to a new data tape structure. In doing so, the enhanced Quick View System (QVS2) will accommodate data from any of the four spectrometers used with the Daedalus scanner on the NASA ER2 platform. The QVS2 is designed around the AST 486/33 MHz CPU personal computer and comes with 10 EISA expansion slots, keyboard, and 4.0 mbytes of memory. Specialized PC-McIDAS software provides the main image analysis and display capability for the system. Image analysis and display of the digital scanner data is accomplished with PC-McIDAS software

    Microcomputer-based artificial vision support system for real-time image processing for camera-driven visual prostheses

    Get PDF
    It is difficult to predict exactly what blind subjects with camera-driven visual prostheses (e.g., retinal implants) can perceive. Thus, it is prudent to offer them a wide variety of image processing filters and the capability to engage these filters repeatedly in any userdefined order to enhance their visual perception. To attain true portability, we employ a commercial off-the-shelf battery-powered general purpose Linux microprocessor platform to create the microcomputer-based artificial vision support system (µAVS^2) for real-time image processing. Truly standalone, µAVS^2 is smaller than a deck of playing cards, lightweight, fast, and equipped with USB, RS- 232 and Ethernet interfaces. Image processing filters on µAVS^2 operate in a user-defined linear sequential-loop fashion, resulting in vastly reduced memory and CPU requirements during execution. µAVS^2 imports raw video frames from a USB or IP camera, performs image processing, and issues the processed data over an outbound Internet TCP/IP or RS-232 connection to the visual prosthesis system. Hence, µAVS^2 affords users of current and future visual prostheses independent mobility and the capability to customize the visual perception generated. Additionally, µAVS^2 can easily be reconfigured for other prosthetic systems. Testing of µAVS^2 with actual retinal implant carriers is envisioned in the near future

    Designing Mixed Criticality Applications on Modern Heterogeneous MPSoC Platforms

    Get PDF
    Multiprocessor Systems-on-Chip (MPSoC) integrating hard processing cores with programmable logic (PL) are becoming increasingly common. While these platforms have been originally designed for high performance computing applications, their rich feature set can be exploited to efficiently implement mixed criticality domains serving both critical hard real-time tasks, as well as soft real-time tasks. In this paper, we take a deep look at commercially available heterogeneous MPSoCs that incorporate PL and a multicore processor. We show how one can tailor these processors to support a mixed criticality system, where cores are strictly isolated to avoid contention on shared resources such as Last-Level Cache (LLC) and main memory. In order to avoid conflicts in last-level cache, we propose the use of cache coloring, implemented in the Jailhouse hypervisor. In addition, we employ ScratchPad Memory (SPM) inside the PL to support a multi-phase execution model for real-time tasks that avoids conflicts in shared memory. We provide a full-stack, working implementation on a latest-generation MPSoC platform, and show results based on both a set of data intensive tasks, as well as a case study based on an image processing benchmark application

    ???????????? ??????????????? ????????? ???????????? ?????? ??????

    Get PDF
    Department of Mehcanical EngineeringUnmanned aerial vehicles (UAVs) are widely used in various areas such as exploration, transportation and rescue activity due to light weight, low cost, high mobility and intelligence. This intelligent system consists of highly integrated and embedded systems along with a microprocessor to perform specific task by computing algorithm or processing data. In particular, image processing is one of main core technologies to handle important tasks such as target tracking, positioning, visual servoing using visual system. However, it often requires heavy amount of computation burden and an additional micro PC controller with a flight computer should be additionally used to process image data. However, performance of the controller is not so good enough due to limited power, size, and weight. Therefore, efficient image processing techniques are needed considering computing load and hardware resources for real time operation on embedded systems. The objective of the thesis research is to develop an efficient image processing framework on embedded systems utilizing neural network and various optimized computation techniques to satisfy both efficient computing speed versus resource usage and accuracy. Image processing techniques has been proposed and tested for management computing resources and operating high performance missions in embedded systems. Graphic processing units (GPUs) available in the market can be used for parallel computing to accelerate computing speed. Multiple cores within central processing units (CPUs) are used like multi-threading during data uploading and downloading between the CPU and the GPU. In order to minimize computing load, several methods have been proposed. The first method is visualization of convolutional neural network (CNN) that can perform both localization and detection simultaneously. The second is region proposal for input area of CNN through simple image processing, which helps algorithm to avoid full frame processing. Finally, surplus computing resources can be saved by control the transient performance such as the FPS limitation. These optimization methods have been experimentally applied to a ground vehicle and quadrotor UAVs and verified that the developed methods offer an optimization to process in embedded environment by saving CPU and memory resources. In addition, they can support to perform various tasks such as object detection and path planning, obstacle avoidance. Through optimization and algorithms, they reveal a number of improvements for the embedded system compared to the existing. Considering the characteristics of the system to transplant the various useful algorithms to the embedded system, the method developed in the research can be further applied to various practical applications.ope

    An Energy-Efficient Reconfigurable Mobile Memory Interface for Computing Systems

    Get PDF
    The critical need for higher power efficiency and bandwidth transceiver design has significantly increased as mobile devices, such as smart phones, laptops, tablets, and ultra-portable personal digital assistants continue to be constructed using heterogeneous intellectual properties such as central processing units (CPUs), graphics processing units (GPUs), digital signal processors, dynamic random-access memories (DRAMs), sensors, and graphics/image processing units and to have enhanced graphic computing and video processing capabilities. However, the current mobile interface technologies which support CPU to memory communication (e.g. baseband-only signaling) have critical limitations, particularly super-linear energy consumption, limited bandwidth, and non-reconfigurable data access. As a consequence, there is a critical need to improve both energy efficiency and bandwidth for future mobile devices.;The primary goal of this study is to design an energy-efficient reconfigurable mobile memory interface for mobile computing systems in order to dramatically enhance the circuit and system bandwidth and power efficiency. The proposed energy efficient mobile memory interface which utilizes an advanced base-band (BB) signaling and a RF-band signaling is capable of simultaneous bi-directional communication and reconfigurable data access. It also increases power efficiency and bandwidth between mobile CPUs and memory subsystems on a single-ended shared transmission line. Moreover, due to multiple data communication on a single-ended shared transmission line, the number of transmission lines between mobile CPU and memories is considerably reduced, resulting in significant technological innovations, (e.g. more compact devices and low cost packaging to mobile communication interface) and establishing the principles and feasibility of technologies for future mobile system applications. The operation and performance of the proposed transceiver are analyzed and its circuit implementation is discussed in details. A chip prototype of the transceiver was implemented in a 65nm CMOS process technology. In the measurement, the transceiver exhibits higher aggregate data throughput and better energy efficiency compared to prior works

    Improving Data Locality in Distributed Processing of Multi-Channel Remote Sensing Data with Potentially Large Stencils

    Get PDF
    Distributing a multi-channel remote sensing data processing with potentially large stencils is a difficult challenge. The goal of this master thesis was to evaluate and investigate the performance impacts of such a processing on a distributed system and if it is possible to improve the total execution time by exploiting data locality or memory alignments. The thesis also gives a brief overview of the actual state of the art in remote sensing distributed data processing and points out why distributed computing will become more important for it in the future. For the experimental part of this thesis an application to process huge arrays on a distributed system was implemented with DASH, a C++ Template Library for Distributed Data Structures with Support for Hierarchical Locality for High Performance Computing and Data-Driven Science. On the basis of the first results an optimization model was developed which has the goal to reduce network traffic while initializing a distributed data structure and executing computations on it with potentially large stencils. Furthermore, a software to estimate the memory layouts with the least network communication cost for a given multi-channel remote sensing data processing workflow was implemented. The results of this optimization were executed and evaluated afterwards. The results show that it is possible to improve the initialization speed of a large image by considering the brick locality by 25%. The optimization model also generate valid decisions for the initialization of the PGAS memory layouts. However, for a real implementation the optimization model has to be modified to reflect implementation-dependent sources of overhead. This thesis presented some approaches towards solving challenges of the distributed computing world that can be used for real-world remote sensing imaging applications and contributed towards solving the challenges of the modern Big Data world for future scientific data exploitation

    A low-cost high-speed twin-prefetching DSP-based shared-memory system for real-time image processing applications

    Get PDF
    This dissertation introduces, investigates, and evaluates a low-cost high-speed twin-prefetching DSP-based bus-interconnected shared-memory system for real-time image processing applications. The proposed architecture can effectively support 32 DSPs in contrast to a maximum of 4 DSPs supported by existing DSP-based bus- interconnected systems. This significant enhancement is achieved by introducing two small programmable fast memories (Twins) between the processor and the shared bus interconnect. While one memory is transferring data from/to the shared memory, the other is supplying the core processor with data. The elimination of the traditional direct linkage of the shared bus and processor data bus makes feasible the utilization of a wider shared bus i.e., shared bus width becomes independent of the data bus width of the processors. The fast prefetching memories and the wider shared bus provide additional bus bandwidth into the system, which eliminates large memory latencies; such memory latencies constitute the major drawback for the performance of shared-memory multiprocessors. Furthermore, in contrast to existing DSP-based uniprocessor or multiprocessor systems the proposed architecture does not require all data to be placed on on-chip or off-chip expensive fast memory in order to reach or maintain peak performance. Further, it can maintain peak performance regardless of whether the processed image is small or large. The performance of the proposed architecture has been extensively investigated executing computationally intensive applications such as real-time high-resolution image processing. The effect of a wide variety of hardware design parameters on performance has been examined. More specifically tables and graphs comprehensively analyze the performance of 1, 2, 4, 8, 16, 32 and 64 DSP-based systems, for a wide variety of shared data interconnect widths such as 32, 64, 128, 256 and 512. In addition, the effect of the wide variance of temporal and spatial locality (present in different applications) on the multiprocessor\u27s execution time is investigated and analyzed. Finally, the prefetching cache-size was varied from a few kilobytes to 4 Mbytes and the corresponding effect on the execution time was investigated. Our performance analysis has clearly showed that the execution time converges to a shallow minimum i.e., it is not sensitive to the size of the prefetching cache. The significance of this observation is that near optimum performance can be achieved with a small (16 to 300 Kbytes) amount of prefetching cache
    corecore