42 research outputs found

    Analysis of 3D Cone-Beam CT Image Reconstruction Performance on a FPGA

    Get PDF
    Efficient and accurate tomographic image reconstruction has been an intensive topic of research due to the increasing everyday usage in areas such as radiology, biology, and materials science. Computed tomography (CT) scans are used to analyze internal structures through capture of x-ray images. Cone-beam CT scans project a cone-shaped x-ray to capture 2D image data from a single focal point, rotating around the object. CT scans are prone to multiple artifacts, including motion blur, streaks, and pixel irregularities, therefore must be run through image reconstruction software to reduce visual artifacts. The most common algorithm used is the Feldkamp, Davis, and Kress (FDK) backprojection algorithm. The algorithm is computationally intensive due to the O(n4) backprojection step, running slowly with large CT data files on CPUs, but exceptionally well on GPUs due to the parallel nature of the algorithm. This thesis will analyze the performance of 3D cone-beam CT image reconstruction implemented in OpenCL on a FPGA embedded into a Power System

    Novel high performance techniques for high definition computer aided tomography

    Get PDF
    Mención Internacional en el título de doctorMedical image processing is an interdisciplinary field in which multiple research areas are involved: image acquisition, scanner design, image reconstruction algorithms, visualization, etc. X-Ray Computed Tomography (CT) is a medical imaging modality based on the attenuation suffered by the X-rays as they pass through the body. Intrinsic differences in attenuation properties of bone, air, and soft tissue result in high-contrast images of anatomical structures. The main objective of CT is to obtain tomographic images from radiographs acquired using X-Ray scanners. The process of building a 3D image or volume from the 2D radiographs is known as reconstruction. One of the latest trends in CT is the reduction of the radiation dose delivered to patients through the decrease of the amount of acquired data. This reduction results in artefacts in the final images if conventional reconstruction methods are used, making it advisable to employ iterative reconstruction algorithms. There are numerous reconstruction algorithms available, from which we can highlight two specific types: traditional algorithms, which are fast but do not enable the obtaining of high quality images in situations of limited data; and iterative algorithms, slower but more reliable when traditional methods do not reach the quality standard requirements. One of the priorities of reconstruction is the obtaining of the final images in near real time, in order to reduce the time spent in diagnosis. To accomplish this objective, new high performance techniques and methods for accelerating these types of algorithms are needed. This thesis addresses the challenges of both traditional and iterative reconstruction algorithms, regarding acceleration and image quality. One common approach for accelerating these algorithms is the usage of shared-memory and heterogeneous architectures. In this thesis, we propose a novel simulation/reconstruction framework, namely FUX-Sim. This framework follows the hypothesis that the development of new flexible X-ray systems can benefit from computer simulations, which may also enable performance to be checked before expensive real systems are implemented. Its modular design abstracts the complexities of programming for accelerated devices to facilitate the development and evaluation of the different configurations and geometries available. In order to obtain near real execution times, low-level optimizations for the main components of the framework are provided for Graphics Processing Unit (GPU) architectures. Other alternative tackled in this thesis is the acceleration of iterative reconstruction algorithms by using distributed memory architectures. We present a novel architecture that unifies the two most important computing paradigms for scientific computing nowadays: High Performance Computing (HPC). The proposed architecture combines Big Data frameworks with the advantages of accelerated computing. The proposed methods presented in this thesis provide more flexible scanner configurations as they offer an accelerated solution. Regarding performance, our approach is as competitive as the solutions found in the literature. Additionally, we demonstrate that our solution scales with the size of the problem, enabling the reconstruction of high resolution images.El procesamiento de imágenes médicas es un campo interdisciplinario en el que participan múltiples áreas de investigación como la adquisición de imágenes, diseño de escáneres, algoritmos de reconstrucción de imágenes, visualización, etc. La tomografía computarizada (TC) de rayos X es una modalidad de imágen médica basada en el cálculo de la atenuación sufrida por los rayos X a medida que pasan por el cuerpo a escanear. Las diferencias intrínsecas en la atenuación de hueso, aire y tejido blando dan como resultado imágenes de alto contraste de estas estructuras anatómicas. El objetivo principal de la TC es obtener imágenes tomográficas a partir estas radiografías obtenidas mediante escáneres de rayos X. El proceso de construir una imagen o volumen en 3D a partir de las radiografías 2D se conoce como reconstrucción. Una de las últimas tendencias en la tomografía computarizada es la reducción de la dosis de radiación administrada a los pacientes a través de la reducción de la cantidad de datos adquiridos. Esta reducción da como resultado artefactos en las imágenes finales si se utilizan métodos de reconstrucción convencionales, por lo que es aconsejable emplear algoritmos de reconstrucción iterativos. Existen numerosos algoritmos de reconstrucción disponibles a partir de los cuales podemos destacar dos categorías: algoritmos tradicionales, rápidos pero no permiten obtener imágenes de alta calidad en situaciones en las que los datos son limitados; y algoritmos iterativos, más lentos pero más estables en situaciones donde los métodos tradicionales no alcanzan los requisitos en cuanto a la calidad de la imagen. Una de las prioridades de la reconstrucción es la obtención de las imágenes finales en tiempo casi real, con el fin de reducir el tiempo de diagnóstico. Para lograr este objetivo, se necesitan nuevas técnicas y métodos de alto rendimiento para acelerar estos algoritmos. Esta tesis aborda los desafíos de los algoritmos de reconstrucción tradicionales e iterativos, con respecto a la aceleración y la calidad de imagen. Un enfoque común para acelerar estos algoritmos es el uso de arquitecturas de memoria compartida y heterogéneas. En esta tesis, proponemos un nuevo sistema de simulación/reconstrucción, llamado FUX-Sim. Este sistema se construye alrededor de la hipótesis de que el desarrollo de nuevos sistemas de rayos X flexibles puede beneficiarse de las simulaciones por computador, en los que también se puede realizar un control del rendimiento de los nuevos sistemas a desarrollar antes de su implementación física. Su diseño modular abstrae las complejidades de la programación para aceleradores con el objetivo de facilitar el desarrollo y la evaluación de las diferentes configuraciones y geometrías disponibles. Para obtener ejecuciones en casi tiempo real, se proporcionan optimizaciones de bajo nivel para los componentes principales del sistema en las arquitecturas GPU. Otra alternativa abordada en esta tesis es la aceleración de los algoritmos de reconstrucción iterativa mediante el uso de arquitecturas de memoria distribuidas. Presentamos una arquitectura novedosa que unifica los dos paradigmas informáticos más importantes en la actualidad: computación de alto rendimiento (HPC) y Big Data. La arquitectura propuesta combina sistemas Big Data con las ventajas de los dispositivos aceleradores. Los métodos propuestos presentados en esta tesis proporcionan configuraciones de escáner más flexibles y ofrecen una solución acelerada. En cuanto al rendimiento, nuestro enfoque es tan competitivo como las soluciones encontradas en la literatura. Además, demostramos que nuestra solución escala con el tamaño del problema, lo que permite la reconstrucción de imágenes de alta resolución.This work has been mainly funded thanks to a FPU fellowship (FPU14/03875) from the Spanish Ministry of Education. It has also been partially supported by other grants: • DPI2016-79075-R. “Nuevos escenarios de tomografía por rayos X”, from the Spanish Ministry of Economy and Competitiveness. • TIN2016-79637-P Towards unification of HPC and Big Data Paradigms from the Spanish Ministry of Economy and Competitiveness. • Short-term scientific missions (STSM) grant from NESUS COST Action IC1305. • TIN2013-41350-P, Scalable Data Management Techniques for High-End Computing Systems from the Spanish Ministry of Economy and Competitiveness. • RTC-2014-3028-1 NECRA Nuevos escenarios clinicos con radiología avanzada from the Spanish Ministry of Economy and Competitiveness.Programa Oficial de Doctorado en Ciencia y Tecnología InformáticaPresidente: José Daniel García Sánchez.- Secretario: Katzlin Olcoz Herrero.- Vocal: Domenico Tali

    OpenCL acceleration on FPGA vs CUDA on GPU

    Get PDF

    Fast algorithm for real-time rings reconstruction

    Get PDF
    The GAP project is dedicated to study the application of GPU in several contexts in which real-time response is important to take decisions. The definition of real-time depends on the application under study, ranging from answer time of μs up to several hours in case of very computing intensive task. During this conference we presented our work in low level triggers [1] [2] and high level triggers [3] in high energy physics experiments, and specific application for nuclear magnetic resonance (NMR) [4] [5] and cone-beam CT [6]. Apart from the study of dedicated solution to decrease the latency due to data transport and preparation, the computing algorithms play an essential role in any GPU application. In this contribution, we show an original algorithm developed for triggers application, to accelerate the ring reconstruction in RICH detector when it is not possible to have seeds for reconstruction from external trackers

    Efficient architectures of heterogeneous fpga-gpu for 3-d medical image compression

    Get PDF
    The advent of development in three-dimensional (3-D) imaging modalities have generated a massive amount of volumetric data in 3-D images such as magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), and ultrasound (US). Existing survey reveals the presence of a huge gap for further research in exploiting reconfigurable computing for 3-D medical image compression. This research proposes an FPGA based co-processing solution to accelerate the mentioned medical imaging system. The HWT block implemented on the sbRIO-9632 FPGA board is Spartan 3 (XC3S2000) chip prototyping board. Analysis and performance evaluation of the 3-D images were been conducted. Furthermore, a novel architecture of context-based adaptive binary arithmetic coder (CABAC) is the advanced entropy coding tool employed by main and higher profiles of H.264/AVC. This research focuses on GPU implementation of CABAC and comparative study of discrete wavelet transform (DWT) and without DWT for 3-D medical image compression systems. Implementation results on MRI and CT images, showing GPU significantly outperforming single-threaded CPU implementation. Overall, CT and MRI modalities with DWT outperform in term of compression ratio, peak signal to noise ratio (PSNR) and latency compared with images without DWT process. For heterogeneous computing, MRI images with various sizes and format, such as JPEG and DICOM was implemented. Evaluation results are shown for each memory iteration, transfer sizes from GPU to CPU consuming more bandwidth or throughput. For size 786, 486 bytes JPEG format, both directions consumed bandwidth tend to balance. Bandwidth is relative to the transfer size, the larger sizing will take more latency and throughput. Next, OpenCL implementation for concurrent task via dedicated FPGA. Finding from implementation reveals, OpenCL on batch procession mode with AOC techniques offers substantial results where the amount of logic, area, register and memory increased proportionally to the number of batch. It is because of the kernel will copy the kernel block refer to batch number. Therefore memory bank increased periodically related to kernel block. It was found through comparative study that the tree balance and unroll loop architecture provides better achievement, in term of local memory, latency and throughput

    Image-based Control and Automation of High-speed X-ray Imaging Experiments

    Get PDF
    Moderne Röntgenbildgebung gibt Aufschluss über die innere Struktur von Objekten aus den verschiedensten Materialien. Der Erfolg solcher Messungen hängt dabei entscheidend von einer geeigneten Wahl der Aufnahmebedingungen ab, von der mechanischen Instrumentierung und von den Eigenschaften der Probe oder des untersuchten Prozesses selbst. Bisher gibt es kein bekanntes Verfahren für autonome Datenakquise, welches auch für sehr verschiedene Röntgenbildgebungsexperimenten die Steuerung über bildbasiertes Feedback erlaubt. Die vorliegende Arbeit setzt sich als Ziel, diese Lücke zu schließen, indem gezielt die hierbei auftretenden Probleme angegangen und gelöst werden: die Auswahl der experimentellen Startparameter, eine schnelle Verarbeitung der aufgenommenen Daten und ein automatisches Feedback zur Korrektur der laufenden Messprozedur. Um die am besten geeigneten experimentellen Bedingungen zu bestimmen, gehen wir von den Grundlagen der Bildentstehung aus und entwickeln ein Framework für dessen Simulation. Dieses ermöglicht uns eine große Bandbreite an virtuellen Röntgenbildgebungsexperimenten durchzuführen, wobei die entscheidenden physikalischen Prozesse auf dem Weg der Röntgenstrahlung von der Quelle bis zum Detektor berücksichtigt werden. Darüber hinaus betrachten wir verschiedene Probenformen und bewegungen, was uns die Simulation von Experimenten wie etwa 4D (zeitaufgelöster) Tomographie ermöglicht. Außerdem entwickeln wir eine autonome Prozedur für die Datenakquise, welches die Startbedingungen des Versuchs dann während der schon laufenden Messung auf Basis schneller Bildanalyse das nachjustiert und auch andere Parameter des Experiments steuern kann. Besonderes Augenmerk legen wir hier auf Hochgeschwindigkeitsexperimente, welche hohen Anforderungen an die Geschwindigkeit der Datenverarbeitung stellen, vor allem wenn die Steuerung auf rechenintensiven Algorithmen wie etwa für die tomographische 3D Rekonstruktion der Probe basiert. Um hierzu einen effizienten Algorithmus zu implementieren, verwenden wir ein hochgradig parallelisiertes Framework. Dessen Ausgabe kann dann zur Berechnung verschiedener Bildmetriken verwendet werden, um quantitative Information über die aufgenommenen Daten zu erhalten. Diese bilden die Grundlage zur Entscheidungsfindung in einem geschlossenen Regelkreis, in dem die Hardware für die Datenakquise betrieben wird. Die Genauigkeit des entwickelten Simulationsframeworks zeigen wir, indem wir virtuelle und reale Experimente vergleichen, die auf Gitterinterferometrie basieren und damit spezielle optische Elemente für die Kontrastbildung einsetzen. Außerdem untersuchen wir im Detail den Einfluss der Bildgebungsbedingungen auf die Genauigkeit des implementierten Algorithmus für gefilterte Rückprojektion, und inwiefern unter dessen Berücksichtigung eine Optimierung der experimentellen Bedingungen möglich ist. Wir demonstrieren die Fähigkeiten des von uns entwickelten Systems zur autonomen Datenakquise anhand eines in-situ Tomographieexperiments, bei dem es basierend auf 3D-Rekonstruktion die Framerate der Kamera optimiert und damit sicherstellt, dass die aufgezeichneten Datensätze ohne Artefakte rekonstruiert werden können. Außerdem nutzen wir unser System, um ein Tomographieexperiment mit hohem Probendurchsatz durchzuführen, bei dem viele ähnliche biologische Proben gescannt werde: Für jede davon wird automatisch die tomographische Rotationsachse bestimmt und schließlich zur Sicherstellung der Qualität schon während der Messung ein komplettes 3D Volumen rekonstruiert. Darüber hinaus führen wir ein in-situ Laminographieexperiment durch, welches die Rissbildung in einer Materialprobe untersucht. Hierbei führt unser System die Datenakquise durch und rekonstruiert einen zentral gelegenen Querschnitt durch die Probe, um dessen korrekte Ausrichtung und die Qualität der Daten sicherzustellen. Unsere Arbeit ermöglicht - basierend auf hochgenauen Simulationen - die Wahl der am besten geeigneten Startbedingungen eines Experiments, deren Feinabstimmung während eines realen Experiments und schließlich dessen automatische Steuerung basierend auf schneller Analyse der gerade aufgezeichneten Daten. Ein solches Vorgehen bei der Datenakquise ermöglicht neuartige in-vivo und in-situ Hochgeschwindigkeitsexperimente, die bedingt durch die hohen Datenraten nicht mehr von einer menschlichen Bedienperson gehandhabt werden könnten

    FPGA Acceleration of Domain-specific Kernels via High-Level Synthesis

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    PERFORMANCE ANALYSIS AND FITNESS OF GPGPU AND MULTICORE ARCHITECTURES FOR SCIENTIFIC APPLICATIONS

    Get PDF
    Recent trends in computing architecture development have focused on exploiting task- and data-level parallelism from applications. Major hardware vendors are experimenting with novel parallel architectures, such as the Many Integrated Core (MIC) from Intel that integrates 50 or more x86 processors on a single chip, the Accelerated Processing Unit from AMD that integrates a multicore x86 processor with a graphical processing unit (GPU), and many other initiatives from other hardware vendors that are underway. Therefore, various types of architectures are available to developers for accelerating an application. A performance model that predicts the suitability of the architecture for accelerating an application would be very helpful prior to implementation. Thus, in this research, a Fitness model that ranks the potential performance of accelerators for an application is proposed. Then the Fitness model is extended using statistical multiple regression to model both the runtime performance of accelerators and the impact of programming models on accelerator performance with high degree of accuracy. We have validated both performance models for all the case studies. The error rate of these models, calculated using the experimental performance data, is tolerable in the high-performance computing field. In this research, to develop and validate the two performance models we have also analyzed the performance of several multicore CPUs and GPGPU architectures and the corresponding programming models using multiple case studies. The first case study used in this research is a matrix-matrix multiplication algorithm. By varying the size of the matrix from a small size to a very large size, the performance of the multicore and GPGPU architectures are studied. The second case study used in this research is a biological spiking neural network (SNN), implemented with four neuron models that have varying requirements for communication and computation making them useful for performance analysis of the hardware platforms. We report and analyze the performance variation of the four popular accelerators (Intel Xeon, AMD Opteron, Nvidia Fermi, and IBM PS3) and four advanced CPU architectures (Intel 32 core, AMD 32 core, IBM 16 core, and SUN 32 core) with problem size (matrix and network size) scaling, available optimization techniques and execution configuration. This thorough analysis provides insight regarding how the performance of an accelerator is affected by problem size, optimization techniques, and accelerator configuration. We have analyzed the performance impact of four popular multicore parallel programming models, POSIX-threading, Open Multi-Processing (OpenMP), Open Computing Language (OpenCL), and Concurrency Runtime on an Intel i7 multicore architecture; and, two GPGPU programming models, Compute Unified Device Architecture (CUDA) and OpenCL, on a NVIDIA GPGPU. With the broad study conducted using a wide range of application complexity, multiple optimizations, and varying problem size, it was found that according to their achievable performance, the programming models for the x86 processor cannot be ranked across all applications, whereas the programming models for GPGPU can be ranked conclusively. We also have qualitatively and quantitatively ranked all the six programming models in terms of their perceived programming effort. The results and analysis in this research indicate and are supported by the proposed performance models that for a given hardware system, the best performance for an application is obtained with a proper match of programming model and architecture

    Nākotnes procesoru arhitektūru pielietojums precīzu daļiņu paātrinātāju modelēšanā

    Get PDF
    Jaunās procesoru arhitektūras, kā grafiskie procesori (GPU) un Intel Many Integrated Cores (MIC) procesori, sniedz milzīgu veiktspējas potenciālu augstas veiktspējas skaitļošanas aplikācijās. Tomēr izstrādājot programmatūru, kas spēj izmantot šīs jaunās tehnoloģijas ir jāsaskarās ar dažādiem papildus grūtībām. Programmām ir jāspēj izmantot papildus paralēlisms, ko piedāvā šīs iekārtās, tām ir jāspēj pielāgoties dažādām procesoru arhitektūrām un jāizmanto dažādas izstrādes platformas, lai aplikācija spēdu darboties uz iekārtām no dažādiem ražotājiem. Dynamic Kernel Scheduler (DKS) tika izstrādāts, lai nodrošinātu papildus programmatūras slāni starp programmu un papildus processoriem. DKS nodrošina komunikāciju starp aplikāciju un šīm iekārtām, uzdevumu izpildi uz iekārtas un piedāvā bibliotēku ar algoritmiem, kas optimizēti darbībai uz šīm iekārtām. Algoritmi, kas pieejami DKS, tiek izstrādāti izmantojot CUDA, OpenCL un OpenMP tehnoloģijas. Atkarībā no pieejamās iekārtas DKS spēj pārslēgties starp šiem risinājumiem un izvēlēties pareizo algoritma implementāciju. DKS tika izmantots, lai nodrošinātu papilduis processoru atbalstu tādās aplikācijās kā OPAL (Object-oriented Particle Accelerator Library), musrfit un PET (Positron Emission Tomography) attētlu rekonstruēšanas aplikācijā. Šīs programmas tiek izstrādātas Paul Scherrer Institut un ETH Zurich, un izmantotas daļiņu paātrinātāju modelēšanā un experimentālo datu analīzē. Sasniegtie rezultāti rāda, ka izmantojot papildus processorus iespējams sasniegt ievērojamu paātrinājumu programmu izpildes laikā un ar DKS palīdzību tiek atvieglota GPU un Intel MIC integrācija un uzturēšana esošajās aplikācijās. Jaunno processoru arhitektūru potenciāls tiek papildus nodemonstrēts pārnesot uz CUDA mbtrack aplikāciju, kas izstrādāta SOLEIL (French national synchrotron facility). Šī programma PSI tiek izmantota, lai modelētu nestabilitātes saistītos daļiņu kūļos un pārejošos effektus, ko rada daļiņu plūsma mijiedarbojoties ar apkārtējiem elementiem. Izmantojot skaitļošanas jaudu, kas pieejama GPU, ir iespējams šīs simulācijas pārnest no lielākiem CPU klasteriem un vienkāršāku sistēmu, kas sastāv no processora papildināta ar vienu grafisko karti. Atslēgas vārdi: Aparatūras paātrinātāji, GPU skaitļošana, Intel MIC, CUDA, OpenCL, OpenMPEmerging processor architectures such as graphical processing units (GPUs) and Intel Many Integrated Cores (MICs) provide a huge performance potential for high performance computing. However developing software that uses these hardware accelerators introduces additional challenges for the developer. These challenges may include exposing increased parallelism, handling different hardware designs, and using multiple development frameworks in order to utilize devices from different vendors. The Dynamic Kernel Scheduler (DKS) is being developed in order to provide a software layer between the host application and different hardware accelerators. DKS handles the communication between the host and the device, schedules task execution, and provides a library of built-in algorithms. Algorithms available in the DKS library will be written in CUDA, OpenCL, and OpenMP. Depending on the available hardware, the DKS can select the appropriate implementation of the algorithm. The DKS was used to enable co-processor usage in applications such as OPAL (Object-oriented Particle Accelerator Library), musrfit and PET (Positron Emission Tomography) Image reconstruction application. These applications are developed at Paul Scherrer Institut, and ETH Zurich for particle accelerator modeling and experimental data analysis, and used by the world wide user community. The achieved results show that substantial speedups in application execution times can be achieved using co-processors compared to CPUs and with the help of DKS the process of integrating new processors in existing applications is simplified and more maintainable. The potential of the new hardware architectures is further demonstrated by porting to CUDA application for multibunch tracking (mbtrack) developed at SOLEIL (French national synchrotron facility). This application is used at PSI for detailed study of coupled bunch instabilities and transient beam-loading. By using the computational power of the GPUs the necessary simulations can be done on the GPU instead of a larger computing cluster that would be required otherwise. Keywords: Hardware acceleration, GPU computing, Intel MIC, CUDA, OpenCL, OpenM
    corecore