42 research outputs found
Analysis of 3D Cone-Beam CT Image Reconstruction Performance on a FPGA
Efficient and accurate tomographic image reconstruction has been an intensive topic of research due to the increasing everyday usage in areas such as radiology, biology, and materials science. Computed tomography (CT) scans are used to analyze internal structures through capture of x-ray images. Cone-beam CT scans project a cone-shaped x-ray to capture 2D image data from a single focal point, rotating around the object. CT scans are prone to multiple artifacts, including motion blur, streaks, and pixel irregularities, therefore must be run through image reconstruction software to reduce visual artifacts. The most common algorithm used is the Feldkamp, Davis, and Kress (FDK) backprojection algorithm. The algorithm is computationally intensive due to the O(n4) backprojection step, running slowly with large CT data files on CPUs, but exceptionally well on GPUs due to the parallel nature of the algorithm. This thesis will analyze the performance of 3D cone-beam CT image reconstruction implemented in OpenCL on a FPGA embedded into a Power System
Novel high performance techniques for high definition computer aided tomography
Mención Internacional en el título de doctorMedical image processing is an interdisciplinary field in which multiple research areas are involved:
image acquisition, scanner design, image reconstruction algorithms, visualization, etc.
X-Ray Computed Tomography (CT) is a medical imaging modality based on the attenuation
suffered by the X-rays as they pass through the body. Intrinsic differences in attenuation properties
of bone, air, and soft tissue result in high-contrast images of anatomical structures. The
main objective of CT is to obtain tomographic images from radiographs acquired using X-Ray
scanners. The process of building a 3D image or volume from the 2D radiographs is known as
reconstruction. One of the latest trends in CT is the reduction of the radiation dose delivered
to patients through the decrease of the amount of acquired data. This reduction results in artefacts
in the final images if conventional reconstruction methods are used, making it advisable to
employ iterative reconstruction algorithms.
There are numerous reconstruction algorithms available, from which we can highlight two
specific types: traditional algorithms, which are fast but do not enable the obtaining of high
quality images in situations of limited data; and iterative algorithms, slower but more reliable
when traditional methods do not reach the quality standard requirements. One of the priorities
of reconstruction is the obtaining of the final images in near real time, in order to reduce the
time spent in diagnosis. To accomplish this objective, new high performance techniques and methods
for accelerating these types of algorithms are needed. This thesis addresses the challenges
of both traditional and iterative reconstruction algorithms, regarding acceleration and image
quality. One common approach for accelerating these algorithms is the usage of shared-memory
and heterogeneous architectures. In this thesis, we propose a novel simulation/reconstruction
framework, namely FUX-Sim. This framework follows the hypothesis that the development of
new flexible X-ray systems can benefit from computer simulations, which may also enable performance
to be checked before expensive real systems are implemented. Its modular design
abstracts the complexities of programming for accelerated devices to facilitate the development
and evaluation of the different configurations and geometries available. In order to obtain near
real execution times, low-level optimizations for the main components of the framework are
provided for Graphics Processing Unit (GPU) architectures.
Other alternative tackled in this thesis is the acceleration of iterative reconstruction algorithms
by using distributed memory architectures. We present a novel architecture that unifies
the two most important computing paradigms for scientific computing nowadays: High Performance
Computing (HPC). The proposed architecture combines Big Data frameworks with the
advantages of accelerated computing.
The proposed methods presented in this thesis provide more flexible scanner configurations
as they offer an accelerated solution. Regarding performance, our approach is as competitive as
the solutions found in the literature. Additionally, we demonstrate that our solution scales with
the size of the problem, enabling the reconstruction of high resolution images.El procesamiento de imágenes médicas es un campo interdisciplinario en el que participan múltiples
áreas de investigación como la adquisición de imágenes, diseño de escáneres, algoritmos de
reconstrucción de imágenes, visualización, etc. La tomografía computarizada (TC) de rayos X es
una modalidad de imágen médica basada en el cálculo de la atenuación sufrida por los rayos X a
medida que pasan por el cuerpo a escanear. Las diferencias intrínsecas en la atenuación de hueso,
aire y tejido blando dan como resultado imágenes de alto contraste de estas estructuras anatómicas.
El objetivo principal de la TC es obtener imágenes tomográficas a partir estas radiografías
obtenidas mediante escáneres de rayos X. El proceso de construir una imagen o volumen en 3D a
partir de las radiografías 2D se conoce como reconstrucción. Una de las últimas tendencias en la
tomografía computarizada es la reducción de la dosis de radiación administrada a los pacientes
a través de la reducción de la cantidad de datos adquiridos. Esta reducción da como resultado
artefactos en las imágenes finales si se utilizan métodos de reconstrucción convencionales, por
lo que es aconsejable emplear algoritmos de reconstrucción iterativos.
Existen numerosos algoritmos de reconstrucción disponibles a partir de los cuales podemos
destacar dos categorías: algoritmos tradicionales, rápidos pero no permiten obtener imágenes de
alta calidad en situaciones en las que los datos son limitados; y algoritmos iterativos, más lentos
pero más estables en situaciones donde los métodos tradicionales no alcanzan los requisitos en
cuanto a la calidad de la imagen. Una de las prioridades de la reconstrucción es la obtención
de las imágenes finales en tiempo casi real, con el fin de reducir el tiempo de diagnóstico. Para
lograr este objetivo, se necesitan nuevas técnicas y métodos de alto rendimiento para acelerar
estos algoritmos.
Esta tesis aborda los desafíos de los algoritmos de reconstrucción tradicionales e iterativos,
con respecto a la aceleración y la calidad de imagen. Un enfoque común para acelerar estos
algoritmos es el uso de arquitecturas de memoria compartida y heterogéneas. En esta tesis,
proponemos un nuevo sistema de simulación/reconstrucción, llamado FUX-Sim. Este sistema se
construye alrededor de la hipótesis de que el desarrollo de nuevos sistemas de rayos X flexibles
puede beneficiarse de las simulaciones por computador, en los que también se puede realizar
un control del rendimiento de los nuevos sistemas a desarrollar antes de su implementación
física. Su diseño modular abstrae las complejidades de la programación para aceleradores con el
objetivo de facilitar el desarrollo y la evaluación de las diferentes configuraciones y geometrías
disponibles. Para obtener ejecuciones en casi tiempo real, se proporcionan optimizaciones de
bajo nivel para los componentes principales del sistema en las arquitecturas GPU.
Otra alternativa abordada en esta tesis es la aceleración de los algoritmos de reconstrucción
iterativa mediante el uso de arquitecturas de memoria distribuidas. Presentamos una arquitectura
novedosa que unifica los dos paradigmas informáticos más importantes en la actualidad:
computación de alto rendimiento (HPC) y Big Data. La arquitectura propuesta combina sistemas
Big Data con las ventajas de los dispositivos aceleradores.
Los métodos propuestos presentados en esta tesis proporcionan configuraciones de escáner
más flexibles y ofrecen una solución acelerada. En cuanto al rendimiento, nuestro enfoque es tan
competitivo como las soluciones encontradas en la literatura. Además, demostramos que nuestra
solución escala con el tamaño del problema, lo que permite la reconstrucción de imágenes de
alta resolución.This work has been mainly funded thanks to a FPU fellowship (FPU14/03875) from the Spanish
Ministry of Education.
It has also been partially supported by other grants:
• DPI2016-79075-R. “Nuevos escenarios de tomografía por rayos X”, from the Spanish Ministry
of Economy and Competitiveness.
• TIN2016-79637-P Towards unification of HPC and Big Data Paradigms from the Spanish
Ministry of Economy and Competitiveness.
• Short-term scientific missions (STSM) grant from NESUS COST Action IC1305.
• TIN2013-41350-P, Scalable Data Management Techniques for High-End Computing Systems
from the Spanish Ministry of Economy and Competitiveness.
• RTC-2014-3028-1 NECRA Nuevos escenarios clinicos con radiología avanzada from the
Spanish Ministry of Economy and Competitiveness.Programa Oficial de Doctorado en Ciencia y Tecnología InformáticaPresidente: José Daniel García Sánchez.- Secretario: Katzlin Olcoz Herrero.- Vocal: Domenico Tali
Fast algorithm for real-time rings reconstruction
The GAP project is dedicated to study the application of GPU in several contexts in which
real-time response is important to take decisions. The definition of real-time depends on
the application under study, ranging from answer time of μs up to several hours in case
of very computing intensive task. During this conference we presented our work in low
level triggers [1] [2] and high level triggers [3] in high energy physics experiments, and
specific application for nuclear magnetic resonance (NMR) [4] [5] and cone-beam CT [6].
Apart from the study of dedicated solution to decrease the latency due to data transport
and preparation, the computing algorithms play an essential role in any GPU application.
In this contribution, we show an original algorithm developed for triggers application, to
accelerate the ring reconstruction in RICH detector when it is not possible to have seeds
for reconstruction from external trackers
Efficient architectures of heterogeneous fpga-gpu for 3-d medical image compression
The advent of development in three-dimensional (3-D) imaging modalities have generated a massive amount of volumetric data in 3-D images such as magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), and ultrasound (US). Existing survey reveals the presence of a huge gap for further research in exploiting reconfigurable computing for 3-D medical image compression. This research proposes an FPGA based co-processing solution to accelerate the mentioned medical imaging system. The HWT block implemented on the sbRIO-9632 FPGA board is Spartan 3 (XC3S2000) chip prototyping board. Analysis and performance evaluation of the 3-D images were been conducted. Furthermore, a novel architecture of context-based adaptive binary arithmetic coder (CABAC) is the advanced entropy coding tool employed by main and higher profiles of H.264/AVC. This research focuses on GPU implementation of CABAC and comparative study of discrete wavelet transform (DWT) and without DWT for 3-D medical image compression systems. Implementation results on MRI and CT images, showing GPU significantly outperforming single-threaded CPU implementation. Overall, CT and MRI modalities with DWT outperform in term of compression ratio, peak signal to noise ratio (PSNR) and latency compared with images without DWT process. For heterogeneous computing, MRI images with various sizes and format, such as JPEG and DICOM was implemented. Evaluation results are shown for each memory iteration, transfer sizes from GPU to CPU consuming more bandwidth or throughput. For size 786, 486 bytes JPEG format, both directions consumed bandwidth tend to balance. Bandwidth is relative to the transfer size, the larger sizing will take more latency and throughput. Next, OpenCL implementation for concurrent task via dedicated FPGA. Finding from implementation reveals, OpenCL on batch procession mode with AOC techniques offers substantial results where the amount of logic, area, register and memory increased proportionally to the number of batch. It is because of the kernel will copy the kernel block refer to batch number. Therefore memory bank increased periodically related to kernel block. It was found through comparative study that the tree balance and unroll loop architecture provides better achievement, in term of local memory, latency and throughput
Image-based Control and Automation of High-speed X-ray Imaging Experiments
Moderne Röntgenbildgebung gibt Aufschluss über die innere Struktur von Objekten aus den verschiedensten Materialien. Der Erfolg solcher Messungen hängt dabei entscheidend von einer geeigneten Wahl der Aufnahmebedingungen ab, von der mechanischen Instrumentierung und von den Eigenschaften der Probe oder des untersuchten Prozesses selbst. Bisher gibt es kein bekanntes Verfahren für autonome Datenakquise, welches auch für sehr verschiedene Röntgenbildgebungsexperimenten die Steuerung über bildbasiertes Feedback erlaubt. Die vorliegende Arbeit setzt sich als Ziel, diese Lücke zu schließen, indem gezielt die hierbei auftretenden Probleme angegangen und gelöst werden: die Auswahl der experimentellen Startparameter, eine schnelle Verarbeitung der aufgenommenen Daten und ein automatisches Feedback zur Korrektur der laufenden Messprozedur.
Um die am besten geeigneten experimentellen Bedingungen zu bestimmen, gehen wir von den Grundlagen der Bildentstehung aus und entwickeln ein Framework für dessen Simulation. Dieses ermöglicht uns eine große Bandbreite an virtuellen Röntgenbildgebungsexperimenten durchzuführen, wobei die entscheidenden physikalischen Prozesse auf dem Weg der Röntgenstrahlung von der Quelle bis zum Detektor berücksichtigt werden. Darüber hinaus betrachten wir verschiedene Probenformen und bewegungen, was uns die Simulation von Experimenten wie etwa 4D (zeitaufgelöster) Tomographie ermöglicht.
Außerdem entwickeln wir eine autonome Prozedur für die Datenakquise, welches die Startbedingungen des Versuchs dann während der schon laufenden Messung auf Basis schneller Bildanalyse das nachjustiert und auch andere Parameter des Experiments steuern kann. Besonderes Augenmerk legen wir hier auf Hochgeschwindigkeitsexperimente, welche hohen Anforderungen an die Geschwindigkeit der Datenverarbeitung stellen, vor allem wenn die Steuerung auf rechenintensiven Algorithmen wie etwa für die tomographische 3D Rekonstruktion der Probe basiert. Um hierzu einen effizienten Algorithmus zu implementieren, verwenden wir ein hochgradig parallelisiertes Framework. Dessen Ausgabe kann dann zur Berechnung verschiedener Bildmetriken verwendet werden, um quantitative Information über die aufgenommenen Daten zu erhalten. Diese bilden die Grundlage zur Entscheidungsfindung in einem geschlossenen Regelkreis, in dem die Hardware für die Datenakquise betrieben wird.
Die Genauigkeit des entwickelten Simulationsframeworks zeigen wir, indem wir virtuelle und reale Experimente vergleichen, die auf Gitterinterferometrie basieren und damit spezielle optische Elemente für die Kontrastbildung einsetzen. Außerdem untersuchen wir im Detail den Einfluss der Bildgebungsbedingungen auf die Genauigkeit des implementierten Algorithmus für gefilterte Rückprojektion, und inwiefern unter dessen Berücksichtigung eine Optimierung der experimentellen Bedingungen möglich ist.
Wir demonstrieren die Fähigkeiten des von uns entwickelten Systems zur autonomen Datenakquise anhand eines in-situ Tomographieexperiments, bei dem es basierend auf 3D-Rekonstruktion die Framerate der Kamera optimiert und damit sicherstellt, dass die aufgezeichneten Datensätze ohne Artefakte rekonstruiert werden können. Außerdem nutzen wir unser System, um ein Tomographieexperiment mit hohem Probendurchsatz durchzuführen, bei dem viele ähnliche biologische Proben gescannt werde: Für jede davon wird automatisch die tomographische Rotationsachse bestimmt und schließlich zur Sicherstellung der Qualität schon während der Messung ein komplettes 3D Volumen rekonstruiert. Darüber hinaus führen wir ein in-situ Laminographieexperiment durch, welches die Rissbildung in einer Materialprobe untersucht. Hierbei führt unser System die Datenakquise durch und rekonstruiert einen zentral gelegenen Querschnitt durch die Probe, um dessen korrekte Ausrichtung und die Qualität der Daten sicherzustellen.
Unsere Arbeit ermöglicht - basierend auf hochgenauen Simulationen - die Wahl der am besten geeigneten Startbedingungen eines Experiments, deren Feinabstimmung während eines realen Experiments und schließlich dessen automatische Steuerung basierend auf schneller Analyse der gerade aufgezeichneten Daten. Ein solches Vorgehen bei der Datenakquise ermöglicht neuartige in-vivo und in-situ Hochgeschwindigkeitsexperimente, die bedingt durch die hohen Datenraten nicht mehr von einer menschlichen Bedienperson gehandhabt werden könnten
FPGA Acceleration of Domain-specific Kernels via High-Level Synthesis
L'abstract è presente nell'allegato / the abstract is in the attachmen
PERFORMANCE ANALYSIS AND FITNESS OF GPGPU AND MULTICORE ARCHITECTURES FOR SCIENTIFIC APPLICATIONS
Recent trends in computing architecture development have focused on exploiting task- and data-level parallelism from applications. Major hardware vendors are experimenting with novel parallel architectures, such as the Many Integrated Core (MIC) from Intel that integrates 50 or more x86 processors on a single chip, the Accelerated Processing Unit from AMD that integrates a multicore x86 processor with a graphical processing unit (GPU), and many other initiatives from other hardware vendors that are underway. Therefore, various types of architectures are available to developers for accelerating an application. A performance model that predicts the suitability of the architecture for accelerating an application would be very helpful prior to implementation. Thus, in this research, a Fitness model that ranks the potential performance of accelerators for an application is proposed. Then the Fitness model is extended using statistical multiple regression to model both the runtime performance of accelerators and the impact of programming models on accelerator performance with high degree of accuracy. We have validated both performance models for all the case studies. The error rate of these models, calculated using the experimental performance data, is tolerable in the high-performance computing field. In this research, to develop and validate the two performance models we have also analyzed the performance of several multicore CPUs and GPGPU architectures and the corresponding programming models using multiple case studies. The first case study used in this research is a matrix-matrix multiplication algorithm. By varying the size of the matrix from a small size to a very large size, the performance of the multicore and GPGPU architectures are studied. The second case study used in this research is a biological spiking neural network (SNN), implemented with four neuron models that have varying requirements for communication and computation making them useful for performance analysis of the hardware platforms. We report and analyze the performance variation of the four popular accelerators (Intel Xeon, AMD Opteron, Nvidia Fermi, and IBM PS3) and four advanced CPU architectures (Intel 32 core, AMD 32 core, IBM 16 core, and SUN 32 core) with problem size (matrix and network size) scaling, available optimization techniques and execution configuration. This thorough analysis provides insight regarding how the performance of an accelerator is affected by problem size, optimization techniques, and accelerator configuration. We have analyzed the performance impact of four popular multicore parallel programming models, POSIX-threading, Open Multi-Processing (OpenMP), Open Computing Language (OpenCL), and Concurrency Runtime on an Intel i7 multicore architecture; and, two GPGPU programming models, Compute Unified Device Architecture (CUDA) and OpenCL, on a NVIDIA GPGPU. With the broad study conducted using a wide range of application complexity, multiple optimizations, and varying problem size, it was found that according to their achievable performance, the programming models for the x86 processor cannot be ranked across all applications, whereas the programming models for GPGPU can be ranked conclusively. We also have qualitatively and quantitatively ranked all the six programming models in terms of their perceived programming effort. The results and analysis in this research indicate and are supported by the proposed performance models that for a given hardware system, the best performance for an application is obtained with a proper match of programming model and architecture
Nākotnes procesoru arhitektūru pielietojums precīzu daļiņu paātrinātāju modelēšanā
Jaunās procesoru arhitektūras, kā grafiskie procesori (GPU) un Intel Many Integrated Cores (MIC) procesori, sniedz milzīgu veiktspējas potenciālu augstas veiktspējas skaitļošanas aplikācijās. Tomēr izstrādājot programmatūru, kas spēj izmantot šīs jaunās tehnoloģijas ir jāsaskarās ar dažādiem papildus grūtībām. Programmām ir jāspēj izmantot papildus paralēlisms, ko piedāvā šīs iekārtās, tām ir jāspēj pielāgoties dažādām procesoru arhitektūrām un jāizmanto dažādas izstrādes platformas, lai aplikācija spēdu darboties uz iekārtām no dažādiem ražotājiem. Dynamic Kernel Scheduler (DKS) tika izstrādāts, lai nodrošinātu papildus programmatūras slāni starp programmu un papildus processoriem. DKS nodrošina komunikāciju starp aplikāciju un šīm iekārtām, uzdevumu izpildi uz iekārtas un piedāvā bibliotēku ar algoritmiem, kas optimizēti darbībai uz šīm iekārtām. Algoritmi, kas pieejami DKS, tiek izstrādāti izmantojot CUDA, OpenCL un OpenMP tehnoloģijas. Atkarībā no pieejamās iekārtas DKS spēj pārslēgties starp šiem risinājumiem un izvēlēties pareizo algoritma implementāciju. DKS tika izmantots, lai nodrošinātu papilduis processoru atbalstu tādās aplikācijās kā OPAL (Object-oriented Particle Accelerator Library), musrfit un PET (Positron Emission Tomography) attētlu rekonstruēšanas aplikācijā. Šīs programmas tiek izstrādātas Paul Scherrer Institut un ETH Zurich, un izmantotas daļiņu paātrinātāju modelēšanā un experimentālo datu analīzē. Sasniegtie rezultāti rāda, ka izmantojot papildus processorus iespējams sasniegt ievērojamu paātrinājumu programmu izpildes laikā un ar DKS palīdzību tiek atvieglota GPU un Intel MIC integrācija un uzturēšana esošajās aplikācijās. Jaunno processoru arhitektūru potenciāls tiek papildus nodemonstrēts pārnesot uz CUDA mbtrack aplikāciju, kas izstrādāta SOLEIL (French national synchrotron facility). Šī programma PSI tiek izmantota, lai modelētu nestabilitātes saistītos daļiņu kūļos un pārejošos effektus, ko rada daļiņu plūsma mijiedarbojoties ar apkārtējiem elementiem. Izmantojot skaitļošanas jaudu, kas pieejama GPU, ir iespējams šīs simulācijas pārnest no lielākiem CPU klasteriem un vienkāršāku sistēmu, kas sastāv no processora papildināta ar vienu grafisko karti. Atslēgas vārdi: Aparatūras paātrinātāji, GPU skaitļošana, Intel MIC, CUDA, OpenCL, OpenMPEmerging processor architectures such as graphical processing units (GPUs) and Intel Many Integrated Cores (MICs) provide a huge performance potential for high performance computing. However developing software that uses these hardware accelerators introduces additional challenges for the developer. These challenges may include exposing increased parallelism, handling different hardware designs, and using multiple development frameworks in order to utilize devices from different vendors. The Dynamic Kernel Scheduler (DKS) is being developed in order to provide a software layer between the host application and different hardware accelerators. DKS handles the communication between the host and the device, schedules task execution, and provides a library of built-in algorithms. Algorithms available in the DKS library will be written in CUDA, OpenCL, and OpenMP. Depending on the available hardware, the DKS can select the appropriate implementation of the algorithm. The DKS was used to enable co-processor usage in applications such as OPAL (Object-oriented Particle Accelerator Library), musrfit and PET (Positron Emission Tomography) Image reconstruction application. These applications are developed at Paul Scherrer Institut, and ETH Zurich for particle accelerator modeling and experimental data analysis, and used by the world wide user community. The achieved results show that substantial speedups in application execution times can be achieved using co-processors compared to CPUs and with the help of DKS the process of integrating new processors in existing applications is simplified and more maintainable. The potential of the new hardware architectures is further demonstrated by porting to CUDA application for multibunch tracking (mbtrack) developed at SOLEIL (French national synchrotron facility). This application is used at PSI for detailed study of coupled bunch instabilities and transient beam-loading. By using the computational power of the GPUs the necessary simulations can be done on the GPU instead of a larger computing cluster that would be required otherwise. Keywords: Hardware acceleration, GPU computing, Intel MIC, CUDA, OpenCL, OpenM