41 research outputs found
Optimization of high-throughput real-time processes in physics reconstruction
La presente tesis se ha desarrollado en colaboración entre
la Universidad de Sevilla y la Organización Europea para la
Investigación Nuclear, CERN.
El detector LHCb es uno de los cuatro grandes detectores
situados en el Gran Colisionador de Hadrones, LHC. En LHCb,
se colisionan partículas a altas energías para comprender la
diferencia existente entre la materia y la antimateria. Debido a la
cantidad ingente de datos generada por el detector, es necesario
realizar un filtrado de datos en tiempo real, fundamentado en
los conocimientos actuales recogidos en el Modelo Estándar de
física de partículas. El filtrado, también conocido como High
Level Trigger, deberá procesar un throughput de 40 Tb/s de datos,
y realizar un filtrado de aproximadamente 1 000:1, reduciendo
el throughput a unos 40 Gb/s de salida, que se almacenan para
posterior análisis.
El proceso del High Level Trigger se subdivide a su vez en
dos etapas: High Level Trigger 1 (HLT1) y High Level Trigger
2 (HLT2). El HLT1 transcurre en tiempo real, y realiza una reducción de datos de aproximadamente 30:1. El HLT1 consiste
en una serie de procesos software que reconstruyen lo que ha
sucedido en la colisión de partículas. En la reconstrucción del
HLT1 únicamente se analizan las trayectorias de las partículas
producidas fruto de la colisión, en un problema conocido como
reconstrucción de trazas, para dictaminar el interés de las colisiones.
Por contra, el proceso HLT2 es más fino, requiriendo más
tiempo en realizarse y reconstruyendo todos los subdetectores
que componen LHCb.
Hacia 2020, el detector LHCb, así como todos los componentes
del sistema de adquisici´on de datos, serán actualizados acorde
a los últimos desarrollos técnicos. Como parte del sistema
de adquisición de datos, los servidores que procesan HLT1 y
HLT2 también sufrirán una actualización. Al mismo tiempo, el
acelerador LHC será también actualizado, de manera que la
cantidad de datos generada en cada cruce de grupo de partículas
aumentare en aproxidamente 5 veces la actual. Debido a
las actualizaciones tanto del acelerador como del detector, se
prevé que la cantidad de datos que deberá procesar el HLT en
su totalidad sea unas 40 veces mayor a la actual.
La previsión de la escalabilidad del software actual a 2020
subestim´ó los recursos necesarios para hacer frente al incremento
en throughput. Esto produjo que se pusiera en marcha un
estudio de todos los algoritmos tanto del HLT1 como del HLT2,
así como una actualización del código a nuevos estándares, para
mejorar su rendimiento y ser capaz de procesar la cantidad de
datos esperada.
En esta tesis, se exploran varios algoritmos de la reconstrucción de LHCb. El problema de reconstrucción de trazas se analiza
en profundidad y se proponen nuevos algoritmos para su
resolución. Ya que los problemas analizados exhiben un paralelismo
masivo, estos algoritmos se implementan en lenguajes
especializados para tarjetas gráficas modernas (GPUs), dada su
arquitectura inherentemente paralela. En este trabajo se dise ˜nan
dos algoritmos de reconstrucción de trazas. Además, se diseñan
adicionalmente cuatro algoritmos de decodificación y un algoritmo
de clustering, problemas también encontrados en el HLT1.
Por otra parte, se diseña un algoritmo para el filtrado de Kalman,
que puede ser utilizado en ambas etapas.
Los algoritmos desarrollados cumplen con los requisitos esperados
por la colaboración LHCb para el año 2020. Para poder
ejecutar los algoritmos eficientemente en tarjetas gráficas, se
desarrolla un framework especializado para GPUs, que permite
la ejecución paralela de secuencias de reconstrucción en GPUs.
Combinando los algoritmos desarrollados con el framework, se
completa una secuencia de ejecución que asienta las bases para
un HLT1 ejecutable en GPU.
Durante la investigación llevada a cabo en esta tesis, y gracias
a los desarrollos arriba mencionados y a la colaboración de
un pequeño equipo de personas coordinado por el autor, se
completa un HLT1 ejecutable en GPUs. El rendimiento obtenido
en GPUs, producto de esta tesis, permite hacer frente al reto de
ejecutar una secuencia de reconstrucción en tiempo real, bajo
las condiciones actualizadas de LHCb previstas para 2020. As´ı
mismo, se completa por primera vez para cualquier experimento
del LHC un High Level Trigger que se ejecuta únicamente en
GPUs. Finalmente, se detallan varias posibles configuraciones
para incluir tarjetas gr´aficas en el sistema de adquisición de
datos de LHCb.The current thesis has been developed in collaboration between
Universidad de Sevilla and the European Organization for Nuclear
Research, CERN.
The LHCb detector is one of four big detectors placed alongside
the Large Hadron Collider, LHC. In LHCb, particles are
collided at high energies in order to understand the difference
between matter and antimatter. Due to the massive quantity
of data generated by the detector, it is necessary to filter data
in real-time. The filtering, also known as High Level Trigger,
processes a throughput of 40 Tb/s of data and performs a selection
of approximately 1 000:1. The throughput is thus reduced
to roughly 40 Gb/s of data output, which is then stored for
posterior analysis.
The High Level Trigger process is subdivided into two stages:
High Level Trigger 1 (HLT1) and High Level Trigger 2 (HLT2).
HLT1 occurs in real-time, and yields a reduction of data of approximately
30:1. HLT1 consists in a series of software processes
that reconstruct particle collisions. The HLT1 reconstruction only
analyzes the trajectories of particles produced at the collision,
solving a problem known as track reconstruction, that determines
whether the collision data is kept or discarded. In contrast,
HLT2 is a finer process, which requires more time to execute
and reconstructs all subdetectors composing LHCb.
Towards 2020, the LHCb detector and all the components
composing the data acquisition system will be upgraded. As
part of the data acquisition system, the servers that process
HLT1 and HLT2 will also be upgraded. In addition, the LHC
accelerator will also be updated, increasing the data generated in
every bunch crossing by roughly 5 times. Due to the accelerator
and detector upgrades, the amount of data that the HLT will
require to process is expected to increase by 40 times.
The foreseen scalability of the software through 2020 underestimated
the required resources to face the increase in data
throughput. As a consequence, studies of all algorithms composing
HLT1 and HLT2 and code modernizations were carried
out, in order to obtain a better performance and increase the
processing capability of the foreseen hardware resources in the
upgrade.
In this thesis, several algorithms of the LHCb recontruction
are explored. The track reconstruction problem is analyzed
in depth, and new algorithms are proposed. Since the analyzed
problems are massively parallel, these algorithms are implemented
in specialized languages for modern graphics cards
(GPUs), due to their inherently parallel architecture. From this
work stem two algorithm designs. Furthermore, four additional
decoding algorithms and a clustering algorithms have been designed
and implemented, which are also part of HLT1. Apart
from that, an parallel Kalman filter algorithm has been designed
and implemented, which can be used in both HLT stages.
The developed algorithms satisfy the requirements of the
LHCb collaboration for the LHCb upgrade. In order to execute
the algorithms efficiently on GPUs, a software framework specialized
for GPUs is developed, which allows executing GPU
reconstruction sequences in parallel. Combining the developed
algorithms with the framework, an execution sequence is completed
as the foundations of a GPU HLT1.
During the research carried out in this thesis, the aforementioned
developments and a small group of collaborators coordinated
by the author lead to the completion of a full GPU
HLT1 sequence. The performance obtained on GPUs allows
executing a reconstruction sequence in real-time, under LHCb
upgrade conditions. The developed GPU HLT1 constitutes the
first GPU high level trigger ever developed for an LHC experiment.
Finally, various possible realizations of the GPU HLT1 to
integrate in a production GPU-equipped data acquisition system
are detailed
Novel high performance techniques for high definition computer aided tomography
Mención Internacional en el título de doctorMedical image processing is an interdisciplinary field in which multiple research areas are involved:
image acquisition, scanner design, image reconstruction algorithms, visualization, etc.
X-Ray Computed Tomography (CT) is a medical imaging modality based on the attenuation
suffered by the X-rays as they pass through the body. Intrinsic differences in attenuation properties
of bone, air, and soft tissue result in high-contrast images of anatomical structures. The
main objective of CT is to obtain tomographic images from radiographs acquired using X-Ray
scanners. The process of building a 3D image or volume from the 2D radiographs is known as
reconstruction. One of the latest trends in CT is the reduction of the radiation dose delivered
to patients through the decrease of the amount of acquired data. This reduction results in artefacts
in the final images if conventional reconstruction methods are used, making it advisable to
employ iterative reconstruction algorithms.
There are numerous reconstruction algorithms available, from which we can highlight two
specific types: traditional algorithms, which are fast but do not enable the obtaining of high
quality images in situations of limited data; and iterative algorithms, slower but more reliable
when traditional methods do not reach the quality standard requirements. One of the priorities
of reconstruction is the obtaining of the final images in near real time, in order to reduce the
time spent in diagnosis. To accomplish this objective, new high performance techniques and methods
for accelerating these types of algorithms are needed. This thesis addresses the challenges
of both traditional and iterative reconstruction algorithms, regarding acceleration and image
quality. One common approach for accelerating these algorithms is the usage of shared-memory
and heterogeneous architectures. In this thesis, we propose a novel simulation/reconstruction
framework, namely FUX-Sim. This framework follows the hypothesis that the development of
new flexible X-ray systems can benefit from computer simulations, which may also enable performance
to be checked before expensive real systems are implemented. Its modular design
abstracts the complexities of programming for accelerated devices to facilitate the development
and evaluation of the different configurations and geometries available. In order to obtain near
real execution times, low-level optimizations for the main components of the framework are
provided for Graphics Processing Unit (GPU) architectures.
Other alternative tackled in this thesis is the acceleration of iterative reconstruction algorithms
by using distributed memory architectures. We present a novel architecture that unifies
the two most important computing paradigms for scientific computing nowadays: High Performance
Computing (HPC). The proposed architecture combines Big Data frameworks with the
advantages of accelerated computing.
The proposed methods presented in this thesis provide more flexible scanner configurations
as they offer an accelerated solution. Regarding performance, our approach is as competitive as
the solutions found in the literature. Additionally, we demonstrate that our solution scales with
the size of the problem, enabling the reconstruction of high resolution images.El procesamiento de imágenes médicas es un campo interdisciplinario en el que participan múltiples
áreas de investigación como la adquisición de imágenes, diseño de escáneres, algoritmos de
reconstrucción de imágenes, visualización, etc. La tomografía computarizada (TC) de rayos X es
una modalidad de imágen médica basada en el cálculo de la atenuación sufrida por los rayos X a
medida que pasan por el cuerpo a escanear. Las diferencias intrínsecas en la atenuación de hueso,
aire y tejido blando dan como resultado imágenes de alto contraste de estas estructuras anatómicas.
El objetivo principal de la TC es obtener imágenes tomográficas a partir estas radiografías
obtenidas mediante escáneres de rayos X. El proceso de construir una imagen o volumen en 3D a
partir de las radiografías 2D se conoce como reconstrucción. Una de las últimas tendencias en la
tomografía computarizada es la reducción de la dosis de radiación administrada a los pacientes
a través de la reducción de la cantidad de datos adquiridos. Esta reducción da como resultado
artefactos en las imágenes finales si se utilizan métodos de reconstrucción convencionales, por
lo que es aconsejable emplear algoritmos de reconstrucción iterativos.
Existen numerosos algoritmos de reconstrucción disponibles a partir de los cuales podemos
destacar dos categorías: algoritmos tradicionales, rápidos pero no permiten obtener imágenes de
alta calidad en situaciones en las que los datos son limitados; y algoritmos iterativos, más lentos
pero más estables en situaciones donde los métodos tradicionales no alcanzan los requisitos en
cuanto a la calidad de la imagen. Una de las prioridades de la reconstrucción es la obtención
de las imágenes finales en tiempo casi real, con el fin de reducir el tiempo de diagnóstico. Para
lograr este objetivo, se necesitan nuevas técnicas y métodos de alto rendimiento para acelerar
estos algoritmos.
Esta tesis aborda los desafíos de los algoritmos de reconstrucción tradicionales e iterativos,
con respecto a la aceleración y la calidad de imagen. Un enfoque común para acelerar estos
algoritmos es el uso de arquitecturas de memoria compartida y heterogéneas. En esta tesis,
proponemos un nuevo sistema de simulación/reconstrucción, llamado FUX-Sim. Este sistema se
construye alrededor de la hipótesis de que el desarrollo de nuevos sistemas de rayos X flexibles
puede beneficiarse de las simulaciones por computador, en los que también se puede realizar
un control del rendimiento de los nuevos sistemas a desarrollar antes de su implementación
física. Su diseño modular abstrae las complejidades de la programación para aceleradores con el
objetivo de facilitar el desarrollo y la evaluación de las diferentes configuraciones y geometrías
disponibles. Para obtener ejecuciones en casi tiempo real, se proporcionan optimizaciones de
bajo nivel para los componentes principales del sistema en las arquitecturas GPU.
Otra alternativa abordada en esta tesis es la aceleración de los algoritmos de reconstrucción
iterativa mediante el uso de arquitecturas de memoria distribuidas. Presentamos una arquitectura
novedosa que unifica los dos paradigmas informáticos más importantes en la actualidad:
computación de alto rendimiento (HPC) y Big Data. La arquitectura propuesta combina sistemas
Big Data con las ventajas de los dispositivos aceleradores.
Los métodos propuestos presentados en esta tesis proporcionan configuraciones de escáner
más flexibles y ofrecen una solución acelerada. En cuanto al rendimiento, nuestro enfoque es tan
competitivo como las soluciones encontradas en la literatura. Además, demostramos que nuestra
solución escala con el tamaño del problema, lo que permite la reconstrucción de imágenes de
alta resolución.This work has been mainly funded thanks to a FPU fellowship (FPU14/03875) from the Spanish
Ministry of Education.
It has also been partially supported by other grants:
• DPI2016-79075-R. “Nuevos escenarios de tomografía por rayos X”, from the Spanish Ministry
of Economy and Competitiveness.
• TIN2016-79637-P Towards unification of HPC and Big Data Paradigms from the Spanish
Ministry of Economy and Competitiveness.
• Short-term scientific missions (STSM) grant from NESUS COST Action IC1305.
• TIN2013-41350-P, Scalable Data Management Techniques for High-End Computing Systems
from the Spanish Ministry of Economy and Competitiveness.
• RTC-2014-3028-1 NECRA Nuevos escenarios clinicos con radiología avanzada from the
Spanish Ministry of Economy and Competitiveness.Programa Oficial de Doctorado en Ciencia y Tecnología InformáticaPresidente: José Daniel García Sánchez.- Secretario: Katzlin Olcoz Herrero.- Vocal: Domenico Tali
Accelerating Lagrangian transport simulations on graphics processing units: performance optimizations of Massive-Parallel Trajectory Calculations (MPTRAC) v2.6
Lagrangian particle dispersion models are indispensable tools for the study of atmospheric transport processes. However, Lagrangian transport simulations can become numerically expensive when large numbers of air parcels are involved. To accelerate these simulations, we made considerable efforts to port the Massive-Parallel Trajectory Calculations (MPTRAC) model to graphics processing units (GPUs). Here we discuss performance optimizations of the major bottleneck of the GPU code of MPTRAC, the advection kernel. Timeline, roofline, and memory analyses of the baseline GPU code revealed that the application is memory-bound, and performance suffers from near-random memory access patterns. By changing the data structure of the horizontal wind and vertical velocity fields of the global meteorological data driving the simulations from structure of arrays (SoAs) to array of structures (AoSs) and by introducing a sorting method for better memory alignment of the particle data, performance was greatly improved. We evaluated the performance on NVIDIA A100 GPUs of the Jülich Wizard for European Leadership Science (JUWELS) Booster module at the Jülich Supercomputing Center, Germany. For our largest test case, transport simulations with 108 particles driven by the European Centre for Medium-Range Weather Forecasts (ECMWF) ERA5 reanalysis, we found that the runtime for the full set of physics computations was reduced by 75 %, including a reduction of 85 % for the advection kernel. In addition to demonstrating the benefits of code optimization for GPUs, we show that the runtime of central processing unit (CPU-)only simulations is also improved. For our largest test case, we found a runtime reduction of 34 % for the physics computations, including a reduction of 65 % for the advection kernel. The code optimizations discussed here bring the MPTRAC model closer to applications on upcoming exascale high-performance computing systems and will also be of interest for optimizing the performance of other models using particle methods.</p
Methodology for complex dataflow application development
This thesis addresses problems inherent to the development of complex applications for reconfig- urable systems. Many projects fail to complete or take much longer than originally estimated by relying on traditional iterative software development processes typically used with conventional computers. Even though designer productivity can be increased by abstract programming and execution models, e.g., dataflow, development methodologies considering the specific properties of reconfigurable systems do not exist.
The first contribution of this thesis is a design methodology to facilitate systematic develop- ment of complex applications using reconfigurable hardware in the context of High-Performance Computing (HPC). The proposed methodology is built upon a careful analysis of the original application, a software model of the intended hardware system, an analytical prediction of performance and on-chip area usage, and an iterative architectural refinement to resolve identi- fied bottlenecks before writing a single line of code targeting the reconfigurable hardware. It is successfully validated using two real applications and both achieve state-of-the-art performance.
The second contribution extends this methodology to provide portability between devices in two steps. First, additional tool support for contemporary multi-die Field-Programmable Gate Arrays (FPGAs) is developed. An algorithm to automatically map logical memories to hetero- geneous physical memories with special attention to die boundaries is proposed. As a result, only the proposed algorithm managed to successfully place and route all designs used in the evaluation while the second-best algorithm failed on one third of all large applications. Second, best practices for performance portability between different FPGA devices are collected and evaluated on a financial use case, showing efficient resource usage on five different platforms.
The third contribution applies the extended methodology to a real, highly demanding emerging application from the radiotherapy domain. A Monte-Carlo based simulation of dose accumu- lation in human tissue is accelerated using the proposed methodology to meet the real time requirements of adaptive radiotherapy.Open Acces
Accelerating Time Series Analysis via Processing using Non-Volatile Memories
Time Series Analysis (TSA) is a critical workload for consumer-facing
devices. Accelerating TSA is vital for many domains as it enables the
extraction of valuable information and predict future events. The
state-of-the-art algorithm in TSA is the subsequence Dynamic Time Warping
(sDTW) algorithm. However, sDTW's computation complexity increases
quadratically with the time series' length, resulting in two performance
implications. First, the amount of data parallelism available is significantly
higher than the small number of processing units enabled by commodity systems
(e.g., CPUs). Second, sDTW is bottlenecked by memory because it 1) has low
arithmetic intensity and 2) incurs a large memory footprint. To tackle these
two challenges, we leverage Processing-using-Memory (PuM) by performing in-situ
computation where data resides, using the memory cells. PuM provides a
promising solution to alleviate data movement bottlenecks and exposes immense
parallelism.
In this work, we present MATSA, the first MRAM-based Accelerator for Time
Series Analysis. The key idea is to exploit magneto-resistive memory crossbars
to enable energy-efficient and fast time series computation in memory. MATSA
provides the following key benefits: 1) it leverages high levels of parallelism
in the memory substrate by exploiting column-wise arithmetic operations, and 2)
it significantly reduces the data movement costs performing computation using
the memory cells. We evaluate three versions of MATSA to match the requirements
of different environments (e.g., embedded, desktop, or HPC computing) based on
MRAM technology trends. We perform a design space exploration and demonstrate
that our HPC version of MATSA can improve performance by 7.35x/6.15x/6.31x and
energy efficiency by 11.29x/4.21x/2.65x over server CPU, GPU and PNM
architectures, respectively