299 research outputs found
Biofunctionalized all-polymer photonic lab on a chip with integrated solid-state light emitter
A photonic lab on a chip (PhLOC), comprising a solid-state light emitter (SSLE) aligned with a biofunctionalized optofluidic multiple internal reflection (MIR) system, is presented. The SSLE is obtained by filling a microfluidic structure with a phenyltrimethoxysilane (PhTMOS) aqueous sol solution containing a fluorophore organic dye. After curing, the resulting xerogel solid structure retains the emitting properties of the fluorophore, which is evenly distributed in the xerogel matrix. Photostability studies demonstrate that after a total dose (at l = 365 nm) greater than 24 J/cm2, the xerogel emission decay is only 4.1%. To re-direct the emitted light, the SSLE includes two sets of air mirrors that surround the xerogel. Emission mapping of the SSLE demonstrates that alignment variations of 150 mm (between the SSLE and the external pumping light source) provide fluctuations in emitted light smaller than 5%. After this verification, the SSLE is monolithically implemented with a MIR, forming the PhLOC. Its performance is assessed by measuring quinolone yellow, obtaining a limit of detection (LOD) of (0.60 +/- 0.01) mM. Finally, the MIR is selectively biofunctionalized with horseradish peroxidase (HRP) for the detection of hydrogen peroxide (H2O2) target analyte, obtaining a LOD of (0.7 +/- 0.1) mM for H2O2, confirming, for the first time, that solid-state xerogel-based emitters can be massively implemented in biofunctionalized PhLOCs
An Investigation into the Performance Evaluation of Connected Vehicle Applications: From Real-World Experiment to Parallel Simulation Paradigm
A novel system was developed that provides drivers lane merge advisories, using vehicle trajectories obtained through Dedicated Short Range Communication (DSRC). It was successfully tested on a freeway using three vehicles, then targeted for further testing, via simulation. The failure of contemporary simulators to effectively model large, complex urban transportation networks then motivated further research into distributed and parallel traffic simulation. An architecture for a closed-loop, parallel simulator was devised, using a new algorithm that accounts for boundary nodes, traffic signals, intersections, road lengths, traffic density, and counts of lanes; it partitions a sample, Tennessee road network more efficiently than tools like METIS, which increase interprocess communications (IPC) overhead by partitioning more transportation corridors. The simulator uses logarithmic accumulation to synchronize parallel simulations, further reducing IPC. Analyses suggest this eliminates up to one-third of IPC overhead incurred by a linear accumulation model
Parallel source code transformation techniques using design patterns
Mención Internacional en el título de doctorIn recent years, the traditional approaches for improving performance, such as increasing
the clock frequency, has come to a dead-end. To tackle this issue, parallel architectures,
such as multi-/many-core processors, have been envisioned to increase
the performance by providing greater processing capabilities. However, programming
efficiently for this architectures demands big efforts in order to transform sequential
applications into parallel and to optimize such applications. Compared to
sequential programming, designing and implementing parallel applications for operating
on modern hardware poses a number of new challenges to developers such
as data races, deadlocks, load imbalance, etc.
To pave the way, parallel design patterns provide a way to encapsulate algorithmic
aspects, allowing users to implement robust, readable and portable solutions
with such high-level abstractions. Basically, these patterns instantiate parallelism
while hiding away the complexity of concurrency mechanisms, such as thread management,
synchronizations or data sharing. Nonetheless, frameworks following this
philosophy does not share the same interface and users require understanding different
libraries, and their capabilities, not only to decide which fits best for their
purposes but also to properly leverage them. Furthermore, in order to parallelize
these applications, it is necessary to analyze the sequential code in order to detect the
regions of code that can be parallelized that is a time consuming and complex task.
Additionally, different libraries targeted to specific devices provide some algorithms
implementations that are already parallel and highly-tuned. In these situations, it is
also necessary to analyze and determine which routine implementation is the most
suitable for a given problem.
To tackle these issues, this thesis aims at simplifying and minimizing the necessary
efforts to transform sequential applications into parallel. This way, resulting
codes will improve their performance by fully exploiting the available resources
while the development efforts will be considerably reduced. Basically, in this thesis,
we contribute with the following. First, we propose a technique to detect potential
parallel patterns in sequential code. Second, we provide a novel generic C++ interface
for parallel patterns which acts as a switch among existing frameworks. Third,
we implement a framework that is able to transform sequential code into parallel
using the proposed pattern discovery technique and pattern interface. Finally, we
propose mechanisms that are able to select the most suitable device and routine implementation
to solve a given problem based on previous performance information.
The evaluation demonstrates that using the proposed techniques can minimize the
refactoring and optimization time while improving the performance of the resulting
applications with respect to the original code.En los últimos años, las técnicas tradicionales para mejorar el rendimiento, como es
el caso del incremento de la frecuencia de reloj, han llegado a sus límites. Con el
fin de seguir mejorando el rendimiento, se han desarrollado las arquitecturas paralelas,
las cuales proporcionan un incremento del rendimiento al estar provistas de
mayores capacidades de procesamiento. Sin embargo, programar de forma eficiente
para estas arquitecturas requieren de grandes esfuerzos por parte de los desarrolladores.
Comparado con la programación secuencial, diseñar e implementar aplicaciones
paralelas enfocadas a trabajar en estas arquitecturas presentan una gran
cantidad de dificultades como son las condiciones de carrera, los deadlocks o el incorrecto
balanceo de la carga.
En este sentido, los patrones paralelos son una forma de encapsular aspectos
algorítmicos de las aplicaciones permitiendo el desarrollo de soluciones robustas,
portables y legibles gracias a las abstracciones de alto nivel. En general, estos patrones
son capaces de proporcionar el paralelismo a la vez que ocultan las complejidades
derivadas de los mecanismos de control de concurrencia necesarios como el
manejo de los hilos, las sincronizaciones o la compartición de datos. No obstante,
los diferentes frameworks que siguen esta filosofía no comparten una única interfaz
lo que conlleva que los usuarios deban conocer múltiples bibliotecas y sus capacidades,
con el fin de decidir cuál de ellos es mejor para una situación concreta y
como usarlos de forma eficiente. Además, con el fin de paralelizar aplicaciones existentes,
es necesario analizar e identificar las regiones del código que pueden ser paralelizadas,
lo cual es una tarea ardua y compleja. Además, algunos algoritmos ya se
encuentran implementados en paralelo y optimizados para arquitecturas concretas
en diversas bibliotecas. Esto da lugar a que sea necesario analizar y determinar que
implementación concreta es la más adecuada para solucionar un problema dado.
Para paliar estas situaciones, está tesis busca simplificar y minimizar el esfuerzo
necesario para transformar aplicaciones secuenciales en paralelas. De esta forma,
los códigos resultantes serán capaces de explotar los recursos disponibles a la vez
que se reduce considerablemente el esfuerzo de desarrollo necesario. En general,
esta tesis contribuye con lo siguiente. En primer lugar, se propone una técnica de
detección de patrones paralelos en códigos secuenciales. En segundo lugar, se presenta
una interfaz genérica de patrones paralelos para C++ que permite seleccionar
la implementación de dichos patrones proporcionada por frameworks ya existentes.
En tercer lugar, se introduce un framework de transformación de código secuencial
a paralelo que hace uso de las técnicas de detección de patrones y la interfaz
presentadas. Finalmente, se proponen mecanismos capaces de seleccionar la implementación
más adecuada para solucionar un problema concreto basándose en el
rendimiento obtenido en ejecuciones previas. Gracias a la evaluación realizada se ha
podido demostrar que uso de las técnicas presentadas pueden minimizar el tiempo
necesario para transformar y optimizar el código a la vez que mejora el rendimiento
de las aplicaciones transformadas.Programa Oficial de Doctorado en Ciencia y Tecnología InformáticaPresidente: David Expósito Singh.- Secretario: Rafael Asenjo Plaza.- Vocal: Marco Aldinucc
Deep Learning in the Automotive Industry: Applications and Tools
Deep Learning refers to a set of machine learning techniques that utilize
neural networks with many hidden layers for tasks, such as image
classification, speech recognition, language understanding. Deep learning has
been proven to be very effective in these domains and is pervasively used by
many Internet services. In this paper, we describe different automotive uses
cases for deep learning in particular in the domain of computer vision. We
surveys the current state-of-the-art in libraries, tools and infrastructures
(e.\,g.\ GPUs and clouds) for implementing, training and deploying deep neural
networks. We particularly focus on convolutional neural networks and computer
vision use cases, such as the visual inspection process in manufacturing plants
and the analysis of social media data. To train neural networks, curated and
labeled datasets are essential. In particular, both the availability and scope
of such datasets is typically very limited. A main contribution of this paper
is the creation of an automotive dataset, that allows us to learn and
automatically recognize different vehicle properties. We describe an end-to-end
deep learning application utilizing a mobile app for data collection and
process support, and an Amazon-based cloud backend for storage and training.
For training we evaluate the use of cloud and on-premises infrastructures
(including multiple GPUs) in conjunction with different neural network
architectures and frameworks. We assess both the training times as well as the
accuracy of the classifier. Finally, we demonstrate the effectiveness of the
trained classifier in a real world setting during manufacturing process.Comment: 10 page
TinyML: Tools, Applications, Challenges, and Future Research Directions
In recent years, Artificial Intelligence (AI) and Machine learning (ML) have
gained significant interest from both, industry and academia. Notably,
conventional ML techniques require enormous amounts of power to meet the
desired accuracy, which has limited their use mainly to high-capability devices
such as network nodes. However, with many advancements in technologies such as
the Internet of Things (IoT) and edge computing, it is desirable to incorporate
ML techniques into resource-constrained embedded devices for distributed and
ubiquitous intelligence. This has motivated the emergence of the TinyML
paradigm which is an embedded ML technique that enables ML applications on
multiple cheap, resource- and power-constrained devices. However, during this
transition towards appropriate implementation of the TinyML technology,
multiple challenges such as processing capacity optimization, improved
reliability, and maintenance of learning models' accuracy require timely
solutions. In this article, various avenues available for TinyML implementation
are reviewed. Firstly, a background of TinyML is provided, followed by detailed
discussions on various tools supporting TinyML. Then, state-of-art applications
of TinyML using advanced technologies are detailed. Lastly, various research
challenges and future directions are identified.Comment: 12 pags, 3 tables, 4 figure
Optimization of deep learning algorithms for an autonomous RC vehicle
Dissertação de mestrado em Engenharia InformáticaThis dissertation aims to evaluate and improve the performance of deep learning (DL)
algorithms to autonomously drive a vehicle, using a Remo Car (an RC vehicle) as testbed.
The RC vehicle was built with a 1:10 scaled remote controlled car and fitted with an
embedded system and a video camera to capture and process real-time image data. Two
different embedded systems were comparatively evaluated: an homogeneous system, a
Raspberry Pi 4, and an heterogeneous system, a NVidia Jetson Nano. The Raspberry Pi 4 with
an advanced 4-core ARM device supports multiprocessing, while the Jetson Nano, also with
a 4-core ARM device, has an integrated accelerator, a 128 CUDA-core NVidia GPU.
The captured video is processed with convolutional neural networks (CNNs), which
interpret image data of the vehicle’s surroundings and predict critical data, such as lane view
and steering angle, to provide mechanisms to drive on its own, following a predefined path.
To improve the driving performance of the RC vehicle, this work analysed the programmed
DL algorithms, namely different computer vision approaches for object detection and image
classification, aiming to explore DL techniques and improve their performance at the inference
phase.
The work also analysed the computational efficiency of the control software, while running
intense and complex deep learning tasks in the embedded devices, and fully explored the
advanced characteristics and instructions provided by the two embedded systems in the
vehicle.
Different machine learning (ML) libraries and frameworks were analysed and evaluated:
TensorFlow, TensorFlow Lite, Arm NN, PyArmNN and TensorRT. They play a key role to
deploy the relevant algorithms and to fully engage the hardware capabilities.
The original algorithm was successfully optimized and both embedded systems could
perfectly handle this workload. To understand the computational limits of both devices, an
additional and heavy DL algorithm was developed that aimed to detect traffic signs.
The homogeneous system, the Raspberry Pi 4, could not deliver feasible low-latency values,
hence the detection of traffic signs was not possible in real-time. However, a great performance
improvement was achieved using the heterogeneous system, Jetson Nano, enabling their
CUDA-cores to process the additional workload.Esta dissertação tem como objetivo avaliar e melhorar o desempenho de algoritmos de deep learning (DL) orientados à condução autónoma de veículos, usando um carro controlado remotamente como ambiente de teste. O carro foi construído usando um modelo de um veículo de controlo remoto de escala 1:10, onde foi colocado um sistema embebido e uma câmera de vídeo para capturar e processar imagem em tempo real. Dois sistemas embebidos foram comparativamente avaliados: um sistema homogéneo, um Raspberry Pi 4, e um sistema heterogéneo, uma NVidia Jetson Nano. O Raspberry Pi 4 possui um processador ARM com 4 núcleos, suportando multiprocessamento. A Jetson Nano, também com um processador ARM de 4 núcleos, possui uma unidade adicional de processamento com 128 núcleos do tipo CUDA-core. O vídeo capturado e processado usando redes neuronais convolucionais (CNN), interpretando o meio envolvente do veículo e prevendo dados cruciais, como a visibilidade da linha da estrada e o angulo de direção, de forma a que o veículo consiga conduzir de forma autónoma num determinado ambiente. De forma a melhorar o desempenho da condução autónoma do veículo, diferentes algoritmos de deep learning foram analisados, nomeadamente diferentes abordagens de visão por computador para detecção e classificação de imagens, com o objetivo de explorar técnicas de CNN e melhorar o seu desempenho na fase de inferência. A dissertação também analisou a eficiência computacional do software usado para a execução de tarefas de aprendizagem profunda intensas e complexas nos dispositivos embebidos, e explorou completamente as características avançadas e as instruções fornecidas pelos dois sistemas embebidos no veículo. Diferentes bibliotecas e frameworks de machine learning foram analisadas e avaliadas: TensorFlow, TensorFlow Lite, Arm NN, PyArmNN e TensorRT. Estes desempenham um papel fulcral no provisionamento dos algoritmos de deep learning para tirar máximo partido das capacidades do hardware usado. O algoritmo original foi otimizado com sucesso e ambos os sistemas embebidos conseguiram executar os algoritmos com pouco esforço. Assim, para entender os limites computacionais de ambos os dispositivos, um algoritmo adicional mais complexo de deep learning foi desenvolvido com o objetivo de detectar sinais de transito. O sistema homogéneo, o Raspberry Pi 4, não conseguiu entregar valores viáveis de baixa latência, portanto, a detecção de sinais de trânsito não foi possível em tempo real, usando este sistema. No entanto, foi alcançada uma grande melhoria de desempenho usando o sistema heterogeneo, Jetson Nano, que usaram os seus núcleos CUDA adicionais para processar a carga computacional mais intensa
Towards automatic parallelization of stream processing applications
Parallelizing and optimizing codes for recent multi-/many-core processors have been recognized to be a complex task. For this reason, strategies to automatically transform sequential codes into parallel and discover optimization opportunities are crucial to relieve the burden to developers. In this paper, we present a compile-time framework to (semi) automatically find parallel patterns (Pipeline and Farm) and transform sequential streaming applications into parallel using GrPPI, a generic parallel pattern interface. This framework uses a novel pipeline stage-balancing technique which provides the code generator module with the necessary information to produce balanced pipelines. The evaluation, using a synthetic video benchmark and a real-world computer vision application, demonstrates that the presented framework is capable of producing parallel and optimized versions of the application. A comparison study under several thread-core oversubscribed conditions reveals that the framework can bring comparable performance results with respect to the Intel TBB programming framework.This work was supported in part by the Spanish Ministerio de Economía y Competitividad through the Project Toward Uni cation of HPC
and Big Data Paradigms under Grant TIN2016-79637-P and in part by the EU Project RePhrase: REfactoring Parallel Heterogeneous
Resource-Aware Applications under Grant ICT 644235
- …