1,141 research outputs found
Automatic Loop Nest Parallelization for the Predictable Execution Model
Currently, embedded real-time systems still widely use single-core processors. A major challenge in the adoption of multicore processors is the presence of shared hardware resources such as main memory. Contention between threads executing on different cores for access to such resources makes it difficult to tightly estimate the Worst-Case Execution Time (WCET) of applications. To safely employ multicore processors in real-time systems, previous work has introduced a PRedictable Execution Model (PREM) for embedded Multi-Processor Systems-on-a-Chip (MPSoCs). Under PREM, each thread is divided into memory phases, where the code and data required by the thread are moved from main memory to a local memory (cache or scratchpad) or vice versa, and execution phases, where the thread computes based on the code and data available in local memory. Memory phases are then scheduled by the Operating System (OS) to avoid contention among threads, thus resulting in tight WCET bounds. The main challenge in applying the model is to automatically generate optimized PREM-compliant code instead of rewriting programs manually. Note that many programs of interests, such as emerging AI and neural network kernels, comprise both compute-intensive and memory-intensive deeply nested loops. Hence, PREM code generation and optimization should be applicable to nested loop structures and consider whether performance is constrained by computation or memory transfers.
In this thesis, we address the problem of automatically parallelizing and optimizing nested loop structure programs by presenting a workflow that automatically generates PREM-compliant optimized code. To correctly model the structure of nested loop programs, we leverage existing polyhedral compilation tools that analyze the original program and generate optimized executables. Two main techniques are adopted for optimization: loop tiling and parallelization. We build a timing model to estimate the length of execution and memory phases, and then construct a Directed Acyclic Graph (DAG) of program phases to estimate its makespan. During this process, our framework searches for the combination of tile sizes and thread numbers that minimize the makespan of the program; given the complexity of the optimization problem, we design a heuristic algorithm to find solutions close to the optimal. Finally, to show its usefulness, we evaluate our technique based on the Gem5 architectural simulator on computational kernels from the PolyBench-NN benchmark
Quality of experience and access network traffic management of HTTP adaptive video streaming
The thesis focuses on Quality of Experience (QoE) of HTTP adaptive video streaming (HAS) and traffic management in access networks to improve the QoE of HAS. First, the QoE impact of adaptation parameters and time on layer was investigated with subjective crowdsourcing studies. The results were used to compute a QoE-optimal adaptation strategy for given video and network conditions. This allows video service providers to develop and benchmark improved adaptation logics for HAS. Furthermore, the thesis investigated concepts to monitor video QoE on application and network layer, which can be used by network providers in the QoE-aware traffic management cycle. Moreover, an analytic and simulative performance evaluation of QoE-aware traffic management on a bottleneck link was conducted. Finally, the thesis investigated socially-aware traffic management for HAS via Wi-Fi offloading of mobile HAS flows. A model for the distribution of public Wi-Fi hotspots and a platform for socially-aware traffic management on private home routers was presented. A simulative performance evaluation investigated the impact of Wi-Fi offloading on the QoE and energy consumption of mobile HAS.Die Doktorarbeit beschäftigt sich mit Quality of Experience (QoE) – der subjektiv empfundenen Dienstgüte – von adaptivem HTTP Videostreaming (HAS) und mit Verkehrsmanagement, das in Zugangsnetzwerken eingesetzt werden kann, um die QoE des adaptiven Videostreamings zu verbessern. Zuerst wurde der Einfluss von Adaptionsparameters und der Zeit pro Qualitätsstufe auf die QoE von adaptivem Videostreaming mittels subjektiver Crowdsourcingstudien untersucht. Die Ergebnisse wurden benutzt, um die QoE-optimale Adaptionsstrategie für gegebene Videos und Netzwerkbedingungen zu berechnen. Dies ermöglicht Dienstanbietern von Videostreaming verbesserte Adaptionsstrategien für adaptives Videostreaming zu entwerfen und zu benchmarken. Weiterhin untersuchte die Arbeit Konzepte zum Überwachen von QoE von Videostreaming in der Applikation und im Netzwerk, die von Netzwerkbetreibern im Kreislauf des QoE-bewussten Verkehrsmanagements eingesetzt werden können. Außerdem wurde eine analytische und simulative Leistungsbewertung von QoE-bewusstem Verkehrsmanagement auf einer Engpassverbindung durchgeführt. Schließlich untersuchte diese Arbeit sozialbewusstes Verkehrsmanagement für adaptives Videostreaming mittels WLAN Offloading, also dem Auslagern von mobilen Videoflüssen über WLAN Netzwerke. Es wurde ein Modell für die Verteilung von öffentlichen WLAN Zugangspunkte und eine Plattform für sozialbewusstes Verkehrsmanagement auf privaten, häuslichen WLAN Routern vorgestellt. Abschließend untersuchte eine simulative Leistungsbewertung den Einfluss von WLAN Offloading auf die QoE und den Energieverbrauch von mobilem adaptivem Videostreaming
Adaptive Microarchitectural Optimizations to Improve Performance and Security of Multi-Core Architectures
With the current technological barriers, microarchitectural optimizations are increasingly important to ensure performance scalability of computing systems. The shift to multi-core architectures increases the demands on the memory system, and amplifies the role of microarchitectural optimizations in performance improvement. In a multi-core system, microarchitectural resources are usually shared, such as the cache, to maximize utilization but sharing can also lead to contention and lower performance. This can be mitigated through partitioning of shared caches.However, microarchitectural optimizations which were assumed to be fundamentally secure for a long time, can be used in side-channel attacks to exploit secrets, as cryptographic keys. Timing-based side-channels exploit predictable timing variations due to the interaction with microarchitectural optimizations during program execution. Going forward, there is a strong need to be able to leverage microarchitectural optimizations for performance without compromising security. This thesis contributes with three adaptive microarchitectural resource management optimizations to improve security and/or\ua0performance\ua0of multi-core architectures\ua0and a systematization-of-knowledge of timing-based side-channel attacks.\ua0We observe that to achieve high-performance cache partitioning in a multi-core system\ua0three requirements need to be met: i) fine-granularity of partitions, ii) locality-aware placement and iii) frequent changes. These requirements lead to\ua0high overheads for current centralized partitioning solutions, especially as the number of cores in the\ua0system increases. To address this problem, we present an adaptive and scalable cache partitioning solution (DELTA) using a distributed and asynchronous allocation algorithm. The\ua0allocations occur through core-to-core challenges, where applications with larger performance benefit will gain cache capacity. The\ua0solution is implementable in hardware, due to low computational complexity, and can scale to large core counts.According to our analysis, better performance can be achieved by coordination of multiple optimizations for different resources, e.g., off-chip bandwidth and cache, but is challenging due to the increased number of possible allocations which need to be evaluated.\ua0Based on these observations, we present a solution (CBP) for coordinated management of the optimizations: cache partitioning, bandwidth partitioning and prefetching.\ua0Efficient allocations, considering the inter-resource interactions and trade-offs, are achieved using local resource managers to limit the solution space.The continuously growing number of\ua0side-channel attacks leveraging\ua0microarchitectural optimizations prompts us to review attacks and defenses to understand the vulnerabilities of different microarchitectural optimizations. We identify the four root causes of timing-based side-channel attacks: determinism, sharing, access violation\ua0and information flow.\ua0Our key insight is that eliminating any of the exploited root causes, in any of the attack steps, is enough to provide protection.\ua0Based on our framework, we present a systematization of the attacks and defenses on a wide range of microarchitectural optimizations, which highlights their key similarities.\ua0Shared caches are an attractive attack surface for side-channel attacks, while defenses need to be efficient since the cache is crucial for performance.\ua0To address this issue, we present an adaptive and scalable cache partitioning solution (SCALE) for protection against cache side-channel attacks. The solution leverages randomness,\ua0and provides quantifiable and information theoretic security guarantees using differential privacy. The solution closes the performance gap to a state-of-the-art non-secure allocation policy for a mix of secure and non-secure applications
Genomic epidemiology and antimicrobial resistance of Klebsiella pneumoniae in the Comunitat Valenciana
La resistencia antimicrobiana (RAM) es una importante amenaza para la salud pública a nivel mundial. El mal uso de los antibióticos ha llevado al surgimiento y propagación de infecciones resistentes a los antibióticos. Entre los patógenos que más muertes causan asociadas a la RAM bacteriana se encuentra Klebsiella pneumoniae, uno de los patógenos más preocupantes. DE hecho, en 2019, K. pneumoniae se clasificó como la segunda causa de muertes atribuible a la RAM. Entre las resistencias adquiridas en K. pneumoniae, las que mayor preocupación causan son las cepas que han desarrollado resistencia a cefalosporinas de tercera generación (3GC) y carbapenems (CRKp). En esta tesis se ha investigado la epidemiología de K. pneumoniae resistente a 3GC y carbapenems utilizando la información genómica recopilada en el proyecto Vigilancia de Klebsiella pneumoniae en la Comunitat Valenciana (SKPCV).
Bajo el proyecto SKPCV, se recolectaron casi 2200 aislados de K. pneumoniae productores de ESBL y/o carbapenems durante 3 años (2017 - 2019) y posteriormente se secuenció el genoma completo utilizando tecnologías de segunda generación (Illumina) y tercera generación (Pacific Biosciences y Oxford Nanopore).
Para proporcionar contexto y establecer una colección que nos permitiera dilucidar las relaciones entre los aislados de K. pneumoniae del SKPCV con los de hospitales españoles y a nivel mundial, incluimos aislados recogidos previamente en algunos hospitales del NLSAR, así como datos externos de tres bases de datos diferentes: RefSeq, GenBank y ENA.
Utilizando estos datos, recogimos más de 13,000 genomas. Trabajar con conjuntos de datos grandes y garantizar la calidad de los datos puede ser un desafío, por lo tanto, creamos un filtro de control de calidad con pasos jerarquizados que evaluaron la asignación taxonómica y la contaminación interespecífica, la calidad del ensamblado, la contaminación intraespecífica y, finalmente, la similitud genómica de toda la colección. Utilizando este filtro de calidad, obtuvimos una gran colección con 1,604 genomas del SKPCV, 395 aislados retrospectivos recogidos en tres hospitales del NLSAR y más de 10,000 genomas globales disponibles en las bases de datos públicas.
Finalmente, encontramos que las composiciones de linajes del SKPCV y NLSAR eran muy diversas, pero también similares a las de los genomas españoles depositados en las bases de datos. De hecho, la mayoría de los aislados de NLSAR estaban relacionados con aislados recogidos en otras regiones de España, lo que sugiere historias evolutivas similares. Nuestro análisis reveló que solo un linaje, ST307, fue responsable de la mayoría de las infecciones resistentes a 3GC y carbapenems, así como de las transmisiones interhospitalarias. También descubrimos que los determinantes de la resistencia a 3GC y carbapenems, junto con los linajes portadores correspondientes, se distribuían de manera distinta en los hospitales y que, excepto por ST307 que portaba blaCTX-M-15, la mayoría de los linajes y combinaciones de determinantes de AMR se limitaban mayormente a un solo hospital. De hecho, las poblaciones hospitalarias eran diferentes entre sí. Nuestros hallazgos sugieren que la carga de la RAM y K. pneumoniae en esta región fue el resultado de una diversidad de factores, que incluyen linajes únicos que probablemente se originaron en la comunidad o en la microbiota previa de los pacientes, así como una compleja interacción entre la transmisión de linajes entre hospitales y la proliferación local de clones problemáticos dentro de cada hospital.
Nuestros hallazgos muestran que la aparición inicial de resistencia a carbapenems y la diseminación del grupo hospitalario universitario (HGUV) ocurrieron durante un corto período de un año y fue muy compleja. Encontramos seis linajes diferentes que comprendían la mayoría de la población de CRKp en el HGUV, diseminando diferentes mecanismos de resistencia (AmpC, OXA-48, NDM-1 y NDM-23) en diferentes variantes de plásmidos. Estos linajes experimentaron una expansión clonal local, con varios casos de posible transmisión directa dentro del hospital.
Finalmente, utilizamos la epidemiología genómica para describir la aparición y diseminación en varios hospitales de un nuevo gen de resistencia a carbapenems, denominado blaNDM-23. Pudimos dilucidar el efecto fenotípico y el entorno genético del gen. El gen estaba contenido en un plásmido resistente a múltiples fármacos con 18 genes adicionales de resistencia a antibióticos, lo que produjo un fenotipo de resistencia a múltiples fármacos. El gen y el plásmido se encontraron en una cepa ST437. Descubrimos que el plásmido no era movilizable, por lo que la diseminación del blaNDM-23 se produjo a través de una expansión clonal. La diseminación de este linaje ST437 portador del blaNDM-23 afectó al menos a cuatro hospitales diferentes de la Comunitat Valenciana desde 2016 hasta al menos 2019, cuando concluyó nuestro muestreo
Durability of Wireless Charging Systems Embedded Into Concrete Pavements for Electric Vehicles
Point clouds are widely used in various applications such as 3D modeling, geospatial analysis, robotics, and more. One of the key advantages of 3D point cloud data is that, unlike other data formats like texture, it is independent of viewing angle, surface type, and parameterization. Since each point in the point cloud is independent of the other, it makes it the most suitable source of data for tasks like object recognition, scene segmentation, and reconstruction. Point clouds are complex and verbose due to the numerous attributes they contain, many of which may not be always necessary for rendering, making retrieving and parsing a heavy task.
As Sensors are becoming more precise and popular, effectively streaming, processing, and rendering the data is also becoming more challenging. In a hierarchical continuous LOD system, the previously fetched and rendered data for a region may become unavailable when revisiting it. To address this, we use a non-persistence cache using hash-map which stores the parsed point attributes, which still has some limitations, such as the dataset needing to be refetched and reprocessed if the tab or browser is closed and reopened which can be addressed by persistence caching. On the web, popularly persistence caching involves storing data in server memory, or an intermediate caching server like Redis. This is not suitable for point cloud data where we have to store parsed and processed large point data making point cloud visualization rely only on non-persistence caching.
The thesis aims to contribute toward better performance and suitability of point cloud rendering on the web reducing the number of read requests to the remote file to access data.We achieve this with the application of client-side-based LRU Cache and Private File Open Space as a combination of both persistence and non-persistence caching of data. We use a cloud-optimized data format, which is better suited for web and streaming hierarchical data structures. Our focus is to improve rendering performance using WebGPU by reducing access time and minimizing the amount of data loaded in GPU.
Preliminary results indicate that our approach significantly improves rendering performance and reduce network request when compared to traditional caching methods using WebGPU
MTrainS: Improving DLRM training efficiency using heterogeneous memories
Recommendation models are very large, requiring terabytes (TB) of memory
during training. In pursuit of better quality, the model size and complexity
grow over time, which requires additional training data to avoid overfitting.
This model growth demands a large number of resources in data centers. Hence,
training efficiency is becoming considerably more important to keep the data
center power demand manageable. In Deep Learning Recommendation Models (DLRM),
sparse features capturing categorical inputs through embedding tables are the
major contributors to model size and require high memory bandwidth. In this
paper, we study the bandwidth requirement and locality of embedding tables in
real-world deployed models. We observe that the bandwidth requirement is not
uniform across different tables and that embedding tables show high temporal
locality. We then design MTrainS, which leverages heterogeneous memory,
including byte and block addressable Storage Class Memory for DLRM
hierarchically. MTrainS allows for higher memory capacity per node and
increases training efficiency by lowering the need to scale out to multiple
hosts in memory capacity bound use cases. By optimizing the platform memory
hierarchy, we reduce the number of nodes for training by 4-8X, saving power and
cost of training while meeting our target training performance
Vehicle as a Service (VaaS): Leverage Vehicles to Build Service Networks and Capabilities for Smart Cities
Smart cities demand resources for rich immersive sensing, ubiquitous
communications, powerful computing, large storage, and high intelligence
(SCCSI) to support various kinds of applications, such as public safety,
connected and autonomous driving, smart and connected health, and smart living.
At the same time, it is widely recognized that vehicles such as autonomous
cars, equipped with significantly powerful SCCSI capabilities, will become
ubiquitous in future smart cities. By observing the convergence of these two
trends, this article advocates the use of vehicles to build a cost-effective
service network, called the Vehicle as a Service (VaaS) paradigm, where
vehicles empowered with SCCSI capability form a web of mobile servers and
communicators to provide SCCSI services in smart cities. Towards this
direction, we first examine the potential use cases in smart cities and
possible upgrades required for the transition from traditional vehicular ad hoc
networks (VANETs) to VaaS. Then, we will introduce the system architecture of
the VaaS paradigm and discuss how it can provide SCCSI services in future smart
cities, respectively. At last, we identify the open problems of this paradigm
and future research directions, including architectural design, service
provisioning, incentive design, and security & privacy. We expect that this
paper paves the way towards developing a cost-effective and sustainable
approach for building smart cities.Comment: 32 pages, 11 figure
Measurement of Triple-Differential Z+Jet Cross Sections with the CMS Detector at 13 TeV and Modelling of Large-Scale Distributed Computing Systems
The achievable precision in the calculations of predictions for observables measured at the LHC experiments depends on the amount of invested computing power and the precision of input parameters that go into the calculation. Currently, no theory exists that can derive the input parameter values for perturbative calculations from first principles. Instead, they have to be derived from measurements in dedicated analyses that measure observables sensitive to the input parameters with high precision. Such an analysis that measures the production cross section of oppositely charged muon pairs with an invariant mass close to the mass of the boson in association with jets in a phase space divided into bins of the transverse momentum of the dimuon system , and two observables and created from the rapidities of the dimuon system and the jet with the highest momentum is presented. To achieve the highest statistical precision in this triple-differential measurement the full data recorded by the CMS experiment at a center-of-mass energy of in the years 2016 to 2018 is combined. The measured cross sections are compared to theoretical predictions approximating full NNLO accuracy in perturbative QCD. Deviations from these predictions are observed rendering further studies at full NNLO accuracy necessary.
To obtain the measured results large amounts of data are processed and analysed on distributed computing infrastructures. Theoretical calculations pose similar computing demands. Consequently, substantial amounts of storage and processing resources are required by the LHC collaborations. These requirements are met in large parts by the resources of the WLCG, a complex federation of globally distributed computer centres. With the upgrade of the LHC and the experiments, in the HL-LHC era, the computing demands are expected to increase substantially. Therefore, the prevailing computing models need to be updated to cope with the unprecedented demands. For the design of future adaptions of the HEP workflow executions on infrastructures a simulation model is developed, and an implementation tested on infrastructure design candidates inspired by a proposal of the German HEP computing community. The presented study of these infrastructure candidates showcases the applicability of the simulation tool in the strategical development of a future computing infrastructure for HEP in the HL-LHC context
- …