153 research outputs found

    A Survey of Performance Optimization for Mobile Applications

    Get PDF
    Nowadays there is a mobile application for almost everything a user may think of, ranging from paying bills and gathering information to playing games and watching movies. In order to ensure user satisfaction and success of applications, it is important to provide high performant applications. This is particularly important for resource constraint systems such as mobile devices. Thereby, non-functional performance characteristics, such as energy and memory consumption, play an important role for user satisfaction. This paper provides a comprehensive survey of non-functional performance optimization for Android applications. We collected 155 unique publications, published between 2008 and 2020, that focus on the optimization of non-functional performance of mobile applications. We target our search at four performance characteristics, in particular: responsiveness, launch time, memory and energy consumption. For each performance characteristic, we categorize optimization approaches based on the method used in the corresponding publications. Furthermore, we identify research gaps in the literature for future work

    An Android application for the trustees of a distributed, end-to-end verifiable, internet voting system

    Get PDF
    Τα συστήματα ηλεκτρονικής ψηφοφορίας αποτελούν μια ισχυρή τεχνολογία η οποία βελτιώνει την εκλογική διαδικασία αυξάνοντας σημαντικά την ταχύτητα καταμέτρησης ψήφων και μειώνοντας το κόστος διεξαγωγής εκλογών. Tα διαδικτυακά συστήματα ηλεκτρονικής ψηφοφορίας επιτρέπουν στους ψηφοφόρους να ψηφίζουν γρήγορα και απομακρυσμένα, προωθώντας δυνητικά τη δημοκρατία μέσω της παροχής αρκετά βελτιωμένης προσβασιμότητας για άτομα με σωματικά εμπόδια, αυξάνοντας, επομένως, τη συνολική συμμετοχή των ψηφοφόρων. Επιπροσθέτως, υπάρχουν συστήματα ηλεκτρονικής ψηφοφορίας τα οποία επιτρέπουν στους ψηφοφόρους να επαληθεύσουν άμεσα ολόκληρη την εκλογική διαδικασία, επιβεβαιώνοντας, έτσι, ότι καμία οντότητα δεν έχει αλλοιώσει το εκλογικό αποτέλεσμα με κάποιο τρόπο. Η παρούσα πτυχιακή εργασία περιγράφει το σχεδιασμό και την υλοποίηση μιας εφαρμογής για το λειτουργικό σύστημα Android η οποία αυτοματοποιεί τη διαδικασία που οι trustees ενός καινούριου, κατανεμημένου, από-άκρο-σε-άκρο επαληθεύσιμου, ηλεκτρονικού συστήματος ψηφοφορίας πρέπει να ακολουθήσουν. Ο ευρύτερος στόχος είναι η σκιαγράφηση της αρχιτεκτονικής και η παροχή μιας αναφορικής υλοποίησης για μια αποδοτική, σταθερή και ασφαλή εφαρμογή Android ηλεκτρονικής ψηφοφορίας, όχι μόνο για ένα trustee, αλλά πιθανώς για οποιαδήποτε εκλογική οντότητα, δείχνοντας ότι οι σημερινές συνηθισμένες κινητές συσκευές είναι αρκετά ισχυρές για να χρησιμοποιηθούν αποδοτικά για τέτοιες εργασίες.E-voting systems comprise a powerful technology that improves the election procedure by significantly increasing tallying speed and reducing election costs. Internet voting systems allow voters to cast their votes quickly and remotely, potentially promoting democracy by providing considerably improved accessibility for individuals that are physically hindered, thus increasing overall voter participation. On top of that, there exist internet voting systems that allow voters to directly verify the entire election procedure, thereby confirming that absolutely no entities have altered the election result in some way. The present thesis describes the design and implementation of an application for the Android operating system that automates the procedure which the trustees of a new, distributed, end-to-end verifiable, internet voting system have to follow. The broader goal is to outline the architecture and provide a reference implementation for an efficient, robust and secure e-voting Android application, not just for a trustee, but possibly for any voting entity in general, showing that today’s ordinary mobile devices are powerful enough to effectively be used for such tasks

    Inviwo -- A Visualization System with Usage Abstraction Levels

    Full text link
    The complexity of today's visualization applications demands specific visualization systems tailored for the development of these applications. Frequently, such systems utilize levels of abstraction to improve the application development process, for instance by providing a data flow network editor. Unfortunately, these abstractions result in several issues, which need to be circumvented through an abstraction-centered system design. Often, a high level of abstraction hides low level details, which makes it difficult to directly access the underlying computing platform, which would be important to achieve an optimal performance. Therefore, we propose a layer structure developed for modern and sustainable visualization systems allowing developers to interact with all contained abstraction levels. We refer to this interaction capabilities as usage abstraction levels, since we target application developers with various levels of experience. We formulate the requirements for such a system, derive the desired architecture, and present how the concepts have been exemplary realized within the Inviwo visualization system. Furthermore, we address several specific challenges that arise during the realization of such a layered architecture, such as communication between different computing platforms, performance centered encapsulation, as well as layer-independent development by supporting cross layer documentation and debugging capabilities

    Multicore architecture optimizations for HPC applications

    Get PDF
    From single-core CPUs to detachable compute accelerators, supercomputers made a tremendous progress by using available transistors on chip and specializing hardware for a given type of computation. Today, compute nodes used in HPC employ multi-core CPUs tailored for serial execution and multiple accelerators (many-core devices or GPUs) for throughput computing. However, designing next-generation HPC system requires not only the performance improvement but also better energy efficiency. Current trend of reaching exascale level of computation asks for at least an order of magnitude increase in both of these metrics. This thesis explores HPC-specific optimizations in order to make better utilization of the available transistors and to improve performance by transparently executing parallel code across multiple GPU accelerators. First, we analyze several HPC benchmark suites, compare them against typical desktop applications, and identify the differences which advocate for proper core tailoring. Moreover, within the HPC applications, we evaluate serial and parallel code sections separately, resulting in an Asymmetric Chip Multiprocessor (ACMP) design with one core optimized for single-thread performance and many lean cores for parallel execution. Our results presented here suggests downsizing of core front-end structures providing an HPC-tailored lean core which saves 16% of the core area and 7% of power, without performance loss. Further improving an ACMP design, we identify that multiple lean cores run the same code during parallel regions. This motivated us to evaluate the idea where lean cores share the I-cache with the intent of benefiting from mutual prefetching, without increasing the average access latency. Our exploration of the multiple parameters finds the sweet spot on a wide interconnect to access the shared I-cache and the inclusion of a few line buffers to provide the required bandwidth and latency to sustain performance. The projections presented in this thesis show additional 11% area savings with a 5% energy reduction at no performance cost. These area and power savings might be attractive for many-core accelerators either for increasing the performance per area and power unit, or adding additional cores and thus improving the performance for the same hardware budget. Finally, in this thesis we study the effects of future NUMA accelerators comprised of multiple GPU devices. Reaching the limits of a single-GPU die size, next-generation GPU compute accelerators will likely embrace multi-socket designs increasing the core count and memory bandwidth. However, maintaining the UMA behavior of a single-GPU in multi-GPU systems without code rewriting stands as a challenge. We investigate multi-socket NUMA GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be exploited allowing GPU sockets to dynamically optimize their individual interconnect and cache policies, minimizing the impact of NUMA effects. Our NUMA-aware GPU outperforms a single GPU by 1.5×, 2.3×, and 3.2× while achieving 89%, 84%, and 76% of theoretical application scalability in 2, 4, and 8 sockets designs respectively. Implementable today, NUMA-aware multi-socket GPUs may be a promising candidate for performance scaling of future compute nodes used in HPC.Empezando por CPUs de un solo procesador, y pasando por aceleradores discretos, los supercomputadores han avanzado enormemente utilizando todos los transistores disponibles en el chip, y especializando los diseños para cada tipo de cálculo. Actualmente, los nodos de cálculo de un sistema de Computación de Altas Prestaciones (CAP) utilizan CPUs de múltiples procesadores, optimizados para el cálculo serial de instrucciones, y múltiples aceleradores (aceleradores gráficos, o many-core), optimizados para el cálculo paralelo. El diseño de un sistema CAP de nueva generación requiere no solo mejorar el rendimiento de cálculo, sino también mejorar la eficiencia energética. La siguiente generación de sistemas requiere mejorar un orden de magnitud en ambas métricas simultáneamente. Esta tesis doctoral explora optimizaciones específicas para sistemas CAP para hacer un mejor uso de los transistores, y para mejorar las prestaciones de forma transparente ejecutando las aplicaciones en múltiples aceleradores en paralelo. Primero, analizamos varios conjuntos de aplicaciones CAP, y las comparamos con aplicaciones para servidores y escritorio, identificando las principales diferencias que nos indican cómo ajustar la arquitectura para CAP. En las aplicaciones CAP, también analizamos la parte secuencial del código y la parte paralela de forma separada, . El resultado de este análisis nos lleva a proponer una arquitectura multiprocesador asimétrica (ACMP) , con un procesador optimizado para el código secuencial, y múltiples procesadores, más pequeños, optimizados para el procesamiento paralelo. Nuestros resultados muestran que reducir el tamaño de las estructuras del front-end (fetch, y predicción de saltos) en los procesadores paralelos nos proporciona un 16% extra de área en el chip, y una reducción de consumo del 7%. Como mejora a nuestra arquitectura ACMP, proponemos explotar el hecho de que todos los procesadores paralelos ejecutan el mismo código al mismo tiempo. Evaluamos una propuesta en que los procesadores paralelos comparten la caché de instrucciones, con la intención de que uno de ellos precargue las instrucciones para los demás procesadores (prefetching), sin aumentar la latencia media de acceso. Nuestra exploración de los distintos parámetros determina que el punto óptimo requiere una interconexión de alto ancho de banda para acceder a la caché compartida, y el uso de unos pocos line buffers para mantener el ancho de banda y la latencia necesarios. Nuestras proyecciones muestran un ahorro adicional del 11% en área y el 5% en energía, sin impacto en el rendimiento. Estos ahorros de área y energía permiten a un multiprocesador incrementar la eficiencia energética, o aumentar el rendimiento añadiendo procesador adicionales. Por último, estudiamos el efecto de usar múltiples aceleradores (GPU) en una arquitectura con tiempo de acceso a memoria no uniforme (NUMA). Una vez alcanzado el límite de número de transistores y tamaño máximo por chip, la siguiente generación de aceleradores deberá utilizar múltiples chips para aumentar el número de procesadores y el ancho de banda de acceso a memoria. Sin embargo, es muy difícil mantener la ilusión de un tiempo de acceso a memoria uniforme en un sistema multi-GPU sin reescribir el código de la aplicación. Nuestra investigación sobre sistemas multi-GPU muestra retos significativos en el diseño de la interconexión entre las GPU y la jerarquía de memorias cache. Nuestros resultados muestran que se puede explotar el comportamiento en fases de las aplicaciones para optimizar la configuración de la interconexión y las cachés de forma dinámica, minimizando el impacto de la arquitectura NUMA. Nuestro diseño mejora el rendimiento de un sistema con una única GPU en 1.5x, 2.3x y 3.2x (el 89%, 84%, y 76% del máximo teórico) usando 2, 4, y 8 GPUs en paralelo. Siendo su implementación posible hoy en dia, los nodos de cálculo con múltiples aceleradores son una alternativa atractiva para futuros sistemas CAP

    Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors

    Get PDF
    abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions. Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%. Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications. Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future. In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Multicore architecture optimizations for HPC applications

    Get PDF
    From single-core CPUs to detachable compute accelerators, supercomputers made a tremendous progress by using available transistors on chip and specializing hardware for a given type of computation. Today, compute nodes used in HPC employ multi-core CPUs tailored for serial execution and multiple accelerators (many-core devices or GPUs) for throughput computing. However, designing next-generation HPC system requires not only the performance improvement but also better energy efficiency. Current trend of reaching exascale level of computation asks for at least an order of magnitude increase in both of these metrics. This thesis explores HPC-specific optimizations in order to make better utilization of the available transistors and to improve performance by transparently executing parallel code across multiple GPU accelerators. First, we analyze several HPC benchmark suites, compare them against typical desktop applications, and identify the differences which advocate for proper core tailoring. Moreover, within the HPC applications, we evaluate serial and parallel code sections separately, resulting in an Asymmetric Chip Multiprocessor (ACMP) design with one core optimized for single-thread performance and many lean cores for parallel execution. Our results presented here suggests downsizing of core front-end structures providing an HPC-tailored lean core which saves 16% of the core area and 7% of power, without performance loss. Further improving an ACMP design, we identify that multiple lean cores run the same code during parallel regions. This motivated us to evaluate the idea where lean cores share the I-cache with the intent of benefiting from mutual prefetching, without increasing the average access latency. Our exploration of the multiple parameters finds the sweet spot on a wide interconnect to access the shared I-cache and the inclusion of a few line buffers to provide the required bandwidth and latency to sustain performance. The projections presented in this thesis show additional 11% area savings with a 5% energy reduction at no performance cost. These area and power savings might be attractive for many-core accelerators either for increasing the performance per area and power unit, or adding additional cores and thus improving the performance for the same hardware budget. Finally, in this thesis we study the effects of future NUMA accelerators comprised of multiple GPU devices. Reaching the limits of a single-GPU die size, next-generation GPU compute accelerators will likely embrace multi-socket designs increasing the core count and memory bandwidth. However, maintaining the UMA behavior of a single-GPU in multi-GPU systems without code rewriting stands as a challenge. We investigate multi-socket NUMA GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be exploited allowing GPU sockets to dynamically optimize their individual interconnect and cache policies, minimizing the impact of NUMA effects. Our NUMA-aware GPU outperforms a single GPU by 1.5×, 2.3×, and 3.2× while achieving 89%, 84%, and 76% of theoretical application scalability in 2, 4, and 8 sockets designs respectively. Implementable today, NUMA-aware multi-socket GPUs may be a promising candidate for performance scaling of future compute nodes used in HPC.Empezando por CPUs de un solo procesador, y pasando por aceleradores discretos, los supercomputadores han avanzado enormemente utilizando todos los transistores disponibles en el chip, y especializando los diseños para cada tipo de cálculo. Actualmente, los nodos de cálculo de un sistema de Computación de Altas Prestaciones (CAP) utilizan CPUs de múltiples procesadores, optimizados para el cálculo serial de instrucciones, y múltiples aceleradores (aceleradores gráficos, o many-core), optimizados para el cálculo paralelo. El diseño de un sistema CAP de nueva generación requiere no solo mejorar el rendimiento de cálculo, sino también mejorar la eficiencia energética. La siguiente generación de sistemas requiere mejorar un orden de magnitud en ambas métricas simultáneamente. Esta tesis doctoral explora optimizaciones específicas para sistemas CAP para hacer un mejor uso de los transistores, y para mejorar las prestaciones de forma transparente ejecutando las aplicaciones en múltiples aceleradores en paralelo. Primero, analizamos varios conjuntos de aplicaciones CAP, y las comparamos con aplicaciones para servidores y escritorio, identificando las principales diferencias que nos indican cómo ajustar la arquitectura para CAP. En las aplicaciones CAP, también analizamos la parte secuencial del código y la parte paralela de forma separada, . El resultado de este análisis nos lleva a proponer una arquitectura multiprocesador asimétrica (ACMP) , con un procesador optimizado para el código secuencial, y múltiples procesadores, más pequeños, optimizados para el procesamiento paralelo. Nuestros resultados muestran que reducir el tamaño de las estructuras del front-end (fetch, y predicción de saltos) en los procesadores paralelos nos proporciona un 16% extra de área en el chip, y una reducción de consumo del 7%. Como mejora a nuestra arquitectura ACMP, proponemos explotar el hecho de que todos los procesadores paralelos ejecutan el mismo código al mismo tiempo. Evaluamos una propuesta en que los procesadores paralelos comparten la caché de instrucciones, con la intención de que uno de ellos precargue las instrucciones para los demás procesadores (prefetching), sin aumentar la latencia media de acceso. Nuestra exploración de los distintos parámetros determina que el punto óptimo requiere una interconexión de alto ancho de banda para acceder a la caché compartida, y el uso de unos pocos line buffers para mantener el ancho de banda y la latencia necesarios. Nuestras proyecciones muestran un ahorro adicional del 11% en área y el 5% en energía, sin impacto en el rendimiento. Estos ahorros de área y energía permiten a un multiprocesador incrementar la eficiencia energética, o aumentar el rendimiento añadiendo procesador adicionales. Por último, estudiamos el efecto de usar múltiples aceleradores (GPU) en una arquitectura con tiempo de acceso a memoria no uniforme (NUMA). Una vez alcanzado el límite de número de transistores y tamaño máximo por chip, la siguiente generación de aceleradores deberá utilizar múltiples chips para aumentar el número de procesadores y el ancho de banda de acceso a memoria. Sin embargo, es muy difícil mantener la ilusión de un tiempo de acceso a memoria uniforme en un sistema multi-GPU sin reescribir el código de la aplicación. Nuestra investigación sobre sistemas multi-GPU muestra retos significativos en el diseño de la interconexión entre las GPU y la jerarquía de memorias cache. Nuestros resultados muestran que se puede explotar el comportamiento en fases de las aplicaciones para optimizar la configuración de la interconexión y las cachés de forma dinámica, minimizando el impacto de la arquitectura NUMA. Nuestro diseño mejora el rendimiento de un sistema con una única GPU en 1.5x, 2.3x y 3.2x (el 89%, 84%, y 76% del máximo teórico) usando 2, 4, y 8 GPUs en paralelo. Siendo su implementación posible hoy en dia, los nodos de cálculo con múltiples aceleradores son una alternativa atractiva para futuros sistemas CAP.Postprint (published version

    Service Abstractions for Scalable Deep Learning Inference at the Edge

    Get PDF
    Deep learning driven intelligent edge has already become a reality, where millions of mobile, wearable, and IoT devices analyze real-time data and transform those into actionable insights on-device. Typical approaches for optimizing deep learning inference mostly focus on accelerating the execution of individual inference tasks, without considering the contextual correlation unique to edge environments and the statistical nature of learning-based computation. Specifically, they treat inference workloads as individual black boxes and apply canonical system optimization techniques, developed over the last few decades, to handle them as yet another type of computation-intensive applications. As a result, deep learning inference on edge devices still face the ever increasing challenges of customization to edge device heterogeneity, fuzzy computation redundancy between inference tasks, and end-to-end deployment at scale. In this thesis, we propose the first framework that automates and scales the end-to-end process of deploying efficient deep learning inference from the cloud to heterogeneous edge devices. The framework consists of a series of service abstractions that handle DNN model tailoring, model indexing and query, and computation reuse for runtime inference respectively. Together, these services bridge the gap between deep learning training and inference, eliminate computation redundancy during inference execution, and further lower the barrier for deep learning algorithm and system co-optimization. To build efficient and scalable services, we take a unique algorithmic approach of harnessing the semantic correlation between the learning-based computation. Rather than viewing individual tasks as isolated black boxes, we optimize them collectively in a white box approach, proposing primitives to formulate the semantics of the deep learning workloads, algorithms to assess their hidden correlation (in terms of the input data, the neural network models, and the deployment trials) and merge common processing steps to minimize redundancy

    Power Consumption Analysis, Measurement, Management, and Issues:A State-of-the-Art Review of Smartphone Battery and Energy Usage

    Get PDF
    The advancement and popularity of smartphones have made it an essential and all-purpose device. But lack of advancement in battery technology has held back its optimum potential. Therefore, considering its scarcity, optimal use and efficient management of energy are crucial in a smartphone. For that, a fair understanding of a smartphone's energy consumption factors is necessary for both users and device manufacturers, along with other stakeholders in the smartphone ecosystem. It is important to assess how much of the device's energy is consumed by which components and under what circumstances. This paper provides a generalized, but detailed analysis of the power consumption causes (internal and external) of a smartphone and also offers suggestive measures to minimize the consumption for each factor. The main contribution of this paper is four comprehensive literature reviews on: 1) smartphone's power consumption assessment and estimation (including power consumption analysis and modelling); 2) power consumption management for smartphones (including energy-saving methods and techniques); 3) state-of-the-art of the research and commercial developments of smartphone batteries (including alternative power sources); and 4) mitigating the hazardous issues of smartphones' batteries (with a details explanation of the issues). The research works are further subcategorized based on different research and solution approaches. A good number of recent empirical research works are considered for this comprehensive review, and each of them is succinctly analysed and discussed
    corecore