17 research outputs found
Understanding Performance Inefficiencies In Native And Managed Languages
Production software packages have become increasingly complex with millions of lines of code, sophisticated control and data flow, and references to a hierarchy of external libraries. This complexity often introduces performance inefficiencies across software stacks, making it practically impossible for users to pinpoint them manually. Performance profiling tools (a.k.a. profilers) abound in the tools community to aid software developers in understanding program behavior. Classical profiling techniques focus on identifying hotspots. The hotspot analysis is indispensable; however, it can hardly diagnose whether a resource is being used in a productive manner that contributes to the overall efficiency of a program. Consequently, a significant burden is on developers to make a judgment call on whether there is scope to optimize a hotspot. Derived metrics, e.g., cache miss ratio, offer slightly better intuition into hotspots but are still not panaceas. Hence, there is a need for profilers that investigate resource wastage instead of usage. To overcome the critical missing pieces in prior work and complement existing profilers, we propose novel fine- and coarse-grained profilers to pinpoint varieties of performance inefficiencies and provide optimization guidance for a wide range of software covering benchmarks, enterprise applications, and large-scale parallel applications running on supercomputers and data centers. Fine-grained profilers are indispensable to understand performance inefficiencies comprehensively. We propose a whole-program profiler called LoadSpy, which works on binary executables to detect and quantify wasteful memory operations in their context and scope. Our observation, which is justified by myriad case studies, is that wasteful memory operations are often an indicator of various forms of performance inefficiencies, such as suboptimal choices of algorithms or data structures, missed compiler optimizations, and developers’ inattention to performance. Guided by LoadSpy, we are able to optimize a large number of well-known benchmarks and real-world applications, yielding significant speedups. Despite deep performance insights offered by fine-grained profilers, the high overhead keeps them away from widespread adoption, particularly in production. By contrast, coarse-grained profilers introduce low overhead at the cost of poor performance insights. Hence, another research topic is how we benefit from both, that is, the combination of deep insights of fine-grained profilers and low overhead of coarse-grained ones. The first effort to do so is proposing a lightweight profiler called JXPerf. It abandons heavyweight instrumentation by combining hardware performance monitoring units and debug registers available in commodity CPUs to detect wasteful memory operations. Compared with LoadSpy, JXPerf reduces the runtime overhead from 10x to 7% on average. The lightweight nature makes it useful in production. Another effort is proposing a lightweight profiler called FVSampler, the first nonintrusive profiler to study function execution variance
Recommended from our members
Efficient tagged memory
We characterize the cache behavior of an in-memory tag table and
demonstrate that an optimized implementation can typically achieve a near-zero memory traffic overhead. Both industry and academia have repeatedly demonstrated tagged memory as a key mechanism to enable enforcement of powerful security invariants, including capabilities pointer integrity, watchpoints, and information-flow tracking. A single-bit tag shadowspace is the most commonly proposed requirement, as one bit is the minimum metadata needed to distinguish between an untyped data word and any number of new hardware-enforced types. We survey various tag shadowspace approaches and identify their common requirements and positive features of their implementations. To avoid non-standard memory widths, we identify the most practical implementation for tag storage to be an in-memory table managed next to the DRAM controller. We characterize the caching performance of such a tag table and demonstrate a DRAM traffic overhead below 5\% for the vast majority of applications. We identify spatial locality on a page scale as the primary factor that enables surprisingly high table cache-ability. We then demonstrate tag-table compression for a set of common applications. A hierarchical structure with elegantly simple optimizations reduces DRAM traffic overhead to below 1\% for most applications. These insights and optimizations pave the way for commercial applications making use of single-bit tags stored in commodity memory
Exploiting task-based programming models for resilience
Hardware errors become more common as silicon technologies shrink and become more vulnerable, especially in memory cells, which are the most exposed to errors. Permanent and intermittent faults are caused by manufacturing variability and circuits ageing. While these can be mitigated once they are identified, their continuous rate of appearance throughout the lifetime of memory devices will always cause unexpected errors. In addition, transient faults are caused by effects such as radiation or small voltage/frequency margins, and there is no efficient way to shield against these events.
Other constraints related to the diminishing sizes of transistors, such as power consumption and memory latency have caused the microprocessor industry to turn to increasingly complex processor architectures. To solve the difficulties arising from programming such architectures, programming models have emerged that rely on runtime systems. These systems form a new intermediate layer on the hardware-software abstraction stack, that performs tasks such as distributing work across computing resources: processor cores, accelerators, etc. These runtime systems dispose of a lot of information, both from the hardware and the applications, and offer thus many possibilities for optimisations.
This thesis proposes solutions to the increasing fault rates in memory, across multiple resilience disciplines, from algorithm-based fault tolerance to hardware error correcting codes, through OS reliability strategies. These solutions rely for their efficiency on the opportunities presented by runtime systems.
The first contribution of this thesis is an algorithmic-based resilience technique, allowing to tolerate detected errors in memory. This technique allows to recover data that is lost by performing computations that rely on simple redundancy relations identified in the program. The recovery is demonstrated for a family of iterative solvers, the Krylov subspace methods, and evaluated for the conjugate gradient solver. The runtime can transparently overlap the recovery with the computations of the algorithm, which allows to mask the already low overheads of this technique.
The second part of this thesis proposes a metric to characterise the impact of faults in memory, which outperforms state-of-the-art metrics in precision and assurances on the error rate. This metric reveals a key insight into data that is not relevant to the program, and we propose an OS-level strategy to ignore errors in such data, by delaying the reporting of detected errors. This allows to reduce failure rates of running programs, by ignoring errors that have no impact.
The architectural-level contribution of this thesis is a dynamically adaptable Error Correcting Code (ECC) scheme, that can increase protection of memory regions where the impact of errors is highest. A runtime methodology is presented to estimate the fault rate at runtime using our metric, through performance monitoring tools of current commodity processors. Guiding the dynamic ECC scheme online using the methodology's vulnerability estimates allows to decrease error rates of programs at a fraction of the redundancy cost required for a uniformly stronger ECC.
This provides a useful and wide range of trade-offs between redundancy and error rates.
The work presented in this thesis demonstrates that runtime systems allow to make the most of redundancy stored in memory, to help tackle increasing error rates in DRAM. This exploited redundancy can be an inherent part of algorithms that allows to tolerate higher fault rates, or in the form of dead data stored in memory. Redundancy can also be added to a program, in the form of ECC. In all cases, the runtime allows to decrease failure rates efficiently, by diminishing recovery costs, identifying redundant data, or targeting critical data. It is thus a very valuable tool for the future computing systems, as it can perform optimisations across different layers of abstractions.Los errores en memoria se vuelven más comunes a medida que las tecnologías de silicio reducen su tamaño. La variabilidad de fabricación y el envejecimiento de los circuitos causan fallos permanentes e intermitentes. Aunque se pueden mitigar una vez identificados, su continua tasa de aparición siempre causa errores inesperados.
Además, la memoria también sufre de fallos transitorios contra los cuales no se puede proteger eficientemente. Estos fallos están causados por efectos como la radiación o los reducidos márgenes de voltaje y frecuencia.
Otras restricciones coetáneas, como el consumo de energía y la latencia de la memoria, obligaron a las arquitecturas de computadores a volverse cada vez más complejas. Para programar tales procesadores, se desarrollaron modelos de programación basados en entornos de ejecución. Estos sistemas forman una nueva abstracción entre hardware y software, realizando tareas como la distribución del trabajo entre recursos informáticos: núcleos de procesadores, aceleradores, etc. Estos entornos de ejecución disponen de mucha información tanto sobre el hardware como sobre las aplicaciones, y ofrecen así muchas posibilidades de optimización.
Esta tesis propone soluciones a los fallos en memoria entre múltiples disciplinas de resiliencia, desde la tolerancia a fallos basada en algoritmos, hasta los códigos de corrección de errores en hardware, incluyendo estrategias de resiliencia del sistema operativo. La eficiencia de estas soluciones depende de las oportunidades que presentan los entornos de ejecución.
La primera contribución de esta tesis es una técnica a nivel algorítmico que permite corregir fallos encontrados mientras el programa su ejecuta. Para corregir fallos se han identificado redundancias simples en los datos del programa para toda una clase de algoritmos, los métodos del subespacio de Krylov (gradiente conjugado, GMRES, etc). La estrategia de recuperación de datos desarrollada
permite corregir errores sin tener que reinicializar el algoritmo, y aprovecha el modelo de programación para superponer las computaciones del algoritmo y de la recuperación de datos.
La segunda parte de esta tesis propone una métrica para caracterizar el impacto de los fallos en la memoria. Esta métrica supera en precisión a las métricas de vanguardia y permite identificar datos que son menos relevantes para el programa.
Se propone una estrategia a nivel del sistema operativo retrasando la notificación de los errores detectados, que permite ignorar fallos en estos datos y reducir la tasa de fracaso del programa.
Por último, la contribución a nivel arquitectónico de esta tesis es un esquema de Código de Corrección de Errores (ECC por sus siglas en inglés) adaptable dinámicamente. Este esquema puede aumentar la protección de las regiones de memoria donde el impacto de los errores es mayor. Se presenta una metodología para estimar el riesgo de fallo en tiempo de ejecución utilizando nuestra métrica,
a través de las herramientas de monitorización del rendimiento disponibles en los procesadores actuales. El esquema de ECC guiado dinámicamente con estas estimaciones de vulnerabilidad permite disminuir la tasa de fracaso de los programas a una fracción del coste de redundancia requerido para un ECC uniformemente más fuerte.
El trabajo presentado en esta tesis demuestra que los entornos de ejecución permiten aprovechar al máximo la redundancia contenida en la memoria, para contener el aumento de los errores en ella. Esta redundancia explotada puede ser una parte inherente de los algoritmos que permite tolerar más fallos, en forma de datos inutilizados almacenados en la memoria, o agregada a la memoria de un
programa en forma de ECC. En todos los casos, el entorno de ejecución permite disminuir los efectos de los fallos de manera eficiente, disminuyendo los costes de recuperación, identificando datos redundantes, o focalizando esfuerzos de protección en los datos críticos.Postprint (published version
FAT-DBT engine (framework for application-tailorcd, co-designcd dynamic binary translation enginc)
Tese de Doutoramento em Engenharia Eletrónica e de Computadores (PDEEC)Dynamic binary translation (DBT) has emerged as an execution engine that monitors,
modifies and possibly optimizes running applications for specific purposes.
DBT is deployed as an execution layer between the application binary and the operating
system or host-machine, which creates opportunities for collecting runtime
information. Initially, DBT supported binary-level compatibility, but based on the
collected runtime information, it also became popular for code instrumentation,
ISA-virtualization and dynamic-optimization purposes.
Building a DBT system brings many challenges, as it involves complex components
integration and requires deep architectural level knowledge. Moreover, DBT incurs
in significant overheads, mainly due to code decoding and translation, as well as
execution along with general functionalities emulation. While initially conceived
bearing in mind high-end architectures for performance demanding applications,
such challenges become even more evident when directing DBT to embedded systems.
The latter makes an effective deployment very challenging due to its complexity,
tight constraints on memory, and limited performance and power. Legacy
support and binary compatibility is a topic of relevant interest in such systems,
due to their broad dissemination among industrial environments and wide utilization
in sensing and monitoring processes, from yearly times, with considerable
maintenance and replacement costs.
To address such issues, this thesis intents to contribute with a solution that leverages
an optimized and accelerated dynamic binary translator targeting resourceconstrained
embedded systems while supporting legacy systems.
The developed work allows to: (1) evaluate the potential of DBT for legacy support
purposes on the resource-constrained embedded systems; (2) achieve a configurable
DBT architecture specialized for resource-constrained embedded systems;
(3) address DBT translation, execution and emulation overheads through the combination
of software and hardware; and (4) promote DBT utilization as a legacy
support tool for the industry as a end-product.A tradução binária dinâmica (TBD) emergiu como um motor de execução que
permite a modificação e possível optimização de código executável para um determinado
propósito. A TBD é integrada nos sistemas como uma camada de execução
entre o código binário executável e o sistema operativo ou a máquina hospedeira
alvo, o que origina oportunidades de recolha de informação de execução.
A criação de um sistema de TBD traz consigo diversos desafios, uma vez que envolve
a integração de componentes complexos e conhecimentos aprofundados das
arquitecturas de processadores envolvidas. Ademais, a utilização de TBD gera diversos
custos computacionais indirectos, maioritariamente devido à descodificação
e tradução de código, bem como emulação de funcionalidades em geral. Considerando
que a TBD foi inicialmente pensada para sistemas de gama alta, os
desafios mencionados tornam-se ainda mais evidentes quando a mesma é aplicada
em sistemas embebidos. Nesta área os limitados recursos de memória e os exigentes
requisitos de desempenho e consumo energético,tornam uma implementação eficiente
de TBD muito difícil de obter. Compatibilidade binária e suporte a código
de legado são tópicos de interesse em sistemas embebidos, justificado pela ampla
disseminação dos mesmos no meio industrial para tarefas de sensorização e monitorização
ao longo dos tempos, reforçado pelos custos de manutenção adjacentes
à sua utilização.
Para endereçar os desafios descritos, nesta tese propõe-se uma solução para potencializar
a tradução binária dinâmica, optimizada e com aceleração, para suporte a
código de legado em sistemas embebidos de baixa gama.
O trabalho permitiu (1) avaliar o potencial da TBD quando aplicada ao suporte
a código de legado em sistemas embebidos de baixa gama; (2) a obtenção de
uma arquitectura de TBD configurável e especializada para este tipo de sistemas;
(3) reduzir os custos computacionais associados à tradução, execução e emulação,
através do uso combinado de software e hardware; (4) e promover a utilização na
industria de TBD como uma ferramenta de suporte a código de legado.This thesis was supported by a PhD scholarship from Fundação para a Ciência e
Tecnologia, SFRH/BD/81681/201
Segurança de computadores por meio de autenticação intrínseca de hardware
Orientadores: Guido Costa Souza de Araújo, Mario Lúcio Côrtes e Diego de Freitas AranhaTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Neste trabalho apresentamos Computer Security by Hardware-Intrinsic Authentication (CSHIA), uma arquitetura de computadores segura para sistemas embarcados que tem como objetivo prover autenticidade e integridade para código e dados. Este trabalho está divido em três fases: Projeto da Arquitetura, sua Implementação, e sua Avaliação de Segurança. Durante a fase de projeto, determinamos como integridade e autenticidade seriam garantidas através do uso de Funções Fisicamente Não Clonáveis (PUFs) e propusemos um algoritmo de extração de chaves criptográficas de memórias cache de processadores. Durante a implementação, flexibilizamos o projeto da arquitetura para fornecer diferentes possibilidades de configurações sem comprometimento da segurança. Então, avaliamos seu desempenho levando em consideração o incremento em área de chip, aumento de consumo de energia e memória adicional para diferentes configurações. Por fim, analisamos a segurança de PUFs e desenvolvemos um novo ataque de canal lateral que circunvê a propriedade de unicidade de PUFs por meio de seus elementos de construçãoAbstract: This work presents Computer Security by Hardware-Intrinsic Authentication (CSHIA), a secure computer architecture for embedded systems that aims at providing authenticity and integrity for code and data. The work encompassed three phases: Design, Implementation, and Security Evaluation. In design, we laid out the basic ideas behind CSHIA, namely, how integrity and authenticity are employed through the use of Physical Unclonable Functions (PUFs), and we proposed an algorithm to extract cryptographic keys from the intrinsic memories of processors. In implementation, we made CSHIA¿s design more flexible, allowing different configurations without compromising security. Then, we evaluated CSHIA¿s performance and overheads, such as area, energy, and memory, for multiple configurations. Finally, we evaluated security of PUFs, which led us to develop a new side-channel-based attack that enabled us to circumvent PUFs¿ uniqueness property through their architectural elementsDoutoradoCiência da ComputaçãoDoutor em Ciência da Computação2015/06829-2; 2016/25532-3147614/2014-7FAPESPCNP
Exploiting task-based programming models for resilience
Hardware errors become more common as silicon technologies shrink and become more vulnerable, especially in memory cells, which are the most exposed to errors. Permanent and intermittent faults are caused by manufacturing variability and circuits ageing. While these can be mitigated once they are identified, their continuous rate of appearance throughout the lifetime of memory devices will always cause unexpected errors. In addition, transient faults are caused by effects such as radiation or small voltage/frequency margins, and there is no efficient way to shield against these events.
Other constraints related to the diminishing sizes of transistors, such as power consumption and memory latency have caused the microprocessor industry to turn to increasingly complex processor architectures. To solve the difficulties arising from programming such architectures, programming models have emerged that rely on runtime systems. These systems form a new intermediate layer on the hardware-software abstraction stack, that performs tasks such as distributing work across computing resources: processor cores, accelerators, etc. These runtime systems dispose of a lot of information, both from the hardware and the applications, and offer thus many possibilities for optimisations.
This thesis proposes solutions to the increasing fault rates in memory, across multiple resilience disciplines, from algorithm-based fault tolerance to hardware error correcting codes, through OS reliability strategies. These solutions rely for their efficiency on the opportunities presented by runtime systems.
The first contribution of this thesis is an algorithmic-based resilience technique, allowing to tolerate detected errors in memory. This technique allows to recover data that is lost by performing computations that rely on simple redundancy relations identified in the program. The recovery is demonstrated for a family of iterative solvers, the Krylov subspace methods, and evaluated for the conjugate gradient solver. The runtime can transparently overlap the recovery with the computations of the algorithm, which allows to mask the already low overheads of this technique.
The second part of this thesis proposes a metric to characterise the impact of faults in memory, which outperforms state-of-the-art metrics in precision and assurances on the error rate. This metric reveals a key insight into data that is not relevant to the program, and we propose an OS-level strategy to ignore errors in such data, by delaying the reporting of detected errors. This allows to reduce failure rates of running programs, by ignoring errors that have no impact.
The architectural-level contribution of this thesis is a dynamically adaptable Error Correcting Code (ECC) scheme, that can increase protection of memory regions where the impact of errors is highest. A runtime methodology is presented to estimate the fault rate at runtime using our metric, through performance monitoring tools of current commodity processors. Guiding the dynamic ECC scheme online using the methodology's vulnerability estimates allows to decrease error rates of programs at a fraction of the redundancy cost required for a uniformly stronger ECC.
This provides a useful and wide range of trade-offs between redundancy and error rates.
The work presented in this thesis demonstrates that runtime systems allow to make the most of redundancy stored in memory, to help tackle increasing error rates in DRAM. This exploited redundancy can be an inherent part of algorithms that allows to tolerate higher fault rates, or in the form of dead data stored in memory. Redundancy can also be added to a program, in the form of ECC. In all cases, the runtime allows to decrease failure rates efficiently, by diminishing recovery costs, identifying redundant data, or targeting critical data. It is thus a very valuable tool for the future computing systems, as it can perform optimisations across different layers of abstractions.Los errores en memoria se vuelven más comunes a medida que las tecnologías de silicio reducen su tamaño. La variabilidad de fabricación y el envejecimiento de los circuitos causan fallos permanentes e intermitentes. Aunque se pueden mitigar una vez identificados, su continua tasa de aparición siempre causa errores inesperados.
Además, la memoria también sufre de fallos transitorios contra los cuales no se puede proteger eficientemente. Estos fallos están causados por efectos como la radiación o los reducidos márgenes de voltaje y frecuencia.
Otras restricciones coetáneas, como el consumo de energía y la latencia de la memoria, obligaron a las arquitecturas de computadores a volverse cada vez más complejas. Para programar tales procesadores, se desarrollaron modelos de programación basados en entornos de ejecución. Estos sistemas forman una nueva abstracción entre hardware y software, realizando tareas como la distribución del trabajo entre recursos informáticos: núcleos de procesadores, aceleradores, etc. Estos entornos de ejecución disponen de mucha información tanto sobre el hardware como sobre las aplicaciones, y ofrecen así muchas posibilidades de optimización.
Esta tesis propone soluciones a los fallos en memoria entre múltiples disciplinas de resiliencia, desde la tolerancia a fallos basada en algoritmos, hasta los códigos de corrección de errores en hardware, incluyendo estrategias de resiliencia del sistema operativo. La eficiencia de estas soluciones depende de las oportunidades que presentan los entornos de ejecución.
La primera contribución de esta tesis es una técnica a nivel algorítmico que permite corregir fallos encontrados mientras el programa su ejecuta. Para corregir fallos se han identificado redundancias simples en los datos del programa para toda una clase de algoritmos, los métodos del subespacio de Krylov (gradiente conjugado, GMRES, etc). La estrategia de recuperación de datos desarrollada
permite corregir errores sin tener que reinicializar el algoritmo, y aprovecha el modelo de programación para superponer las computaciones del algoritmo y de la recuperación de datos.
La segunda parte de esta tesis propone una métrica para caracterizar el impacto de los fallos en la memoria. Esta métrica supera en precisión a las métricas de vanguardia y permite identificar datos que son menos relevantes para el programa.
Se propone una estrategia a nivel del sistema operativo retrasando la notificación de los errores detectados, que permite ignorar fallos en estos datos y reducir la tasa de fracaso del programa.
Por último, la contribución a nivel arquitectónico de esta tesis es un esquema de Código de Corrección de Errores (ECC por sus siglas en inglés) adaptable dinámicamente. Este esquema puede aumentar la protección de las regiones de memoria donde el impacto de los errores es mayor. Se presenta una metodología para estimar el riesgo de fallo en tiempo de ejecución utilizando nuestra métrica,
a través de las herramientas de monitorización del rendimiento disponibles en los procesadores actuales. El esquema de ECC guiado dinámicamente con estas estimaciones de vulnerabilidad permite disminuir la tasa de fracaso de los programas a una fracción del coste de redundancia requerido para un ECC uniformemente más fuerte.
El trabajo presentado en esta tesis demuestra que los entornos de ejecución permiten aprovechar al máximo la redundancia contenida en la memoria, para contener el aumento de los errores en ella. Esta redundancia explotada puede ser una parte inherente de los algoritmos que permite tolerar más fallos, en forma de datos inutilizados almacenados en la memoria, o agregada a la memoria de un
programa en forma de ECC. En todos los casos, el entorno de ejecución permite disminuir los efectos de los fallos de manera eficiente, disminuyendo los costes de recuperación, identificando datos redundantes, o focalizando esfuerzos de protección en los datos críticos
Recommended from our members
Utilizing Runtime Information for Accurate Root Cause Identification in Performance Diagnosis
This dissertation highlights that existing performance diagnostic tools often become less effective due to their inherent inaccuracies in modern software. To overcome these inaccuracies and effectively identify the root causes of performance issues, it is necessary to incorporate supplementary runtime information into these tools. Within this context, the dissertation integrates specific runtime information into two typical performance diagnostic tools: profilers and causal tracing tools.
The integration yields a substantial enhancement in the effectiveness of performance diagnosis. Among these tools, gprof stands out as a representative profiler for performance diagnosis. Nonetheless, its effectiveness diminishes as the time cost calculated based on CPU sampling fails to accurately and adequately pinpoint the root causes of performance issues in complex software. To tackle this challenge, the dissertation introduces an innovative methodology called value-assisted cost profiling (vProf). This approach incorporates variable values observed during runtime into the profiling process.
By continuously sampling variable values from both normal and problematic executions, vProf refines function cost estimates, identifies anomalies in value distributions, and highlights potentially problematic code areas that could be the actual sources of performance is- sues. The effectiveness of vProf is validated through the diagnosis of 18 real-world performance is- sues in four widely-used applications. Remarkably, vProf outperforms other state-of-the-art tools, successfully diagnosing all issues, including three that had remained unresolved for over four years.
Causal tracing tools reveal the root causes of performance issues in complex software by generating tracing graphs. However, these graphs often suffer from inherent inaccuracies, characterized by superfluous (over-connected) and missed (under-connected) edges. These inaccuracies arise from the diversity of programming paradigms. To mitigate the inaccuracies, the dissertation proposes an approach to derive strong and weak edges in tracing graphs based on the vertices’ semantics collected during runtime. By leveraging these edge types, a beam-search-based diagnostic algorithm is employed to identify the most probable causal paths. Causal paths from normal and buggy executions are differentiated to provide key insights into the root causes of performance issues. To validate this approach, a causal tracing tool named Argus is developed and tested across multiple versions of macOS. It is evaluated on 12 well-known spinning pinwheel issues in popular macOS applications. Notably, Argus successfully diagnoses the root causes of all identified issues, including 10 issues that had remained unresolved for several years.
The results from both tools exemplify a substantial enhancement of performance diagnostic tools achieved by harnessing runtime information. The integration can effectively mitigate inherent inaccuracies, lend support to inaccuracy-tolerant diagnostic algorithms, and provide key insights to pinpoint the root causes
Lightweight Protocols and Applications for Memory-Based Intrinsic Physically Unclonable Functions on Commercial Off-The-Shelve Devices
We are currently living in the era in which through the ever-increasing dissemination of inter-connected embedded devices, the Internet-of-Things manifests. Although such end-point devices are commonly labeled as ``smart gadgets'' and hence they suggest to implement some sort of intelligence, from a cyber-security point of view, more then often the opposite holds. The market force in the branch of commercial embedded devices leads to minimizing production costs and time-to-market. This widespread trend has a direct, disastrous impact on the security properties of such devices. The majority of currently used devices or those that will be produced in the future do not implement any or insufficient security mechanisms. Foremost the lack of secure hardware components often mitigates the application of secure protocols and applications.
This work is dedicated to a fundamental solution statement, which allows to retroactively secure commercial off-the-shelf devices, which otherwise are exposed to various attacks due to the lack of secure hardware components. In particular, we leverage the concept of Physically Unclonable Functions (PUFs), to create hardware-based security anchors in standard hardware components. For this purpose, we exploit manufacturing variations in Static Random-Access Memory (SRAM) and Dynamic Random-Access Memory modules to extract intrinsic memory-based PUF instances and building on that, to develop secure and lightweight protocols and applications.
For this purpose, we empirically evaluate selected and representative device types towards their PUF characteristics. In a further step, we use those device types, which qualify due to the existence of desired PUF instances for subsequent development of security applications and protocols. Subsequently, we present various software-based security solutions which are specially tailored towards to the characteristic properties of embedded devices. More precisely, the proposed solutions comprise a secure boot architecture as well as an approach to protect the integrity of the firmware by binding it to the underlying hardware. Furthermore, we present a lightweight authentication protocol which leverages a novel DRAM-based PUF type. Finally, we propose a protocol, which allows to securely verify the software state of remote embedded devices
Uniparallel Execution and its Uses.
We introduce uniparallelism: a new style of execution that allows
multithreaded applications to benefit from the simplicity of
uniprocessor execution while scaling performance with increasing
processors.
A uniparallel execution consists of a thread-parallel execution, where
each thread runs on its own processor, and an epoch-parallel
execution, where multiple time intervals (epochs) of the program run
concurrently. The epoch-parallel execution runs all threads of a
given epoch on a single processor; this enables the use of techniques
that are effective on a uniprocessor. To scale performance with
increasing cores, a thread-parallel execution runs ahead of the
epoch-parallel execution and generates speculative checkpoints from
which to start future epochs. If these checkpoints match the program
state produced by the epoch-parallel execution at the end of each
epoch, the speculation is committed and output externalized; if they
mismatch, recovery can be safely initiated as no speculative state has
been externalized.
We use uniparallelism to build two novel systems: DoublePlay and
Frost. DoublePlay benefits from the efficiency of logging the
epoch-parallel execution (as threads in an epoch are constrained to a
single processor, only infrequent thread context-switches need to be
logged to recreate the order of shared-memory accesses), allowing it
to outperform all prior systems that guarantee deterministic replay on
commodity multiprocessors.
While traditional methods detect data races by analyzing the events
executed by a program, Frost introduces a new, substantially faster
method called outcome-based race detection to detect the effects of a
data race by comparing the program state of replicas for divergences.
Unlike DoublePlay, which runs a single epoch-parallel execution of the
program, Frost runs multiple epoch-parallel replicas with
complementary schedules, which are a set of thread schedules crafted
to ensure that replicas diverge only if a data race occurs and to make
it very likely that harmful data races cause divergences. Frost
detects divergences by comparing the outputs and memory states of
replicas at the end of each epoch. Upon detecting a divergence, Frost
analyzes the replica outcomes to diagnose the data race bug and
selects an appropriate recovery strategy that masks the failure.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/89677/1/kaushikv_1.pd
Developing an Effective Detection Framework for Targeted Ransomware Attacks in Brownfield Industrial Internet of Things
The Industrial Internet of Things (IIoT) is being interconnected with many critical industrial activities, creating major cyber security concerns. The key concern is with edge systems of Brownfield IIoT, where new devices and technologies are deployed to interoperate with legacy industrial control systems and leverage the benefits of IoT. These edge devices, such as edge gateways, have opened the way to advanced attacks such as targeted ransomware. Various pre-existing security solutions can detect and mitigate such attacks but are often ineffective due to the heterogeneous nature of the IIoT devices and protocols and their interoperability demands. Consequently, developing new detection solutions is essential. The key challenges in developing detection solutions for targeted ransomware attacks in IIoT systems include 1) understanding attacks and their behaviour, 2) designing accurate IIoT system models to test attacks, 3) obtaining realistic data representing IIoT systems' activities and connectivities, and 4) identifying attacks.
This thesis provides important contributions to the research focusing on investigating targeted ransomware attacks against IIoT edge systems and developing a new detection framework. The first contribution is developing the world's first example of ransomware, specifically targeting IIoT edge gateways. The experiments' results demonstrate that such an attack is now possible on edge gateways. Also, the kernel-related activity parameters appear to be significant indicators of the crypto-ransomware attacks' behaviour, much more so than for similar attacks in workstations. The second contribution is developing a new holistic end-to-end IIoT security testbed (i.e., Brown-IIoTbed) that can be easily reproduced and reconfigured to support new processes and security scenarios. The results prove that Brown-IIoTbed operates efficiently in terms of its functions and security testing.
The third contribution is generating a first-of-its-kind dataset tailored for IIoT systems covering targeted ransomware attacks and their activities, called X-IIoTID. The dataset includes connectivity- and device-agnostic features collected from various data sources. The final contribution is developing a new asynchronous peer-to-peer federated deep learning framework tailored for IIoT edge gateways for detecting targeted ransomware attacks. The framework's effectiveness has been evaluated against pre-existing datasets and the newly developed X-IIoTID dataset