347 research outputs found
Improving Energy and Area Scalability of the Cache Hierarchy in CMPs
As the core counts increase in each chip multiprocessor generation, CMPs should improve scalability in performance, area, and energy consumption to meet the demands of
larger core counts. Directory-based protocols constitute the most scalable alternative.
A conventional directory, however, suffers from an inefficient use of storage and energy.
First, the large, non-scalable, sharer vectors consume unnecessary area and leakage, especially considering that most of the blocks tracked in a directory are cached by a single
core. Second, although increasing directory size and associativity could boost system
performance by reducing the coverage misses, it would come at the expense of area and
energy consumption.
This thesis focuses and exploits the important differences of behavior between private
and shared blocks from the directory point of view. These differences claim for a separate
management of both types of blocks at the directory. First, we propose the PS-Directory,
a two-level directory cache that keeps the reduced number of frequently accessed shared
entries in a small and fast first-level cache, namely Shared Directory Cache, and uses
a larger and slower second-level Private Directory Cache to track the large amount of
private blocks. Experimental results show that, compared to a conventional directory, the PS-Directory improves performance while also reducing silicon area and energy consumption.
In this thesis we also show that the shared/private ratio of entries in the directory varies
across applications and across different execution phases within the applications, which
encourages us to propose Dynamic Way Partitioning (DWP) Directory. DWP-Directory
reduces the number of ways with storage for shared blocks and it allows this storage to be
powered off or on at run-time according to the dynamic requirements of the applications
following a repartitioning algorithm. Results show similar performance as a traditional
directory with high associativity, and similar area requirements as recent state-of-the-art schemes. In addition, DWP-Directory achieves notable static and dynamic power
consumption savings.
This dissertation also deals with the scalability issues in terms of power found
in processor caches. A significant fraction of the total power budget is consumed by
on-chip caches which are usually deployed with a high associativity degree (even L1
caches are being implemented with eight ways) to enhance the system performance. On
a cache access, each way in the corresponding set is accessed in parallel, which is costly
in terms of energy. This thesis presents the PS-Cache architecture, an energy-efficient
cache design that reduces the number of accessed ways without hurting the performance.
The PS-Cache takes advantage of the private-shared knowledge of the referenced block
to reduce energy by accessing only those ways holding the kind of block looked up.
Results show significant dynamic power consumption savings.
Finally, we propose an energy-efficient architectural design that can be effectively applied
to any kind of set-associative cache memory, not only to processor caches. The proposed
approach, called the Tag Filter (TF) Architecture, filters the ways accessed in the target
cache set, and just a few ways are searched in the tag and data arrays. This allows the
approach to reduce the dynamic energy consumption of caches without hurting their
access time. For this purpose, the proposed architecture holds the X least significant
bits of each tag in a small auxiliary X-bit-wide array. These bits are used to filter
the ways where the least significant bits of the tag do not match with the bits in the
X-bit array. Experimental results show that this filtering mechanism achieves energy
consumption in set-associative caches similar to direct mapped ones.
Experimental results show that the proposals presented in this thesis offer a good tradeoff
among these three major design axes.Conforme se incrementa el número de núcleos en las nuevas generaciones de multiprocesadores en chip, los CMPs deben de escalar en prestaciones, área y consumo energético
para cumplir con las demandas de un número núcleos mayor. Los protocolos basados
en directorio constituyen la alternativa más escalable. Un directorio convencional, no
obstante, sufre de una utilización ineficiente de almacenamiento y energía. En primer
lugar, los grandes y poco escalables vectores de compartidores consumen una cantidad
de energía de fuga y de área innecesaria, especialmente si se tiene en consideración que
la mayoría de los bloques en un directorio solo se encuentran en la cache de un único
núcleo. En segundo lugar, aunque incrementar el tamaño y la asociatividad del directorio aumentaría las prestaciones del sistema, esto supondría un incremento notable en el
consumo energético.
Esta tesis estudia las diferencias significativas entre el comportamiento de bloques privados y compartidos en el directorio, lo que nos lleva hacia una gestión separada para
cada uno de los tipos de bloque. Proponemos el PS-Directory, una cache de directorio de dos niveles que mantiene el reducido número de las entradas compartidas, que
son los que se acceden con más frecuencia, en una estructura pequeña de primer nivel
(concretamente, la Shared Directory Cache) y que utiliza una estructura más grande y
lenta en el segundo nivel (Private Directory Cache) para poder mantener la información
de los bloques privados. Los resultados experimentales muestran
que, comparado con un directorio convencional, el PS-Directory consigue mejorar las
prestaciones a la vez que reduce el área de silicio y el consumo energético.
Ya que el ratio compartido/privado de las entradas en el directorio varia entre aplicaciones y entre las diferentes fases de ejecución dentro de las aplicaciones, proponemos el
Dynamic Way Partitioning (DWP) Directory. El DWP-Directory reduce el número de
vías que almacenan entradas compartidas y permite que éstas se enciendan o apaguen
en tiempo de ejecución según los requisitos dinámicos de las aplicaciones según un algoritmo de reparticionado. Los resultados muestran unas prestaciones similares a un
directorio tradicional de alta asociatividad y un área similar a otros esquemas recientes
del estado del arte. Adicionalmente, el DWP-Directory obtiene importantes reducciones
de consumo estático y dinámico.
Esta disertación también se enfrenta a los problemas de escalabilidad que se pueden
encontrar en las memorias cache. En un acceso a la cache, se accede a cada vía del conjunto en paralelo, siendo
así un acción costosa en energía. Esta tesis presenta la arquitectura PS-Cache, un
diseño energéticamente eficiente que reduce el número de vías accedidas sin perjudicar
las prestaciones. La PS-Cache utiliza la información del estado privado-compartido del
bloque referenciado para reducir la energía, ya que tan solo accedemos a un subconjunto
de las vías que mantienen los bloques del tipo solicitado. Los resultados muestran unos
importantes ahorros de energía dinámica.
Finalmente, proponemos otro diseño de arquitectura energéticamente eficiente que se
puede aplicar a cualquier tipo de memoria cache asociativa por conjuntos. La propuesta, la Tag Filter (TF) Architecture, filtra las vías accedidas en el conjunto de la cache, de manera que solo se mira un número reducido de
vías tanto en el array de etiquetas como en el de datos. Esto permite que nuestra propuesta reduzca el consumo de energía dinámico de las caches sin perjudicar su tiempo de
acceso. Los resultados experimentales muestran que este mecanismo de filtrado es capaz de obtener un
consumo energético en caches asociativas por conjunto similar de las caches de mapeado
directo.
Los resultados
experimentales muestran que las propuestas presentadas en esta tesis consiguen un buen
compromiso entre estos tres importantes pilares de diseño.Conforme s'incrementen el nombre de nuclis en les noves generacions de multiprocessadors en xip, els CMPs han d'escalar en prestacions, àrea i consum energètic per complir en les demandes d'un nombre de nuclis major. El protocols basats en directori són
l'alternativa més escalable. Un directori convencional, no obstant, pateix una utilització
ineficient d'emmagatzematge i energia. En primer lloc, els grans i poc escalables vectors
de compartidors consumeixen una quantitat d'energia estàtica i d'àrea innecessària, especialment si es considera que la majoria dels blocs en un directori només es troben en la
cache d'un sol nucli. En segon lloc, tot i que incrementar la grandària i l'associativitat del
directori augmentaria les prestacions del sistema, això suposaria un increment notable
en el consum d'energia.
Aquesta tesis estudia les diferències significatives entre el comportament de blocs privats
i compartits dins del directori, la qual cosa ens guia cap a una gestió separada per a cada
un dels tipus de bloc. Proposem el PS-Directory, una cache de directori de dos nivells que
manté el reduït nombre de les entrades de blocs compartits, que són els que s'accedeixen
amb més freqüència, en una estructura menuda de primer nivell (concretament, la Shared
Directory Cache) i que empra una estructura més gran i lenta en el segon nivell (Private
Directory Cache) per poder mantenir la informació dels blocs privats.
Els resultats experimentals mostren que, comparat amb un directori convencional, el
PS-Directory aconsegueix millorar les prestacions a la vegada que redueix l'àrea de silici
i el consum energètic.
Ja que la ràtio compartit/privat de les entrades en el directori varia entre aplicacions
i entre les diferents fases d'execució dins de les aplicacions, proposem el Dynamic Way
Partitioning (DWP) Directory. DWP-Directory redueix el nombre de vies que emmagatzemen entrades compartides i permeten que aquest s'encengui o apagui en temps
d'execució segons els requeriments dinàmics de les aplicacions seguint un algoritme de
reparticionat. Els resultats mostren unes prestacions similars a un directori tradicional
d'alta associativitat i una àrea similar a altres esquemes recents de l'estat de l'art. Adicionalment, el DWP-Directory obté importants reduccions de consum estàtic i dinàmic.
Aquesta dissertació també s'enfronta als problemes d'escalabilitat que es poden tro-
bar en les memòries cache. Les caches on-chip consumeixen una part significativa del
consum total del sistema. Aquestes caches implementen un alt nivell d'associativitat. En un accés a la cache, s'accedeix a cada via del conjunt en paral·lel, essent
així una acció costosa en energia. Aquesta tesis presenta l'arquitectura PS-Cache, un
disseny energèticament eficient que redueix el nombre de vies accedides sense perjudicar
les prestacions. La PS-Cache utilitza la informació de l'estat privat-compartit del bloc
referenciat per a reduir energia, ja que només accedim al subconjunt de vies que mantenen blocs del tipus sol·licitat. Els resultats mostren uns importants estalvis d'energia
dinàmica.
Finalment, proposem un altre disseny d'arquitectura energèticament eficient que es pot
aplicar a qualsevol tipus de memòria cache associativa per conjunts. La proposta, la Tag Filter (TF) Architecture, filtra les vies
accedides en el conjunt de la cache, de manera que només un reduït nombre de vies es
miren tant en el array d'etiquetes com en el de dades. Això permet que la nostra proposta
redueixi el consum dinàmic energètic de les caches sense perjudicar el seu temps d'accés.
Els
resultats experimentals mostren que aquest mecanisme de filtre és capaç d'obtenir un
consum energètic en caches associatives per conjunt similar al de les caches de mapejada
directa.
Els resultats experimentals mostren que les propostes presentades en aquesta tesis conseguixen un bon
compromís entre aquestros tres importants pilars de diseny.Valls Mompó, JJ. (2017). Improving Energy and Area Scalability of the Cache Hierarchy in CMPs [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/79551TESI
Development of Lifting-based VLSI Architectures for Two-Dimensional Discrete Wavelet Transform
Two-dimensional discrete wavelet transform (2-D DWT) has evolved as an essential
part of a modem compression system. It offers superior compression with good image
quality and overcomes disadvantage of the discrete cosine transform, which suffers
from blocks artifacts that reduces the quality of the inage. The amount of
computations involve in 2-D DWT is enormous and cannot be processed by generalpurpose
processors when real-time processing is required. Th·"efore, high speed and
low power VLSI architecture that computes 2-D DWT effectively is needed. In this
research, several VLSI architectures have been developed that meets real-time
requirements for 2-D DWT applications. This research iaitially started off by
implementing a software simulation program that decorrelates the original image and
reconstructs the original image from the decorrelated image. Then, based on the
information gained from implementing the simulation program, a new approach for
designing lifting-based VLSI architectures for 2-D forward DWT is introduced. As a
result, two high performance VLSI architectures that perform 2-D DWT for 5/3 and
9/7 filters are developed based on overlapped and nonoverlapped scan methods. Then,
the intermediate architecture is developed, which aim a·: reducing the power
consumption of the overlapped areas without using the expensive line buffer. In order
to best meet real-time applications of 2-D DWT with demanding requirements in
terms of speed and throughput parallelism is explored. The single pipelined
intermediate and overlapped architectures are extended to 2-, 3-, and 4-parallel
architectures to achieve speed factors of 2, 3, and 4, respectively. To further
demonstrate the effectiveness of the approach single and para.llel VLSI architectures
for 2-D inverse discrete wavelet transform (2-D IDWT) are developed. Furthermore,
2-D DWT memory architectures, which have been overlooked in the literature, are
also developed. Finally, to show the architectural models developed for 2-D DWT are
simple to control, the control algorithms for 4-parallel architecture based on the first
scan method is developed. To validate architectures develcped in this work five
architectures are implemented and simulated on Altera FPGA.
In compliance with the terms of the Copyright Act 1987 and the IP Policy of the
university, the copyright of this thesis has been reassigned by the author to the legal
entity of the university,
Institute of Technology PETRONAS Sdn bhd.
Due acknowledgement shall always be made of the use of any material contained
in, or derived from, this thesis
Experimental and computational analysis of biased agonism on full-length and a C-terminally truncated adenosine A2A receptor
Biased agonism, the ability of agonists to differentially activate downstream signaling pathways by stabilizing specific receptor conformations, is a key issue for G protein-coupled receptor (GPCR) signaling. The C-terminal domain might influence this functional selectivity of GPCRs as it engages G proteins, GPCR kinases, β-arrestins, and several other proteins. Thus, the aim of this paper is to compare the agonist-dependent selectivity for intracellular pathways in a heterologous system expressing the full-length (A2AR) and a C-tail truncated (A2AΔ40R lacking the last 40 amino acids) adenosine A2A receptor, a GPCR that is already targeted in Parkinson's disease using a first-in-class drug. Experimental data such as ligand binding, cAMP production, β-arrestin recruitment, ERK1/2 phosphorylation and dynamic mass redistribution assays, which correspond to different aspects of signal transduction, were measured upon the action of structurally diverse compounds (the agonists adenosine, NECA, CGS-21680, PSB-0777 and LUF-5834 and the SCH-58261 antagonist) in cells expressing A2AR and A2AΔ40R. The results show that taking cAMP levels and the endogenous adenosine agonist as references, the main difference in bias was obtained with PSB-0777 and LUF-5834. The C-terminus is dispensable for both G-protein and β-arrestin recruitment and also for MAPK activation. Unrestrained molecular dynamics simulations, at the μs timescale, were used to understand the structural arrangements of the binding cavity, triggered by these chemically different agonists, facilitating G protein binding with different efficacy
Experimental and computational analysis of biased agonism on full-length and a C-terminally truncated adenosine A receptor
Funding: This work was partially supported by grants from the Spanish Ministry of Economy and Competitiveness (BFU2015-64405-R, SAF2017-84117-R, RTI2018-094204-B-I00 and PID2019- 109240RB-I00; they may include FEDER funds), the Alzheimer's Association (AARFD-17-503612) and by a grant from Fundacio "la Marato" de TV3 (201413-30).Biased agonism, the ability of agonists to differentially activate downstream signaling pathways by stabilizing specific receptor conformations, is a key issue for G protein-coupled receptor (GPCR) signaling. The C-terminal domain might influence this functional selectivity of GPCRs as it engages G proteins, GPCR kinases, β-arrestins, and several other proteins. Thus, the aim of this paper is to compare the agonist-dependent selectivity for intracellular pathways in a heterologous system expressing the full-length (AR) and a C-tail truncated (A Δ40 R lacking the last 40 amino acids) adenosine A receptor, a GPCR that is already targeted in Parkinson's disease using a first-in-class drug. Experimental data such as ligand binding, cAMP production, β-arrestin recruitment, ERK1/2 phosphorylation and dynamic mass redistribution assays, which correspond to different aspects of signal transduction, were measured upon the action of structurally diverse compounds (the agonists adenosine, NECA, CGS-21680, PSB-0777 and LUF-5834 and the SCH-58261 antagonist) in cells expressing AR and A Δ40 R. The results show that taking cAMP levels and the endogenous adenosine agonist as references, the main difference in bias was obtained with PSB-0777 and LUF-5834. The C-terminus is dispensable for both G-protein and β-arrestin recruitment and also for MAPK activation. Unrestrained molecular dynamics simulations, at the μs timescale, were used to understand the structural arrangements of the binding cavity, triggered by these chemically different agonists, facilitating G protein binding with different efficacy
Utopia: Fast and Efficient Address Translation via Hybrid Restrictive & Flexible Virtual-to-Physical Address Mappings
Conventional virtual memory (VM) frameworks enable a virtual address to
flexibly map to any physical address. This flexibility necessitates large data
structures to store virtual-to-physical mappings, which leads to high address
translation latency and large translation-induced interference in the memory
hierarchy. On the other hand, restricting the address mapping so that a virtual
address can only map to a specific set of physical addresses can significantly
reduce address translation overheads by using compact and efficient translation
structures. However, restricting the address mapping flexibility across the
entire main memory severely limits data sharing across different processes and
increases data accesses to the swap space of the storage device, even in the
presence of free memory. We propose Utopia, a new hybrid virtual-to-physical
address mapping scheme that allows both flexible and restrictive hash-based
address mapping schemes to harmoniously co-exist in the system. The key idea of
Utopia is to manage physical memory using two types of physical memory
segments: restrictive and flexible segments. A restrictive segment uses a
restrictive, hash-based address mapping scheme that maps virtual addresses to
only a specific set of physical addresses and enables faster address
translation using compact translation structures. A flexible segment employs
the conventional fully-flexible address mapping scheme. By mapping data to a
restrictive segment, Utopia enables faster address translation with lower
translation-induced interference. Utopia improves performance by 24% in a
single-core system over the baseline system, whereas the best prior
state-of-the-art contiguity-aware translation scheme improves performance by
13%.Comment: To appear in 56th IEEE/ACM International Symposium on
Microarchitecture (MICRO), 202
Selective Dynamic Analysis of Virtualized Whole-System Guest Environments
Dynamic binary analysis is a prevalent and indispensable technique in program analysis. While several dynamic binary analysis tools and frameworks have been proposed, all suffer from one or more of: prohibitive performance degradation, a semantic gap between the analysis code and the execution under analysis, architecture/OS specificity, being user-mode only, and lacking flexibility and extendability. This dissertation describes the design of the Dynamic Executable Code Analysis Framework (DECAF), a virtual machine-based, multi-target, whole-system dynamic binary analysis framework. In short, DECAF seeks to address the shortcomings of existing whole-system dynamic analysis tools and extend the state of the art by utilizing a combination of novel techniques to provide rich analysis functionality without crippling amounts of execution overhead. DECAF extends the mature QEMU whole-system emulator, a type-2 hypervisor capable of emulating every instruction that executes within a complete guest system environment.
DECAF provides a novel, hardware event-based method of just-in-time virtual machine introspection (VMI) to address the semantic gap problem. It also implements a novel instruction-level taint tracking engine at bitwise level of granularity, ensuring that taint propagation is sound and highly precise throughout the guest environment. A formal analysis of the taint propagation rules is provided to verify that most instructions introduce neither false positives nor false negatives. DECAF’s design also provides a plugin architecture with a simple-to-use, event-driven programming interface that makes it both flexible and extendable for a variety of analysis tasks.
The implementation of DECAF consists of 9550 lines of C++ code and 10270 lines of C code. Its performance is evaluated using CPU2006 SPEC benchmarks, which show an average overhead of 605% for system wide tainting and 12% for VMI. Three platformneutral DECAF plugins - Instruction Tracer, Keylogger Detector, and API Tracer - are described and evaluated in this dissertation to demonstrate the ease of use and effectiveness of DECAF in writing cross-platform and system-wide analysis tools.
This dissertation also presents the Virtual Device Fuzzer (VDF), a scalable fuzz testing framework for discovering bugs within the virtual devices implemented as part of QEMU. Such bugs could be used by malicious software executing within a guest under analysis by DECAF, so the discovery, reproduction, and diagnosis of such bugs helps to protect DECAF against attack while improving QEMU and any analysis platforms built upon QEMU. VDF uses selective instrumentation to perform targeted fuzz testing, which explores only the branches of execution belonging to virtual devices under analysis. By leveraging record and replay of memory-mapped I/O activity, VDF quickly cycles virtual devices through an arbitrarily large number of states without requiring a guest OS to be booted or present. Once a test case is discovered that triggers a bug, VDF reduces the test case to the minimum number of reads/writes required to trigger the bug and generates source code suitable for reproducing the bug during debugging and analysis.
VDF is evaluated by fuzz testing eighteen QEMU virtual devices, generating 1014 crash or hang test cases that reveal bugs in six of the tested devices. Over 80% of the crashes and hangs were discovered within the first day of testing. VDF covered an average of 62.32% of virtual device branches during testing, and the average test case was minimized to a reproduction test case only 18.57% of its original size
- …