581 research outputs found

    Performance and power optimizations in chip multiprocessors for throughput-aware computation

    Get PDF
    The so-called "power (or power density) wall" has caused core frequency (and single-thread performance) to slow down, giving rise to the era of multi-core/multi-thread processors. For example, the IBM POWER4 processor, released in 2001, incorporated two single-thread cores into the same chip. In 2010, IBM released the POWER7 processor with eight 4-thread cores in the same chip, for a total capacity of 32 execution contexts. The ever increasing number of cores and threads gives rise to new opportunities and challenges for software and hardware architects. At software level, applications can benefit from the abundant number of execution contexts to boost throughput. But this challenges programmers to create highly-parallel applications and operating systems capable of scheduling them correctly. At hardware level, the increasing core and thread count puts pressure on the memory interface, because memory bandwidth grows at a slower pace ---phenomenon known as the "bandwidth (or memory) wall". In addition to memory bandwidth issues, chip power consumption rises due to manufacturers' difficulty to lower operating voltages sufficiently every processor generation. This thesis presents innovations to improve bandwidth and power consumption in chip multiprocessors (CMPs) for throughput-aware computation: a bandwidth-optimized last-level cache (LLC), a bandwidth-optimized vector register file, and a power/performance-aware thread placement heuristic. In contrast to state-of-the-art LLC designs, our organization avoids data replication and, hence, does not require keeping data coherent. Instead, the address space is statically distributed all over the LLC (in a fine-grained interleaving fashion). The absence of data replication increases the cache effective capacity, which results in better hit rates and higher bandwidth compared to a coherent LLC. We use double buffering to hide the extra access latency due to the lack of data replication. The proposed vector register file is composed of thousands of registers and organized as an aggregation of banks. We leverage such organization to attach small special-function "local computation elements" (LCEs) to each bank. This approach ---referred to as the "processor-in-regfile" (PIR) strategy--- overcomes the limited number of register file ports. Because each LCE is a SIMD computation element and all of them can proceed concurrently, the PIR strategy constitutes a highly-parallel super-wide-SIMD device (ideal for throughput-aware computation). Finally, we present a heuristic to reduce chip power consumption by dynamically placing software (application) threads across hardware (physical) threads. The heuristic gathers chip-level power and performance information at runtime to infer characteristics of the applications being executed. For example, if an application's threads share data, the heuristic may decide to place them in fewer cores to favor inter-thread data sharing and communication. In such case, the number of active cores decreases, which is a good opportunity to switch off the unused cores to save power. It is increasingly harder to find bulletproof (micro-)architectural solutions for the bandwidth and power scalability limitations in CMPs. Consequently, we think that architects should attack those problems from different flanks simultaneously, with complementary innovations. This thesis contributes with a battery of solutions to alleviate those problems in the context of throughput-aware computation: 1) proposing a bandwidth-optimized LLC; 2) proposing a bandwidth-optimized register file organization; and 3) proposing a simple technique to improve power-performance efficiency.El excesivo consumo de potencia de los procesadores actuales ha desacelerado el incremento en la frecuencia operativa de los mismos para dar lugar a la era de los procesadores con múltiples núcleos y múltiples hilos de ejecución. Por ejemplo, el procesador POWER7 de IBM, lanzado al mercado en 2010, incorpora ocho núcleos en el mismo chip, con cuatro hilos de ejecución por núcleo. Esto da lugar a nuevas oportunidades y desafíos para los arquitectos de software y hardware. A nivel de software, las aplicaciones pueden beneficiarse del abundante número de núcleos e hilos de ejecución para aumentar el rendimiento. Pero esto obliga a los programadores a crear aplicaciones altamente paralelas y sistemas operativos capaces de planificar correctamente la ejecución de las mismas. A nivel de hardware, el creciente número de núcleos e hilos de ejecución ejerce presión sobre la interfaz de memoria, ya que el ancho de banda de memoria crece a un ritmo más lento. Además de los problemas de ancho de banda de memoria, el consumo de energía del chip se eleva debido a la dificultad de los fabricantes para reducir suficientemente los voltajes de operación entre generaciones de procesadores. Esta tesis presenta innovaciones para mejorar el ancho de banda y consumo de energía en procesadores multinúcleo en el ámbito de la computación orientada a rendimiento ("throughput-aware computation"): una memoria caché de último nivel ("last-level cache" o LLC) optimizada para ancho de banda, un banco de registros vectorial optimizado para ancho de banda, y una heurística para planificar la ejecución de aplicaciones paralelas orientada a mejorar la eficiencia del consumo de potencia y desempeño. En contraste con los diseños de LLC de última generación, nuestra organización evita la duplicación de datos y, por tanto, no requiere de técnicas de coherencia. El espacio de direcciones de memoria se distribuye estáticamente en la LLC con un entrelazado de grano fino. La ausencia de replicación de datos aumenta la capacidad efectiva de la memoria caché, lo que se traduce en mejores tasas de acierto y mayor ancho de banda en comparación con una LLC coherente. Utilizamos la técnica de "doble buffering" para ocultar la latencia adicional necesaria para acceder a datos remotos. El banco de registros vectorial propuesto se compone de miles de registros y se organiza como una agregación de bancos. Incorporamos a cada banco una pequeña unidad de cómputo de propósito especial ("local computation element" o LCE). Este enfoque ---que llamamos "computación en banco de registros"--- permite superar el número limitado de puertos en el banco de registros. Debido a que cada LCE es una unidad de cómputo con soporte SIMD ("single instruction, multiple data") y todas ellas pueden proceder de forma concurrente, la estrategia de "computación en banco de registros" constituye un dispositivo SIMD altamente paralelo. Por último, presentamos una heurística para planificar la ejecución de aplicaciones paralelas orientada a reducir el consumo de energía del chip, colocando dinámicamente los hilos de ejecución a nivel de software entre los hilos de ejecución a nivel de hardware. La heurística obtiene, en tiempo de ejecución, información de consumo de potencia y desempeño del chip para inferir las características de las aplicaciones. Por ejemplo, si los hilos de ejecución a nivel de software comparten datos significativamente, la heurística puede decidir colocarlos en un menor número de núcleos para favorecer el intercambio de datos entre ellos. En tal caso, los núcleos no utilizados se pueden apagar para ahorrar energía. Cada vez es más difícil encontrar soluciones de arquitectura "a prueba de balas" para resolver las limitaciones de escalabilidad de los procesadores actuales. En consecuencia, creemos que los arquitectos deben atacar dichos problemas desde diferentes flancos simultáneamente, con innovaciones complementarias

    Analysis of opportunities for cache coherence in heterogeneous embedded systems

    Full text link
    [ES] En el contexto de los sistemas empotrados heterogéneos surgen nuevas necesidades y retos. Este trabajo se va a centrar en la coherencia de éstos sistemas para analizar la posibilidad de aplicar técnicas que se ajusten mejor a dichas necesidades. Previo al análisis se presentará en qué consiste y qué soluciones se proponen actualmente para el problema de la coherencia.[EN] New challenges arise in the context of embedded heterogeneous systems. This work is focused on the coherence of those systems in order to analyze the posibility of applying techniques that best cope with such challenges. Prior to that, we will offer an explanation of what the coherency problem is and what the currently proposed solutions to that problem are.Esteve García, A. (2012). Analysis of opportunities for cache coherence in heterogeneous embedded systems. http://hdl.handle.net/10251/29846Archivo delegad

    Edge Computing Platforms and Protocols

    Get PDF
    Cloud computing has created a radical shift in expanding the reach of application usage and has emerged as a de-facto method to provide low-cost and highly scalable computing services to its users. Existing cloud infrastructure is a composition of large-scale networks of datacenters spread across the globe. These datacenters are carefully installed in isolated locations and are heavily managed by cloud providers to ensure reliable performance to its users. In recent years, novel applications, such as Internet-of-Things, augmented-reality, autonomous vehicles etc., have proliferated the Internet. Majority of such applications are known to be time-critical and enforce strict computational delay requirements for acceptable performance. Traditional cloud offloading techniques are inefficient for handling such applications due to the incorporation of additional network delay encountered while uploading pre-requisite data to distant datacenters. Furthermore, as computations involving such applications often rely on sensor data from multiple sources, simultaneous data upload to the cloud also results in significant congestion in the network. Edge computing is a new cloud paradigm which aims to bring existing cloud services and utilities near end users. Also termed edge clouds, the central objective behind this upcoming cloud platform is to reduce the network load on the cloud by utilizing compute resources in the vicinity of users and IoT sensors. Dense geographical deployment of edge clouds in an area not only allows for optimal operation of delay-sensitive applications but also provides support for mobility, context awareness and data aggregation in computations. However, the added functionality of edge clouds comes at the cost of incompatibility with existing cloud infrastructure. For example, while data center servers are closely monitored by the cloud providers to ensure reliability and security, edge servers aim to operate in unmanaged publicly-shared environments. Moreover, several edge cloud approaches aim to incorporate crowdsourced compute resources, such as smartphones, desktops, tablets etc., near the location of end users to support stringent latency demands. The resulting infrastructure is an amalgamation of heterogeneous, resource-constrained and unreliable compute-capable devices that aims to replicate cloud-like performance. This thesis provides a comprehensive collection of novel protocols and platforms for integrating edge computing in the existing cloud infrastructure. At its foundation lies an all-inclusive edge cloud architecture which allows for unification of several co-existing edge cloud approaches in a single logically classified platform. This thesis further addresses several open problems for three core categories of edge computing: hardware, infrastructure and platform. For hardware, this thesis contributes a deployment framework which enables interested cloud providers to effectively identify optimal locations for deploying edge servers in any geographical region. For infrastructure, the thesis proposes several protocols and techniques for efficient task allocation, data management and network utilization in edge clouds with the end-objective of maximizing the operability of the platform as a whole. Finally, the thesis presents a virtualization-dependent platform for application owners to transparently utilize the underlying distributed infrastructure of edge clouds, in conjunction with other co-existing cloud environments, without much management overhead.Pilvilaskenta on aikaansaanut suuren muutoksen sovellusten toiminta-alueessa ja on sen myötä muodostunut lähes oletusarvoiseksi tavaksi toteuttaa edullisia ja skaalautuvia laskentapalveluita käyttäjille. Olemassaoleva pilvi-infrastruktuuri on kokoelma suuren mittakaavan datakeskuksia ympäri maailman. Datakeskukset sijaitsevat maantieteellisesti tarkkaan valituissa paikoissa, joista pilvioperaattorit pystyvät takaamaan hyvän suorituskyvyn käyttäjilleen. Viime vuosina yleistyneet uudet sovellusalat, kuten esineiden Internet (IoT), lisätty todellisuus (AR), itseohjautuvat autot, jne., ovat yleistyneet Internetissä. Valtaosa edellä mainituista sovellusaloista on aikakriittisiä, ja ne asettavat laskennalle tiukan viivemarginaalin, jonka toteutuminen on edellytys sovelluksen hyväksyttävälle suorituskyvylle. Perinteiset menetelmät delegoida laskentaa pilvipalveluihin ovat kelvottomia aikakriittisissä sovelluksissa, sillä laskentaan liittyvän oheisdatan siirtämisestä johtuva verkkoviive on liian suuri. Useat edellä mainituista uusista sovellusaloista hyödyntävät sensoridataa, jota kerätään useista eri lähteistä. Samanaikaiset datayhteydet puolestaan aiheuttavat merkittävää ruuhkaa verkossa. Reunalaskenta on uusi pilviparadigma, jonka tavoitteena on tuoda nykyiset palvelut ja resurssit lähemmäksi loppukäyttäjää. Myös reunapilvenä tunnetun paradigman keskeinen tavoite on vähentää pilveen kohdistuvaa verkkoliikennettä suorittamalla sovelluksen vaatima laskenta resursseilla, jotka sijaitsevat lähempänä loppukäyttäjää. Reunapilvien tiheä maantieteellinen sijoittelu ei ainoastaan auta minimoimaan tiedonsiirtoviivettä aikakriittisiä sovelluksia varten, vaan tukee myös sovellusten mobiliteettia, kontekstitietoisuutta ja datan aggregointia laskentaa varten. Edellä mainitut reunapilven tarjoamat uudet mahdollisuudet eivät kuitenkaan ole yhteensopivia nykyisten pilvi-infrastruktuurien kanssa. Datakeskukset toimivat tarkoin valvotuissa ympäristöissä palvelun takaamiseksi, kun taas reunapilvien toiminta-alue on hallinnoimaton ja julkinen. Useat esitykset reunapilven toteutukseen liittyen hyödyntävät myös käyttäjien laitteiden potentiaalista laskentakapasiteettia, jota tänä päivänä löytyy runsaasti mm. älypuhelimista, kannettavista tietokoneista, tableteista. Reunapilven infrastruktuuri on täten haastava yhdistelmä heterogeenisiä, resurssirajoitettuja, epäluotettavia, mutta laskentakykyisiä laitteita, jotka yhdessä pyrkivät suorittamaan pilvilaskentaa. Tämä väitöstutkimus tarjoaa kokoelman uudentyyppisiä protokollia ja alustoja reunalaskennan integroimiseksi osaksi nykyistä pilvi-infrastruktuuria. Tutkimuksen pohjana on kokonaisvaltainen reunapilviarkkitehtuuri, joka pyrkii yhdistämään useita rinnakkaisia arkkitehtuuriehdotuksia yhdeksi loogiseksi pilvialustaksi. Väitöstutkimus ottaa myös kantaa useisiin avoimiin ongelmiin reunalaskennan kolmella osa-alueella: resurssit, infrastruktuuri ja palvelualusta. Resursseihin liittyen tämä väitöstutkimus tarjoaa käyttöönottokehyksen, jonka avulla palveluntarjoajat voivat tehokkaasti selvittää reunapalvelinten optimaaliset maantieteelliset sijoituskohteet. Infrastruktuurin osalta tämä väitöstutkimus esittelee reunapilvessä tapahtuvaa tehokasta tehtävien allokointia, datan hallinnointia ja verkon hyödyntämistä varten useita protokollia ja tekniikoita, joiden yhteinen tavoite on maksimoida alustan toiminnallisuus kokonaisuutena. Tämän väitöstutkimuksen lopussa kuvataan virtualisointiin pohjautuva alusta, jonka avulla käyttäjä voi läpinäkyvästi hyödyntää ympäröivää reunapilveä perinteisten pilvi-infrastruktuurien rinnalla ilman suurta hallinnollista kuormaa

    Fault- and Yield-Aware On-Chip Memory Design and Management

    Get PDF
    Ever decreasing device size causes more frequent hard faults, which becomes a serious burden to processor design and yield management. This problem is particularly pronounced in the on-chip memory which consumes up to 70% of a processor' s total chip area. Traditional circuit-level techniques, such as redundancy and error correction code, become less effective in error-prevalent environments because of their large area overhead. In this work, we suggest an architectural solution to building reliable on-chip memory in the future processor environment. Our approaches have two parts, a design framework and architectural techniques for on-chip memory structures. Our design framework provides important architectural evaluation metrics such as yield, area, and performance based on low level defects and process variations parameters. Processor architects can quickly evaluate their designs' characteristics in terms of yield, area, and performance. With the framework, we develop architectural yield enhancement solutions for on-chip memory structures including L1 cache, L2 cache and directory memory. Our proposed solutions greatly improve yield with negligible area and performance overhead. Furthermore, we develop a decoupled yield model of compute cores and L2 caches in CMPs, which show that there will be many more L2 caches than compute cores in a chip. We propose efficient utilization techniques for excess caches. Evaluation results show that excess caches significantly improve overall performance of CMPs

    Virtualization services: scalable methods for virtualizing multicore systems

    Get PDF
    Multi-core technology is bringing parallel processing capabilities from servers to laptops and even handheld devices. At the same time, platform support for system virtualization is making it easier to consolidate server and client resources, when and as needed by applications. This consolidation is achieved by dynamically mapping the virtual machines on which applications run to underlying physical machines and their processing cores. Low cost processor and I/O virtualization methods efficiently scaled to different numbers of processing cores and I/O devices are key enablers of such consolidation. This dissertation develops and evaluates new methods for scaling virtualization functionality to multi-core and future many-core systems. Specifically, it re-architects virtualization functionality to improve scalability and better exploit multi-core system resources. Results from this work include a self-virtualized I/O abstraction, which virtualizes I/O so as to flexibly use different platforms' processing and I/O resources. Flexibility affords improved performance and resource usage and most importantly, better scalability than that offered by current I/O virtualization solutions. Further, by describing system virtualization as a service provided to virtual machines and the underlying computing platform, this service can be enhanced to provide new and innovative functionality. For example, a virtual device may provide obfuscated data to guest operating systems to maintain data privacy; it could mask differences in device APIs or properties to deal with heterogeneous underlying resources; or it could control access to data based on the ``trust' properties of the guest VM. This thesis demonstrates that extended virtualization services are superior to existing operating system or user-level implementations of such functionality, for multiple reasons. First, this solution technique makes more efficient use of key performance-limiting resource in multi-core systems, which are memory and I/O bandwidth. Second, this solution technique better exploits the parallelism inherent in multi-core architectures and exhibits good scalability properties, in part because at the hypervisor level, there is greater control in precisely which and how resources are used to realize extended virtualization services. Improved control over resource usage makes it possible to provide value-added functionalities for both guest VMs and the platform. Specific instances of virtualization services described in this thesis are the network virtualization service that exploits heterogeneous processing cores, a storage virtualization service that provides location transparent access to block devices by extending the functionality provided by network virtualization service, a multimedia virtualization service that allows efficient media device sharing based on semantic information, and an object-based storage service with enhanced access control.Ph.D.Committee Chair: Schwan, Karsten; Committee Member: Ahamad, Mustaq; Committee Member: Fujimoto, Richard; Committee Member: Gavrilovska, Ada; Committee Member: Owen, Henry; Committee Member: Xenidis, Jim

    Hardware-conscious query processing for the many-core era

    Get PDF
    Die optimale Nutzung von moderner Hardware zur Beschleunigung von Datenbank-Anfragen ist keine triviale Aufgabe. Viele DBMS als auch DSMS der letzten Jahrzehnte basieren auf Sachverhalten, die heute kaum noch Gültigkeit besitzen. Ein Beispiel hierfür sind heutige Server-Systeme, deren Hauptspeichergröße im Bereich mehrerer Terabytes liegen kann und somit den Weg für Hauptspeicherdatenbanken geebnet haben. Einer der größeren letzten Hardware Trends geht hin zu Prozessoren mit einer hohen Anzahl von Kernen, den sogenannten Manycore CPUs. Diese erlauben hohe Parallelitätsgrade für Programme durch Multithreading sowie Vektorisierung (SIMD), was die Anforderungen an die Speicher-Bandbreite allerdings deutlich erhöht. Der sogenannte High-Bandwidth Memory (HBM) versucht diese Lücke zu schließen, kann aber ebenso wie Many-core CPUs jeglichen Performance-Vorteil negieren, wenn dieser leichtfertig eingesetzt wird. Diese Arbeit stellt die Many-core CPU-Architektur zusammen mit HBM vor, um Datenbank sowie Datenstrom-Anfragen zu beschleunigen. Es wird gezeigt, dass ein hardwarenahes Kostenmodell zusammen mit einem Kalibrierungsansatz die Performance verschiedener Anfrageoperatoren verlässlich vorhersagen kann. Dies ermöglicht sowohl eine adaptive Partitionierungs und Merge-Strategie für die Parallelisierung von Datenstrom-Anfragen als auch eine ideale Konfiguration von Join-Operationen auf einem DBMS. Nichtsdestotrotz ist nicht jede Operation und Anwendung für die Nutzung einer Many-core CPU und HBM geeignet. Datenstrom-Anfragen sind oft auch an niedrige Latenz und schnelle Antwortzeiten gebunden, welche von höherer Speicher-Bandbreite kaum profitieren können. Hinzu kommen üblicherweise niedrigere Taktraten durch die hohe Kernzahl der CPUs, sowie Nachteile für geteilte Datenstrukturen, wie das Herstellen von Cache-Kohärenz und das Synchronisieren von parallelen Thread-Zugriffen. Basierend auf den Ergebnissen dieser Arbeit lässt sich ableiten, welche parallelen Datenstrukturen sich für die Verwendung von HBM besonders eignen. Des Weiteren werden verschiedene Techniken zur Parallelisierung und Synchronisierung von Datenstrukturen vorgestellt, deren Effizienz anhand eines Mehrwege-Datenstrom-Joins demonstriert wird.Exploiting the opportunities given by modern hardware for accelerating query processing speed is no trivial task. Many DBMS and also DSMS from past decades are based on fundamentals that have changed over time, e.g., servers of today with terabytes of main memory capacity allow complete avoidance of spilling data to disk, which has prepared the ground some time ago for main memory databases. One of the recent trends in hardware are many-core processors with hundreds of logical cores on a single CPU, providing an intense degree of parallelism through multithreading as well as vectorized instructions (SIMD). Their demand for memory bandwidth has led to the further development of high-bandwidth memory (HBM) to overcome the memory wall. However, many-core CPUs as well as HBM have many pitfalls that can nullify any performance gain with ease. In this work, we explore the many-core architecture along with HBM for database and data stream query processing. We demonstrate that a hardware-conscious cost model with a calibration approach allows reliable performance prediction of various query operations. Based on that information, we can, therefore, come to an adaptive partitioning and merging strategy for stream query parallelization as well as finding an ideal configuration of parameters for one of the most common tasks in the history of DBMS, join processing. However, not all operations and applications can exploit a many-core processor or HBM, though. Stream queries optimized for low latency and quick individual responses usually do not benefit well from more bandwidth and suffer from penalties like low clock frequencies of many-core CPUs as well. Shared data structures between cores also lead to problems with cache coherence as well as high contention. Based on our insights, we give a rule of thumb which data structures are suitable to parallelize with focus on HBM usage. In addition, different parallelization schemas and synchronization techniques are evaluated, based on the example of a multiway stream join operation

    Towards Scalable Synchronization on Multi-Cores

    Get PDF
    The shift of commodity hardware from single- to multi-core processors in the early 2000s compelled software developers to take advantage of the available parallelism of multi-cores. Unfortunately, only few---so-called embarrassingly parallel---applications can leverage this available parallelism in a straightforward manner. The remaining---non-embarrassingly parallel---applications require that their processes coordinate their possibly interleaved executions to ensure overall correctness---they require synchronization. Synchronization is achieved by constraining or even prohibiting parallel execution. Thus, per Amdahl's law, synchronization limits software scalability. In this dissertation, we explore how to minimize the effects of synchronization on software scalability. We show that scalability of synchronization is mainly a property of the underlying hardware. This means that synchronization directly hampers the cross-platform performance portability of concurrent software. Nevertheless, we can achieve portability without sacrificing performance, by creating design patterns and abstractions, which implicitly leverage hardware details without exposing them to software developers. We first perform an exhaustive analysis of the performance behavior of synchronization on several modern platforms. This analysis clearly shows that the performance and scalability of synchronization are highly dependent on the characteristics of the underlying platform. We then focus on lock-based synchronization and analyze the energy/performance trade-offs of various waiting techniques. We show that the performance and the energy efficiency of locks go hand in hand on modern x86 multi-cores. This correlation is again due to the characteristics of the hardware that does not provide practical tools for reducing the power consumption of locks without sacrificing throughput. We then propose two approaches for developing portable and scalable concurrent software, hence hiding the limitations that the underlying multi-cores impose. First, we introduce OPTIK, a new practical design pattern for designing and implementing fast and scalable concurrent data structures. We illustrate the power of our OPTIK pattern by devising five new algorithms and by optimizing four state-of-the-art algorithms for linked lists, skip lists, hash tables, and queues. Second, we introduce MCTOP, a multi-core topology abstraction which includes low-level information, such as memory bandwidths. MCTOP enables developers to accurately and portably define high-level optimization policies. We illustrate several such policies through four examples, including automated backoff schemes for locks, and illustrate the performance and portability of these policies on five platforms

    Design Space Exploration and Resource Management of Multi/Many-Core Systems

    Get PDF
    The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends

    Virtualization techniques for memory resource exploitation

    Get PDF
    Cloud infrastructures have become indispensable in our daily lives with the rise of cloud-based services offered by companies like Facebook, Google, Amazon and many others. These cloud infrastructures use a large numbers of servers provisioned with their own computing resources. Each of these servers use a piece of software, called the Hypervisor (``HV''), that allows them to create multiple virtual instances of the server's physical computing resources and abstract them into "Virtual Machines'' (VMs). A VM runs an Operating System, which in turn runs the applications. The VMs within the servers generate varying memory demand behavior. When the demand increases, costly operations such as (virtual) disk accesses and/or VM migrations can occur. As a result, it is necessary to optimize the utilization of the local memory resources within a single computing server. However, pressure on the memory resources can still increase, making it necessary to migrate the VM to a different server with larger memory or add more memory to the same server. At this point, it is important to consider that some of the servers in the cloud infrastructure might have memory resources that they are not using. Considering the possibility to make memory available to the server, new architectures have been introduced that provide hardware support to enable servers to share their memory capacity. This thesis presents multiple contributions to the memory management problem. First, it addresses the problem of optimizing memory resources in a virtualized server through different types of memory abstractions. Two full contributions are presented for managing memory within a single server called SmarTmem and CARLEMM. In this respect, a third contribution is also presented, called CAVMem, that works as the foundation for CARLEMM. Second, this thesis presents two contributions for memory capacity aggregation across multiple servers, offering two mechanisms called GV-Tmem and vMCA, this latter being based on GV-Tmem but with significant enhancements. These mechanisms distribute the server's total memory within a single-server and globally across computing servers using a user-space process with high-level memory management policies.Las infraestructuras para la nube se han vuelto indispensables en nuestras vidas diarias con la proliferación de los servicios ofrecidos por compañías como Facebook, Google, Amazon entre otras. Estas infraestructuras utilizan una gran cantidad de servidores proveídos con sus propios recursos computacionales. Cada unos de estos servidores utilizan un software, llamado el Hipervisor (“HV”), que les permite crear múltiples instancias virtuales de los recursos físicos de computación del servidor y abstraerlos en “Máquinas Virtuales” (VMs). Una VM ejecuta un Sistema Operativo (OS), el cual a su vez ejecuta aplicaciones. Las VMs dentro de los servidores generan un comportamiento variable de demanda de memoria. Cuando la demanda de memoria aumenta, operaciones costosas como accesos al disco (virtual) y/o migraciones de VMs pueden ocurrir. Como resultado, es necesario optimizar la utilización de los recursos de memoria locales dentro del servidor. Sin embargo, la demanda por memoria puede seguir aumentando, haciendo necesario que la VM migre a otro servidor o que se añada más memoria al servidor. En este punto, es importante considerar que algunos servidores podrían tener recursos de memoria que no están utilizando. Considerando la posibilidad de hacer más memoria disponible a los servidores que lo necesitan, nuevas arquitecturas de servidores han sido introducidos que brindan el soporte de hardware necesario para habilitar que los servidores puedan compartir su capacidad de memoria. Esta tesis presenta múltiples contribuciones para el problema de manejo de memoria. Primero, se enfoca en el problema de optimizar los recursos de memoria en un servidor virtualizado a través de distintos tipos de abstracciones de memoria. Dos contribuciones son presentadas para administrar memoria de manera automática dentro de un servidor virtualizado, llamadas SmarTmem y CARLEMM. En este contexto, una tercera contribución es presentada, llamada CAVMem, que proporciona los fundamentos para el desarrollo de CARLEMM. Segundo, la tesis presenta dos contribuciones enfocadas en la agregación de capacidad de memoria a través de múltiples servidores, ofreciendo dos mecanismos llamados GV-Tmem y vMCA, siendo este último basado en GV-Tmem pero con mejoras significativas. Estos mecanismos administran la memoria total de un servidor a nivel local y de manera global a lo largo de los servidores de la infraestructura de nube utilizando un proceso de usuario que implementa políticas de manejo de ..

    Virtualization techniques for memory resource exploitation

    Get PDF
    Cloud infrastructures have become indispensable in our daily lives with the rise of cloud-based services offered by companies like Facebook, Google, Amazon and many others. These cloud infrastructures use a large numbers of servers provisioned with their own computing resources. Each of these servers use a piece of software, called the Hypervisor (``HV''), that allows them to create multiple virtual instances of the server's physical computing resources and abstract them into "Virtual Machines'' (VMs). A VM runs an Operating System, which in turn runs the applications. The VMs within the servers generate varying memory demand behavior. When the demand increases, costly operations such as (virtual) disk accesses and/or VM migrations can occur. As a result, it is necessary to optimize the utilization of the local memory resources within a single computing server. However, pressure on the memory resources can still increase, making it necessary to migrate the VM to a different server with larger memory or add more memory to the same server. At this point, it is important to consider that some of the servers in the cloud infrastructure might have memory resources that they are not using. Considering the possibility to make memory available to the server, new architectures have been introduced that provide hardware support to enable servers to share their memory capacity. This thesis presents multiple contributions to the memory management problem. First, it addresses the problem of optimizing memory resources in a virtualized server through different types of memory abstractions. Two full contributions are presented for managing memory within a single server called SmarTmem and CARLEMM. In this respect, a third contribution is also presented, called CAVMem, that works as the foundation for CARLEMM. Second, this thesis presents two contributions for memory capacity aggregation across multiple servers, offering two mechanisms called GV-Tmem and vMCA, this latter being based on GV-Tmem but with significant enhancements. These mechanisms distribute the server's total memory within a single-server and globally across computing servers using a user-space process with high-level memory management policies.Las infraestructuras para la nube se han vuelto indispensables en nuestras vidas diarias con la proliferación de los servicios ofrecidos por compañías como Facebook, Google, Amazon entre otras. Estas infraestructuras utilizan una gran cantidad de servidores proveídos con sus propios recursos computacionales. Cada unos de estos servidores utilizan un software, llamado el Hipervisor (“HV”), que les permite crear múltiples instancias virtuales de los recursos físicos de computación del servidor y abstraerlos en “Máquinas Virtuales” (VMs). Una VM ejecuta un Sistema Operativo (OS), el cual a su vez ejecuta aplicaciones. Las VMs dentro de los servidores generan un comportamiento variable de demanda de memoria. Cuando la demanda de memoria aumenta, operaciones costosas como accesos al disco (virtual) y/o migraciones de VMs pueden ocurrir. Como resultado, es necesario optimizar la utilización de los recursos de memoria locales dentro del servidor. Sin embargo, la demanda por memoria puede seguir aumentando, haciendo necesario que la VM migre a otro servidor o que se añada más memoria al servidor. En este punto, es importante considerar que algunos servidores podrían tener recursos de memoria que no están utilizando. Considerando la posibilidad de hacer más memoria disponible a los servidores que lo necesitan, nuevas arquitecturas de servidores han sido introducidos que brindan el soporte de hardware necesario para habilitar que los servidores puedan compartir su capacidad de memoria. Esta tesis presenta múltiples contribuciones para el problema de manejo de memoria. Primero, se enfoca en el problema de optimizar los recursos de memoria en un servidor virtualizado a través de distintos tipos de abstracciones de memoria. Dos contribuciones son presentadas para administrar memoria de manera automática dentro de un servidor virtualizado, llamadas SmarTmem y CARLEMM. En este contexto, una tercera contribución es presentada, llamada CAVMem, que proporciona los fundamentos para el desarrollo de CARLEMM. Segundo, la tesis presenta dos contribuciones enfocadas en la agregación de capacidad de memoria a través de múltiples servidores, ofreciendo dos mecanismos llamados GV-Tmem y vMCA, siendo este último basado en GV-Tmem pero con mejoras significativas. Estos mecanismos administran la memoria total de un servidor a nivel local y de manera global a lo largo de los servidores de la infraestructura de nube utilizando un proceso de usuario que implementa políticas de manejo de ...Postprint (published version
    corecore