234 research outputs found

    Partial Replica Location And Selection For Spatial Datasets

    Get PDF
    As the size of scientific datasets continues to grow, we will not be able to store enormous datasets on a single grid node, but must distribute them across many grid nodes. The implementation of partial or incomplete replicas, which represent only a subset of a larger dataset, has been an active topic of research. Partial Spatial Replicas extend this functionality to spatial data, allowing us to distribute a spatial dataset in pieces over several locations. We investigate solutions to the partial spatial replica selection problems. First, we describe and develop two designs for an Spatial Replica Location Service (SRLS), which must return the set of replicas that intersect with a query region. Integrating a relational database, a spatial data structure and grid computing software, we build a scalable solution that works well even for several million replicas. In our SRLS, we have improved performance by designing a R-tree structure in the backend database, and by aggregating several queries into one larger query, which reduces overhead. We also use the Morton Space-filling Curve during R-tree construction, which improves spatial locality. In addition, we describe R-tree Prefetching(RTP), which effectively utilizes the modern multi-processor architecture. Second, we present and implement a fast replica selection algorithm in which a set of partial replicas is chosen from a set of candidates so that retrieval performance is maximized. Using an R-tree based heuristic algorithm, we achieve O(n log n) complexity for this NP-complete problem. We describe a model for disk access performance that takes filesystem prefetching into account and is sufficiently accurate for spatial replica selection. Making a few simplifying assumptions, we present a fast replica selection algorithm for partial spatial replicas. The algorithm uses a greedy approach that attempts to maximize performance by choosing a collection of replica subsets that allow fast data retrieval by a client machine. Experiments show that the performance of the solution found by our algorithm is on average always at least 91% and 93.4% of the performance of the optimal solution in 4-node and 8-node tests respectively

    Pervasive Data Access in Wireless and Mobile Computing Environments

    Get PDF
    The rapid advance of wireless and portable computing technology has brought a lot of research interests and momentum to the area of mobile computing. One of the research focus is on pervasive data access. with wireless connections, users can access information at any place at any time. However, various constraints such as limited client capability, limited bandwidth, weak connectivity, and client mobility impose many challenging technical issues. In the past years, tremendous research efforts have been put forth to address the issues related to pervasive data access. A number of interesting research results were reported in the literature. This survey paper reviews important works in two important dimensions of pervasive data access: data broadcast and client caching. In addition, data access techniques aiming at various application requirements (such as time, location, semantics and reliability) are covered

    Multimedia

    Get PDF
    The nowadays ubiquitous and effortless digital data capture and processing capabilities offered by the majority of devices, lead to an unprecedented penetration of multimedia content in our everyday life. To make the most of this phenomenon, the rapidly increasing volume and usage of digitised content requires constant re-evaluation and adaptation of multimedia methodologies, in order to meet the relentless change of requirements from both the user and system perspectives. Advances in Multimedia provides readers with an overview of the ever-growing field of multimedia by bringing together various research studies and surveys from different subfields that point out such important aspects. Some of the main topics that this book deals with include: multimedia management in peer-to-peer structures & wireless networks, security characteristics in multimedia, semantic gap bridging for multimedia content and novel multimedia applications

    Adaptive Prefetching and Cache Partitioning for Multicore Processors

    Get PDF
    El acceso a la memoria principal en los procesadores actuales supone un importante cuello de botella para las prestaciones, dado que los diferentes núcleos compiten por el limitado ancho de banda de memoria, agravando la brecha entre las prestaciones del procesador y las de la memoria principal. Distintas técnicas atacan este problema, siendo las más relevantes el uso de jerarquías de caché multinivel y la prebúsqueda. Las cachés jerárquicas aprovechan la localidad temporal y espacial que en general presentan los programas en el acceso a los datos, para mitigar las enormes latencias de acceso a memoria principal. Para limitar el número de accesos a la memoria DRAM, fuera del chip, los procesadores actuales cuentan con grandes cachés de último nivel (LLC). Para mejorar su utilización y reducir costes, estas cachés suelen compartirse entre todos los núcleos del procesador. Este enfoque mejora significativamente el rendimiento de la mayoría de las aplicaciones en comparación con el uso de cachés privados más pequeños. Compartir la caché, sin embargo, presenta una problema importante: la interferencia entre aplicaciones. La prebúsqueda, por otro lado, trae bloques de datos a las cachés antes de que el procesador los solicite, ocultando la latencia de memoria principal. Desafortunadamente, dado que la prebúsqueda es una técnica especulativa, si no tiene éxito puede contaminar la caché con bloques que no se usarán. Además, las prebúsquedas interfieren con los accesos a memoria normales, tanto los del núcleo que emite las prebúsquedas como los de los demás. Esta tesis se centra en reducir la interferencia entre aplicaciones, tanto en las caché compartidas como en el acceso a la memoria principal. Para reducir la interferencia entre aplicaciones en el acceso a la memoria principal, el mecanismo propuesto en esta disertación regula la agresividad de cada prebuscador, activando o desactivando selectivamente algunos de ellos, dependiendo de su rendimiento individual y de los requisitos de ancho de banda de memoria principal de los otros núcleos. Con respecto a la interferencia en cachés compartidos, esta tesis propone dos técnicas de particionado para la LLC, las cuales otorgan más espacio de caché a las aplicaciones que progresan más lentamente debido a la interferencia entre aplicaciones. La primera propuesta de particionado de caché requiere hardware específico no disponible en procesadores comerciales, por lo que se ha evaluado utilizando un entorno de simulación. La segunda propuesta de particionado de caché presenta una familia de políticas que superan las limitaciones en el número de particiones y en el número de vías de caché disponibles mediante la agrupación de aplicaciones en clústeres y la superposición de particiones de caché, por lo que varias aplicaciones comparten las mismas vías. Dado que se ha implementado utilizando los mecanismos para el particionado de la LLC que presentan algunos procesadores Intel modernos, esta propuesta ha sido evaluada en una máquina real. Los resultados experimentales muestran que el mecanismo de prebúsqueda selectiva propuesto en esta tesis reduce el número de solicitudes de memoria principal en un 20%, cosa que se traduce en mejoras en la equidad del sistema, el rendimiento y el consumo de energía. Por otro lado, con respecto a los esquemas de partición propuestos, en comparación con un sistema sin particiones, ambas propuestas reducen la iniquidad del sistema en un promedio de más del 25%, independientemente de la cantidad de aplicaciones en ejecución, y esta reducción en la injusticia no afecta negativamente al rendimiento.Accessing main memory represents a major performance bottleneck in current processors, since the different cores compete among them for the limited offchip bandwidth, aggravating even more the so called memory wall. Several techniques have been applied to deal with the core-memory performance gap, with the most preeminent ones being prefetching and hierarchical caching. Hierarchical caches leverage the temporal and spacial locality of the accessed data, mitigating the huge main memory access latencies. To limit the number of accesses to the off-chip DRAM memory, current processors feature large Last Level Caches. These caches are shared between all the cores to improve the utilization of the cache space and reduce cost. This approach significantly improves the performance of most applications compared to using smaller private caches. Cache sharing, however, presents an important shortcoming: the interference between applications. Prefetching, on the other hand, brings data blocks to the caches before they are requested, hiding the main memory latency. Unfortunately, since prefetching is a speculative technique, inaccurate prefetches may pollute the cache with blocks that will not be used. In addition, the prefetches interfere with the regular memory requests, both the ones from the application running on the core that issued the prefetches and the others. This thesis focuses on reducing the inter-application interference, both in the shared cache and in the access to the main memory. To reduce the interapplication interference in the access to main memory, the proposed approach regulates the aggressiveness of each core prefetcher, and selectively activates or deactivates some of them, depending on their individual performance and the main memory bandwidth requirements of the other cores. With respect to interference in shared caches, this thesis proposes two LLC partitioning techniques that give more cache space to the applications that have their progress diminished due inter-application interferences. The first cache partitioning proposal requires dedicated hardware not available in commercial processors, so it has been evaluated using a simulation framework. The second proposal dealing with cache partitioning presents a family of partitioning policies that overcome the limitations in the number of partitions and the number of available ways by grouping applications and overlapping cache partitions, so multiple applications share the same ways. Since it has been implemented using the cache partitioning features of modern Intel processors it has been evaluated in a real machine. Experimental results show that the proposed selective prefetching mechanism reduces the number of main memory requests by 20%, which translates to improvements in unfairness, performance, and energy consumption. On the other hand, regarding the proposed partitioning schemes, compared to a system with no partitioning, both reduce unfairness more than 25% on average, regardless of the number of applications running in the multicore, and this reduction in unfairness does not negatively affect the performance.L'accés a la memòria principal en els processadors actuals suposa un important coll d'ampolla per a les prestacions, ja que els diferents nuclis competeixen pel limitat ample de banda de memòria, agreujant la bretxa entre les prestacions del processador i les de la memòria principal. Diferents tècniques ataquen aquest problema, sent les més rellevants l'ús de jerarquies de memòria cau multinivell i la prebusca. Les memòries cau jeràrquiques aprofiten la localitat temporal i espacial que en general presenten els programes en l'accés a les dades per mitigar les enormes latències d'accés a memòria principal. Per limitar el nombre d'accessos a la memòria DRAM, fora del xip, els processadors actuals compten amb grans caus d'últim nivell (LLC). Per millorar la seva utilització i reduir costos, aquestes memòries cau solen compartir-se entre tots els nuclis del processador. Aquest enfocament millora significativament el rendiment de la majoria de les aplicacions en comparació amb l'ús de caus privades més menudes. Compartir la memòria cau, no obstant, presenta una problema important: la interferencia entre aplicacions. La prebusca, per altra banda, porta blocs de dades a les memòries cau abans que el processador els sol·licite, ocultant la latència de memòria principal. Desafortunadament, donat que la prebusca és una técnica especulativa, si no té èxit pot contaminar la memòria cau amb blocs que no fan falta. A més, les prebusques interfereixen amb els accessos normals a memòria, tant els del nucli que emet les prebusques com els dels altres. Aquesta tesi es centra en reduir la interferència entre aplicacions, tant en les cau compartides com en l'accés a la memòria principal. Per reduir la interferència entre aplicacions en l'accés a la memòria principal, el mecanismo proposat en aquesta dissertació regula l'agressivitat de cada prebuscador, activant o desactivant selectivament alguns d'ells, en funció del seu rendiment individual i dels requisits d'ample de banda de memòria principal dels altres nuclis. Pel que fa a la interferència en caus compartides, aquesta tesi proposa dues tècniques de particionat per a la LLC, les quals atorguen més espai de memòria cau a les aplicacions que progressen més lentament a causa de la interferència entre aplicacions. La primera proposta per al particionat de memòria cau requereix hardware específic no disponible en processadors comercials, per la qual cosa s'ha avaluat utilitzant un entorn de simulació. La segona proposta de particionat per a memòries cau presenta una família de polítiques que superen les limitacions en el nombre de particions i en el nombre de vies de memòria cau disponibles mitjan¿ cant l'agrupació d'aplicacions en clústers i la superposició de particions de memòria cau, de manera que diverses aplicacions comparteixen les mateixes vies. Atès que s'ha implementat utilitzant els mecanismes per al particionat de la LLC que ofereixen alguns processadors Intel moderns, aquesta proposta s'ha avaluat en una màquina real. Els resultats experimentals mostren que el mecanisme de prebusca selectiva proposat en aquesta tesi redueix el nombre de sol·licituds a la memòria principal en un 20%, cosa que es tradueix en millores en l'equitat del sistema, el rendiment i el consum d'energia. Per altra banda, pel que fa als esquemes de particiónat proposats, en comparació amb un sistema sense particions, ambdues propostes redueixen la iniquitat del sistema en més d'un 25% de mitjana, independentment de la quantitat d'aplicacions en execució, i aquesta reducció en la iniquitat no afecta negativament el rendiment.Selfa Oliver, V. (2018). Adaptive Prefetching and Cache Partitioning for Multicore Processors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/112423TESI

    State Management for Efficient Event Pattern Detection

    Get PDF
    Event Stream Processing (ESP) Systeme überwachen kontinuierliche Datenströme, um benutzerdefinierte Queries auszuwerten. Die Herausforderung besteht darin, dass die Queryverarbeitung zustandsbehaftet ist und die Anzahl von Teilübereinstimmungen mit der Größe der verarbeiteten Events exponentiell anwächst. Die Dynamik von Streams und die Notwendigkeit, entfernte Daten zu integrieren, erschweren die Zustandsverwaltung. Erstens liefern heterogene Eventquellen Streams mit unvorhersehbaren Eingaberaten und Queryselektivitäten. Während Spitzenzeiten ist eine erschöpfende Verarbeitung unmöglich, und die Systeme müssen auf eine Best-Effort-Verarbeitung zurückgreifen. Zweitens erfordern Queries möglicherweise externe Daten, um ein bestimmtes Event für eine Query auszuwählen. Solche Abhängigkeiten sind problematisch: Das Abrufen der Daten unterbricht die Stream-Verarbeitung. Ohne eine Eventauswahl auf Grundlage externer Daten wird das Wachstum von Teilübereinstimmungen verstärkt. In dieser Dissertation stelle ich Strategien für optimiertes Zustandsmanagement von ESP Systemen vor. Zuerst ermögliche ich eine Best-Effort-Verarbeitung mittels Load Shedding. Dabei werden sowohl Eingabeeevents als auch Teilübereinstimmungen systematisch verworfen, um eine Latenzschwelle mit minimalem Qualitätsverlust zu garantieren. Zweitens integriere ich externe Daten, indem ich das Abrufen dieser von der Verwendung in der Queryverarbeitung entkoppele. Mit einem effizienten Caching-Mechanismus vermeide ich Unterbrechungen durch Übertragungslatenzen. Dazu werden externe Daten basierend auf ihrer erwarteten Verwendung vorab abgerufen und mittels Lazy Evaluation bei der Eventauswahl berücksichtigt. Dabei wird ein Kostenmodell verwendet, um zu bestimmen, wann welche externen Daten abgerufen und wie lange sie im Cache aufbewahrt werden sollen. Ich habe die Effektivität und Effizienz der vorgeschlagenen Strategien anhand von synthetischen und realen Daten ausgewertet und unter Beweis gestellt.Event stream processing systems continuously evaluate queries over event streams to detect user-specified patterns with low latency. However, the challenge is that query processing is stateful and it maintains partial matches that grow exponentially in the size of processed events. State management is complicated by the dynamicity of streams and the need to integrate remote data. First, heterogeneous event sources yield dynamic streams with unpredictable input rates, data distributions, and query selectivities. During peak times, exhaustive processing is unreasonable, and systems shall resort to best-effort processing. Second, queries may require remote data to select a specific event for a pattern. Such dependencies are problematic: Fetching the remote data interrupts the stream processing. Yet, without event selection based on remote data, the growth of partial matches is amplified. In this dissertation, I present strategies for optimised state management in event pattern detection. First, I enable best-effort processing with load shedding that discards both input events and partial matches. I carefully select the shedding elements to satisfy a latency bound while striving for a minimal loss in result quality. Second, to efficiently integrate remote data, I decouple the fetching of remote data from its use in query evaluation by a caching mechanism. To this end, I hide the transmission latency by prefetching remote data based on anticipated use and by lazy evaluation that postpones the event selection based on remote data to avoid interruptions. A cost model is used to determine when to fetch which remote data items and how long to keep them in the cache. I evaluated the above techniques with queries over synthetic and real-world data. I show that the load shedding technique significantly improves the recall of pattern detection over baseline approaches, while the technique for remote data integration significantly reduces the pattern detection latency

    Data Structures and Algorithms for Scalable NDN Forwarding

    Get PDF
    Named Data Networking (NDN) is a recently proposed general-purpose network architecture that aims to address the limitations of the Internet Protocol (IP), while maintaining its strengths. NDN takes an information-centric approach, focusing on named data rather than computer addresses. In NDN, the content is identified by its name, and each NDN packet has a name that specifies the content it is fetching or delivering. Since there are no source and destination addresses in an NDN packet, it is forwarded based on a lookup of its name in the forwarding plane, which consists of the Forwarding Information Base (FIB), Pending Interest Table (PIT), and Content Store (CS). In addition, as an in-network caching element, a scalable Repository (Repo) design is needed to provide large-scale long-term content storage in NDN networks. Scalable NDN forwarding is a challenge. Compared to the well-understood approaches to IP forwarding, NDN forwarding performs lookups on packet names, which have variable and unbounded lengths, increasing the lookup complexity. The lookup tables are larger than in IP, requiring more memory space. Moreover, NDN forwarding has a read-write data plane, requiring per-packet updates at line rates. Designing and evaluating a scalable NDN forwarding node architecture is a major effort within the overall NDN research agenda. The goal of this dissertation is to demonstrate that scalable NDN forwarding is feasible with the proposed data structures and algorithms. First, we propose a FIB lookup design based on the binary search of hash tables that provides a reliable longest name prefix lookup performance baseline for future NDN research. We have demonstrated 10 Gbps forwarding throughput with 256-byte packets and one billion synthetic forwarding rules, each containing up to seven name components. Second, we explore data structures and algorithms to optimize the FIB design based on the specific characteristics of real-world forwarding datasets. Third, we propose a fingerprint-only PIT design that reduces the memory requirements in the core routers. Lastly, we discuss the Content Store design issues and demonstrate that the NDN Repo implementation can leverage many of the existing databases and storage systems to improve performance

    Improving address translation performance in virtualized multi-tenant systems

    Get PDF
    With the explosive growth in dataset sizes, application memory footprints are commonly reaching hundreds of GBs. Such huge datasets pressure the TLBs, resulting in frequent misses that must be resolved through a page walk – a long-latency pointer chase through multiple levels of the in-memory radix-tree-based page table. Page walk latency is particularly high under virtualization where address translation mandates traversing two radix-tree page tables in a process called a nested page walk, performing up to 24 memory accesses. Page walk latency can be also amplified by the effects caused by the colocation of applications on the same server used in an attempt to increase utilization. Under colocation, cache contention makes cache misses during a nested page walk more frequent, piling up page walk latency. Both virtualization and colocation are widely adopted in cloud platforms, such as Amazon Web Services and Google Cloud Engine. As a result, in cloud environments, page walk latency can reach hundreds of cycles, significantly reducing the overall application’s performance. This thesis addresses the problem of the high page walk latency by 1 identifying the sources of the high page walk latency under virtualization and/or colocation, and 2 proposing hardware and software techniques that accelerate page walks by means of new memory allocation strategies for the page table and data which can be easily adopted by existing systems. Firstly, we quantify how the dataset size growth, virtualization, and colocation affect page walk latency. We also study how a high page walk latency affects perform ance. Due to the lack of dedicated tools for evaluating address translation overhead on modern processors, we design a methodology to vary the page walk latency experienced by an application running on real hardware. To quantify the performance impact of address translation, we measure the application’s execution time while varying the page walk latency. We find that under virtualization, address translation considerably limits performance: an application can waste up to 68% of execution time due to stalls originating from page walks. In addition, we investigate which accesses from a nested page walk are most significant for the overall page walk latency by examining from where in the memory hierarchy these accesses are served. We find that accesses to the deeper levels of the page table radix tree are responsible for most of the overall page walk latency. Based on these observations, we introduce two address translation acceleration techniques that can be applied to any ISA that employs radix-tree page tables and nested page walks. The first of these techniques is Prefetched Address Translation (ASAP), a new software-hardware approach for mitigating the high page walk latency caused by virtualization and/or application colocation. At the heart of ASAP is a lightweight technique for directly indexing individual levels of the page table radix tree. Direct indexing enables ASAP to fetch nodes from deeper levels of the page table without first accessing the preceding levels, thus lowering the page walk latency. ASAP is fully compatible with the existing radix-tree-based page table and requires only incremental and isolated changes to the memory subsystem. The second technique is PTEMagnet, a new software-only approach for reducing address translation latency under virtualization and application colocation. Initially, we identify a new address translation bottleneck caused by memory fragmentation stemming from the interaction of virtualization, application colocation, and the Linux memory allocator. The fragmentation results in the effective cache footprint of the host page table being larger than that of the guest page table. The bloated footprint of the host page table leads to frequent cache misses during nested page walks, increasing page walk latency. In response to these observations, we propose PTEMag net. PTEMagnet prevents memory fragmentation by fine-grained reservation-based memory allocation in the guest OS. PTEMagnet is fully legacy-preserving, requiring no modifications to either user code or mechanisms for address translation and virtualization. In summary, this thesis proposes non-disruptive upgrades to the virtual memory subsystem for reducing page walk latency in virtualized deployments. In doing so, this thesis evaluates the impact of page walk latency on the application’s performance, identifies the bottlenecks of the existing address translation mechanism caused by virtualization, application colocation, and the Linux memory allocator, and proposes software-hardware and software-only solutions for eliminating the bottlenecks
    • …
    corecore