252 research outputs found

    Enhancing HPC on Virtual Systems in Clouds through Optimizing Virtual Overlay Networks

    Get PDF
    Virtual Ethernet overlay provides a powerful model for realizing virtual distributed and parallel computing systems with strong isolation, portability, and recoverability properties. However, in extremely high throughput and low latency networks, such overlays can suffer from bandwidth and latency limitations, which is of particular concern in HPC environments. Through a careful and quantitative analysis, I iden- tify three core issues limiting performance: delayed and excessive virtual interrupt delivery into guests, copies between host and guest data buffers during encapsulation, and the semantic gap between virtual Ethernet features and underlying physical network features. I propose three novel optimizations in response: optimistic timer- free virtual interrupt injection, zero-copy cut-through data forwarding, and virtual TCP offload. These optimizations improve the latency and bandwidth of the overlay network on 10 Gbps Ethernet and InfiniBand interconnects, resulting in near-native performance for a wide range of microbenchmarks and MPI application benchmarks

    E-EON : Energy-Efficient and Optimized Networks for Hadoop

    Get PDF
    Energy efficiency and performance improvements have been two of the major concerns of current Data Centers. With the advent of Big Data, more information is generated year after year, and even the most aggressive predictions of the largest network equipment manufacturer have been surpassed due to the non-stop growing network traffic generated by current Big Data frameworks. As, currently, one of the most famous and discussed frameworks designed to store, retrieve and process the information that is being consistently generated by users and machines, Hadoop has gained a lot of attention from the industry in recent years and presently its name describes a whole ecosystem designed to tackle the most varied requirements of today’s cloud applications. This thesis relates to Hadoop clusters, mainly focused on their interconnects, which is commonly considered to be the bottleneck of such ecosystem. We conducted research focusing on energy efficiency and also on performance optimizations as improvements on cluster throughput and network latency. Regarding the energy consumption, a significant proportion of a data center's energy consumption is caused by the network, which stands for 12% of the total system power at full load. With the non-stop growing network traffic, it is desired by industry and academic community that network energy consumption should be proportional to its utilization. Considering cluster performance, although Hadoop is a network throughput-sensitive workload with less stringent requirements for network latency, there is an increasing interest in running batch and interactive workloads concurrently on the same cluster. Doing so maximizes system utilization, to obtain the greatest benefits from the capital and operational expenditures. For this to happen, cluster throughput should not be impacted when network latency is minimized. The two biggest challenges faced during the development of this thesis were related to achieving near proportional energy consumption for the interconnects and also improving the network latency found on Hadoop clusters, while having virtually no loss on cluster throughput. Such challenges led to comparable sized opportunity: proposing new techniques that must solve such problems from the current generation of Hadoop clusters. We named E-EON the set of techniques presented in this work, which stands for Energy Efficient and Optimized Networks for Hadoop. E-EON can be used to reduce the network energy consumption and yet, to reduce network latency while cluster throughput is improved at the same time. Furthermore, such techniques are not exclusive to Hadoop and they are also expected to have similar benefits if applied to any other Big Data framework infrastructure that fits the problem characterization we presented throughout this thesis. With E-EON we were able to reduce the energy consumption by up to 80% compared to the state-of-the art technique. We were also able to reduce network latency by up to 85% and in some cases, even improve cluster throughput by 10%. Although these were the two major accomplishment from this thesis, we also present minor benefits which translate to easier configuration compared to the stat-of-the-art techniques. Finally, we enrich the discussions found in this thesis with recommendations targeting network administrators and network equipment manufacturers.La eficiencia energética y las mejoras de rendimiento han sido dos de las principales preocupaciones de los Data Centers actuales. Con el arribo del Big Data, se genera más información año con año, incluso las predicciones más agresivas de parte del mayor fabricante de dispositivos de red se han superado debido al continuo tráfico de red generado por los sistemas de Big Data. Actualmente, uno de los más famosos y discutidos frameworks desarrollado para almacenar, recuperar y procesar la información generada consistentemente por usuarios y máquinas, Hadoop acaparó la atención de la industria en los últimos años y actualmente su nombre describe a todo un ecosistema diseñado para abordar los requisitos más variados de las aplicaciones actuales de Cloud Computing. Esta tesis profundiza sobre los clusters Hadoop, principalmente enfocada a sus interconexiones, que comúnmente se consideran el cuello de botella de dicho ecosistema. Realizamos investigaciones centradas en la eficiencia energética y también en optimizaciones de rendimiento como mejoras en el throughput de la infraestructura y de latencia de la red. En cuanto al consumo de energía, una porción significativa de un Data Center es causada por la red, representada por el 12 % de la potencia total del sistema a plena carga. Con el tráfico constantemente creciente de la red, la industria y la comunidad académica busca que el consumo energético sea proporcional a su uso. Considerando las prestaciones del cluster, a pesar de que Hadoop mantiene una carga de trabajo sensible al rendimiento de red aunque con requisitos menos estrictos sobre la latencia de la misma, existe un interés creciente en ejecutar aplicaciones interactivas y secuenciales de manera simultánea sobre dicha infraestructura. Al hacerlo, se maximiza la utilización del sistema para obtener los mayores beneficios al capital y gastos operativos. Para que esto suceda, el rendimiento del sistema no puede verse afectado cuando se minimiza la latencia de la red. Los dos mayores desafíos enfrentados durante el desarrollo de esta tesis estuvieron relacionados con lograr un consumo energético cercano a la cantidad de interconexiones y también a mejorar la latencia de red encontrada en los clusters Hadoop al tiempo que la perdida del rendimiento de la infraestructura es casi nula. Dichos desafíos llevaron a una oportunidad de tamaño semejante: proponer técnicas novedosas que resuelven dichos problemas a partir de la generación actual de clusters Hadoop. Llamamos a E-EON (Energy Efficient and Optimized Networks) al conjunto de técnicas presentadas en este trabajo. E-EON se puede utilizar para reducir el consumo de energía y la latencia de la red al mismo tiempo que el rendimiento del cluster se mejora. Además tales técnicas no son exclusivas de Hadoop y también se espera que tengan beneficios similares si se aplican a cualquier otra infraestructura de Big Data que se ajuste a la caracterización del problema que presentamos a lo largo de esta tesis. Con E-EON pudimos reducir el consumo de energía hasta en un 80% en comparación con las técnicas encontradas en la literatura actual. También pudimos reducir la latencia de la red hasta en un 85% y, en algunos casos, incluso mejorar el rendimiento del cluster en un 10%. Aunque estos fueron los dos principales logros de esta tesis, también presentamos beneficios menores que se traducen en una configuración más sencilla en comparación con las técnicas más avanzadas. Finalmente, enriquecimos las discusiones encontradas en esta tesis con recomendaciones dirigidas a los administradores de red y a los fabricantes de dispositivos de red

    Passive available bandwidth: Applying self -induced congestion analysis of application-generated traffic

    Get PDF
    Monitoring end-to-end available bandwidth is critical in helping applications and users efficiently use network resources. Because the performance of distributed systems is intrinsically linked to the performance of the network, applications that have knowledge of the available bandwidth can adapt to changing network conditions and optimize their performance. A well-designed available bandwidth tool should be easily deployable and non-intrusive. While several tools have been created to actively measure the end-to-end available bandwidth of a network path, they require instrumentation at both ends of the path, and the traffic injected by these tools may affect the performance of other applications on the path.;We propose a new passive monitoring system that accurately measures available bandwidth by applying self-induced congestion analysis to traces of application-generated traffic. The Watching Resources from the Edge of the Network (Wren) system transparently provides available bandwidth information to applications without having to modify the applications to make the measurements and with negligible impact on the performance of applications. Wren produces a series of real-time available bandwidth measurements that can be used by applications to adapt their runtime behavior to optimize performance or that can be sent to a central monitoring system for use by other or future applications.;Most active bandwidth tools rely on adjustments to the sending rate of packets to infer the available bandwidth. The major obstacle with using passive kernel-level traces of TCP traffic is that we have no control over the traffic pattern. We demonstrate that there is enough natural variability in the sending rates of TCP traffic that techniques used by active tools can be applied to traces of application-generated traffic to yield accurate available bandwidth measurements.;Wren uses kernel-level instrumentation to collect traces of application traffic and analyzes the traces in the user-level to achieve the necessary accuracy and avoid intrusiveness. We introduce new passive bandwidth algorithms based on the principles of the active tools to measure available bandwidth, investigate the effectiveness of these new algorithms, implement a real-time system capable of efficiently monitoring available bandwidth, and demonstrate that applications can use Wren measurements to adapt their runtime decisions

    Distributed Object Tracking Using a Cluster-Based Kalman Filter in Wireless Camera Networks

    Get PDF
    Local data aggregation is an effective means to save sensor node energy and prolong the lifespan of wireless sensor networks. However, when a sensor network is used to track moving objects, the task of local data aggregation in the network presents a new set of challenges, such as the necessity to estimate, usually in real time, the constantly changing state of the target based on information acquired by the nodes at different time instants. To address these issues, we propose a distributed object tracking system which employs a cluster-based Kalman filter in a network of wireless cameras. When a target is detected, cameras that can observe the same target interact with one another to form a cluster and elect a cluster head. Local measurements of the target acquired by members of the cluster are sent to the cluster head, which then estimates the target position via Kalman filtering and periodically transmits this information to a base station. The underlying clustering protocol allows the current state and uncertainty of the target position to be easily handed off among clusters as the object is being tracked. This allows Kalman filter-based object tracking to be carried out in a distributed manner. An extended Kalman filter is necessary since measurements acquired by the cameras are related to the actual position of the target by nonlinear transformations. In addition, in order to take into consideration the time uncertainty in the measurements acquired by the different cameras, it is necessary to introduce nonlinearity in the system dynamics. Our object tracking protocol requires the transmission of significantly fewer messages than a centralized tracker that naively transmits all of the local measurements to the base station. It is also more accurate than a decentralized tracker that employs linear interpolation for local data aggregation. Besides, the protocol is able to perform real-time estimation because our implementation takes into consideration the sparsit- - y of the matrices involved in the problem. The experimental results show that our distributed object tracking protocol is able to achieve tracking accuracy comparable to the centralized tracking method, while requiring a significantly smaller number of message transmissions in the network

    Low-Latency Hard Real-Time Communication over Switched Ethernet

    Get PDF
    With the upsurge in the demand for high-bandwidth networked real-time applications in cost-sensitive environments, a key issue is to take advantage of developments of commodity components that offer a multiple of the throughput of classical real-time solutions. It was the starting hypothesis of this dissertation that with fine grained traffic shaping as the only means of node cooperation, it should be possible to achieve lower guaranteed delays and higher bandwidth utilization than with traditional approaches, even though Switched Ethernet does not support policing in the switches as other network architectures do. This thesis presents the application of traffic shaping to Switched Ethernet and validates the hypothesis. It shows, both theoretically and practically, how commodity Switched Ethernet technology can be used for low-latency hard real-time communication, and what operating-system support is needed for an efficient implementation

    E-EON : Energy-Efficient and Optimized Networks for Hadoop

    Get PDF
    Energy efficiency and performance improvements have been two of the major concerns of current Data Centers. With the advent of Big Data, more information is generated year after year, and even the most aggressive predictions of the largest network equipment manufacturer have been surpassed due to the non-stop growing network traffic generated by current Big Data frameworks. As, currently, one of the most famous and discussed frameworks designed to store, retrieve and process the information that is being consistently generated by users and machines, Hadoop has gained a lot of attention from the industry in recent years and presently its name describes a whole ecosystem designed to tackle the most varied requirements of today’s cloud applications. This thesis relates to Hadoop clusters, mainly focused on their interconnects, which is commonly considered to be the bottleneck of such ecosystem. We conducted research focusing on energy efficiency and also on performance optimizations as improvements on cluster throughput and network latency. Regarding the energy consumption, a significant proportion of a data center's energy consumption is caused by the network, which stands for 12% of the total system power at full load. With the non-stop growing network traffic, it is desired by industry and academic community that network energy consumption should be proportional to its utilization. Considering cluster performance, although Hadoop is a network throughput-sensitive workload with less stringent requirements for network latency, there is an increasing interest in running batch and interactive workloads concurrently on the same cluster. Doing so maximizes system utilization, to obtain the greatest benefits from the capital and operational expenditures. For this to happen, cluster throughput should not be impacted when network latency is minimized. The two biggest challenges faced during the development of this thesis were related to achieving near proportional energy consumption for the interconnects and also improving the network latency found on Hadoop clusters, while having virtually no loss on cluster throughput. Such challenges led to comparable sized opportunity: proposing new techniques that must solve such problems from the current generation of Hadoop clusters. We named E-EON the set of techniques presented in this work, which stands for Energy Efficient and Optimized Networks for Hadoop. E-EON can be used to reduce the network energy consumption and yet, to reduce network latency while cluster throughput is improved at the same time. Furthermore, such techniques are not exclusive to Hadoop and they are also expected to have similar benefits if applied to any other Big Data framework infrastructure that fits the problem characterization we presented throughout this thesis. With E-EON we were able to reduce the energy consumption by up to 80% compared to the state-of-the art technique. We were also able to reduce network latency by up to 85% and in some cases, even improve cluster throughput by 10%. Although these were the two major accomplishment from this thesis, we also present minor benefits which translate to easier configuration compared to the stat-of-the-art techniques. Finally, we enrich the discussions found in this thesis with recommendations targeting network administrators and network equipment manufacturers.La eficiencia energética y las mejoras de rendimiento han sido dos de las principales preocupaciones de los Data Centers actuales. Con el arribo del Big Data, se genera más información año con año, incluso las predicciones más agresivas de parte del mayor fabricante de dispositivos de red se han superado debido al continuo tráfico de red generado por los sistemas de Big Data. Actualmente, uno de los más famosos y discutidos frameworks desarrollado para almacenar, recuperar y procesar la información generada consistentemente por usuarios y máquinas, Hadoop acaparó la atención de la industria en los últimos años y actualmente su nombre describe a todo un ecosistema diseñado para abordar los requisitos más variados de las aplicaciones actuales de Cloud Computing. Esta tesis profundiza sobre los clusters Hadoop, principalmente enfocada a sus interconexiones, que comúnmente se consideran el cuello de botella de dicho ecosistema. Realizamos investigaciones centradas en la eficiencia energética y también en optimizaciones de rendimiento como mejoras en el throughput de la infraestructura y de latencia de la red. En cuanto al consumo de energía, una porción significativa de un Data Center es causada por la red, representada por el 12 % de la potencia total del sistema a plena carga. Con el tráfico constantemente creciente de la red, la industria y la comunidad académica busca que el consumo energético sea proporcional a su uso. Considerando las prestaciones del cluster, a pesar de que Hadoop mantiene una carga de trabajo sensible al rendimiento de red aunque con requisitos menos estrictos sobre la latencia de la misma, existe un interés creciente en ejecutar aplicaciones interactivas y secuenciales de manera simultánea sobre dicha infraestructura. Al hacerlo, se maximiza la utilización del sistema para obtener los mayores beneficios al capital y gastos operativos. Para que esto suceda, el rendimiento del sistema no puede verse afectado cuando se minimiza la latencia de la red. Los dos mayores desafíos enfrentados durante el desarrollo de esta tesis estuvieron relacionados con lograr un consumo energético cercano a la cantidad de interconexiones y también a mejorar la latencia de red encontrada en los clusters Hadoop al tiempo que la perdida del rendimiento de la infraestructura es casi nula. Dichos desafíos llevaron a una oportunidad de tamaño semejante: proponer técnicas novedosas que resuelven dichos problemas a partir de la generación actual de clusters Hadoop. Llamamos a E-EON (Energy Efficient and Optimized Networks) al conjunto de técnicas presentadas en este trabajo. E-EON se puede utilizar para reducir el consumo de energía y la latencia de la red al mismo tiempo que el rendimiento del cluster se mejora. Además tales técnicas no son exclusivas de Hadoop y también se espera que tengan beneficios similares si se aplican a cualquier otra infraestructura de Big Data que se ajuste a la caracterización del problema que presentamos a lo largo de esta tesis. Con E-EON pudimos reducir el consumo de energía hasta en un 80% en comparación con las técnicas encontradas en la literatura actual. También pudimos reducir la latencia de la red hasta en un 85% y, en algunos casos, incluso mejorar el rendimiento del cluster en un 10%. Aunque estos fueron los dos principales logros de esta tesis, también presentamos beneficios menores que se traducen en una configuración más sencilla en comparación con las técnicas más avanzadas. Finalmente, enriquecimos las discusiones encontradas en esta tesis con recomendaciones dirigidas a los administradores de red y a los fabricantes de dispositivos de red.Postprint (published version

    Analysis of Gemini interconnect recovery mechanisms: methods and observations

    Get PDF
    This thesis focuses on the resilience of network components, and recovery capabilities of extreme-scale high-performance computing (HPC) systems, specifically petaflop-level supercomputers, aimed at solving complex science, engineering, and business problems that require high bandwidth, enhanced networking, and high compute capabilities. The resilience of the network is critical for ensuring successful execution of the applications and overall system availability. Failure of interconnect components such as links, routers, power supply, etc. pose a threat to the resilience of the interconnect network, causing application failures and, in the worst case, system-wide failure. An extreme-scale system is designed to manage these failures and automatically recover from such failures to ensure successful application execution and avoid system-wide failure. Thus, in this thesis, we characterize the success probability of the recovery procedures as well as the impact of the recovery procedures on the applications. We developed an interconnect recovery mechanisms analysis tool (I-RAT), a plugin built on top of LogDiver to characterize and assess the impact of recovery mechanisms. The tool was used to analyze more than two years of network/system logs from Blue Waters, a supercomputer operated by the NCSA at the University of Illinois. Our analyses show that recovery mechanisms are frequently triggered (in as little as 36 hours for link failovers) that can fail with relatively high probability (as much as 0.25 for link failover). Furthermore, the analyses show that system resilience does not equate to application resilience since executing applications can fail with non-negligible probability during (or just after) a successful recovery. Our analyses show that interconnect recovery mechanisms are frequently triggered (the mean time between triggers is as short as 36 hours for link failovers), and the initiated recovery fails with relatively high probability (as much as 0.25 for link failover). We also show that as many as 20\% of the executing applications fail during the recovery phase

    Overhead in available bandwidth estimation tools: Evaluation and analysis

    Get PDF
    Current Available Bandwidth Estimation Tools (ABET) insert into the network probing packets to perform a single estimation. The utilization of these packets makes ABET intrusive and prone to errors since they consume part of the available bandwidth they are measuring. This paper presents a comparative of Overhead Estimation Tools (OET) analysis of representative ABET: Abing, Diettopp, Pathload, PathChirp, Traceband, IGI, PTR, Assolo, and Wbest. By using Internet traffic, the study shows that the insertion of probing packets is a factor that affects two metrics associated to the estimation. First, it is shown that the accuracy is affected proportionally to the amount of probing traffic. Secondly, the Estimation Time (ET) is increased in high congested end-to-end links when auto-induced congestion tools are use
    corecore