34 research outputs found

    Performance scalability analysis of JavaScript applications with web workers

    Get PDF
    Web applications are getting closer to the performance of native applications taking advantage of new standard–based technologies. The recent HTML5 standard includes, among others, the Web Workers API that allows executing JavaScript applications on multiple threads, or workers. However, the internals of the browser’s JavaScript virtual machine does not expose direct relation between workers and running threads in the browser and the utilization of logical cores in the processor. As a result, developers do not know how performance actually scales on different environments and therefore what is the optimal number of workers on parallel JavaScript codes. This paper presents the first performance scalability analysis of parallel web apps with multiple workers. We focus on two case studies representative of different worker execution models. Our analyses show performance scaling on different parallel processor microarchitectures and on three major web browsers in the market. Besides, we study the impact of co–running applications on the web app performance. The results provide insights for future approaches to automatically find out the optimal number of workers that provide the best tradeoff between performance and resource usage to preserve system responsiveness and user experience, especially on environments with unexpected changes on system workload.Peer ReviewedPostprint (author's final draft

    Analysis and architectural support for parallel stateful packet processing

    Get PDF
    The evolution of network services is closely related to the network technology trend. Originally network nodes forwarded packets from a source to a destination in the network by executing lightweight packet processing, or even negligible workloads. As links provide more complex services, packet processing demands the execution of more computational intensive applications. Complex network applications deal with both packet header and payload (i.e. packet contents) to provide upper layer network services, such as enhanced security, system utilization policies, and video on demand management.Applications that provide complex network services arise two key capabilities that differ from the low layer network applications: a) deep packet inspection examines the packet payload tipically searching for a matching string or regular expression, and b) stateful processing keeps track information of previous packet processing, unlike other applications that don't keep any data about other packet processing. In most cases, deep packet inspection also integrates stateful processing.Computer architecture researches aim to maximize the system throughput to sustain the required network processing performance as well as other demands, such as memory and I/O bandwidth. In fact, there are different processor architectures depending on the sharing degree of hardware resources among streams (i.e. hardware context). Multicore architectures present multiple processing engines within a single chip that share cache levels of memory hierarchy and interconnection network. Multithreaded architectures integrates multiple streams in a single processing engine sharing functional units, register file, fecth unit, and inner levels of cache hierarchy. Scalable multicore multithreaded architectures emerge as a solution to overcome the requirements of high throughput systems. We call massively multithreaded architectures to the architectures that comprise tens to hundreds of streams distributed across multiple cores on a chip. Nevertheless, the efficient utilization of these architectures depends on the application characteristics. On one hand, emerging network applications show large computational workloads with significant variations in the packet processing behavior. Then, it is important to analyze the behavior of each packet processing to optimally assign packets to threads (i.e. software context) for reducing any negative interaction among them. On the other hand, network applications present Packet Level Parallelism (PLP) in which several packets can be processed in parallel. As in other paradigms, dependencies among packets limit the amount of PLP. Lower network layer applications show negligible packet dependencies. In contrast, complex upper network applications show dependencies among packets leading to reduce the amount of PLP.In this thesis, we address the limitations of parallelism in stateful network applications to maximize the throughput of advanced network devices. This dissertation comprises three complementary sets of contributions focused on: network analysis, workload characterization and architectural proposal.The network analysis evaluates the impact of network traffic on stateful network applications. We specially study the impact of network traffic aggregation on memory hierarchy performance. We categorize and characterize network applications according to their data management. The results point out that stateful processing presents reduced instruction level parallelism and high rate of long latency memory accesses. Our analysis reveal that stateful applications expose a variety of levels of parallelism related to stateful data categories. Thus, we propose the MultiLayer Processing (MLP) as an execution model to exploit multiple levels of parallelism. The MLP is a thread migration based mechanism that increases the sinergy among streams in the memory hierarchy and alleviates the contention in critical sections of parallel stateful workloads

    Mapa conceptual global como herramienta para la visión de conjunto de un sistema operativo

    Get PDF
    Numerosas asignaturas están formadas por un temario que está totalmente interrelacionado. Al final del curso los estudiantes deberían haber adquirido los conocimientos de cada tema pero, más importante aún, deberían saber cómo interactúan los diferentes temas entre ellos para obtener una visión global de la asignatura. Sin embargo, a menudo los estudiantes se centran en los temas por separado, en parte porque no les ofrecemos herramientas que les ayuden a relacionar las distintas partes del curso. En este trabajo presentamos el uso de un Mapa Conceptual Global (MCG) de una asignatura como recurso docente que ayuda al estudiante a obtener una visión de conjunto de todo el temario. La experiencia ha sido realizada como complemento de una clase de aprendizaje activo en una asignatura de Sistemas Operativos, pero pensamos que puede ser fácilmente aplicable a otros cursos.Peer Reviewe

    Dynamic web worker pool management for highly parallel javascript web applications

    Get PDF
    JavaScript web applications are improving performance mainly thanks to the inclusion of new standards by HTML5. Among others, web workers API allows multithreaded JavaScript web apps to exploit parallel processors. However, developers have difficulties to determine the minimum number of web workers that provide the highest performance. But even if developers found out this optimal number, it is a static value configured at the beginning of the execution. Because users tend to execute other applications in background, the estimated number of web workers could be non-optimal, because it may overload or underutilize the system. In this paper, we propose a solution for highly parallel web apps to dynamically adapt the number of running web workers to the actual available resources, avoiding the hassle to estimate a static optimal number of threads. The solution consists in the inclusion of a web worker pool and a simple management algorithm in the web app. Even though there are co-running applications, the results show our approach dynamically enables a number of web workers close to the optimal. Our proposal, which is independent of the web browser, overcomes the lack of knowledge of the underlying processor architecture as well as dynamic resources availability changes.Peer ReviewedPostprint (author's final draft

    Performance analysis of a new packet trace compressor based on TCP flow clustering

    Get PDF
    In this paper we study the properties of a new packet trace compression method based on clustering of TCP flows. With our proposed method, the compression ratio that we achieve is around 3%, reducing the file size, for instance, from 100 MB to 3 MB. Although this specification defines a lossy compressed data format, it preserves important statistical properties present into original trace. In order to validate the method, memory performance studies were done with the Radix Tree algorithm executing a trace generated by our method. To give support to these studies, measurements were taken of memory access and cache miss ratio. For the time, the results have showed that our proposed method provides a good solution for packet trace compression.Peer ReviewedPostprint (published version

    Overhead of the spin-lock loop in UltraSPARC T2

    Get PDF
    Spin locks are task synchronization mechanism used to provide mutual exclusion to shared software resources. Spin locks have a good performance in several situations over other synchronization mechanisms, i.e., when on average tasks wait short time to obtain the lock, the probability of getting the lock is high, or when there is no other synchronization mechanism. In this paper we study the effect that the execution of spinlocks create in multithreaded processors. Besides going to multicore architectures, recent industry trends show a big move toward hardware multithreaded processors. Intel P4, IBM POWER5 and POWER6, Sun's UltraSPARC T1 and T2 all this processors implement multithreading in various degrees. By sharing more processor resources we can increase system's performance, but at the same time, it increases the impact that processes executing simultaneously introduce to each other.Postprint (published version

    Understanding the overhead of the spin-lock loop in CMT architectures

    Get PDF
    Spin locks are a synchronization mechanisms used to provide mutual exclusion to shared software resources. Spin locks are used over other synchronization mechanisms in several situations, like when the average waiting time to obtain the lock is short, in which case the probability of getting the lock is high, or when it is no possible to use other synchronization mechanisms. In this paper, we study the effect that the execution of the Linux spin-lock loop in the Sun UltraSPARC T1 and T2 processors introduces on other running tasks, especially in the worst case scenario where the workload shows high contention on a lock. For this purpose, we create a task that continuously executes the spin-lock loop and execute several instances of this task together with another active tasks. Our results show that, when the spin-lock tasks run with other applications in the same core of a T1 or a T2 processor, they introduce a significant overhead on other applications: 31% in T1 and 42% in T2, on average, respectively. For the T1 and T2 processors, we identify the fetch bandwidth as the main source of interaction between active threads and the spin-lock threads. We, propose 4 different variants of the Linux spin-lock loop that require less fetch bandwidth. Our proposal reduces the overhead of the spin-lock tasks over the other applications down to 3.5% and 1.5% on average, in T1 and T2 respectively. This is a reduction of 28 percentage points with respect to the Linux spin-lock loop for T1. For T2 the reduction is about 40 percentage points.Peer ReviewedPreprin

    Measuring Operating System Overhead on CMT Processors

    Get PDF
    Numerous studies have shown that Operating System (OS) noise is one of the reasons for significant performance degradation in clustered architectures. Although many studies examine the OS noise for High Performance Computing (HPC), especially in multi-processor/core systems, most of them focus on 2- or 4-core systems. In this paper, we analyze the major sources of OS noise on a massive multithreading processor, the Sun UltraSPARC T1, running Linux and Solaris. Since a real system is too complex to analyze, we compare those results with a low-overhead runtime environment: the Netra Data Plane Software Suite (Netra DPS). Our results show that the overhead introduced by the OS timer interrupt in Linux and Solaris depends on the particular core and hardware context in which the application is running. This overhead is up to 30% when the application is executed on the same hardware context of the timer interrupt handler and up to 10% when the application and the timer interrupt handler run on different contexts but on the same core. We detect no overhead when the benchmark and the timer interrupt handler run on different cores of the processor.Peer Reviewe

    Measuring operating system overhead on Sun UltraSparc T1 processor

    Get PDF
    Numerous studies have shown that Operating System (OS) noise is one of the reasons for significant performance degradation in clustered architectures. Although many studies examine the OS noise for High Performance Computing, especially in multi-processor/core systems, most of them focus on 2- or 4-core systems. In this study, we analyze sources of OS noise on a massive multithreading processor, the Sun UltraSPARC T1.We compare results, measured in Linux and Solaris, with the results provided by a low-overhead runtime environment that introduces almost no overhead in applications’ execution time. Our results show that the overhead introduced by the OS timer interrupt in Linux and Solaris depends on the particular core and hardware context in which the application is running. This overhead is up to 30% when the application is executed on the same hardware context as the timer interrupt handler, and up to 10% when the application and the timer interrupt handler run on different contexts but on the same core. We detect no overhead when the benchmark and the timer interrupt handler run on different cores of the processor.Postprint (published version

    A cost-efficient QoS-aware analytical model of future software content delivery networks

    Get PDF
    Freelance, part-time, work-at-home, and other flexible jobs are changing the concept of workplace, and bringing information and content exchange problems to companies. Geographically spread corporations may use remote distribution of software and data to attend employees' demands, by exploiting emerging delivery technologies. In this context, cost-efficient software distribution is crucial to allow business evolution and make IT infrastructures more agile. On the other hand, container based virtualization technology is shaping the new trends of software deployment and infrastructure design. We envision current and future enterprise IT management trends evolving towards container based software delivery over Hybrid CDNs. This paper presents a novel cost-efficient QoS aware analytical model and a Hybrid CDN-P2P architecture for enterprise software distribution. The model would allow delivery cost minimization for a wide range of companies, from big multinationals to SMEs, using CDN-P2P distribution under various industrial hypothetical scenarios. Model constraints guarantee acceptable deployment times and keep interchanged content amounts below the bandwidth and storage network limits in our scenarios. Indeed, key model parameters account for network bandwidth, storage limits and rental prices, which are empirically determined from their offered values by the commercial delivery networks KeyCDN, MaxCDN, CDN77 and BunnyCDN. This preliminary study indicates that MaxCDN offers the best cost-QoS trade-off. The model is implemented in the network simulation tool PeerSim, and then applied to diverse testing scenarios by varying company types, number and profile (either, technical or administrative) of employees and the number and size of content requests. Hybrid simulation results show overall economic savings between 5\% and 20\%, compared to just hiring resources from a commercial CDN, while guaranteeing satisfactory QoS levels in terms of deployment times and number of served requests.This work was partially supported by Generalitat de Catalunya under the SGR Program (2017-SGR-962) and the RIS3CAT DRAC Project (001-P-001723). We have also received funding from Ministry of Science and Innovation (Spain) under the project EQC2019-005653-P.Peer ReviewedPostprint (author's final draft
    corecore