512 research outputs found

    A model-based approach for automatic recovery from memory leaks in enterprise applications

    Get PDF
    Large-scale distributed computing systems such as data centers are hosted on heterogeneous and networked servers that execute in a dynamic and uncertain operating environment, caused by factors such as time-varying user workload and various failures. Therefore, achieving stringent quality-of-service goals is a challenging task, requiring a comprehensive approach to performance control, fault diagnosis, and failure recovery. This work presents a model-based approach for fault management, which integrates limited lookahead control (LLC), diagnosis, and fault-tolerance concepts that: (1) enables systems to adapt to environment variations, (2) maintains the availability and reliability of the system, (3) facilitates system recovery from failures. We focused on memory leak errors in this thesis. A characterization function is designed to detect memory leaks. Then, a LLC is applied to enable the computing system to adapt efficiently to variations in the workload, and to enable the system recover from memory leaks and maintain functionality

    Monitoring and analysis system for performance troubleshooting in data centers

    Get PDF
    It was not long ago. On Christmas Eve 2012, a war of troubleshooting began in Amazon data centers. It started at 12:24 PM, with an mistaken deletion of the state data of Amazon Elastic Load Balancing Service (ELB for short), which was not realized at that time. The mistake first led to a local issue that a small number of ELB service APIs were affected. In about six minutes, it evolved into a critical one that EC2 customers were significantly affected. One example was that Netflix, which was using hundreds of Amazon ELB services, was experiencing an extensive streaming service outage when many customers could not watch TV shows or movies on Christmas Eve. It took Amazon engineers 5 hours 42 minutes to find the root cause, the mistaken deletion, and another 15 hours and 32 minutes to fully recover the ELB service. The war ended at 8:15 AM the next day and brought the performance troubleshooting in data centers to world’s attention. As shown in this Amazon ELB case.Troubleshooting runtime performance issues is crucial in time-sensitive multi-tier cloud services because of their stringent end-to-end timing requirements, but it is also notoriously difficult and time consuming. To address the troubleshooting challenge, this dissertation proposes VScope, a flexible monitoring and analysis system for online troubleshooting in data centers. VScope provides primitive operations which data center operators can use to troubleshoot various performance issues. Each operation is essentially a series of monitoring and analysis functions executed on an overlay network. We design a novel software architecture for VScope so that the overlay networks can be generated, executed and terminated automatically, on-demand. From the troubleshooting side, we design novel anomaly detection algorithms and implement them in VScope. By running anomaly detection algorithms in VScope, data center operators are notified when performance anomalies happen. We also design a graph-based guidance approach, called VFocus, which tracks the interactions among hardware and software components in data centers. VFocus provides primitive operations by which operators can analyze the interactions to find out which components are relevant to the performance issue. VScope’s capabilities and performance are evaluated on a testbed with over 1000 virtual machines (VMs). Experimental results show that the VScope runtime negligibly perturbs system and application performance, and requires mere seconds to deploy monitoring and analytics functions on over 1000 nodes. This demonstrates VScope’s ability to support fast operation and online queries against a comprehensive set of application to system/platform level metrics, and a variety of representative analytics functions. When supporting algorithms with high computation complexity, VScope serves as a ‘thin layer’ that occupies no more than 5% of their total latency. Further, by using VFocus, VScope can locate problematic VMs that cannot be found via solely application-level monitoring, and in one of the use cases explored in the dissertation, it operates with levels of perturbation of over 400% less than what is seen for brute-force and most sampling-based approaches. We also validate VFocus with real-world data center traces. The experimental results show that VFocus has troubleshooting accuracy of 83% on average.Ph.D

    Measuring And Improving Internet Video Quality Of Experience

    Get PDF
    Streaming multimedia content over the IP-network is poised to be the dominant Internet traffic for the coming decade, predicted to account for more than 91% of all consumer traffic in the coming years. Streaming multimedia content ranges from Internet television (IPTV), video on demand (VoD), peer-to-peer streaming, and 3D television over IP to name a few. Widespread acceptance, growth, and subscriber retention are contingent upon network providers assuring superior Quality of Experience (QoE) on top of todays Internet. This work presents the first empirical understanding of Internet’s video-QoE capabilities, and tools and protocols to efficiently infer and improve them. To infer video-QoE at arbitrary nodes in the Internet, we design and implement MintMOS: a lightweight, real-time, noreference framework for capturing perceptual quality. We demonstrate that MintMOS’s projections closely match with subjective surveys in accessing perceptual quality. We use MintMOS to characterize Internet video-QoE both at the link level and end-to-end path level. As an input to our study, we use extensive measurements from a large number of Internet paths obtained from various measurement overlays deployed using PlanetLab. Link level degradations of intra– and inter–ISP Internet links are studied to create an empirical understanding of their shortcomings and ways to overcome them. Our studies show that intra–ISP links are often poorly engineered compared to peering links, and that iii degradations are induced due to transient network load imbalance within an ISP. Initial results also indicate that overlay networks could be a promising way to avoid such ISPs in times of degradations. A large number of end-to-end Internet paths are probed and we measure delay, jitter, and loss rates. The measurement data is analyzed offline to identify ways to enable a source to select alternate paths in an overlay network to improve video-QoE, without the need for background monitoring or apriori knowledge of path characteristics. We establish that for any unstructured overlay of N nodes, it is sufficient to reroute key frames using a random subset of k nodes in the overlay, where k is bounded by O(lnN). We analyze various properties of such random subsets to derive simple, scalable, and an efficient path selection strategy that results in a k-fold increase in path options for any source-destination pair; options that consistently outperform Internet path selection. Finally, we design a prototype called source initiated frame restoration (SIFR) that employs random subsets to derive alternate paths and demonstrate its effectiveness in improving Internet video-QoE

    Network coding for reliable wireless sensor networks

    Get PDF
    Wireless sensor networks are used in many applications and are now a key element in the increasingly growing Internet of Things. These networks are composed of small nodes including wireless communication modules, and in most of the cases are able to autonomously con gure themselves into networks, to ensure sensed data delivery. As more and more sensor nodes and networks join the Internet of Things, collaboration between geographically distributed systems are expected. Peer to peer overlay networks can assist in the federation of these systems, for them to collaborate. Since participating peers/proxies contribute to storage and processing, there is no burden on speci c servers and bandwidth bottlenecks are avoided. Network coding can be used to improve the performance of wireless sensor networks. The idea is for data from multiple links to be combined at intermediate encoding nodes, before further transmission. This technique proved to have a lot of potential in a wide range of applications. In the particular case of sensor networks, network coding based protocols and algorithms try to achieve a balance between low packet error rate and energy consumption. For network coding based constrained networks to be federated using peer to peer overlays, it is necessary to enable the storage of encoding vectors and coded data by such distributed storage systems. Packets can arrive to the overlay through any gateway/proxy (peers in the overlay), and lost packets can be recovered by the overlay (or client) using original and coded data that has been stored. The decoding process requires a decoding service at the overlay network. Such architecture, which is the focus of this thesis, will allow constrained networks to reduce packet error rate in an energy e cient way, while bene ting from an e ective distributed storage solution for their federation. This will serve as a basis for the proposal of mathematical models and algorithms that determine the most e ective routing trees, for packet forwarding toward sink/gateway nodes, and best amount and placement of encoding nodes.As redes de sensores sem fios são usadas em muitas aplicações e são hoje consideradas um elemento-chave para o desenvolvimento da Internet das Coisas. Compostas por nós de pequena dimensão que incorporam módulos de comunicação sem fios, grande parte destas redes possuem a capacidade de se configurarem de forma autónoma, formando sistemas em rede para garantir a entrega dos dados recolhidos. (…

    Monitoring and Failure Recovery of Cloud-Managed Digital Signage

    Get PDF
    Digitaal signage kasutatakse laialdaselt erinevates valdkondades, nagu näiteks transpordisüsteemid, turustusvõimalused, meelelahutus ja teised, et kuvada teavet piltide, videote ja teksti kujul. Nende ressursside usaldusväärsus, vajalike teenuste kättesaadavus ja turvameetmed on selliste süsteemide vastuvõtmisel võtmeroll. Digitaalse märgistussüsteemi tõhus haldamine on teenusepakkujatele keeruline ülesanne. Selle süsteemi rikkeid võib põhjustada mitmeid põhjuseid, nagu näiteks vigased kuvarid, võrgu-, riist- või tarkvaraprobleemid, mis on üsna korduvad. Traditsiooniline protsess sellistest ebaõnnestumistest taastumisel hõlmab sageli tüütuid ja tülikaid diagnoose. Paljudel juhtudel peavad tehnikud kohale füüsiliselt külastama, suurendades seeläbi hoolduskulusid ja taastumisaega.Selles väites pakume lahendust, mis jälgib, diagnoosib ja taandub tuntud tõrgetest, ühendades kuvarid pilvega. Pilvepõhine kaug- ja autonoomne server konfigureerib kaugseadete sisu ja uuendab neid dünaamiliselt. Iga kuva jälgib jooksvat protsessi ja saadab trace’i, logib süstemisse perioodiliselt. Negatiivide puhul analüüsitakse neid serverisse salvestatud logisid, mis optimaalselt kasutavad kohandatud logijuhtimismoodulit. Lisaks näitavad ekraanid ebaõnnestumistega toimetulemiseks enesetäitmise protseduure, kui nad ei suuda pilvega ühendust luua. Kavandatud lahendus viiakse läbi Linuxi süsteemis ja seda hinnatakse serveri kasutuselevõtuga Amazon Web Service (AWS) pilves. Peamisteks tulemusteks on meetodite kogum, mis võimaldavad kaugjuhtimisega kuvariprobleemide lahendamist.Digital signage is widely used in various fields such as transport systems, trading outlets, entertainment, and others, to display information in the form of images, videos, and text. The reliability of these resources, availability of required services and security measures play a key role in the adoption of such systems. Efficient management of the digital signage system is a challenging task to the service providers. There could be many reasons that lead to the malfunctioning of this system such as faulty displays, network, hardware or software failures that are quite repetitive. The traditional process of recovering from such failures often involves tedious and cumbersome diagnosis. In many cases, technicians need to physically visit the site, thereby increasing the maintenance costs and the recovery time. In this thesis, we propose a solution that monitors, diagnoses and recovers from known failures by connecting the displays to a cloud. A cloud-based remote and autonomous server configures the content of remote displays and updates them dynamically. Each display tracks the running process and sends the trace and system logs to the server periodically. These logs, stored at the server optimally using a customized log management module, are analysed for failures. In addition, the displays incorporate self-recovery procedures to deal with failures, when they are unable to create connection to the cloud. The proposed solution is implemented on a Linux system and evaluated by deploying the server on the Amazon Web Service (AWS) cloud. The main result of the thesis is a collection of techniques for resolving the display system failures remotely

    Pervasive computing reference architecture from a software engineering perspective (PervCompRA-SE)

    Get PDF
    Pervasive computing (PervComp) is one of the most challenging research topics nowadays. Its complexity exceeds the outdated main frame and client-server computation models. Its systems are highly volatile, mobile, and resource-limited ones that stream a lot of data from different sensors. In spite of these challenges, it entails, by default, a lengthy list of desired quality features like context sensitivity, adaptable behavior, concurrency, service omnipresence, and invisibility. Fortunately, the device manufacturers improved the enabling technology, such as sensors, network bandwidth, and batteries to pave the road for pervasive systems with high capabilities. On the other hand, this domain area has gained an enormous amount of attention from researchers ever since it was first introduced in the early 90s of the last century. Yet, they are still classified as visionary systems that are expected to be woven into people’s daily lives. At present, PervComp systems still have no unified architecture, have limited scope of context-sensitivity and adaptability, and many essential quality features are insufficiently addressed in PervComp architectures. The reference architecture (RA) that we called (PervCompRA-SE) in this research, provides solutions for these problems by providing a comprehensive and innovative pair of business and technical architectural reference models. Both models were based on deep analytical activities and were evaluated using different qualitative and quantitative methods. In this thesis we surveyed a wide range of research projects in PervComp in various subdomain areas to specify our methodological approach and identify the quality features in the PervComp domain that are most commonly found in these areas. It presented a novice approach that utilizes theories from sociology, psychology, and process engineering. The thesis analyzed the business and architectural problems in two separate chapters covering the business reference architecture (BRA) and the technical reference architecture (TRA). The solutions for these problems were introduced also in the BRA and TRA chapters. We devised an associated comprehensive ontology with semantic meanings and measurement scales. Both the BRA and TRA were validated throughout the course of research work and evaluated as whole using traceability, benchmark, survey, and simulation methods. The thesis introduces a new reference architecture in the PervComp domain which was developed using a novel requirements engineering method. It also introduces a novel statistical method for tradeoff analysis and conflict resolution between the requirements. The adaptation of the activity theory, human perception theory and process re-engineering methods to develop the BRA and the TRA proved to be very successful. Our approach to reuse the ontological dictionary to monitor the system performance was also innovative. Finally, the thesis evaluation methods represent a role model for researchers on how to use both qualitative and quantitative methods to evaluate a reference architecture. Our results show that the requirements engineering process along with the trade-off analysis were very important to deliver the PervCompRA-SE. We discovered that the invisibility feature, which was one of the envisioned quality features for the PervComp, is demolished and that the qualitative evaluation methods were just as important as the quantitative evaluation methods in order to recognize the overall quality of the RA by machines as well as by human beings

    Fault diagnosis for IP-based network with real-time conditions

    Get PDF
    BACKGROUND: Fault diagnosis techniques have been based on many paradigms, which derive from diverse areas and have different purposes: obtaining a representation model of the network for fault localization, selecting optimal probe sets for monitoring network devices, reducing fault detection time, and detecting faulty components in the network. Although there are several solutions for diagnosing network faults, there are still challenges to be faced: a fault diagnosis solution needs to always be available and able enough to process data timely, because stale results inhibit the quality and speed of informed decision-making. Also, there is no non-invasive technique to continuously diagnose the network symptoms without leaving the system vulnerable to any failures, nor a resilient technique to the network's dynamic changes, which can cause new failures with different symptoms. AIMS: This thesis aims to propose a model for the continuous and timely diagnosis of IP-based networks faults, independent of the network structure, and based on data analytics techniques. METHOD(S): This research's point of departure was the hypothesis of a fault propagation phenomenon that allows the observation of failure symptoms at a higher network level than the fault origin. Thus, for the model's construction, monitoring data was collected from an extensive campus network in which impact link failures were induced at different instants of time and with different duration. These data correspond to widely used parameters in the actual management of a network. The collected data allowed us to understand the faults' behavior and how they are manifested at a peripheral level. Based on this understanding and a data analytics process, the first three modules of our model, named PALADIN, were proposed (Identify, Collection and Structuring), which define the data collection peripherally and the necessary data pre-processing to obtain the description of the network's state at a given moment. These modules give the model the ability to structure the data considering the delays of the multiple responses that the network delivers to a single monitoring probe and the multiple network interfaces that a peripheral device may have. Thus, a structured data stream is obtained, and it is ready to be analyzed. For this analysis, it was necessary to implement an incremental learning framework that respects networks' dynamic nature. It comprises three elements, an incremental learning algorithm, a data rebalancing strategy, and a concept drift detector. This framework is the fourth module of the PALADIN model named Diagnosis. In order to evaluate the PALADIN model, the Diagnosis module was implemented with 25 different incremental algorithms, ADWIN as concept-drift detector and SMOTE (adapted to streaming scenario) as the rebalancing strategy. On the other hand, a dataset was built through the first modules of the PALADIN model (SOFI dataset), which means that these data are the incoming data stream of the Diagnosis module used to evaluate its performance. The PALADIN Diagnosis module performs an online classification of network failures, so it is a learning model that must be evaluated in a stream context. Prequential evaluation is the most used method to perform this task, so we adopt this process to evaluate the model's performance over time through several stream evaluation metrics. RESULTS: This research first evidences the phenomenon of impact fault propagation, making it possible to detect fault symptoms at a monitored network's peripheral level. It translates into non-invasive monitoring of the network. Second, the PALADIN model is the major contribution in the fault detection context because it covers two aspects. An online learning model to continuously process the network symptoms and detect internal failures. Moreover, the concept-drift detection and rebalance data stream components which make resilience to dynamic network changes possible. Third, it is well known that the amount of available real-world datasets for imbalanced stream classification context is still too small. That number is further reduced for the networking context. The SOFI dataset obtained with the first modules of the PALADIN model contributes to that number and encourages works related to unbalanced data streams and those related to network fault diagnosis. CONCLUSIONS: The proposed model contains the necessary elements for the continuous and timely diagnosis of IPbased network faults; it introduces the idea of periodical monitorization of peripheral network elements and uses data analytics techniques to process it. Based on the analysis, processing, and classification of peripherally collected data, it can be concluded that PALADIN achieves the objective. The results indicate that the peripheral monitorization allows diagnosing faults in the internal network; besides, the diagnosis process needs an incremental learning process, conceptdrift detection elements, and rebalancing strategy. The results of the experiments showed that PALADIN makes it possible to learn from the network manifestations and diagnose internal network failures. The latter was verified with 25 different incremental algorithms, ADWIN as concept-drift detector and SMOTE (adapted to streaming scenario) as the rebalancing strategy. This research clearly illustrates that it is unnecessary to monitor all the internal network elements to detect a network's failures; instead, it is enough to choose the peripheral elements to be monitored. Furthermore, with proper processing of the collected status and traffic descriptors, it is possible to learn from the arriving data using incremental learning in cooperation with data rebalancing and concept drift approaches. This proposal continuously diagnoses the network symptoms without leaving the system vulnerable to failures while being resilient to the network's dynamic changes.Programa de Doctorado en Ciencia y Tecnología Informática por la Universidad Carlos III de MadridPresidente: José Manuel Molina López.- Secretario: Juan Carlos Dueñas López.- Vocal: Juan Manuel Corchado Rodrígue

    Mesh-Mon: a Monitoring and Management System for Wireless Mesh Networks

    Get PDF
    A mesh network is a network of wireless routers that employ multi-hop routing and can be used to provide network access for mobile clients. Mobile mesh networks can be deployed rapidly to provide an alternate communication infrastructure for emergency response operations in areas with limited or damaged infrastructure. In this dissertation, we present Dart-Mesh: a Linux-based layer-3 dual-radio two-tiered mesh network that provides complete 802.11b coverage in the Sudikoff Lab for Computer Science at Dartmouth College. We faced several challenges in building, testing, monitoring and managing this network. These challenges motivated us to design and implement Mesh-Mon, a network monitoring system to aid system administrators in the management of a mobile mesh network. Mesh-Mon is a scalable, distributed and decentralized management system in which mesh nodes cooperate in a proactive manner to help detect, diagnose and resolve network problems automatically. Mesh-Mon is independent of the routing protocol used by the mesh routing layer and can function even if the routing protocol fails. We demonstrate this feature by running Mesh-Mon on two versions of Dart-Mesh, one running on AODV (a reactive mesh routing protocol) and the second running on OLSR (a proactive mesh routing protocol) in separate experiments. Mobility can cause links to break, leading to disconnected partitions. We identify critical nodes in the network, whose failure may cause a partition. We introduce two new metrics based on social-network analysis: the Localized Bridging Centrality (LBC) metric and the Localized Load-aware Bridging Centrality (LLBC) metric, that can identify critical nodes efficiently and in a fully distributed manner. We run a monitoring component on client nodes, called Mesh-Mon-Ami, which also assists Mesh-Mon nodes in the dissemination of management information between physically disconnected partitions, by acting as carriers for management data. We conclude, from our experimental evaluation on our 16-node Dart-Mesh testbed, that our system solves several management challenges in a scalable manner, and is a useful and effective tool for monitoring and managing real-world mesh networks

    NASA SBIR abstracts of 1990 phase 1 projects

    Get PDF
    The research objectives of the 280 projects placed under contract in the National Aeronautics and Space Administration (NASA) 1990 Small Business Innovation Research (SBIR) Phase 1 program are described. The basic document consists of edited, non-proprietary abstracts of the winning proposals submitted by small businesses in response to NASA's 1990 SBIR Phase 1 Program Solicitation. The abstracts are presented under the 15 technical topics within which Phase 1 proposals were solicited. Each project was assigned a sequential identifying number from 001 to 280, in order of its appearance in the body of the report. The document also includes Appendixes to provide additional information about the SBIR program and permit cross-reference in the 1990 Phase 1 projects by company name, location by state, principal investigator, NASA field center responsible for management of each project, and NASA contract number
    corecore