1,581 research outputs found

    Doctor of Philosophy

    Get PDF
    dissertationThe next generation mobile network (i.e., 5G network) is expected to host emerging use cases that have a wide range of requirements; from Internet of Things (IoT) devices that prefer low-overhead and scalable network to remote machine operation or remote healthcare services that require reliable end-to-end communications. Improving scalability and reliability is among the most important challenges of designing the next generation mobile architecture. The current (4G) mobile core network heavily relies on hardware-based proprietary components. The core networks are expensive and therefore are available in limited locations in the country. This leads to a high end-to-end latency due to the long latency between base stations and the mobile core, and limitations in having innovations and an evolvable network. Moreover, at the protocol level the current mobile network architecture was designed for a limited number of smart-phones streaming a large amount of high quality traffic but not a massive number of low-capability devices sending small and sporadic traffic. This results in high-overhead control and data planes in the mobile core network that are not suitable for a massive number of future Internet-of-Things (IoT) devices. In terms of reliability, network operators already deployed multiple monitoring sys- tems to detect service disruptions and fix problems when they occur. However, detecting all service disruptions is challenging. First, there is a complex relationship between the network status and user-perceived service experience. Second, service disruptions could happen because of reasons that are beyond the network itself. With technology advancements in Software-defined Network (SDN) and Network Func- tion Virtualization (NFV), the next generation mobile network is expected to be NFV-based and deployed on NFV platforms. However, in contrast to telecom-grade hardware with built-in redundancy, commodity off-the-shell (COTS) hardware in NFV platforms often can't be comparable in term of reliability. Availability of Telecom-grade mobile core network hardwares is typically 99.999% (i.e., "five-9s" availability) while most NFV platforms only guarantee "three-9s" availability - orders of magnitude less reliable. Therefore, an NFV-based mobile core network needs extra mechanisms to guarantee its availability. This Ph.D. dissertation focuses on using SDN/NFV, data analytics and distributed system techniques to enhance scalability and reliability of the next generation mobile core network. The dissertation makes the following contributions. First, it presents SMORE, a practical offloading architecture that reduces end-to-end latency and enables new functionalities in mobile networks. It then presents SIMECA, a light-weight and scalable mobile core network designed for a massive number of future IoT devices. Second, it presents ABSENCE, a passive service monitoring system using customer usage and data analytics to detect silent failures in an operational mobile network. Lastly, it presents ECHO, a distributed mobile core network architecture to improve availability of NFV-based mobile core network in public clouds

    What broke where for distributed and parallel applications — a whodunit story

    Get PDF
    Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed and parallel systems is a difficult task. These large distributed and parallel systems are composed of various complex software and hardware components. When the system experiences some performance or correctness problem, developers struggle to understand the root cause of the problem and fix in a timely manner. In my thesis, I address these three components of the performance problems in computer systems. First, we focus on diagnosing performance problems in large-scale parallel applications running on supercomputers. We developed techniques to localize the performance problem for root-cause analysis. Parallel applications, most of which are complex scientific simulations running in supercomputers, can create up to millions of parallel tasks that run on different machines and communicate using the message passing paradigm. We developed a highly scalable and accurate automated debugging tool called PRODOMETER, which uses sophisticated algorithms to first, create a logical progress dependency graph of the tasks to highlight how the problem spread through the system manifesting as a system-wide performance issue. Second, uses this logical progress dependence graph to identify the task where the problem originated. Finally, PRODOMETER pinpoints the code region corresponding to the origin of the bug. Second, we developed a tool-chain that can detect performance anomaly using machine-learning techniques and can achieve very low false positive rate. Our input-aware performance anomaly detection system consists of a scalable data collection framework to collect performance related metrics from different granularity of code regions, an offline model creation and prediction-error characterization technique, and a threshold based anomaly-detection-engine for production runs. Our system requires few training runs and can handle unknown inputs and parameter combinations by dynamically calibrating the anomaly detection threshold according to the characteristics of the input data and the characteristics of the prediction-error of the models. Third, we developed performance problem mitigation scheme for erasure-coded distributed storage systems. Repair operations of the failed blocks in erasure-coded distributed storage system take really long time in networked constrained data-centers. The reason being, during the repair operation for erasure-coded distributed storage, a lot of data from multiple nodes are gathered into a single node and then a mathematical operation is performed to reconstruct the missing part. This process severely congests the links toward the destination where newly recreated data is to be hosted. We proposed a novel distributed repair technique, called Partial-Parallel-Repair (PPR) that performs this reconstruction in parallel on multiple nodes and eliminates network bottlenecks, and as a result, greatly speeds up the repair process. Fourth, we study how for a class of applications, performance can be improved (or performance problems can be mitigated) by selectively approximating some of the computations. For many applications, the main computation happens inside a loop that can be logically divided into a few temporal segments, we call phases. We found that while approximating the initial phases might severely degrade the quality of the results, approximating the computation for the later phases have very small impact on the final quality of the result. Based on this observation, we developed an optimization framework that for a given budget of quality-loss, would find the best approximation settings for each phase in the execution

    A model-based approach for automatic recovery from memory leaks in enterprise applications

    Get PDF
    Large-scale distributed computing systems such as data centers are hosted on heterogeneous and networked servers that execute in a dynamic and uncertain operating environment, caused by factors such as time-varying user workload and various failures. Therefore, achieving stringent quality-of-service goals is a challenging task, requiring a comprehensive approach to performance control, fault diagnosis, and failure recovery. This work presents a model-based approach for fault management, which integrates limited lookahead control (LLC), diagnosis, and fault-tolerance concepts that: (1) enables systems to adapt to environment variations, (2) maintains the availability and reliability of the system, (3) facilitates system recovery from failures. We focused on memory leak errors in this thesis. A characterization function is designed to detect memory leaks. Then, a LLC is applied to enable the computing system to adapt efficiently to variations in the workload, and to enable the system recover from memory leaks and maintain functionality

    Supporting Quality of Service in Scientific Workflows

    Get PDF
    While workflow management systems have been utilized in enterprises to support businesses for almost two decades, the use of workflows in scientific environments was fairly uncommon until recently. Nowadays, scientists use workflow systems to conduct scientific experiments, simulations, and distributed computations. However, most scientific workflow management systems have not been built using existing workflow technology; rather they have been designed and developed from scratch. Due to the lack of generality of early scientific workflow systems, many domain-specific workflow systems have been developed. Generally speaking, those domain-specific approaches lack common acceptance and tool support and offer lower robustness compared to business workflow systems. In this thesis, the use of the industry standard BPEL, a workflow language for modeling business processes, is proposed for the modeling and the execution of scientific workflows. Due to the widespread use of BPEL in enterprises, a number of stable and mature software products exist. The language is expressive (Turingcomplete) and not restricted to specific applications. BPEL is well suited for the modeling of scientific workflows, but existing implementations of the standard lack important features that are necessary for the execution of scientific workflows. This work presents components that extend an existing implementation of the BPEL standard and eliminate the identified weaknesses. The components thus provide the technical basis for use of BPEL in academia. The particular focus is on so-called non-functional (Quality of Service) requirements. These requirements include scalability, reliability (fault tolerance), data security, and cost (of executing a workflow). From a technical perspective, the workflow system must be able to interface with the middleware systems that are commonly used by the scientific workflow community to allow access to heterogeneous, distributed resources (especially Grid and Cloud resources). The major components cover exactly these requirements: Cloud Resource Provisioner Scalability of the workflow system is achieved by automatically adding additional (Cloud) resources to the workflow system’s resource pool when the workflow system is heavily loaded. Fault Tolerance Module High reliability is achieved via continuous monitoring of workflow execution and corrective interventions, such as re-execution of a failed workflow step or replacement of the faulty resource. Cost Aware Data Flow Aware Scheduler The majority of scientific workflow systems only take the performance and utilization of resources for the execution of workflow steps into account when making scheduling decisions. The presented workflow system goes beyond that. By defining preference values for the weighting of costs and the anticipated workflow execution time, workflow users may influence the resource selection process. The developed multiobjective scheduling algorithm respects the defined weighting and makes both efficient and advantageous decisions using a heuristic approach. Security Extensions Because it supports various encryption, signature and authentication mechanisms (e.g., Grid Security Infrastructure), the workflow system guarantees data security in the transfer of workflow data. Furthermore, this work identifies the need to equip workflow developers with workflow modeling tools that can be used intuitively. This dissertation presents two modeling tools that support users with different needs. The first tool, DAVO (domain-adaptable, Visual BPEL Orchestrator), operates at a low level of abstraction and allows users with knowledge of BPEL to use the full extent of the language. DAVO is a software that offers extensibility and customizability for different application domains. These features are used in the implementation of the second tool, SimpleBPEL Composer. SimpleBPEL is aimed at users with little or no background in computer science and allows for quick and intuitive development of BPEL workflows based on predefined components

    Automated IT Service Fault Diagnosis Based on Event Correlation Techniques

    Get PDF
    In the previous years a paradigm shift in the area of IT service management could be witnessed. IT management does not only deal with the network, end systems, or applications anymore, but is more and more concerned with IT services. This is caused by the need of organizations to monitor the efficiency of internal IT departments and to have the possibility to subscribe IT services from external providers. This trend has raised new challenges in the area of IT service management, especially with respect to service level agreements laying down the quality of service to be guaranteed by a service provider. Fault management is also facing new challenges which are related to ensuring the compliance to these service level agreements. For example, a high utilization of network links in the infrastructure can imply a delay increase in the delivery of services with respect to agreed time constraints. Such relationships have to be detected and treated in a service-oriented fault diagnosis which therefore does not deal with faults in a narrow sense, but with service quality degradations. This thesis aims at providing a concept for service fault diagnosis which is an important part of IT service fault management. At first, a motivation of the need of further examinations regarding this issue is given which is based on the analysis of services offered by a large IT service provider. A generalization of the scenario forms the basis for the specification of requirements which are used for a review of related research work and commercial products. Even though some solutions for particular challenges have already been provided, a general approach for service fault diagnosis is still missing. For addressing this issue, a framework is presented in the main part of this thesis using an event correlation component as its central part. Event correlation techniques which have been successfully applied to fault management in the area of network and systems management are adapted and extended accordingly. Guidelines for the application of the framework to a given scenario are provided afterwards. For showing their feasibility in a real world scenario, they are used for both example services referenced earlier

    Qualité de service dans l'IOT : couche de brouillard

    Get PDF
    Abstract : The Internet of Things (IoT) can be defined as a combination of push and pull from the technological side and human side respectively. This push and pull effect results in more connectivity among objects and humans in the near surrounding environments [1]. With the growth in the field of IoT, in recent times, the risk of real time failures has increased as well. The failures are often detected by certain points of vulnerability in the system. Narrowing down to the root causes we get the point of failures and that leads to the required measures to overcome them. This creates the need for IoT systems to have a proper Quality of Service (QoS) architecture. Thus, QoS is becoming a crucial issue with the democratization of IoT. QoS is the description or measurement of the overall performance of a service, such as a telephony or computer network or a cloud computing service, particularly the performance seen by the users of the network. In this study, we propose the methods of enforcement of QoS in IoT platforms. We will highlight the challenges and recurrent issues faced by all IoT platforms which in turn inspired us to build a generic tool to overcome these challenges by enforcing the QoS in all the IoT platforms with an easy to use set up. The main focus of this study is to enable QoS features in the Fog layer of the IoT architecture. Existing platforms and systems enabling QoS features in the Fog layer are also highlighted. Finally, we validate our proposed model by implementing it on our AMI-LAB platform.L'Internet des objets (IdO) (Internet of Things en anglais), peut être défini comme une combinaison d’interactions entre les Humains et le monde technologique de l’Internet. De cet effet résulte une interconnexion entre les objets physiques et les appareils technologiques dans leur environnement proche. Ces dernières années le domaine de l'IdO s’est beaucoup développé, entrainant ainsi une augmentation du risque de défaillances en temps réel. Les défaillances sont souvent détectées par certains points de vulnérabilité dans le système. En se concentrant sur les causes profondes, le point de défaillance peut être détecter, ce qui conduit aux mesures à mettre en place pour surmonter les défaillances. Les systèmes IdO ont donc besoin d'avoir une architecture de Qualité de Service (QdS) adéquate. Ainsi, la QdS devient un enjeu crucial avec la démocratisation de l'IdO. La QdS est la description ou la mesure de la performance globale d'un service, tel qu'un réseau de téléphonie ou informatique, ou un service de cloud computing, en particulier la performance perçue par les utilisateurs du réseau. Dans cette étude, nous proposons les méthodes de mise en œuvre de la QdS dans les plateformes IdO. Nous mettrons en lumière les défis et les problèmes récurrents rencontrés par toutes les plateformes IdO, qui nous ont inspirés à construire un outil générique pour surmonter ces défis en imposant la QdS dans toutes les plateformes IdO avec une configuration facile à utiliser. L'objectif principal de cette étude est de permettre les fonctionnalités de QdS dans la couche Fog de l'architecture IdO. Les plateformes et systèmes existants permettant les fonctionnalités de QdS dans la couche Fog sont également mis en évidence. Enfin, nous soulignons la validation de notre modèle en le mettant en œuvre sur notre plateforme AMI-LAB

    SAVCBS 2004 Specification and Verification of Component-Based Systems: Workshop Proceedings

    Get PDF
    This is the proceedings of the 2004 SAVCBS workshop. The workshop is concerned with how formal (i.e., mathematical) techniques can be or should be used to establish a suitable foundation for the specification and verification of component-based systems. Component-based systems are a growing concern for the software engineering community. Specification and reasoning techniques are urgently needed to permit composition of systems from components. Component-based specification and verification is also vital for scaling advanced verification techniques such as extended static analysis and model checking to the size of real systems. The workshop considers formalization of both functional and non-functional behavior, such as performance or reliability

    Smart Wireless Sensor Networks

    Get PDF
    The recent development of communication and sensor technology results in the growth of a new attractive and challenging area - wireless sensor networks (WSNs). A wireless sensor network which consists of a large number of sensor nodes is deployed in environmental fields to serve various applications. Facilitated with the ability of wireless communication and intelligent computation, these nodes become smart sensors which do not only perceive ambient physical parameters but also be able to process information, cooperate with each other and self-organize into the network. These new features assist the sensor nodes as well as the network to operate more efficiently in terms of both data acquisition and energy consumption. Special purposes of the applications require design and operation of WSNs different from conventional networks such as the internet. The network design must take into account of the objectives of specific applications. The nature of deployed environment must be considered. The limited of sensor nodes� resources such as memory, computational ability, communication bandwidth and energy source are the challenges in network design. A smart wireless sensor network must be able to deal with these constraints as well as to guarantee the connectivity, coverage, reliability and security of network's operation for a maximized lifetime. This book discusses various aspects of designing such smart wireless sensor networks. Main topics includes: design methodologies, network protocols and algorithms, quality of service management, coverage optimization, time synchronization and security techniques for sensor networks
    • …
    corecore