1,509 research outputs found

    An extensible and scalable Pilot-MapReduce framework for data intensive applications on distributed cyberinfrastructure

    Get PDF
    The volume and complexity of data that must be analyzed in scientific applications is increasing exponentially. Often, this data is distributed; thus, the ability to analyze data by localizing it will yield limited returns. Therefore, an efficient processing of large distributed datasets is required, whilst ideally not introducing fundamentally new programming models or methods. For example, extending MapReduce - a proven effective programming model for processing large datasets, to work more effectively on distributed data and on different infrastructure (such as non-Hadoop, general-purpose clusters) is desirable. We posit that this can be achieved with an effective and efficient runtime environment and without refactoring MapReduce itself. MapReduce on distributed data requires effective distributed coordination of computation (map and reduce) and data, as well as distributed data management (in particular the transfer of intermediate data units). To address these requirements, we design and implement Pilot-MapReduce (PMR) - a flexible, infrastructure-independent runtime environment for MapReduce. PMR is based on Pilot abstractions for both compute (Pilot- Jobs) and data (Pilot-Data): it utilizes Pilot-Jobs to couple the map phase computation to the nearby source data, and Pilot-Data to move intermediate data using parallel data transfers to the reduce computation phase. We analyze the effectiveness of PMR over applications with different characteristics (e. g. different volumes of intermediate and output data). Our experimental evaluations show that the Pilot abstraction for data movement across multiple clusters is promising, and can lower the execution time span of the entire MapReduce execution. We also investigate the performance of PMR with distributed data using a Word Count and a genome sequencing application over different MapReduce configurations. We find that PMR is a viable tool to support distributed NGS analytics by comparing and contrasting the PMR approach to similar capabilities of Seqal and Crossbow, two Next Generation Sequencing(NGS) Hadoop MapReduce based applications. Our experiments show that PMR provides the desired flexibility in the deployment and configuration of MapReduce runs to address specific application characteristics and achieve an optimal performance, both locally and over wide-area multiple clusters

    Cloud Computing in VANETs: Architecture, Taxonomy, and Challenges

    Get PDF
    Cloud Computing in VANETs (CC-V) has been investigated into two major themes of research including Vehicular Cloud Computing (VCC) and Vehicle using Cloud (VuC). VCC is the realization of autonomous cloud among vehicles to share their abundant resources. VuC is the efficient usage of conventional cloud by on-road vehicles via a reliable Internet connection. Recently, number of advancements have been made to address the issues and challenges in VCC and VuC. This paper qualitatively reviews CC-V with the emphasis on layered architecture, network component, taxonomy, and future challenges. Specifically, a four-layered architecture for CC-V is proposed including perception, co-ordination, artificial intelligence and smart application layers. Three network component of CC-V namely, vehicle, connection and computation are explored with their cooperative roles. A taxonomy for CC-V is presented considering major themes of research in the area including design of architecture, data dissemination, security, and applications. Related literature on each theme are critically investigated with comparative assessment of recent advances. Finally, some open research challenges are identified as future issues. The challenges are the outcome of the critical and qualitative assessment of literature on CC-V

    Autonomic computing architecture for SCADA cyber security

    Get PDF
    Cognitive computing relates to intelligent computing platforms that are based on the disciplines of artificial intelligence, machine learning, and other innovative technologies. These technologies can be used to design systems that mimic the human brain to learn about their environment and can autonomously predict an impending anomalous situation. IBM first used the term ‘Autonomic Computing’ in 2001 to combat the looming complexity crisis (Ganek and Corbi, 2003). The concept has been inspired by the human biological autonomic system. An autonomic system is self-healing, self-regulating, self-optimising and self-protecting (Ganek and Corbi, 2003). Therefore, the system should be able to protect itself against both malicious attacks and unintended mistakes by the operator

    LEGaTO: first steps towards energy-efficient toolset for heterogeneous computing

    Get PDF
    LEGaTO is a three-year EU H2020 project which started in December 2017. The LEGaTO project will leverage task-based programming models to provide a software ecosystem for Made-in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC.Peer ReviewedPostprint (author's final draft

    Timely processing of big data in collaborative large-scale distributed systems

    Get PDF
    Today’s Big Data phenomenon, characterized by huge volumes of data produced at very high rates by heterogeneous and geographically dispersed sources, is fostering the employment of large-scale distributed systems in order to leverage parallelism, fault tolerance and locality awareness with the aim of delivering suitable performances. Among the several areas where Big Data is gaining increasing significance, the protection of Critical Infrastructure is one of the most strategic since it impacts on the stability and safety of entire countries. Intrusion detection mechanisms can benefit a lot from novel Big Data technologies because these allow to exploit much more information in order to sharpen the accuracy of threats discovery. A key aspect for increasing even more the amount of data at disposal for detection purposes is the collaboration (meant as information sharing) among distinct actors that share the common goal of maximizing the chances to recognize malicious activities earlier. Indeed, if an agreement can be found to share their data, they all have the possibility to definitely improve their cyber defenses. The abstraction of Semantic Room (SR) allows interested parties to form trusted and contractually regulated federations, the Semantic Rooms, for the sake of secure information sharing and processing. Another crucial point for the effectiveness of cyber protection mechanisms is the timeliness of the detection, because the sooner a threat is identified, the faster proper countermeasures can be put in place so as to confine any damage. Within this context, the contributions reported in this thesis are threefold * As a case study to show how collaboration can enhance the efficacy of security tools, we developed a novel algorithm for the detection of stealthy port scans, named R-SYN (Ranked SYN port scan detection). We implemented it in three distinct technologies, all of them integrated within an SR-compliant architecture that allows for collaboration through information sharing: (i) in a centralized Complex Event Processing (CEP) engine (Esper), (ii) in a framework for distributed event processing (Storm) and (iii) in Agilis, a novel platform for batch-oriented processing which leverages the Hadoop framework and a RAM-based storage for fast data access. Regardless of the employed technology, all the evaluations have shown that increasing the number of participants (that is, increasing the amount of input data at disposal), allows to improve the detection accuracy. The experiments made clear that a distributed approach allows for lower detection latency and for keeping up with higher input throughput, compared with a centralized one. * Distributing the computation over a set of physical nodes introduces the issue of improving the way available resources are assigned to the elaboration tasks to execute, with the aim of minimizing the time the computation takes to complete. We investigated this aspect in Storm by developing two distinct scheduling algorithms, both aimed at decreasing the average elaboration time of the single input event by decreasing the inter-node traffic. Experimental evaluations showed that these two algorithms can improve the performance up to 30%. * Computations in online processing platforms (like Esper and Storm) are run continuously, and the need of refining running computations or adding new computations, together with the need to cope with the variability of the input, requires the possibility to adapt the resource allocation at runtime, which entails a set of additional problems. Among them, the most relevant concern how to cope with incoming data and processing state while the topology is being reconfigured, and the issue of temporary reduced performance. At this aim, we also explored the alternative approach of running the computation periodically on batches of input data: although it involves a performance penalty on the elaboration latency, it allows to eliminate the great complexity of dynamic reconfigurations. We chose Hadoop as batch-oriented processing framework and we developed some strategies specific for dealing with computations based on time windows, which are very likely to be used for pattern recognition purposes, like in the case of intrusion detection. Our evaluations provided a comparison of these strategies and made evident the kind of performance that this approach can provide

    A cloud-based smart metering infrastructure for distribution grid services and automation

    Get PDF
    © 2017 The Authors The evolution of the power systems towards the smart grid paradigm is strictly dependent on the modernization of distribution grids. To achieve this target, new infrastructures, technologies and applications are increasingly required. This paper presents a smart metering infrastructure that unlocks a large set of possible services aimed at the automation and management of distribution grids. The proposed architecture is based on a cloud solution, which allows the communication with the smart meters from one side and provides the needed interfaces to the distribution grid services on the other one. While a large number of applications can be designed on top of the cloud, in this paper the focus will be on a real-time distributed state estimation algorithm that enables the automatic reconfiguration of the grid. The paper will present the key role of the cloud solution for obtaining scalability, interoperability and flexibility, and for enabling the integration of different services for the automation of the distribution system. The distributed state estimation algorithm and the automatic network reconfiguration will be presented as an example of coordinated operation of different distribution grid services through the cloud
    corecore