25 research outputs found

    In-Transit Molecular Dynamics Analysis with Apache Flink

    Get PDF
    International audienceIn this paper, an on-line parallel analytics framework is proposed to process and store in transit all the data being generated by a Molecular Dynamics (MD) simulation run using staging nodes in the same cluster executing the simulation. The implementation and deployment of such a parallel workflow with standard HPC tools, managing problems such as data partitioning and load balancing, can be a hard task for scientists. In this paper we propose to leverage Apache Flink, a scalable stream processing engine from the Big Data domain, in this HPC context. Flink enables to program analyses within a simple window based map/reduce model, while the runtime takes care of the deployment, load balancing and fault tolerance. We build a complete in transit analytics workflow, connecting an MD simulation to Apache Flink and to a distributed database, Apache HBase, to persist all the desired data. To demonstrate the expressivity of this programming model and its suitability for HPC scientific environments, two common analytics in the MD field have been implemented. We assessed the performance of this framework, concluding that it can handle simulations of sizes used in the literature while providing an effective and versatile tool for scientists to easily incorporate on-line parallel analytics in their current workflows

    X-Composer: Enabling Cross-Environments In-SituWorkflows between HPC and Cloud

    Get PDF
    As large-scale scientific simulations and big data analyses become more popular, it is increasingly more expensive to store huge amounts of raw simulation results to perform post-analysis. To minimize the expensive data I/O, "in-situ" analysis is a promising approach, where data analysis applications analyze the simulation generated data on the fly without storing it first. However, it is challenging to organize, transform, and transport data at scales between two semantically different ecosystems due to the distinct software and hardware difference. To tackle these challenges, we design and implement the X-Composer framework. X-Composer connects cross-ecosystem applications to form an "in-situ" scientific workflow, and provides a unified approach and recipe for supporting such hybrid in-situ workflows on distributed heterogeneous resources. X-Composer reorganizes simulation data as continuous data streams and feeds them seamlessly into the Cloud-based stream processing services to minimize I/O overheads. For evaluation, we use X-Composer to set up and execute a cross-ecosystem workflow, which consists of a parallel Computational Fluid Dynamics simulation running on HPC, and a distributed Dynamic Mode Decomposition analysis application running on Cloud. Our experimental results show that X-Composer can seamlessly couple HPC and Big Data jobs in their own native environments, achieve good scalability, and provide high-fidelity analytics for ongoing simulations in real-time

    Mobility mining for time-dependent urban network modeling

    Get PDF
    170 p.Mobility planning, monitoring and analysis in such a complex ecosystem as a city are very challenging.Our contributions are expected to be a small step forward towards a more integrated vision of mobilitymanagement. The main hypothesis behind this thesis is that the transportation offer and the mobilitydemand are greatly coupled, and thus, both need to be thoroughly and consistently represented in a digitalmanner so as to enable good quality data-driven advanced analysis. Data-driven analytics solutions relyon measurements. However, sensors do only provide a measure of movements that have already occurred(and associated magnitudes, such as vehicles per hour). For a movement to happen there are two mainrequirements: i) the demand (the need or interest) and ii) the offer (the feasibility and resources). Inaddition, for good measurement, the sensor needs to be located at an adequate location and be able tocollect data at the right moment. All this information needs to be digitalised accordingly in order to applyadvanced data analytic methods and take advantage of good digital transportation resource representation.Our main contributions, focused on mobility data mining over urban transportation networks, can besummarised in three groups. The first group consists of a comprehensive description of a digitalmultimodal transport infrastructure representation from global and local perspectives. The second groupis oriented towards matching diverse sensor data onto the transportation network representation,including a quantitative analysis of map-matching algorithms. The final group of contributions covers theprediction of short-term demand based on various measures of urban mobility

    Mobility mining for time-dependent urban network modeling

    Get PDF
    170 p.Mobility planning, monitoring and analysis in such a complex ecosystem as a city are very challenging.Our contributions are expected to be a small step forward towards a more integrated vision of mobilitymanagement. The main hypothesis behind this thesis is that the transportation offer and the mobilitydemand are greatly coupled, and thus, both need to be thoroughly and consistently represented in a digitalmanner so as to enable good quality data-driven advanced analysis. Data-driven analytics solutions relyon measurements. However, sensors do only provide a measure of movements that have already occurred(and associated magnitudes, such as vehicles per hour). For a movement to happen there are two mainrequirements: i) the demand (the need or interest) and ii) the offer (the feasibility and resources). Inaddition, for good measurement, the sensor needs to be located at an adequate location and be able tocollect data at the right moment. All this information needs to be digitalised accordingly in order to applyadvanced data analytic methods and take advantage of good digital transportation resource representation.Our main contributions, focused on mobility data mining over urban transportation networks, can besummarised in three groups. The first group consists of a comprehensive description of a digitalmultimodal transport infrastructure representation from global and local perspectives. The second groupis oriented towards matching diverse sensor data onto the transportation network representation,including a quantitative analysis of map-matching algorithms. The final group of contributions covers theprediction of short-term demand based on various measures of urban mobility

    An Integrated Big and Fast Data Analytics Platform for Smart Urban Transportation Management

    Full text link
    (c) 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.[EN] Smart urban transportation management can be considered as a multifaceted big data challenge. It strongly relies on the information collected into multiple, widespread, and heterogeneous data sources as well as on the ability to extract actionable insights from them. Besides data, full stack (from platform to services and applications) Information and Communications Technology (ICT) solutions need to be specifically adopted to address smart cities challenges. Smart urban transportation management is one of the key use cases addressed in the context of the EUBra-BIGSEA (Europe-Brazil Collaboration of Big Data Scientific Research through Cloud-Centric Applications) project. This paper specifically focuses on the City Administration Dashboard, a public transport analytics application that has been developed on top of the EUBra-BIGSEA platform and used by the Municipality stakeholders of Curitiba, Brazil, to tackle urban traffic data analysis and planning challenges. The solution proposed in this paper joins together a scalable big and fast data analytics platform, a flexible and dynamic cloud infrastructure, data quality and entity matching algorithms as well as security and privacy techniques. By exploiting an interoperable programming framework based on Python Application Programming Interface (API), it allows an easy, rapid and transparent development of smart cities applications.This work was supported by the European Commission through the Cooperation Programme under EUBra-BIGSEA Horizon 2020 Grant [Este projeto e resultante da 3a Chamada Coordenada BR-UE em Tecnologias da Informacao e Comunicacao (TIC), anunciada pelo Ministerio de Ciencia, Tecnologia e Inovacao (MCTI)] under Grant 690116.Fiore, S.; Elia, D.; Pires, CE.; Mestre, DG.; Cappiello, C.; Vitali, M.; Andrade, N.... (2019). An Integrated Big and Fast Data Analytics Platform for Smart Urban Transportation Management. IEEE Access. 7:117652-117677. https://doi.org/10.1109/ACCESS.2019.2936941S117652117677

    High-Performance Modelling and Simulation for Big Data Applications

    Get PDF
    This open access book was prepared as a Final Publication of the COST Action IC1406 “High-Performance Modelling and Simulation for Big Data Applications (cHiPSet)“ project. Long considered important pillars of the scientific method, Modelling and Simulation have evolved from traditional discrete numerical methods to complex data-intensive continuous analytical optimisations. Resolution, scale, and accuracy have become essential to predict and analyse natural and complex systems in science and engineering. When their level of abstraction raises to have a better discernment of the domain at hand, their representation gets increasingly demanding for computational and data resources. On the other hand, High Performance Computing typically entails the effective use of parallel and distributed processing units coupled with efficient storage, communication and visualisation systems to underpin complex data-intensive applications in distinct scientific and technical domains. It is then arguably required to have a seamless interaction of High Performance Computing with Modelling and Simulation in order to store, compute, analyse, and visualise large data sets in science and engineering. Funded by the European Commission, cHiPSet has provided a dynamic trans-European forum for their members and distinguished guests to openly discuss novel perspectives and topics of interests for these two communities. This cHiPSet compendium presents a set of selected case studies related to healthcare, biological data, computational advertising, multimedia, finance, bioinformatics, and telecommunications

    High-Performance Modelling and Simulation for Big Data Applications

    Get PDF
    This open access book was prepared as a Final Publication of the COST Action IC1406 “High-Performance Modelling and Simulation for Big Data Applications (cHiPSet)“ project. Long considered important pillars of the scientific method, Modelling and Simulation have evolved from traditional discrete numerical methods to complex data-intensive continuous analytical optimisations. Resolution, scale, and accuracy have become essential to predict and analyse natural and complex systems in science and engineering. When their level of abstraction raises to have a better discernment of the domain at hand, their representation gets increasingly demanding for computational and data resources. On the other hand, High Performance Computing typically entails the effective use of parallel and distributed processing units coupled with efficient storage, communication and visualisation systems to underpin complex data-intensive applications in distinct scientific and technical domains. It is then arguably required to have a seamless interaction of High Performance Computing with Modelling and Simulation in order to store, compute, analyse, and visualise large data sets in science and engineering. Funded by the European Commission, cHiPSet has provided a dynamic trans-European forum for their members and distinguished guests to openly discuss novel perspectives and topics of interests for these two communities. This cHiPSet compendium presents a set of selected case studies related to healthcare, biological data, computational advertising, multimedia, finance, bioinformatics, and telecommunications

    Multimodal Approach for Big Data Analytics and Applications

    Get PDF
    The thesis presents multimodal conceptual frameworks and their applications in improving the robustness and the performance of big data analytics through cross-modal interaction or integration. A joint interpretation of several knowledge renderings such as stream, batch, linguistics, visuals and metadata creates a unified view that can provide a more accurate and holistic approach to data analytics compared to a single standalone knowledge base. Novel approaches in the thesis involve integrating multimodal framework with state-of-the-art computational models for big data, cloud computing, natural language processing, image processing, video processing, and contextual metadata. The integration of these disparate fields has the potential to improve computational tools and techniques dramatically. Thus, the contributions place multimodality at the forefront of big data analytics; the research aims at mapping and under- standing multimodal correspondence between different modalities. The primary contribution of the thesis is the Multimodal Analytics Framework (MAF), a collaborative ensemble framework for stream and batch processing along with cues from multiple input modalities like language, visuals and metadata to combine benefits from both low-latency and high-throughput. The framework is a five-step process: Data ingestion. As a first step towards Big Data analytics, a high velocity, fault-tolerant streaming data acquisition pipeline is proposed through a distributed big data setup, followed by mining and searching patterns in it while data is still in transit. The data ingestion methods are demonstrated using Hadoop ecosystem tools like Kafka and Flume as sample implementations. Decision making on the ingested data to use the best-fit tools and methods. In Big Data Analytics, the primary challenges often remain in processing heterogeneous data pools with a one-method-fits all approach. The research introduces a decision-making system to select the best-fit solutions for the incoming data stream. This is the second step towards building a data processing pipeline presented in the thesis. The decision-making system introduces a Fuzzy Graph-based method to provide real-time and offline decision-making. Lifelong incremental machine learning. In the third step, the thesis describes a Lifelong Learning model at the processing layer of the analytical pipeline, following the data acquisition and decision making at step two for downstream processing. Lifelong learning iteratively increments the training model using a proposed Multi-agent Lambda Architecture (MALA), a collaborative ensemble architecture between the stream and batch data. As part of the proposed MAF, MALA is one of the primary contributions of the research.The work introduces a general-purpose and comprehensive approach in hybrid learning of batch and stream processing to achieve lifelong learning objectives. Improving machine learning results through ensemble learning. As an extension of the Lifelong Learning model, the thesis proposes a boosting based Ensemble method as the fourth step of the framework, improving lifelong learning results by reducing the learning error in each iteration of a streaming window. The strategy is to incrementally boost the learning accuracy on each iterating mini-batch, enabling the model to accumulate knowledge faster. The base learners adapt more quickly in smaller intervals of a sliding window, improving the machine learning accuracy rate by countering the concept drift. Cross-modal integration between text, image, video and metadata for more comprehensive data coverage than a text-only dataset. The final contribution of this thesis is a new multimodal method where three different modalities: text, visuals (image and video) and metadata, are intertwined along with real-time and batch data for more comprehensive input data coverage than text-only data. The model is validated through a detailed case study on the contemporary and relevant topic of the COVID-19 pandemic. While the remainder of the thesis deals with text-only input, the COVID-19 dataset analyzes both textual and visual information in integration. Post completion of this research work, as an extension to the current framework, multimodal machine learning is investigated as a future research direction

    Evaluating and Enabling Scalable High Performance Computing Workloads on Commercial Clouds

    Get PDF
    Performance, usability, and accessibility are critical components of high performance computing (HPC). Usability and performance are especially important to academic researchers as they generally have little time to learn a new technology and demand a certain type of performance in order to ensure the quality and quantity of their research results. We have observed that while not all workloads run well in the cloud, some workloads perform well. We have also observed that although commercial cloud adoption by industry has been growing at a rapid pace, its use by academic researchers has not grown as quickly. We aim to help close this gap and enable researchers to utilize the commercial cloud more efficiently and effectively. We present our results on architecting and benchmarking an HPC environment on Amazon Web Services (AWS) where we observe that there are particular types of applications that are and are not suited for the commercial cloud. Then, we present our results on architecting and building a provisioning and workflow management tool (PAW), where we developed an application that enables a user to launch an HPC environment in the cloud, execute a customizable workflow, and after the workflow has completed delete the HPC environment automatically. We then present our results on the scalability of PAW and the commercial cloud for compute intensive workloads by deploying a 1.1 million vCPU cluster. We then discuss our research into the feasibility of utilizing commercial cloud infrastructure to help tackle the large spikes and data-intensive characteristics of Transportation Cyberphysical Systems (TCPS) workloads. Then, we present our research in utilizing the commercial cloud for urgent HPC applications by deploying a 1.5 million vCPU cluster to process 211TB of traffic video data to be utilized by first responders during an evacuation situation. Lastly, we present the contributions and conclusions drawn from this work

    30th International Conference on Condition Monitoring and Diagnostic Engineering Management (COMADEM 2017)

    Get PDF
    Proceedings of COMADEM 201
    corecore