112,499 research outputs found

    On-Demand Big Data Integration: A Hybrid ETL Approach for Reproducible Scientific Research

    Full text link
    Scientific research requires access, analysis, and sharing of data that is distributed across various heterogeneous data sources at the scale of the Internet. An eager ETL process constructs an integrated data repository as its first step, integrating and loading data in its entirety from the data sources. The bootstrapping of this process is not efficient for scientific research that requires access to data from very large and typically numerous distributed data sources. a lazy ETL process loads only the metadata, but still eagerly. Lazy ETL is faster in bootstrapping. However, queries on the integrated data repository of eager ETL perform faster, due to the availability of the entire data beforehand. In this paper, we propose a novel ETL approach for scientific data integration, as a hybrid of eager and lazy ETL approaches, and applied both to data as well as metadata. This way, Hybrid ETL supports incremental integration and loading of metadata and data from the data sources. We incorporate a human-in-the-loop approach, to enhance the hybrid ETL, with selective data integration driven by the user queries and sharing of integrated data between users. We implement our hybrid ETL approach in a prototype platform, Obidos, and evaluate it in the context of data sharing for medical research. Obidos outperforms both the eager ETL and lazy ETL approaches, for scientific research data integration and sharing, through its selective loading of data and metadata, while storing the integrated data in a scalable integrated data repository.Comment: Pre-print Submitted to the DMAH Special Issue of the Springer DAPD Journa

    CoRide: Joint Order Dispatching and Fleet Management for Multi-Scale Ride-Hailing Platforms

    Get PDF
    How to optimally dispatch orders to vehicles and how to tradeoff between immediate and future returns are fundamental questions for a typical ride-hailing platform. We model ride-hailing as a large-scale parallel ranking problem and study the joint decision-making task of order dispatching and fleet management in online ride-hailing platforms. This task brings unique challenges in the following four aspects. First, to facilitate a huge number of vehicles to act and learn efficiently and robustly, we treat each region cell as an agent and build a multi-agent reinforcement learning framework. Second, to coordinate the agents from different regions to achieve long-term benefits, we leverage the geographical hierarchy of the region grids to perform hierarchical reinforcement learning. Third, to deal with the heterogeneous and variant action space for joint order dispatching and fleet management, we design the action as the ranking weight vector to rank and select the specific order or the fleet management destination in a unified formulation. Fourth, to achieve the multi-scale ride-hailing platform, we conduct the decision-making process in a hierarchical way where a multi-head attention mechanism is utilized to incorporate the impacts of neighbor agents and capture the key agent in each scale. The whole novel framework is named as CoRide. Extensive experiments based on multiple cities real-world data as well as analytic synthetic data demonstrate that CoRide provides superior performance in terms of platform revenue and user experience in the task of city-wide hybrid order dispatching and fleet management over strong baselines.Comment: CIKM 201
    corecore