8 research outputs found

    Big Automotive Data Preprocessing: A Three Stages Approach

    Get PDF
    The automotive industry generates large datasets of various formats, uncertainties and frequencies. To exploit Automotive Big Data, the data needs to be connected, fused and preprocessed to quality datasets before being used for production and business processes. Data preprocessing tasks are typically expensive, tightly coupled with their intended AI algorithms and are done manually by domain experts. Hence there is a need to automate data preprocessing to seamlessly generate cleaner data. We intend to introduce a generic data preprocessing framework that handles vehicle-to-everything (V2X) data streams and dynamic updates. We intend to decentralize and automate data preprocessing by leveraging edge computing with the objective of progressively improving the quality of the dataflow within edge components (vehicles) and onto the cloud

    AIR: A Light-Weight Yet High-Performance Dataflow Engine based on Asynchronous Iterative Routing

    Full text link
    Distributed Stream Processing Systems (DSPSs) are among the currently most emerging topics in data management, with applications ranging from real-time event monitoring to processing complex dataflow programs and big data analytics. The major market players in this domain are clearly represented by Apache Spark and Flink, which provide a variety of frontend APIs for SQL, statistical inference, machine learning, stream processing, and many others. Yet rather few details are reported on the integration of these engines into the underlying High-Performance Computing (HPC) infrastructure and the communication protocols they use. Spark and Flink, for example, are implemented in Java and still rely on a dedicated master node for managing their control flow among the worker nodes in a compute cluster. In this paper, we describe the architecture of our AIR engine, which is designed from scratch in C++ using the Message Passing Interface (MPI), pthreads for multithreading, and is directly deployed on top of a common HPC workload manager such as SLURM. AIR implements a light-weight, dynamic sharding protocol (referred to as "Asynchronous Iterative Routing"), which facilitates a direct and asynchronous communication among all client nodes and thereby completely avoids the overhead induced by the control flow with a master node that may otherwise form a performance bottleneck. Our experiments over a variety of benchmark settings confirm that AIR outperforms Spark and Flink in terms of latency and throughput by a factor of up to 15; moreover, we demonstrate that AIR scales out much better than existing DSPSs to clusters consisting of up to 8 nodes and 224 cores.Comment: 16 pages, 6 figures, 15 plot

    TRANSFORMING DATA PREPROCESSING: A HOLISTIC, NORMALIZED AND DISTRIBUTED APPROACH

    No full text
    Substantial volumes of data are generated at the edge as a result of an exponential increase in the number of Internet of Things (IoT) applications. IoT data are generated at edge components and, in most cases, transmitted to central or cloud infrastructures via the network. Distributing data preprocessing to the edge and closer to the data sources would address issues found in the data early in the pipeline. Thus, distribution prevents error propagation, removes redundancies, minimizes privacy leakage and optimally summarizes the information contained in the data prior to transmission. This in turn, prevents wasting valuable yet limited resources at the edge, which would otherwise be used for transmitting data that may contain anomalies and redundancies. New legal requirements such the GDPR and ethical responsibilities render data preprocessing, which addresses these emerging topics, urgent especially at the edge prior to the data leaving the premises of data owners. This PhD dissertation is divided into two parts that focus on two main directions within data preprocessing. The first part focuses on structuring and normalizing the data preprocessing design phase for AI applications. This involved an extensive and comprehensive survey of data preprocessing techniques coupled with an empirical analysis. From the survey, we introduced a holistic and normalized definition and scope of data preprocessing. We also identified the means of generalizing data preprocessing by abstracting preprocessing techniques into categories and sub-categories. Our survey and empirical analysis highlighted dependencies and relationships between the different categories and sub-categories, which determine the order of execution within preprocessing pipelines. The identified categories, sub-categories and their dependencies were assembled into a novel data preprocessing design tool that is a template from which application and dataset specific preprocessing plans and pipelines are derived. The design tool is agnostic to datasets and applications and is a crucial step towards normalizing, regulating and structuring the design of data preprocessing pipelines. The tool helps practitioners and researchers apply a modern take on data preprocessing that enhances the reproducibility of preprocessed datasets and addresses a broader spectrum of issues in the data. The second part of the dissertation focuses on leveraging edge computing within an IoT context to distribute data preprocessing at the edge. We empirically evaluated the feasibility of distributing data preprocessing techniques from different categories and assessed the impact of the distribution including on the consumption of different resources such as time, storage, bandwidth and energy. To perform the distribution, we proposed a collaborative edge-cloud framework dedicated to data preprocessing with two main mechanisms that achieve synchronization and coordination. The synchronization mechanism is an Over-The-Air (OTA) updating mechanism that remotely pushes updated preprocessing plans to the different edge components in response to changes in user requirements or the evolution of data characteristics. The coordination mechanism is a resilient and progressive execution mechanism that leverages the Directed Acyclic Graph (DAG) to represent the data preprocessing plans. Distributed preprocessing plans are shared between different cloud and edge components and are progressively executed while adhering to the topological order dictated by the DAG representation. To empirically test our proposed solutions, we developed a prototype, named DeltaWing, of our edge-cloud collaborative data preprocessing framework that consists of three stages; one central stage and two edge stages. A use-case was also designed based on a dataset obtained from Honda Research Institute US. Using DeltaWing and the use-case, we simulated an Automotive IoT application to evaluate our proposed solutions. Our empirical results highlight the effectiveness and positive impact of our framework in reducing the consumption of valuable resources (e.g., ≈ 57% reduction in bandwidth usage) at the edge while retaining information (prediction accuracy) and maintaining operational integrity. The two parts of the dissertation are interconnected yet can exist independently. Their contributions combined, constitute a generic toolset for the optimization of the data preprocessing phase

    The Impact Of Distributed Data Preprocessing On Automotive Data Streams

    No full text
    Vehicles have transformed into sophisticated com- puting machines that not only serve the objective of transporta- tion from point A to point B but serve other objectives including improved experience, safer journey, automated and more efficient and sustainable transportation. With such sophistication comes complex applications and enormous volumes of data generated from diverse types of vehicle sensors and components. Automotive data is not sedentary but moves from the edge (the vehicle) to the cloud (e.g., infrastructure of the vehicle manufacturers, national highway agencies, insurance companies, etc.). The exponential increase in data volume and variety generated in modern vehicles far exceeds the rate of infrastructure scaling and expansion. To mitigate this challenge, the computational and storage capacities of vehicle components can be leveraged to perform in-vehicle operations on the data to either prepare and transform (prepro- cess) the data or extract information from (process) the data. This paper focuses on distributing data preprocessing to the vehicle and highlights the benefits and impact of the distribution including on the consumption of resources (e.g., energy)

    Transforming IoT Data Preprocessing: A Holistic, Normalized and Distributed Approach

    No full text
    Data preprocessing is an integral part of Artificial Intelligence (AI) pipelines. It transforms raw data into input data that fulfill algorithmic criteria and improve prediction accuracy. As the adoption of Internet of Things (IoT) gains more momentum, the data volume generated from the edge is exponentially increasing that far exceeds any expansion of infrastructure. Social responsibilities and regulations (e.g., GDPR) must also be adhered when handling IoT data. In addition, we are currently witnessing a shift towards distributing AI to the edge. The aforementioned reasons render the distribution of data preprocessing to the edge an urgent requirement. In this paper, we introduce a modern data preprocessing framework that consists of two main parts. Part1 is a design tool that reduces the complexity and costs of the data preprocessing phase for AI via generalization and normalization. The design tool is a standard template that maps specific techniques into abstract categories and highlights dependencies between them. In addition, it presents a holistic notion of data preprocessing that is not limited to data cleaning. The second part is an IoT tool that adopts the edge-cloud collaboration model to progressively improve the quality of the data. It includes a synchronization mechanism that ensures adaptation to changes in data characteristics and a coordination mechanism that ensures correct and complete execution of preprocessing plans between the cloud and the edge. The paper includes an empirical analysis of the framework using a developed prototype and an automotive use-case. Our results demonstrate reductions in resource consumption (e.g., energy, bandwidth) while maintaining the value and integrity of the data

    Synchronized Preprocessing of Sensor Data

    No full text
    Sensor data whether collected for machine learning, deep learning or other applications must be preprocessed to fit input requirements or improve performance and accuracy. Data preparation is an expensive, resource consuming and complex phase often performed centrally on raw data for a specific application. The dataflow between the edge and the cloud can be enhanced in terms of efficiency, reliability and lineage by preprocessing the datasets closer to their data sources. We propose a dedicated data preprocessing framework that distributes preprocessing tasks between a cloud stage and two edge stages to create a dataflow with progressively improving quality. The framework handles heterogenous data and dynamic preprocessing plans simultaneously targeting diverse applications and use cases from different domains. Each stage autonomously executes sensor specific preprocessing plans in parallel while synchronizing the progressive execution and dynamic updates of the preprocessing plans with the other stages. Our approach minimizes the workload on central infrastructures and reduces the resources used for transferring raw data from the edge. We also demonstrate that preprocessing data can be sensor specific rather than application specific and thus can be performed prior to knowing a specific application

    AIR: A Light-Weight Yet High-Performance Dataflow Engine based on Asynchronous Iterative Routing

    No full text
    Distributed Stream Processing Engines (DSPEs) are currently among the most emerging topics in data management, with applications ranging from real-time event monitoring to processing complex dataflow programs and big data analytics. In this paper, we describe the architecture of our AIR engine, which is designed from scratch in C++ using the Message Passing Interface (MPI), pthreads for multithreading, and is directly deployed on top of a common HPC workload manager such as SLURM. AIR implements a light-weight, dynamic sharding protocol (referred to as “Asynchronous Iterative Routing”), which facilitates a direct and asynchronous communication among all worker nodes and thereby completely avoids any additional communication overhead with a dedicated master node. With its unique design, AIR fills the gap between the prevalent scale-out (but Java-based) architectures like Apache Spark and Flink, on one hand, and recent scale-up (and C++ based) prototypes such as StreamBox and PiCo, on the other hand. Our experiments over various benchmark settings confirm that AIR performs as good as the best scale-up SPEs on a single-node setup, while it outperforms existing scale-out DSPEs in terms of processing latency and sustainable throughput by a factor of up to 15 in a distributed setting
    corecore