1,316 research outputs found

    Data science applications to connected vehicles: Key barriers to overcome

    Get PDF
    The connected vehicles will generate huge amount of pervasive and real time data, at very high frequencies. This poses new challenges for Data science. How to analyse these data and how to address short-term and long-term storage are some of the key barriers to overcome.JRC.C.6-Economics of Climate Change, Energy and Transpor

    On-the-fly tracing for data-centric computing : parallelization, workflow and applications

    Get PDF
    As data-centric computing becomes the trend in science and engineering, more and more hardware systems, as well as middleware frameworks, are emerging to handle the intensive computations associated with big data. At the programming level, it is crucial to have corresponding programming paradigms for dealing with big data. Although MapReduce is now a known programming model for data-centric computing where parallelization is completely replaced by partitioning the computing task through data, not all programs particularly those using statistical computing and data mining algorithms with interdependence can be re-factorized in such a fashion. On the other hand, many traditional automatic parallelization methods put an emphasis on formalism and may not achieve optimal performance with the given limited computing resources. In this work we propose a cross-platform programming paradigm, called on-the-fly data tracing , to provide source-to-source transformation where the same framework also provides the functionality of workflow optimization on larger applications. Using a big-data approximation computations related to large-scale data input are identified in the code and workflow and a simplified core dependence graph is built based on the computational load taking in to account big data. The code can then be partitioned into sections for efficient parallelization; and at the workflow level, optimization can be performed by adjusting the scheduling for big-data considerations, including the I/O performance of the machine. Regarding each unit in both source code and workflow as a model, this framework enables model-based parallel programming that matches the available computing resources. The techniques used in model-based parallel programming as well as the design of the software framework for both parallelization and workflow optimization as well as its implementations with multiple programming languages are presented in the dissertation. Then, the following experiments are performed to validate the framework: i) the benchmarking of parallelization speed-up using typical examples in data analysis and machine learning (e.g. naive Bayes, k-means) and ii) three real-world applications in data-centric computing with the framework are also described to illustrate the efficiency: pattern detection from hurricane and storm surge simulations, road traffic flow prediction and text mining from social media data. In the applications, it illustrates how to build scalable workflows with the framework along with performance enhancements

    A Systematic Review of Learning based Notion Change Acceptance Strategies for Incremental Mining

    Get PDF
    The data generated contemporarily from different communication environments is dynamic in content different from the earlier static data environments. The high speed streams have huge digital data transmitted with rapid context changes unlike static environments where the data is mostly stationery. The process of extracting, classifying, and exploring relevant information from enormous flowing and high speed varying streaming data has several inapplicable issues when static data based strategies are applied. The learning strategies of static data are based on observable and established notion changes for exploring the data whereas in high speed data streams there are no fixed rules or drift strategies existing beforehand and the classification mechanisms have to develop their own learning schemes in terms of the notion changes and Notion Change Acceptance by changing the existing notion, or substituting the existing notion, or creating new notions with evaluation in the classification process in terms of the previous, existing, and the newer incoming notions. The research in this field has devised numerous data stream mining strategies for determining, predicting, and establishing the notion changes in the process of exploring and accurately predicting the next notion change occurrences in Notion Change. In this context of feasible relevant better knowledge discovery in this paper we have given an illustration with nomenclature of various contemporarily affirmed models of benchmark in data stream mining for adapting the Notion Change

    Dynamic adversarial mining - effectively applying machine learning in adversarial non-stationary environments.

    Get PDF
    While understanding of machine learning and data mining is still in its budding stages, the engineering applications of the same has found immense acceptance and success. Cybersecurity applications such as intrusion detection systems, spam filtering, and CAPTCHA authentication, have all begun adopting machine learning as a viable technique to deal with large scale adversarial activity. However, the naive usage of machine learning in an adversarial setting is prone to reverse engineering and evasion attacks, as most of these techniques were designed primarily for a static setting. The security domain is a dynamic landscape, with an ongoing never ending arms race between the system designer and the attackers. Any solution designed for such a domain needs to take into account an active adversary and needs to evolve over time, in the face of emerging threats. We term this as the ‘Dynamic Adversarial Mining’ problem, and the presented work provides the foundation for this new interdisciplinary area of research, at the crossroads of Machine Learning, Cybersecurity, and Streaming Data Mining. We start with a white hat analysis of the vulnerabilities of classification systems to exploratory attack. The proposed ‘Seed-Explore-Exploit’ framework provides characterization and modeling of attacks, ranging from simple random evasion attacks to sophisticated reverse engineering. It is observed that, even systems having prediction accuracy close to 100%, can be easily evaded with more than 90% precision. This evasion can be performed without any information about the underlying classifier, training dataset, or the domain of application. Attacks on machine learning systems cause the data to exhibit non stationarity (i.e., the training and the testing data have different distributions). It is necessary to detect these changes in distribution, called concept drift, as they could cause the prediction performance of the model to degrade over time. However, the detection cannot overly rely on labeled data to compute performance explicitly and monitor a drop, as labeling is expensive and time consuming, and at times may not be a possibility altogether. As such, we propose the ‘Margin Density Drift Detection (MD3)’ algorithm, which can reliably detect concept drift from unlabeled data only. MD3 provides high detection accuracy with a low false alarm rate, making it suitable for cybersecurity applications; where excessive false alarms are expensive and can lead to loss of trust in the warning system. Additionally, MD3 is designed as a classifier independent and streaming algorithm for usage in a variety of continuous never-ending learning systems. We then propose a ‘Dynamic Adversarial Mining’ based learning framework, for learning in non-stationary and adversarial environments, which provides ‘security by design’. The proposed ‘Predict-Detect’ classifier framework, aims to provide: robustness against attacks, ease of attack detection using unlabeled data, and swift recovery from attacks. Ideas of feature hiding and obfuscation of feature importance are proposed as strategies to enhance the learning framework\u27s security. Metrics for evaluating the dynamic security of a system and recover-ability after an attack are introduced to provide a practical way of measuring efficacy of dynamic security strategies. The framework is developed as a streaming data methodology, capable of continually functioning with limited supervision and effectively responding to adversarial dynamics. The developed ideas, methodology, algorithms, and experimental analysis, aim to provide a foundation for future work in the area of ‘Dynamic Adversarial Mining’, wherein a holistic approach to machine learning based security is motivated

    Hierarchical and Decentralised Federated Learning

    Full text link
    Federated learning has shown enormous promise as a way of training ML models in distributed environments while reducing communication costs and protecting data privacy. However, the rise of complex cyber-physical systems, such as the Internet-of-Things, presents new challenges that are not met with traditional FL methods. Hierarchical Federated Learning extends the traditional FL process to enable more efficient model aggregation based on application needs or characteristics of the deployment environment (e.g., resource capabilities and/or network connectivity). It illustrates the benefits of balancing processing across the cloud-edge continuum. Hierarchical Federated Learning is likely to be a key enabler for a wide range of applications, such as smart farming and smart energy management, as it can improve performance and reduce costs, whilst also enabling FL workflows to be deployed in environments that are not well-suited to traditional FL. Model aggregation algorithms, software frameworks, and infrastructures will need to be designed and implemented to make such solutions accessible to researchers and engineers across a growing set of domains. H-FL also introduces a number of new challenges. For instance, there are implicit infrastructural challenges. There is also a trade-off between having generalised models and personalised models. If there exist geographical patterns for data (e.g., soil conditions in a smart farm likely are related to the geography of the region itself), then it is crucial that models used locally can consider their own locality in addition to a globally-learned model. H-FL will be crucial to future FL solutions as it can aggregate and distribute models at multiple levels to optimally serve the trade-off between locality dependence and global anomaly robustness.Comment: 11 pages, 6 figures, 25 reference
    • 

    corecore