14 research outputs found

    Outlier detection and classification in sensor data streams for proactive decision support systems

    Get PDF
    A paper has a deal with the problem of quality assessment in sensor data streams accumulated by proactive decision support systems. The new problem is stated where outliers need to be detected and to be classified according to their nature of origin. There are two types of outliers defined; the first type is about misoperations of a system and the second type is caused by changes in the observed system behavior due to inner and external influences. The proposed method is based on the data-driven forecast approach to predict the values in the incoming data stream at the expected time. This method includes the forecasting model and the clustering model. The forecasting model predicts a value in the incoming data stream at the expected time to find the deviation between a real observed value and a predicted one. The clustering method is used for taxonomic classification of outliers. Constructive neural networks models (CoNNS) and evolving connectionists systems (ECS) are used for prediction of sensors data. There are two real world tasks are used as case studies. The maximal values of accuracy are 0.992 and 0.974, and F1 scores are 0.967 and 0.938, respectively, for the first and the second tasks. The conclusion contains findings how to apply the proposed method in proactive decision support systems

    Outlier detection and classification in sensor data streams for proactive decision support systems

    Get PDF
    A paper has a deal with the problem of quality assessment in sensor data streams accumulated by proactive decision support systems. The new problem is stated where outliers need to be detected and to be classified according to their nature of origin. There are two types of outliers defined; the first type is about misoperations of a system and the second type is caused by changes in the observed system behavior due to inner and external influences. The proposed method is based on the data-driven forecast approach to predict the values in the incoming data stream at the expected time. This method includes the forecasting model and the clustering model. The forecasting model predicts a value in the incoming data stream at the expected time to find the deviation between a real observed value and a predicted one. The clustering method is used for taxonomic classification of outliers. Constructive neural networks models (CoNNS) and evolving connectionists systems (ECS) are used for prediction of sensors data. There are two real world tasks are used as case studies. The maximal values of accuracy are 0.992 and 0.974, and F1 scores are 0.967 and 0.938, respectively, for the first and the second tasks. The conclusion contains findings how to apply the proposed method in proactive decision support systems

    Clustering on Spatial Data Sets Using Extended Linked Clustering Algorithms

    Get PDF
    Various clustering algorithms (CA) have been reported in literature, to group data into clusters in diverse domains. Literature further reported that, these CAs work satisfactorily either on pure numerical data or on pure categorical data and perform poorly on mixed numerical and categorical data. Clustering is the process of creating distribution patterns and obtaining intrinsic correlations in large datasets by arranging the data into similarity classes. The present work pertains to reviewing the available research papers on clustering spatial data. In a web perspective, a detailed inspection of grouped patterns and their belonging to well known characters will be very useful for evolution of clusters. The review work is split into spatial data mining, clustering on spatial data sets and extended linked clustering. This review work will enable the researchers to make an in depth study of the till date research work on above areas and will pave way for developing extended linked clustering algorithms with a view to find number of clusters on mixed datasets to produce results for several datasets. This work is likely to assist in deciding which clustering solution to use to obtain a coherent data solution for a particular character experiment. Further it could be used as an optimal tool to guide the clustering process towards better and character interpretable meaningful solutions. The major contribution of present work is to present an in depth literature review of research in the areas of Spatial data mining, clustering on spatial data sets and extended linked clustering with a view to assist researchers to develop optimum extended linked clustering and to develop optimum extended linked clustering algorithms for clustering process ,towards better and character interpretable meaningful solutions

    Content assessment of the primary biodiversity data published through GBIF network: Status, challenges and potentials

    Get PDF
    With the establishment of the Global Biodiversity Information Facility (GBIF) in 2001 as an inter-governmental co-ordinating body, concerted efforts were made during the past decade to establish a global research infrastructure to facilitate the sharing, discovery and access to primary biodiversity data. As on date the participants in GBIF have enabled the discovery and access to over 267+ million such data records. While this remarkable achievement in terms of volume of data must be acknowledged, concerns about the quality and ‘fitness-for-use’ of the data should also be carefully considered in future developments. This contribution is therefore a direct response to the calls for comprehensive content assessment of the GBIF mobilised data. It is the first comprehensive assessment of the coverage of the content mobilised so far through GBIF, as well as a mean to identify the existing gaps and reflect on fitness-for-use requirements. This paper describes the complementary methodologies adopted by the GBIF Secretariat and University of Navarra for the development of a comprehensive content assessment. Outcomes of these research initiatives are summarised in four categories, namely, (a) data quality assessment, (b) trends/patterns assessment, (c) fitness-for-use assessment, and (d) ecosystem specific data diversity assessment. In conclusion we make specific suggestions to the GBIF community on the adoption of common indicators to assess progress towards future targets as well as recommendations to populate such exercise at various levels within the GBIF Network from national level to thematic levels

    Software-Defect Localisation by Mining Dataflow-Enabled Call Graphs

    Get PDF
    Defect localisation is essential in software engineering and is an important task in domain-specific data mining. Existing techniques building on call-graph mining can localise different kinds of defects. However, these techniques focus on defects that affect the controlflow and are agnostic regarding the dataflow. In this paper, we introduce dataflow-enabled call graphs that incorporate abstractions of the dataflow. Building on these graphs, we present an approach for defect localisation. The creation of the graphs and the defect localisation are essentially data mining problems, making use of discretisation, frequent subgraph mining and feature selection. We demonstrate the defect-localisation qualities of our approach with a study on defects introduced into Weka. As a result, defect localisation now works much better, and a developer has to investigate on average only 1.5 out of 30 methods to fix a defect

    Software-Defect Localisation by Mining Dataflow-Enabled Call Graphs

    Get PDF
    Defect localisation is essential in software engineering and is an important task in domain-specific data mining. Existing techniques building on call-graph mining can localise different kinds of defects. However, these techniques focus on defects that affect the controlflow and are agnostic regarding the dataflow. In this paper, we introduce dataflow-enabled call graphs that incorporate abstractions of the dataflow. Building on these graphs, we present an approach for defect localisation. The creation of the graphs and the defect localisation are essentially data mining problems, making use of discretisation, frequent subgraph mining and feature selection. We demonstrate the defect-localisation qualities of our approach with a study on defects introduced into Weka. As a result, defect localisation now works much better, and a developer has to investigate on average only 1.5 out of 30 methods to fix a defect

    Scalable Software-Defect Localisation by Hierarchical Mining of Dynamic Call Graphs

    Get PDF
    The localisation of defects in computer programmes is essential in software engineering and is important in domain-specific data mining. Existing techniques which build on call-graph mining localise defects well, but do not scale for large software projects. This paper presents a hierarchical approach with good scalability characteristics. It makes use of novel call-graph representations, frequent subgraph mining and feature selection. It first analyses call graphs of a coarse granularity, before it zooms-in into more fine-grained graphs. We evaluate our approach with defects in the Mozilla Rhino project: In our setup, it narrows down the code a developer has to examine to about 6% only

    The role of semantic web technologies for IoT data in underpinning environmental science

    Get PDF
    The advent of Internet of Things (IoT) technology has the potential to generate a huge amount of heterogeneous data at different geographical locations and with various temporal resolutions in environmental science. In many other areas of IoT deployment, volume and velocity dominate, however in environmental science, the more general pattern is quite distinct and often variety dominates. There exists a large number of small, heterogeneous and potentially complex datasets and the key challenge is to understand the interdependencies between these disparate datasets representing different environmental facets. These characteristics pose several data challenges including data interpretation, interoperability and integration, to name but a few, and there is a pressing need to address these challenges. The author postulates that Semantic Web technologies and associated techniques have the potential to address the aforementioned data challenges and support environmental science. The main goal of this thesis is to examine the potential role of Semantic Web technologies in making sense of such complex and heterogeneous environmental data in all its complexity. The thesis explores the state-of-the-art in the use of such technologies in the context of environmental science. After an in-depth assessment of related work, the thesis further examined the characteristics of environmental data through semi-structured interviews with leading experts. Through this, three key research challenges emerge: discovering interdependencies between disparate datasets, geospatial data integration and reasoning, and data heterogeneity. In response to these challenges, an ontology was developed that semantically enriches all sensor measurements stemmed from an experimental Environmental IoT infrastructure. The resultant ontology was evaluated through three real-world use-cases derived from the interviews. This led to a number of major contributions from this work including: the development of an ontology tailored for streaming environmental data offering semantic enrichment of IoT data, support for spatio-temporal data integration and reasoning, and the analysis of unique characteristics of environmental science around data
    corecore