156,357 research outputs found

    Toolbox for interactive time series analysis

    Get PDF
    Time series, as we call sequences of measurements of an observed phenomenon, represent an important type of data in the fields of econometrics (e.g. countries' yearly GDP and relative debt change), business (e.g. number of products sold per month), medicine (EEG, ECG), meteorology (e.g. change in average temperature through time) and in almost all other fields of natural and social science. It is thus important for toolsets to exist, with which one can analyze, transform, visualize, and model time series data. Based on renowned Orange data mining software framework, we propose a suite of visual programming widgets for construction of workflows for interactive time series analysis, visualization, and forecast. In particular, the suite comprises widgets for time series differencing, interpolation, aggregation, seasonal adjustment, transformation with window functions and estimation of causality. Additionally, we devise components for plotting time series data in a line chart diagram, periodogram, correlogram, and spiral heatmap. We support time series modeling with VAR or ARIMA models. We evaluate our contribution on various time series data sets

    Structural Generative Descriptions for Temporal Data

    Get PDF
    In data mining problems the representation or description of data plays a fundamental role, since it defines the set of essential properties for the extraction and characterisation of patterns. However, for the case of temporal data, such as time series and data streams, one outstanding issue when developing mining algorithms is finding an appropriate data description or representation. In this thesis two novel domain-independent representation frameworks for temporal data suitable for off-line and online mining tasks are formulated. First, a domain-independent temporal data representation framework based on a novel data description strategy which combines structural and statistical pattern recognition approaches is developed. The key idea here is to move the structural pattern recognition problem to the probability domain. This framework is composed of three general tasks: a) decomposing input temporal patterns into subpatterns in time or any other transformed domain (for instance, wavelet domain); b) mapping these subpatterns into the probability domain to find attributes of elemental probability subpatterns called primitives; and c) mining input temporal patterns according to the attributes of their corresponding probability domain subpatterns. This framework is referred to as Structural Generative Descriptions (SGDs). Two off-line and two online algorithmic instantiations of the proposed SGDs framework are then formulated: i) For the off-line case, the first instantiation is based on the use of Discrete Wavelet Transform (DWT) and Wavelet Density Estimators (WDE), while the second algorithm includes DWT and Finite Gaussian Mixtures. ii) For the online case, the first instantiation relies on an online implementation of DWT and a recursive version of WDE (RWDE), whereas the second algorithm is based on a multi-resolution exponentially weighted moving average filter and RWDE. The empirical evaluation of proposed SGDs-based algorithms is performed in the context of time series classification, for off-line algorithms, and in the context of change detection and clustering, for online algorithms. For this purpose, synthetic and publicly available real-world data are used. Additionally, a novel framework for multidimensional data stream evolution diagnosis incorporating RWDE into the context of Velocity Density Estimation (VDE) is formulated. Changes in streaming data and changes in their correlation structure are characterised by means of local and global evolution coefficients as well as by means of recursive correlation coefficients. The proposed VDE framework is evaluated using temperature data from the UK and air pollution data from Hong Kong.Open Acces

    Feature Space Modeling for Accurate and Efficient Learning From Non-Stationary Data

    Get PDF
    A non-stationary dataset is one whose statistical properties such as the mean, variance, correlation, probability distribution, etc. change over a specific interval of time. On the contrary, a stationary dataset is one whose statistical properties remain constant over time. Apart from the volatile statistical properties, non-stationary data poses other challenges such as time and memory management due to the limitation of computational resources mostly caused by the recent advancements in data collection technologies which generate a variety of data at an alarming pace and volume. Additionally, when the collected data is complex, managing data complexity, emerging from its dimensionality and heterogeneity, can pose another challenge for effective computational learning. The problem is to enable accurate and efficient learning from non-stationary data in a continuous fashion over time while facing and managing the critical challenges of time, memory, concept change, and complexity simultaneously. Feature space modeling is one of the most effective solutions to address this problem. For non-stationary data, selecting relevant features is even more critical than stationary data due to the reduction of feature dimension which can ensure the best use a computational resource to produce higher accuracy and efficiency by data mining algorithms. In this dissertation, we investigated a variety of feature space modeling techniques to improve the overall performance of data mining algorithms. In particular, we built Relief based feature sub selection method in combination with data complexity iv analysis to improve the classification performance using ovarian cancer image data collected in a non-stationary batch mode. We also collected time series health sensor data in a streaming environment and deployed feature space transformation using Singular Value Decomposition (SVD). This led to reduced dimensionality of feature space resulting in better accuracy and efficiency produced by Density Ration Estimation Method in identifying potential change points in data over time. We have also built an unsupervised feature space modeling using matrix factorization and Lasso Regression which was successfully deployed in conjugate with Relative Density Ratio Estimation to address the botnet attacks in a non-stationary environment. Relief based feature model improved 16% accuracy of Fuzzy Forest classifier. For change detection framework, we observed 9% improvement in accuracy for PCA feature transformation. Due to the unsupervised feature selection model, for 2% and 5% malicious traffic ratio, the proposed botnet detection framework exhibited average 20% better accuracy than One Class Support Vector Machine (OSVM) and average 25% better accuracy than Autoencoder. All these results successfully demonstrate the effectives of these feature space models. The fundamental theme that repeats itself in this dissertation is about modeling efficient feature space to improve both accuracy and efficiency of selected data mining models. Every contribution in this dissertation has been subsequently and successfully employed to capitalize on those advantages to solve real-world problems. Our work bridges the concepts from multiple disciplines ineffective and surprising ways, leading to new insights, new frameworks, and ultimately to a cross-production of diverse fields like mathematics, statistics, and data mining

    A methodology for evaluating utilisation of mine planning software and consequent decision-making strategies in South Africa

    Get PDF
    Mine planning software has and continues to contribute to the development of the South African mining industry. As mine planning software usage continues to be more widespread, it is imperative that a methodology to evaluate mine planning software utilisation for enhanced decision-making strategies in South Africa is established. An existing online database available on the website link http://db.mining.wits.ac.za was developed prior to this study in September 2012 (initial data collection date). However, the database only acted as a snapshot of mine planning software data repository and lacked a framework to evaluate utilisation of mine planning software in the South African mining industry. In this thesis, a methodology was developed to measure the utilisation of mine planning software to enhance decision-making strategies in the South African mining industry. The methodology for the evaluation of utilisation of mine planning software in various commodity sectors was developed on the basis of three variables, namely, commodity, functionality, and time factor, as a key evaluation criteria. Even though the calculations can be done on any commodity in a similar manner, in this research, calculations were only performed on four different commodities, namely coal, diamond, gold and platinum group metals which are the most significant minerals in South Africa. Six functionalities namely Geological Data Management, Geological Modelling and Resource Estimation, Design and Layout, Scheduling, Financial Valuation and Optimisation were applied on the four different commodities using two different time-stamps (September 2012 and April 2014). The following software providers availed information that was used to populate the database: Geovia, MineRP Solutions, Sable, RungePincockMinarco, Maptek, Cyest Technology and CAE Mining. Note that the CAE Mining data was only made available in April 2014 (second data collection date). However, the results indicated that the market leaders in terms of mine planning software utilisation in South Africa differs, depending on the commodity that is being mined as well as the functionality that is being used. In addition, this thesis also proposed a framework to estimate the future use of mine planning software on an evolving dataset by considering the fact that the database will be continually updated in the future. By using Artificial Neural Networks (ANN), specifically supervised learning, time-series analyses were performed. Results from the time-series analyses were used to establish the framework for estimating the future use of mine planning software utilisation in the South African mining industry. By using this newly developed framework, utilisation of the various mine planning software was measured leading to the formulation of different decision-making strategies for the various mine planning software stakeholders. By using this newly developed framework to estimate and measure mine planning software utilisation, and proposing a framework for time-series analyses on an evolving dataset, this thesis serves a number of beneficiaries; firstly, the South African mining industry to position themselves better by acquiring optimal combination of mine planning software that is being used in South Africa so that they can improve their production levels, secondly, tertiary education institutions and mining consulting firms which make use of mine planning software, and lastly, the aforementioned software providers by strategically positioning themselves in a limited mine planning software market. However, this newly developed framework could be used by involved parties for corporate strategic decision-making

    Foundational principles for large scale inference: Illustrations through correlation mining

    Full text link
    When can reliable inference be drawn in the "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, with implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics the dataset is often variable-rich but sample-starved: a regime where the number nn of acquired samples (statistical replicates) is far fewer than the number pp of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data." Sample complexity however has received relatively less attention, especially in the setting when the sample size nn is fixed, and the dimension pp grows without bound. To address this gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where the variable dimension is fixed and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa-scale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables that are of interest. We demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks
    corecore