3,395 research outputs found

    Classification tree methods for panel data using wavelet-transformed time series

    Get PDF
    Wavelet-transformed variables can have better classification performance for panel data than using variables on their original scale. Examples are provided showing the types of data where using a wavelet-based representation is likely to improve classification accuracy. Results show that in most cases wavelet-transformed data have better or similar classification accuracy to the original data, and only select genuinely useful explanatory variables. Use of wavelet-transformed data provides localized mean and difference variables which can be more effective than the original variables, provide a means of separating “signal” from “noise”, and bring the opportunity for improved interpretation via the consideration of which resolution scales are the most informative. Panel data with multiple observations on each individual require some form of aggregation to classify at the individual level. Three different aggregation schemes are presented and compared using simulated data and real data gathered during liver transplantation. Methods based on aggregating individual level data before classification outperform methods which rely solely on the combining of time-point classifications

    Prediction of Panel and Streaming Data using Wavelet Transform-based Decision Trees

    Get PDF
    Decision trees are a popular model for classification and regression since they have an easy interpretation and no parameter assumptions. In the tree building process, we choose the Gini index as the splitting criterion which has good performance for data with many missing values and many categories (values). Other splitting criteria in use include averaged squared error and statistical significant testing. In the tree pruning process, we use cross validation to choose the best tree which has the minimum possible prediction error. When the explanatory variables are time series, however, trees can not detect the potential correlation in them and may be influenced by the noise involved. So we use wavelet analysis to transform the original time series into wavelet transformed variables, by decomposing the original time series into scaling and wavelet coefficients, representing the smooth and detail information at different resolution levels. The basis we choose is the Haar wavelet, as it is simple for interpretation. Other bases are also considered, but they do not have obviously better performance than the Haar wavelet. Although the approach of using the wavelet transform is suitable for data without too many variables to control the computational time, the computational time increases due to using high dimensional wavelet transformed variables is roughly only linear in the increase in the number of variables. So the computational time will not increase rapidly when the data are transformed into suitable resolution levels or when the number of original variables is not a lot. The first application of decision trees with wavelet transformed variables is panel data classification. Trees can classify each observation, but are not able to classify each individual which contains many observations. So we design three methods for panel data classification. After classify each observation using trees, Method 1 classifies each individual by summarizing the major class of its observations. Method 3 transfers the panel data into cross sectional data by summarizing the information for each individual and then uses trees to classify this cross sectional data. Method 2 is based on Method 1 and is similar to but more complicated than Method 3. The difference between Method 2 and Method 3 is that the transformed cross-sectional data are no longer heart rate values or wavelet transformed heart rate values but the probabilities for each observation to be classified as group 1. The probability is calculated from Method 1. So we number this method as the second one as it is based on Method 1. Results show Method 3 is generally the best on both simulated and real data as it works directly on individuals while Methods 1 and 2 are based on classification results of observations, which is not our primary target. The second and the third applications are time series prediction. In the second one, we explore, for static regression, whether or not wavelet transformed variables are better than original variables in regression problems under different circumstances. This includes different seasonal effects at a possible time lag of explanatory variables. The mod- els are then applied to real liver transplantation (LT) surgery data and China air pollution data, both of which show that the wavelet trans- formed variables are better. Wavelet transformed variables are directly used in the third application: interval forecasting for streaming data. In the forecasting process, if both the predicted value and its prediction interval are known, we will know more about the uncertainty in the prediction. There are two choices for interval construction. Gaussian prediction intervals work well if the time series clearly follows a Gaussian distribution. The quantile interval is not restricted by the Gaussian distribution assumptions, which is suitable in this context as we do not know the distribution of the future data. The performance is measured by coverage and interval width. Instead of using only one model, ensemble models are also considered. By comparing trees produced using typical models like ARIMA and GARCH in both simulation and real data applications, we find trees are more computationally efficient than both alternative models. Compared with trees, ARIMA may have a much wider prediction interval when trend is falsely detected and is slow to react when the distribution changes. GARCH has similar performance to trees in coverage and interval width. So tree methods are suggested for time series prediction. When comparing the performance of wavelet transformed variables and original variables in both classification and regression simulation and real data applications, results show that wavelet transformed variables are better than or equal to the performance of original variables in ac- curacy. Models using wavelet transformed variables also provide more detailed information, which give better understanding of the classification or regression process

    Application of Stationary Wavelet Support Vector Machines for the Prediction of Economic Recessions

    Get PDF
    This paper examines the efficiency of various approaches on the classification and prediction of economic expansion and recession periods in United Kingdom. Four approaches are applied. The first is discrete choice models using Logit and Probit regressions, while the second approach is a Markov Switching Regime (MSR) Model with Time-Varying Transition Probabilities. The third approach refers on Support Vector Machines (SVM), while the fourth approach proposed in this study is a Stationary Wavelet SVM modelling. The findings show that SW-SVM and MSR present the best forecasting performance, in the out-of sample period. In addition, the forecasts for period 2012-2015 are provided using all approaches

    Compressive Mining: Fast and Optimal Data Mining in the Compressed Domain

    Full text link
    Real-world data typically contain repeated and periodic patterns. This suggests that they can be effectively represented and compressed using only a few coefficients of an appropriate basis (e.g., Fourier, Wavelets, etc.). However, distance estimation when the data are represented using different sets of coefficients is still a largely unexplored area. This work studies the optimization problems related to obtaining the \emph{tightest} lower/upper bound on Euclidean distances when each data object is potentially compressed using a different set of orthonormal coefficients. Our technique leads to tighter distance estimates, which translates into more accurate search, learning and mining operations \textit{directly} in the compressed domain. We formulate the problem of estimating lower/upper distance bounds as an optimization problem. We establish the properties of optimal solutions, and leverage the theoretical analysis to develop a fast algorithm to obtain an \emph{exact} solution to the problem. The suggested solution provides the tightest estimation of the L2L_2-norm or the correlation. We show that typical data-analysis operations, such as k-NN search or k-Means clustering, can operate more accurately using the proposed compression and distance reconstruction technique. We compare it with many other prevalent compression and reconstruction techniques, including random projections and PCA-based techniques. We highlight a surprising result, namely that when the data are highly sparse in some basis, our technique may even outperform PCA-based compression. The contributions of this work are generic as our methodology is applicable to any sequential or high-dimensional data as well as to any orthogonal data transformation used for the underlying data compression scheme.Comment: 25 pages, 20 figures, accepted in VLD

    Rapid identification of oil contaminated soils using visible near infrared diffuse reflectance spectroscopy

    Get PDF
    Initially, 46 petroleum contaminated and non-contaminated soil samples were collected and scanned using visible near-infrared diffuse reflectance spectroscopy (VisNIR DRS) at three combinations of moisture content and pretreatment. The VisNIR spectra of soil samples were used to predict total petroleum hydrocarbon (TPH) content using partial least squares (PLS) regression and boosted regression tree (BRT) models. The field-moist intact scan proved best for predicting TPH content with a validation r2 of 0.64 and relative percent difference (RPD) of 1.70. Those 46 samples were used to calibrate a penalized spline (PS) model. Subsequently, the PS model was used to predict soil TPH content for 128 soil samples collected over an 80 ha study site. An exponential semivariogram using PS predictions revealed strong spatial dependence among soil TPH [r2 = 0.76, range = 52 m, nugget = 0.001 (log10 mg kg-1)2, and sill 1.044 (log10 mg kg-1)2]. An ordinary block kriging map produced from the data showed that TPH distribution matched the expected TPH variability of the study site. Another study used DRS to measure reflectance patterns of 68 artificially constructed samples with different clay content, organic carbon levels, petroleum types, and different levels of contamination per type. Both first derivative of reflectance and discrete wavelet transformations were used to preprocess the spectra. Principal component analysis (PCA) was applied for qualitative VisNIR discrimination of variable soil types, organic carbon levels, petroleum types, and concentration levels. Soil types were separated with 100% accuracy, and organic carbon levels were separated with 96% accuracy by linear discriminant analysis. The support vector machine produced 82% classification accuracy for organic carbon levels by repeated random splitting of the whole dataset. However, spectral absorptions for each petroleum hydrocarbon overlapped with each other and could not be separated with any classification scheme when contaminations were mixed. Wavelet-based multiple linear regression performed best for predicting petroleum amount with the highest residual prediction deviation (RPD) of 3.97. While using the first derivative of reflectance spectra, PS regression performed better (RPD = 3.3) than the PLS (RPD= 2.5) model. Specific calibrations considering additional soil physicochemical variability are recommended to produce improved predictions

    Machine learning: statistical physics based theory and smart industry applications

    Get PDF
    The increasing computational power and the availability of data have made it possible to train ever-bigger artificial neural networks. These so-called deep neural networks have been used for impressive applications, like advanced driver assistance and support in medical diagnoses. However, various vulnerabilities have been revealed and there are many open questions concerning the workings of neural networks. Theoretical analyses are therefore essential for further progress. One current question is: why is it that networks with Rectified Linear Unit (ReLU) activation seemingly perform better than networks with sigmoidal activation?We contribute to the answer to this question by comparing ReLU networks with sigmoidal networks in diverse theoretical learning scenarios. In contrast to analysing specific datasets, we use a theoretical modelling using methods from statistical physics. They give the typical learning behaviour for chosen model scenarios. We analyse both the learning behaviour on a fixed dataset and on a data stream in the presence of a changing task. The emphasis is on the analysis of the network’s transition to a state wherein specific concepts have been learnt. We find significant benefits of ReLU networks: they exhibit continuous increases of their performance and adapt more quickly to changing tasks.In the second part of the thesis we treat applications of machine learning: we design a quick quality control method for material in a production line and study the relationship with product faults. Furthermore, we introduce a methodology for the interpretable classification of time series data

    Phenotypic Signatures Arising from Unbalanced Bacterial Growth

    Get PDF
    Fluctuations in the growth rate of a bacterial culture during unbalanced growth are generally considered undesirable in quantitative studies of bacterial physiology. Under well-controlled experimental conditions, however, these fluctuations are not random but instead reflect the interplay between intra-cellular networks underlying bacterial growth and the growth environment. Therefore, these fluctuations could be considered quantitative phenotypes of the bacteria under a specific growth condition. Here, we present a method to identify “phenotypic signatures” by time-frequency analysis of unbalanced growth curves measured with high temporal resolution. The signatures are then applied to differentiate amongst different bacterial strains or the same strain under different growth conditions, and to identify the essential architecture of the gene network underlying the observed growth dynamics. Our method has implications for both basic understanding of bacterial physiology and for the classification of bacterial strains

    Dengue Dynamics in Binh Thuan Province, Southern Vietnam: Periodicity, Synchronicity and Climate Variability

    Get PDF
    Dengue has become a major international public health problem due to increasing geographic distribution and a transition from epidemic transmission with long inter-epidemic intervals to endemic transmission with seasonal fluctuation. Seasonal and multi-annual cycles in dengue incidence vary over time and space. We performed wavelet analyses on time series of monthly notified dengue cases in Binh Thuan province, southern Vietnam, from January 1994 to June 2009. We observed a continuous annual mode of oscillation with a non-stationary 2–3-year multi-annual cycle. We used phase differences to describe the spatio-temporal patterns which suggest that the seasonal wave of infection was either synchronous with all districts or moving away from Phan Thiet district, while the multi-annual wave of infection was moving towards Phan Thiet district. We also found a strong non-stationary association between ENSO indices and climate variables with dengue incidence. We provided insight in dengue population transmission dynamics over the past 14.5 years. Further studies on an extensive time series dataset are needed to test the hypothesis that epidemics emanate from larger cities in southern Vietnam
    corecore