3,676 research outputs found

    Time Series Cluster Kernel for Learning Similarities between Multivariate Time Series with Missing Data

    Get PDF
    Similarity-based approaches represent a promising direction for time series analysis. However, many such methods rely on parameter tuning, and some have shortcomings if the time series are multivariate (MTS), due to dependencies between attributes, or the time series contain missing data. In this paper, we address these challenges within the powerful context of kernel methods by proposing the robust \emph{time series cluster kernel} (TCK). The approach taken leverages the missing data handling properties of Gaussian mixture models (GMM) augmented with informative prior distributions. An ensemble learning approach is exploited to ensure robustness to parameters by combining the clustering results of many GMM to form the final kernel. We evaluate the TCK on synthetic and real data and compare to other state-of-the-art techniques. The experimental results demonstrate that the TCK is robust to parameter choices, provides competitive results for MTS without missing data and outstanding results for missing data.Comment: 23 pages, 6 figure

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Data-driven Soft Sensors in the Process Industry

    Get PDF
    In the last two decades Soft Sensors established themselves as a valuable alternative to the traditional means for the acquisition of critical process variables, process monitoring and other tasks which are related to process control. This paper discusses characteristics of the process industry data which are critical for the development of data-driven Soft Sensors. These characteristics are common to a large number of process industry fields, like the chemical industry, bioprocess industry, steel industry, etc. The focus of this work is put on the data-driven Soft Sensors because of their growing popularity, already demonstrated usefulness and huge, though yet not completely realised, potential. A comprehensive selection of case studies covering the three most important Soft Sensor application fields, a general introduction to the most popular Soft Sensor modelling techniques as well as a discussion of some open issues in the Soft Sensor development and maintenance and their possible solutions are the main contributions of this work

    Interpretable Ensemble Learning for Materials Property Prediction with Classical Interatomic Potentials: Carbon as an Example

    Full text link
    Machine learning (ML) is widely used to explore crystal materials and predict their properties. However, the training is time-consuming for deep-learning models, and the regression process is a black box that is hard to interpret. Also, the preprocess to transfer a crystal structure into the input of ML, called descriptor, needs to be designed carefully. To efficiently predict important properties of materials, we propose an approach based on ensemble learning consisting of regression trees to predict formation energy and elastic constants based on small-size datasets of carbon allotropes as an example. Without using any descriptor, the inputs are the properties calculated by molecular dynamics with 9 different classical interatomic potentials. Overall, the results from ensemble learning are more accurate than those from classical interatomic potentials, and ensemble learning can capture the relatively accurate properties from the 9 classical potentials as criteria for predicting the final properties

    Symbolic uses of export information: implications for export performance

    Get PDF
    As export competition becomes more intense and export success vital for survival (Katsikeas, 1994), so the effective processing and use of information regarding the international environment becomes a critical prerequisite for gaining competitive advantage (Leonidou and Theodosiou, 2004). Symbolic use of information is one type of information use, which although relatively underexplored to date, may be the most prevalent form of information use within organisations – especially in an export setting (Beyer and Trice, 1982). Symbolic use occurs when information is used for purposes other than the ones which led to its collection (Menon and Varadarajan, 1992). Symbolic use of information has been conceptualised as a multi-dimensional construct encompassing various dimensions (Vyas and Souchon, 2003). Examples include “exporters that engage in distorting market research findings, taking conclusions out of context, disclosing only the findings that confirm an executive‟s predetermined position or consciously ignoring information” (Toften and Olsen, 2004, p. 106). Symbolic use can also legitimate decisions reached on the basis of intuition or managerial assumptions (Vyas and Souchon, 2003). Although conceptual propositions of the potential relationship between each of the symbolic use dimensions and performance exist (Vyas and Souchon 2003), no empirical research has yet been undertaken. As a result, little is known about how and why symbolic use of export information may affect export performance, and under what circumstances. Furthermore, reliable and valid measures for each one of the symbolic use dimensions are absent in the literature. The purpose of this thesis is to fill in these research gaps. In so doing, a combination of both qualitative and quantitative methods is employed. The exploratory phase takes the form of in depth interviews with export decision makers in the UK. The data collected in this exploratory phase are analysed through the use of within-case and cross-case displays as per Miles and Huberman (1994) and are used not just for hypothesis development, but also to identify potential outcomes of using information symbolically in specific ways and to create pools of items for the development of measures of symbolic use. (Continues...)

    Predictive Maintenance on the Machining Process and Machine Tool

    Get PDF
    This paper presents the process required to implement a data driven Predictive Maintenance (PdM) not only in the machine decision making, but also in data acquisition and processing. A short review of the different approaches and techniques in maintenance is given. The main contribution of this paper is a solution for the predictive maintenance problem in a real machining process. Several steps are needed to reach the solution, which are carefully explained. The obtained results show that the Preventive Maintenance (PM), which was carried out in a real machining process, could be changed into a PdM approach. A decision making application was developed to provide a visual analysis of the Remaining Useful Life (RUL) of the machining tool. This work is a proof of concept of the methodology presented in one process, but replicable for most of the process for serial productions of pieces

    Practical approaches to mining of clinical datasets : from frameworks to novel feature selection

    Get PDF
    Research has investigated clinical data that have embedded within them numerous complexities and uncertainties in the form of missing values, class imbalances and high dimensionality. The research in this thesis was motivated by these challenges to minimise these problems whilst, at the same time, maximising classification performance of data and also selecting the significant subset of variables. As such, this led to the proposal of a data mining framework and feature selection method. The proposed framework has a simple algorithmic framework and makes use of a modified form of existing frameworks to address a variety of different data issues, called the Handling Clinical Data Framework (HCDF). The assessment of data mining techniques reveals that missing values imputation and resampling data for class balancing can improve the performance of classification. Next, the proposed feature selection method was introduced; it involves projecting onto principal component method (FS-PPC) and draws on ideas from both feature extraction and feature selection to select a significant subset of features from the data. This method selects features that have high correlation with the principal component by applying symmetrical uncertainty (SU). However, irrelevant and redundant features are removed by using mutual information (MI). However, this method provides confidence in the selected subset of features that will yield realistic results with less time and effort. FS-PPC is able to retain classification performance and meaningful features while consisting of non-redundant features. The proposed methods have been practically applied to analysis of real clinical data and their effectiveness has been assessed. The results show that the proposed methods are enable to minimise the clinical data problems whilst, at the same time, maximising classification performance of data

    ANN-MIND : dropout for neural network training with missing data

    Get PDF
    M.Sc. (Computer Science)Abstract: It is a well-known fact that the quality of the dataset plays a central role in the results and conclusions drawn from the analysis of such a dataset. As the saying goes, ”garbage in, garbage out”. In recent years, neural networks have displayed good performance in solving a diverse number of problems. Unfortunately, neural networks are not immune to this misfortune presented by missing values. Furthermore, in most real-world settings, it is often the case that, the only data available for training neural networks consists of missing values. In such cases, we are left with little choice but to use this data for the purposes of training neural networks, although doing so may result in a poorly trained neural network. Most systems currently in use- merely discard the missing observation from the training datasets, while others just proceed to use this data and ignore the problems presented by the missing values. Still other approaches choose to impute these missing values with fixed constants such as means and mode. Most neural network models work under the assumption that the supplied data contains no missing values. This dissertation explores a method for training neural networks in the event where the training dataset consists of missing values..

    Recovering Loss to Followup Information Using Denoising Autoencoders

    Full text link
    Loss to followup is a significant issue in healthcare and has serious consequences for a study's validity and cost. Methods available at present for recovering loss to followup information are restricted by their expressive capabilities and struggle to model highly non-linear relations and complex interactions. In this paper we propose a model based on overcomplete denoising autoencoders to recover loss to followup information. Designed to work with high volume data, results on various simulated and real life datasets show our model is appropriate under varying dataset and loss to followup conditions and outperforms the state-of-the-art methods by a wide margin (20%\ge 20\% in some scenarios) while preserving the dataset utility for final analysis.Comment: Copyright IEEE 2017, IEEE International Conference on Big Data (Big Data
    corecore