1,026 research outputs found

    Time Series Cluster Kernel for Learning Similarities between Multivariate Time Series with Missing Data

    Get PDF
    Similarity-based approaches represent a promising direction for time series analysis. However, many such methods rely on parameter tuning, and some have shortcomings if the time series are multivariate (MTS), due to dependencies between attributes, or the time series contain missing data. In this paper, we address these challenges within the powerful context of kernel methods by proposing the robust \emph{time series cluster kernel} (TCK). The approach taken leverages the missing data handling properties of Gaussian mixture models (GMM) augmented with informative prior distributions. An ensemble learning approach is exploited to ensure robustness to parameters by combining the clustering results of many GMM to form the final kernel. We evaluate the TCK on synthetic and real data and compare to other state-of-the-art techniques. The experimental results demonstrate that the TCK is robust to parameter choices, provides competitive results for MTS without missing data and outstanding results for missing data.Comment: 23 pages, 6 figure

    Autoregressive Kernels For Time Series

    Full text link
    We propose in this work a new family of kernels for variable-length time series. Our work builds upon the vector autoregressive (VAR) model for multivariate stochastic processes: given a multivariate time series x, we consider the likelihood function p_{\theta}(x) of different parameters \theta in the VAR model as features to describe x. To compare two time series x and x', we form the product of their features p_{\theta}(x) p_{\theta}(x') which is integrated out w.r.t \theta using a matrix normal-inverse Wishart prior. Among other properties, this kernel can be easily computed when the dimension d of the time series is much larger than the lengths of the considered time series x and x'. It can also be generalized to time series taking values in arbitrary state spaces, as long as the state space itself is endowed with a kernel \kappa. In that case, the kernel between x and x' is a a function of the Gram matrices produced by \kappa on observations and subsequences of observations enumerated in x and x'. We describe a computationally efficient implementation of this generalization that uses low-rank matrix factorization techniques. These kernels are compared to other known kernels using a set of benchmark classification tasks carried out with support vector machines

    Development and evaluation of machine learning algorithms for biomedical applications

    Get PDF
    Gene network inference and drug response prediction are two important problems in computational biomedicine. The former helps scientists better understand the functional elements and regulatory circuits of cells. The latter helps a physician gain full understanding of the effective treatment on patients. Both problems have been widely studied, though current solutions are far from perfect. More research is needed to improve the accuracy of existing approaches. This dissertation develops machine learning and data mining algorithms, and applies these algorithms to solve the two important biomedical problems. Specifically, to tackle the gene network inference problem, the dissertation proposes (i) new techniques for selecting topological features suitable for link prediction in gene networks; a graph sparsification method for network sampling; (iii) combined supervised and unsupervised methods to infer gene networks; and (iv) sampling and boosting techniques for reverse engineering gene networks. For drug sensitivity prediction problem, the dissertation presents (i) an instance selection technique and hybrid method for drug sensitivity prediction; (ii) a link prediction approach to drug sensitivity prediction; a noise-filtering method for drug sensitivity prediction; and (iv) transfer learning approaches for enhancing the performance of drug sensitivity prediction. Substantial experiments are conducted to evaluate the effectiveness and efficiency of the proposed algorithms. Experimental results demonstrate the feasibility of the algorithms and their superiority over the existing approaches

    The prediction of interferon treatment effects based on time series microarray gene expression profiles

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The status of a disease can be reflected by specific transcriptional profiles resulting from the induction or repression activity of a number of genes. Here, we proposed a time-dependent diagnostic model to predict the treatment effects of interferon and ribavirin to HCV infected patients by using time series microarray gene expression profiles of a published study.</p> <p>Methods</p> <p>In the published study, 33 African-American (AA) and 36 Caucasian American (CA) patients with chronic HCV genotype 1 infection received pegylated interferon and ribavirin therapy for 28 days. HG-U133A GeneChip containing 22283 probes was used to analyze the global gene expression in peripheral blood mononuclear cells (PBMC) of all the patients on day 0 (pretreatment), 1, 2, 7, 14, and 28. According to the decrease of HCV RNA levels on day 28, two categories of responses were defined: good and poor. A voting method based on Student's t test, Wilcoxon test, empirical Bayes test and significance analysis of microarray was used to identify differentially expressed genes. A time-dependent diagnostic model based on C4.5 decision tree was constructed to predict the treatment outcome. This model not only utilized the gene expression profiles before the treatment, but also during the treatment. Leave-one-out cross validation was used to evaluate the performance of the model.</p> <p>Results</p> <p>The model could correctly predict all Caucasian American patients' treatment effects at very early time point. The prediction accuracy of African-American patients achieved 85.7%. In addition, thirty potential biomarkers which may play important roles in response to interferon and ribavirin were identified.</p> <p>Conclusion</p> <p>Our method provides a way of using time series gene expression profiling to predict the treatment effect of pegylated interferon and ribavirin therapy on HCV infected patients. Similar experimental and bioinformatical strategies may be used to improve treatment decisions for other chronic diseases.</p

    Linear latent force models using Gaussian processes.

    Get PDF
    Purely data-driven approaches for machine learning present difficulties when data are scarce relative to the complexity of the model or when the model is forced to extrapolate. On the other hand, purely mechanistic approaches need to identify and specify all the interactions in the problem at hand (which may not be feasible) and still leave the issue of how to parameterize the system. In this paper, we present a hybrid approach using Gaussian processes and differential equations to combine data-driven modeling with a physical model of the system. We show how different, physically inspired, kernel functions can be developed through sensible, simple, mechanistic assumptions about the underlying system. The versatility of our approach is illustrated with three case studies from motion capture, computational biology, and geostatistics

    A method to identify differential expression profiles of time-course gene data with Fourier transformation

    Get PDF
    BACKGROUND: Time course gene expression experiments are an increasingly popular method for exploring biological processes. Temporal gene expression profiles provide an important characterization of gene function, as biological systems are both developmental and dynamic. With such data it is possible to study gene expression changes over time and thereby to detect differential genes. Much of the early work on analyzing time series expression data relied on methods developed originally for static data and thus there is a need for improved methodology. Since time series expression is a temporal process, its unique features such as autocorrelation between successive points should be incorporated into the analysis. RESULTS: This work aims to identify genes that show different gene expression profiles across time. We propose a statistical procedure to discover gene groups with similar profiles using a nonparametric representation that accounts for the autocorrelation in the data. In particular, we first represent each profile in terms of a Fourier basis, and then we screen out genes that are not differentially expressed based on the Fourier coefficients. Finally, we cluster the remaining gene profiles using a model-based approach in the Fourier domain. We evaluate the screening results in terms of sensitivity, specificity, FDR and FNR, compare with the Gaussian process regression screening in a simulation study and illustrate the results by application to yeast cell-cycle microarray expression data with alpha-factor synchronization. The key elements of the proposed methodology: (i) representation of gene profiles in the Fourier domain; (ii) automatic screening of genes based on the Fourier coefficients and taking into account autocorrelation in the data, while controlling the false discovery rate (FDR); (iii) model-based clustering of the remaining gene profiles. CONCLUSIONS: Using this method, we identified a set of cell-cycle-regulated time-course yeast genes. The proposed method is general and can be potentially used to identify genes which have the same patterns or biological processes, and help facing the present and forthcoming challenges of data analysis in functional genomics

    Restricting Supervised Learning: Feature Selection and Feature Space Partition

    Get PDF
    Many supervised learning problems are considered difficult to solve either because of the redundant features or because of the structural complexity of the generative function. Redundant features increase the learning noise and therefore decrease the prediction performance. Additionally, a number of problems in various applications such as bioinformatics or image processing, whose data are sampled in a high dimensional space, suffer the curse of dimensionality, and there are not enough observations to obtain good estimates. Therefore, it is necessary to reduce such features under consideration. Another issue of supervised learning is caused by the complexity of an unknown generative model. To obtain a low variance predictor, linear or other simple functions are normally suggested, but they usually result in high bias. Hence, a possible solution is to partition the feature space into multiple non-overlapping regions such that each region is simple enough to be classified easily. In this dissertation, we proposed several novel techniques for restricting supervised learning problems with respect to either feature selection or feature space partition. Among different feature selection methods, 1-norm regularization is advocated by many researchers because it incorporates feature selection as part of the learning process. We give special focus here on ranking problems because very little work has been done for ranking using L1 penalty. We present here a 1-norm support vector machine method to simultaneously find a linear ranking function and to perform feature subset selection in ranking problems. Additionally, because ranking is formulated as a classification task when pair-wise data are considered, it increases the computational complexity from linear to quadratic in terms of sample size. We also propose a convex hull reduction method to reduce this impact. The method was tested on one artificial data set and two benchmark real data sets, concrete compressive strength set and Abalone data set. Theoretically, by tuning the trade-off parameter between the 1-norm penalty and the empirical error, any desired size of feature subset could be achieved, but computing the whole solution path in terms of the trade-off parameter is extremely difficult. Therefore, using 1-norm regularization alone may not end up with a feature subset of small size. We propose a recursive feature selection method based on 1-norm regularization which can handle the multi-class setting effectively and efficiently. The selection is performed iteratively. In each iteration, a linear multi-class classifier is trained using 1-norm regularization, which leads to sparse weight vectors, i.e., many feature weights are exactly zero. Those zero-weight features are eliminated in the next iteration. The selection process has a fast rate of convergence. We tested our method on an earthworm microarray data set and the empirical results demonstrate that the selected features (genes) have very competitive discriminative power. Feature space partition separates a complex learning problem into multiple non-overlapping simple sub-problems. It is normally implemented in a hierarchical fashion. Different from decision tree, a leaf node of this hierarchical structure does not represent a single decision, but represents a region (sub-problem) that is solvable with respect to linear functions or other simple functions. In our work, we incorporate domain knowledge in the feature space partition process. We consider domain information encoded by discrete or categorical attributes. A discrete or categorical attribute provides a natural partition of the problem domain, and hence divides the original problem into several non-overlapping sub-problems. In this sense, the domain information is useful if the partition simplifies the learning task. However it is not trivial to select the discrete or categorical attribute that maximally simplify the learning task. A naive approach exhaustively searches all the possible restructured problems. It is computationally prohibitive when the number of discrete or categorical attributes is large. We describe a metric to rank attributes according to their potential to reduce the uncertainty of a classification task. It is quantified as a conditional entropy achieved using a set of optimal classifiers, each of which is built for a sub-problem defined by the attribute under consideration. To avoid high computational cost, we approximate the solution by the expected minimum conditional entropy with respect to random projections. This approach was tested on three artificial data sets, three cheminformatics data sets, and two leukemia gene expression data sets. Empirical results demonstrate that our method is capable of selecting a proper discrete or categorical attribute to simplify the problem, i.e., the performance of the classifier built for the restructured problem always beats that of the original problem. Restricting supervised learning is always about building simple learning functions using a limited number of features. Top Selected Pair (TSP) method builds simple classifiers based on very few (for example, two) features with simple arithmetic calculation. However, traditional TSP method only deals with static data. In this dissertation, we propose classification methods for time series data that only depend on a few pairs of features. Based on the different comparison strategies, we developed the following approaches: TSP based on average, TSP based on trend, and TSP based on trend and absolute difference amount. In addition, inspired by the idea of using two features, we propose a time series classification method based on few feature pairs using dynamic time warping and nearest neighbor

    Identification of Nonlinear State-Space Systems from Heterogeneous Datasets

    Get PDF
    This paper proposes a new method to identify nonlinear state-space systems from heterogeneous datasets. The method is described in the context of identifying biochemical/gene networks (i.e., identifying both reaction dynamics and kinetic parameters) from experimental data. Simultaneous integration of various datasets has the potential to yield better performance for system identification. Data collected experimentally typically vary depending on the specific experimental setup and conditions. Typically, heterogeneous data are obtained experimentally through (a) replicate measurements from the same biological system or (b) application of different experimental conditions such as changes/perturbations in biological inductions, temperature, gene knock-out, gene over-expression, etc. We formulate here the identification problem using a Bayesian learning framework that makes use of “sparse group” priors to allow inference of the sparsest model that can explain the whole set of observed, heterogeneous data. To enable scale up to large number of features, the resulting non-convex optimisation problem is relaxed to a re-weighted Group Lasso problem using a convex-concave procedure. As an illustrative example of the effectiveness of our method, we use it to identify a genetic oscillator (generalised eight species repressilator). Through this example we show that our algorithm outperforms Group Lasso when the number of experiments is increased, even when each single time-series dataset is short. We additionally assess the robustness of our algorithm against noise by varying the intensity of process noise and measurement noise
    corecore