106 research outputs found

    Microarray missing data imputation based on a set theoretic framework and biological knowledge

    Get PDF
    Gene expressions measured using microarrays usually suffer from the missing value problem. However, in many data analysis methods, a complete data matrix is required. Although existing missing value imputation algorithms have shown good performance to deal with missing values, they also have their limitations. For example, some algorithms have good performance only when strong local correlation exists in data while some provide the best estimate when data is dominated by global structure. In addition, these algorithms do not take into account any biological constraint in their imputation. In this paper, we propose a set theoretic framework based on projection onto convex sets (POCS) for missing data imputation. POCS allows us to incorporate different types of a priori knowledge about missing values into the estimation process. The main idea of POCS is to formulate every piece of prior knowledge into a corresponding convex set and then use a convergence-guaranteed iterative procedure to obtain a solution in the intersection of all these sets. In this work, we design several convex sets, taking into consideration the biological characteristic of the data: the first set mainly exploit the local correlation structure among genes in microarray data, while the second set captures the global correlation structure among arrays. The third set (actually a series of sets) exploits the biological phenomenon of synchronization loss in microarray experiments. In cyclic systems, synchronization loss is a common phenomenon and we construct a series of sets based on this phenomenon for our POCS imputation algorithm. Experiments show that our algorithm can achieve a significant reduction of error compared to the KNNimpute, SVDimpute and LSimpute methods

    Ensemble learning based on classifier prediction confidence and comprehensive learning particle swarm optimisation for medical image segmentation.

    Get PDF
    Segmentation, a process of partitioning an image into multiple segments to locate objects and boundaries, is considered one of the most essential medical imaging process. In recent years, Deep Neural Networks (DNN) have achieved many notable successes in medical image analysis, including image segmentation. Due to the fact that medical imaging applications require robust, reliable results, it is necessary to devise effective DNN models for medical applications. One solution is to combine multiple DNN models in an ensemble system to obtain better results than using each single DNN model. Ensemble learning is a popular machine learning technique in which multiple models are combined to improve the final results and has been widely used in medical image analysis. In this paper, we propose to measure the confidence in the prediction of each model in the ensemble system and then use an associate threshold to determine whether the confidence is acceptable or not. A segmentation model is selected based on the comparison between the confidence and its associated threshold. The optimal threshold for each segmentation model is found by using Comprehensive Learning Particle Swarm Optimisation (CLPSO), a swarm intelligence algorithm. The Dice coefficient, a popular performance metric for image segmentation, is used as the fitness criteria. The experimental results on three medical image segmentation datasets confirm that our ensemble achieves better results compared to some well-known segmentation models

    Heterogeneous ensemble selection for evolving data streams.

    Get PDF
    Ensemble learning has been widely applied to both batch data classification and streaming data classification. For the latter setting, most existing ensemble systems are homogenous, which means they are generated from only one type of learning model. In contrast, by combining several types of different learning models, a heterogeneous ensemble system can achieve greater diversity among its members, which helps to improve its performance. Although heterogeneous ensemble systems have achieved many successes in the batch classification setting, it is not trivial to extend them directly to the data stream setting. In this study, we propose a novel HEterogeneous Ensemble Selection (HEES) method, which dynamically selects an appropriate subset of base classifiers to predict data under the stream setting. We are inspired by the observation that a well-chosen subset of good base classifiers may outperform the whole ensemble system. Here, we define a good candidate as one that expresses not only high predictive performance but also high confidence in its prediction. Our selection process is thus divided into two sub-processes: accurate-candidate selection and confident-candidate selection. We define an accurate candidate in the stream context as a base classifier with high accuracy over the current concept, while a confident candidate as one with a confidence score higher than a certain threshold. In the first sub-process, we employ the prequential accuracy to estimate the performance of a base classifier at a specific time, while in the latter sub-process, we propose a new measure to quantify the predictive confidence and provide a method to learn the threshold incrementally. The final ensemble is formed by taking the intersection of the sets of confident classifiers and accurate classifiers. Experiments on a wide range of data streams show that the proposed method achieves competitive performance with lower running time in comparison to the state-of-the-art online ensemble methods

    Spectral estimation in unevenly sampled space of periodically expressed microarray time series data

    Get PDF
    BACKGROUND: Periodogram analysis of time-series is widespread in biology. A new challenge for analyzing the microarray time series data is to identify genes that are periodically expressed. Such challenge occurs due to the fact that the observed time series usually exhibit non-idealities, such as noise, short length, and unevenly sampled time points. Most methods used in the literature operate on evenly sampled time series and are not suitable for unevenly sampled time series. RESULTS: For evenly sampled data, methods based on the classical Fourier periodogram are often used to detect periodically expressed gene. Recently, the Lomb-Scargle algorithm has been applied to unevenly sampled gene expression data for spectral estimation. However, since the Lomb-Scargle method assumes that there is a single stationary sinusoid wave with infinite support, it introduces spurious periodic components in the periodogram for data with a finite length. In this paper, we propose a new spectral estimation algorithm for unevenly sampled gene expression data. The new method is based on signal reconstruction in a shift-invariant signal space, where a direct spectral estimation procedure is developed using the B-spline basis. Experiments on simulated noisy gene expression profiles show that our algorithm is superior to the Lomb-Scargle algorithm and the classical Fourier periodogram based method in detecting periodically expressed genes. We have applied our algorithm to the Plasmodium falciparum and Yeast gene expression data and the results show that the algorithm is able to detect biologically meaningful periodically expressed genes. CONCLUSION: We have proposed an effective method for identifying periodic genes in unevenly sampled space of microarray time series gene expression data. The method can also be used as an effective tool for gene expression time series interpolation or resampling

    Aggregation of classifiers: a justifiable information granularity approach.

    Get PDF
    In this paper, we introduced a new approach of combining multiple classifiers in a heterogeneous ensemble system. Instead of using numerical membership values when combining, we constructed interval membership values for each class prediction from the meta-data of observation by using the concept of information granule. In the proposed method, the uncertainty (diversity) of the predictions produced by the base classifiers is quantified by the interval-based information granules. The decision model is then generated by considering both bound and length of the intervals. Extensive experimentation using the UCI datasets has demonstrated the superior performance of our algorithm over other algorithms including six fixed combining methods, one trainable combining method, AdaBoost, bagging, and random subspace

    A weighted multiple classifier framework based on random projection.

    Get PDF
    In this paper, we propose a weighted multiple classifier framework based on random projections. Similar to the mechanism of other homogeneous ensemble methods, the base classifiers in our approach are obtained by a learning algorithm on different training sets generated by projecting the original up-space training set to lower dimensional down-spaces. We then apply a Least SquarE−based method to weigh the outputs of the base classifiers so that the contribution of each classifier to the final combined prediction is different. We choose Decision Tree as the learning algorithm in the proposed framework and conduct experiments on a number of real and synthetic datasets. The experimental results indicate that our framework is better than many of the benchmark algorithms, including three homogeneous ensemble methods (Bagging, RotBoost, and Random Subspace), several well-known algorithms (Decision Tree, Random Neural Network, Linear Discriminative Analysis, K Nearest Neighbor, L2-loss Linear Support Vector Machine, and Discriminative Restricted Boltzmann Machine), and random projection-based ensembles with fixed combining rules with regard to both classification error rates and F1 scores
    • …
    corecore