321 research outputs found

    Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Microarray technologies produced large amount of data. In a previous study, we have shown the interest of <it>k-Nearest Neighbour </it>approach for restoring the missing gene expression values, and its positive impact of the gene clustering by hierarchical algorithm. Since, numerous replacement methods have been proposed to impute missing values (MVs) for microarray data. In this study, we have evaluated twelve different usable methods, and their influence on the quality of gene clustering. Interestingly we have used several datasets, both kinetic and non kinetic experiments from yeast and human.</p> <p>Results</p> <p>We underline the excellent efficiency of approaches proposed and implemented by Bo and co-workers and especially one based on expected maximization (<it>EM_array</it>). These improvements have been observed also on the imputation of extreme values, the most difficult predictable values. We showed that the imputed MVs have still important effects on the stability of the gene clusters. The improvement on the clustering obtained by hierarchical clustering remains limited and, not sufficient to restore completely the correct gene associations. However, a common tendency can be found between the quality of the imputation method and the gene cluster stability. Even if the comparison between clustering algorithms is a complex task, we observed that <it>k-means </it>approach is more efficient to conserve gene associations.</p> <p>Conclusions</p> <p>More than 6.000.000 independent simulations have assessed the quality of 12 imputation methods on five very different biological datasets. Important improvements have so been done since our last study. The <it>EM_array </it>approach constitutes one efficient method for restoring the missing expression gene values, with a lower estimation error level. Nonetheless, the presence of MVs even at a low rate is a major factor of gene cluster instability. Our study highlights the need for a systematic assessment of imputation methods and so of dedicated benchmarks. A noticeable point is the specific influence of some biological dataset.</p

    A MACHINE LEARNING APPROACH TO QUERY TIME-SERIES MICROARRAY DATA SETS FOR FUNCTIONALLY RELATED GENES USING HIDDEN MARKOV MODELS

    Get PDF
    Microarray technology captures the rate of expression of genes under varying experimental conditions. Genes encode the information necessary to build proteins; proteins used by cellular functions exhibit higher rates of expression for the associated genes. If multiple proteins are required for a particular function then their genes show a pattern of coexpression during time periods when the function is active within a cell. Cellular functions are generally complex and require groups of genes to cooperate; these groups of genes are called functional modules. Modular organization of genetic functions has been evident since 1999. Detecting functionally related genes in a genome and detecting all genes belonging to particular functional modules are current research topics in this field. The number of microarray gene expression datasets available in public repositories increases rapidly, and advances in technology have now made it feasible to routinely perform whole-genome studies where the behavior of every gene in a genome is captured. This promises a wealth of biological and medical information, but making the amount of data accessible to researchers requires intelligent and efficient computational algorithms. Researchers working on specific cellular functions would benefit from this data if it was possible to quickly extract information useful to their area of research. This dissertation develops a machine learning algorithm that allows one or multiple microarray data sets to be queried with a set of known and functionally related input genes in order to detect additional genes participating in the same or closely related functions. The focus is on time-series microarray datasets where gene expression values are obtained from the same experiment over a period of time from a series of sequential measurements. A feature selection algorithm selects relevant time steps where the provided input genes exhibit correlated expression behavior. Time steps are the columns in microarray data sets, rows list individual genes. A specific linear Hidden Markov Model (HMM) is then constructed to contain one hidden state for each of the selected experiments and is trained using the expression values of the input genes from the microarray. Given the trained HMM the probability that a sequence of gene expression values was generated by that particular HMM can be calculated. This allows for the assignment of a probability score for each gene in the microarray. High-scoring genes are included in the result set (of genes with functional similarities to the input genes.) P-values can be calculated by repeating this algorithm to train multiple individual HMMs using randomly selected genes as input genes and calculating a Parzen Density Function (PDF) from the probability scores of all HMMs for each gene. A feedback loop uses the result generated from one algorithm run as input set for another iteration of the algorithm. This iterated HMM algorithm allows for the characterization of functional modules from very small input sets and for weak similarity signals. This algorithm also allows for the integration of multiple microarray data sets; two approaches are studied: Meta-Analysis (combination of the results from individual data set runs) and the extension of the linear HMM across multiple individual data sets. Results indicate that Meta-Analysis works best for integration of closely related microarrays and a spanning HMM works best for the integration of multiple heterogeneous datasets. The performance of this approach is demonstrated relative to the published literature on a number of widely used synthetic data sets. Biological application is verified by analyzing biological data sets of the Fruit Fly D. Melanogaster and Baker‟s Yeast S. Cerevisiae. The algorithm developed in this dissertation is better able to detect functionally related genes in common data sets than currently available algorithms in the published literature

    Novel Computationally Intelligent Machine Learning Algorithms for Data Mining and Knowledge Discovery

    Get PDF
    This thesis addresses three major issues in data mining regarding feature subset selection in large dimensionality domains, plausible reconstruction of incomplete data in cross-sectional applications, and forecasting univariate time series. For the automated selection of an optimal subset of features in real time, we present an improved hybrid algorithm: SAGA. SAGA combines the ability to avoid being trapped in local minima of Simulated Annealing with the very high convergence rate of the crossover operator of Genetic Algorithms, the strong local search ability of greedy algorithms and the high computational efficiency of generalized regression neural networks (GRNN). For imputing missing values and forecasting univariate time series, we propose a homogeneous neural network ensemble. The proposed ensemble consists of a committee of Generalized Regression Neural Networks (GRNNs) trained on different subsets of features generated by SAGA and the predictions of base classifiers are combined by a fusion rule. This approach makes it possible to discover all important interrelations between the values of the target variable and the input features. The proposed ensemble scheme has two innovative features which make it stand out amongst ensemble learning algorithms: (1) the ensemble makeup is optimized automatically by SAGA; and (2) GRNN is used for both base classifiers and the top level combiner classifier. Because of GRNN, the proposed ensemble is a dynamic weighting scheme. This is in contrast to the existing ensemble approaches which belong to the simple voting and static weighting strategy. The basic idea of the dynamic weighting procedure is to give a higher reliability weight to those scenarios that are similar to the new ones. The simulation results demonstrate the validity of the proposed ensemble model

    Towards more reliable feature evaluations for classification

    Get PDF
    In this thesis we study feature subset selection and feature weighting algorithms. Our aim is to make their output more stable and more useful when used to train a classifier. We begin by defining the concept of stability and selecting a measure to asses the output of the feature selection process. Then we study different sources of instability and propose modifications of classic algorithms that improve their stability. We propose a modification of wrapper algorithms that take otherwise unused information into account to overcome an intrinsic source of instability for this algorithms: the feature assessment being a random variable that depends on the particular training subsample. Our version accumulates the evaluation results of each feature at each iteration to average out the effect of the randomness. Another novel proposal is to make wrappers evaluate the remainder set of features at each step to overcome another source of instability: randomness of the algorithms themselves. In this case, by evaluating the non-selected set of features, the initial choice of variables is more educated. These modifications do not bring a great amount of computational overhead and deliver better results, both in terms of stability and predictive power. We finally tackle another source of instability: the differential contribution of the instances to feature assessment. We present a framework to combine almost any instance weighting algorithm with any feature weighting one. Our combination of algorithms deliver more stable results for the various feature weighting algorithms we have tested. Finally, we present a deeper integration of instance weighting with feature weighting by modifying the Simba algorithm, that delivers even better results in terms of stabilityEl focus d'aquesta tesi és mesurar, estudiar i millorar l’estabilitat d’algorismes de selecció de subconjunts de variables (SSV) i avaluació de variables (AV) en un context d'aprenentatge supervisat. El propòsit general de la SSV en un context de classificació és millorar la precisió de la predicció. Nosaltres afirmem que hi ha un altre gran repte en SSV i AV: l’estabilitat des resultats. Un cop triada una mesura d’estabilitat entre les estudiades, proposem millores d’un algorisme molt popular: el Relief. Analitzem diferents mesures de distància a més de la original i estudiem l'efecte que tenen sobre la precisió, la detecció de la redundància i l'estabilitat. També posem a prova diferents maneres d’utilitzar els pesos que es calculen a cada pas per influir en el càlcul de distàncies d’una manera similar a com ho fa un altre algorisme d'AV: el Simba. També millorem la seva estabilitat incrementant la contribució dels pesos de les variables en el càlcul de la distància a mesura que avança el temps per minimitzar l’impacte de la selecció aleatòria de les primeres instàncies. Pel què fa als algorismes embolcall, (wrappers) els modifiquem per tenir en compte informació que era ignorada per superar una font intrínseca d’inestabilitat: el fet que l’avaluació de les variables és una variable aleatòria que depèn del subconjunt de dades utilitzat. La nostra versió acumula els resultats en cada iteració per compensar l’efecte aleatori mentre que els originals descarten tota la informació recollida sobre cada variable en una determinada iteració i comencen de nou a la següent, donant lloc a resultats més inestables. Una altra proposta és fer que aquests wrappers avaluïn el subconjunt de variables no seleccionat en cada iteració per evitar una altra font d’inestabilitat. Aquestes modificacions no comporten un gran augment de cost computacional i els seus resultats són més estables i més útils per un classificador. Finalment proposem ponderar la contribució de cada instància en l’AV. Poden existir observacions atípiques que no s'haurien de tenir tant en compte com les altres; si estem intentant predir un càncer utilitzant informació d’anàlisis genètics, hauríem de donar menys credibilitat a les dades obtingudes de persones exposades a grans nivells de radiació tot i que no tenir informació sobre aquesta exposició. Els mètodes d’avaluació d’instàncies (AI) pretenen identificar aquests casos i assignar-los pesos més baixos. Varis autors han treballat en esquemes d’AI per millorar la SSV però no hi ha treball previ en la combinació d'AI amb AV. Presentem un marc de treball per combinar algorismes d'AI amb altres d'AV. A més proposem un nou algorisme d’AI basat en el concepte de marge de decisió que utilitzen alguns algorismes d’AV. Amb aquest marc de treball hem posat a prova les modificacions contra les versions originals utilitzant varis jocs de dades del repositori UCI, de xips d'ADN i els utilitzats en el desafiament de SSV del NIPS-2003. Les nostres combinacions d'algorismes d'avaluació d'instàncies i atributs ens aporten resultats més estables per varis algorismes d'avaluació d'atributs que hem estudiat. Finalment, presentem una integració més profunda de l'avaluació d'instàncies amb l'algorisme de selecció de variables Simba consistent a utilitzar els pesos de les instàncies per ponderar el càlcul de les distàncies, amb la que obtenim resultats encara millors en termes d’estabilitat. Les contribucions principals d’aquesta tesi son: (i) aportar un marc de treball per combinar l'AI amb l’AV, (ii) una revisió de les mesures d’estabilitat de SSV, (iii) diverses modificacions d’algorismes de SSV i AV que milloren la seva estabilitat i el poder predictiu del subconjunt de variables seleccionats; sense un augment significatiu del seu cost computacional, (iv) una definició teòrica de la importància d'una variable i (v) l'estudi de la relació entre l'estabilitat de la SSV i la redundància de les variables.Postprint (published version

    A New Approach for Handling Null Values in Web Log Using KNN and Tabu Search KNN

    Get PDF
    Abstract When the data mining procedures deals with the extraction of interesting knowledge from web logs is known as Web usage mining. The result of any mining is successful, only if the dataset under consideration is well preprocessed. One of the important preprocessing steps is handling of null/missing values. Handlings of null values have been a great bit of test for researcher. Various methods are available for estimation of null value such as k-means clustering algorithm, MARE algorithm and fuzzy logic approach. Although all these process are not always efficient. We propose an efficient approach for handling null values in web log. We are using a hybrid tabu search – k nearest neighbor classifier with multiple distance function. Tabu search – KNN classifier perform feature selection of K-NN rules. We are handling null values efficiently by using different distance function. It is called Ensemble of function. It gives different set of feature vector. Feature selection is useful for improving the classification accuracy of NN rule. We are using different distance metric with different set of feature, so it reduces the possibility that some error will common. Therefore, proposed method is better for handling null values. The proposed method is using hybrid classifier with different distance metrics and different feature vector. It is evaluated using our MANIT database. Results have indicated that a significant increase in the performance when compared with simple K-NN classifier. Original Source URL : http://aircconline.com/ijdkp/V1N5/0911ijdkp02.pdf For more details : http://airccse.org/journal/ijdkp/vol1.htm

    Multivariate Models and Algorithms for Systems Biology

    Get PDF
    Rapid advances in high-throughput data acquisition technologies, such as microarraysand next-generation sequencing, have enabled the scientists to interrogate the expression levels of tens of thousands of genes simultaneously. However, challenges remain in developingeffective computational methods for analyzing data generated from such platforms. In thisdissertation, we address some of these challenges. We divide our work into two parts. Inthe first part, we present a suite of multivariate approaches for a reliable discovery of geneclusters, often interpreted as pathway components, from molecular profiling data with replicated measurements. We translate our goal into learning an optimal correlation structure from replicated complete and incomplete measurements. In the second part, we focus on thereconstruction of signal transduction mechanisms in the signaling pathway components. Wepropose gene set based approaches for inferring the structure of a signaling pathway.First, we present a constrained multivariate Gaussian model, referred to as the informed-case model, for estimating the correlation structure from replicated and complete molecular profiling data. Informed-case model generalizes previously known blind-case modelby accommodating prior knowledge of replication mechanisms. Second, we generalize theblind-case model by designing a two-component mixture model. Our idea is to strike anoptimal balance between a fully constrained correlation structure and an unconstrained one.Third, we develop an Expectation-Maximization algorithm to infer the underlying correlation structure from replicated molecular profiling data with missing (incomplete) measurements.We utilize our correlation estimators for clustering real-world replicated complete and incompletemolecular profiling data sets. The above three components constitute the first partof the dissertation. For the structural inference of signaling pathways, we hypothesize a directed signal pathway structure as an ensemble of overlapping and linear signal transduction events. We then propose two algorithms to reverse engineer the underlying signaling pathway structure using unordered gene sets corresponding to signal transduction events. Throughout we treat gene sets as variables and the associated gene orderings as random.The first algorithm has been developed under the Gibbs sampling framework and the secondalgorithm utilizes the framework of simulated annealing. Finally, we summarize our findingsand discuss possible future directions

    Multivariate Models and Algorithms for Systems Biology

    Get PDF
    Rapid advances in high-throughput data acquisition technologies, such as microarraysand next-generation sequencing, have enabled the scientists to interrogate the expression levels of tens of thousands of genes simultaneously. However, challenges remain in developingeffective computational methods for analyzing data generated from such platforms. In thisdissertation, we address some of these challenges. We divide our work into two parts. Inthe first part, we present a suite of multivariate approaches for a reliable discovery of geneclusters, often interpreted as pathway components, from molecular profiling data with replicated measurements. We translate our goal into learning an optimal correlation structure from replicated complete and incomplete measurements. In the second part, we focus on thereconstruction of signal transduction mechanisms in the signaling pathway components. Wepropose gene set based approaches for inferring the structure of a signaling pathway.First, we present a constrained multivariate Gaussian model, referred to as the informed-case model, for estimating the correlation structure from replicated and complete molecular profiling data. Informed-case model generalizes previously known blind-case modelby accommodating prior knowledge of replication mechanisms. Second, we generalize theblind-case model by designing a two-component mixture model. Our idea is to strike anoptimal balance between a fully constrained correlation structure and an unconstrained one.Third, we develop an Expectation-Maximization algorithm to infer the underlying correlation structure from replicated molecular profiling data with missing (incomplete) measurements.We utilize our correlation estimators for clustering real-world replicated complete and incompletemolecular profiling data sets. The above three components constitute the first partof the dissertation. For the structural inference of signaling pathways, we hypothesize a directed signal pathway structure as an ensemble of overlapping and linear signal transduction events. We then propose two algorithms to reverse engineer the underlying signaling pathway structure using unordered gene sets corresponding to signal transduction events. Throughout we treat gene sets as variables and the associated gene orderings as random.The first algorithm has been developed under the Gibbs sampling framework and the secondalgorithm utilizes the framework of simulated annealing. Finally, we summarize our findingsand discuss possible future directions

    Enhancing imputation techniques performance utilizing uncertainty aware predictors and adversarial learning

    Get PDF
    One crucial problem for applying machine learning algorithms to real-world datasets is missing data. The objective of data imputation is to fill the missing values in a dataset to resemble the completed dataset as accurately as possible. Many methods are proposed in the literature that mostly differs on the objective function and types of the variables considered. The performance of traditional machine learning methods is low when there is a nonlinear and complex relationship between features. Recently, deep learning methods are introduced to estimate data distribution and generate values for missing entries. However, these methods are originally developed for large datasets and custom data types such as image, video, and text. Thus, adopting these methods for small and structured datasets that are prevalent in real-world applications is not straightforward and often yields unsatisfactory results. Also, both types of methods do not consider uncertainty in the imputed data. We address these issues by developing a simple neural network-based architecture that works well with small and tabular datasets and utilizing a novel adversarial strategy to estimate the uncertainty of imputed data. The estimated uncertainty scores of features are then passed to the imputer module, and it fills the missing values by paying more attention to more reliable feature values. It results in an uncertainty-aware imputer with a promising performance. Extensive experiments conducted on some real-world datasets confirm that the proposed methods considerably outperform state-of-the-art imputers. Meanwhile, their execution time is not costly compared to peer state-of-the-art methods

    Microarray Data Mining and Gene Regulatory Network Analysis

    Get PDF
    The novel molecular biological technology, microarray, makes it feasible to obtain quantitative measurements of expression of thousands of genes present in a biological sample simultaneously. Genome-wide expression data generated from this technology are promising to uncover the implicit, previously unknown biological knowledge. In this study, several problems about microarray data mining techniques were investigated, including feature(gene) selection, classifier genes identification, generation of reference genetic interaction network for non-model organisms and gene regulatory network reconstruction using time-series gene expression data. The limitations of most of the existing computational models employed to infer gene regulatory network lie in that they either suffer from low accuracy or computational complexity. To overcome such limitations, the following strategies were proposed to integrate bioinformatics data mining techniques with existing GRN inference algorithms, which enables the discovery of novel biological knowledge. An integrated statistical and machine learning (ISML) pipeline was developed for feature selection and classifier genes identification to solve the challenges of the curse of dimensionality problem as well as the huge search space. Using the selected classifier genes as seeds, a scale-up technique is applied to search through major databases of genetic interaction networks, metabolic pathways, etc. By curating relevant genes and blasting genomic sequences of non-model organisms against well-studied genetic model organisms, a reference gene regulatory network for less-studied organisms was built and used both as prior knowledge and model validation for GRN reconstructions. Networks of gene interactions were inferred using a Dynamic Bayesian Network (DBN) approach and were analyzed for elucidating the dynamics caused by perturbations. Our proposed pipelines were applied to investigate molecular mechanisms for chemical-induced reversible neurotoxicity
    corecore