4,859 research outputs found

    Simcluster: clustering enumeration gene expression data on the simplex space

    Get PDF
    Transcript enumeration methods such as SAGE, MPSS, and sequencing-by-synthesis EST "digital northern", are important high-throughput techniques for digital gene expression measurement. As other counting or voting processes, these measurements constitute compositional data exhibiting properties particular to the simplex space where the summation of the components is constrained. These properties are not present on regular Euclidean spaces, on which hybridization-based microarray data is often modeled. Therefore, pattern recognition methods commonly used for microarray data analysis may be non-informative for the data generated by transcript enumeration techniques since they ignore certain fundamental properties of this space.

Here we present a software tool, Simcluster, designed to perform clustering analysis for data on the simplex space. We present Simcluster as a stand-alone command-line C package and as a user-friendly on-line tool. Both versions are available at: http://xerad.systemsbiology.net/simcluster.

Simcluster is designed in accordance with a well-established mathematical framework for compositional data analysis, which provides principled procedures for dealing with the simplex space, and is thus applicable in a number of contexts, including enumeration-based gene expression data

    Time and dose-dependent effects of phenobarbital on the rat liver miRNAome.

    Get PDF
    In a previous study we had shown that treatment of male Fischer rats with exogenous chemicals for three months resulted in prominent, mode-of-action dependent effects on liver microRNA (miRNA) (Koufaris et al., 2012). Here we investigated how the effects of chemicals on liver miRNA in male Fischer rats relate to the length and dose of exposure to phenobarbital (PB), a drug with multiple established hepatic effects. Importantly, although acute PB treatment (1-7 days) had significant effects on liver mRNA and the expected effects on the liver phenotype (transient hyperplasia, hepatomegaly, cytochrome P450 induction), limited effects on liver miRNA were observed. However, at 14 days of PB treatment clear dose-dependent effects on miRNA were observed. The main effect of PB treatment from days 1 to 90 on liver miRNA was found to be the persistent, progressive, and highly correlated induction of the miR-200a/200b/429 and miR-96/182 clusters, occurring after the termination of the xenobiotic-induced transient hyperplasia. Moreover, in agreement with their reported functions in the literature we found associations between perturbations of miR-29b and miR-200a/200b by PB with global DNA methylation and zeb1/zeb2 proteins respectively. Our data suggest that miRNA are unlikely to play an important role in the acute responses of the adult rodent liver to PB treatment. However, the miRNA responses to longer PB exposures suggest a potential role for maintaining liver homeostasis in response to sub-chronic and chronic xenobiotic-induced perturbations. Similar studies for more chemicals are needed to clarify whether the temporal and dose pattern of miRNA-toxicant interaction identified here for PB are widely applicable to other xenobiotics. © 2013 Elsevier Ireland Ltd

    Asterias: a parallelized web-based suite for the analysis of expression and aCGH data

    Get PDF
    Asterias (\url{http://www.asterias.info}) is an integrated collection of freely-accessible web tools for the analysis of gene expression and aCGH data. Most of the tools use parallel computing (via MPI). Most of our applications allow the user to obtain additional information for user-selected genes by using clickable links in tables and/or figures. Our tools include: normalization of expression and aCGH data; converting between different types of gene/clone and protein identifiers; filtering and imputation; finding differentially expressed genes related to patient class and survival data; searching for models of class prediction; using random forests to search for minimal models for class prediction or for large subsets of genes with predictive capacity; searching for molecular signatures and predictive genes with survival data; detecting regions of genomic DNA gain or loss. The capability to send results between different applications, access to additional functional information, and parallelized computation make our suite unique and exploit features only available to web-based applications.Comment: web based application; 3 figure

    Improving clustering with metabolic pathway data

    Get PDF
    Background: It is a common practice in bioinformatics to validate each group returned by a clustering algorithm through manual analysis, according to a-priori biological knowledge. This procedure helps finding functionally related patterns to propose hypotheses for their behavior and the biological processes involved. Therefore, this knowledge is used only as a second step, after data are just clustered according to their expression patterns. Thus, it could be very useful to be able to improve the clustering of biological data by incorporating prior knowledge into the cluster formation itself, in order to enhance the biological value of the clusters. Results: A novel training algorithm for clustering is presented, which evaluates the biological internal connections of the data points while the clusters are being formed. Within this training algorithm, the calculation of distances among data points and neurons centroids includes a new term based on information from well-known metabolic pathways. The standard self-organizing map (SOM) training versus the biologically-inspired SOM (bSOM) training were tested with two real data sets of transcripts and metabolites from Solanum lycopersicum and Arabidopsis thaliana species. Classical data mining validation measures were used to evaluate the clustering solutions obtained by both algorithms. Moreover, a new measure that takes into account the biological connectivity of the clusters was applied. The results of bSOM show important improvements in the convergence and performance for the proposed clustering method in comparison to standard SOM training, in particular, from the application point of view. Conclusions: Analyses of the clusters obtained with bSOM indicate that including biological information during training can certainly increase the biological value of the clusters found with the proposed method. It is worth to highlight that this fact has effectively improved the results, which can simplify their further analysis.Fil: Milone, Diego Humberto. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Stegmayer, Georgina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Lopez, Mariana Gabriela. Instituto Nacional de Tecnología Agropecuaria. Centro de Investigación en Ciencias Veterinarias y Agronómicas. Instituto de Biotecnología; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Kamenetzky, Laura. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Instituto Nacional de Tecnología Agropecuaria. Centro de Investigación en Ciencias Veterinarias y Agronómicas. Instituto de Biotecnología; ArgentinaFil: Carrari, Fernando Oscar. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Instituto Nacional de Tecnología Agropecuaria. Centro de Investigación en Ciencias Veterinarias y Agronómicas. Instituto de Biotecnología; Argentin

    Statistical inference from large-scale genomic data

    Get PDF
    This thesis explores the potential of statistical inference methodologies in their applications in functional genomics. In essence, it summarises algorithmic findings in this field, providing step-by-step analytical methodologies for deciphering biological knowledge from large-scale genomic data, mainly microarray gene expression time series. This thesis covers a range of topics in the investigation of complex multivariate genomic data. One focus involves using clustering as a method of inference and another is cluster validation to extract meaningful biological information from the data. Information gained from the application of these various techniques can then be used conjointly in the elucidation of gene regulatory networks, the ultimate goal of this type of analysis. First, a new tight clustering method for gene expression data is proposed to obtain tighter and potentially more informative gene clusters. Next, to fully utilise biological knowledge in clustering validation, a validity index is defined based on one of the most important ontologies within the Bioinformatics community, Gene Ontology. The method bridges a gap in current literature, in the sense that it takes into account not only the variations of Gene Ontology categories in biological specificities and their significance to the gene clusters, but also the complex structure of the Gene Ontology. Finally, Bayesian probability is applied to making inference from heterogeneous genomic data, integrated with previous efforts in this thesis, for the aim of large-scale gene network inference. The proposed system comes with a stochastic process to achieve robustness to noise, yet remains efficient enough for large-scale analysis. Ultimately, the solutions presented in this thesis serve as building blocks of an intelligent system for interpreting large-scale genomic data and understanding the functional organisation of the genome

    From microarray to biology: an integrated experimental, statistical and in silico analysis of how the extracellular matrix modulates the phenotype of cancer cells

    Get PDF
    A statistically robust and biologically-based approach for analysis of microarray data is described that integrates independent biological knowledge and data with a global F-test for finding genes of interest that minimizes the need for replicates when used for hypothesis generation. First, each microarray is normalized to its noise level around zero. The microarray dataset is then globally adjusted by robust linear regression. Second, genes of interest that capture significant responses to experimental conditions are selected by finding those that express significantly higher variance than those expressing only technical variability. Clustering expression data and identifying expression-independent properties of genes of interest including upstream transcriptional regulatory elements (TREs), ontologies and networks or pathways organizes the data into a biologically meaningful system. We demonstrate that when the number of genes of interest is inconveniently large, identifying a subset of "beacon genes" representing the largest changes will identify pathways or networks altered by biological manipulation. The entire dataset is then used to complete the picture outlined by the "beacon genes." This allow construction of a structured model of a system that can generate biologically testable hypotheses. We illustrate this approach by comparing cells cultured on plastic or an extracellular matrix which organizes a dataset of over 2,000 genes of interest from a genome wide scan of transcription. The resulting model was confirmed by comparing the predicted pattern of TREs with experimental determination of active transcription factors

    Model order selection for bio-molecular data clustering

    Get PDF
    Background: Cluster analysis has been widely applied for investigating structure in bio-molecular data. A drawback of most clustering algorithms is that they cannot automatically detect the ”natural ” number of clusters underlying the data, and in many cases we have no enough ”a priori ” biological knowledge to evaluate both the number of clusters as well as their validity. Recently several methods based on the concept of stability have been proposed to estimate the ”optimal ” number of clusters, but despite their successful application to the analysis of complex bio-molecular data, the assessment of the statistical significance of the discovered clustering solutions and the detection of multiple structures simultaneously present in high-dimensional bio-molecular data are still major problems. Results: We propose a stability method based on randomized maps that exploits the high-dimensionality and relatively low cardinality that characterize bio-molecular data, by selecting subsets of randomized linear combinations of the input variables, and by using stability indices based on the overall distribution of similarity measures between multiple pairs of clusterings performed on the randomly projected data. A χ 2-based statistical test is proposed to assess the significance of the clustering solutions and to detect significant and if possible multi-level structures simultaneously present in the data (e.g. hierarchical structures)

    Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes

    Get PDF
    BACKGROUND: Commonly employed clustering methods for analysis of gene expression data do not directly incorporate phenotypic data about the samples. Furthermore, clustering of samples with known phenotypes is typically performed in an informal fashion. The inability of clustering algorithms to incorporate biological data in the grouping process can limit proper interpretation of the data and its underlying biology. RESULTS: We present a more formal approach, the modk-prototypes algorithm, for clustering biological samples based on simultaneously considering microarray gene expression data and classes of known phenotypic variables such as clinical chemistry evaluations and histopathologic observations. The strategy involves constructing an objective function with the sum of the squared Euclidean distances for numeric microarray and clinical chemistry data and simple matching for histopathology categorical values in order to measure dissimilarity of the samples. Separate weighting terms are used for microarray, clinical chemistry and histopathology measurements to control the influence of each data domain on the clustering of the samples. The dynamic validity index for numeric data was modified with a category utility measure for determining the number of clusters in the data sets. A cluster's prototype, formed from the mean of the values for numeric features and the mode of the categorical values of all the samples in the group, is representative of the phenotype of the cluster members. The approach is shown to work well with a simulated mixed data set and two real data examples containing numeric and categorical data types. One from a heart disease study and another from acetaminophen (an analgesic) exposure in rat liver that causes centrilobular necrosis. CONCLUSION: The modk-prototypes algorithm partitioned the simulated data into clusters with samples in their respective class group and the heart disease samples into two groups (sick and buff denoting samples having pain type representative of angina and non-angina respectively) with an accuracy of 79%. This is on par with, or better than, the assignment accuracy of the heart disease samples by several well-known and successful clustering algorithms. Following modk-prototypes clustering of the acetaminophen-exposed samples, informative genes from the cluster prototypes were identified that are descriptive of, and phenotypically anchored to, levels of necrosis of the centrilobular region of the rat liver. The biological processes cell growth and/or maintenance, amine metabolism, and stress response were shown to discern between no and moderate levels of acetaminophen-induced centrilobular necrosis. The use of well-known and traditional measurements directly in the clustering provides some guarantee that the resulting clusters will be meaningfully interpretable

    Integration and mining of malaria molecular, functional and pharmacological data: how far are we from a chemogenomic knowledge space?

    Get PDF
    The organization and mining of malaria genomic and post-genomic data is highly motivated by the necessity to predict and characterize new biological targets and new drugs. Biological targets are sought in a biological space designed from the genomic data from Plasmodium falciparum, but using also the millions of genomic data from other species. Drug candidates are sought in a chemical space containing the millions of small molecules stored in public and private chemolibraries. Data management should therefore be as reliable and versatile as possible. In this context, we examined five aspects of the organization and mining of malaria genomic and post-genomic data: 1) the comparison of protein sequences including compositionally atypical malaria sequences, 2) the high throughput reconstruction of molecular phylogenies, 3) the representation of biological processes particularly metabolic pathways, 4) the versatile methods to integrate genomic data, biological representations and functional profiling obtained from X-omic experiments after drug treatments and 5) the determination and prediction of protein structures and their molecular docking with drug candidate structures. Progresses toward a grid-enabled chemogenomic knowledge space are discussed.Comment: 43 pages, 4 figures, to appear in Malaria Journa

    Development of a simple artificial intelligence method to accurately subtype breast cancers based on gene expression barcodes

    Get PDF
    >Magister Scientiae - MScINTRODUCTION: Breast cancer is a highly heterogeneous disease. The complexity of achieving an accurate diagnosis and an effective treatment regimen lies within this heterogeneity. Subtypes of the disease are not simply molecular, i.e. hormone receptor over-expression or absence, but the tumour itself is heterogeneous in terms of tissue of origin, metastases, and histopathological variability. Accurate tumour classification vastly improves treatment decisions, patient outcomes and 5-year survival rates. Gene expression studies aided by transcriptomic technologies such as microarrays and next-generation sequencing (e.g. RNA-Sequencing) have aided oncology researcher and clinician understanding of the complex molecular portraits of malignant breast tumours. Mechanisms governing cancers, which include tumorigenesis, gene fusions, gene over-expression and suppression, cellular process and pathway involvementinvolvement, have been elucidated through comprehensive analyses of the cancer transcriptome. Over the past 20 years, gene expression signatures, discovered with both microarray and RNA-Seq have reached clinical and commercial application through the development of tests such as Mammaprint®, OncotypeDX®, and FoundationOne® CDx, all which focus on chemotherapy sensitivity, prediction of cancer recurrence, and tumour mutational level. The Gene Expression Barcode (GExB) algorithm was developed to allow for easy interpretation and integration of microarray data through data normalization with frozen RMA (fRMA) preprocessing and conversion of relative gene expression to a sequence of 1's and 0's. Unfortunately, the algorithm has not yet been developed for RNA-Seq data. However, implementation of the GExB with feature-selection would contribute to a machine-learning based robust breast cancer and subtype classifier. METHODOLOGY: For microarray data, we applied the GExB algorithm to generate barcodes for normal breast and breast tumour samples. A two-class classifier for malignancy was developed through feature-selection on barcoded samples by selecting for genes with 85% stable absence or presence within a tissue type, and differentially stable between tissues. A multi-class feature-selection method was employed to identify genes with variable expression in one subtype, but 80% stable absence or presence in all other subtypes, i.e. 80% in n-1 subtypes. For RNA-Seq data, a barcoding method needed to be developed which could mimic the GExB algorithm for microarray data. A z-score-to-barcode method was implemented and differential gene expression analysis with selection of the top 100 genes as informative features for classification purposes. The accuracy and discriminatory capability of both microarray-based gene signatures and the RNA-Seq-based gene signatures was assessed through unsupervised and supervised machine-learning algorithms, i.e., K-means and Hierarchical clustering, as well as binary and multi-class Support Vector Machine (SVM) implementations. RESULTS: The GExB-FS method for microarray data yielded an 85-probe and 346-probe informative set for two-class and multi-class classifiers, respectively. The two-class classifier predicted samples as either normal or malignant with 100% accuracy and the multi-class classifier predicted molecular subtype with 96.5% accuracy with SVM. Combining RNA-Seq DE analysis for feature-selection with the z-score-to-barcode method, resulted in a two-class classifier for malignancy, and a multi-class classifier for normal-from-healthy, normal-adjacent-tumour (from cancer patients), and breast tumour samples with 100% accuracy. Most notably, a normal-adjacent-tumour gene expression signature emerged, which differentiated it from normal breast tissues in healthy individuals. CONCLUSION: A potentially novel method for microarray and RNA-Seq data transformation, feature selection and classifier development was established. The universal application of the microarray signatures and validity of the z-score-to-barcode method was proven with 95% accurate classification of RNA-Seq barcoded samples with a microarray discovered gene expression signature. The results from this comprehensive study into the discovery of robust gene expression signatures holds immense potential for further R&F towards implementation at the clinical endpoint, and translation to simpler and cost-effective laboratory methods such as qtPCR-based tests
    corecore