6,246 research outputs found

    Inferring gene ontologies from pairwise similarity data.

    Get PDF
    MotivationWhile the manually curated Gene Ontology (GO) is widely used, inferring a GO directly from -omics data is a compelling new problem. Recognizing that ontologies are a directed acyclic graph (DAG) of terms and hierarchical relations, algorithms are needed that: analyze a full matrix of gene-gene pairwise similarities from -omics data; infer true hierarchical structure in these data rather than enforcing hierarchy as a computational artifact; and respect biological pleiotropy, by which a term in the hierarchy can relate to multiple higher level terms. Methods addressing these requirements are just beginning to emerge-none has been evaluated for GO inference.MethodsWe consider two algorithms [Clique Extracted Ontology (CliXO), LocalFitness] that uniquely satisfy these requirements, compared with methods including standard clustering. CliXO is a new approach that finds maximal cliques in a network induced by progressive thresholding of a similarity matrix. We evaluate each method's ability to reconstruct the GO biological process ontology from a similarity matrix based on (a) semantic similarities for GO itself or (b) three -omics datasets for yeast.ResultsFor task (a) using semantic similarity, CliXO accurately reconstructs GO (>99% precision, recall) and outperforms other approaches (<20% precision, <20% recall). For task (b) using -omics data, CliXO outperforms other methods using two -omics datasets and achieves ∼30% precision and recall using YeastNet v3, similar to an earlier approach (Network Extracted Ontology) and better than LocalFitness or standard clustering (20-25% precision, recall).ConclusionThis study provides algorithmic foundation for building gene ontologies by capturing hierarchical and pleiotropic structure embedded in biomolecular data

    SMART: Unique splitting-while-merging framework for gene clustering

    Get PDF
    Copyright @ 2014 Fa et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.Successful clustering algorithms are highly dependent on parameter settings. The clustering performance degrades significantly unless parameters are properly set, and yet, it is difficult to set these parameters a priori. To address this issue, in this paper, we propose a unique splitting-while-merging clustering framework, named “splitting merging awareness tactics” (SMART), which does not require any a priori knowledge of either the number of clusters or even the possible range of this number. Unlike existing self-splitting algorithms, which over-cluster the dataset to a large number of clusters and then merge some similar clusters, our framework has the ability to split and merge clusters automatically during the process and produces the the most reliable clustering results, by intrinsically integrating many clustering techniques and tasks. The SMART framework is implemented with two distinct clustering paradigms in two algorithms: competitive learning and finite mixture model. Nevertheless, within the proposed SMART framework, many other algorithms can be derived for different clustering paradigms. The minimum message length algorithm is integrated into the framework as the clustering selection criterion. The usefulness of the SMART framework and its algorithms is tested in demonstration datasets and simulated gene expression datasets. Moreover, two real microarray gene expression datasets are studied using this approach. Based on the performance of many metrics, all numerical results show that SMART is superior to compared existing self-splitting algorithms and traditional algorithms. Three main properties of the proposed SMART framework are summarized as: (1) needing no parameters dependent on the respective dataset or a priori knowledge about the datasets, (2) extendible to many different applications, (3) offering superior performance compared with counterpart algorithms.National Institute for Health Researc

    Development of a simple artificial intelligence method to accurately subtype breast cancers based on gene expression barcodes

    Get PDF
    >Magister Scientiae - MScINTRODUCTION: Breast cancer is a highly heterogeneous disease. The complexity of achieving an accurate diagnosis and an effective treatment regimen lies within this heterogeneity. Subtypes of the disease are not simply molecular, i.e. hormone receptor over-expression or absence, but the tumour itself is heterogeneous in terms of tissue of origin, metastases, and histopathological variability. Accurate tumour classification vastly improves treatment decisions, patient outcomes and 5-year survival rates. Gene expression studies aided by transcriptomic technologies such as microarrays and next-generation sequencing (e.g. RNA-Sequencing) have aided oncology researcher and clinician understanding of the complex molecular portraits of malignant breast tumours. Mechanisms governing cancers, which include tumorigenesis, gene fusions, gene over-expression and suppression, cellular process and pathway involvementinvolvement, have been elucidated through comprehensive analyses of the cancer transcriptome. Over the past 20 years, gene expression signatures, discovered with both microarray and RNA-Seq have reached clinical and commercial application through the development of tests such as Mammaprint®, OncotypeDX®, and FoundationOne® CDx, all which focus on chemotherapy sensitivity, prediction of cancer recurrence, and tumour mutational level. The Gene Expression Barcode (GExB) algorithm was developed to allow for easy interpretation and integration of microarray data through data normalization with frozen RMA (fRMA) preprocessing and conversion of relative gene expression to a sequence of 1's and 0's. Unfortunately, the algorithm has not yet been developed for RNA-Seq data. However, implementation of the GExB with feature-selection would contribute to a machine-learning based robust breast cancer and subtype classifier. METHODOLOGY: For microarray data, we applied the GExB algorithm to generate barcodes for normal breast and breast tumour samples. A two-class classifier for malignancy was developed through feature-selection on barcoded samples by selecting for genes with 85% stable absence or presence within a tissue type, and differentially stable between tissues. A multi-class feature-selection method was employed to identify genes with variable expression in one subtype, but 80% stable absence or presence in all other subtypes, i.e. 80% in n-1 subtypes. For RNA-Seq data, a barcoding method needed to be developed which could mimic the GExB algorithm for microarray data. A z-score-to-barcode method was implemented and differential gene expression analysis with selection of the top 100 genes as informative features for classification purposes. The accuracy and discriminatory capability of both microarray-based gene signatures and the RNA-Seq-based gene signatures was assessed through unsupervised and supervised machine-learning algorithms, i.e., K-means and Hierarchical clustering, as well as binary and multi-class Support Vector Machine (SVM) implementations. RESULTS: The GExB-FS method for microarray data yielded an 85-probe and 346-probe informative set for two-class and multi-class classifiers, respectively. The two-class classifier predicted samples as either normal or malignant with 100% accuracy and the multi-class classifier predicted molecular subtype with 96.5% accuracy with SVM. Combining RNA-Seq DE analysis for feature-selection with the z-score-to-barcode method, resulted in a two-class classifier for malignancy, and a multi-class classifier for normal-from-healthy, normal-adjacent-tumour (from cancer patients), and breast tumour samples with 100% accuracy. Most notably, a normal-adjacent-tumour gene expression signature emerged, which differentiated it from normal breast tissues in healthy individuals. CONCLUSION: A potentially novel method for microarray and RNA-Seq data transformation, feature selection and classifier development was established. The universal application of the microarray signatures and validity of the z-score-to-barcode method was proven with 95% accurate classification of RNA-Seq barcoded samples with a microarray discovered gene expression signature. The results from this comprehensive study into the discovery of robust gene expression signatures holds immense potential for further R&F towards implementation at the clinical endpoint, and translation to simpler and cost-effective laboratory methods such as qtPCR-based tests

    The Iterative Signature Algorithm for the analysis of large scale gene expression data

    Full text link
    We present a new approach for the analysis of genome-wide expression data. Our method is designed to overcome the limitations of traditional techniques, when applied to large-scale data. Rather than alloting each gene to a single cluster, we assign both genes and conditions to context-dependent and potentially overlapping transcription modules. We provide a rigorous definition of a transcription module as the object to be retrieved from the expression data. An efficient algorithm, that searches for the modules encoded in the data by iteratively refining sets of genes and conditions until they match this definition, is established. Each iteration involves a linear map, induced by the normalized expression matrix, followed by the application of a threshold function. We argue that our method is in fact a generalization of Singular Value Decomposition, which corresponds to the special case where no threshold is applied. We show analytically that for noisy expression data our approach leads to better classification due to the implementation of the threshold. This result is confirmed by numerical analyses based on in-silico expression data. We discuss briefly results obtained by applying our algorithm to expression data from the yeast S. cerevisiae.Comment: Latex, 36 pages, 8 figure
    corecore