Search CORE

137 research outputs found

Regulatory motif discovery using a population clustering evolutionary algorithm

Author: Lones Michael A.
Tyrrell Andy M.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/07/2007
Field of study

This paper describes a novel evolutionary algorithm for regulatory motif discovery in DNA promoter sequences. The algorithm uses data clustering to logically distribute the evolving population across the search space. Mating then takes place within local regions of the population, promoting overall solution diversity and encouraging discovery of multiple solutions. Experiments using synthetic data sets have demonstrated the algorithm's capacity to find position frequency matrix models of known regulatory motifs in relatively long promoter sequences. These experiments have also shown the algorithm's ability to maintain diversity during search and discover multiple motifs within a single population. The utility of the algorithm for discovering motifs in real biological data is demonstrated by its ability to find meaningful motifs within muscle-specific regulatory sequences

White Rose Research Online

Using Self-Organizing Maps to Visualize, Filter and Cluster Multidimensional Bio-Omics Data

Author: Fang Hai
Zhang Ji
Publication venue: 'IntechOpen'
Publication date: 21/11/2012
Field of study

IntechOpen

Recommended from our members

A systems biology design and implementation of novel bioinformatics software tools for high throughput gene expression analysis

Author: Khan Mohsin Amir Faiz
Publication venue: Brunel University School of Health Sciences and Social Care PhD Theses
Publication date: 01/01/2009
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Microarray technology has revolutionized the field of molecular biology by offering an efficient and cost effective platform for the simultaneous quantification of thousands of genes or even entire genomes in a single experiment. Unlike southern blotting, which is restricted to the measurement of one gene at-a-time, microarrays offer biologists with the opportunity to carry out genome-wide experiments in order to help them gain a systems level understanding of cell regulation and control. The application of bioinformatics in the milieu of gene expression analysis has attracted a great deal of attention in the recent past due to specific algorithms and software solutions that attempt to illustrate complex multidimensional microarray data in a biologically coherent fashion so that it can be understood by the biologist. This has given rise to some exciting prospects for deciphering microarray data, by helping us refine our comprehension pertinent to the underlying physiological dynamics of disease. Although much progress is being made in the development of specialized bioinformatics software pipelines with the purpose of decoding large volumes of gene expression data in the context of systems biology, several loopholes exist. Perhaps most notable of these loopholes is the fact that there is an increasing demand for software solutions that specialize in automating the comparison of multiple gene expression profiles, derived from microarray experiments sharing a common biological theme. This is no doubt an important challenge, since common genes across different biological conditions having similar expression patterns are likely to be involved in the same biological process and hence, may share the same regulatory signatures. The potential benefits of this in refining our understanding of the physiology of disease are undeniable. The research presented in this thesis provides a systematic walkthrough of a series of software pipelines developed for the purpose of streamlining gene expression analysis in a systems biology context. Firstly, we present BiSAn, a software tool that deciphers expression data from the perspective of transcriptional regulation. Following this, we present Genome Interaction Analyzer (GIA), which analyzes microarray data in the integrative framework of transcription factor binding sites, protein-protein interactions and molecular pathways. The final contribution is a software pipeline called MicroPath, which analyzes multiple sets of gene expression profiles and attempts to extract common regulatory signatures that may be implicating the biological question

Brunel University Research Archive

Knowledge-guided multi-scale independent component analysis for biomarker identification

Author: Chen Li
Clarke Robert
Hoffman Eric
Shih Ie-Ming
Wang Chen
Wang Yue
Xuan Jianhua
Zhang Zhen
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Many statistical methods have been proposed to identify disease biomarkers from gene expression profiles. However, from gene expression profile data alone, statistical methods often fail to identify biologically meaningful biomarkers related to a specific disease under study. In this paper, we develop a novel strategy, namely knowledge-guided multi-scale independent component analysis (ICA), to first infer regulatory signals and then identify biologically relevant biomarkers from microarray data. Results Since gene expression levels reflect the joint effect of several underlying biological functions, disease-specific biomarkers may be involved in several distinct biological functions. To identify disease-specific biomarkers that provide unique mechanistic insights, a meta-data "knowledge gene pool" (KGP) is first constructed from multiple data sources to provide important information on the likely functions (such as gene ontology information) and regulatory events (such as promoter responsive elements) associated with potential genes of interest. The gene expression and biological meta data associated with the members of the KGP can then be used to guide subsequent analysis. ICA is then applied to multi-scale gene clusters to reveal regulatory modes reflecting the underlying biological mechanisms. Finally disease-specific biomarkers are extracted by their weighted connectivity scores associated with the extracted regulatory modes. A statistical significance test is used to evaluate the significance of transcription factor enrichment for the extracted gene set based on motif information. We applied the proposed method to yeast cell cycle microarray data and Rsf-1-induced ovarian cancer microarray data. The results show that our knowledge-guided ICA approach can extract biologically meaningful regulatory modes and outperform several baseline methods for biomarker identification. Conclusion We have proposed a novel method, namely knowledge-guided multi-scale ICA, to identify disease-specific biomarkers. The goal is to infer knowledge-relevant regulatory signals and then identify corresponding biomarkers through a multi-scale strategy. The approach has been successfully applied to two expression profiling experiments to demonstrate its improved performance in extracting biologically meaningful and disease-related biomarkers. More importantly, the proposed approach shows promising results to infer novel biomarkers for ovarian cancer and extend current knowledge.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Evaluation of statistical correlation and validation methods for construction of gene co-expression networks

Author: Duvvuru Suman
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/12/2008
Field of study

High-throughput technologies such as microarrays have led to the rapid accumulation of large scale genomic data providing opportunities to systematically infer gene function and co-expression networks. Typical steps of co-expression network analysis using microarray data consist of estimation of pair-wise gene co-expression using some similarity measure, construction of co-expression networks, identification of clusters of co-expressed genes and post-cluster analyses such as cluster validation. This dissertation is primarily concerned with development and evaluation of approaches for the first and the last steps – estimation of gene co-expression matrices and validation of network clusters. Since clustering methods are not a focus, only a paraclique clustering algorithm will be used in this evaluation. First, a novel Bayesian approach is presented for combining the Pearson correlation with prior biological information from Gene Ontology, yielding a biologically relevant estimate of gene co-expression. The addition of biological information by the Bayesian approach reduced noise in the paraclique gene clusters as indicated by high silhouette and increased homogeneity of clusters in terms of molecular function. Standard similarity measures including correlation coefficients from Pearson, Spearman, Kendall’s Tau, Shrinkage, Partial, and Mutual information, and Euclidean and Manhattan distance measures were evaluated. Based on quality metrics such as cluster homogeneity and stability with respect to ontological categories, clusters resulting from partial correlation and mutual information were more biologically relevant than those from any other correlation measures. Second, statistical quality of clusters was evaluated using approaches based on permutation tests and Mantel correlation to identify significant and informative clusters that capture most of the covariance in the dataset. Third, the utility of statistical contrasts was studied for classification of temporal patterns of gene expression. Specifically, polynomial and Helmert contrast analyses were shown to provide a means of labeling the co-expressed gene sets because they showed similar temporal profiles

University of Tennessee, Knoxville: Trace

Front Matter - Soft Computing for Data Mining Applications

Author: Patnaik L.M.
Srinivasa K.G.
Venugopal K.R.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Efficient tools and algorithms for knowledge discovery in large data sets have been devised during the recent years. These methods exploit the capability of computers to search huge amounts of data in a fast and effective manner. However, the data to be analyzed is imprecise and afflicted with uncertainty. In the case of heterogeneous data sources such as text, audio and video, the data might moreover be ambiguous and partly conflicting. Besides, patterns and relationships of interest are usually vague and approximate. Thus, in order to make the information mining process more robust or say, human-like methods for searching and learning it requires tolerance towards imprecision, uncertainty and exceptions. Thus, they have approximate reasoning capabilities and are capable of handling partial truth. Properties of the aforementioned kind are typical soft computing. Soft computing techniques like Genetic

ePrints@Bangalore University

A MACHINE LEARNING APPROACH TO QUERY TIME-SERIES MICROARRAY DATA SETS FOR FUNCTIONALLY RELATED GENES USING HIDDEN MARKOV MODELS

Author: Senf Alexander J.
Publication venue: 'Paleontological Institute at The University of Kansas'
Publication date: 01/01/2011
Field of study

Microarray technology captures the rate of expression of genes under varying experimental conditions. Genes encode the information necessary to build proteins; proteins used by cellular functions exhibit higher rates of expression for the associated genes. If multiple proteins are required for a particular function then their genes show a pattern of coexpression during time periods when the function is active within a cell. Cellular functions are generally complex and require groups of genes to cooperate; these groups of genes are called functional modules. Modular organization of genetic functions has been evident since 1999. Detecting functionally related genes in a genome and detecting all genes belonging to particular functional modules are current research topics in this field. The number of microarray gene expression datasets available in public repositories increases rapidly, and advances in technology have now made it feasible to routinely perform whole-genome studies where the behavior of every gene in a genome is captured. This promises a wealth of biological and medical information, but making the amount of data accessible to researchers requires intelligent and efficient computational algorithms. Researchers working on specific cellular functions would benefit from this data if it was possible to quickly extract information useful to their area of research. This dissertation develops a machine learning algorithm that allows one or multiple microarray data sets to be queried with a set of known and functionally related input genes in order to detect additional genes participating in the same or closely related functions. The focus is on time-series microarray datasets where gene expression values are obtained from the same experiment over a period of time from a series of sequential measurements. A feature selection algorithm selects relevant time steps where the provided input genes exhibit correlated expression behavior. Time steps are the columns in microarray data sets, rows list individual genes. A specific linear Hidden Markov Model (HMM) is then constructed to contain one hidden state for each of the selected experiments and is trained using the expression values of the input genes from the microarray. Given the trained HMM the probability that a sequence of gene expression values was generated by that particular HMM can be calculated. This allows for the assignment of a probability score for each gene in the microarray. High-scoring genes are included in the result set (of genes with functional similarities to the input genes.) P-values can be calculated by repeating this algorithm to train multiple individual HMMs using randomly selected genes as input genes and calculating a Parzen Density Function (PDF) from the probability scores of all HMMs for each gene. A feedback loop uses the result generated from one algorithm run as input set for another iteration of the algorithm. This iterated HMM algorithm allows for the characterization of functional modules from very small input sets and for weak similarity signals. This algorithm also allows for the integration of multiple microarray data sets; two approaches are studied: Meta-Analysis (combination of the results from individual data set runs) and the extension of the linear HMM across multiple individual data sets. Results indicate that Meta-Analysis works best for integration of closely related microarrays and a spanning HMM works best for the integration of multiple heterogeneous datasets. The performance of this approach is demonstrated relative to the published literature on a number of widely used synthetic data sets. Biological application is verified by analyzing biological data sets of the Fruit Fly D. Melanogaster and Baker‟s Yeast S. Cerevisiae. The algorithm developed in this dissertation is better able to detect functionally related genes in common data sets than currently available algorithms in the published literature

KU ScholarWorks

Gene expression profiling of head and neck cancer

Author: Warner Giles C
Publication venue: 'Queen Mary University of London'
Publication date: 01/01/2004
Field of study

MDThe purpose of this study was to classify oral squamous cell carcinomas (OSCCs) based on their gene expression profiles, to identify differentially expressed genes in these cancers, and to correlate genetic deregulation with clinical-histopathological data and patient outcome. After conducting proof of principle experiments utilizing six head and neck squamous cell carcinomas (HNSCCs) cell lines, the gene expression profiles of 20 OSCCs and subsequently an additional 8 OSCCs were determined using cDNA microarrays containing 19,200 sequences and the Binary Tree-Structured Vector Quantization (BTSVQ) method of data analysis. Two sample clusters were identified in the group of 20 tumors that correlated with T3-T4 category of disease (P=0.035) and nodal metastasis( p=0.035). Samplec lustering of 28 OSCCsa nd the 6 cell lines revealed a correlation with disease free survival. BTSVQ analysis identified a subset of 23 differentially expressed genes with the lowest quantization error scores in the cluster containing more advanceds taget umors from the 20 OSCC dataset.T he expressiono f six of these differentially expressedg enesw as validated by quantitative real-time RT-PCR. Statistical analysis of quantitative real-time RT-PCR data was performed and, after Bonferroni correction, CLDNI (p = 0.007) over-expressionw as significantly correlated with the cluster containing more advanced stage tumors. Despite the clinical heterogeneity of OSCC, molecular subtyping by cDNA microarray analysis was able to identify distinct patternso f genee xpressiona ssociatedw ith relevant clinical parameters. The application of this methodology represents an advance in the classification of oral cavity tumors, and may ultimately aid in the development of more tailored therapies for oral carcinoma

Queen Mary Research Online

OpenGrey Repository

Dimensionality reduction methods for microarray cancer data using prior knowledge

Author: Hira Zena Maria
Publication venue: Computing, Imperial College London
Publication date: 01/06/2016
Field of study

Microarray studies are currently a very popular source of biological information. They allow the simultaneous measurement of hundreds of thousands of genes, drastically increasing the amount of data that can be gathered in a small amount of time and also decreasing the cost of producing such results. Large numbers of high dimensional data sets are currently being generated and there is an ongoing need to find ways to analyse them to obtain meaningful interpretations. Many microarray experiments are concerned with answering specific biological or medical questions regarding diseases and treatments. Cancer is one of the most popular research areas and there is a plethora of data available requiring in depth analysis. Although the analysis of microarray data has been thoroughly researched over the past ten years, new approaches still appear regularly, and may lead to a better understanding of the available information. The size of the modern data sets presents considerable difficulties to traditional methodologies based on hypothesis testing, and there is a new move towards the use of machine learning in microarray data analysis. Two new methods of using prior genetic knowledge in machine learning algorithms have been developed and their results are compared with existing methods. The prior knowledge consists of biological pathway data that can be found in on-line databases, and gene ontology terms. The first method, called ``a priori manifold learning'' uses the prior knowledge when constructing a manifold for non-linear feature extraction. It was found to perform better than both linear principal components analysis (PCA) and the non-linear Isomap algorithm (without prior knowledge) in both classification accuracy and quality of the clusters. Both pathway and GO terms were used as prior knowledge, and results showed that using GO terms can make the models over-fit the data. In the cases where the use of GO terms does not over-fit, the results are better than PCA, Isomap and a priori manifold learning using pathways. The second method, called ``the feature selection over pathway segmentation algorithm'', uses the pathway information to split a big dataset into smaller ones. Then, using AdaBoost, decision trees are constructed for each of the smaller sets and the sets that achieve higher classification accuracy are identified. The individual genes in these subsets are assessed to determine their role in the classification process. Using data sets concerning chronic myeloid leukaemia (CML) two subsets based on pathways were found to be strongly associated with the response to treatment. Using a different data set from measurements on lower grade glioma (LGG) tumours, four informative gene sets were discovered. Further analysis based on the Gini importance measure identified a set of genes for each cancer type (CML, LGG) that could predict the response to treatment very accurately (> 90%). Moreover a single gene that can predict the response to CML treatment accurately was identified.Open Acces

Spiral - Imperial College Digital Repository