967 research outputs found

    Clustering-based approaches to SAGE data mining

    Get PDF
    Serial analysis of gene expression (SAGE) is one of the most powerful tools for global gene expression profiling. It has led to several biological discoveries and biomedical applications, such as the prediction of new gene functions and the identification of biomarkers in human cancer research. Clustering techniques have become fundamental approaches in these applications. This paper reviews relevant clustering techniques specifically designed for this type of data. It places an emphasis on current limitations and opportunities in this area for supporting biologically-meaningful data mining and visualisation

    Machine learning approaches to supporting the identification of photoreceptor-enriched genes based on expression data

    Get PDF
    BACKGROUND: Retinal photoreceptors are highly specialised cells, which detect light and are central to mammalian vision. Many retinal diseases occur as a result of inherited dysfunction of the rod and cone photoreceptor cells. Development and maintenance of photoreceptors requires appropriate regulation of the many genes specifically or highly expressed in these cells. Over the last decades, different experimental approaches have been developed to identify photoreceptor enriched genes. Recent progress in RNA analysis technology has generated large amounts of gene expression data relevant to retinal development. This paper assesses a machine learning methodology for supporting the identification of photoreceptor enriched genes based on expression data. RESULTS: Based on the analysis of publicly-available gene expression data from the developing mouse retina generated by serial analysis of gene expression (SAGE), this paper presents a predictive methodology comprising several in silico models for detecting key complex features and relationships encoded in the data, which may be useful to distinguish genes in terms of their functional roles. In order to understand temporal patterns of photoreceptor gene expression during retinal development, a two-way cluster analysis was firstly performed. By clustering SAGE libraries, a hierarchical tree reflecting relationships between developmental stages was obtained. By clustering SAGE tags, a more comprehensive expression profile for photoreceptor cells was revealed. To demonstrate the usefulness of machine learning-based models in predicting functional associations from the SAGE data, three supervised classification models were compared. The results indicated that a relatively simple instance-based model (KStar model) performed significantly better than relatively more complex algorithms, e.g. neural networks. To deal with the problem of functional class imbalance occurring in the dataset, two data re-sampling techniques were studied. A random over-sampling method supported the implementation of the most powerful prediction models. The KStar model was also able to achieve higher predictive sensitivities and specificities using random over-sampling techniques. CONCLUSION: The approaches assessed in this paper represent an efficient and relatively inexpensive in silico methodology for supporting large-scale analysis of photoreceptor gene expression by SAGE. They may be applied as complementary methodologies to support functional predictions before implementing more comprehensive, experimental prediction and validation methods. They may also be combined with other large-scale, data-driven methods to facilitate the inference of transcriptional regulatory networks in the developing retina. Furthermore, the methodology assessed may be applied to other data domains

    Fractals in the Nervous System: conceptual Implications for Theoretical Neuroscience

    Get PDF
    This essay is presented with two principal objectives in mind: first, to document the prevalence of fractals at all levels of the nervous system, giving credence to the notion of their functional relevance; and second, to draw attention to the as yet still unresolved issues of the detailed relationships among power law scaling, self-similarity, and self-organized criticality. As regards criticality, I will document that it has become a pivotal reference point in Neurodynamics. Furthermore, I will emphasize the not yet fully appreciated significance of allometric control processes. For dynamic fractals, I will assemble reasons for attributing to them the capacity to adapt task execution to contextual changes across a range of scales. The final Section consists of general reflections on the implications of the reviewed data, and identifies what appear to be issues of fundamental importance for future research in the rapidly evolving topic of this review

    Spiking neurons in 3D growing self-organising maps

    Get PDF
    In Kohonen’s Self-Organising Maps (SOM) learning, preserving the map topology to simulate the actual input features appears to be a significant process. Misinterpretation of the training samples can lead to failure in identifying the important features that may affect the outcomes generated by the SOM model. Nonetheless, it is a challenging task as most of the real problems are composed of complex and insufficient data. Spiking Neural Network (SNN) is the third generation of Artificial Neural Network (ANN), in which information can be transferred from one neuron to another using spike, processed, and trigger response as output. This study, hence, embedded spiking neurons for SOM learning in order to enhance the learning process. The proposed method was divided into five main phases. Phase 1 investigated issues related to SOM learning algorithm, while in Phase 2; datasets were collected for analyses carried out in Phase 3, wherein neural coding scheme for data representation process was implemented in the classification task. Next, in Phase 4, the spiking SOM model was designed, developed, and evaluated using classification accuracy rate and quantisation error. The outcomes showed that the proposed model had successfully attained exceptional classification accuracy rate with low quantisation error to preserve the quality of the generated map based on original input data. Lastly, in the final phase, a Spiking 3D Growing SOM is proposed to address the surface reconstruction issue by enhancing the spiking SOM using 3D map structure in SOM algorithm with a growing grid mechanism. The application of spiking neurons to enhance the performance of SOM is relevant in this study due to its ability to spike and to send a reaction when special features are identified based on its learning of the presented datasets. The study outcomes contribute to the enhancement of SOM in learning the patterns of the datasets, as well as in proposing a better tool for data analysis

    Molecular Phenotypes Distinguish Patients with Relatively Stable from Progressive Idiopathic Pulmonary Fibrosis (IPF)

    Get PDF
    BACKGROUND: Idiopathic pulmonary fibrosis (IPF) is a progressive, chronic interstitial lung disease that is unresponsive to current therapy and often leads to death. However, the rate of disease progression differs among patients. We hypothesized that comparing the gene expression profiles between patients with stable disease and those in which the disease progressed rapidly will lead to biomarker discovery and contribute to the understanding of disease pathogenesis. METHODOLOGY AND PRINCIPAL FINDINGS: To begin to address this hypothesis, we applied Serial Analysis of Gene Expression (SAGE) to generate lung expression profiles from diagnostic surgical lung biopsies in 6 individuals with relatively stable (or slowly progressive) IPF and 6 individuals with progressive IPF (based on changes in DLCO and FVC over 12 months). Our results indicate that this comprehensive lung IPF SAGE transcriptome is distinct from normal lung tissue and other chronic lung diseases. To identify candidate markers of disease progression, we compared the IPF SAGE profiles in stable and progressive disease, and identified a set of 102 transcripts that were at least 5-fold up regulated and a set of 89 transcripts that were at least 5-fold down regulated in the progressive group (P-value</=0.05). The over expressed genes included surfactant protein A1, two members of the MAPK-EGR-1-HSP70 pathway that regulate cigarette-smoke induced inflammation, and Plunc (palate, lung and nasal epithelium associated), a gene not previously implicated in IPF. Interestingly, 26 of the up regulated genes are also increased in lung adenocarcinomas and have low or no expression in normal lung tissue. More importantly, we defined a SAGE molecular expression signature of 134 transcripts that sufficiently distinguished relatively stable from progressive IPF. CONCLUSIONS: These findings indicate that molecular signatures from lung parenchyma at the time of diagnosis could prove helpful in predicting the likelihood of disease progression or possibly understanding the biological activity of IPF

    Gene Expression Analysis Methods on Microarray Data a A Review

    Get PDF
    In recent years a new type of experiments are changing the way that biologists and other specialists analyze many problems. These are called high throughput experiments and the main difference with those that were performed some years ago is mainly in the quantity of the data obtained from them. Thanks to the technology known generically as microarrays, it is possible to study nowadays in a single experiment the behavior of all the genes of an organism under different conditions. The data generated by these experiments may consist from thousands to millions of variables and they pose many challenges to the scientists who have to analyze them. Many of these are of statistical nature and will be the center of this review. There are many types of microarrays which have been developed to answer different biological questions and some of them will be explained later. For the sake of simplicity we start with the most well known ones: expression microarrays

    Features extraction using random matrix theory.

    Get PDF
    Representing the complex data in a concise and accurate way is a special stage in data mining methodology. Redundant and noisy data affects generalization power of any classification algorithm, undermines the results of any clustering algorithm and finally encumbers the monitoring of large dynamic systems. This work provides several efficient approaches to all aforementioned sides of the analysis. We established, that notable difference can be made, if the results from the theory of ensembles of random matrices are employed. Particularly important result of our study is a discovered family of methods based on projecting the data set on different subsets of the correlation spectrum. Generally, we start with traditional correlation matrix of a given data set. We perform singular value decomposition, and establish boundaries between essential and unimportant eigen-components of the spectrum. Then, depending on the nature of the problem at hand we either use former or later part for the projection purpose. Projecting the spectrum of interest is a common technique in linear and non-linear spectral methods such as Principal Component Analysis, Independent Component Analysis and Kernel Principal Component Analysis. Usually the part of the spectrum to project is defined by the amount of variance of overall data or feature space in non-linear case. The applicability of these spectral methods is limited by the assumption that larger variance has important dynamics, i.e. if the data has a high signal-to-noise ratio. If it is true, projection of principal components targets two problems in data mining, reduction in the number of features and selection of more important features. Our methodology does not make an assumption of high signal-to-noise ratio, instead, using the rigorous instruments of Random Matrix Theory (RNIT) it identifies the presence of noise and establishes its boundaries. The knowledge of the structure of the spectrum gives us possibility to make more insightful projections. For instance, in the application to router network traffic, the reconstruction error procedure for anomaly detection is based on the projection of noisy part of the spectrum. Whereas, in bioinformatics application of clustering the different types of leukemia, implicit denoising of the correlation matrix is achieved by decomposing the spectrum to random and non-random parts. For temporal high dimensional data, spectrum and eigenvectors of its correlation matrix is another representation of the data. Thus, eigenvalues, components of the eigenvectors, inverse participation ratio of eigenvector components and other operators of eigen analysis are spectral features of dynamic system. In our work we proposed to extract spectral features using the RMT. We demonstrated that with extracted spectral features we can monitor the changing dynamics of network traffic. Experimenting with the delayed correlation matrices of network traffic and extracting its spectral features, we visualized the delayed processes in the system. We demonstrated in our work that broad range of applications in feature extraction can benefit from the novel RMT based approach to the spectral representation of the data

    Statistical methods for the analysis of RNA sequencing data

    Get PDF
    The next generation sequencing technology, RNA-sequencing (RNA-seq), has an increasing popularity over traditional microarrays in transcriptome analyses. Statistical methods used for gene expression analyses with these two technologies are different because the array-based technology measures intensities using continuous distributions, whereas RNA-seq provides absolute quantification of gene expression using counts of reads. There is a need for reliable statistical methods to exploit the information from the rapidly evolving sequencing technologies and limited work has been done on expression analysis of time-course RNA-seq data. In this dissertation, we propose a model-based clustering method for identifying gene expression patterns in time-course RNA-seq data. Our approach employs a longitudinal negative binomial mixture model to postulate the over-dispersed time-course gene count data. We also modify existing common initialization procedures to suit our model-based clustering algorithm. The effectiveness of the proposed methods is assessed using simulated data and is illustrated by real data from time-course genomic experiments. Another common issue in gene expression analysis is the presence of missing values in the datasets. Various treatments to missing values in genomic datasets have been developed but limited work has been done on RNA-seq data. In the current work, we examine the performance of various imputation methods and their impact on the clustering of time-course RNA-seq data. We develop a cluster-based imputation method which is specifically suitable for dealing with missing values in RNA-seq datasets. Simulation studies are provided to assess the performance of the proposed imputation approach
    corecore