15 research outputs found

    Techniques for clustering gene expression data

    Get PDF
    Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data profile. This review paper surveys state of the art applications which recognises these limitations and implements procedures to overcome them. It provides a framework for the evaluation of clustering in gene expression analyses. The nature of microarray data is discussed briefly. Selected examples are presented for the clustering methods considered

    Genomic and proteomic analysis with dynamically growing self organising tree (DGSOT) for measuring clinical outcomes of cancer

    Get PDF
    Genomics and proteomics microarray technologies are used for analysing molecular and cellular expressions of cancer. This creates a challenge for analysis and interpretation of the data generated as it is produced in large volumes. The current review describes a combined system for genetic, molecular interpretation and analysis of genomics and proteomics technologies that offers a wide range of interpreted results. Artificial neural network systems technology has the type of programmes to best deal with these large volumes of analytical data. The artificial system to be recommended here is to be determined from the analysis and selection of the best of different available technologies currently being used or reviewed for microarray data analysis. The system proposed here is a tree structure, a new hierarchical clustering algorithm called a dynamically growing self-organizing tree (DGSOT) algorithm, which overcomes drawbacks of traditional hierarchical clustering algorithms. The DGSOT algorithm combines horizontal and vertical growth to construct a mutlifurcating hierarchical tree from top to bottom to cluster the data. They are designed to combine the strengths of Neural Networks (NN), which have speed and robustness to noise, and hierarchical clustering tree structure which are minimum prior requirement for number of clusters specification and training in order to output results of interpretable biological context. The combined system will generate an output of biological interpretation of expression profiles associated with diagnosis of disease (including early detection, molecular classification and staging), metastasis (spread of the disease to non-adjacent organs and/or tissues), prognosis (predicting clinical outcome) and response to treatment; it also gives possible therapeutic options ranking them according to their benefits for the patient.Key words: Genomics, proteomics, microarray, dynamically growing self-organizing tree (DGSOT)

    Detection and Prevention System towards the Truth of Convergence on Decision Using Aumann Agreement Theorem

    Get PDF
    AbstractThe Detection and Prevention system against many attacks has been formulated in Mobile ad hoc networks to secure the data and to provide the uninterrupted service to the legitimate clients. The formulation of opinion of neighbors or belief value or Trust value plays vital role in the detection system to avoid attacks. The attack detection system always extracts the behaviors of nodes to identify the attack patterns and prediction of future attacks. The False positives and false negatives plays vital role on identification of attackers accurately without any false positives and negatives .Our system uses the Aumann agreement theorem for convergence of Truth on opinion based on the bound of confidence value, such that truth consensus will maintained, The accuracy of system will be enhanced through this methodolog

    Text Classification Aided by Clustering: a Literature Review

    Get PDF

    Enhanced data clustering and classification using auto-associative neural networks and self organizing maps

    Get PDF
    This thesis presents a number of investigations leading to introduction of novel applications of intelligent algorithms in the fields of informatics and analytics. This research aims to develop novel methodologies to reduce dimensions and clustering of highly non-linear multidimensional data. Improving the performance of existing methodologies has been based on two fundamental approaches. The first is to look into making novel structural re-arrangements by hybridisation of conventional intelligent algorithms which are Auto-Associative Neural Networks (AANN) and Self Organizing Maps (SOM) for data clustering improvement. The second is to enhance data clustering and classification performance by introducing novel fundamental algorithmic changes known as M3-SOM in the data processing and training procedure of conventional SOM. Both approaches are tested, benchmarked and analysed using three datasets which are Iris Flowers, Italian Olive Oils and Wine through case studies for dimension reduction, clustering and classification of complex and non-linear data. The study on AANN alone shows that this non-linear algorithm is able to efficiently reduce dimensions of the three datasets. This paves the way towards structurally hybridising AANN as dimension reduction method with SOM as clustering method (AANNSOM) for data clustering enhancement. This hybrid AANNSOM is then introduced and applied to cluster Iris Flowers, Italian Olive Oils and Wine datasets. The hybrid methodology proves to be able to improve data clustering accuracy, reduce quantisation errors and decrease computational time when compared to SOM in all case studies. However, the topographic errors showed inconsistency throughout the studies and it is still difficult for both AANNSOM and SOM to provide additional inherent information of the datasets such as the exact position of a data in a cluster. Therefore, M3-SOM, a novel methodology based on SOM training algorithm is proposed, developed and studied on the same datasets. M3-SOM was able to improve data clustering and classification accuracy for all three case studies when compared to conventional SOM. It is also able to obtain inherent information about the position of one data or "sub-cluster" towards other data or sub-cluster within the same class in Iris Flowers and Wine datasets. Nevertheless, it faces difficulties in achieving the same level of performance when clustering Italian Olive Oils data due to high number of data classes. However, it can be concluded that both methodologies have been able to improve data clustering and classification performance as well as to discover inherent information inside multidimensional data

    An Experimental Study on Microarray Expression Data from Plants under Salt Stress by using Clustering Methods

    Get PDF
    Current Genome-wide advancements in Gene chips technology provide in the “Omics (genomics, proteomics and transcriptomics) research”, an opportunity to analyze the expression levels of thousand of genes across multiple experiments. In this regard, many machine learning approaches were proposed to deal with this deluge of information. Clustering methods are one of these approaches. Their process consists of grouping data (gene profiles) into homogeneous clusters using distance measurements. Various clustering techniques are applied, but there is no consensus for the best one. In this context, a comparison of seven clustering algorithms was performed and tested against the gene expression datasets of three model plants under salt stress. These techniques are evaluated by internal and relative validity measures. It appears that the AGNES algorithm is the best one for internal validity measures for the three plant datasets. Also, K-Means profiles a trend for relative validity measures for these datasets

    The Evolving Tree—Analysis and Applications

    Full text link

    The Shewanella Federation: Functional Genomic Investigations of Dissimilatory Metal-Reducing Shewanella

    Get PDF
    Generation and validation of a Shewanella oneidensis MR-1 clone set for protein expression and phage display. An ORF clone set for S. oneidensis was created using the lambda recombinase system. ORFs within entry vectors in this system can be readily transferred into multiple destination vectors, making the clone set a useful resource for research groups studying this microorganism. To establish that the S. oneidensis clone set could be used for protein expression and functional studies, three sets of ORFs were examined for expression of His-tag proteins, expression of His/GST-tag proteins, or for effective display on phage. A total of 21 out of 30 (70%) predicted two-component transcriptional regulators from S. oneidensis were successfully expressed in the His-tag format. The use of the S. oneidensis clone set for functional studies was tested using a phage display system. The method involves the fusion of peptides or proteins to a coat protein of a bacteriophage. This results in display of the fused protein on the exterior of the phage, while the DNA encoding the fusion resides within the virion. The physical linkage between the displayed protein and the DNA encoding it allows screening of vast numbers of proteins for ligand-binding properties. With this technology, a phage clone encoding thioredoxin TrxA was isolated from a sub-library consisting of 80 clones. It is evident that the S. oneidensis clone set can be used for expression of functional S. oneidensis proteins in E. coli using the appropriate destination vectors. Characterization of ArcA. In Escherichia coli, metabolic transitions between aerobic and anaerobic growth states occur when cells enter an oxygen-limited condition. Many of these metabolic transitions are controlled at the transcriptional level by the activities of the global regulatory proteins ArcA (aerobic respiration control) and Fnr (fumarate nitrate regulator). A homolog of ArcA (81% amino acid sequence identity) was identified in S. oneidensis MR-1, and arcA mutants with MR-1 as the parental strain were generated. Phenotype characterization showed the arcA deletion mutant grew slower than the wild-type and was hypersensitive to H2O2 stress. Microarray analysis indicated that S. oneidensis ArcA regulates a large number of different genes from that in E. coli although they do have overlapping regulatory functions on a small set of genes. The S. oneidensis arcA gene was also cloned and expressed in E. coli. The ArcA proteins from the wild-type and a point mutant strains (D54N) were purified and their DNA binding properties were analyzed by electrophoretic motility shift (EMS) and DNase I footprinting assays. The results indicate that phosphorylated ArcA proteins bind to a DNA site similar in sequence to the E. coli ArcA binding site. The common feature of the binding sites is the presence of a conserved 15 base pair motif that contains 2-3 mismatches when compared to the E. coli ArcA-P consensus binding motif. Genome scale computational predictions of binding sites were also performed and 331 putative ArcA regulatory targets were identified. Therefore, the regulation of aerobic/anaerobic respiration may be more complex than it was expected in S. oneidensis. A high-throughput percentage-of-binding strategy to measure binding energies in DNA–protein interactions. Based on results of studies on ArcA of S. oneidensis, we developed a high-throughput approach to measure binding energies in DNA-protein interactions, which enables a more precise prediction for DNA-binding sites in genomes. With this approach, the importance of each position within the ArcA-P binding site was quantitatively established by characterizing the interaction between Shewanella ArcA-P and a series of mutant promoter DNAs, whereby each position in the binding site was systematically mutated to all possible single nucleotide changes. The results of the fine mapping were used to create a position-specific energy matrix (PEM) that was used for a genome-scale prediction of 45 ArcA-P sites in Shewanella. A further examination suggests that this prediction is >81% consistency with in vivo gene regulation according to microarray studies and >92% (13/14) accuracy in comparison with published in vitro gel shift validation binding assays. In addition, this study predicted 27 ArcA-P sites for 15 published E. coli ArcA-P footprinted DNAs, and 24 of them were found exactly within the footprinting protected regions and the other three sites fall into the regions that were not examined by footprinting assays. This is the first report showing that footprinting protected regions can be effectively predicted by starting from a single known transcription factor binding site. Finally, the predicted H. influenzae ArcA-P sites correlate well with in vivo regulation determined by a microarray analysis in that the eight predicted binding sites with the most favorable ∆∆G scores all exhibit ArcA dependent gene regulation

    Computational analysis of gene expression data

    Get PDF
    Gene expression is central to the function of living cells. While advances in sequencing and expression measurement technology over the past decade has greatly facilitated the further understanding of the genome and its functions, the characterisation of functional groups of genes remains one of the most important problems in modern biology. Technological advancements have resulted in massive information output, with the priority objective shifting to development of data analysis methods. As such, a large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments, and consequently, confusion regarding the best approach to take. Common techniques applied are not necessarily the most applicable for the analysis of patterns in microarray data. This confusion is clarified through provision of a framework for the analysis of clustering technique and investigation of how well they apply to gene expression data. To this end, the properties of microarray data itself are examined, followed by an examination of the properties of clustering techniques and how well they apply to gene expression. Clearly, each technique will find patterns even if the structures are not meaningful in a biological context and these structures are not usually the same for different algorithms. Also, these algorithms are inherently biased as properties of clusters reflect built in clustering criteria. From these considerations, it is clear that cluster validation is critical for algorithm development and verification of results, usually based on a manual, lengthy and subjective exploration process. Consequently, it is key to the interpretation of the gene expression data. We carry out a critical analysis of current methods used to evaluate clustering results. Clusters obtained from real and synthetic datasets are compared between algorithms. To understand the properties of complex gene expression datasets, graphical representations can be used. Intuitively, the data can be represented in terms of a bipartite graph, with weighted edges between gene-sample node couples corresponding to significant expression measurements of interest. In this research, this method of representation is extensively studied and methods are used, in combination with probabilistic models, to develop new clustering techniques for analysis of gene expression data in this mode of representation. Performance of these techniques can be influenced both by the search algorithm, and, by the graph weighting scheme and both merit vigorous investigation. A novel edge-weighting scheme, based on empirical evidence, is presented. The scheme is tested using several benchmark datasets at various levels of granularity, and comparisons are provided with current a popular data analysis method used in the Bioinformatics community. The analysis shows that the new empirical based scheme developed out-performs current edge-weighting methods by accounting for the subtleties in the data through a data-dependent threshold analysis, and selecting ‘interesting’ gene-sample couples based on relative values. The graphical theme of gene expression analysis is further developed by construction of a one-mode gene expression network which specifically focuses on local interactions among genes. Classical network theory is used to identify and examine organisational properties in the resulting graphs. A new algorithm, GraphCreate, is presented which finds functional modules in the one-mode graph, i.e. sets of genes which are coherently expressed over subsets of samples, and a scoring scheme developed (using bi-partite graph properties as a basis) to weight these modules. Use of this representation is used to extensively study published gene expression datasets and to identify functional modules of genes with GraphCreate. This work is important as it advances research in the area of transcriptome analyiii sis, beyond simply finding groups of coherently expressed genes, by developing a general framework to understand how and when gene sets are interacting
    corecore