12,497 research outputs found

    Genome signatures, self-organizing maps and higher order phylogenies: a parametric analysis

    Get PDF
    Genome signatures are data vectors derived from the compositional statistics of DNA. The self-organizing map (SOM) is a neural network method for the conceptualisation of relationships within complex data, such as genome signatures. The various parameters of the SOM training phase are investigated for their effect on the accuracy of the resulting output map. It is concluded that larger SOMs, as well as taking longer to train, are less sensitive in phylogenetic classification of unknown DNA sequences. However, where a classification can be made, a larger SOM is more accurate. Increasing the number of iterations in the training phase of the SOM only slightly increases accuracy, without improving sensitivity. The optimal length of the DNA sequence k-mer from which the genome signature should be derived is 4 or 5, but shorter values are almost as effective. In general, these results indicate that small, rapidly trained SOMs are generally as good as larger, longer trained ones for the analysis of genome signatures. These results may also be more generally applicable to the use of SOMs for other complex data sets, such as microarray data

    Mapping of the Genome Sequence Using Two-stage Self Organizing Maps

    Get PDF
    In this paper, we introduce an algorithm of Self-Organizing Maps(SOM) which can map the genome sequence continuously on the map. The DNA sequences are considered to have the special features depending on the regions where the sequences are taken from or the gene functions of the proteins which are translated from the sequences. If the hidden features of the DNA sequences are extracted from the DNA sequences, they can be used for predicting the regions or the functions of the sequences. In this paper, we propose the algorithms using two stage SOM which organizes the sequences of the specific length at the first stage and organizes the set of sequences at the 2nd stage This algorithm can map the genome sequences on the map at each stage depending on the features of the sequences. We made some analyses of the genome sequences concerning the functions, species and secondary structure of the sequences

    Time series genome-centric analysis unveils bacterial response to operational disturbance in activated sludge

    Get PDF
    Understanding ecosystem response to disturbances and identifying the most critical traits for the maintenance of ecosystem functioning are important goals for microbial community ecology. In this study, we used 16S rRNA amplicon sequencing and metagenomics to investigate the assembly of bacterial populations in a full-scale municipal activated sludge wastewater treatment plant over a period of 3 years, including a 9-month period of disturbance characterized by short-term plant shutdowns. Following the reconstruction of 173 metagenome-assembled genomes, we assessed the functional potential, the number of rRNA gene operons, and the in situ growth rate of microorganisms present throughout the time series. Operational disturbances caused a significant decrease in bacteria with a single copy of the rRNA (rrn) operon. Despite moderate differences in resource availability, replication rates were distributed uniformly throughout time, with no differences between disturbed and stable periods. We suggest that the length of the growth lag phase, rather than the growth rate, is the primary driver of selection under disturbed conditions. Thus, the system could maintain its function in the face of disturbance by recruiting bacteria with the capacity to rapidly resume growth under unsteady operating conditions.Fil: Pérez, María Victoria. Agua y Saneamientos Argentinos S.a.; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Instituto de Investigaciones en Ingeniería Genética y Biología Molecular "Dr. Héctor N. Torres"; ArgentinaFil: Guerrero, Leandro Demiån. Consejo Nacional de Investigaciones Científicas y Técnicas. Instituto de Investigaciones en Ingeniería Genética y Biología Molecular "Dr. Héctor N. Torres"; ArgentinaFil: Orellana, Esteban. Consejo Nacional de Investigaciones Científicas y Técnicas. Instituto de Investigaciones en Ingeniería Genética y Biología Molecular "Dr. Héctor N. Torres"; ArgentinaFil: Figuerola, Eva Lucia Margarita. Consejo Nacional de Investigaciones Científicas y Técnicas. Instituto de Investigaciones en Ingeniería Genética y Biología Molecular "Dr. Héctor N. Torres"; Argentina. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Fisiología, Biología Molecular y Celular; ArgentinaFil: Erijman, Leonardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Instituto de Investigaciones en Ingeniería Genética y Biología Molecular "Dr. Héctor N. Torres"; Argentina. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Fisiología, Biología Molecular y Celular; Argentin

    Development of Self-Compressing BLSOM for Comprehensive Analysis of Big Sequence Data

    Get PDF

    A novel bioinformatics tool for phylogenetic classification of genomic sequence fragments derived from mixed genomes of uncultured environmental microbes

    Get PDF
    A Self-Organizing Map (SOM) is an effective tool for clustering and visualizing high-dimensional complex data on a two-dimensional map. We modified the conventional SOM to genome informatics, making the learning process and resulting map independent of the order of data input, and developed a novel bioinformatics tool for phylogenetic classification of sequence fragments obtained from pooled genome samples of microorganisms in environmental samples allowing visualization of microbial diversity and the relative abundance of microorganisms on a map. First we constructed SOMs of tri- and tetranucleotide frequencies from a total of 3.3-Gb of sequences derived using 113 prokaryotic and 13 eukaryotic genomes, for which complete genome sequences are available. SOMs classified the 330000 10-kb sequences from these genomes mainly according to species without information on the species. Importantly, classification was possible without orthologous sequence sets and thus was useful for studies of novel sequences from poorly characterized species such as those living only under extreme conditions and which have attracted wide scientific and industrial attention. Using the SOM method, sequences that were derived from a single genome but cloned independently in a metagenome library could be reassociated in silico. The usefulness of SOMs in metagenome studies was also discussed

    Genomic and proteomic analysis with dynamically growing self organising tree (DGSOT) for measuring clinical outcomes of cancer

    Get PDF
    Genomics and proteomics microarray technologies are used for analysing molecular and cellular expressions of cancer. This creates a challenge for analysis and interpretation of the data generated as it is produced in large volumes. The current review describes a combined system for genetic, molecular interpretation and analysis of genomics and proteomics technologies that offers a wide range of interpreted results. Artificial neural network systems technology has the type of programmes to best deal with these large volumes of analytical data. The artificial system to be recommended here is to be determined from the analysis and selection of the best of different available technologies currently being used or reviewed for microarray data analysis. The system proposed here is a tree structure, a new hierarchical clustering algorithm called a dynamically growing self-organizing tree (DGSOT) algorithm, which overcomes drawbacks of traditional hierarchical clustering algorithms. The DGSOT algorithm combines horizontal and vertical growth to construct a mutlifurcating hierarchical tree from top to bottom to cluster the data. They are designed to combine the strengths of Neural Networks (NN), which have speed and robustness to noise, and hierarchical clustering tree structure which are minimum prior requirement for number of clusters specification and training in order to output results of interpretable biological context. The combined system will generate an output of biological interpretation of expression profiles associated with diagnosis of disease (including early detection, molecular classification and staging), metastasis (spread of the disease to non-adjacent organs and/or tissues), prognosis (predicting clinical outcome) and response to treatment; it also gives possible therapeutic options ranking them according to their benefits for the patient.Key words: Genomics, proteomics, microarray, dynamically growing self-organizing tree (DGSOT)

    Yeast gene CMR1/YDL156W is consistently co-expressed with genes participating in DNA-metabolic processes in a variety of stringent clustering experiments

    Get PDF
    © 2013 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0/, which permits unrestricted use, provided the original author and source are credited.The binarization of consensus partition matrices (Bi-CoPaM) method has, among its unique features, the ability to perform ensemble clustering over the same set of genes from multiple microarray datasets by using various clustering methods in order to generate tunable tight clusters. Therefore, we have used the Bi-CoPaM method to the most synchronized 500 cell-cycle-regulated yeast genes from different microarray datasets to produce four tight, specific and exclusive clusters of co-expressed genes. We found 19 genes formed the tightest of the four clusters and this included the gene CMR1/YDL156W, which was an uncharacterized gene at the time of our investigations. Two very recent proteomic and biochemical studies have independently revealed many facets of CMR1 protein, although the precise functions of the protein remain to be elucidated. Our computational results complement these biological results and add more evidence to their recent findings of CMR1 as potentially participating in many of the DNA-metabolism processes such as replication, repair and transcription. Interestingly, our results demonstrate the close co-expressions of CMR1 and the replication protein A (RPA), the cohesion complex and the DNA polymerases α, ÎŽ and ɛ, as well as suggest functional relationships between CMR1 and the respective proteins. In addition, the analysis provides further substantial evidence that the expression of the CMR1 gene could be regulated by the MBF complex. In summary, the application of a novel analytic technique in large biological datasets has provided supporting evidence for a gene of previously unknown function, further hypotheses to test, and a more general demonstration of the value of sophisticated methods to explore new large datasets now so readily generated in biological experiments.National Institute for Health Researc

    Neural Network and Bioinformatic Methods for Predicting HIV-1 Protease Inhibitor Resistance

    Full text link
    This article presents a new method for predicting viral resistance to seven protease inhibitors from the HIV-1 genotype, and for identifying the positions in the protease gene at which the specific nature of the mutation affects resistance. The neural network Analog ARTMAP predicts protease inhibitor resistance from viral genotypes. A feature selection method detects genetic positions that contribute to resistance both alone and through interactions with other positions. This method has identified positions 35, 37, 62, and 77, where traditional feature selection methods have not detected a contribution to resistance. At several positions in the protease gene, mutations confer differing degress of resistance, depending on the specific amino acid to which the sequence has mutated. To find these positions, an Amino Acid Space is introduced to represent genes in a vector space that captures the functional similarity between amino acid pairs. Feature selection identifies several new positions, including 36, 37, and 43, with amino acid-specific contributions to resistance. Analog ARTMAP networks applied to inputs that represent specific amino acids at these positions perform better than networks that use only mutation locations.Air Force Office of Scientific Research (F49620-01-1-0423); National Geospatial-Intelligence Agency (NMA 201-01-1-2016); National Science Foundation (SBE-0354378); Office of Naval Research (N00014-01-1-0624

    Community-wide analysis of microbial genome sequence signatures

    Get PDF
    Genome signatures are used to identify and cluster sequences de novo from an acid biofilm microbial community metagenomic dataset, revealing information about the low-abundance community members
    • 

    corecore