7 research outputs found

    The automation of microarray data analysis to ameliorate biochemical pathways [abstract]

    Get PDF
    Abstract only availableFaculty Mentor: Dr. Dong Xu, Computer ScienceThe goal of this study is the annotation of a gene cluster in microarray data with probable biochemical pathways. In our experiment each gene from Arabidopsis microarray data was compared to each gene from Arabidopsis KEGG pathway, and a similarity calculation was made. For the 180 unique pathways for Arabidopsis in KEGG, the maximum similarity was annotated based on which gene with KEGG pathway annotation the microarray gene was most similar to. A single gene will likely have many GO ID terms, creating a good thoroughfare for calculation of gene similarity. When two GO IDs were compared a number that is based on parent GO terms between zero and one was assigned. Each term to term similarity was summed, and divided by the number of comparisons, giving a similarity rating between zero and one for two genes. The genes are next grouped using fuzzy c-means clustering, and each cluster is annotated based on maximum pathway membership. Two different matrixes were constructed for data analysis: Sim contained KEGG genes matched at similarity 1 and all other matches with a number between zero and one, called fuzzy matches generated by GO ID similarity described above, and Sim4 contained all fuzzy matches. Analysis of both matrixes for KEGG genes in properly labeled pathways was done at 19, 22, and 40 clusters resulting in Sim with 73, 82, and 97 percent matched respectively and Sim4 with 24, 32, and 51 percent matched respectively. Analysis of each cluster is done using a similarity sum for each pathway to find a maximum, and division by the number of genes in the cluster. Validity of this is found using the Sim matrix, and the analysis of clusters that only contain KEGG genes, where a return of 1 is found. Other clusters with significant values were found, often containing only unknown genes

    Protein secondary structure prediction: Creating a meta-tool

    Get PDF
    Abstract only availableProtein structure prediction is a growing field of interest for a many varied reasons, owing not only to its obvious utility, but also the success that applying newer mathematical tools has garnered in recent years. Despite the intractability of determining optimal protein structure directly by finding a lowest-energy conformation among a huge amount of candidates, many heuristic methods have emerged that sacrifice some degree of accuracy for reasonable speed of execution. Through the use of numerical techniques such as neural networks(1), neural networks bolstered by position-specific scoring matrices generated by psi-blast(2), and k-nearest neighbor algorithms(3), the success rate of protein structure prediction has been increasing over the past decade and a half. Each of these tools has particular strengths and weaknesses. To address this and to improve prediction accuracy, we are constructing a three-part meta-tool that combines k-nearest neighbor methods, neural network methods, and hidden markov models to predict the secondary structure of proteins based on their position-specific scoring matrices. The results from each of the individual tools will be integrated and filtered to form a final prediction. This tool will be available on the web through a simple interface for those wishing to evaluate or utilize it. References: 1: Rost and Sander. Predictions of protein secondary structure at better than 70% Accuracy; J. Mol. Biol. (1993) 232, 584-599 2: Jones. Protein secondary structure prediction based on position-specific scoring matrices; J. Mol. Boil. (1999) 292, 195-202 3: Bondugula, Duzlevski, Xu. Profiles and fuzzy k-nearest neighbor algorithm for protein secondary structure prediction; (unpublished).NSF-REU Program in Biosystems Modeling and Analysi

    Post-Translational modifications and the effects on protein identification through mass spectrometry

    Get PDF
    Abstract only availableMass Spectrometry is an effective tool for protein identification. A typical process for protein identification is to break down a protein into smaller peptides and to determine the mass of each of these peptides. These peptide masses are then compared against a database of proteins, to identify the protein which composes these peptides. Most proteins undergo co- and /or post-translational modifications such as glycosylation, phosphorylation etc after they are synthesized. Post translational modifications (PTMs), cause the masses of the peptides to be different than they are in the database, causing the computer programs to predict them incorrectly. While developing a program to accurately predict proteins using Mass Spectrometry, consideration must be given to such PTMs that may occur. The aim of the project was to modify a program currently in development, to allow the users to select some PTMs for consideration. The major challenge was to account for the PTMs without introducing large amounts of error into the system. In order to avoid the possibility of the program matching the mass of selected PTMs to false positive hits, the users will be instructed to only select a small number (less than 5) of PTMs. The actual program will be able to include an infinite number of PTMs, but the prototype only includes 47. There still needs to be some testing done as to which method of scoring gives the best confidence of the predictions. However, taking PTMs into account will definitely allow for a more successful identification of a larger number of proteins.NSF-REU Program in Biosystems Modeling and Analysi

    ComPhy: Prokaryotic Composite Distance Phylogenies Inferred from Whole-Genome Gene Sets

    Get PDF
    doi:10.1186/1471-2105-10-S1-S5With the increasing availability of whole genome sequences, it is becoming more and more important to use complete genome sequences for inferring species phylogenies. We developed a new tool ComPhy, 'Composite Distance Phylogeny', based on a composite distance matrix calculated from the comparison of complete gene sets between genome pairs to produce a prokaryotic phylogeny. The composite distance between two genomes is defined by three components: Gene Dispersion Distance (GDD), Genome Breakpoint Distance (GBD) and Gene Content Distance (GCD). GDD quantifies the dispersion of orthologous genes along the genomic coordinates from one genome to another; GBD measures the shared breakpoints between two genomes; GCD measures the level of shared orthologs between two genomes. The phylogenetic tree is constructed from the composite distance matrix using a neighbor joining method. We tested our method on 9 datasets from 398 completely sequenced prokaryotic genomes. We have achieved above 90% agreement in quartet topologies between the tree created by our method and the tree from the Bergey's taxonomy. In comparison to several other phylogenetic analysis methods, our method showed consistently better performance. ComPhy is a fast and robust tool for genome-wide inference of evolutionary relationship among genomes."This work was supported in part by NSF/ITR-IIS-0407204.

    Constructing proteome and metabolome maps for genetic improvement of energy-related traits in soybean [abstract]

    Get PDF
    Only abstract of poster available.Track V: BiomassAlthough the genetic blueprint of soybean is represented by the genome, its phenotype is a product of that blueprint manifested as the production of proteins and metabolites influencing growth characteristics, stress responses, seed composition, and yield. We are using various tools of genomics and molecular breeding with an aim towards development of value-added soybeans that will help United States farmers to maintain their competitiveness and expand utilization of soybean crops (e.g. functional foods, industrial uses, biodiesel, etc). Profiling soybean gene products will lay the foundation for a systems biology approach to key processes such as seed development, which will lead to the genetic improvement of yield and seed composition. Being one of the major bio-energy crops, building a comprehensive map of proteins and metabolites for soybean will help make connections between regulatory or metabolic pathways not previously characterized. Another major benefit from these studies is the discovery of energy related traits including plant productivity and seed compositional traits for the genetic improvement of soybean. It is well known that environmental cues influence developmental phenotypes in plants. Different biotic stresses such as fungal diseases and abiotic stresses, such as drought and flooding, also elicit phenotypic responses from the genome. Thus, by studying the gene products, a direct correlation between response and specific peptides/metabolites can be made. This will lead to crop improvement either through breeding or transgenic efforts. Major objectives of this study are: a) to identify key soybean seed, leaf, and root proteins involved in development and biotic and abiotic stress responses; b) to establish a comprehensive set of chemical standards for soybean metabolites moving toward construction of a metabolome map with a focus on seed and drought effects on seed development and, c) to compile a database linking proteomic and metabolite information and associate this information to value-added soybean traits and markers for assisted breeding. We are utilizing GC/MS, LC/MS, and NMR approaches to identify key molecules for further characterization

    Development and assessment of scoring functions for protein identification using PMF data

    Get PDF
    PMF is one of the major methods for protein identification using the MS technology. It is faster and cheaper than MS/MS. Although PMF does not differentiate trypsin-digested peptides of identical mass, which makes it less informative than MS/MS, current computational methods for PMF have the potential to improve its detection accuracy by better use of the information content in PMF spectra. We developed a number of new probability-based scoring functions for PMF protein identification based on the MOWSE algorithm. We considered a detailed distribution of matching masses in a protein database and peak intensity, as well as the likelihood of peptide matches to be close to each other in a protein sequence. Our computational methods are assessed and compared with other methods using PMF data of 52 gel spots of known protein standards. The comparison shows that our new scoring schemes have higher or comparable accuracies for protein identification in comparison to the existing methods. Our software is freely available upon request. The scoring functions can be easily incorporated into other proteomics software packages

    Genomic strategies for soybean oil improvement and biodiesel production

    Get PDF
    Track II: Transportation and BiofuelsIncludes audio file (21 min.)Soybean oil, a promising renewable energy resource, comprises 73% of biodiesel in addition to other industrial applications. Missouri is the fifth largest state in the US for soybean plantation. With the target to produce 225 million gallons of biodiesel by 2015 from the current 75 million gallons produced in 2005, efforts should not only focus on expanding the number of oil crops to meet the demand but also to increase the amount of oil per hectare for each crop. Considering the ever increasing need for biodiesel and the potential for Missouri to play a major role in national and international demand, We, at the National Center for Soybean Biotechnology focus on discovering the genetic factors that are responsible for oil content in soybean using genetic and genomic strategies. The long term goal is to apply discoveries in breeding programs and biotechnology for the development of improved soybean cultivars with increased oil content that will make this crop more competitive in end-uses. Our multidisciplinary approaches include traditional Quantitative Trait Loci (QTL) mapping, association mapping, bioinformatics and transgenics by developing new resources and utilizing already available resources such as mapping populations, diverse germplasm collections, genome sequence information and transgenes. In addition to total oil content, we are focusing on improving quality traits such as oleic acid which has direct human health benefits and application in biodiesel production. With the use of advanced genomic technologies, genetic materials, and synergistic efforts involving intra- and inter institutional collaborations, we believe that our current and future research will contribute substantially to biodiesel production. Increased production using high oil soybean cultivars will not only increase the economic gains to farmers/growers but also facilitate the US to emerge as the global leader in biodiesel production
    corecore