25 research outputs found
Protein interactions across and between eukaryotic kingdoms: networks, inference strategies, integration of functional data and evolutionary dynamics
Thesis (Ph.D.)--Boston UniversityHow cellular elements coordinate their function is a fundamental question in biology. A crucial step towards understanding cellular systems is the mapping of physical interactions between protein, DNA, RNA and other macromolecules or metabolites. Genome-scale technologies have yielded protein-protein interaction networks for several eukaryotic species and have provided insight into biological processes and evolution, but many of the currently available networks are biased. Towards a true human protein-protein interaction network, we examined literature-based aggregations of lowthroughput experiments, high-throughput experimental networks validated using different strategies, and predicted interaction networks to infer how the underlying interactome may differ from current maps. Using systematically mapped interactome networks, which appear to be the least biased, we explored the functional organization of Arabidopsis thaliana and characterize the asymmetric divergence of duplicated paralogous proteins through their interaction profiles. To further dissect the relationship between interactions and function enforced by evolution, we investigated a first-of-its-kind systematic crossspecies human-yeast hybrid interactome network. Although the cross-species network is topologically similar to conventional intra-species networks, we found signatures of dynamic changes in interaction propensities due to countervailing evolutionary forces. Collectively, these analyses of human, plant and yeast interactome networks bridge separate experiments to characterize bias, function and evolution across eukaryotic kingdoms
Mugsy: fast multiple alignment of closely related whole genomes
Motivation: The relative ease and low cost of current generation sequencing technologies has led to a dramatic increase in the number of sequenced genomes for species across the tree of life. This increasing volume of data requires tools that can quickly compare multiple whole-genome sequences, millions of base pairs in length, to aid in the study of populations, pan-genomes, and genome evolution
Tandem mass spectrometry data quality assessment by self-convolution
<p>Abstract</p> <p>Background</p> <p>Many algorithms have been developed for deciphering the tandem mass spectrometry (MS) data sets. They can be essentially clustered into two classes. The first performs searches on theoretical mass spectrum database, while the second based itself on <it>de novo </it>sequencing from raw mass spectrometry data. It was noted that the quality of mass spectra affects significantly the protein identification processes in both instances. This prompted the authors to explore ways to measure the quality of MS data sets before subjecting them to the protein identification algorithms, thus allowing for more meaningful searches and increased confidence level of proteins identified.</p> <p>Results</p> <p>The proposed method measures the qualities of MS data sets based on the symmetric property of b- and y-ion peaks present in a MS spectrum. Self-convolution on MS data and its time-reversal copy was employed. Due to the symmetric nature of b-ions and y-ions peaks, the self-convolution result of a good spectrum would produce a highest mid point intensity peak. To reduce processing time, self-convolution was achieved using Fast Fourier Transform and its inverse transform, followed by the removal of the "DC" (Direct Current) component and the normalisation of the data set. The quality score was defined as the ratio of the intensity at the mid point to the remaining peaks of the convolution result. The method was validated using both theoretical mass spectra, with various permutations, and several real MS data sets. The results were encouraging, revealing a high percentage of positive prediction rates for spectra with good quality scores.</p> <p>Conclusion</p> <p>We have demonstrated in this work a method for determining the quality of tandem MS data set. By pre-determining the quality of tandem MS data before subjecting them to protein identification algorithms, spurious protein predictions due to poor tandem MS data are avoided, giving scientists greater confidence in the predicted results. We conclude that the algorithm performs well and could potentially be used as a pre-processing for all mass spectrometry based protein identification tools.</p
Recommended from our members
Interpreting Cancer Genomes Using Systematic Host Perturbations by Tumour Virus Proteins
Genotypic differences greatly influence susceptibility and resistance to disease. Understanding genotype-phenotype relationships requires that phenotypes be viewed as manifestations of network properties, rather than simply as the result of individual genomic variations. Genome sequencing efforts have identified numerous germline mutations associated with cancer predisposition and large numbers of somatic genomic alterations. However, it remains challenging to distinguish between background, or “passenger” and causal, or “driver” cancer mutations in these datasets. Human viruses intrinsically depend on their host cell during the course of infection and can elicit pathological phenotypes similar to those arising from mutations. To test the hypothesis that genomic variations and tumour viruses may cause cancer via related mechanisms, we systematically examined host interactome and transcriptome network perturbations caused by DNA tumour virus proteins. The resulting integrated viral perturbation data reflects rewiring of the host cell networks, and highlights pathways that go awry in cancer, such as Notch signalling and apoptosis. We show that systematic analyses of host targets of viral proteins can identify cancer genes with a success rate on par with their identification through functional genomics and large-scale cataloguing of tumour mutations. Together, these complementary approaches result in increased specificity for cancer gene identification. Combining systems-level studies of pathogen-encoded gene products with genomic approaches will facilitate prioritization of cancer-causing driver genes so as to advance understanding of the genetic basis of human cancer
Network-based Analysis of Genome Wide Association Data Provides Novel Candidate Genes for Lipid and Lipoprotein Traits
Genome wide association studies (GWAS) identify susceptibility loci for complex traits, but do not identify particular genes of interest. Integration of functional and network information may help in overcoming this limitation and identifying new susceptibility loci. Using GWAS and comorbidity data, we present a network-based approach to predict candidate genes for lipid and lipoprotein traits. We apply a prediction pipeline incorporating interactome, co-expression, and comorbidity data to Global Lipids Genetics Consortium (GLGC) GWAS for four traits of interest, identifying phenotypically coherent modules. These modules provide insights regarding gene involvement in complex phenotypes with multiple susceptibility alleles and low effect sizes. To experimentally test our predictions, we selected four candidate genes and genotyped representative SNPs in the Malmo Diet and Cancer Cardiovascular Cohort. We found significant associations with LDL-C and total-cholesterol levels for a synonymous SNP (rs234706) in the cystathionine beta-synthase (CBS) gene (p = 1 x 10(-5) and adjusted-p = 0.013, respectively). Further, liver samples taken from 206 patients revealed that patients with the minor allele of rs234706 had significant dysregulation of CBS (p = 0.04). Despite the known biological role of CBS in lipid metabolism, SNPs within the locus have not yet been identified in GWAS of lipoprotein traits. Thus, the GWAS-based Comorbidity Module (GCM) approach identifies candidate genes missed by GWAS studies, serving as a broadly applicable tool for the investigation of other complex disease phenotypes
Automated Genome Mining of Ribosomal Peptide Natural Products
Ribosomally synthesized and posttranslationally
modified peptides
(RiPPs), especially from microbial sources, are a large group of bioactive
natural products that are a promising source of new (bio)chemistry
and bioactivity. In light of exponentially
increasing microbial genome databases and improved mass spectrometry
(MS)-based metabolomic platforms, there is a need for computational
tools that connect natural product genotypes predicted from microbial
genome sequences with their corresponding chemotypes from metabolomic
data sets. Here, we introduce RiPPquest, a tandem mass spectrometry
database search tool for identification of microbial RiPPs, and apply
it to lanthipeptide discovery. RiPPquest uses genomics to limit search
space to the vicinity of RiPP biosynthetic genes and proteomics to
analyze extensive peptide modifications and compute p-values of peptide-spectrum
matches (PSMs). We highlight RiPPquest by connecting multiple RiPPs
from extracts of <i>Streptomyces</i> to their gene clusters
and by the discovery of a new class III lanthipeptide, informatipeptin,
from <i>Streptomyces viridochromogenes</i> DSM 40736 to
reflect that it is a natural product that was discovered by mass spectrometry
based genome mining using algorithmic tools rather than manual inspection
of mass spectrometry data and genetic information. The presented tool
is available at cyclo.ucsd.edu