106 research outputs found
FRAGS: estimation of coding sequence substitution rates from fragmentary data
BACKGROUND: Rates of substitution in protein-coding sequences can provide important insights into evolutionary processes that are of biomedical and theoretical interest. Increased availability of coding sequence data has enabled researchers to estimate more accurately the coding sequence divergence of pairs of organisms. However the use of different data sources, alignment protocols and methods to estimate substitution rates leads to widely varying estimates of key parameters that define the coding sequence divergence of orthologous genes. Although complete genome sequence data are not available for all organisms, fragmentary sequence data can provide accurate estimates of substitution rates provided that an appropriate and consistent methodology is used and that differences in the estimates obtainable from different data sources are taken into account. RESULTS: We have developed FRAGS, an application framework that uses existing, freely available software components to construct in-frame alignments and estimate coding substitution rates from fragmentary sequence data. Coding sequence substitution estimates for human and chimpanzee sequences, generated by FRAGS, reveal that methodological differences can give rise to significantly different estimates of important substitution parameters. The estimated substitution rates were also used to infer upper-bounds on the amount of sequencing error in the datasets that we have analysed. CONCLUSION: We have developed a system that performs robust estimation of substitution rates for orthologous sequences from a pair of organisms. Our system can be used when fragmentary genomic or transcript data is available from one of the organisms and the other is a completely sequenced genome within the Ensembl database. As well as estimating substitution statistics our system enables the user to manage and query alignment and substitution data
The contribution of exon-skipping events on chromosome 22 to protein coding diversity
Completion of the human genome sequence provides evidence for a gene count with lower bound 30,000–40,000. Significant protein complexity may derive in part from multiple transcript isoforms. Recent EST based studies have revealed that alternate transcription, including alternative splicing, polyadenylation and transcription start sites, occurs within at least 30–40% of human genes. Transcript form surveys have yet to integrate the genomic context, expression, frequency, and contribution to protein diversity of isoform variation. We determine here the degree to which protein coding diversity may be influenced by alternate expression of transcripts by exhaustive manual confirmation of genome sequence annotation, and comparison to available transcript data to accurately associate skipped exon isoforms with genomic sequence. Relative expression levels of transcripts are estimated from EST database representation. The rigorous in silico method accurately identifies exon skipping using verified genome sequence. 545 genes have been studied in this first hand-curated assessment of exon skipping on chromosome 22
Prioritizing genes of potential relevance to diseases affected by sex hormones: an example of Myasthenia Gravis
<p>Abstract</p> <p>Background</p> <p>About 5% of western populations are afflicted by autoimmune diseases many of which are affected by sex hormones. Autoimmune diseases are complex and involve many genes. Identifying these disease-associated genes contributes to development of more effective therapies. Also, association studies frequently imply genomic regions that contain disease-associated genes but fall short of pinpointing these genes. The identification of disease-associated genes has always been challenging and to date there is no universal and effective method developed.</p> <p>Results</p> <p>We have developed a method to prioritize disease-associated genes for diseases affected strongly by sex hormones. Our method uses various types of information available for the genes, but no information that directly links genes with the disease. It generates a score for each of the considered genes and ranks genes based on that score. We illustrate our method on early-onset myasthenia gravis (MG) using genes potentially controlled by estrogen and localized in a genomic segment (which contains the MHC and surrounding region) strongly associated with MG. Based on the considered genomic segment 283 genes are ranked for their relevance to MG and responsiveness to estrogen. The top three ranked genes, HLA-G, TAP2 and HLA-DRB1, are implicated in autoimmune diseases, while TAP2 is associated with SNPs characteristic for MG. Within the top 35 prioritized genes our method identifies 90% of the 10 already known MG-associated genes from the considered region without using any information that directly links genes to MG. Among the top eight genes we identified HLA-G and TUBB as new candidates. We show that our <it>ab-initio </it>approach outperforms the other methods for prioritizing disease-associated genes.</p> <p>Conclusion</p> <p>We have developed a method to prioritize disease-associated genes under the potential control of sex hormones. We demonstrate the success of this method by prioritizing the genes localized in the MHC and surrounding region and evaluating the role of these genes as potential candidates for estrogen control as well as MG. We show that our method outperforms the other methods. The method has a potential to be adapted to prioritize genes relevant to other diseases.</p
CLU: A new algorithm for EST clustering
BACKGROUND: The continuous flow of EST data remains one of the richest sources for discoveries in modern biology. The first step in EST data mining is usually associated with EST clustering, the process of grouping of original fragments according to their annotation, similarity to known genomic DNA or each other. Clustered EST data, accumulated in databases such as UniGene, STACK and TIGR Gene Indices have proven to be crucial in research areas from gene discovery to regulation of gene expression. RESULTS: We have developed a new nucleotide sequence matching algorithm and its implementation for clustering EST sequences. The program is based on the original CLU match detection algorithm, which has improved performance over the widely used d2_cluster. The CLU algorithm automatically ignores low-complexity regions like poly-tracts and short tandem repeats. CONCLUSION: CLU represents a new generation of EST clustering algorithm with improved performance over current approaches. An early implementation can be applied in small and medium-size projects. The CLU program is available on an open source basis free of charge. It can be downloaded fro
Recommended from our members
Integrating Murine Gene Expression Studies to Understand Obstructive Lung Disease due to Chronic Inhaled Endotoxin
Rationale: Endotoxin is a near ubiquitous environmental exposure that that has been associated with both asthma and chronic obstructive pulmonary disease (COPD). These obstructive lung diseases have a complex pathophysiology, making them difficult to study comprehensively in the context of endotoxin. Genome-wide gene expression studies have been used to identify a molecular snapshot of the response to environmental exposures. Identification of differentially expressed genes shared across all published murine models of chronic inhaled endotoxin will provide insight into the biology underlying endotoxin-associated lung disease. Methods: We identified three published murine models with gene expression profiling after repeated low-dose inhaled endotoxin. All array data from these experiments were re-analyzed, annotated consistently, and tested for shared genes found to be differentially expressed. Additional functional comparison was conducted by testing for significant enrichment of differentially expressed genes in known pathways. The importance of this gene signature in smoking-related lung disease was assessed using hierarchical clustering in an independent experiment where mice were exposed to endotoxin, smoke, and endotoxin plus smoke. Results: A 101-gene signature was detected in three murine models, more than expected by chance. The three model systems exhibit additional similarity beyond shared genes when compared at the pathway level, with increasing enrichment of inflammatory pathways associated with longer duration of endotoxin exposure. Genes and pathways important in both asthma and COPD were shared across all endotoxin models. Mice exposed to endotoxin, smoke, and smoke plus endotoxin were accurately classified with the endotoxin gene signature. Conclusions: Despite the differences in laboratory, duration of exposure, and strain of mouse used in three experimental models of chronic inhaled endotoxin, surprising similarities in gene expression were observed. The endotoxin component of tobacco smoke may play an important role in disease development
Comparison of glioma stem cells to neural stem cells from the adult human brain identifies dysregulated Wnt- signaling and a fingerprint associated with clinical outcome
AbstractGlioblastoma is the most common brain tumor. Median survival in unselected patients is <10 months. The tumor harbors stem-like cells that self-renew and propagate upon serial transplantation in mice, although the clinical relevance of these cells has not been well documented. We have performed the first genome-wide analysis that directly relates the gene expression profile of nine enriched populations of glioblastoma stem cells (GSCs) to five identically isolated and cultivated populations of stem cells from the normal adult human brain. Although the two cell types share common stem- and lineage-related markers, GSCs show a more heterogeneous gene expression. We identified a number of pathways that are dysregulated in GSCs. A subset of these pathways has previously been identified in leukemic stem cells, suggesting that cancer stem cells of different origin may have common features. Genes upregulated in GSCs were also highly expressed in embryonic and induced pluripotent stem cells. We found that canonical Wnt-signaling plays an important role in GSCs, but not in adult human neural stem cells. As well we identified a 30-gene signature highly overexpressed in GSCs. The expression of these signature genes correlates with clinical outcome and demonstrates the clinical relevance of GSCs
Computational disease gene identification: a concert of methods prioritizes type 2 diabetes and obesity candidate genes
Genome-wide experimental methods to identify disease genes, such as linkage analysis and association studies, generate increasingly large candidate gene sets for which comprehensive empirical analysis is impractical. Computational methods employ data from a variety of sources to identify the most likely candidate disease genes from these gene sets. Here, we review seven independent computational disease gene prioritization methods, and then apply them in concert to the analysis of 9556 positional candidate genes for type 2 diabetes (T2D) and the related trait obesity. We generate and analyse a list of nine primary candidate genes for T2D genes and five for obesity. Two genes, LPL and BCKDHA, are common to these two sets. We also present a set of secondary candidates for T2D (94 genes) and for obesity (116 genes) with 58 genes in common to both diseases
Mice and Men: Their Promoter Properties
Using the two largest collections of Mus musculus and Homo sapiens transcription start sites (TSSs) determined based on CAGE tags, ditags, full-length cDNAs, and other transcript data, we describe the compositional landscape surrounding TSSs with the aim of gaining better insight into the properties of mammalian promoters. We classified TSSs into four types based on compositional properties of regions immediately surrounding them. These properties highlighted distinctive features in the extended core promoters that helped us delineate boundaries of the transcription initiation domain space for both species. The TSS types were analyzed for associations with initiating dinucleotides, CpG islands, TATA boxes, and an extensive collection of statistically significant cis-elements in mouse and human. We found that different TSS types show preferences for different sets of initiating dinucleotides and cis-elements. Through Gene Ontology and eVOC categories and tissue expression libraries we linked TSS characteristics to expression. Moreover, we show a link of TSS characteristics to very specific genomic organization in an example of immune-response-related genes (GO:0006955). Our results shed light on the global properties of the two transcriptomes not revealed before and therefore provide the framework for better understanding of the transcriptional mechanisms in the two species, as well as a framework for development of new and more efficient promoter- and gene-finding tools
Perceptual and technical barriers in sharing and formatting metadata accompanying omics studies
Metadata, often termed "data about data," is crucial for organizing,
understanding, and managing vast omics datasets. It aids in efficient data
discovery, integration, and interpretation, enabling users to access,
comprehend, and utilize data effectively. Its significance spans the domains of
scientific research, facilitating data reproducibility, reusability, and
secondary analysis. However, numerous perceptual and technical barriers hinder
the sharing of metadata among researchers. These barriers compromise the
reliability of research results and hinder integrative meta-analyses of omics
studies . This study highlights the key barriers to metadata sharing, including
the lack of uniform standards, privacy and legal concerns, limitations in study
design, limited incentives, inadequate infrastructure, and the dearth of
well-trained personnel for metadata management and reuse. Proposed solutions
include emphasizing the promotion of standardization, educational efforts, the
role of journals and funding agencies, incentives and rewards, and the
improvement of infrastructure. More accurate, reliable, and impactful research
outcomes are achievable if the scientific community addresses these barriers,
facilitating more accurate, reliable, and impactful research outcomes
An Assessment of the Role of DNA Adenine Methyltransferase on Gene Expression Regulation in E coli
N6-Adenine methylation is an important epigenetic signal, which regulates various processes, such as DNA replication and repair and transcription. In γ-proteobacteria, Dam is a stand-alone enzyme that methylates GATC sites, which are non-randomly distributed in the genome. Some of these overlap with transcription factor binding sites. This work describes a global computational analysis of a published Dam knockout microarray alongside other publicly available data to throw insights into the extent to which Dam regulates transcription by interfering with protein binding. The results indicate that DNA methylation by DAM may not globally affect gene transcription by physically blocking access of transcription factors to binding sites. Down-regulation of Dam during stationary phase correlates with the activity of TFs whose binding sites are enriched for GATC sites
- …