94 research outputs found
CAFTAN: a tool for fast mapping, and quality assessment of cDNAs
Background: The German cDNA Consortium has been cloning full length cDNAs and continued with their
exploitation in protein localization experiments and cellular assays. However, the efficient use of large cDNA
resources requires the development of strategies that are capable of a speedy selection of truly useful cDNAs
from biological and experimental noise. To this end we have developed a new high-throughput analysis tool,
CAFTAN, which simplifies these efforts and thus fills the gap between large-scale cDNA collections and their
systematic annotation and application in functional genomics.
Results: CAFTAN is built around the mapping of cDNAs to the genome assembly, and the subsequent analysis
of their genomic context. It uses sequence features like the presence and type of PolyA signals, inner and flanking
repeats, the GC-content, splice site types, etc. All these features are evaluated in individual tests and classify
cDNAs according to their sequence quality and likelihood to have been generated from fully processed mRNAs.
Additionally, CAFTAN compares the coordinates of mapped cDNAs with the genomic coordinates of reference
sets from public available resources (e.g., VEGA, ENSEMBL). This provides detailed information about overlapping
exons and the structural classification of cDNAs with respect to the reference set of splice variants.
The evaluation of CAFTAN showed that is able to correctly classify more than 85% of 5950 selected "known
protein-coding" VEGA cDNAs as high quality multi- or single-exon. It identified as good 80.6 % of the single exon
cDNAs and 85 % of the multiple exon cDNAs.
The program is written in Perl and in a modular way, allowing the adoption of this strategy to other tasks like
EST-annotation, or to extend it by adding new classification rules and new organism databases as they become
available. We think that it is a very useful program for the annotation and research of unfinished genomes.
Conclusion: CAFTAN is a high-throughput sequence analysis tool, which performs a fast and reliable quality
prediction of cDNAs. Several thousands of cDNAs can be analyzed in a short time, giving the curator/scientist a
first quick overview about the quality and the already existing annotation of a set of cDNAs. It supports the
rejection of low quality cDNAs and helps in the selection of likely novel splice variants, and/or completely novel
transcripts for new experiments.German Federal Ministry of Education and Research 01GR0101 and 01GR0420 and 01GR045
Rhomboid Protease Dynamics and Lipid Interactions
Intramembrane proteases, which cleave transmembrane
(TM) helices, participate in numerous biological
processes encompassing all branches of life.
Several crystallographic structures of Escherichia
coli GlpG rhomboid protease have been determined.
In order to understand GlpG dynamics and lipid interactions
in a native-like environment, we have examined
the molecular dynamics of wild-type and mutant
GlpG in different membrane environments. The irregular
shape and small hydrophobic thickness of the
protein cause significant bilayer deformations that
may be important for substrate entry into the active
site. Hydrogen-bond interactions with lipids are
paramount in protein orientation and dynamics.
Mutations in the unusual L1 loop cause changes in
protein dynamics and protein orientation that are
relayed to the His-Ser catalytic dyad. Similarly,mutations
in TM5 change the dynamics and structure of
the L1 loop. These results imply that the L1 loop
has an important regulatory role in proteolysis.National Institute
of General Medical Sciences (GM-74637
cDNA2Genome: A tool for mapping and annotating cDNAs
BACKGROUND: In the last years several high-throughput cDNA sequencing projects have been funded worldwide with the aim of identifying and characterizing the structure of complete novel human transcripts. However some of these cDNAs are error prone due to frameshifts and stop codon errors caused by low sequence quality, or to cloning of truncated inserts, among other reasons. Therefore, accurate CDS prediction from these sequences first require the identification of potentially problematic cDNAs in order to speed up the posterior annotation process. RESULTS: cDNA2Genome is an application for the automatic high-throughput mapping and characterization of cDNAs. It utilizes current annotation data and the most up to date databases, especially in the case of ESTs and mRNAs in conjunction with a vast number of approaches to gene prediction in order to perform a comprehensive assessment of the cDNA exon-intron structure. The final result of cDNA2Genome is an XML file containing all relevant information obtained in the process. This XML output can easily be used for further analysis such us program pipelines, or the integration of results into databases. The web interface to cDNA2Genome also presents this data in HTML, where the annotation is additionally shown in a graphical form. cDNA2Genome has been implemented under the W3H task framework which allows the combination of bioinformatics tools in tailor-made analysis task flows as well as the sequential or parallel computation of many sequences for large-scale analysis. CONCLUSIONS: cDNA2Genome represents a new versatile and easily extensible approach to the automated mapping and annotation of human cDNAs. The underlying approach allows sequential or parallel computation of sequences for high-throughput analysis of cDNAs
Profile analysis and prediction of tissue-specific CpG island methylation classes
<p>Abstract</p> <p>Background</p> <p>The computational prediction of DNA methylation has become an important topic in the recent years due to its role in the epigenetic control of normal and cancer-related processes. While previous prediction approaches focused merely on differences between methylated and unmethylated DNA sequences, recent experimental results have shown the presence of much more complex patterns of methylation across tissues and time in the human genome. These patterns are only partially described by a binary model of DNA methylation. In this work we propose a novel approach, based on profile analysis of tissue-specific methylation that uncovers significant differences in the sequences of CpG islands (CGIs) that predispose them to a tissue- specific methylation pattern.</p> <p>Results</p> <p>We defined CGI methylation profiles that separate not only between constitutively methylated and unmethylated CGIs, but also identify CGIs showing a differential degree of methylation across tissues and cell-types or a lack of methylation exclusively in sperm. These profiles are clearly distinguished by a number of CGI attributes including their evolutionary conservation, their significance, as well as the evolutionary evidence of prior methylation. Additionally, we assess profile functionality with respect to the different compartments of protein coding genes and their possible use in the prediction of DNA methylation.</p> <p>Conclusion</p> <p>Our approach provides new insights into the biological features that determine if a CGI has a functional role in the epigenetic control of gene expression and the features associated with CGI methylation susceptibility. Moreover, we show that the ability to predict CGI methylation is based primarily on the quality of the biological information used and the relationships uncovered between different sources of knowledge. The strategy presented here is able to predict, besides the constitutively methylated and unmethylated classes, two more tissue specific methylation classes conserving the accuracy provided by leading binary methylation classification methods.</p
Profile analysis and prediction of tissue-specific CpG island methylation classes
Background: The computational prediction of DNA methylation has become an important topic in the recent years due to its role in the epigenetic control of normal and cancer-related processes. While previous prediction approaches focused merely on differences between methylated and unmethylated DNA sequences, recent experimental results have shown the presence of much more complex patterns of methylation across tissues and time in the human genome. These patterns are only partially described by a binary model of DNA methylation. In this work we propose a novel approach, based on profile analysis of tissue-specific methylation that uncovers significant differences in the sequences of CpG islands (CGIs) that predispose them to a tissuespecific methylation pattern. Results: We defined CGI methylation profiles that separate not only between constitutively methylated and unmethylated CGIs, but also identify CGIs showing a differential degree of methylation across tissues and cell-types or a lack of methylation exclusively in sperm. These profiles are clearly distinguished by a number of CGI attributes including their evolutionary conservation, their significance, as well as the evolutionary evidence of prior methylation. Additionally, we assess profile functionality with respect to the different compartments of protein coding genes and their possible use in the prediction of DNA methylation. Conclusion: Our approach provides new insights into the biological features that determine if a CGI has a functional role in the epigenetic control of gene expression and the features associated with CGI methylation susceptibility. Moreover, we show that the ability to predict CGI methylation is based primarily on the quality of the biological information used and the relationships uncovered between different sources of knowledge. The strategy presented here is able to predict, besides the constitutively methylated and unmethylated classes, two more tissue specific methylation classes conserving the accuracy provided by leading binary methylation classification methods.publishedVersionPeer Reviewe
Cis-cop: Multiobjective identification of cis-regulatory modules based on constrains
Gene expression regulation is an intricate,
dynamic phenomenon essential for all biolog ical functions. The necessary instructions for
gen expression are encoded in cis-regulatory
elements that work together and interact
with the RNA polymerase to confer specific
spatial and temporal patterns of transcrip tion. Therefore, the identification of these el ements is currently an active area of research
in computational analysis of regulatory se quences. However, the problem is difficult
since the combinatorial interactions between
the regulating factors can be very complex.
Here we present a web server, Cis-cop, that
identifies cis-regulatory modules given a set
of transcription factor binding sites and, ad ditionally, also RNA pol sites for a group of
genes
Optimization of multi-classifiers for computational biology: application to gene finding and expression
Genomes of many organisms have been sequenced over the last few years. However, transforming such raw sequence data into knowledge remains a hard task. A great number of prediction programs have been developed to address part of this problem: the location of genes along a genome and their expression. We propose a multi-objective methodology to combine state-of-the-art algorithms into an aggregation scheme in order to obtain optimal methods’ aggregations. The results obtained show a major improvement in sensitivity when our methodology is compared to the performance of individual methods for gene finding and gene expression problems. The methodology proposed here is an automatic method generator, and a step forward to exploit all already existing methods, by providing alternative optimal methods’ aggregations to answer concrete queries for a certain biological problem with a maximized accuracy of the prediction. As more approaches are integrated for each of the presented problems, de novo accuracy can be expected to improve further.Ministerio de Ciencia y TecnologĂa TIN2006-12879Junta de AndalucĂa TIC-0278
Uncovering the complex genetic architecture of human plasma lipidome using machine learning methods
Genetic architecture of plasma lipidome provides insights into regulation of lipid metabolism
and related diseases. We applied an unsupervised machine learning method, PGMRA, to discover
phenotype-genotype many-to-many relations between genotype and plasma lipidome (phenotype)
in order to identify the genetic architecture of plasma lipidome profiled from 1,426 Finnish individuals
aged 30–45 years. PGMRA involves biclustering genotype and lipidome data independently followed
by their inter-domain integration based on hypergeometric tests of the number of shared individuals.
Pathway enrichment analysis was performed on the SNP sets to identify their associated biological
processes. We identified 93 statistically significant (hypergeometric p-value < 0.01) lipidomegenotype
relations. Genotype biclusters in these 93 relations contained 5977 SNPs across 3164 genes.
Twenty nine of the 93 relations contained genotype biclusters with more than 50% unique SNPs
and participants, thus representing most distinct subgroups. We identified 30 significantly enriched
biological processes among the SNPs involved in 21 of these 29 most distinct genotype-lipidome
subgroups through which the identified genetic variants can influence and regulate plasma lipid
related metabolism and profiles. This study identified 29 distinct genotype-lipidome subgroups in the
studied Finnish population that may have distinct disease trajectories and therefore could be useful in
precision medicine research.Research Council of FinlandSocial Insurance Institution of FinlandCompetitive State Research Financing of Expert Responsibility area of Kuopio, Tampere and Turku University HospitalsJuho Vainio FoundationPaavo Nurmi FoundationFinnish Foundation for Cardiovascular ResearchFinnish Cultural Foundation
Finnish IT center for scienceSigrid Juselius FoundationTampere Tuberculosis FoundationEmil Aaltonen FoundationYrjo Jahnsson FoundationSigne and Ane Gyllenberg FoundationDiabetes Research Foundation of Finnish Diabetes Association 322098
286284
134309
126925
121584
124282
255381
256474
283115
319060
320297
314389
338395
330809
104821
129378
117797
141071
INFRAIA-2016-1-730897Horizon 2020European Research Council (ERC)
European Commission 349708Tampere University Hospital Supporting FoundationFinnish Society of Clinical ChemistrySpanish Government RTI2018-098983-B-100Laboratoriolaaketieteen Edistamissaatio~SrIda Montinin saatioKalle Kaiharin saatioAarne Koskelon saatioFaculty of Medicine and Health Technology, Tampere UniversityProject HPC-EUROPA3 X51001
50191928EC Research Innovation Action under H2020 Programme 75532
Optimization of multi-classifiers for computational biology: application to gene finding and expression
Genomes of many organisms have been
sequenced over the last few years. However, transforming
such raw sequence data into knowledge remains a hard
task. A great number of prediction programs have been
developed to address part of this problem: the location of
genes along a genome and their expression. We propose a
multi-objective methodology to combine state-of-the-art
algorithms into an aggregation scheme in order to obtain
optimal methods’ aggregations. The results obtained show
a major improvement in sensitivity when our methodology
is compared to the performance of individual methods for
gene finding and gene expression problems. The methodology
proposed here is an automatic method generator, and a
step forward to exploit all already existing methods, by
providing alternative optimal methods’ aggregations to
answer concrete queries for a certain biological problem
with a maximized accuracy of the prediction. As more
approaches are integrated for each of the presented problems,
de novo accuracy can be expected to improve further.Ministry of Science and Innovation, Spain (MICINN)
Spanish Government TIN-2006-12879Junta de Andalucia TIC-02788Howard Hughes Medical InstituteEuropean Commission
Junta de Andaluci
Identification of differentially expressed small non-coding RNAs in the legume endosymbiont Sinorhizobium meliloti by comparative genomics
Bacterial small non-coding RNAs (sRNAs) are being recognized as novel widespread regulators of gene expression in response to environmental signals. Here, we present the first search for sRNA-encoding genes in the nitrogen-fixing endosymbiont Sinorhizobium meliloti, performed by a genome- wide computational analysis of its intergenic regions. Comparative sequence data from eight related alpha-proteobacteria were obtained, and the interspecies pairwise alignments were scored with the programs eQRNA and RNAz as complementary predictive tools to identify conserved and stable secondary structures corresponding to putative non-coding RNAs. Northern experiments confirmed that eight of the predicted loci, selected among the original 32 candidates as most probable sRNA genes, expressed small transcripts. This result supports the combined use of eQRNA and RNAz as a robust strategy to identify novel sRNAs in bacteria. Furthermore, seven of the transcripts accumulated differentially in free-living and symbiotic conditions. Experimental mapping of the 5 '-ends of the detected transcripts revealed that their encoding genes are organized in autonomous transcription units with recognizable promoter and, in most cases, termination signatures. These findings suggest novel regulatory functions for sRNAs related to the interactions of alpha-proteobacteria with their eukaryotic hosts.Spanish Ministerio de
EducaciĂłn y Ciencia (Project AGL2006-12466/AGR)Junta de AndalucĂa (Project CV1-01522)NIH Grant
1R01GM070538-02FPI Fellowship
from the Spanish Ministerio de EducaciĂłn y Cienci
- …