35 research outputs found
Proteogenómica y splicing alternativo
Tesis Doctoral inédita leída en la Universidad Autónoma de Madrid, Facultad de Ciencias, Departamento de Biología Molecular. Fecha de lectura: 08 de febrero de 2016La anotación manual de los genes codificantes de proteína requiere diversas fuentes de
evidencia. Conseguir evidencia experimental de la expresión de las proteínas sigue siendo un
reto técnico complicado. La mayoría de métodos se basan en predicciones computacionales
y evidencia experimental a nivel de transcrito. La tecnología de espectrometría de masas ha
avanzado considerablemente en las dos últimas décadas, situándola como una herramienta
puntera para proyectos de anotación genómica. La espectrometría de masas permite la
depuración y validación de genes codificantes y transcritos alternativos, así como la
detección de nuevas regiones codificantes. La proteogenómica, una disciplina entre la
genómica y la proteómica, requiere el desarrollo de métodos y estrategias computacionales
para el análisis de datos a gran escala.
El objetivo principal de esta tesis es desarrollar métodos computacionales para el proceso y
análisis de datos proteómicos y genómicos. Para ello se han diseñado varias estrategias de
análisis de datos proteómicos a gran escala.
En la primera parte se aplican los flujos de trabajo diseñado para la búsqueda, validación y
curación de resultados proteómicos, a partir de diversas fuentes de datos genómicos. La
caracterización de isoformas alternativas y eventos de splicing en humano y ratón muestra
tres grupos sobrerrepresentados. En concreto, las ribonucleoproteínas nucleares, las
isoformas alternativas generadas a partir de exones homólogos, y las creadas a partir de
deleciones pequeñas. El estudio se amplía utilizando una base de datos experimentales
proteómicos mayor, y con ello se corrobora que la mayoría de genes expresa una proteína
dominante. Se demuestra que los eventos de splicing detectados a nivel de proteína conservan
los dominios funcionales. Finalmente, se ratifica que más del 20% de las isoformas de
splicing están generadas por exones homólogos, que estas son específicas de tejido, y que
están notablemente conservadas, advirtiéndose su posible relevancia a nivel celular.
En la última parte se utilizan los péptidos de ocho experimentos proteómicos a gran escala
para caracterizar la isoforma más expresada del gen. La comparativa de la isoforma
proteómica más expresada coincide con la de dos métodos ortólogos analizados. Uno basado
en la conservación de función y estructura, y el otro basado en anotaciones genómicas
corregidas por expertos. Los resultados muestran la tendencia hacia la expresión de una sola
isoforma, independientemente del tejido, y confirman la idoneidad de APPRIS para la
predicción de isoformas principales.The manual annotation of protein-coding genes is based on many diverse sources of
evidence. Most support comes from computational predictions, genomic evidence and
experimental expression at transcript level. Finding experimental evidence for the expression
of proteins remains a difficult technical challenge, but mass spectrometry technology has
advanced considerably in the past two decades, becoming an important tool for genomic
annotation projects. Mass spectrometry also enables the refining and validation coding genes
and alternative transcripts and detection of novel coding regions.
Proteogenomics, a discipline that unites genomics and proteomics requires the development
of computational methods and strategies for data analysis on a large scale. The main objective
of this thesis was to develop computational methods for processing and analyzing genomic
and proteomic data. Several strategies to analyze large-scale proteomic data have been
designed to achieve this goal.
In the first part workflows designed to search, validate and curate results from a variety of
sources of proteomic data were applied as part of a pilot study. The characterization of
alternative splice isoforms in human and mouse experiments highlighted three overrepresented
groups; specifically, ribonucleoproteins, alternative isoforms generated from
homologous exons and those generated from small indels. The pilot study was later extended
using a larger experimental proteomic data set. This second analysis confirmed that most
genes express a dominant protein and demonstrated that splicing events detected at the
protein level rarely break conserved functional domains. The large-scale study confirmed
that more than 20% of splice isoforms are generated from homologous exons.
Many of these alternative homologous exons are tissue specific and all are remarkably
conserved, highlighting their relevance at the cellular level.
Finally peptides from eight large-scale proteomic experiments are used to characterize a main
experimental isoform. This main proteomics isoform matches those selected by two
orthogonal methods, one predicted from conservation and protein functional and structure
features, and the other annotated by manual annotators based on genomic evidence. The
results show clearly that almost all genes have a principal protein isoform regardless of tissue
From identification to validation to gene count
The current GENCODE gene count of ~ 30,000, including 21,727 protein-coding and 8,483 RNA genes, is significantly lower than the 100,000 genes anticipated by early estimates. Accurate annotation of protein-coding and non-coding genes and pseudogenes is essential in calculating the true gene count and gaining insight into human evolution.
As part of the GENCODE Consortium, the HAVANA team produces high quality manual gene annotation, which forms the basis for the reference gene set being used by the ENCODE project and provides a rich annotation of alternative splice variants and assignment of functional potential. However, the protein-coding potential of some splice variants is uncertain and valid splice variants can remain unannotated if they are absent from current cDNA libraries. Recent technological developments in sequencing and mass spectrometry have created a vast amount of new transcript and protein data that facilitate the identification and validation of new and existing transcripts, while harboring their own limitations and problems
Comprehensive Quantification of the Modified Proteome Reveals Oxidative Heart Damage in Mitochondrial Heteroplasmy
Post-translational modifications hugely increase the functional diversity of proteomes. Recent algorithms based on ultratolerant database searching are forging a path to unbiased analysis of peptide modifications by shotgun mass spectrometry. However, these approaches identify only one-half of the modified forms potentially detectable and do not map the modified residue. Moreover, tools for the quantitative analysis of peptide modifications are currently lacking. Here, we present a suite of algorithms that allows comprehensive identification of detectable modifications, pinpoints the modified residues, and enables their quantitative analysis through an integrated statistical model. These developments were used to characterize the impact of mitochondrial heteroplasmy on the proteome and on the modified peptidome in several tissues from 12-week-old mice. Our results reveal that heteroplasmy mainly affects cardiac tissue, inducing oxidative damage to proteins of the oxidative phosphorylation system, and provide a molecular mechanism explaining the structural and functional alterations produced in heart mitochondria.We thank Simon Bartlett (CNIC) for English editing. This study was supported by competitive grants from the Spanish Ministry of Economy and Competitiveness (MINECO) (BIO2015-67580-P) through the Carlos III Institute of Health-Fondo de Investigacion Sanitaria (PRB2, IPT13/0001-ISCIII-SGEFI/FEDER; ProteoRed), by Fundacion La Marato TV3, and by FP7-PEOPLE-2013-ITN ``Next-Generation Training in Cardiovascular Research and Innovation-Cardionext.'' N.B. is a FP7-PEOPLE-2013-ITN-Cardionext Fellow. The CNIC is supported by the MINECO and the Pro-CNIC Foundation, and is a Severo Ochoa Center of Excellence (MINECO Award SEV-2015-0505).S
Quantitative HDL Proteomics Identifies Peroxiredoxin-6 as a Biomarker of Human Abdominal Aortic Aneurysm
High-density lipoproteins (HDLs) are complex protein and lipid assemblies whose composition is known to change in diverse pathological situations. Analysis of the HDL proteome can thus provide insight into the main mechanisms underlying abdominal aortic aneurysm (AAA) and potentially detect novel systemic biomarkers. We performed a multiplexed quantitative proteomics analysis of HDLs isolated from plasma of AAA patients (N = 14) and control study participants (N = 7). Validation was performed by western-blot (HDL), immunohistochemistry (tissue), and ELISA (plasma). HDL from AAA patients showed elevated expression of peroxiredoxin-6 (PRDX6), HLA class I histocompatibility antigen (HLA-I), retinol-binding protein 4, and paraoxonase/arylesterase 1 (PON1), whereas alpha-2 macroglobulin and C4b-binding protein were decreased. The main pathways associated with HDL alterations in AAA were oxidative stress and immune-inflammatory responses. In AAA tissue, PRDX6 colocalized with neutrophils, vascular smooth muscle cells, and lipid oxidation. Moreover, plasma PRDX6 was higher in AAA (N = 47) than in controls (N = 27), reflecting increased systemic oxidative stress. Finally, a positive correlation was recorded between PRDX6 and AAA diameter. The analysis of the HDL proteome demonstrates that redox imbalance is a major mechanism in AAA, identifying the antioxidant PRDX6 as a novel systemic biomarker of AAA.We thank Simon Bartlett for language and scientific editing. This study was supported by the Spanish Ministry of Economy and Competitiveness (MINECO) (SAF2016-80843-R, BIO2012-37926 and BIO2015-67580-P), Fondo de Investigaciones Sanitarias ISCiii-FEDER (PRB2) (IPT13/0001, ProteoRed, Redes RIC RD12/0042/00038 and RD12/0042/0056, Biobancos RD09/0076/00101 and CA12/00371), Centro de Investigacion Biomedica en Red de Diabetes y Enfermedades Metabolicas Asociadas (CIBERDEM), and FRIAT. The CNIC is supported by the Spanish Ministry of Economy and Competitiveness (MINECO) and the Pro-CNIC Foundation, and is a Severo Ochoa Center of Excellence (MINECO award SEV-2015-0505).S
SQANTI : extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification
High-throughput sequencing of full-length transcripts using long reads has paved the way for the discovery of thousands of novel transcripts, even in well-annotated mammalian species. The advances in sequencing technology have created a need for studies and tools that can characterize these novel variants. Here, we present SQANTI, an automated pipeline for the classification of long-read transcripts that can assess the quality of data and the preprocessing pipeline using 47 unique descriptors. We apply SQANTI to a neuronal mouse transcriptome using Pacific Biosciences (PacBio) long reads and illustrate how the tool is effective in characterizing and describing the composition of the full-length transcriptome. We perform extensive evaluation of ToFU PacBio transcripts by PCR to reveal that an important number of the novel transcripts are technical artifacts of the sequencing approach and that SQANTI quality descriptors can be used to engineer a filtering strategy to remove them. Most novel transcripts in this curated transcriptome are novel combinations of existing splice sites, resulting more frequently in novel ORFs than novel UTRs, and are enriched in both general metabolic and neural-specific functions. We show that these new transcripts have a major impact in the correct quantification of transcript levels by state-of-the-art short-read-based quantification algorithms. By comparing our iso-transcriptome with public proteomics databases, we find that alternative isoforms are elusive to proteogenomics detection. SQANTI allows the user to maximize the analytical outcome of long-read technologies by providing the tools to deliver quality-evaluated and curated full-length transcriptomes
Inference of Functional Relations in Predicted Protein Networks with a Machine Learning Approach
Background: Molecular biology is currently facing the challenging task of functionally characterizing the proteome. The large number of possible protein-protein interactions and complexes, the variety of environmental conditions and cellular states in which these interactions can be reorganized, and the multiple ways in which a protein can influence the function of others, requires the development of experimental and computational approaches to analyze and predict functional associations between proteins as part of their activity in the interactome. Methodology/Principal Findings: We have studied the possibility of constructing a classifier in order to combine the output of the several protein interaction prediction methods. The AODE (Averaged One-Dependence Estimators) machine learning algorithm is a suitable choice in this case and it provides better results than the individual prediction methods, and it has better performances than other tested alternative methods in this experimental set up. To illustrate the potential use of this new AODE-based Predictor of Protein InterActions (APPIA), when analyzing high-throughput experimental data, we show how it helps to filter the results of published High-Throughput proteomic studies, ranking in a significant way functionally related pairs. Availability: All the predictions of the individual methods and of the combined APPIA predictor, together with the used datasets of functional associations are available at http://ecid.bioinfo.cnio.es/. Conclusions: We propose a strategy that integrates the main current computational techniques used to predict functional associations into a unified classifier system, specifically focusing on the evaluation of poorly characterized protein pairs. We selected the AODE classifier as the appropriate tool to perform this task. AODE is particularly useful to extract valuable information from large unbalanced and heterogeneous data sets. The combination of the information provided by five prediction interaction prediction methods with some simple sequence features in APPIA is useful in establishing reliability values and helpful to prioritize functional interactions that can be further experimentally characterized.This work was funded by the BioSapiens (grant number LSHG-CT-2003-503265) and the Experimental Network for Functional Integration (ENFIN) Networks of Excellence (contract number LSHG-CT-2005-518254), by Consolider BSC (grant number CSD2007-00050) and by the project “Functions for gene sets” from the Spanish Ministry of Education and Science (BIO2007-66855). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript
Correction to “Analyzing the First Drafts of the Human Proteome”
Correction to
“Analyzing the First Drafts of the Human Proteome