25 research outputs found

    Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides

    Get PDF
    We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs and other non-canonical transcripts, such as those produced by alternative splicing events. It also includes exonic out-of-frame translation from otherwise canonical protein-coding mRNAs. Moreover, the tool enables the generation of variant protein sequences from multiple sources of genomic variants including COSMIC, cBioportal, gnomAD and mutations detected from sequencing of patient samples. pypgatk and pgdb provide multiple functionalities for database handling including optimized target/decoy generation by the algorithm DecoyPyrat. Finally, we have reanalyzed six public datasets in PRIDE by generating cell-type specific databases for 65 cell lines using the pypgatk and pgdb workflow, revealing a wealth of non-canonical or cryptic peptides amounting to >5% of the total number of peptides identified

    A proteomics sample metadata representation for multiomics integration and big data analysis

    Get PDF
    The amount of public proteomics data is rapidly increasing but there is no standardized format to describe the sample metadata and their relationship with the dataset files in a way that fully supports their understanding or reanalysis. Here we propose to develop the transcriptomics data format MAGE-TAB into a standard representation for proteomics sample metadata. We implement MAGE-TAB-Proteomics in a crowdsourcing project to manually curate over 200 public datasets. We also describe tools and libraries to validate and submit sample metadata-related information to the PRIDE repository. We expect that these developments will improve the reproducibility and facilitate the reanalysis and integration of public proteomics datasets.publishedVersio

    Large-scale data-driven analysis to understand the genetics of Congenital Heart Disease

    No full text
    Congenital Heart Disease (CHD) delineates a large group of structural defects, which can occur due to perturbations at some stage in the cardiac embryogenesis process. With a global incidence ranging from 7 to 9 cases per 1000 live births, CHD accounts for a significant fraction of new-borns deaths worldwide. Different studies have identified genetics as an essential factor underlying CHD, along with environmental factors. The technological advances within the last years have helped improve CHD diagnosis and understand its genetic causes. Nevertheless, despite the advances in our understanding of the disease, many molecular mechanisms underlying CHD remain uncertain. Herein I present my efforts focused on discovering new genes and biological pathways altered in patients with CHD. The work is based on large CHD patient cohorts, collected and analysed as part of an international collaboration. The adopted integrative data-driven approach in this work can roughly be grouped into two principal aims: i) the development of statistical frameworks and bioinformatics tools to analyse high-dimensional data and ii) the meta-analysis of large-scale exome sequencing data to elucidate variants and genes conferring risk of CHD. By meta-analysing copy number variations and de novo variants in CHD probands, we implicated novel genes reaching genome-wide significant association with CHD and strengthened previously described associations. We also explored the differences between non-syndromic and syndromic CHD by analysing a large-scale exome cohort of patients. In summary, our integrative approach, supported by the data analysis of ~15,000 CHD patients, allowed us to gain new insights into the genetic origin of CHD. Consequently, we present here a valuable resource to continue investigating the causes of CHD and pave the way to promote new studies in this area

    Estimación del punto isoeléctrico de péptidos empleando descriptores moleculares y máquinas de soporte vectorial

    No full text
    <p>El fraccionamiento de mezclas de péptidos utilizando geles con gradiente de pH inmovilizado se utiliza con frecuencia como el primer paso de separación en experimentos de proteómica. Esta técnica produce un incremento tanto en el rango dinámico como en la resolución de la separación de péptidos previo al análisis por Cromatografía Líquida-Espectrometría de Masas. Los valores de punto isoeléctrico (pI) experimental obtenidos en combinación con la información de los espectros de fragmentación pueden ser utilizados para mejorar las identificaciones de péptidos. Por lo tanto, la estimación precisa del valor de pI basado en la secuencia de aminoácidos constituye un punto crítico en este tipo de experimentos. En la actualidad, el pI se estima fundamentalmente mediante modelos basados en el estado de carga de la molécula, y/o el algoritmo Cofactor. Sin embargo, ninguno de estos métodos es capaz de calcular el valor de pI de péptidos básicos con precisión. En este trabajo, presentamos un enfoque nuevo que puede mejorar la estimación del pI significativamente, mediante el uso de máquinas de soporte vectorial (SVM), un descriptor experimental de aminoácidos tomado de la base de datos AAIndex y el punto isoeléctrico predicho por un modelo basado en el estado de carga. Los resultados obtenidos en dos conjuntos de datos experimentales mostraron una alta correlación (0.96-0.98) entre valores estimados y observados de pI, con una desviación estándar de 0.32-0.36 unidades de pH.</p

    Accurate and fast feature selection workflow for high-dimensional omics data - Fig 2

    Get PDF
    <p>(A) Correlation matrix for the 544 physicochemical (features) of the 7,391 peptides (samples) included in Dataset 2; (B) the final 20 variables after the correlation-matrix filtering steps.</p

    Accuracy vs. feature selection combination for expression datasets (1, 3, 4, 5, 6 and 7).

    No full text
    <p>(<b>RF</b>) Random Forest without previous feature selection step; (<b>X2-CM-RFE-RF</b>), random forest classification after the feature selection step using univariate correlation filter with matrix correlation and recursive feature elimination; (<b>X2-PCA-RFE-RF</b>), random forest classification after the feature selection step using univariate correlation filter with principal component analysis and recursive feature elimination. All methods include an internal cross-validation 10-fold step. All accuracy metrics were estimated following the approach previously reported by <i>Pochet et al</i>. [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0189875#pone.0189875.ref031" target="_blank">31</a>], where 20-fold randomized test data were used to summarize the accuracy of the FS combination.</p

    Proposed workflow for FS including a filtering step with univariate and/or multivariate approaches, followed by a wrapper approach (recursive feature elimination).

    No full text
    <p>Proposed workflow for FS including a filtering step with univariate and/or multivariate approaches, followed by a wrapper approach (recursive feature elimination).</p

    Le Courrier

    No full text
    03 avril 18251825/04/03 (A0,N93)

    Error plot of predicted isoelectric point vs the experimental isoelectric point (Dataset 2): (SVM) applying FS or cross-correlation step; (X2-CM-SVM) adding correlation filters as the only steps for feature selection; (RFE-SVM-CV3) recursive feature elimination, three interactions of cross-validation combined with SVM; (X2-CM-RFE-SVM-CV3) considering the full FS workflow.

    No full text
    <p>Error plot of predicted isoelectric point vs the experimental isoelectric point (Dataset 2): (SVM) applying FS or cross-correlation step; (X2-CM-SVM) adding correlation filters as the only steps for feature selection; (RFE-SVM-CV3) recursive feature elimination, three interactions of cross-validation combined with SVM; (X2-CM-RFE-SVM-CV3) considering the full FS workflow.</p
    corecore