69 research outputs found
Accurate estimation of isoelectric point of protein and peptide based on amino acid sequences
Motivation: In any macromolecular polyprotic system - for example protein, DNA or RNA - the isoelectric point - commonly referred to as the pI - can be defined as the point of singularity in a titration curve, corresponding to the solution pH value at which the net overall surface charge - and thus the electrophoretic mobility - of the ampholyte sums to zero. Different modern analytical biochemistry and proteomics methods depend on the isoelectric point as a principal feature for protein and peptide characterization. Protein separation by isoelectric point is a critical part of 2-D gel electrophoresis, a key precursor of proteomics, where discrete spots can be digested in-gel, and proteins subsequently identified by analytical mass spectrometry. Peptide fractionation according to their pI is also widely used in current proteomics sample preparation procedures previous to the LC-MS/MS analysis. Therefore accurate theoretical prediction of pI would expedite such analysis. While such pI calculation is widely used, it remains largely untested, motivating our efforts to benchmark pI prediction methods. Results: Using data from the database PIP-DB and one publically available dataset as our reference gold standard, we have undertaken the benchmarking of pI calculation methods. We find that methods vary in their accuracy and are highly sensitive to the choice of basis set. The machine-learning algorithms, especially the SVM-based algorithm, showed a superior performance when studying peptide mixtures. In general, learning-based pI prediction methods (such as Cofactor, SVM and Branca) require a large training dataset and their resulting performance will strongly depend of the quality of that data. In contrast with Iterative methods, machine-learning algorithms have the advantage of being able to add new features to improve the accuracy of prediction. Contact: [email protected] Availability and Implementation: The software and data are freely available at https://github.com/ypriverol/pIR. Supplementary information: Supplementary data are available at Bioinformatics online
Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides
We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs and other non-canonical transcripts, such as those produced by alternative splicing events. It also includes exonic out-of-frame translation from otherwise canonical protein-coding mRNAs. Moreover, the tool enables the generation of variant protein sequences from multiple sources of genomic variants including COSMIC, cBioportal, gnomAD and mutations detected from sequencing of patient samples. pypgatk and pgdb provide multiple functionalities for database handling including optimized target/decoy generation by the algorithm DecoyPyrat. Finally, we have reanalyzed six public datasets in PRIDE by generating cell-type specific databases for 65 cell lines using the pypgatk and pgdb workflow, revealing a wealth of non-canonical or cryptic peptides amounting to >5% of the total number of peptides identified
A proteomics sample metadata representation for multiomics integration and big data analysis
The amount of public proteomics data is rapidly increasing but there is no standardized format to describe the sample metadata and their relationship with the dataset files in a way that fully supports their understanding or reanalysis. Here we propose to develop the transcriptomics data format MAGE-TAB into a standard representation for proteomics sample metadata. We implement MAGE-TAB-Proteomics in a crowdsourcing project to manually curate over 200 public datasets. We also describe tools and libraries to validate and submit sample metadata-related information to the PRIDE repository. We expect that these developments will improve the reproducibility and facilitate the reanalysis and integration of public proteomics datasets.publishedVersio
Extensive Identification of Genes Involved in Congenital and Structural Heart Disorders and Cardiomyopathy
Clinical presentation of congenital heart disease is heterogeneous, making identification of the disease-causing genes and their genetic pathways and mechanisms of action challenging. By using in vivo electrocardiography, transthoracic echocardiography and microcomputed tomography imaging to screen 3,894 single-gene-null mouse lines for structural and functional cardiac abnormalities, here we identify 705 lines with cardiac arrhythmia, myocardial hypertrophy and/or ventricular dilation. Among these 705 genes, 486 have not been previously associated with cardiac dysfunction in humans, and some of them represent variants of unknown relevance (VUR). Mice with mutations in Casz1, Dnajc18, Pde4dip, Rnf38 or Tmem161b genes show developmental cardiac structural abnormalities, with their human orthologs being categorized as VUR. Using UK Biobank data, we validate the importance of the DNAJC18 gene for cardiac homeostasis by showing that its loss of function is associated with altered left ventricular systolic function. Our results identify hundreds of previously unappreciated genes with potential function in congenital heart disease and suggest causal function of five VUR in congenital heart disease
Extensive identification of genes involved in congenital and structural heart disorders and cardiomyopathy.
Clinical presentation of congenital heart disease is heterogeneous, making identification of the disease-causing genes and their genetic pathways and mechanisms of action challenging. By using in vivo electrocardiography, transthoracic echocardiography and microcomputed tomography imaging to screen 3,894 single-gene-null mouse lines for structural and functional cardiac abnormalities, here we identify 705 lines with cardiac arrhythmia, myocardial hypertrophy and/or ventricular dilation. Among these 705 genes, 486 have not been previously associated with cardiac dysfunction in humans, and some of them represent variants of unknown relevance (VUR). Mice with mutations in Casz1, Dnajc18, Pde4dip, Rnf38 or Tmem161b genes show developmental cardiac structural abnormalities, with their human orthologs being categorized as VUR. Using UK Biobank data, we validate the importance of the DNAJC18 gene for cardiac homeostasis by showing that its loss of function is associated with altered left ventricular systolic function. Our results identify hundreds of previously unappreciated genes with potential function in congenital heart disease and suggest causal function of five VUR in congenital heart disease
Large-scale data-driven analysis to understand the genetics of Congenital Heart Disease
Congenital Heart Disease (CHD) delineates a large group of structural defects, which can occur due to perturbations at some stage in the cardiac embryogenesis process. With a global incidence ranging from 7 to 9 cases per 1000 live births, CHD accounts for a significant fraction of new-borns deaths worldwide. Different studies have identified genetics as an essential factor underlying CHD, along with environmental factors. The technological advances within the last years have helped improve CHD diagnosis and understand its genetic causes. Nevertheless, despite the advances in our understanding of the disease, many molecular mechanisms underlying CHD remain uncertain. Herein I present my efforts focused on discovering new genes and biological pathways altered in patients with CHD. The work is based on large CHD patient cohorts, collected and analysed as part of an international collaboration. The adopted integrative data-driven approach in this work can roughly be grouped into two principal aims: i) the development of statistical frameworks and bioinformatics tools to analyse high-dimensional data and ii) the meta-analysis of large-scale exome sequencing data to elucidate variants and genes conferring risk of CHD. By meta-analysing copy number variations and de novo variants in CHD probands, we implicated novel genes reaching genome-wide significant association with CHD and strengthened previously described associations. We also explored the differences between non-syndromic and syndromic CHD by analysing a large-scale exome cohort of patients. In summary, our integrative approach, supported by the data analysis of ~15,000 CHD patients, allowed us to gain new insights into the genetic origin of CHD. Consequently, we present here a valuable resource to continue investigating the causes of CHD and pave the way to promote new studies in this area
Estimación del punto isoeléctrico de péptidos empleando descriptores moleculares y máquinas de soporte vectorial
<p>El fraccionamiento de mezclas de péptidos utilizando geles con gradiente de pH inmovilizado se utiliza con frecuencia como el primer paso de separación en experimentos de proteómica. Esta técnica produce un incremento tanto en el rango dinámico como en la resolución de la separación de péptidos previo al análisis por Cromatografía Líquida-Espectrometría de Masas. Los valores de punto isoeléctrico (pI) experimental obtenidos en combinación con la información de los espectros de fragmentación pueden ser utilizados para mejorar las identificaciones de péptidos. Por lo tanto, la estimación precisa del valor de pI basado en la secuencia de aminoácidos constituye un punto crítico en este tipo de experimentos. En la actualidad, el pI se estima fundamentalmente mediante modelos basados en el estado de carga de la molécula, y/o el algoritmo Cofactor. Sin embargo, ninguno de estos métodos es capaz de calcular el valor de pI de péptidos básicos con precisión. En este trabajo, presentamos un enfoque nuevo que puede mejorar la estimación del pI significativamente, mediante el uso de máquinas de soporte vectorial (SVM), un descriptor experimental de aminoácidos tomado de la base de datos AAIndex y el punto isoeléctrico predicho por un modelo basado en el estado de carga. Los resultados obtenidos en dos conjuntos de datos experimentales mostraron una alta correlación (0.96-0.98) entre valores estimados y observados de pI, con una desviación estándar de 0.32-0.36 unidades de pH.</p
bigbio/py-pgatk: v0.0.24
<h2>What's Changed</h2>
<ul>
<li>update by @ypriverol in https://github.com/bigbio/py-pgatk/pull/69</li>
<li>fixed bug in gtf file name, just keep version not release string by @ypriverol in https://github.com/bigbio/py-pgatk/pull/68</li>
<li>add spectrumAI by @DongdongdongW in https://github.com/bigbio/py-pgatk/pull/70</li>
<li>Fix the bug of class-fdr, adjust the blast to multi-process running t… by @DongdongdongW in https://github.com/bigbio/py-pgatk/pull/71</li>
<li>fixes issue #72 by @husensofteng in https://github.com/bigbio/py-pgatk/pull/73</li>
<li>update validate by @DongdongdongW in https://github.com/bigbio/py-pgatk/pull/76</li>
<li>spectrumAI into py-pgatk by @ypriverol in https://github.com/bigbio/py-pgatk/pull/77</li>
</ul>
<h2>New Contributors</h2>
<ul>
<li>@DongdongdongW made their first contribution in https://github.com/bigbio/py-pgatk/pull/70</li>
</ul>
<p><strong>Full Changelog</strong>: https://github.com/bigbio/py-pgatk/compare/v0.0.23...v0.0.24</p>
Integrative analysis of genomic variants reveals new associations of candidate haploinsufficient genes with congenital heart disease
Numerous genetic studies have established a role for rare genomic variants in Congenital Heart Disease (CHD) at the copy number variation (CNV) and de novo variant (DNV) level. To identify novel haploinsufficient CHD disease genes, we performed an integrative analysis of CNVs and DNVs identified in probands with CHD including cases with sporadic thoracic aortic aneurysm. We assembled CNV data from 7,958 cases and 14,082 controls and performed a gene-wise analysis of the burden of rare genomic deletions in cases versus controls. In addition, we performed variation rate testing for DNVs identified in 2,489 parent-offspring trios. Our analysis revealed 21 genes which were significantly affected by rare CNVs and/or DNVs in probands. Fourteen of these genes have previously been associated with CHD while the remaining genes (FEZ1, MYO16, ARID1B, NALCN, WAC, KDM5B and WHSC1) have only been associated in small cases series or show new associations with CHD. In addition, a systems level analysis revealed affected protein-protein interaction networks involved in Notch signaling pathway, heart morphogenesis, DNA repair and cilia/centrosome function. Taken together, this approach highlights the importance of re-analyzing existing datasets to strengthen disease association and identify novel disease genes and pathways
- …
