Search CORE

2 research outputs found

Next Generation Proteomic Pipeline for Chromosome-Based Proteomic Research Using NeXtProt and GENCODE Databases

Author: Gun Wook Park (1471024)
Heeyoun Hwang (1471042)
Hyoung-Joo Lee (157437)
Hyun Kyoung Lee (1471027)
Ji Eun Jeong (504013)
Ji Yeong Park (3090333)
Jin Young Kim (271185)
John R. Yates (1234548)
Jong Shin Yoo (173825)
Ju Yeon Lee (1471039)
Kyung-Hoon Kwon (1471033)
Sung-Kyu Robin Park (1471045)
Young Mok Park (162400)
Young-Ki Paik (147669)
Publication venue
Publication date
Field of study

Human Proteome Project aims to map all human proteins including missing proteins as well as proteoforms with post translational modifications, alternative splicing variants (ASVs), and single amino acid variants (SAAVs). neXtProt and Ensemble databases are usually used to provide curated information on human coding genes. However, to find these proteoforms, we (Chr #11 team) first introduce a streamlined pipeline using customized and concatenated neXtProt and GENCODE originated from Ensemble, with controlled false discovery rate (FDR). Because of large sized databases used in this pipeline, we found more stringent FDR filtering (0.1% at the peptide level and 1% at the protein level) to claim novel findings, such as GENCODE ASVs and missing proteins, from human hippocampus data set (MSV000081385) and ProteomeXchange (PXD007166). Using our next generation proteomic pipeline (nextPP) with neXtProt and GENCODE databases, two missing proteins such as activity-regulated cytoskeleton-associated protein (ARC, Chr 8) and glutamate receptor ionotropic, kainite 5 (GRIK5, Chr 19) were additionally identified with two or more unique peptides from human brain tissues. Additionally, by applying the pipeline to human brain related data sets such as cortex (PXD000067 and PXD000561), spinal cord, and fetal brain (PXD000561), seven GENCODE ASVs such as ACTN4–012 (Chr.19), DPYSL2–005 (Chr.8), MPRIP-003 (Chr.17), NCAM1–013 (Chr.11), EPB41L1–017 (Chr.20), AGAP1–004 (Chr.2), and CPNE5–005 (Chr.6) were identified from two or more data sets. The identified peptides of GENCODE ASVs were mapped onto novel exon insertions, alternative translations at 5′-untranslated region, or novel protein coding sequence. Applying the pipeline to male reproductive organ related data sets, 52 GENCODE ASVs were identified from two testis (PXD000561 and PXD002179) and a spermatozoa (PXD003947) data sets. Four out of 52 GENCODE ASVs such as RAB11FIP5–008 (Chr. 2), RP13–347D8.7–001 (Chr. X), PRDX4–002 (Chr. X), and RP11–666A8.13–001 (Chr. 17) were identified in all of the three samples

FigShare

Integrated Proteomic Pipeline Using Multiple Search Engines for a Proteogenomic Study with a Controlled Protein False Discovery Rate

Author: Eun Sun Ji (1471036)
Gun Wook Park (1471024)
Heeyoun Hwang (1471042)
Hyoung-Joo Lee (157437)
Hyun Kyoung Lee (1471027)
Ji Yeong Park (3090333)
Jin Young Kim (271185)
John R. Yates (1234548)
Jong Shin Yoo (173825)
Ju Yeon Lee (1471039)
Kwang Hoe Kim (1471030)
Kyung-Hoon Kwon (1471033)
Sung-Kyu Robin Park (1471045)
Young Mok Park (162400)
Young-Ki Paik (147669)
Publication venue
Publication date
Field of study

In the Chromosome-Centric Human Proteome Project (C-HPP), false-positive identification by peptide spectrum matches (PSMs) after database searches is a major issue for proteogenomic studies using liquid-chromatography and mass-spectrometry-based large proteomic profiling. Here we developed a simple strategy for protein identification, with a controlled false discovery rate (FDR) at the protein level, using an integrated proteomic pipeline (IPP) that consists of four engrailed steps as follows. First, using three different search engines, SEQUEST, MASCOT, and MS-GF+, individual proteomic searches were performed against the neXtProt database. Second, the search results from the PSMs were combined using statistical evaluation tools including DTASelect and Percolator. Third, the peptide search scores were converted into E-scores normalized using an in-house program. Last, ProteinInferencer was used to filter the proteins containing two or more peptides with a controlled FDR of 1.0% at the protein level. Finally, we compared the performance of the IPP to a conventional proteomic pipeline (CPP) for protein identification using a controlled FDR of <1% at the protein level. Using the IPP, a total of 5756 proteins (vs 4453 using the CPP) including 477 alternative splicing variants (vs 182 using the CPP) were identified from human hippocampal tissue. In addition, a total of 10 missing proteins (vs 7 using the CPP) were identified with two or more unique peptides, and their tryptic peptides were validated using MS/MS spectral pattern from a repository database or their corresponding synthetic peptides. This study shows that the IPP effectively improved the identification of proteins, including alternative splicing variants and missing proteins, in human hippocampal tissues for the C-HPP. All RAW files used in this study were deposited in ProteomeXchange (PXD000395)

FigShare