    Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics.

    The results of analysis of shotgun proteomics mass spectrometry data can be greatly affected by the selection of the reference protein sequence database against which the spectra are matched. For many species there are multiple sources from which somewhat different sequence sets can be obtained. This can lead to confusion about which database is best in which circumstances-a problem especially acute in human sample analysis. All sequence databases are genome-based, with sequences for the predicted gene and their protein translation products compiled. Our goal is to create a set of primary sequence databases that comprise the union of sequences from many of the different available sources and make the result easily available to the community. We have compiled a set of four sequence databases of varying sizes, from a small database consisting of only the ∼20,000 primary isoforms plus contaminants to a very large database that includes almost all nonredundant protein sequences from several sources. This set of tiered, increasingly complete human protein sequence databases suitable for mass spectrometry proteomics sequence database searching is called the Tiered Human Integrated Search Proteome set. In order to evaluate the utility of these databases, we have analyzed two different data sets, one from the HeLa cell line and the other from normal human liver tissue, with each of the four tiers of database complexity. The result is that approximately 0.8%, 1.1%, and 1.5% additional peptides can be identified for Tiers 2, 3, and 4, respectively, as compared with the Tier 1 database, at substantially increasing computational cost. This increase in computational cost may be worth bearing if the identification of sequence variants or the discovery of sequences that are not present in the reviewed knowledge base entries is an important goal of the study. We find that it is useful to search a data set against a simpler database, and then check the uniqueness of the discovered peptides against a more complex database. We have set up an automated system that downloads all the source databases on the first of each month and automatically generates a new set of search databases and makes them available for download at http://www.peptideatlas.org/thisp/

    A Tandem Mass Spectrometry Sequence Database Search Method for Identification of O-Fucosylated Proteins by Mass Spectrometry.

    Thrombospondin type 1 repeats (TSRs), small adhesive protein domains with a wide range of functions, are usually modified with O-linked fucose, which may be extended to O-fucose-β1,3-glucose. Collision-induced dissociation (CID) spectra of O-fucosylated peptides cannot be sequenced by standard tandem mass spectrometry (MS/MS) sequence database search engines because O-linked glycans are highly labile in the gas phase and are effectively absent from the CID peptide fragment spectra, resulting in a large mass error. Electron transfer dissociation (ETD) preserves O-linked glycans on peptide fragments, but only a subset of tryptic peptides with low m/ z can be reliably sequenced from ETD spectra compared to CID. Accordingly, studies to date that have used MS to identify O-fucosylated TSRs have required manual interpretation of CID mass spectra even when ETD was also employed. In order to facilitate high-throughput, automatic identification of O-fucosylated peptides from CID spectra, we re-engineered the MS/MS sequence database search engine Comet and the MS data analysis suite Trans-Proteomic Pipeline to enable automated sequencing of peptides exhibiting the neutral losses characteristic of labile O-linked glycans. We used our approach to reanalyze published proteomics data from Plasmodium parasites and identified multiple glycoforms of TSR-containing proteins

    Gene finding in the chicken genome

    BACKGROUND: Despite the continuous production of genome sequence for a number of organisms, reliable, comprehensive, and cost effective gene prediction remains problematic. This is particularly true for genomes for which there is not a large collection of known gene sequences, such as the recently published chicken genome. We used the chicken sequence to test comparative and homology-based gene-finding methods followed by experimental validation as an effective genome annotation method. RESULTS: We performed experimental evaluation by RT-PCR of three different computational gene finders, Ensembl, SGP2 and TWINSCAN, applied to the chicken genome. A Venn diagram was computed and each component of it was evaluated. The results showed that de novo comparative methods can identify up to about 700 chicken genes with no previous evidence of expression, and can correctly extend about 40% of homology-based predictions at the 5' end. CONCLUSIONS: De novo comparative gene prediction followed by experimental verification is effective at enhancing the annotation of the newly sequenced genomes provided by standard homology-based methods

    Addressing statistical biases in nucleotide-derived protein databases for proteogenomic search strategies

    [Image: see text] Proteogenomics has the potential to advance genome annotation through high quality peptide identifications derived from mass spectrometry experiments, which demonstrate a given gene or isoform is expressed and translated at the protein level. This can advance our understanding of genome function, discovering novel genes and gene structure that have not yet been identified or validated. Because of the high-throughput shotgun nature of most proteomics experiments, it is essential to carefully control for false positives and prevent any potential misannotation. A number of statistical procedures to deal with this are in wide use in proteomics, calculating false discovery rate (FDR) and posterior error probability (PEP) values for groups and individual peptide spectrum matches (PSMs). These methods control for multiple testing and exploit decoy databases to estimate statistical significance. Here, we show that database choice has a major effect on these confidence estimates leading to significant differences in the number of PSMs reported. We note that standard target:decoy approaches using six-frame translations of nucleotide sequences, such as assembled transcriptome data, apparently underestimate the confidence assigned to the PSMs. The source of this error stems from the inflated and unusual nature of the six-frame database, where for every target sequence there exists five “incorrect” targets that are unlikely to code for protein. The attendant FDR and PEP estimates lead to fewer accepted PSMs at fixed thresholds, and we show that this effect is a product of the database and statistical modeling and not the search engine. A variety of approaches to limit database size and remove noncoding target sequences are examined and discussed in terms of the altered statistical estimates generated and PSMs reported. These results are of importance to groups carrying out proteogenomics, aiming to maximize the validation and discovery of gene structure in sequenced genomes, while still controlling for false positives

    Minería de datos para el descubrimiento de patrones en enfermedades respiratorias en Bogotá, Colombia

    Trabajo de InvestigaciónEl presente proyecto se basa en la aplicación de minería de datos mediante el algoritmo de clustering K- means que permita la generación de un modelo descriptivo con el análisis de los datos y con el objetivo de identificar posibles comportamientos en enfermedades respiratorias en la ciudad de Bogotá. El conjunto de clústeres generados por la herramienta RapidMiner es la recopilación de datos de un periodo de cinco años de 2012 a 2016, en donde se contemplan el número de casos asociados a 184 diagnósticos de enfermedades respiratorias y la edad de los pacientes corresponde de 0 a 5 años.Trabajo de Investigación1. GENERALIDADES 2. OBJETIVOS 3. JUSTIFICACIÓN 4. DELIMITACIÓN 5. MARCO REFERENCIAL 6. METODOLOGÍA 7. FUENTES DE EXTRACCIÓN Y SUS VARIABLES 8. DISEÑO 9. SELECCIÓN DE ALGORITMOS DE CLUSTERING 10. RECONOCER PATRONES A PARTIR DE LA INFORMACIÓN RECOPILADA 11. CONCLUSIONES 12. TRABAJOS FUTUROS 13. REFERENCIAS BIBLIOGRÁFICAS 14. ANEXOSPregradoIngeniero de Sistema

    Demonstration of Protein-Based Human Identification Using the Hair Shaft Proteome

    YesHuman identification from biological material is largely dependent on the ability to characterize genetic polymorphisms in DNA. Unfortunately, DNA can degrade in the environment, sometimes below the level at which it can be amplified by PCR. Protein however is chemically more robust than DNA and can persist for longer periods. Protein also contains genetic variation in the form of single amino acid polymorphisms. These can be used to infer the status of non-synonymous single nucleotide polymorphism alleles. To demonstrate this, we used mass spectrometry-based shotgun proteomics to characterize hair shaft proteins in 66 European-American subjects. A total of 596 single nucleotide polymorphism alleles were correctly imputed in 32 loci from 22 genes of subjects’ DNA and directly validated using Sanger sequencing. Estimates of the probability of resulting individual non-synonymous single nucleotide polymorphism allelic profiles in the European population, using the product rule, resulted in a maximum power of discrimination of 1 in 12,500. Imputed non-synonymous single nucleotide polymorphism profiles from European–American subjects were considerably less frequent in the African population (maximum likelihood ratio = 11,000). The converse was true for hair shafts collected from an additional 10 subjects with African ancestry, where some profiles were more frequent in the African population. Genetically variant peptides were also identified in hair shaft datasets from six archaeological skeletal remains (up to 260 years old). This study demonstrates that quantifiable measures of identity discrimination and biogeographic background can be obtained from detecting genetically variant peptides in hair shaft protein, including hair from bioarchaeological contexts.The Technology Commercialization Innovation Program (Contracts #121668, #132043) of the Utah Governors Office of Commercial Development, the Scholarship Activitie

    Performance-based vs socially supportive culture:a cross-national study of descriptive norms and entrepreneurship

    This paper is a cross-national study testing a framework relating cultural descriptive norms to entrepreneurship in a sample of 40 nations. Based on data from the Global Leadership and Organizational Behavior Effectiveness project, we identify two higher-order dimensions of culture – socially supportive culture (SSC) and performance-based culture (PBC) – and relate them to entrepreneurship rates and associated supply-side and demand-side variables available from the Global Entrepreneurship Monitor. Findings provide strong support for a social capital/SSC and supply-side variable explanation of entrepreneurship rate. PBC predicts demand-side variables, such as opportunity existence and the quality of formal institutions to support entrepreneurship

    The evolution of extreme cooperation via shared dysphoric experiences

    Willingness to lay down one’s life for a group of non-kin, well documented historically and ethnographically, represents an evolutionary puzzle. Building on research in social psychology, we develop a mathematical model showing how conditioning cooperation on previous shared experience can allow individually costly pro-group behavior to evolve. The model generates a series of predictions that we then test empirically in a range of special sample populations (including military veterans, college fraternity/sorority members, football fans, martial arts practitioners, and twins). Our empirical results show that sharing painful experiences produces “identity fusion” – a visceral sense of oneness – which in turn can motivate self-sacrifice, including willingness to fight and die for the group. Practically, our account of how shared dysphoric experiences produce identity fusion helps us better understand such pressing social issues as suicide terrorism, holy wars, sectarian violence, gang-related violence, and other forms of intergroup conflict