20 research outputs found

    Software for Sequence Analysis of Variants in Functional Screening Libraries and Personalized Genome Files

    Full text link
    Detailed knowledge of protein function is critical for both the study of protein interactions and the development of drugs which target specific proteins. Currently, there are few techniques that directly examine protein function. The techniques that are available are time consuming and can only address one variant of a protein at a time. Our laboratory has designed 3 high throughput protein function screens. We hypothesize that these will address this shortfall. The first screen is the Chimeric Minimotif Decoy (CMD) Assay. For this screen, we constructed red fluorescent proteins with one or more C-terminal minimotifs. Minimotifs are short, contiguous amino acid sequences with a known function. These CMD proteins are then over-expressed in a cell, which is then infected with HIV. HIV relies on host proteins to replicate. The minimotifs of the protein may competitively bind the host proteins as a decoy and inhibit HIV infection. HIV infection is indicated by the expression of a green fluorescent protein transcribed from an LTR-GFP HIV infection reporter construct. The second technique we developed is the GigaAssay Driver Mutagenesis Screen. With this screen, the effects of all single mutations on a protein’s ability to function can be comprehensively examined. The screen is constructed by using a CRISPR-Cas9 system to integrate the GigaAssay cassette into a cell. This cassette includes a randomly mutagenized variant of a driver protein and a reporter protein. The sequences of both of these proteins share a DNA barcode sequence that is unique to that cell. In our proof-of-concept experiment, the driver protein is Tat, the transcription factor responsible for the replication of the HIV-1 genome. Here, Tat drives the transcription of the reporter, GFP. The strength of each variant of Tat can be determined by the ratio between the reporter and driver mRNA transcripts. Our third technique is the GigaAssay MD screen. Here, the GigaAssay cassette has a CMD protein sequence and an LTR-GFP HIV infection reporter sequence, and the CMD sequence and reporter sequence share a unique barcode. The CMD protein’s ability to inhibit HIV infection can be determined by the ratio between the CMD and reporter mRNA transcripts. However, the large amount of nucleic acid sequencing data produced by these screens cannot be interpreted by any currently existing analysis pipelines. The goal of my thesis was to optimize the construction of proteins for the Chimeric Minimotif Decoy assay and to write and validate a suite of Java software to interpret and visualize the results of all of the screens. The software I wrote correctly interprets the sequence of 100% of the CMD proteins in the CMD Assay. It also correctly interprets 98.6% of synthetic test transcript sequences in the GigaAssay Driver Mutagenesis Screen and 98.9% of synthetic test transcript sequences in the GigaAssay MD Screen. I also wrote and validated a suite of Java software for the generation of personalized dietary recommendations based on meta-analysis of nutrigenetics studies. Nutrigenetics is the study of the effect of genetic variation on the response of phenotypes and phenotypic risk to diet and dietary changes. The software first composes a dataset by comparing our laboratory’s meta-analysis data with the USDA Nutrient Database, matching foods to recommendations, and calculating recommended food portions. The software then builds a MySQL database with this dataset. This database is then used by the software to analyze a user’s personal genome file and match variants in the file to dietary recommendations, which are then exported to a personalized dietary suggestion report. The software correctly builds the MySQL database and matches genome file variants accurately in all test cases

    Prioritization of Variants for Investigation of Genotype-Directed Nutrition in Human Superpopulations

    Get PDF
    Dietary guidelines recommended by key health agencies are generally designed for a global population. However, ethnicity affects human disease and environment-gene interactions, including nutrient intake. Historically, isolated human populations with different genetic backgrounds have adapted to distinct environments with varying food sources. Ethnicity is relevant to the interaction of food intake with genes and disease susceptibility; yet major health agencies generally do not recommend food and nutrients codified by population genotypes and their frequencies. In this paper, we have consolidated published nutrigenetic variants and examine their frequencies in human superpopulations to prioritize these variants for future investigation of population-specific genotype-directed nutrition. The nutrients consumed by individuals interact with their genome and may alter disease risk. Herein, we searched the literature, designed a data model, and manually curated hundreds of papers. The resulting database houses 101 variants that reached significance (p \u3c 0.05), from 35 population studies. Nutrigenetic variants associated with modified nutrient intake have the potential to reduce the risk of colorectal cancer, obesity, metabolic syndrome, type 2 diabetes, and several other diseases. Since many nutrigenetic studies have identified a major variant in some populations, we suggest that superpopulation-specific genotype-directed nutrition modifications be prioritized for future study and evaluation. Genotype-directed nutrition approaches to dietary modification have the potential to reduce disease risk in select human populations

    The Functional Human C-Terminome

    No full text
    <div><p>All translated proteins end with a carboxylic acid commonly called the C-terminus. Many short functional sequences (minimotifs) are located on or immediately proximal to the C-terminus. However, information about the function of protein C-termini has not been consolidated into a single source. Here, we built a new “C-terminome” database and web system focused on human proteins. Approximately 3,600 C-termini in the human proteome have a minimotif with an established molecular function. To help evaluate the function of the remaining C-termini in the human proteome, we inferred minimotifs identified by experimentation in rodent cells, predicted minimotifs based upon consensus sequence matches, and predicted novel highly repetitive sequences in C-termini. Predictions can be ranked by enrichment scores or Gene Evolutionary Rate Profiling (GERP) scores, a measurement of evolutionary constraint. By searching for new anchored sequences on the last 10 amino acids of proteins in the human proteome with lengths between 3–10 residues and up to 5 degenerate positions in the consensus sequences, we have identified new consensus sequences that predict instances in the majority of human genes. All of this information is consolidated into a database that can be accessed through a C-terminome web system with search and browse functions for minimotifs and human proteins. A known consensus sequence-based predicted function is assigned to nearly half the proteins in the human proteome. Weblink: <a href="http://cterminome.bio-toolkit.com/" target="_blank">http://cterminome.bio-toolkit.com</a>.</p></div

    <i>de novo</i> consensus sequence matches in the human proteome.

    No full text
    <p>Consensus sequences and instances of length 3–10 residues and up to 5 degenerate positions were generated and used to search the last 10 amino acids of each protein in the human proteome (<b>A</b>). This included alternatively spliced isoforms. The inset shows the total number of sequences searched and number of occurrences identified. For each set of sequences with a given length and degeneracy (e.g. 3–1), the number of sequences searched and occurrences identified are shown in the bar graph.</p

    Functional landscape of C-terminome.

    No full text
    <p><b>(A)</b> A pie chart shows the number of experimentally verified C-terminal minimotifs with different functions. The molecular function for each category is shown in the other panels: bind (<b>B</b>), traffic (<b>C</b>), and modify (<b>D</b>). This graph includes both consensus sequences and instances. <b>(B)</b> A pie chart showing the types of domain targets for binding minimotifs. The “Other domains” category includes Transducin-like enhancer proteins, Tetratricopeptide repeat domain, 14-3-3 domain, G protein coupled receptor, COPI and COPII binding proteins, and ubiquitin- binding proteins. The category 'domain not known' indicates that the specific interaction domain of the target protein was not identified. <b>(C)</b> A pie chart showing the different compartments for trafficking motifs. <b>(D)</b> A pie chart showing functional categories for post-translational modifications. The “other PTMs” category (1%) includes methylation, prenylation, glycosylation, crotonylation, amidation, farnesylation, sulphation, de-phosphorylation, o-glcnacation, geranyl-geranylation, glycation, carboxy-methylation, deamination, sumoylation, tri-iodination, malonylation, mevalonation, and palmitoylation.</p

    C-terminome protein search and browse pages.

    No full text
    <p><b>(A)</b> The main C-terminome search page can be used to select, search and browse proteins or minimotifs. Question marks open a popup window with acceptable syntax. When a protein is entered into the textbox and a search is initiated, a results page shows the top hits for the protein search term (bottom panel). Presented information includes the RefSeq protein accession number, gene name, and protein sequence with the highlighted C-terminus. <b>(B)</b> A protein can also be selected from the Browse Proteins page. This list can be searched for protein names alphabetically or browsed for C-termini of different proteins. A key at the top indicates the GERP score for each residue at the C-termini. <b>(C)</b> Both Search and Browse Proteins produce a page with the results shown in <b>C</b>. This includes the RefSeq protein accession number, protein name, protein sequence with the C-terminus highlighted, and alternative spliced variants with the RefSeq accession number, and isoform name (top panel); a list of consensi present in the protein including whether the consensus sequence was experimental or predicted, number of instances, and both proteome-wide and discrete proteome fold enrichment (bottom panel).</p
    corecore