Search CORE

259 research outputs found

Protein Threading for Genome-Scale Structural Analysis

Author: Ellrott Kyle P
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/12/2007
Field of study

Protein structure prediction is a necessary tool in the field of bioinformatic analysis. It is a non-trivial process that can add a great deal of information to a genome annotation. This dissertation deals with protein structure prediction through the technique of protein fold recognition and outlines several strategies for the improvement of protein threading techniques. In order to improve protein threading performance, this dissertation begins with an outline of sequence/structure alignment energy functions. A technique called Violated Inequality Minimization is used to quickly adapt to the changing energy landscape as new energy functions are added. To continue the improvement of alignment accuracy and fold recognition, new formulations of energy functions are used for the creation of the sequence/structure alignment. These energies include a formulation of a gap penalty which is dependent on sequence characteristics different from the traditional constant penalty. Another proposed energy is dependent on conserved structural patterns found during threading. These structural patterns have been employed to refine the sequence/structure alignment in my research. The section on Linear Programming Algorithm for protein structure alignment deals with the optimization of an alignment using additional residue-pair energy functions. In the original version of the model, all cores had to be aligned to the target sequence. Our research outlines an expansion of the original threading model which allows for a more flexible alignment by allowing core deletions. Aside from improvements in fold recognition and alignment accuracy, there is also a need to ensure that these techniques can scale for the computational demands of genome level structure prediction. A heuristic decision making processes has been designed to automate the classification and preparation of proteins for prediction. A graph analysis has been applied to the integration of different tools involved in the pipeline. Analysis of the data dependency graph allows for automatic parallelization of genome structure prediction. These different contributions help to improve the overall performance of protein threading and help distribute computations across a large set of computers to help make genome scale protein structure prediction practically feasible

University of Tennessee, Knoxville: Trace

The GA4GH Task Execution API: Enabling Easy Multi Cloud Task Execution

Author: Beckman Liam
Ellrott Kyle P.
Kanitz Alexander
Malladi Venkat S.
McLoughlin Matthew H.
Publication venue
Publication date: 08/02/2024
Field of study

The Global Alliance for Genomics and Health (GA4GH) Task Execution Service (TES) API is a standardized schema and API for describing and executing batch execution tasks. It provides a common way to submit and manage tasks to a variety of compute environments, including on premise High Performance Compute and High Throughput Computing (HPC/HTC) systems, Cloud computing platforms, and hybrid environments. The TES API is designed to be flexible and extensible, allowing it to be adapted to a wide range of use cases, such as "bringing compute to the data" solutions for federated and distributed data analysis or load balancing across multi cloud infrastructures. This API has been adopted by a number of different service providers and utilized by several workflow engines. Using its capabilities, genomes research institutes are building hybrid compute systems to study life science

arXiv.org e-Print Archive

Structure of the γ-D-glutamyl-L-diamino acid endopeptidase YkfC from Bacillus cereus in complex with L-Ala-γ-D-Glu: insights into substrate recognition by NlpC/P60 cysteine peptidases.

Dipeptidyl-peptidase VI from Bacillus sphaericus and YkfC from Bacillus subtilis have both previously been characterized as highly specific γ-D-glutamyl-L-diamino acid endopeptidases. The crystal structure of a YkfC ortholog from Bacillus cereus (BcYkfC) at 1.8 Å resolution revealed that it contains two N-terminal bacterial SH3 (SH3b) domains in addition to the C-terminal catalytic NlpC/P60 domain that is ubiquitous in the very large family of cell-wall-related cysteine peptidases. A bound reaction product (L-Ala-γ-D-Glu) enabled the identification of conserved sequence and structural signatures for recognition of L-Ala and γ-D-Glu and, therefore, provides a clear framework for understanding the substrate specificity observed in dipeptidyl-peptidase VI, YkfC and other NlpC/P60 domains in general. The first SH3b domain plays an important role in defining substrate specificity by contributing to the formation of the active site, such that only murein peptides with a free N-terminal alanine are allowed. A conserved tyrosine in the SH3b domain of the YkfC subfamily is correlated with the presence of a conserved acidic residue in the NlpC/P60 domain and both residues interact with the free amine group of the alanine. This structural feature allows the definition of a subfamily of NlpC/P60 enzymes with the same N-terminal substrate requirements, including a previously characterized cyanobacterial L-alanine-γ-D-glutamate endopeptidase that contains the two key components (an NlpC/P60 domain attached to an SH3b domain) for assembly of a YkfC-like active site

PubMed Central

eScholarship - University of California

The structure of BVU2987 from Bacteroides vulgatus reveals a superfamily of bacterial periplasmic proteins with possible inhibitory function.

Author: Abdubek Polat
Astakhova Tamara
Axelrod Herbert L
Bakolitsa Constantina
Carlton Dennis
Chen Connie
Chiu Hsiu Ju
Chiu Michelle
Clayton Thomas
Das Debanu
Deacon Ashley M
Deller Marc C
Duan Lian
Ellrott Kyle
Elsliger Marc André
Ernst Dustin
Farr Carol L
Feuerhelm Julie
Finn Robert D
Godzik Adam
Grant Joanna C
Grzechnik Anna
Han Gye Won
Hodgson Keith O
Jaroszewski Lukasz
Jin Kevin K
Klock Heath E
Knuth Mark W
Kozbial Piotr
Krishna S Sri
Kumar Abhinav
Lesley Scott A
Marciano David
McMullan Daniel
Miller Mitchell D
Morse Andrew T
Nigoghossian Edward
Nopakun Amanda
Okach Linda
Puckett Christina
Reyes Ron
Rife Christopher L
Sefcovic Natasha
Tien Henry J
Trame Christine B
van den Bedem Henry
Weekes Dana
Wilson Ian A
Wooley John
Wooten Tiffany
Xu Qingping
Publication venue: eScholarship, University of California
Publication date: 05/03/2010
Field of study

Proteins that contain the DUF2874 domain constitute a new Pfam family PF11396. Members of this family have predominantly been identified in microbes found in the human gut and oral cavity. The crystal structure of one member of this family, BVU2987 from Bacteroides vulgatus, has been determined, revealing a β-lactamase inhibitor protein-like structure with a tandem repeat of domains. Sequence analysis and structural comparisons reveal that BVU2987 and other DUF2874 proteins are related to β-lactamase inhibitor protein, PepSY and SmpA_OmlA proteins and hence are likely to function as inhibitory proteins

PubMed Central

eScholarship - University of California

Prophetic Granger Causality to infer gene regulatory networks

Author: Bivol Adrian
Carlin Daniel E
Ellrott Kyle
Graim Kiley
Paull Evan O
Ryabinin Peter
Sokolov Artem
Stuart Joshua M
Wong Christopher K
Publication venue: eScholarship, University of California
Publication date: 01/01/2017
Field of study

We introduce a novel method called Prophetic Granger Causality (PGC) for inferring gene regulatory networks (GRNs) from protein-level time series data. The method uses an L1-penalized regression adaptation of Granger Causality to model protein levels as a function of time, stimuli, and other perturbations. When combined with a data-independent network prior, the framework outperformed all other methods submitted to the HPN-DREAM 8 breast cancer network inference challenge. Our investigations reveal that PGC provides complementary information to other approaches, raising the performance of ensemble learners, while on its own achieves moderate performance. Thus, PGC serves as a valuable new tool in the bioinformatics toolkit for analyzing temporal datasets. We investigate the general and cell-specific interactions predicted by our method and find several novel interactions, demonstrating the utility of the approach in charting new tumor wiring

Crossref

Directory of Open Access Journals

eScholarship - University of California

The Francis Crick Institute

Recommended from our members

Protocol for assessing distances in pathway space for classifier feature sets from machine learning methods.

Author: Apolonio Victor
Benz Christopher
Castro Mauro
Chagas Vinicius
Cherniack Andrew
Ellrott Kyle
Grewal Jasleen
Jones Steven
Karlberg Brian
Laird Peter
Lee Jordan
Robertson A
Stuart Joshua
Tercan Bahar
Wong Christopher
Yau Christina
Zenklusen Jean
Publication venue: eScholarship, University of California
Publication date: 18/03/2025
Field of study

As genes tend to be co-regulated as gene modules, feature selection in machine learning (ML) on gene expression data can be challenged by the complexity of gene regulation. Here, we present a protocol for reconciling differences in classifier features identified using different ML approaches. We describe steps for loading the PathwaySpace R package, preparing input for analysis, and creating density plots of gene sets. We then detail procedures for testing whether apparently distinct feature sets are related in pathway space. For complete details on the use and execution of this protocol, please refer to Ellrott et al.1

eScholarship - University of California

Retrospective evaluation of whole exome and genome mutation calls in 746 cancer samples

Author: Bailey Matthew H
Ding Li
Dong Guanlan
Dursi Lewis Jonathan
Ellrott Kyle
Gerstein Mark B
Getz Gad
Kelso Sean
Li Shantao
Li Yize
Liang Wen-Wei
MC3 Working Group
Meyerson William U
PCAWG Consortium
PCAWG novel somatic mutation calling methods working group
Saksena Gordon
Simpson Jared T
Wang Liang-Bo
Weerasinghe Amila
Wendl Michael C
Wheeler David A
Publication venue: Nature Publishing Group
Publication date: 21/09/2020
Field of study

The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) curated consensus somatic mutation calls using whole exome sequencing (WES) and whole genome sequencing (WGS), respectively. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2,658 cancers across 38 tumour types, we compare WES and WGS side-by-side from 746 TCGA samples, finding that ~80% of mutations overlap in covered exonic regions. We estimate that low variant allele fraction (VAF < 15%) and clonal heterogeneity contribute up to 68% of private WGS mutations and 71% of private WES mutations. We observe that ~30% of private WGS mutations trace to mutations identified by a single variant caller in WES consensus efforts. WGS captures both ~50% more variation in exonic regions and un-observed mutations in loci with variable GC-content. Together, our analysis highlights technological divergences between two reproducible somatic variant detection efforts

ZORA

Author Correction: Retrospective evaluation of whole exome and genome mutation calls in 746 cancer samples

Author: Bailey Matthew H
Ding Li
Dong Guanlan
Dursi Lewis Jonathan
Ellrott Kyle
Gerstein Mark B
Getz Gad
Kelso Sean
Li Shantao
Li Yize
Liang Wen-Wei
MC3 Working Group
Meyerson William U
PCAWG Consortium
PCAWG novel somatic mutation calling methods working group
Saksena Gordon
Simpson Jared T
Wang Liang-Bo
Weerasinghe Amila
Wendl Michael C
Wheeler David A
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 30/11/2020
Field of study

Correction to this paper has been published: https://doi.org/10.1038/s41467-020-20128-w

ZORA

Germline contamination and leakage in whole genome somatic single nucleotide variant detection

Author: Adam A. Margolin
Adam D. Ewing
Cristian Caloian
Dorota H. Sendorek
J. Christopher Bare
Joshua M. Stuart
Kathleen E. Houlahan
Kyle Ellrott
Paul C. Boutros
Takafumi N. Yamaguchi
Thea C. Norman
Publication venue: Springer Science and Business Media LLC
Publication date: 01/01/1982
Field of study

Abstract Background The clinical sequencing of cancer genomes to personalize therapy is becoming routine across the world. However, concerns over patient re-identification from these data lead to questions about how tightly access should be controlled. It is not thought to be possible to re-identify patients from somatic variant data. However, somatic variant detection pipelines can mistakenly identify germline variants as somatic ones, a process called “germline leakage”. The rate of germline leakage across different somatic variant detection pipelines is not well-understood, and it is uncertain whether or not somatic variant calls should be considered re-identifiable. To fill this gap, we quantified germline leakage across 259 sets of whole-genome somatic single nucleotide variant (SNVs) predictions made by 21 teams as part of the ICGC-TCGA DREAM Somatic Mutation Calling Challenge. Results The median somatic SNV prediction set contained 4325 somatic SNVs and leaked one germline polymorphism. The level of germline leakage was inversely correlated with somatic SNV prediction accuracy and positively correlated with the amount of infiltrating normal cells. The specific germline variants leaked differed by tumour and algorithm. To aid in quantitation and correction of leakage, we created a tool, called GermlineFilter, for use in public-facing somatic SNV databases. Conclusions The potential for patient re-identification from leaked germline variants in somatic SNV predictions has led to divergent open data access policies, based on different assessments of the risks. Indeed, a single, well-publicized re-identification event could reshape public perceptions of the values of genomic data sharing. We find that modern somatic SNV prediction pipelines have low germline-leakage rates, which can be further reduced, especially for cloud-sharing, using pre-filtering software

TSpace (University of Toronto)

Crossref

Archive of European Integration

Directory of Open Access Journals

eScholarship - University of California

UQ eSpace (University of Queensland)

The cancer genome atlas pan-cancer analysis project

Author: Collisson Eric A.
Ellrott Kyle
Mills Shaw Kenna R.
Mills Gordon B.
Ozenberger Brad A.
Sander Chris
Shmulevich Ilya
Stuart Joshua M.
The Cancer Genome Atlas Research Network
Weinstein John N.
Publication venue
Publication date: 01/01/2013
Field of study

The Cancer Genome Atlas (TCGA) Research Network has profiled and analyzed large numbers of human tumors to discover molecular aberrations at the DNA, RNA, protein and epigenetic levels. The resulting rich data provide a major opportunity to develop an integrated picture of commonalities, differences and emergent themes across tumor lineages. The Pan-Cancer initiative compares the first 12 tumor types profiled by TCGA. Analysis of the molecular aberrations and their functional roles across tumor types will teach us how to extend therapies effective in one cancer type to others with a similar genomic profile

PubMed Central

Carolina Digital Repository