1,203 research outputs found

    A Survey of Genomic Properties for the Detection of Regulatory Polymorphisms

    Get PDF
    Advances in the computational identification of functional noncoding polymorphisms will aid in cataloging novel determinants of health and identifying genetic variants that explain human evolution. To date, however, the development and evaluation of such techniques has been limited by the availability of known regulatory polymorphisms. We have attempted to address this by assembling, from the literature, a computationally tractable set of regulatory polymorphisms within the ORegAnno database (http://www.oreganno.org). We have further used 104 regulatory single-nucleotide polymorphisms from this set and 951 polymorphisms of unknown function, from 2-kb and 152-bp noncoding upstream regions of genes, to investigate the discriminatory potential of 23 properties related to gene regulation and population genetics. Among the most important properties detected in this region are distance to transcription start site, local repetitive content, sequence conservation, minor and derived allele frequencies, and presence of a CpG island. We further used the entire set of properties to evaluate their collective performance in detecting regulatory polymorphisms. Using a 10-fold cross-validation approach, we were able to achieve a sensitivity and specificity of 0.82 and 0.71, respectively, and we show that this performance is strongly influenced by the distance to the transcription start site

    Medical Statistics - Current Developments in Statistical Methodology for Genetic Architecture of Complex Diseases

    Get PDF
    [no abstract available

    DEVELOPMENT AND APPLICATION OF MASS SPECTROMETRY-BASED PROTEOMICS TO GENERATE AND NAVIGATE THE PROTEOMES OF THE GENUS POPULUS

    Get PDF
    Historically, there has been tremendous synergy between biology and analytical technology, such that one drives the development of the other. Over the past two decades, their interrelatedness has catalyzed entirely new experimental approaches and unlocked new types of biological questions, as exemplified by the advancements of the field of mass spectrometry (MS)-based proteomics. MS-based proteomics, which provides a more complete measurement of all the proteins in a cell, has revolutionized a variety of scientific fields, ranging from characterizing proteins expressed by a microorganism to tracking cancer-related biomarkers. Though MS technology has advanced significantly, the analysis of complicated proteomes, such as plants or humans, remains challenging because of the incongruity between the complexity of the biological samples and the analytical techniques available. In this dissertation, analytical methods utilizing state-of-the-art MS instrumentation have been developed to address challenges associated with both qualitative and quantitative characterization of eukaryotic organisms. In particular, these efforts focus on characterizing Populus, a model organism and potential feedstock for bioenergy. The effectiveness of pre-existing MS techniques, initially developed to identify proteins reliably in microbial proteomes, were tested to define the boundaries and characterize the landscape of functional genome expression in Populus. Although these approaches were generally successful, achieving maximal proteome coverage was still limited by a number of factors, including genome complexity, the dynamic range of protein identification, and the abundance of protein variants. To overcome these challenges, improvements were needed in sample preparation, MS instrumentation, and bioinformatics. Optimization of experimental procedures and implementation of current state-of-the-art instrumentation afforded the most detailed look into the predicted proteome space of Populus, offering varying proteome perspectives: 1) network-wide, 2) pathway-specific, and 3) protein-level viewpoints. In addition, we implemented two bioinformatic approaches that were capable of decoding the plasticity of the Populus proteome, facilitating the identification of single amino acid polymorphisms and generating a more accurate profile of protein expression. Though the methods and results presented in this dissertation have direct implications in the study of bioenergy research, more broadly this dissertation focuses on developing techniques to contend with the notorious challenges associated with protein characterization in all eukaryotic organisms

    Improving data extraction methods for large molecular biology datasets.

    Get PDF
    In the past, an experiment involving a pair wise comparison normally involved one or a few dependant variables. Now, 1000s of dependent variables can be measured simultaneously in a single experiment, be it detecting genes via a microarray experiment, sequencing genomes, or detecting microbial species based on DNA fragments using molecular techniques. How we analyze such large collections of data will be a major scientific focus over the next decade. Statistical methods that were once acceptable for comparing a few conditions are being revised to handle 1000?s of experiments. Molecular biology techniques that explored 1 gene or species have evolved and are now capable of generating complex datasets requiring new strategies and ways of thinking in order to discover biologically meaningful results. The central theme of this dissertation is to develop strategies that deal with a number of issues that are present in these large scale datasets. In chapter 1, I describe a microarray analytical method that can be applied to low replicate experiments. In chapter?s 2-4, the focus is how to best analyze data from ARISA (a PCR based molecular method for rapidly generating a finger print of microbial diversity). Chapter 2 focuses on qualifying ARISA data so that data will best represent its biological source, prior to further analysis. Chapter 3 focuses on how to best compare ARISA profiles to one another. Chapter 4 focuses on developing a software tool that implements the data processing and clustering strategies from chapter?s 2 and 3. The findings described herein provide the scientific community with improved analytical strategies in both the microarray and ARISA research areas

    The Diploid Genome Sequence of an Individual Human

    Get PDF
    Presented here is a genome sequence of an individual human. It was produced from ∼32 million random DNA fragments, sequenced by Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage for any given region. We developed a modified version of the Celera assembler to facilitate the identification and comparison of alternate alleles within this individual diploid genome. Comparison of this genome and the National Center for Biotechnology Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2–206 bp), 292,102 heterozygous insertion/deletion events (indels)(1–571 bp), 559,473 homozygous indels (1–82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. Using a novel haplotype assembly strategy, we were able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploid nature of the genome. These data depict a definitive molecular portrait of a diploid human genome that provides a starting point for future genome comparisons and enables an era of individualized genomic information

    CLINICAL AND BIOLOGICALLY-BASED APPROACHES FOR CLASSIFYING AND PREDICTING EARLY OUTCOMES OF CHRONIC CHILDHOOD ARTHRITIS

    Get PDF
    Background: Juvenile idiopathic arthritis (JIA) comprises a heterogeneous group of conditions that share chronic arthritis as a common characteristic. Current classification criteria for chronic childhood arthritis have limitations. Despite new treatment strategies and medications, some continue to have persistently active and disabling disease as adults. Few predictors of poor outcomes have been identified. Objectives: This thesis comprises two complementary studies. The objective of the first study was to identify discrete clusters comprising clinical features and inflammatory biomarkers in children with JIA and to compare them with the current JIA categories that have been proposed by the International League of Associations for Rheumatology. The second study aimed to identify predictors of short-term arthritis activity based on clinical and biomarker profiles in JIA patients. Methods: For both studies we utilized data that were collected in a Canadian nation-wide, prospective, longitudinal cohort study titled Biologically-Based Outcome Predictors in JIA. Clustering and classification algorithms were applied to the data to accomplish both study objectives. Results: This research identified three clusters of patients in visit 1 (enrolment) and five clusters in visit 2 (6-month). Clusters revealed in this analysis exposed different and more homogenous subgroups compared to the seven conventional JIA categories. In the second study, the presence or absence of active joints, physician global assessments, and Wallace criteria were chosen as outcome variables 18 months post-enrolment. Among 112 variables, 17 were selected as the best predictors of 18-month outcomes. The panel predicted presence or absence of active arthritis, physician global assessment, and Wallace criteria of inactive disease 18 months after diagnosis with 79%, 82%, and 71% accuracy and 0.83, 0.86, 0.82 area under the curve (AUC), respectively. The accuracy and AUC values were higher compared to when only clinical features were used for prediction. Conclusion: Results of this study suggest that certain groups of patients within different JIA categories are more aligned pathobiologically than their separate clinical categorizations suggest. Further, the research found a small number of clinical and inflammatory variables at diagnosis can more accurately predict short-term arthritis activity in JIA than clinical characteristics only

    Distinguishing HIV-1 drug resistance, accessory, and viral fitness mutations using conditional selection pressure analysis of treated versus untreated patient samples

    Get PDF
    BACKGROUND: HIV can evolve drug resistance rapidly in response to new drug treatments, often through a combination of multiple mutations [1-3]. It would be useful to develop automated analyses of HIV sequence polymorphism that are able to predict drug resistance mutations, and to distinguish different types of functional roles among such mutations, for example, those that directly cause drug resistance, versus those that play an accessory role. Detecting functional interactions between mutations is essential for this classification. We have adapted a well-known measure of evolutionary selection pressure (K(a)/K(s)) and developed a conditional K(a)/K(s )approach to detect important interactions. RESULTS: We have applied this analysis to four independent HIV protease sequencing datasets: 50,000 clinical samples sequenced by Specialty Laboratories, Inc.; 1800 samples from patients treated with protease inhibitors; 2600 samples from untreated patients; 400 samples from untreated African patients. We have identified 428 mutation interactions in Specialty dataset with statistical significance and we were able to distinguish primary vs. accessory mutations for many well-studied examples. Amino acid interactions identified by conditional K(a)/K(s )matched 80 of 92 pair wise interactions found by a completely independent study of HIV protease (p-value for this match is significant: 10(-70)). Furthermore, K(a)/K(s )selection pressure results were highly reproducible among these independent datasets, both qualitatively and quantitatively, suggesting that they are detecting real drug-resistance and viral fitness mutations in the wild HIV-1 population. CONCLUSION: Conditional K(a)/K(s )analysis can detect mutation interactions and distinguish primary vs. accessory mutations in HIV-1. K(a)/K(s )analysis of treated vs. untreated patient data can distinguish drug-resistance vs. viral fitness mutations. Verification of these results would require longitudinal studies. The result provides a valuable resource for AIDS research and will be available for open access upon publication at REVIEWERS: This article was reviewed by Wen-Hsiung Li (nominated by Eugene V. Koonin), Robert Shafer (nominated by Eugene V. Koonin), and Shamil Sunyaev
    corecore