35 research outputs found

    Efficient Pedigree-Based Imputation

    Full text link
    When performing a Genome-Wide Association Study (GWAS), one attempts to associate a phenotype with some genomic information, commonly a gene or set of genes. Often we wish to have more accuracy and attempt to identify a Single Nucleotide Polymorphism (SNP) or Single Nucleotide Polymorphisms (SNPs) that are associated with the phenotype. Sometimes a GWAS is also used to associate other kinds of genetic data, like methylation or Copy Number Variations (CNVs) with the phenotype. The phenotype in such studies is often a disease, e.g. Type II Diabetes Melitus (T2D), Coronary Heart Disease (CHD), cancer, or others, but can be other traits as well, for instance, height, weight, eye color, or intelligence. In order to perform a GWAS it is necessary to sequence the Deoxyribonucleic Acid (DNA) of the individuals in the study. This sequencing is much cheaper than it once was, but is still very expensive for large scale studies. Large scale studies are needed in order to achieve the necessary statistical power to reliably identify associations. By performing imputation we are able to increase the size of studies in two ways. Individual studies are able to sequence more individuals on their budget because they can sequence individuals for only certain sites and impute the rest of the sites to recover part of the power. Also, large scale meta-studies can impute in order to have full sequences for all the individuals in the smaller studies in order to make them comparable, this is the approach taken by Fuchsberger et al [33]. Imputation for genetic data is done in two main ways. The first way is population-based imputation, which depends on Linkage Disequilibrium (LD) and knowing the allele frequencies for a reference population that the study population is believed to be similar to. The second main way to impute is Identity By Descent (IBD)-based imputation, in which we infer genotypes based on the familial relationships in pedigree data. In this thesis, we focus on IBD-based imputation. Imputing on pedigree data can be quite time consuming, for instance, the original implementation of GIGI (Genome Imputation Given Inheritance), Cheung et al [15], took around 17 days to impute chromosome 2 (2,402,346 SNPs) of a pedigree with 189 members, using 28 GB of RAM [53]. Being able to complete family (IBD)-based imputation in a timely manner with high accuracy is of great value to researchers around the world, especially now as this data becomes more available to those without large budgets for sheer computing power. The basis for phasing and imputation along with the details of the calculations involved and exploration of ways to increase the speed for imputing large pedigree data are described in this thesis.Master of ScienceComputer ScienceUniversity of Michigan-Flinthttps://deepblue.lib.umich.edu/bitstream/2027.42/149464/1/Kunji2018.pdfDescription of Kunji2018.pdf : Thesi

    An unsupervised disease module identification technique in biological networks using novel quality metric based on connectivity, conductance and modularity

    Get PDF
    Disease processes are usually driven by several genes interacting in molecular modules or pathways leading to the disease. The identification of such modules in gene or protein networks is the core of computational methods in biomedical research. With this pretext, the Disease Module Identification (DMI) DREAM Challenge was initiated as an effort to systematically assess module identification methods on a panel of 6 diverse genomic networks. In this paper, we propose a generic refinement method based on ideas of merging and splitting the hierarchical tree obtained from any community detection technique for constrained DMI in biological networks. The only constraint was that size of community is in the range [3, 100]. We propose a novel model evaluation metric, called F-score, computed from several unsupervised quality metrics like modularity, conductance and connectivity to determine the quality of a graph partition at given level of hierarchy. We also propose a quality measure, namely Inverse Confidence, which ranks and prune insignificant modules to obtain a curated list of candidate disease modules (DM) for biological network. The predicted modules are evaluated on the basis of the total number of unique candidate modules that are associated with complex traits and diseases from over 200 genome-wide association study (GWAS) datasets. During the competition, we identified 42 modules, ranking 15th at the official false detection rate (FDR) cut-off of 0.05 for identifying statistically significant DM in the 6 benchmark networks. However, for stringent FDR cut-offs 0.025 and 0.01, the proposed method identified 31 (rank 9) and 16 DMIs (rank 10) respectively. From additional analysis, our proposed approach detected a total of 44 DM in the networks in comparison to 60 for the winner of DREAM Challenge. Interestingly, for several individual benchmark networks, our performance was better or competitive with the winner

    Untargeted Metabolomics Profiling Reveals Perturbations in Arginine-NO Metabolism in Middle Eastern Patients with Coronary Heart Disease

    Get PDF
    Coronary heart disease (CHD) is a major cause of death in Middle Eastern (ME) populations, with current studies of the metabolic fingerprints of CHD lacking in diversity. Identification of specific biomarkers to uncover potential mechanisms for developing predictive models and targeted therapies for CHD is urgently needed for the least-studied ME populations. A case-control study was carried out in a cohort of 1001 CHD patients and 2999 controls. Untargeted metabolomics was used, generating 1159 metabolites. Univariate and pathway enrichment analyses were performed to understand functional changes in CHD. A metabolite risk score (MRS) was developed to assess the predictive performance of CHD using multivariate analysis and machine learning. A total of 511 metabolites were significantly different between the CHD patients and the controls (FDR p < 0.05). The enriched pathways (FDR p < 10−300) included D-arginine and D-ornithine metabolism, glycolysis, oxidation and degradation of branched chain fatty acids, and sphingolipid metabolism. MRS showed good discriminative power between the CHD cases and the controls (AUC = 0.99). In this first study in the Middle East, known and novel circulating metabolites and metabolic pathways associated with CHD were identified. A small panel of metabolites can efficiently discriminate CHD cases and controls and therefore can be used as a diagnostic/predictive tool

    Genetic Susceptibility to Arrhythmia Phenotypes in a Middle Eastern Cohort of 14,259 Whole-Genome Sequenced Individuals

    Get PDF
    Background: The current study explores the genetic underpinnings of cardiac arrhythmia phenotypes within Middle Eastern populations, which are under-represented in genomic medicine research. Methods: Whole-genome sequencing data from 14,259 individuals from the Qatar Biobank were used and contained 47.8% of Arab ancestry, 18.4% of South Asian ancestry, and 4.6% of African ancestry. The frequency of rare functional variants within a set of 410 candidate genes for cardiac arrhythmias was assessed. Polygenic risk score (PRS) performance for atrial fibrillation (AF) prediction was evaluated. Results: This study identified 1196 rare functional variants, including 162 previously linked to arrhythmia phenotypes, with varying frequencies across Arab, South Asian, and African ancestries. Of these, 137 variants met the pathogenic or likely pathogenic (P/LP) criteria according to ACMG guidelines. Of these, 91 were in ACMG actionable genes and were present in 1030 individuals (~7%). Ten P/LP variants showed significant associations with atrial fibrillation p &lt; 2.4 x 10-10. Five out of ten existing PRSs were significantly associated with AF (e.g., PGS000727, p = 0.03, OR = 1.43 [1.03, 1.97]). Conclusions: Our study is the largest to study the genetic predisposition to arrhythmia phenotypes in the Middle East using whole-genome sequence data. It underscores the importance of including diverse populations in genomic investigations to elucidate the genetic landscape of cardiac arrhythmias and mitigate health disparities in genomic medicine.This publication was made possible by the PPM3 award PPM 03-0322-190036 from the Qatar National Research Fund (QNRF, a member of the Qatar Foundation).Scopu

    Synthetic Mimic of Antimicrobial Peptide with Nonmembrane-Disrupting Antibacterial Properties

    Get PDF
    Proteolysis in dairy lactic acid bacteria has been studied in great detail by genetic, biochemical and ultrastructural methods. From these studies the picture emerges that the proteolytic systems of lactococci and lactobacilli are remarkably similar in their components and mode of action. The proteolytic system consists of an extracellularly located serine-proteinase, transport systems specific for di-tripeptides and oligopeptides (> 3 residues), and a multitude of intracellular peptidases. This review describes the properties and regulation of individual components as well as studies that have led to identification of their cellular localization. Targeted mutational techniques developed in recent years have made it possible to investigate the role of individual and combinations of enzymes in vivo. Based on these results as well as in vitro studies of the enzymes and transporters, a model for the proteolytic pathway is proposed. The main features are: (i) proteinases have a broad specificity and are capable of releasing a large number of different oligopeptides, of which a large fraction falls in the range of 4 to 8 amino acid residues; (ii) oligopeptide transport is the main route for nitrogen entry into the cell; (iii) all peptidases are located intracellularly and concerted action of peptidases is required for complete degradation of accumulated peptides.

    Coevolving pairs between amino acids in V1V2 and other structural HIV-1 Env domains.

    No full text
    <p>V1, V2 and V3 are shown in skyblue, pink and orange coloured cartoon illustration. (A) 59 predicted and in [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0143245#pone.0143245.ref049" target="_blank">49</a>] crystallised coevolving residue pairs; with TP illustrated as green and FP as red dashes. (B) Two long-distance coevolving amino acids are quite likely mediated by a N-linked glycan. The involved amino acids are shown in stick representation. (C) Three long-distance residue pairs (Ile165-Lys192, Gly167-Lys192, and Gly167-Met426) are presumably inter gp120 contacts. The intra- and inter-gp120 distances are shown as coloured (orange,light green and yellow) bonds. The inter-gp120 distances are in all cases smaller than the intra-gp120 ones.</p

    HIV cell entry.

    No full text
    <p>Schematic illustration of HIV-1 entry steps attachment and coreceptor binding.</p

    Coevolution network of the inter gp120-gp41 coevolving pair Pro238-Glu630.

    No full text
    <p>This pair may affect the gp120-gp41 interaction, although their are not proximal, through their intra-domain coevolving residue partners.</p

    Predicted coevolving pairs between amino acids located in V3 and other structural regions of HIV-1 Env.

    No full text
    <p>Gp120 is shown in cartoon representation, with V1 coloured in blue, V2 in pink and V3 in orange. (A) All inter V3 coevolving pairs are highlighted with green (TP) or red (FP) coloured dashes. (B) Coevolving amino acid pairs Glu293-Asn295, Glu293-Thr297, His330-Asn332 and His330-Ser334 (shown in sticks representation) are mediated by N-linked glycans (shown as black lines). (C) Predicted contacts between amino acids located in V1V2 and V3. The involved amino acids are highlighted as coloured sticks.</p
    corecore