217 research outputs found
Space-efficient merging of succinct de Bruijn graphs
We propose a new algorithm for merging succinct representations of de Bruijn
graphs introduced in [Bowe et al. WABI 2012]. Our algorithm is based on the
lightweight BWT merging approach by Holt and McMillan [Bionformatics 2014,
ACM-BCB 2014]. Our algorithm has the same asymptotic cost of the state of the
art tool for the same problem presented by Muggli et al. [bioRxiv 2017,
Bioinformatics 2019], but it uses less than half of its working space. A novel
important feature of our algorithm, not found in any of the existing tools, is
that it can compute the Variable Order succinct representation of the union
graph within the same asymptotic time/space bounds.Comment: Accepted to SPIRE'1
Large-scale machine learning-based phenotyping significantly improves genomic discovery for optic nerve head morphology.
Genome-wide association studies (GWASs) require accurate cohort phenotyping, but expert labeling can be costly, time intensive, and variable. Here, we develop a machine learning (ML) model to predict glaucomatous optic nerve head features from color fundus photographs. We used the model to predict vertical cup-to-disc ratio (VCDR), a diagnostic parameter and cardinal endophenotype for glaucoma, in 65,680 Europeans in the UK Biobank (UKB). A GWAS of ML-based VCDR identified 299 independent genome-wide significant (GWS; p ≤ 5 × 10-8) hits in 156 loci. The ML-based GWAS replicated 62 of 65 GWS loci from a recent VCDR GWAS in the UKB for which two ophthalmologists manually labeled images for 67,040 Europeans. The ML-based GWAS also identified 93 novel loci, significantly expanding our understanding of the genetic etiologies of glaucoma and VCDR. Pathway analyses support the biological significance of the novel hits to VCDR: select loci near genes involved in neuronal and synaptic biology or harboring variants are known to cause severe Mendelian ophthalmic disease. Finally, the ML-based GWAS results significantly improve polygenic prediction of VCDR and primary open-angle glaucoma in the independent EPIC-Norfolk cohort
The Parkinson's phenome-traits associated with Parkinson's disease in a broadly phenotyped cohort
In order to systematically describe the Parkinson's disease phenome, we performed a series of 832 cross-sectional case-control analyses in a large database. Responses to 832 online survey-based phenotypes including diseases, medications, and environmental exposures were analyzed in 23andMe research participants. For each phenotype, survey respondents were used to construct a cohort of Parkinson's disease cases and age-matched and sex-matched controls, and an association test was performed using logistic regression. Cohorts included a median of 3899 Parkinson's disease cases and 49,808 controls, all of European ancestry. Highly correlated phenotypes were removed and the novelty of each significant association was systematically assessed (assigned to one of four categories: known, likely, unclear, or novel). Parkinson's disease diagnosis was associated with 122 phenotypes. We replicated 27 known associations and found 23 associations with a strong a priori link to a known association. We discovered 42 associations that have not previously been reported. Migraine, obsessive-compulsive disorder, and seasonal allergies were associated with Parkinson's disease and tend to occur decades before the typical age of diagnosis for Parkinson's disease. The phenotypes that currently comprise the Parkinson's disease phenome have mostly been explored in relatively small purpose-built studies. Using a single large dataset, we have successfully reproduced many of these established associations and have extended the Parkinson's disease phenome by discovering novel associations. Our work paves the way for studies of these associated phenotypes that explore shared molecular mechanisms with Parkinson's disease, infer causal relationships, and improve our ability to identify individuals at high-risk of Parkinson's disease
Optical map guided genome assembly
Background The long reads produced by third generation sequencing technologies have significantly boosted the results of genome assembly but still, genome-wide assemblies solely based on read data cannot be produced. Thus, for example, optical mapping data has been used to further improve genome assemblies but it has mostly been applied in a post-processing stage after contig assembly. Results We proposeOpticalKermitwhich directly integrates genome wide optical maps into contig assembly. We show how genome wide optical maps can be used to localize reads on the genome and then we adapt the Kermit method, which originally incorporated genetic linkage maps to the miniasm assembler, to use this information in contig assembly. Our experimental results show that incorporating genome wide optical maps to the contig assembly of miniasm increases NGA50 while the number of misassemblies decreases or stays the same. Furthermore, when compared to the Canu assembler,OpticalKermitproduces an assembly with almost three times higher NGA50 with a lower number of misassemblies on realA. thalianareads. Conclusions OpticalKermitsuccessfully incorporates optical mapping data directly to contig assembly of eukaryotic genomes. Our results show that this is a promising approach to improve the contiguity of genome assemblies.Peer reviewe
RNA splicing. The human splicing code reveals new insights into the genetic determinants of disease
To facilitate precision medicine and whole genome annotation, we developed a machine learning technique that scores how strongly genetic variants affect RNA splicing, whose alteration contributes to many diseases. Analysis of over 650,000 intronic and exonic variants reveals widespread patterns of mutation-driven aberrant splicing. Intronic disease mutations alter splicing nine times more often than common variants, and missense exonic disease mutations that least impact protein function are five times more likely to alter splicing than others. Tens of thousands of disease-causing mutations are detected, including those involved in cancers and spinal muscular atrophy. Examination of intronic and exonic variants found using whole genome sequencing of individuals with autism reveals mis-spliced genes with neurodevelopmental phenotypes. Our approach provides evidence for causal variants and should enable new discoveries in precision medicine
A peridynamic based machine learning model for one-dimensional and two-dimensional structures
With the rapid growth of available data and computing resources, using data-driven models is a potential approach in many scientific disciplines and engineering. However, for complex physical phenomena that have limited data, the data-driven models are lacking robustness and fail to provide good predictions. Theory-guided data science is the recent technology that can take advantage of both physics-driven and data-driven models. This study presents a novel peridynamics based machine learning model for one and two-dimensional structures. The linear relationships between the displacement of a material point and displacements of its family members and applied forces are obtained for the machine learning model by using linear regression. The numerical procedure for coupling the peridynamic model and the machine learning model is also provided. The numerical procedure for coupling the peridynamic model and the machine learning model is also provided. The accuracy of the coupled model is verified by considering various examples of a one-dimensional bar and two-dimensional plate. To further demonstrate the capabilities of the coupled model, damage prediction for a plate with a pre-existing crack, a two-dimensional representation of a three-point bending test, and a plate subjected to dynamic load are simulated
The effect of LRRK2 loss-of-function variants in humans
Analysis of large genomic datasets, including gnomAD, reveals that partial LRRK2 loss of function is not strongly associated with diseases, serving as an example of how human genetics can be leveraged for target validation in drug discovery. Human genetic variants predicted to cause loss-of-function of protein-coding genes (pLoF variants) provide natural in vivo models of human gene inactivation and can be valuable indicators of gene function and the potential toxicity of therapeutic inhibitors targeting these genes(1,2). Gain-of-kinase-function variants in LRRK2 are known to significantly increase the risk of Parkinson's disease(3,4), suggesting that inhibition of LRRK2 kinase activity is a promising therapeutic strategy. While preclinical studies in model organisms have raised some on-target toxicity concerns(5-8), the biological consequences of LRRK2 inhibition have not been well characterized in humans. Here, we systematically analyze pLoF variants in LRRK2 observed across 141,456 individuals sequenced in the Genome Aggregation Database (gnomAD)(9), 49,960 exome-sequenced individuals from the UK Biobank and over 4 million participants in the 23andMe genotyped dataset. After stringent variant curation, we identify 1,455 individuals with high-confidence pLoF variants in LRRK2. Experimental validation of three variants, combined with previous work(10), confirmed reduced protein levels in 82.5% of our cohort. We show that heterozygous pLoF variants in LRRK2 reduce LRRK2 protein levels but that these are not strongly associated with any specific phenotype or disease state. Our results demonstrate the value of large-scale genomic databases and phenotyping of human loss-of-function carriers for target validation in drug discovery.Peer reviewe
Identification of novel risk loci for restless legs syndrome in genome-wide association studies in individuals of European ancestry : a meta-analysis
Background Restless legs syndrome is a prevalent chronic neurological disorder with potentially severe mental and physical health consequences. Clearer understanding of the underlying pathophysiology is needed to improve treatment options. We did a meta-analysis of genome-wide association studies (GWASs) to identify potential molecular targets. Methods In the discovery stage, we combined three GWAS datasets (EU-RLS GENE, INTERVAL, and 23andMe) with diagnosis data collected from 2003 to 2017, in face-to-face interviews or via questionnaires, and involving 15126 cases and 95 725 controls of European ancestry. We identified common variants by fixed-effect inverse-variance meta-analysis. Significant genome-wide signals (p Findings We identified and replicated 13 new risk loci for restless legs syndrome and confirmed the previously identified six risk loci. MEIS1 was confirmed as the strongest genetic risk factor for restless legs syndrome (odds ratio 1.92, 95% CI 1 85-1.99). Gene prioritisation, enrichment, and genetic correlation analyses showed that identified pathways were related to neurodevelopment and highlighted genes linked to axon guidance (associated with SEMA6D), synapse formation (NTNG1), and neuronal specification (HOXB cluster family and MYT1). Interpretation Identification of new candidate genes and associated pathways will inform future functional research. Advances in understanding of the molecular mechanisms that underlie restless legs syndrome could lead to new treatment options. We focused on common variants; thus, additional studies are needed to dissect the roles of rare and structural variations.Peer reviewe
ART: A machine learning Automated Recommendation Tool for synthetic biology
Biology has changed radically in the last two decades, transitioning from a descriptive science into a design science. Synthetic biology allows us to bioengineer cells to synthesize novel valuable molecules such as renewable biofuels or anticancer drugs. However, traditional synthetic biology approaches involve ad-hoc engineering practices, which lead to long development times. Here, we present the Automated Recommendation Tool (ART), a tool that leverages machine learning and probabilistic modeling techniques to guide synthetic biology in a systematic fashion, without the need for a full mechanistic understanding of the biological system. Using sampling-based optimization, ART provides a set of recommended strains to be built in the next engineering cycle, alongside probabilistic predictions of their production levels. We demonstrate the capabilities of ART on simulated data sets, as well as experimental data from real metabolic engineering projects producing renewable biofuels, hoppy flavored beer without hops, and fatty acids. Finally, we discuss the limitations of this approach, and the practical consequences of the underlying assumptions failing
Genomewide Association Studies of LRRK2 Modifiers of Parkinson's Disease
Objective: The aim of this study was to search for genes/variants that modify the effect of LRRK2 mutations in terms of penetrance and age-at-onset of Parkinson's disease. // Methods: We performed the first genomewide association study of penetrance and age-at-onset of Parkinson's disease in LRRK2 mutation carriers (776 cases and 1,103 non-cases at their last evaluation). Cox proportional hazard models and linear mixed models were used to identify modifiers of penetrance and age-at-onset of LRRK2 mutations, respectively. We also investigated whether a polygenic risk score derived from a published genomewide association study of Parkinson's disease was able to explain variability in penetrance and age-at-onset in LRRK2 mutation carriers. // Results: A variant located in the intronic region of CORO1C on chromosome 12 (rs77395454; p value = 2.5E-08, beta = 1.27, SE = 0.23, risk allele: C) met genomewide significance for the penetrance model. Co-immunoprecipitation analyses of LRRK2 and CORO1C supported an interaction between these 2 proteins. A region on chromosome 3, within a previously reported linkage peak for Parkinson's disease susceptibility, showed suggestive associations in both models (penetrance top variant: p value = 1.1E-07; age-at-onset top variant: p value = 9.3E-07). A polygenic risk score derived from publicly available Parkinson's disease summary statistics was a significant predictor of penetrance, but not of age-at-onset. // Interpretation: This study suggests that variants within or near CORO1C may modify the penetrance of LRRK2 mutations. In addition, common Parkinson's disease associated variants collectively increase the penetrance of LRRK2 mutations
- …