426 research outputs found

    Partout: A Distributed Engine for Efficient RDF Processing

    Full text link
    The increasing interest in Semantic Web technologies has led not only to a rapid growth of semantic data on the Web but also to an increasing number of backend applications with already more than a trillion triples in some cases. Confronted with such huge amounts of data and the future growth, existing state-of-the-art systems for storing RDF and processing SPARQL queries are no longer sufficient. In this paper, we introduce Partout, a distributed engine for efficient RDF processing in a cluster of machines. We propose an effective approach for fragmenting RDF data sets based on a query log, allocating the fragments to nodes in a cluster, and finding the optimal configuration. Partout can efficiently handle updates and its query optimizer produces efficient query execution plans for ad-hoc SPARQL queries. Our experiments show the superiority of our approach to state-of-the-art approaches for partitioning and distributed SPARQL query processing

    Data Science

    Get PDF

    Nanopore sequencing of native adeno-associated virus (AAV) single-stranded DNA using a transposase-based rapid protocol

    Get PDF
    Radukic M, Brandt D, Haak M, Müller K, Kalinowski J. Nanopore sequencing of native adeno-associated virus (AAV) single-stranded DNA using a transposase-based rapid protocol. NAR Genomics and Bioinformatics. 2020;2(4): lqaa074.Next-generation sequencing of single-stranded DNA (ssDNA) enables transgene characterization of gene therapy vectors such as adeno-associated virus (AAV), but current library generation uses complicated and potentially biased second-strand synthesis. We report that libraries for nanopore sequencing of ssDNA can be conveniently created without second-strand synthesis using a transposase-based protocol. We show for bacteriophage M13 ssDNA that the MuA transposase has unexpected residual activity on ssDNA, explained in part by transposase action on transient double-stranded hairpins. In case of AAV, library creation is additionally aided by genome hybridization. We demonstrate the power of direct sequencing combined with nanopore long reads by characterizing AAV vector transgenes. Sequencing yielded reads up to full genome length, including GC-rich inverted terminal repeats. Unlike short-read techniques, single reads covered genome-genome and genome-contaminant fusions and other recombination events, whilst additionally providing information on epigenetic methylation. Single-nucleotide variants across the transgene cassette were revealed and secondary genome packaging signals were readily identified. Moreover, comparison of sequence abundance with quantitative polymerase chain reaction results demonstrated the technique's future potential for quantification of DNA impurities in AAV vector stocks. The findings promote direct nanopore sequencing as a fast and versatile platform for ssDNA characterization, such as AAV ssDNA in research and clinical settings

    Fragment-based Pretraining and Finetuning on Molecular Graphs

    Full text link
    Property prediction on molecular graphs is an important application of Graph Neural Networks. Recently, unlabeled molecular data has become abundant, which facilitates the rapid development of self-supervised learning for GNNs in the chemical domain. In this work, we propose pretraining GNNs at the fragment level, a promising middle ground to overcome the limitations of node-level and graph-level pretraining. Borrowing techniques from recent work on principal subgraph mining, we obtain a compact vocabulary of prevalent fragments from a large pretraining dataset. From the extracted vocabulary, we introduce several fragment-based contrastive and predictive pretraining tasks. The contrastive learning task jointly pretrains two different GNNs: one on molecular graphs and the other on fragment graphs, which represents higher-order connectivity within molecules. By enforcing consistency between the fragment embedding and the aggregated embedding of the corresponding atoms from the molecular graphs, we ensure that the embeddings capture structural information at multiple resolutions. The structural information of fragment graphs is further exploited to extract auxiliary labels for graph-level predictive pretraining. We employ both the pretrained molecular-based and fragment-based GNNs for downstream prediction, thus utilizing the fragment information during finetuning. Our graph fragment-based pretraining (GraphFP) advances the performances on 5 out of 8 common molecular benchmarks and improves the performances on long-range biological benchmarks by at least 11.5%. Code is available at: https://github.com/lvkd84/GraphFP.Comment: 18 pages, 4 figures, published in NeurIPS 202

    Molecular binding of formaldehyde to DNA and proteins

    Get PDF
    Formaldehyde is produced worldwide on a large scale (21 million tons in 2000) and used in a wide spectrum of applications. Its toxicity and carcinogenic effects have evoked numerous public health concerns. According to the International Agency on Research on Cancer (IARC), formaldehyde is classified as a known animal and human carcinogen, causing nasal cancer. More limited epidemiologic evidence suggests that formaldehyde can also induce leukemia in humans, however, this is controversial. In this dissertation, we have designed an integrated bottom-up approach to address critical issues to better understand formaldehyde's carcinogenic potential. Specifically, the N-terminus of histone and lysine residues located in both the histone N-terminal tail and the globular fold domain were identified as binding sites for formaldehyde in the current study. We also found that formaldehyde-induced lysine adducts could inhibit the formation of post translational modifications on histone, raising the possibility that formaldehyde might alter epigenetic regulation. We have also elucidated the structures of DNA-protein crosslinks induced by formaldehyde. Detailed characterization of the formaldehyde-derived linkage of single amino acids with nucleosides by NMR and mass spectrometry established that these amino acids all form cross-links involving formation of a formaldehyde-derived methylene bridge. Our results also demonstrated that Lys-dG cross-links are the most common DNA-protein crosslinks induced by formaldehyde, however, they are very labile. The finding that Cys-CH2-dG cross-links could be initiated by the S-hydroxymethyl group of cysteine residue lead to the identification of a novel dG-CH2-GSH adduct. This adduct is unique because of the involvement of S-hydroxymethylglutathione, a key player in the detoxification of formaldehyde. After our extensive work on biomarker discovery and validation involving DNA monoadducts and DNA-DNA cross-links, we applied these methods to analyze DNA samples from rats exposed to [13CD2]-formaldehyde for 1 day and 5 days. The results show that exogenous formaldehyde induced N2-hydroxymethyl-dG monoadducts and dG-dG cross-links in DNA from rat nasal mucosa, but did not form [13CD2]-adducts in distant tissues despite analyzing 5 times more DNA than for nasal epithelium. These data provide strong evidence supporting a genotoxic and cytotoxic mode of action for inhaled formaldehyde in the target tissue for carcinogenesis, but do not support the biological plausibility that inhaled formaldehyde causes leukemia in rats

    Aspects of cyclodextrin host-guest complexes in mass spectrometry

    Get PDF
    Cancer is a widely spread disease leading to uncontrolled cellular replication that caused 9.6 million deaths worldwide in 2018. One approach in cancer treatment is inhibiting the replication process by the administration of organometallic compounds that bind to DNA. Cisplatin is one of the most prominent organometallic compounds that reached clinical approval. However, it suffers from severe side effects (e.g., nephrotoxicity) and causes the development of resistance. Various other metallorganic drugs have been evaluated for their potential in cancer treatment. Thereof, titanocene dichloride had entered clinical trials, but showed only low patient effcacy. Titanocene dichloride is a representative of the class of the bent metallocene dihalides that comprise a tetrahedral structure with two cyclopentadienyl and two halogenide ligands and a metal ion as central atom. Hydrolysis of the halogenide ligands is a crucial step in the activation of the metallocene, allowing for the interaction with its biological target. Unfortunately, extensive hydrolysis of the halogenide and the cyclopentadienyl ligands is detected for titanocene in aqueous environment at physiological conditions, leading to its inactivation. One approach for increasing the hydrolytic stability of titanocene is its inclusion within the cavity of a macrocyclic host structure. Cyclodextrins are such macrocyclic compounds composed of six to eight 1,4-linked α-D-glucopyranose units that are considered nontoxic upon oral administration. Therefore, several aspects of cyclodextrin host-guest complexes in mass spectrometry have been investigated and are discussed in this thesis. In the first section, the mass spectrometric behavior of cyclodextrins is discussed. The central part of this project was the elucidation of the fragmentation mechanism underlying the decomposition of protonated cyclodextrins. Linearization of the macrocyclic structure upon charge-induced cleavage of a glycosidic bond has been revealed as the initial dissociation step. Further decomposition of the linearized structure is characterized by neutral loss of glucose subunits. This dissociation step has been stated to occur upon charge-remote cleavage of other glycosidic bonds, leading to the elimination of a zwitterionic moiety which is potentially internally rearranged. In the second section, the focus is laid on the interaction between titanocene and cyclodextrins elucidated from mass spectrometric experiments. The obtained data indicated the formation of covalent bonds between titanium and the hydroxy groups at the rim of cyclodextrins rather than the formation of an inclusion complex. Consequently, improvement of the hydrolytic stability of titanocene at physiological pH was not obtained by the interaction of titanocene with cyclodextrins. In-source fragmentation has been found to contribute considerably to the ions detected in full scan mass spectrometry. Therefore, the effect of instrumental parameters on the quality of the obtained full scan mass spectra has been evaluated. While the capillary voltage showed only minor effects, proper adjustment of the capillary temperature and the tube lens voltage signifcantly improved the quality of the obtained data. In conclusion, diverse aspects of cyclodextrin host-guest complexes have been successfully investigated using mass spectrometry showing the potential of this analytical technique for various applications

    Synthetic, Biochemical, X-ray Crystallographic, Computational and High-Throughput Screening Approaches Toward Anthrax Toxin Lethal Factor Inhibition

    Get PDF
    University of Minnesota Ph.D. dissertation.October 2015. Major: Medicinal Chemistry. Advisor: Elizabeth Amin. 1 computer file (PDF); xvi, 227 pages.The lethal factor (LF) enzyme secreted by Bacillus anthracis is chiefly responsible for anthrax-related cytotoxicity. In this dissertation, I present the computational design, synthesis, biochemical testing, structural biology, and virtual and high-throughput screening approaches to identify binding requirements for LF inhibition. To this end, we designed ~50 novel compounds to probe design principles and structural requirements for LF. Specifically, in Chapters 2 and 3, computational, synthetic, biochemical and structural biology methods to explore the underinvestigated LF S2′ binding subsite are described. We discovered that LF domain 3 is very flexible and results in a relatively unconstrained S2′ binding site region. Additionally, we found that the S1′ subsite can undergo a novel conformational change resulting in a previously unreported tunnel region, which we term S1′*, that we expect can further be explored to design potent and selective LF inhibitors. Using this novel LF configuration, we virtually screened ~11 million drug-like compounds for activity against LF and have identified a novel compound that inhibits LF with an IC50 of 126 μM. In the course of this work, we found that reliable representation of zinc and other transition metal centers in macromolecules is nontrivial, due to the complexity of the coordination environment and charge distribution at the catalytic center. In Chapter 7, I will present work on applying and optimizing quantum mechanical methods developed by the Truhlar group to accurately calculate bond dissociation energies at low computational cost for various representative Zn2+ and Cd2+ model systems. By analyzing errors, we developed a prescription for an optimal system fragmentation strategy for our models. With this scheme, we find that the EE-3B-CE method is able to reproduce 53 conventionally calculated bond energies with an average absolute error of only 0.59 kcal/mol. Therefore, one could use the EE 3B CE approximation to obtain accurate results for large systems and/or identify better parameters for Zn centers for use in virtual screening. Finally, we present the results of a large-scale in vitro HTS campaign of ~250,000 small-molecules against LF. After extensive validation, involving secondary assays and hit synthesis we were able to prioritize a key lead for further prosecution
    • …
    corecore