640 research outputs found

    Latent Representation and Sampling in Network: Application in Text Mining and Biology.

    Get PDF
    In classical machine learning, hand-designed features are used for learning a mapping from raw data. However, human involvement in feature design makes the process expensive. Representation learning aims to learn abstract features directly from data without direct human involvement. Raw data can be of various forms. Network is one form of data that encodes relational structure in many real-world domains. Therefore, learning abstract features for network units is an important task. In this dissertation, we propose models for incorporating temporal information given as a collection of networks from subsequent time-stamps. The primary objective of our models is to learn a better abstract feature representation of nodes and edges in an evolving network. We show that the temporal information in the abstract feature improves the performance of link prediction task substantially. Besides applying to the network data, we also employ our models to incorporate extra-sentential information in the text domain for learning better representation of sentences. We build a context network of sentences to capture extra-sentential information. This information in abstract feature representation of sentences improves various text-mining tasks substantially over a set of baseline methods. A problem with the abstract features that we learn is that they lack interpretability. In real-life applications on network data, for some tasks, it is crucial to learn interpretable features in the form of graphical structures. For this we need to mine important graphical structures along with their frequency statistics from the input dataset. However, exact algorithms for these tasks are computationally expensive, so scalable algorithms are of urgent need. To overcome this challenge, we provide efficient sampling algorithms for mining higher-order structures from network(s). We show that our sampling-based algorithms are scalable. They are also superior to a set of baseline algorithms in terms of retrieving important graphical sub-structures, and collecting their frequency statistics. Finally, we show that we can use these frequent subgraph statistics and structures as features in various real-life applications. We show one application in biology and another in security. In both cases, we show that the structures and their statistics significantly improve the performance of knowledge discovery tasks in these domains

    Investigation of microbial metal-sulfide interfacial environments under mineral bioleach simulated conditions

    Get PDF
    This research pertains to bioleaching of copper containing ores with particular reference to the copper sulfide mineral chalcopyrite (CuFeS2). While it is focused on heap bioleaching, it has applications to stirred tank bioleaching operations. In the context of bioleaching, microbial extra-cellular polymeric substance (EPS) components are thought to complex chemical oxidants and extend the chemical reaction space available for mineral dissolution reactions, making the microbial-mineral-EPS interface the dominant active zone in terms of microbial oxidation and mineral dissolution. There is a limited understanding of microbial biofilm formation within a bioleach heap. The implication of various microorganisms having a set of defined or optimal conditions under which they colonise and proliferate is quite substantial. Understanding what creates favourable interfacial microenvironments enabling a sessile population to flourish (and thereby decrease lag time) has great implications for minimising costs and maximising productivity. Furthermore, limited work has been conducted on thermophilic microorganisms relevant to bioleaching. These microorganisms are pertinent to successful bioleaching at high temperatures, with work incorporating low grade ores and gangue mineralogy also being scarce. The aim of this research is to provide a thorough investigation into microbial-metal sulfide interfacial environments in situ, using a thermophilic archaeon M. hakonensis, low-grade metal-sulfide ores, a series of temperature regimes, heap-simulating conditions and an in depth extraction and analysis of the EPS produced under varied culturing conditions

    Computational Investigations of Biomolecular Mechanisms in Genomic Replication, Repair and Transcription

    Get PDF
    High fidelity maintenance of the genome is imperative to ensuring stability and proliferation of cells. The genetic material (DNA) of a cell faces a constant barrage of metabolic and environmental assaults throughout the its lifetime, ultimately leading to DNA damage. Left unchecked, DNA damage can result in genomic instability, inviting a cascade of mutations that initiate cancer and other aging disorders. Thus, a large area of focus has been dedicated to understanding how DNA is damaged, repaired, expressed and replicated. At the heart of these processes lie complex macromolecular dynamics coupled with intricate protein-DNA interactions. Through advanced computational techniques it has become possible to probe these mechanisms at the atomic level, providing a physical basis to describe biomolecular phenomena. To this end, we have performed studies aimed at elucidating the dynamics and interactions intrinsic to the functionality of biomolecules critical to maintaining genomic integrity: modeling the DNA editing mechanism of DNA polymerase III, uncovering the DNA damage recognition/repair mechanism of thymine DNA glycosylase and linking genetic disease to the functional dynamics of the pre-initiation complex transcription machinery. Collectively, our results elucidate the dynamic interplay between proteins and DNA, further broadening our understanding of these complex processes involved with genomic maintenance

    Computational Approaches To Anti-Toxin Therapies And Biomarker Identification

    Get PDF
    This work describes the fundamental study of two bacterial toxins with computational methods, the rational design of a potent inhibitor using molecular dynamics, as well as the development of two bioinformatic methods for mining genomic data. Clostridium difficile is an opportunistic bacillus which produces two large glucosylating toxins. These toxins, TcdA and TcdB cause severe intestinal damage. As Clostridium difficile harbors considerable antibiotic resistance, one treatment strategy is to prevent the tissue damage that the toxins cause. The catalytic glucosyltransferase domain of TcdA and TcdB was studied using molecular dynamics in the presence of both a protein-protein binding partner and several substrates. These experiments were combined with lead optimization techniques to create a potent irreversible inhibitor which protects 95% of cells in vitro. Dynamics studies on a TcdB cysteine protease domain were performed to an allosteric communication pathway. Comparative analysis of the static and dynamic properties of the TcdA and TcdB glucosyltransferase domains were carried out to determine the basis for the differential lethality of these toxins. Large scale biological data is readily available in the post-genomic era, but it can be difficult to effectively use that data. Two bioinformatics methods were developed to process whole-genome data. Software was developed to return all genes containing a motif in single genome. This provides a list of genes which may be within the same regulatory network or targeted by a specific DNA binding factor. A second bioinformatic method was created to link the data from genome-wide association studies (GWAS) to specific genes. GWAS studies are frequently subjected to statistical analysis, but mutations are rarely investigated structurally. HyDn-SNP-S allows a researcher to find mutations in a gene that correlate to a GWAS studied phenotype. Across human DNA polymerases, this resulted in strongly predictive haplotypes for breast and prostate cancer. Molecular dynamics applied to DNA Polymerase Lambda suggested a structural explanation for the decrease in polymerase fidelity with that mutant. When applied to Histone Deacetylases, mutations were found that alter substrate binding, and post-translational modification

    Profiling patterns of interhelical associations in membrane proteins.

    Get PDF
    A novel set of methods has been developed to characterize polytopic membrane proteins at the topological, organellar and functional level, in order to reduce the existing functional gap in the membrane proteome. Firstly, a novel clustering tool was implemented, named PROCLASS, to facilitate the manual curation of large sets of proteins, in readiness for feature extraction. TMLOOP and TMLOOP writer were implemented to refine current topological models by predicting membrane dipping loops. TMLOOP applies weighted predictive rules in a collective motif method, to overcome the inherent limitations of single motif methods. The approach achieved 92.4% accuracy in sensitivity and 100% reliability in specificity and 1,392 topological models described in the Swiss-Prot database were refined. The subcellular location (TMLOCATE) and molecular function (TMFUN) prediction methods rely on the TMDEPTH feature extraction method along data mining techniques. TMDEPTH uses refined topological models and amino acid sequences to calculate pairs of residues located at a similar depth in the membrane. Evaluation of TMLOCATE showed a normalized accuracy of 75% in discriminating between proteins belonging to the main organelles. At a sequence similarity threshold of 40%, TMFLTN predicted main functional classes with a sensitivity of 64.1-71.4%) and 70% of the olfactory GPCRs were correctly predicted. At a sequence similarity threshold of 90%, main functional classes were predicted with a sensitivity of 75.6-92.8%) and class A GPCRs were sub-classified with a sensitivity of 84.5%>-92.9%. These results reflect a direct association between the spatial arrangement of residues in the transmembrane regions and the capacity for polytopic membrane proteins to carry out their functions. The developed methods have for the first time categorically shown that the transmembrane regions hold essential information associated with a wide range of functional properties such as filtering and gating processes, subcellular location and molecular function

    A Bioinformatics Study of Protein Conformational Flexibility and Misfolding: a Sequence, Structure and Dynamics Approach

    Get PDF
    This PhD Thesis titled "A Bioinformatics Study of Protein Conformational Flexibility and Misfolding: a Sequence, Structure and Dynamics Approach" comprises the results and conclusions obtained by us from the study of three different but somehow related research projects, covering aspects of the phenomenon of protein local conformational instability, its relationship with protein function, evolvability and aggregation, and the effect of genetic variations on protein conformational instability related to Conformational Diseases. These projects include the prediction of putative prion proteins in complete proteomes and the study of prion biology from a genomic perspective, the prediction of conformationally unstable protein regions and the existence of a structural framework for linking conformational instability to folding and function, and the establishment of a rationale for assessing the connection among mutations and disease phenotypes in Conformational Diseases.Esta tesis doctoral comprende los resultados y conclusiones obtenidos por nosotros a partir del estudio de tres proyectos de investigación diferentes pero de alguna manera relacionados, cubriendo los aspectos del fenómeno de la inestabilidad conformacional local de la proteína, su relación con la función de la proteína, la capacidad de evolución y agregación, y el efecto de las variaciones genéticas en la inestabilidad conformacional de la proteína relacionados con las enfermedades conformacionales. Estos proyectos incluyen la predicción de presuntas proteínas priónicas en proteomas complejos y el estudio de la biología de priones desde una perspectiva genómica, la predicción de las regiones de proteínas conformacionalmente inestables y la existencia de un marco estructural para la vinculación de la inestabilidad conformacional del plegado y la función, y el establecimiento de una razón fundamental para la evaluación de la relación entre las mutaciones y fenotipos de la enfermedad en enfermedades conformacionales

    Towards automated structure determination

    Get PDF
    Der erste Teil dieser Arbeit beschäftigte sich mit der konventionellen strukturellen und dynamischen Charakterisierung von gefaltenen und intrinsisch ungefaltenen Proteinen mittels NMR Spektroskopie. ICln ist ein hochkonserviertes und essentielles 27kDa protein und ist in verschiedenen Signaltransduktionswegen wie Zellvolumsregulation, Angiogenese, oder RNA Prozessierung involviert. Basierend auf Rekonstitutionsexperimente in Membranlipiden und anderen Ergebnissen chloridkanalbildende Funktion von ICln vorgeschlagen. Obwohl knock-down Experimente eine direkte funktionelle Eigenschaft dieses Proteins in der Zellvolumsregulation anzeigten, wurde durch näheren Vergleich der biophysikalischen Charakteristika von Chloridstromen aus Zellen unter hypoosmotischen Bedingungen (die in allen bislang untersuchten Zelltypen praktisch gleich sind) mit jenen erhalten aus ICln Überexpression in Xenopus oocyten einige deutliche Unterschiede sichtbar. Weiters zeigten Immunofluorenszenzexperimente unter normalen Bedingungen eine primare zytoplasmatische Lokalisation des Proteins an, das erst durch hypoosmotische Induktion in oder nahe zur Plasmamembran translokalisier wird. Diese Ergebnisse ließen einige Zweifel an einer porenbildenden Kanalfunktion von ICln aufkommen und viele Forschern ordneten dem Protein in diesem Zusammenhang daraufhin eine rein regulatorische Rolle zu. Nichtsdestoweniger ist die Identifizierung oder der Ausschluß potentieller chloridkanalbildender Kanditaten enorm wichtig, da physiologische Veranderungen in der Chloridepermeabilitat die Grundlage vieler schwerer Erkrankungen, wie der Osteopetrosis, der Dent'sche Krankheit oder des Bartter Syndroms sind. Zur diesem Zweck wurde die tertiäre Struktur von ICln in Lösung bestimmt. Dabei zeigte sich das der N-terminale Teil des Proteins in eine Pleckstrin Homologiedoman-analoge Struktur faltet (ein Strukturmotif das bereits in vielen signaltranduktionsregulatorischen Protein identifiziert wurde), wohingegen der C-terminale Abschnitt intrinisch unfaltet ist mit einigen wenigen Sequenzbereichen mit schwach ausgepragter Sekundarstrukturpreferenz. Des weiteren konnten Interaktionen mit dem membrannahen cytoplasmatischen Teil des Blutplatchen alpha2beta-Integrinprotein und mit dem Faktor LSm4 (einem funktionellen Protein in der snRNP Biogenese) nachgewiesen werden. Eine weitere manuelle Strukturanalyse befaßte sich mit der Bestimmung der Lösungskonformation von Cyclophilin D, einem Mitglied der Immunophilin Familie, das an den mitochondrialen Permeabilitatsporenapparat bindet, und ein interessantes Zielprotein bei der Behandlung von neurodegenerativen Erkrannkungen darstellt. Nebenbei wurde eine bereits publizierte NMR-Struktur der LIM1 Domane aus CRP2 verfeinert. Weitere NMR Analysen beschäftigten sich mit den intrinsisch ungefaltenen Proteinen BASP1 and Osteopontin (OPN), beide sehr vielversprechende Zielproteine für die Krebsforschung. In Anlehnung zur Bewaltigung des enormen Aufwands eines structural genomic Projekts wurde im zweitem Teil dieser Arbeit eine einfache sogenannte direkte Methode zur raschen Strukturcharakterisierung in dieser Arbeit entwickelt und anhand experimenteller Daten getestet. Zuletzt wurde das Leistungsvermogen eines von unserem Gruppenleiter verfaßtem Programm für die automatische Vorhersage der Ligandbindungstelle anhand verschiedener Protein-Ligand Komplexe aus der Datenbank evaluiert
    corecore