4,105 research outputs found

    Bayesian nonparametric clusterings in relational and high-dimensional settings with applications in bioinformatics.

    Get PDF
    Recent advances in high throughput methodologies offer researchers the ability to understand complex systems via high dimensional and multi-relational data. One example is the realm of molecular biology where disparate data (such as gene sequence, gene expression, and interaction information) are available for various snapshots of biological systems. This type of high dimensional and multirelational data allows for unprecedented detailed analysis, but also presents challenges in accounting for all the variability. High dimensional data often has a multitude of underlying relationships, each represented by a separate clustering structure, where the number of structures is typically unknown a priori. To address the challenges faced by traditional clustering methods on high dimensional and multirelational data, we developed three feature selection and cross-clustering methods: 1) infinite relational model with feature selection (FIRM) which incorporates the rich information of multirelational data; 2) Bayesian Hierarchical Cross-Clustering (BHCC), a deterministic approximation to Cross Dirichlet Process mixture (CDPM) and to cross-clustering; and 3) randomized approximation (RBHCC), based on a truncated hierarchy. An extension of BHCC, Bayesian Congruence Measuring (BCM), is proposed to measure incongruence between genes and to identify sets of congruent loci with identical evolutionary histories. We adapt our BHCC algorithm to the inference of BCM, where the intended structure of each view (congruent loci) represents consistent evolutionary processes. We consider an application of FIRM on categorizing mRNA and microRNA. The model uses latent structures to encode the expression pattern and the gene ontology annotations. We also apply FIRM to recover the categories of ligands and proteins, and to predict unknown drug-target interactions, where latent categorization structure encodes drug-target interaction, chemical compound similarity, and amino acid sequence similarity. BHCC and RBHCC are shown to have improved predictive performance (both in terms of cluster membership and missing value prediction) compared to traditional clustering methods. Our results suggest that these novel approaches to integrating multi-relational information have a promising future in the biological sciences where incorporating data related to varying features is often regarded as a daunting task

    Relation Prediction over Biomedical Knowledge Bases for Drug Repositioning

    Get PDF
    Identifying new potential treatment options for medical conditions that cause human disease burden is a central task of biomedical research. Since all candidate drugs cannot be tested with animal and clinical trials, in vitro approaches are first attempted to identify promising candidates. Likewise, identifying other essential relations (e.g., causation, prevention) between biomedical entities is also critical to understand biomedical processes. Hence, it is crucial to develop automated relation prediction systems that can yield plausible biomedical relations to expedite the discovery process. In this dissertation, we demonstrate three approaches to predict treatment relations between biomedical entities for the drug repositioning task using existing biomedical knowledge bases. Our approaches can be broadly labeled as link prediction or knowledge base completion in computer science literature. Specifically, first we investigate the predictive power of graph paths connecting entities in the publicly available biomedical knowledge base, SemMedDB (the entities and relations constitute a large knowledge graph as a whole). To that end, we build logistic regression models utilizing semantic graph pattern features extracted from the SemMedDB to predict treatment and causative relations in Unified Medical Language System (UMLS) Metathesaurus. Second, we study matrix and tensor factorization algorithms for predicting drug repositioning pairs in repoDB, a general purpose gold standard database of approved and failed drug–disease indications. The idea here is to predict repoDB pairs by approximating the given input matrix/tensor structure where the value of a cell represents the existence of a relation coming from SemMedDB and UMLS knowledge bases. The essential goal is to predict the test pairs that have a blank cell in the input matrix/tensor based on the shared biomedical context among existing non-blank cells. Our final approach involves graph convolutional neural networks where entities and relation types are embedded in a vector space involving neighborhood information. Basically, we minimize an objective function to guide our model to concept/relation embeddings such that distance scores for positive relation pairs are lower than those for the negative ones. Overall, our results demonstrate that recent link prediction methods applied to automatically curated, and hence imprecise, knowledge bases can nevertheless result in high accuracy drug candidate prediction with appropriate configuration of both the methods and datasets used

    Discriminative Subgraph Pattern Mining and Its Applications

    Get PDF
    My dissertation concentrates on two problems in mining discriminative subgraphs: how to efficiently identify subgraph patterns that discriminate two sets of graphs and how to improve discrimination power of subgraph patterns by allowing flexibility. To achieve high efficiency, I adapted evolutionary computation to subgraph mining and proposed to learn how to prune search space from search history. To allow flexibility, I proposed to loosely assemble small rigid graphs for structural flexibility and I proposed a label relaxation technique for label flexibility. I evaluated how applications of discriminative subgraphs can benefit from more efficient and effective mining algorithms. Experimental results showed that the proposed algorithms outperform other algorithms in terms of speed. In addition, using discriminative subgraph patterns found by the proposed algorithms leads to competitive or higher classification accuracy than other methods. Allowing structural flexibility enables users to identify subgraph patterns with even higher discrimination power.Doctor of Philosoph

    Seeing the forest for the trees : retrieving plant secondary biochemical pathways from metabolome networks

    Get PDF
    Over the last decade, a giant leap forward has been made in resolving the main bottleneck in metabolomics, i.e., the structural characterization of the many unknowns. This has led to the next challenge in this research field: retrieving biochemical pathway information from the various types of networks that can be constructed from metabolome data. Searching putative biochemical pathways, referred to as biotransformation paths, is complicated because several flaws occur during the construction of metabolome networks. Multiple network analysis tools have been developed to deal with these flaws, while in silico retrosynthesis is appearing as an alternative approach. In this review, the different types of metabolome networks, their flaws, and the various tools to trace these biotransformation paths are discussed

    VI Workshop on Computational Data Analysis and Numerical Methods: Book of Abstracts

    Get PDF
    The VI Workshop on Computational Data Analysis and Numerical Methods (WCDANM) is going to be held on June 27-29, 2019, in the Department of Mathematics of the University of Beira Interior (UBI), Covilhã, Portugal and it is a unique opportunity to disseminate scientific research related to the areas of Mathematics in general, with particular relevance to the areas of Computational Data Analysis and Numerical Methods in theoretical and/or practical field, using new techniques, giving especial emphasis to applications in Medicine, Biology, Biotechnology, Engineering, Industry, Environmental Sciences, Finance, Insurance, Management and Administration. The meeting will provide a forum for discussion and debate of ideas with interest to the scientific community in general. With this meeting new scientific collaborations among colleagues, namely new collaborations in Masters and PhD projects are expected. The event is open to the entire scientific community (with or without communication/poster)

    A method for identifying ancient introgression between caballine and non-caballine equids using whole genome high throughput data.

    Get PDF
    Introgression is one of the main mechanisms that transfer adapted alleles between species. The advantageous variants will get positively selected and retained in the recipient population while rest of the variants undergo negative selection. When analyzing horse genome, two alleles were found in CXCL16 gene, one associated with susceptibility and one with resistance to developing persistent shedding of the Equine Arteritis Virus. The two alleles differ by 4 non-synonymous variants in exon 1 of the gene. Comparison with 3 non-caballine equids (zebras, asses and hemiones) revealed that one haplotype was almost identical to the haplotype found in non-caballines while the other had differences characteristic of 4.5 million years since a common ancestor. Based on this observation, we project that an ancient introgression event occurred between caballine and non-caballine equids. If so, we should be able to find more instances of introgression between these species. We developed a method to identify putatively introgressed segments in the horse genome. It is estimated that non-caballine equids such as zebras and asses diverged from horses between 4 and 4.5 MYA. Genomic analysis of these animals vs. equine reference genome reveals the divergence at both the nucleotide and chromosomal level. Whole genome data for the non-caballine equids when mapped to the caballine (Equus caballus) reference genome show a greater frequency of single nucleotide differences than horses have relative to the same reference. We have created a Likelihood Estimate framework that uses this difference in single nucleotide frequencies to predict whether a haplotype evolved along the caballine or non-caballine lineage. Our results demonstrated that these haplotypes are between 0.5 and 2kb in length and are detectable at a rate of several hundred loci per horse. About 1.1% of the equine genome was introgressed and 64% of the identified putative regions were associated with either structural elements, regulatory regions, or both. These regions were responsible for gene products involved in regulation of response to stimuli, signal transduction, integral components of cell membrane and important metabolism pathways such as purine metabolism and thiamine metabolism. Furthermore, these haplotypes occur at high frequency in the horse population suggesting that they are positively selected by evolution

    NASA Tech Briefs, January 2008

    Get PDF
    Topics covered include: Induction Charge Detector with Multiple Sensing Stages; Generic Helicopter-Based Testbed for Surface Terrain Imaging Sensors; Robot Electronics Architecture; Optimized Geometry for Superconducting Sensing Coils; Sensing a Changing Chemical Mixture Using an Electronic Nose; Inertial Orientation Trackers with Drift Compensation; Microstrip Yagi Antenna with Dual Aperture-Coupled Feed; Patterned Ferroelectric Films for Tunable Microwave Devices; Micron-Accurate Laser Fresnel-Diffraction Ranging System; Efficient G(sup 4)FET-Based Logic Circuits; Web-Enabled Optoelectronic Particle-Fallout Monitor; SiO2/TiO2 Composite for Removing Hg from Combustion Exhaust; Lightweight Tanks for Storing Liquefied Natural Gas; Hybrid Wound Filaments for Greater Resistance to Impacts; Making High-Tensile-Strength Amalgam Components; Bonding by Hydroxide-Catalyzed Hydration and Dehydration; Balanced Flow Meters without Moving Parts; Deflection-Compensating Beam for Use inside a Cylinder; Four-Point-Latching Microactuator; Curved Piezoelectric Actuators for Stretching Optical Fibers; Tunable Optical Assembly with Vibration Dampening; Passive Porous Treatment for Reducing Flap Side-Edge Noise; Cylindrical Piezoelectric Fiber Composite Actuators; Patterning of Indium Tin Oxide Films; Gimballed Shoulders for Friction Stir Welding; Improved Thermal Modulator for Gas Chromatography; Nuclear-Spin Gyroscope Based on an Atomic Co-Magnetometer; Utilizing Ion-Mobility Data to Estimate Molecular Masses; Optical Displacement Sensor for Sub-Hertz Applications; Polarization/Spatial Combining of Laser-Diode Pump Beams; Spatial Combining of Laser-Diode Beams for Pumping an NPRO; Algorithm Optimally Orders Forward-Chaining Inference Rules; Project Integration Architecture; High Power Amplifier and Power Supply; Estimating Mixing Heights Using Microwave Temperature Profiler; and Multiple-Cone Sunshade for a Spaceborne Telescope

    Signal and data processing for machine olfaction and chemical sensing: A review

    Get PDF
    Signal and data processing are essential elements in electronic noses as well as in most chemical sensing instruments. The multivariate responses obtained by chemical sensor arrays require signal and data processing to carry out the fundamental tasks of odor identification (classification), concentration estimation (regression), and grouping of similar odors (clustering). In the last decade, important advances have shown that proper processing can improve the robustness of the instruments against diverse perturbations, namely, environmental variables, background changes, drift, etc. This article reviews the advances made in recent years in signal and data processing for machine olfaction and chemical sensing

    Arvutuslikud ja statistilised meetodid DNA sekveneerimisandmete analüüsimiseks ja rakendused TÜ Eesti Geenivaramu andmetel

    Get PDF
    Väitekirja elektrooniline versioon ei sisalda publikatsiooneTänapäeval võimaldavad teise põlvkonna sekveneerimisel (next-generation sequencing, NGS) põhinevad meetodid määrata inimese genoomi järjestusi suurtes kohortides. Seejuures toodetakse väga suuri andmemahtusid, mis tekitavad mitmeid väljakutseid nii informaatika kui statistika valdkonnas. TÜ Eesti Geenivaramu (TÜ EGV) on aastatel 2002-2011 kogunud enam kui 50 000 inimese geeniproovi ja käesoleval aastal lisandub veel 100 000. Praeguseks hetkeks on üle 5 500 geenidoonori DNA-d analüüsitud erinevate NGS meetoditega. Käesolevas doktoritöös on pakutud üldine raamistik TÜ EGV-s toodetud NGS-andmete töötluseks ning lisaks on uuritud, kuidas võimalikult hästi arvestada Eesti päritolu isikute geneetilist eripära. Üheks levinud NGS meetodiks on eksoomi ehk kõigi valku kodeerivate geenipiirkondade sekveneerimine, mis võimaldab efektiivselt leida harvu ja de novo geenivariante ja leiab seetõttu rakendust meditsiinigeneetikas mendeliaarsete haiguste geenimutatsioonide tuvastamisel. Doktoritöö esimeses osas on analüüsitud kolme Eesti perekonna andmeid ja kõigil kolmel juhul kindlaks tehtud potentsiaalne patogeenne mutatsioon, mis lubab tulevikus välja töötada paremaid ravimeetodeid. Samuti on läbi viidud genoomi sekveneerimisandmete analüüs kliinilise vere näitajatega. See analüüs tõi välja populatsioonipõhise biopanga eelised, mis lisaks rikkalikele genoomiandmetele sisaldab ka väärtuslikku informatsiooni erinevate haiguste ja tunnuste kohta. Uuringus tuvastati olulisi seoseid CEBPA geenivariantide ja basofiilide arvu vahel, kusjuures viimasel on roll mitmete autoimmuunhaiguste sümptomaatikas. Ülegenoomsete assotsiatsiooniuuringute võimsuse suurendamiseks kasutatakse puuduvate geenivariantide ennustamist ehk imputeerimist. Muutmaks just Eesti päritolu isikute andmeanalüüsi tõhusamaks, on kasutatud genoomi sekveneerimisandmeid eestlaste-spetsiifilise imputatsioonipaneeli loomiseks. Seejärel on imputeeritud puuduvaid geenivariante kolmel moel – kasutades nii eestlaste-spetsiifilist kui ka kahte multi-etnilist paneeli. Võrdlustulemused näitasid, et eestlaste-spetsiifilise paneeli kasutamisel õnnestub määrata rohkem parema kvaliteediga geenivariante ning loodud paneeli eelis tuleb eriti esile harvaesinevate variantide puhul.Next-generation sequencing (NGS) technology enables large-scale, routine sequencing in large cohorts. This thesis demonstrated that the analysis of NGS data has a huge potential in several fields, but also requires a massive computational power. Also, with the increase of data volumes, there is an incessant need for the development of computational and statistical methods. Covering the whole spectrum of protein-coding regions in a cost-effective way, exome sequencing opens new opportunities for quick and exact large-scale screenings. In the first part of the thesis we analysed three Estonian families with Mendelian diseases and detected potentially causative gene variants for each case. These projects highlighted that a tight collaboration between data scientists and medical geneticists can lead to findings with considerable impact in the research of rare genetic disorders and have the potential to lead to successful therapies in the future. Population-based biobanks provide numerous opportunities for expanding phenotypic datasets. We used additional blood cell measurements from the electronic medical records and our genome-wide scan detected previously undiscovered association with basophil counts near CEBPA gene, and highlighted their role in the autoimmune regulation. This example opens new dimensions for scanning underlying genetic basis for a variety of traits and diseases. To increase the resolution of genome-wide scans, imputation is routinely implemented to incorporate variants that are not directly genotyped. We had an opportunity to construct an imputation reference panel to Estonians based on genome sequencing data. We showed that the utilization of a population-specific reference panel provided significantly higher imputation confidence for rare variants compared to larger, multi-ethnic panels. In the downstream analysis, we observed a huge gain in gene-based rare variant testing. As one of the main results of this thesis, the Estonian-specific imputation reference panel is created, tested and ready to serve for a long time. This includes data processing in the framework of the ongoing initiative to invite 100,000 Estonians to join the Biobank cohort, with the purpose to develop efficient disease prevention and treatment guides for the implementation of personalized medicine
    corecore