1,148 research outputs found

    Multi-Target Prediction: A Unifying View on Problems and Methods

    Full text link
    Multi-target prediction (MTP) is concerned with the simultaneous prediction of multiple target variables of diverse type. Due to its enormous application potential, it has developed into an active and rapidly expanding research field that combines several subfields of machine learning, including multivariate regression, multi-label classification, multi-task learning, dyadic prediction, zero-shot learning, network inference, and matrix completion. In this paper, we present a unifying view on MTP problems and methods. First, we formally discuss commonalities and differences between existing MTP problems. To this end, we introduce a general framework that covers the above subfields as special cases. As a second contribution, we provide a structured overview of MTP methods. This is accomplished by identifying a number of key properties, which distinguish such methods and determine their suitability for different types of problems. Finally, we also discuss a few challenges for future research

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    The Parallelism Motifs of Genomic Data Analysis

    Get PDF
    Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing

    Machine Learning for Fluid Mechanics

    Full text link
    The field of fluid mechanics is rapidly advancing, driven by unprecedented volumes of data from field measurements, experiments and large-scale simulations at multiple spatiotemporal scales. Machine learning offers a wealth of techniques to extract information from data that could be translated into knowledge about the underlying fluid mechanics. Moreover, machine learning algorithms can augment domain knowledge and automate tasks related to flow control and optimization. This article presents an overview of past history, current developments, and emerging opportunities of machine learning for fluid mechanics. It outlines fundamental machine learning methodologies and discusses their uses for understanding, modeling, optimizing, and controlling fluid flows. The strengths and limitations of these methods are addressed from the perspective of scientific inquiry that considers data as an inherent part of modeling, experimentation, and simulation. Machine learning provides a powerful information processing framework that can enrich, and possibly even transform, current lines of fluid mechanics research and industrial applications.Comment: To appear in the Annual Reviews of Fluid Mechanics, 202

    Kern-basierte Lernverfahren für das virtuelle Screening

    Get PDF
    We investigate the utility of modern kernel-based machine learning methods for ligand-based virtual screening. In particular, we introduce a new graph kernel based on iterative graph similarity and optimal assignments, apply kernel principle component analysis to projection error-based novelty detection, and discover a new selective agonist of the peroxisome proliferator-activated receptor gamma using Gaussian process regression. Virtual screening, the computational ranking of compounds with respect to a predicted property, is a cheminformatics problem relevant to the hit generation phase of drug development. Its ligand-based variant relies on the similarity principle, which states that (structurally) similar compounds tend to have similar properties. We describe the kernel-based machine learning approach to ligand-based virtual screening; in this, we stress the role of molecular representations, including the (dis)similarity measures defined on them, investigate effects in high-dimensional chemical descriptor spaces and their consequences for similarity-based approaches, review literature recommendations on retrospective virtual screening, and present an example workflow. Graph kernels are formal similarity measures that are defined directly on graphs, such as the annotated molecular structure graph, and correspond to inner products. We review graph kernels, in particular those based on random walks, subgraphs, and optimal vertex assignments. Combining the latter with an iterative graph similarity scheme, we develop the iterative similarity optimal assignment graph kernel, give an iterative algorithm for its computation, prove convergence of the algorithm and the uniqueness of the solution, and provide an upper bound on the number of iterations necessary to achieve a desired precision. In a retrospective virtual screening study, our kernel consistently improved performance over chemical descriptors as well as other optimal assignment graph kernels. Chemical data sets often lie on manifolds of lower dimensionality than the embedding chemical descriptor space. Dimensionality reduction methods try to identify these manifolds, effectively providing descriptive models of the data. For spectral methods based on kernel principle component analysis, the projection error is a quantitative measure of how well new samples are described by such models. This can be used for the identification of compounds structurally dissimilar to the training samples, leading to projection error-based novelty detection for virtual screening using only positive samples. We provide proof of principle by using principle component analysis to learn the concept of fatty acids. The peroxisome proliferator-activated receptor (PPAR) is a nuclear transcription factor that regulates lipid and glucose metabolism, playing a crucial role in the development of type 2 diabetes and dyslipidemia. We establish a Gaussian process regression model for PPAR gamma agonists using a combination of chemical descriptors and the iterative similarity optimal assignment kernel via multiple kernel learning. Screening of a vendor library and subsequent testing of 15 selected compounds in a cell-based transactivation assay resulted in 4 active compounds. One compound, a natural product with cyclobutane scaffold, is a full selective PPAR gamma agonist (EC50 = 10 +/- 0.2 muM, inactive on PPAR alpha and PPAR beta/delta at 10 muM). The study delivered a novel PPAR gamma agonist, de-orphanized a natural bioactive product, and, hints at the natural product origins of pharmacophore patterns in synthetic ligands.Wir untersuchen moderne Kern-basierte maschinelle Lernverfahren für das Liganden-basierte virtuelle Screening. Insbesondere entwickeln wir einen neuen Graphkern auf Basis iterativer Graphähnlichkeit und optimaler Knotenzuordnungen, setzen die Kernhauptkomponentenanalyse für Projektionsfehler-basiertes Novelty Detection ein, und beschreiben die Entdeckung eines neuen selektiven Agonisten des Peroxisom-Proliferator-aktivierten Rezeptors gamma mit Hilfe von Gauß-Prozess-Regression. Virtuelles Screening ist die rechnergestützte Priorisierung von Molekülen bezüglich einer vorhergesagten Eigenschaft. Es handelt sich um ein Problem der Chemieinformatik, das in der Trefferfindungsphase der Medikamentenentwicklung auftritt. Seine Liganden-basierte Variante beruht auf dem Ähnlichkeitsprinzip, nach dem (strukturell) ähnliche Moleküle tendenziell ähnliche Eigenschaften haben. In unserer Beschreibung des Lösungsansatzes mit Kern-basierten Lernverfahren betonen wir die Bedeutung molekularer Repräsentationen, einschließlich der auf ihnen definierten (Un)ähnlichkeitsmaße. Wir untersuchen Effekte in hochdimensionalen chemischen Deskriptorräumen, ihre Auswirkungen auf Ähnlichkeits-basierte Verfahren und geben einen Literaturüberblick zu Empfehlungen zur retrospektiven Validierung, einschließlich eines Beispiel-Workflows. Graphkerne sind formale Ähnlichkeitsmaße, die inneren Produkten entsprechen und direkt auf Graphen, z.B. annotierten molekularen Strukturgraphen, definiert werden. Wir geben einen Literaturüberblick über Graphkerne, insbesondere solche, die auf zufälligen Irrfahrten, Subgraphen und optimalen Knotenzuordnungen beruhen. Indem wir letztere mit einem Ansatz zur iterativen Graphähnlichkeit kombinieren, entwickeln wir den iterative similarity optimal assignment Graphkern. Wir beschreiben einen iterativen Algorithmus, zeigen dessen Konvergenz sowie die Eindeutigkeit der Lösung, und geben eine obere Schranke für die Anzahl der benötigten Iterationen an. In einer retrospektiven Studie zeigte unser Graphkern konsistent bessere Ergebnisse als chemische Deskriptoren und andere, auf optimalen Knotenzuordnungen basierende Graphkerne. Chemische Datensätze liegen oft auf Mannigfaltigkeiten niedrigerer Dimensionalität als der umgebende chemische Deskriptorraum. Dimensionsreduktionsmethoden erlauben die Identifikation dieser Mannigfaltigkeiten und stellen dadurch deskriptive Modelle der Daten zur Verfügung. Für spektrale Methoden auf Basis der Kern-Hauptkomponentenanalyse ist der Projektionsfehler ein quantitatives Maß dafür, wie gut neue Daten von solchen Modellen beschrieben werden. Dies kann zur Identifikation von Molekülen verwendet werden, die strukturell unähnlich zu den Trainingsdaten sind, und erlaubt so Projektionsfehler-basiertes Novelty Detection für virtuelles Screening mit ausschließlich positiven Beispielen. Wir führen eine Machbarkeitsstudie zur Lernbarkeit des Konzepts von Fettsäuren durch die Hauptkomponentenanalyse durch. Der Peroxisom-Proliferator-aktivierte Rezeptor (PPAR) ist ein im Zellkern vorkommender Rezeptor, der den Fett- und Zuckerstoffwechsel reguliert. Er spielt eine wichtige Rolle in der Entwicklung von Krankheiten wie Typ-2-Diabetes und Dyslipidämie. Wir etablieren ein Gauß-Prozess-Regressionsmodell für PPAR gamma-Agonisten mit chemischen Deskriptoren und unserem Graphkern durch gleichzeitiges Lernen mehrerer Kerne. Das Screening einer kommerziellen Substanzbibliothek und die anschließende Testung 15 ausgewählter Substanzen in einem Zell-basierten Transaktivierungsassay ergab vier aktive Substanzen. Eine davon, ein Naturstoff mit Cyclobutan-Grundgerüst, ist ein voller selektiver PPAR gamma-Agonist (EC50 = 10 +/- 0,2 muM, inaktiv auf PPAR alpha und PPAR beta/delta bei 10 muM). Unsere Studie liefert einen neuen PPAR gamma-Agonisten, legt den Wirkmechanismus eines bioaktiven Naturstoffs offen, und erlaubt Rückschlüsse auf die Naturstoffursprünge von Pharmakophormustern in synthetischen Liganden
    corecore