6,479 research outputs found

    Interpretable statistics for complex modelling: quantile and topological learning

    Get PDF
    As the complexity of our data increased exponentially in the last decades, so has our need for interpretable features. This thesis revolves around two paradigms to approach this quest for insights. In the first part we focus on parametric models, where the problem of interpretability can be seen as a “parametrization selection”. We introduce a quantile-centric parametrization and we show the advantages of our proposal in the context of regression, where it allows to bridge the gap between classical generalized linear (mixed) models and increasingly popular quantile methods. The second part of the thesis, concerned with topological learning, tackles the problem from a non-parametric perspective. As topology can be thought of as a way of characterizing data in terms of their connectivity structure, it allows to represent complex and possibly high dimensional through few features, such as the number of connected components, loops and voids. We illustrate how the emerging branch of statistics devoted to recovering topological structures in the data, Topological Data Analysis, can be exploited both for exploratory and inferential purposes with a special emphasis on kernels that preserve the topological information in the data. Finally, we show with an application how these two approaches can borrow strength from one another in the identification and description of brain activity through fMRI data from the ABIDE project

    Roughness of molecular property landscapes and its impact on modellability

    Full text link
    In molecular discovery and drug design, structure-property relationships and activity landscapes are often qualitatively or quantitatively analyzed to guide the navigation of chemical space. The roughness (or smoothness) of these molecular property landscapes is one of their most studied geometric attributes, as it can characterize the presence of activity cliffs, with rougher landscapes generally expected to pose tougher optimization challenges. Here, we introduce a general, quantitative measure for describing the roughness of molecular property landscapes. The proposed roughness index (ROGI) is loosely inspired by the concept of fractal dimension and strongly correlates with the out-of-sample error achieved by machine learning models on numerous regression tasks.Comment: 17 pages, 6 figures, 2 tables (SI with 17 pages, 16 figures

    Scientists' bounded mobility on the epistemic landscape

    Full text link
    Despite persistent efforts in revealing the temporal patterns in scientific careers, little attention has been paid to the spatial patterns of scientific activities in the knowledge space. Here, drawing on millions of papers in six disciplines, we consider scientists' publication sequence as "walks" on the quantifiable epistemic landscape constructed from large-scale bibliometric corpora by combining embedding and manifold learning algorithms, aiming to reveal the individual research topic dynamics and association between research radius with academic performance, along their careers. Intuitively, the visualization shows the localized and bounded nature of mobile trajectories. We further find that the distributions of scientists' transition radius and transition pace are both left-skewed compared with the results of controlled experiments. Then, we observe the mixed exploration and exploitation pattern and the corresponding strategic trade-off in the research transition, where scientists both deepen their previous research with frequency bias and explore new research with knowledge proximity bias. We further develop a bounded exploration-exploitation (BEE) model to reproduce the observed patterns. Moreover, the association between scientists' research radius and academic performance shows that extensive exploration will not lead to a sustained increase in academic output but a decrease in impact. In addition, we also note that disruptive findings are more derived from an extensive transition, whereas there is a saturation in this association. Our study contributes to the comprehension of the mobility patterns of scientists in the knowledge space, thereby providing significant implications for the development of scientific policy-making.Comment: article paper, 47 pages, 29 figures, 4 table

    Computational approaches to virtual screening in human central nervous system therapeutic targets

    Get PDF
    In the past several years of drug design, advanced high-throughput synthetic and analytical chemical technologies are continuously producing a large number of compounds. These large collections of chemical structures have resulted in many public and commercial molecular databases. Thus, the availability of larger data sets provided the opportunity for developing new knowledge mining or virtual screening (VS) methods. Therefore, this research work is motivated by the fact that one of the main interests in the modern drug discovery process is the development of new methods to predict compounds with large therapeutic profiles (multi-targeting activity), which is essential for the discovery of novel drug candidates against complex multifactorial diseases like central nervous system (CNS) disorders. This work aims to advance VS approaches by providing a deeper understanding of the relationship between chemical structure and pharmacological properties and design new fast and robust tools for drug designing against different targets/pathways. To accomplish the defined goals, the first challenge is dealing with big data set of diverse molecular structures to derive a correlation between structures and activity. However, an extendable and a customizable fully automated in-silico Quantitative-Structure Activity Relationship (QSAR) modeling framework was developed in the first phase of this work. QSAR models are computationally fast and powerful tool to screen huge databases of compounds to determine the biological properties of chemical molecules based on their chemical structure. The generated framework reliably implemented a full QSAR modeling pipeline from data preparation to model building and validation. The main distinctive features of the designed framework include a)efficient data curation b) prior estimation of data modelability and, c)an-optimized variable selection methodology that was able to identify the most biologically relevant features responsible for compound activity. Since the underlying principle in QSAR modeling is the assumption that the structures of molecules are mainly responsible for their pharmacological activity, the accuracy of different structural representation approaches to decode molecular structural information largely influence model predictability. However, to find the best approach in QSAR modeling, a comparative analysis of two main categories of molecular representations that included descriptor-based (vector space) and distance-based (metric space) methods was carried out. Results obtained from five QSAR data sets showed that distance-based method was superior to capture the more relevant structural elements for the accurate characterization of molecular properties in highly diverse data sets (remote chemical space regions). This finding further assisted to the development of a novel tool for molecular space visualization to increase the understanding of structure-activity relationships (SAR) in drug discovery projects by exploring the diversity of large heterogeneous chemical data. In the proposed visual approach, four nonlinear DR methods were tested to represent molecules lower dimensionality (2D projected space) on which a non-parametric 2D kernel density estimation (KDE) was applied to map the most likely activity regions (activity surfaces). The analysis of the produced probabilistic surface of molecular activities (PSMAs) from the four datasets showed that these maps have both descriptive and predictive power, thus can be used as a spatial classification model, a tool to perform VS using only structural similarity of molecules. The above QSAR modeling approach was complemented with molecular docking, an approach that predicts the best mode of drug-target interaction. Both approaches were integrated to develop a rational and re-usable polypharmacology-based VS pipeline with improved hits identification rate. For the validation of the developed pipeline, a dual-targeting drug designing model against Parkinson’s disease (PD) was derived to identify novel inhibitors for improving the motor functions of PD patients by enhancing the bioavailability of dopamine and avoiding neurotoxicity. The proposed approach can easily be extended to more complex multi-targeting disease models containing several targets and anti/offtargets to achieve increased efficacy and reduced toxicity in multifactorial diseases like CNS disorders and cancer. This thesis addresses several issues of cheminformatics methods (e.g., molecular structures representation, machine learning, and molecular similarity analysis) to improve and design new computational approaches used in chemical data mining. Moreover, an integrative drug-designing pipeline is designed to improve polypharmacology-based VS approach. This presented methodology can identify the most promising multi-targeting candidates for experimental validation of drug-targets network at the systems biology level in the drug discovery process

    Statistical and image processing techniques for remote sensing in agricultural monitoring and mapping

    Get PDF
    Throughout most of history, increasing agricultural production has been largely driven by expanded land use, and – especially in the 19th and 20th century – by technological innovation in breeding, genetics and agrochemistry as well as intensification through mechanization and industrialization. More recently, information technology, digitalization and automation have started to play a more significant role in achieving higher productivity with lower environmental impact and reduced use of resources. This includes two trends on opposite scales: precision farming applying detailed observations on sub-field level to support local management, and large-scale agricultural monitoring observing regional patterns in plant health and crop productivity to help manage macroeconomic and environmental trends. In both contexts, remote sensing imagery plays a crucial role that is growing due to decreasing costs and increasing accessibility of both data and means of processing and analysis. The large archives of free imagery with global coverage, can be expected to further increase adoption of remote sensing techniques in coming years. This thesis addresses multiple aspects of remote sensing in agriculture by presenting new techniques in three distinct research topics: (1) remote sensing data assimilation in dynamic crop models; (2) agricultural field boundary detection from remote sensing observations; and (3) contour extraction and field polygon creation from remote sensing imagery. These key objectives are achieved through combining methods of probability analysis, uncertainty quantification, evolutionary learning and swarm intelligence, graph theory, image processing, deep learning and feature extraction. Four new techniques have been developed. Firstly, a new data assimilation technique based on statistical distance metrics and probability distribution analysis to achieve a flexible representation of model- and measurement-related uncertainties. Secondly, a method for detecting boundaries of agricultural fields based on remote sensing observations designed to only rely on image-based information in multi-temporal imagery. Thirdly, an improved boundary detection approach based on deep learning techniques and a variety of image features. Fourthly, a new active contours method called Graph-based Growing Contours (GGC) that allows automatized extractionof complex boundary networks from imagery. The new approaches are tested and evaluated on multiple study areas in the states of Schleswig-Holstein, Niedersachsen and Sachsen-Anhalt, Germany, based on combine harvester measurements, cadastral data and manual mappings. All methods were designed with flexibility and applicability in mind. They proved to perform similarly or better than other existing methods and showed potential for large-scale application and their synergetic use. Thanks to low data requirements and flexible use of inputs, their application is neither constrained to the specific applications presented here nor the use of a specific type of sensor or imagery. This flexibility, in theory, enables their use even outside of the field of remote sensing.Landwirtschaftliche Produktivitätssteigerung wurde historisch hauptsächlich durch Erschließung neuer Anbauflächen und später, insbesondere im 19. und 20. Jahrhundert, durch technologische Innovation in Züchtung, Genetik und Agrarchemie sowie Intensivierung in Form von Mechanisierung und Industrialisierung erreicht. In jüngerer Vergangenheit spielen jedoch Informationstechnologie, Digitalisierung und Automatisierung zunehmend eine größere Rolle, um die Produktivität bei reduziertem Umwelteinfluss und Ressourcennutzung weiter zu steigern. Daraus folgen zwei entgegengesetzte Trends: Zum einen Precision Farming, das mithilfe von Detailbeobachtungen die lokale Feldarbeit unterstützt, und zum anderen großskalige landwirtschaftliche Beobachtung von Bestands- und Ertragsmustern zur Analyse makroökonomischer und ökologischer Trends. In beiden Fällen spielen Fernerkundungsdaten eine entscheidende Rolle und gewinnen dank sinkender Kosten und zunehmender Verfügbarkeit, sowohl der Daten als auch der Möglichkeiten zu ihrer Verarbeitung und Analyse, weiter an Bedeutung. Die Verfügbarkeit großer, freier Archive von globaler Abdeckung werden in den kommenden Jahren voraussichtlich zu einer zunehmenden Verwendung führen. Diese Dissertation behandelt mehrere Aspekte der Fernerkundungsanwendung in der Landwirtschaft und präsentiert neue Methoden zu drei Themenbereichen: (1) Assimilation von Fernerkundungsdaten in dynamischen Agrarmodellen; (2) Erkennung von landwirtschaftlichen Feldgrenzen auf Basis von Fernerkundungsbeobachtungen; und (3) Konturextraktion und Erstellung von Polygonen aus Fernerkundungsaufnahmen. Zur Bearbeitung dieser Zielsetzungen werden verschiedene Techniken aus der Wahrscheinlichkeitsanalyse, Unsicherheitsquantifizierung, dem evolutionären Lernen und der Schwarmintelligenz, der Graphentheorie, dem Bereich der Bildverarbeitung, Deep Learning und Feature-Extraktion kombiniert. Es werden vier neue Methoden vorgestellt. Erstens, eine neue Methode zur Datenassimilation basierend auf statistischen Distanzmaßen und Wahrscheinlichkeitsverteilungen zur flexiblen Abbildung von Modell- und Messungenauigkeiten. Zweitens, eine neue Technik zur Erkennung von Feldgrenzen, ausschließlich auf Basis von Bildinformationen aus multi-temporalen Fernerkundungsdaten. Drittens, eine verbesserte Feldgrenzenerkennung basierend auf Deep Learning Methoden und verschiedener Bildmerkmale. Viertens, eine neue Aktive Kontur Methode namens Graph-based Growing Contours (GGC), die es erlaubt, komplexe Netzwerke von Konturen aus Bildern zu extrahieren. Alle neuen Ansätze werden getestet und evaluiert anhand von Mähdreschermessungen, Katasterdaten und manuellen Kartierungen in verschiedenen Testregionen in den Bundesländern Schleswig-Holstein, Niedersachsen und Sachsen-Anhalt. Alle vorgestellten Methoden sind auf Flexibilität und Anwendbarkeit ausgelegt. Im Vergleich zu anderen Methoden zeigten sie vergleichbare oder bessere Ergebnisse und verdeutlichten das Potenzial zur großskaligen Anwendung sowie kombinierter Verwendung. Dank der geringen Anforderungen und der flexiblen Verwendung verschiedener Eingangsdaten ist die Nutzung nicht nur auf die hier beschriebenen Anwendungen oder bestimmte Sensoren und Bilddaten beschränkt. Diese Flexibilität erlaubt theoretisch eine breite Anwendung, auch außerhalb der Fernerkundung

    Quantifying Floral Resource Availability Using Unmanned Aerial Systems and Machine Learning Classifications to Predict Bee Community Structure

    Get PDF
    Bees are important for agricultural and non-agricultural ecosystems because they pollinate both wild plants and commercial crops. Flowers provide pollen and nectar resources that bees use to survive and reproduce. Measuring the relationship between the floral community and bee community may help apiarists and land managers to make informed decisions in managing wild and domesticated bee species. Manual methods to describe and count flowering vegetation is costly in time and personnel. Unmanned aerial vehicle (UAV) technology may be an efficient way to describe and count flowering vegetation on a large scale. UAVs with classification analysis and ground transect surveys were used to describe the variation in the flower communities at three field sites in non-agricultural environments. The variation in bee communities were also recorded at the field sites. Seven unique flower species were quantified using UAVs. Using the UAV imagery, it was determined that the period of flowering and changes of flower coverage for different species varied. Twenty-two unique flower species were described and counted using the ground transect surveys and 136 bees from 11 genera were recorded using net surveys. I tested the hypothesis that increased bee diversity and abundance would positively correlate with increased floral diversity and abundance using seven simple linear regression models. I found that the floral resource data collected from ground transect surveys predicts bee diversity, bee richness, and bee abundance. I also found floral abundance data captured by UAVs predicts bee abundance at the field sites. Finally, I found UAV floral abundance predicts ground transect floral abundance suggesting a positive relationship between different sampling methods. My results support previous research that suggests a high diversity of resources will support a high diversity of insects; and habitats with abundant flowers have greater possibilities for partitioning of available resources. My results also support UAVs as an efficient method for describing and counting floral resources in non-agricultural settings. Further research should include using UAV imagery to count flowers to predict bee communities on a landscape scale
    • …
    corecore