1,909 research outputs found

    Use of structure-activity landscape index curves and curve integrals to evaluate the performance of multiple machine learning prediction models

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Standard approaches to address the performance of predictive models that used common statistical measurements for the entire data set provide an overview of the average performance of the models across the entire predictive space, but give little insight into applicability of the model across the prediction space. Guha and Van Drie recently proposed the use of structure-activity landscape index (SALI) curves via the SALI curve integral (SCI) as a means to map the predictive power of computational models within the predictive space. This approach evaluates model performance by assessing the accuracy of pairwise predictions, comparing compound pairs in a manner similar to that done by medicinal chemists.</p> <p>Results</p> <p>The SALI approach was used to evaluate the performance of continuous prediction models for MDR1-MDCK <it>in vitro </it>efflux potential. Efflux models were built with ADMET Predictor neural net, support vector machine, kernel partial least squares, and multiple linear regression engines, as well as SIMCA-P+ partial least squares, and random forest from Pipeline Pilot as implemented by AstraZeneca, using molecular descriptors from <it>SimulationsPlus </it>and AstraZeneca.</p> <p>Conclusion</p> <p>The results indicate that the choice of training sets used to build the prediction models is of great importance in the resulting model quality and that the SCI values calculated for these models were very similar to their Kendall τ values, leading to our suggestion of an approach to use this SALI/SCI paradigm to evaluate predictive model performance that will allow more informed decisions regarding model utility. The use of SALI graphs and curves provides an additional level of quality assessment for predictive models.</p

    Out of equilibrium Statistical Physics of learning

    Get PDF
    In the study of hard optimization problems, it is often unfeasible to achieve a full analytic control on the dynamics of the algorithmic processes that find solutions efficiently. In many cases, a static approach is able to provide considerable insight into the dynamical properties of these algorithms: in fact, the geometrical structures found in the energetic landscape can strongly affect the stationary states and the optimal configurations reached by the solvers. In this context, a classical Statistical Mechanics approach, relying on the assumption of the asymptotic realization of a Boltzmann Gibbs equilibrium, can yield misleading predictions when the studied algorithms comprise some stochastic components that effectively drive these processes out of equilibrium. Thus, it becomes necessary to develop some intuition on the relevant features of the studied phenomena and to build an ad hoc Large Deviation analysis, providing a more targeted and richer description of the geometrical properties of the landscape. The present thesis focuses on the study of learning processes in Artificial Neural Networks, with the aim of introducing an out of equilibrium statistical physics framework, based on the introduction of a local entropy potential, for supporting and inspiring algorithmic improvements in the field of Deep Learning, and for developing models of neural computation that can carry both biological and engineering interest

    Spectrum sensing and occupancy prediction for cognitive machine-to-machine wireless networks

    Get PDF
    A thesis submitted to the University of Bedfordshire, in partial fulfil ment of the requirements for the degree of Doctor of Philosophy (PhD)The rapid growth of the Internet of Things (IoT) introduces an additional challenge to the existing spectrum under-utilisation problem as large scale deployments of thousands devices are expected to require wireless connectivity. Dynamic Spectrum Access (DSA) has been proposed as a means of improving the spectrum utilisation of wireless systems. Based on the Cognitive Radio (CR) paradigm, DSA enables unlicensed spectrum users to sense their spectral environment and adapt their operational parameters to opportunistically access any temporally unoccupied bands without causing interference to the primary spectrum users. In the same context, CR inspired Machine-to-Machine (M2M) communications have recently been proposed as a potential solution to the spectrum utilisation problem, which has been driven by the ever increasing number of interconnected devices. M2M communications introduce new challenges for CR in terms of operational environments and design requirements. With spectrum sensing being the key function for CR, this thesis investigates the performance of spectrum sensing and proposes novel sensing approaches and models to address the sensing problem for cognitive M2M deployments. In this thesis, the behaviour of Energy Detection (ED) spectrum sensing for cognitive M2M nodes is modelled using the two-wave with dffi use power fading model. This channel model can describe a variety of realistic fading conditions including worse than Rayleigh scenarios that are expected to occur within the operational environments of cognitive M2M communication systems. The results suggest that ED based spectrum sensing fails to meet the sensing requirements over worse than Rayleigh conditions and consequently requires the signal-to-noise ratio (SNR) to be increased by up to 137%. However, by employing appropriate diversity and node cooperation techniques, the sensing performance can be improved by up to 11.5dB in terms of the required SNR. These results are particularly useful in analysing the eff ects of severe fading in cognitive M2M systems and thus they can be used to design effi cient CR transceivers and to quantify the trade-o s between detection performance and energy e fficiency. A novel predictive spectrum sensing scheme that exploits historical data of past sensing events to predict channel occupancy is proposed and analysed. This approach allows CR terminals to sense only the channels that are predicted to be unoccupied rather than the whole band of interest. Based on this approach, a spectrum occupancy predictor is developed and experimentally validated. The proposed scheme achieves a prediction accuracy of up to 93% which in turn can lead to up to 84% reduction of the spectrum sensing cost. Furthermore, a novel probabilistic model for describing the channel availability in both the vertical and horizontal polarisations is developed. The proposed model is validated based on a measurement campaign for operational scenarios where CR terminals may change their polarisation during their operation. A Gaussian approximation is used to model the empirical channel availability data with more than 95% confi dence bounds. The proposed model can be used as a means of improving spectrum sensing performance by using statistical knowledge on the primary users occupancy pattern

    Big Earth Data and Machine Learning for Sustainable and Resilient Agriculture

    Full text link
    Big streams of Earth images from satellites or other platforms (e.g., drones and mobile phones) are becoming increasingly available at low or no cost and with enhanced spatial and temporal resolution. This thesis recognizes the unprecedented opportunities offered by the high quality and open access Earth observation data of our times and introduces novel machine learning and big data methods to properly exploit them towards developing applications for sustainable and resilient agriculture. The thesis addresses three distinct thematic areas, i.e., the monitoring of the Common Agricultural Policy (CAP), the monitoring of food security and applications for smart and resilient agriculture. The methodological innovations of the developments related to the three thematic areas address the following issues: i) the processing of big Earth Observation (EO) data, ii) the scarcity of annotated data for machine learning model training and iii) the gap between machine learning outputs and actionable advice. This thesis demonstrated how big data technologies such as data cubes, distributed learning, linked open data and semantic enrichment can be used to exploit the data deluge and extract knowledge to address real user needs. Furthermore, this thesis argues for the importance of semi-supervised and unsupervised machine learning models that circumvent the ever-present challenge of scarce annotations and thus allow for model generalization in space and time. Specifically, it is shown how merely few ground truth data are needed to generate high quality crop type maps and crop phenology estimations. Finally, this thesis argues there is considerable distance in value between model inferences and decision making in real-world scenarios and thereby showcases the power of causal and interpretable machine learning in bridging this gap.Comment: Phd thesi

    Interpretable statistics for complex modelling: quantile and topological learning

    Get PDF
    As the complexity of our data increased exponentially in the last decades, so has our need for interpretable features. This thesis revolves around two paradigms to approach this quest for insights. In the first part we focus on parametric models, where the problem of interpretability can be seen as a “parametrization selection”. We introduce a quantile-centric parametrization and we show the advantages of our proposal in the context of regression, where it allows to bridge the gap between classical generalized linear (mixed) models and increasingly popular quantile methods. The second part of the thesis, concerned with topological learning, tackles the problem from a non-parametric perspective. As topology can be thought of as a way of characterizing data in terms of their connectivity structure, it allows to represent complex and possibly high dimensional through few features, such as the number of connected components, loops and voids. We illustrate how the emerging branch of statistics devoted to recovering topological structures in the data, Topological Data Analysis, can be exploited both for exploratory and inferential purposes with a special emphasis on kernels that preserve the topological information in the data. Finally, we show with an application how these two approaches can borrow strength from one another in the identification and description of brain activity through fMRI data from the ABIDE project

    Densification of spatially-sparse legacy soil data at a national scale: a digital mapping approach

    Get PDF
    Digital soil mapping (DSM) is a viable approach to providing spatial soil information but its adoption at the national scale, especially in sub-Saharan Africa, is limited by low spread of data. Therefore, the focus of this thesis is on optimizing DSM techniques for densification of sparse legacy soil data using Nigeria as a case study. First, the robustness of Random Forest model (RFM) was tested in predicting soil particle-size fractions as a compositional data using additive log-ratio technique. Results indicated good prediction accuracy with RFM while soils are largely coarse-textured especially in the northern region. Second, soil organic carbon (SOC) and bulk density (BD) were predicted from which SOC density and stock were calculated. These were overlaid with land use/land cover (LULC), agro-ecological zone (AEZ) and soil maps to quantify the carbon sequestration of soils and their variation across different AEZs. Results showed that 6.5 Pg C with an average of 71.60 Mg C ha–1 abound in the top 1 m soil depth. Furthermore, to improve the performance of BD and effective cation exchange capacity (ECEC) pedotransfer functions (PTFs), the inclusion of environmental data was explored using multiple linear regression (MLR) and RFM. Results showed an increase in performance of PTFs with the use of soil and environmental data. Finally, the application of Choquet fuzzy integral (CI) technique in irrigation suitability assessment was assessed. This was achieved through multi-criteria analysis of soil, climatic, landscape and socio-economic indices. Results showed that CI is a better aggregation operator compared to weighted mean technique. A total of 3.34 x 106 ha is suitable for surface irrigation in Nigeria while major limitations are due to topographic and soil attributes. Research findings will provide quantitative basis for framing appropriate policies on sustainable food production and environmental management, especially in resource-poor countries of the world

    Is Evolution an Algorithm? Effects of local entropy in unsupervised learning and protein evolution

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Roadmap on Electronic Structure Codes in the Exascale Era

    Get PDF
    Electronic structure calculations have been instrumental in providing many important insights into a range of physical and chemical properties of various molecular and solid-state systems. Their importance to various fields, including materials science, chemical sciences, computational chemistry and device physics, is underscored by the large fraction of available public supercomputing resources devoted to these calculations. As we enter the exascale era, exciting new opportunities to increase simulation numbers, sizes, and accuracies present themselves. In order to realize these promises, the community of electronic structure software developers will however first have to tackle a number of challenges pertaining to the efficient use of new architectures that will rely heavily on massive parallelism and hardware accelerators. This roadmap provides a broad overview of the state-of-the-art in electronic structure calculations and of the various new directions being pursued by the community. It covers 14 electronic structure codes, presenting their current status, their development priorities over the next five years, and their plans towards tackling the challenges and leveraging the opportunities presented by the advent of exascale computing.Comment: Submitted as a roadmap article to Modelling and Simulation in Materials Science and Engineering; Address any correspondence to Vikram Gavini ([email protected]) and Danny Perez ([email protected]
    corecore