3,462 research outputs found

    Revisiting the Dissimilarity Representation in the Context of Regression

    Get PDF
    In machine learning, a natural way to represent an instance is by using a feature vector. However, several studies have shown that this representation may not accurately characterize an object. For classification problems, the dissimilarity paradigm has been proposed as an alternative to the standard feature-based approach. Encoding each object by pairwise dissimilarities has been demonstrated to improve the data quality because it mitigates some complexities such as class overlap, small disjuncts, and low-sample size. However, its suitability and performance when applied to regression problems have not been fully explored. This study redefines the dissimilarity representation for regression. To this end, we have carried out an extensive experimental evaluation on 34 datasets using two linear regression models. The results show that the dissimilarity approach decreases the error rates of both the traditional linear regression and the linear model with elastic net regularization, and it also reduces the complexity of most regression datasets

    Nonlinear effects of environmental drivers shape macroinvertebrate biodiversity in an agricultural pondscape

    Get PDF
    Agriculture is a leading cause of biodiversity loss and significantly impacts freshwater biodiversity through many stressors acting locally and on the landscape scale. The individual effects of these numerous stressors are often difficult to disentangle and quantify, as they might have nonlinear impacts on biodiversity. Within agroecosystems, ponds are biodiversity hotspots providing habitat for many freshwater species and resting or feeding places for terrestrial organisms. Ponds are strongly influenced by their terrestrial surroundings, and understanding the determinants of biodiversity in agricultural landscapes remains difficult but crucial for improving conservation policies and actions. We aimed to identify the main effects of environmental and spatial variables on α-, ÎČ-, and Îł-diversities of macroinvertebrate communities inhabiting ponds (n = 42) in an agricultural landscape in the Northeast Germany, and to quantify the respective roles of taxonomic turnover and nestedness in the pondscape. We disentangled the nonlinear effects of a wide range of environmental and spatial variables on macroinvertebrate α- and ÎČ-biodiversity. Our results show that α-diversity is impaired by eutrophication (phosphate and nitrogen) and that overshaded ponds support impoverished macroinvertebrate biota. The share of arable land in the ponds' surroundings decreases ÎČ-diversity (i.e., dissimilarity in community), while ÎČ-diversity is higher in shallower ponds. Moreover, we found that ÎČ-diversity is mainly driven by taxonomic turnover and that ponds embedded in arable fields support local and regional diversity. Our findings highlight the importance of such ponds for supporting biodiversity, identify the main stressors related to human activities (eutrophication), and emphasize the need for a large number of ponds in the landscape to conserve biodiversity. Small freshwater systems in agricultural landscapes challenge us to compromise between human demands and nature conservation worldwide. Identifying and quantifying the effects of environmental variables on biodiversity inhabiting those ecosystems can help address threats impacting freshwater life with more effective management of pondscapes

    Effects of Molecular Representation in Predicting the Biological Activity using SVM and PLS Approaches

    Get PDF
    In this work we study and analyze the behavior of different representational spaces for molecular activity prediction. Representational spaces based on fingerprint similarity, structural similarity using maximum common subgraphs (MCS) and all maximum common subgraphs (AMCS) approaches are compared against representational spaces based on structural fragments and non-isomorphic fragments (NIF), built using different molecular descriptors. Support vector machine is used to study the influence of molecular representation in the dataset classification and PLS regression is proposed to construct a QSAR model for the molecular activity predictio

    Bio-Ecological Diversity vs. Socio-Economic Diversity: A Comparison of Existing Measures

    Get PDF
    This paper aims to enrich the standard toolbox for measuring diversity in economics. In so doing, we compare the indicators of diversity used by economists with those used by biologists and ecologists. Ecologists and biologists are concerned about biodiversity: the diversity of organisms that inhabit a given area. Concepts of species diversity such as alpha (diversity within community), beta (diversity across communities) and gamma (diversity due to differences among samples when they are combined into a single sample) have been developed (Whittaker, 1960). Biodiversity is more complex than just the species that are present, it includes species richness and species evenness. Those various aspects of diversity are measured by biodiversity indices such as Simpson’s Diversity Indices, Species Richness Index, Shannon Weaver Diversity Indices, Patil and Taillie Index, Modified Hill’s Ratio. In economics, diversity measures are multi-faceted ranging from inequality (Lorenz curve, Gini coefficient, quintile distribution), to polarisation (Esteban and Ray, 1994; Wolfon, 1994, D’Ambrosio (2001)) and heterogeneity (Alesina, Baqir and Hoxby, 2000). We propose an interdisciplinary comparison between indicators. We review their theoretical background and applications. We provide an assessment of their possible use according to their specific properties.Diversity, Growth, Knowledge

    Data mining as a tool for environmental scientists

    Get PDF
    Over recent years a huge library of data mining algorithms has been developed to tackle a variety of problems in fields such as medical imaging and network traffic analysis. Many of these techniques are far more flexible than more classical modelling approaches and could be usefully applied to data-rich environmental problems. Certain techniques such as Artificial Neural Networks, Clustering, Case-Based Reasoning and more recently Bayesian Decision Networks have found application in environmental modelling while other methods, for example classification and association rule extraction, have not yet been taken up on any wide scale. We propose that these and other data mining techniques could be usefully applied to difficult problems in the field. This paper introduces several data mining concepts and briefly discusses their application to environmental modelling, where data may be sparse, incomplete, or heterogenous

    DPRESS: Localizing estimates of predictive uncertainty

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The need to have a quantitative estimate of the uncertainty of prediction for QSAR models is steadily increasing, in part because such predictions are being widely distributed as tabulated values disconnected from the models used to generate them. Classical statistical theory assumes that the error in the population being modeled is independent and identically distributed (IID), but this is often not actually the case. Such inhomogeneous error (heteroskedasticity) can be addressed by providing an individualized estimate of predictive uncertainty for each particular new object <it>u</it>: the standard error of prediction <it>s</it><sub>u </sub>can be estimated as the non-cross-validated error <it>s</it><sub>t* </sub>for the closest object <it>t</it>* in the training set adjusted for its separation <it>d </it>from <it>u </it>in the descriptor space relative to the size of the training set.</p> <p><display-formula><graphic file="1758-2946-1-11-i1.gif"/></display-formula></p> <p>The predictive uncertainty factor <it>Îł</it><sub>t* </sub>is obtained by distributing the internal predictive error sum of squares across objects in the training set based on the distances between them, hence the acronym: <it>D</it>istributed <it>PR</it>edictive <it>E</it>rror <it>S</it>um of <it>S</it>quares (DPRESS). Note that <it>s</it><sub>t* </sub>and <it>Îł</it><sub>t*</sub>are characteristic of each training set compound contributing to the model of interest.</p> <p>Results</p> <p>The method was applied to partial least-squares models built using 2D (molecular hologram) or 3D (molecular field) descriptors applied to mid-sized training sets (<it>N </it>= 75) drawn from a large (<it>N </it>= 304), well-characterized pool of cyclooxygenase inhibitors. The observed variation in predictive error for the external 229 compound test sets was compared with the uncertainty estimates from DPRESS. Good qualitative and quantitative agreement was seen between the distributions of predictive error observed and those predicted using DPRESS. Inclusion of the distance-dependent term was essential to getting good agreement between the estimated uncertainties and the observed distributions of predictive error. The uncertainty estimates derived by DPRESS were conservative even when the training set was biased, but not excessively so.</p> <p>Conclusion</p> <p>DPRESS is a straightforward and powerful way to reliably estimate individual predictive uncertainties for compounds outside the training set based on their distance to the training set and the internal predictive uncertainty associated with its nearest neighbor in that set. It represents a sample-based, <it>a posteriori </it>approach to defining applicability domains in terms of localized uncertainty.</p

    A study of an estuarine benthic community subjected to petrochemical effluents

    Get PDF
    This study has assessed the impact of a petrochemical complex, which discharges its effluents onto an intertidal mudflat. The Grangemouth petrochemical complex on the Forth Estuary, Scotland discharges two effluents on to the Kinneil intertidal area. The results of a 24-year monitoring programme of the Kiimeil intertidal area, carried out between 1976 to 1999, are analysed. The relative impact of the effluents on the macrobenthic community is considered along with other potential pollution sources and climatic factors. During the study period a clear increase in the diversity, evenness and species richness was observed over the whole area. This is attributed to the increased quality of the refinery effluent, the chemical effluent and the River Avon, which also crosses the area. The group analysis showed that although all areas have shown an increase in diversity there are still three areas that can be considered impacted (Groups 1, 2 and 4). Two major changes in the species composition were seen in 1979 when Manayunkia aestuarina was first found and in 1994 when Streblospio shrubsolii was first recorded. The impact of the recent movement of the chemical outfall from a hightide position to a lower shore site is also considered. A detailed survey of the areas around the new lower shore and old upper shore outfalls indicated that there was a spatial difference in the species distribution, which can be explained by the distance from the refinery outfall, the hydrocarbon concentration of the sediments and/or the station height. No change in the community composition was detected after the movement of the chemical outfall in January 1998, although seasonal changes were seen

    Discovering properties of new DNA-binding activity of proteins

    Get PDF
    Protein-DNA interactions are an essential feature in the genetic activities of life, and the ability to predict and manipulate such interactions has applications in a wide range of fields. This Thesis presents the methods of modelling the properties of protein-DNA interactions. In particular, it investigates the methods of visualising and predicting the specificity of DNA-binding Cys2His2 zinc finger interaction. The Cys2His2 zinc finger proteins interact via their individual fingers to base pair subsites on the target DNA. Four key residue positions on the a- helix of the zinc fingers make non-covalent interactions with the DNA with sequence specificity. Mutating these key residues generates combinatorial possibilities that could potentially bind to any DNA segment of interest. Many attempts have been made to predict the binding interaction using structural and chemical information, but with only limited success. The most important contribution of the thesis is that the developed model allows for the binding properties of a given protein-DNA binding to be visualised in relation to other protein-DNA combinations without having to explicitly physically model the specific protein molecule and specific DNA sequence. To prove this, various databases were generated, including a synthetic database which includes all possible combinations of the DNA-binding Cys2His2 zinc finger interactions. NeuroScale, a topographic visualisation technique, is exploited to represent the geometric structures of the protein-DNA interactions by measuring dissimilarity between the data points. In order to verify the effect of visualisation on understanding the binding properties of the DNA-binding Cys2His2 zinc finger interaction, various prediction models are constructed by using both the high dimensional original data and the represented data in low dimensional feature space. Finally, novel data sets are studied through the selected visualisation models based on the experimental DNA-zinc finger protein database. The result of the NeuroScale projection shows that different dissimilarity representations give distinctive structural groupings, but clustering in biologically-interesting ways. This method can be used to forecast the physiochemical properties of the novel proteins which may be beneficial for therapeutic purposes involving genome targeting in general
    • 

    corecore