52 research outputs found
Random forests with random projections of the output space for high dimensional multi-label classification
We adapt the idea of random projections applied to the output space, so as to
enhance tree-based ensemble methods in the context of multi-label
classification. We show how learning time complexity can be reduced without
affecting computational complexity and accuracy of predictions. We also show
that random output space projections may be used in order to reach different
bias-variance tradeoffs, over a broad panel of benchmark problems, and that
this may lead to improved accuracy while reducing significantly the
computational burden of the learning stage
Modelling trade offs between public and private conservation policies
To reduce global biodiversity loss, there is an urgent need to determine the
most efficient allocation of conservation resources. Recently, there has been a
growing trend for many governments to supplement public ownership and
management of reserves with incentive programs for conservation on private
land. At the same time, policies to promote conservation on private land are
rarely evaluated in terms of their ecological consequences. This raises
important questions, such as the extent to which private land conservation can
improve conservation outcomes, and how it should be mixed with more traditional
public land conservation. We address these questions, using a general framework
for modelling environmental policies and a case study examining the
conservation of endangered native grasslands to the west of Melbourne,
Australia. Specifically, we examine three policies that involve: i) spending
all resources on creating public conservation areas; ii) spending all resources
on an ongoing incentive program where private landholders are paid to manage
vegetation on their property with 5-year contracts; and iii) splitting
resources between these two approaches. The performance of each strategy is
quantified with a vegetation condition change model that predicts future
changes in grassland quality. Of the policies tested, no one policy was always
best and policy performance depended on the objectives of those enacting the
policy. This work demonstrates a general method for evaluating environmental
policies and highlights the utility of a model which combines ecological and
socioeconomic processes.Comment: 20 pages, 5 figure
Ontology of core data mining entities
In this article, we present OntoDM-core, an ontology of core data mining
entities. OntoDM-core defines themost essential datamining entities in a three-layered
ontological structure comprising of a specification, an implementation and an application
layer. It provides a representational framework for the description of mining
structured data, and in addition provides taxonomies of datasets, data mining tasks,
generalizations, data mining algorithms and constraints, based on the type of data.
OntoDM-core is designed to support a wide range of applications/use cases, such as
semantic annotation of data mining algorithms, datasets and results; annotation of
QSAR studies in the context of drug discovery investigations; and disambiguation of
terms in text mining. The ontology has been thoroughly assessed following the practices
in ontology engineering, is fully interoperable with many domain resources and
is easy to extend
On Aggregation in Ensembles of Multilabel Classifiers
While a variety of ensemble methods for multilabel classification have been
proposed in the literature, the question of how to aggregate the predictions of
the individual members of the ensemble has received little attention so far. In
this paper, we introduce a formal framework of ensemble multilabel
classification, in which we distinguish two principal approaches: "predict then
combine" (PTC), where the ensemble members first make loss minimizing
predictions which are subsequently combined, and "combine then predict" (CTP),
which first aggregates information such as marginal label probabilities from
the individual ensemble members, and then derives a prediction from this
aggregation. While both approaches generalize voting techniques commonly used
for multilabel ensembles, they allow to explicitly take the target performance
measure into account. Therefore, concrete instantiations of CTP and PTC can be
tailored to concrete loss functions. Experimentally, we show that standard
voting techniques are indeed outperformed by suitable instantiations of CTP and
PTC, and provide some evidence that CTP performs well for decomposable loss
functions, whereas PTC is the better choice for non-decomposable losses.Comment: 14 pages, 2 figure
Combined chemical genetics and data-driven bioinformatics approach identifies receptor tyrosine kinase inhibitors as host-directed antimicrobials
Immunogenetics and cellular immunology of bacterial infectious disease
Wet-dry-wet drug screen leads to the synthesis of TS1, a novel compound reversing lung fibrosis through inhibition of myofibroblast differentiation
Therapies halting the progression of fibrosis are ineffective and limited. Activated myofibroblasts are emerging as important targets in the progression of fibrotic diseases. Previously, we performed a high-throughput screen on lung fibroblasts and subsequently demonstrated that the inhibition of myofibroblast activation is able to prevent lung fibrosis in bleomycin-treated mice. High-throughput screens are an ideal method of repurposing drugs, yet they contain an intrinsic limitation, which is the size of the library itself. Here, we exploited the data from our “wet” screen and used “dry” machine learning analysis to virtually screen millions of compounds, identifying novel anti-fibrotic hits which target myofibroblast differentiation, many of which were structurally related to dopamine. We synthesized and validated several compounds ex vivo (“wet”) and confirmed that both dopamine and its derivative TS1 are powerful inhibitors of myofibroblast activation. We further used RNAi-mediated knock-down and demonstrated that both molecules act through the dopamine receptor 3 and exert their anti-fibrotic effect by inhibiting the canonical transforming growth factor β pathway. Furthermore, molecular modelling confirmed the capability of TS1 to bind both human and mouse dopamine receptor 3. The anti-fibrotic effect on human cells was confirmed using primary fibroblasts from idiopathic pulmonary fibrosis patients. Finally, TS1 prevented and reversed disease progression in a murine model of lung fibrosis. Both our interdisciplinary approach and our novel compound TS1 are promising tools for understanding and combating lung fibrosis
Predicting gene function using hierarchical multi-label decision tree ensembles
<p>Abstract</p> <p>Background</p> <p><it>S. cerevisiae</it>, <it>A. thaliana </it>and <it>M. musculus </it>are well-studied organisms in biology and the sequencing of their genomes was completed many years ago. It is still a challenge, however, to develop methods that assign biological functions to the ORFs in these genomes automatically. Different machine learning methods have been proposed to this end, but it remains unclear which method is to be preferred in terms of predictive performance, efficiency and usability.</p> <p>Results</p> <p>We study the use of decision tree based models for predicting the multiple functions of ORFs. First, we describe an algorithm for learning hierarchical multi-label decision trees. These can simultaneously predict all the functions of an ORF, while respecting a given hierarchy of gene functions (such as FunCat or GO). We present new results obtained with this algorithm, showing that the trees found by it exhibit clearly better predictive performance than the trees found by previously described methods. Nevertheless, the predictive performance of individual trees is lower than that of some recently proposed statistical learning methods. We show that ensembles of such trees are more accurate than single trees and are competitive with state-of-the-art statistical learning and functional linkage methods. Moreover, the ensemble method is computationally efficient and easy to use.</p> <p>Conclusions</p> <p>Our results suggest that decision tree based methods are a state-of-the-art, efficient and easy-to-use approach to ORF function prediction.</p
Ensembles of extremely randomized predictive clustering trees for predicting structured outputs
We address the task of learning ensembles of predictive models for structured output prediction (SOP). We focus on three SOP tasks: multi-target regression (MTR), multi-label classification (MLC) and hierarchical multi-label classification (HMC). In contrast to standard classification and regression, where the output is a single (discrete or continuous) variable, in SOP the output is a data structure—a tuple of continuous variables MTR, a tuple of binary variables MLC or a tuple of binary variables with hierarchical dependencies (HMC). SOP is gaining increasing interest in the research community due to its applicability in a variety of practically relevant domains. In this context, we consider the Extra-Tree ensemble learning method—the overall top performer in the DREAM4 and DREAM5 challenges for gene network reconstruction. We extend this method for SOP tasks and call the extension Extra-PCTs ensembles. As base predictive models we propose using predictive clustering trees (PCTs)–a generalization of decision trees for predicting structured outputs. We conduct a comprehensive experimental evaluation of the proposed method on a collection of 41 benchmark datasets: 21 for MTR, 10 for MLC and 10 for HMC. We first investigate the influence of the size of the ensemble and the size of the feature subset considered at each node. We then compare the performance of Extra-PCTs to other ensemble methods (random forests and bagging), as well as to single PCTs. The experimental evaluation reveals that the Extra-PCTs achieve optimal performance in terms of predictive power and computational cost, with 50 base predictive models across the three tasks. The recommended values for feature subset sizes vary across the tasks, and also depend on whether the dataset contains only binary and/or sparse attributes. The Extra-PCTs give better predictive performance than a single tree (the differences are typically statistically significant). Moreover, the Extra-PCTs are the best performing ensemble method (except for the MLC task, where performances are similar to those of random forests), and Extra-PCTs can be used to learn good feature rankings for all of the tasks considered here
Semi-supervised regression trees with application to QSAR modelling
Despite the ease of collecting abundance of data about various phenomena, obtaining labeled data needed for learning models with high predictive performance remains a difficult and expensive task in many domains. This issue is particularly present in the case of the analysis of scientific data where obtaining labeled data typically requires expensive experiments. Moreover, in the analysis of scientific data, another issue is of fundamental importance: the interpretability of the models and the explainability of their decisions. By taking into account these considerations, we propose a novel semi-supervised method to learn regression trees. Thanks to the semi-supervised machine learning approach, the method is able to exploit information coming not only from labeled data, but also from unlabeled data, thus alleviating the issue of lack of labeled data. The method is based on the predictive clustering trees paradigm that extends regression trees towards structured output prediction. This allows us to obtain interpretable regression trees. The method we propose is particularly suited for the chemoinformatics task of quantitative structure-activity relationship (QSAR) modeling, which is the main application context considered in this paper. Specifically, we evaluate the proposed method on 4 QSAR modelling datasets and illustrate its use on a case study of predicting farnesyltransferase inhibitors. Additionally, we also evaluate our approach on 10 benchmark datasets not related to the QSAR modeling problem. The evaluation reveals the following: semi-supervised trees and ensembles thereof have better predictive performance than their supervised counterparts (especially when the number of labeled examples is very small); different datasets and different amounts of labeled data require different amounts of unlabeled data to be included in the learning process; and the learned semi-supervised regression trees can be used to better understand the problem at hand and the way predictions are being made
Some parental characteristics and habits of insulin-dependent diabetes mellitus children
The aim of this case-control study conducted in Belgrade during 1994-1997 was to investigate whether parental demographic characteristics and habits are associated with insulin-dependent diabetes mellitus (IDDM). Case group comprised 105 children up to 16 years old with IDDM and control group comprised 210 children with skin diseases. Cases and controls were individually matched by age (± one year), sex and place of residence (Belgrade). According to %l test results, children with IDDM significantly had five or more family members and they also significantly more frequently had poor socio-economic status than their controls. Higher education of fathers was significantly more frequently reported in diabetic children, in comparison with their controls. Parents of diabetic children were significantly more frequently occupationally exposed to radiation petroleum, and its derivates, organic solvents, dyes and lacquers. During pregnancy mothers of diabetic children significantly more frequently smoked cigarettes and consumed coffee, coca-cola, alcohol and foods containing nitrosamines. Fathers of diabetic children more frequently consumed alcohol
- …