11 research outputs found
An ensemble approach of dual base learners for multi-class classification problems
In this work, we formalise and evaluate an ensemble of classifiers that is designed for the resolution of multi-class problems. To achieve a good accuracy rate, the base learners are built with pairwise coupled binary and multi-class classifiers. Moreover, to reduce the computational cost of the ensemble and to improve its performance, these classifiers are trained using a specific attribute subset. This proposal offers the opportunity to capture the advantages provided by binary decomposition methods, by attribute partitioning methods, and by cooperative characteristics associated with a combination of redundant base learners. To analyse the quality of this architecture, its performance has been tested on different domains, and the results have been compared to other well-known classification methods. This experimental evaluation indicates that our model is, in most cases, as accurate as these methods, but it is much more efficient. (C) 2014 Elsevier B.V. All rights reserved.This research was supported by the Spanish MICINN under Projects TRA2010-20225-C03-01, TRA 2011-29454-C03-02, and TRA 2011-29454-C03-03
Recommended from our members
Accelerating the Design of Automotive Catalyst Products Using Machine Learning Leveraging experimental data to guide new formulations
The design of catalyst products to reduce harmful emissions is currently an intensive process of expert-driven discovery, taking several years to develop a product. Machine learning can accelerate this timescale, leveraging historic experimental data from related products to guide which
new formulations and experiments will enable a project to most directly reach its targets. We used machine learning to accurately model 16 key performance targets for catalyst products, enabling detailed understanding of the factors governing catalyst performance and realistic suggestions
of future experiments to rapidly develop more effective products. The proposed formulations are currently undergoing experimental validation.</jats:p
Efficient Network Domination for Life Science Applications
With the ever-increasing size of data available to researchers, traditional methods of analysis often cannot scale to match problems being studied. Often only a subset of variables may be utilized or studied further, motivating the need of techniques that can prioritize variable selection. This dissertation describes the development and application of graph theoretic techniques, particularly the notion of domination, for this purpose. In the first part of this dissertation, algorithms for vertex prioritization in the field of network controllability are studied. Here, the number of solutions to which a vertex belongs is used to classify said vertex and determine its suitability in controlling a network. Novel efficient scalable algorithms are developed and analyzed. Empirical tests demonstrate the improvement of these algorithms over those already established in the literature. The second part of this dissertation concerns the prioritization of genes for loss-of-function allele studies in mice. The International Mouse Phenotyping Consortium leads the initiative to develop a loss-of-function allele for each protein coding gene in the mouse genome. Only a small proportion of untested genes can be selected for further study. To address the need to prioritize genes, a generalizable data science strategy is developed. This strategy models genes as a gene-similarity graph, and from it selects subset that will be further characterized. Empirical tests demonstrate the method’s utility over that of pseudorandom selection and less computationally demanding methods. Finally, part three addresses the important task of preprocessing in the context of noisy public health data. Many public health databases have been developed to collect, curate, and store a variety of environmental measurements. Idiosyncrasies in these measurements, however, introduce noise to data found in these databases in several ways including missing, incorrect, outlying, and incompatible data. Beyond noisy data, multiple measurements of similar variables can introduce problems of multicollinearity. Domination is again employed in a novel graph method to handle autocorrelation. Empirical results using the Public Health Exposome dataset are reported. Together these three parts demonstrate the utility of subset selection via domination when applied to a multitude of data sources from a variety of disciplines in the life sciences
Vücut Yağ Yüzdesi Tahmini İçin Özellik Seçim Yöntemlerinin Karşılaştırılması
Çağımızın yaygın olarak görülen sağlık problemlerinden biri olan obezite, kişinin yaşam kalitesine olumsuz etkisinin yanında birçok rahatsızlığa da sebep olmaktadır. Vücut yağ yüzdesi, obezitenin teşhis edilmesinde en önemli göstergedir. Vücut yağ yüzdesinin hızlı, kolay, maliyetsiz ve yüksek doğruluk ile belirlenmesi ise en az obezitenin teşhis edilebilmesi kadar önemlidir. Antropometrik verilerden hesaplanabilen vücut yağ yüzdesi değerini makine öğrenmesi algoritmaları ile güvenli bir şekilde hesaplamak mümkündür. Ancak yüksek boyutlu, alakasız ve gereksiz veriler makine öğrenmesi algoritmalarının doğruluğunu saptırmakta ve modelin eğitim süresini arttırmaktadır. Makine öğrenmesi algoritmalarını daha az özellik ile kullanarak daha yüksek doğruluğun elde edilmesini sağlayan özellik seçim algoritmaları bulunmaktadır. Bu çalışmada vücut yağ yüzdesi tahmini için yedi farklı özellik seçim algoritması karşılaştırılıp daha az özellik ile daha yüksek doğrulukta sonuçların elde edilmesi sağlanmıştır. Özellik seçim yöntemlerinin farklı modellere etkisini incelemek için dört makine öğrenmesi yöntemi kullanılmıştır. Bu makine öğrenmesi algoritmalarının eğitim süreleri karşılaştırılmıştır. Deneysel çalışmalar sonucunda özellik seçim yöntemleri kullanılarak daha az özellik ile modelin eğitimi için daha kısa süre harcanarak daha yüksek doğrulukta tahminler elde edilebileceği gösterilmiştir
Pathophysiological characterization of traumatic brain injury using novel analytical methods
Severity of traumatic brain injury is usually classified by Glasgow coma scale (GCS) as “mild”,
"moderate" or "severe’, which does not capture the heterogeneity of the disease. According to
current guidelines, intracranial pressure (ICP) should not exceed 22 mmHg, with no further
recommendations concerning individualization or tolerable duration of intracranial
hypertension. The aims of this thesis were to identify subgroups of patients beyond
characterization using GCS, and to investigate the impact of duration and magnitude of
intracranial hypertension on outcome, using data from the observational prospective study
Collaborative European neurotrauma effectiveness research in TBI (CENTER-TBI).
To investigate the temporal aspect of tolerable ICP elevations, we examined the correlation
between dose of ICP and outcome represented by 6-month Glasgow outcome scale extended
(GOSE). ICP dose was represented both by the number of events above thresholds for ICP
magnitude and duration and by area under the ICP curve (i.e., “pressure time dose” (PTD)). A
variation in tolerable ICP thresholds of 18 mmHg +/- 4 mmHg (2 standard deviations (SD)) for
events with duration longer than five minutes was identified using a bootstrapping technique.
PTD was correlated to both mortality and unfavorable outcome.
A cerebrovascular autoregulation (CA) dependent ICP tolerability was identified. If CA was
impaired, no tolerable ICP magnitude and duration thresholds were identified, while if CA was
intact, both 19 mmHg for 5 minutes or longer and 15 mmHg for 50 minutes or longer were
correlated to worse outcome. While no significant difference in PTD was seen between
favorable and unfavorable outcome if CA was intact, there was a significant difference if CA
was impaired. In a multivariable analysis, PTD did not remain a significant predictor of
outcome when adjusting for other known predictors in TBI. In a causal inference analysis, both
cerebrovascular autoregulation status and ICP-lowering therapies represented by the therapy
intensity level (TIL) have a directional relationship with outcome. However, no direct causal
relationship of ICP towards outcome was found.
By applying an unsupervised clustering method, we identified six distinct admission clusters
defined by GCS, lactate, oxygen saturation (SpO2), creatinine, glucose, base excess, pH,
PaCO2, and body temperature. These clusters can be summarized in clinical presentation and
metabolic profile. When clustering longitudinal features during the first week in the intensive
care unit (ICU), no optimal number of clusters could be seen. However, glucose variation, a
panel of brain biomarkers, and creatinine consistently described trajectories. Although no
information on outcome was included in the models, both admission clusters and trajectories
showed clear outcome differences, with mortality from 7 to 40% in the admission clusters and
4 to 85% in the trajectories. Adding cluster or trajectory labels to the established outcome
prediction IMPACT model significantly improved outcome predictions.
The results in this thesis support the importance of cerebrovascular autoregulation status as it
was found that CA status was more informative towards outcome than ICP magnitude and
duration. There was a variation in tolerable ICP intensity and duration dependent on whether
CA was intact. Distinct clusters defined by GCS and metabolic profiles related to outcome
suggest the importance of an extracranial evaluation in addition to GCS in TBI patients.
Longitudinal trajectories of TBI patients in the ICU are highly characterized by glucose
variation, brain biomarkers and creatinine
Is mutual information adequate for feature selection in regression?
Feature selection is an important preprocessing step for many high-dimensional regression problems. One of the most common strategies is to select a relevant feature subset based on the mutual information criterion. However, no connection has been established yet between the use of mutual information and a regression error criterion in the machine learning literature. This is obviously an important lack, since minimising such a criterion is eventually the objective one is interested in. This paper demonstrates that under some reasonable assumptions, features selected with the mutual information criterion are the ones minimising the mean squared error and the mean absolute error. On the contrary, it is also shown that the mutual information criterion can fail in selecting optimal features in some situations that we characterise. The theoretical developments presented in this work are expected to lead in practice to a critical and efficient use of the mutual information for feature selection