914 research outputs found
Optimising decision trees using multi-objective particle swarm optimisation
Copyright © 2009 Springer-Verlag Berlin Heidelberg. The final publication is available at link.springer.comBook title: Swarm Intelligence for Multi-objective Problems in Data MiningSummary.
Although conceptually quite simple, decision trees are still among the most popular classifiers applied to real-world problems. Their popularity is due to a number of factors – core among these is their ease of comprehension, robust performance and fast data processing capabilities. Additionally feature selection is implicit within the decision tree structure.
This chapter introduces the basic ideas behind decision trees, focusing on decision trees which only consider a rule relating to a single feature at a node (therefore making recursive axis-parallel slices in feature space to form their classification boundaries). The use of particle swarm optimization (PSO) to train near optimal decision trees is discussed, and PSO is applied both in a single objective formulation (minimizing misclassification cost), and multi-objective formulation (trading off misclassification rates across classes).
Empirical results are presented on popular classification data sets from the well-known UCI machine learning repository, and PSO is demonstrated as being fully capable of acting as an optimizer for trees on these problems. Results additionally support the argument that multi-objectification of a problem can improve uni-objective search in classification problems
Methods for Predicting an Ordinal Response with High-Throughput Genomic Data
Multigenic diagnostic and prognostic tools can be derived for ordinal clinical outcomes using data from high-throughput genomic experiments. A challenge in this setting is that the number of predictors is much greater than the sample size, so traditional ordinal response modeling techniques must be exchanged for more specialized approaches. Existing methods perform well on some datasets, but there is room for improvement in terms of variable selection and predictive accuracy. Therefore, we extended an impressive binary response modeling technique, Feature Augmentation via Nonparametrics and Selection, to the ordinal response setting. Through simulation studies and analyses of high-throughput genomic datasets, we showed that our Ordinal FANS method is sensitive and specific when discriminating between important and unimportant features from the high-dimensional feature space and is highly competitive in terms of predictive accuracy.
Discrete survival time is another example of an ordinal response. For many illnesses and chronic conditions, it is impossible to record the precise date and time of disease onset or relapse. Further, the HIPPA Privacy Rule prevents recording of protected health information which includes all elements of dates (except year), so in the absence of a “limited dataset,” date of diagnosis or date of death are not available for calculating overall survival. Thus, we developed a method that is suitable for modeling high-dimensional discrete survival time data and assessed its performance by conducting a simulation study and by predicting the discrete survival times of acute myeloid leukemia patients using a high-dimensional dataset
Methods for Predicting an Ordinal Response with High-Throughput Genomic Data
Multigenic diagnostic and prognostic tools can be derived for ordinal clinical outcomes using data from high-throughput genomic experiments. A challenge in this setting is that the number of predictors is much greater than the sample size, so traditional ordinal response modeling techniques must be exchanged for more specialized approaches. Existing methods perform well on some datasets, but there is room for improvement in terms of variable selection and predictive accuracy. Therefore, we extended an impressive binary response modeling technique, Feature Augmentation via Nonparametrics and Selection, to the ordinal response setting. Through simulation studies and analyses of high-throughput genomic datasets, we showed that our Ordinal FANS method is sensitive and specific when discriminating between important and unimportant features from the high-dimensional feature space and is highly competitive in terms of predictive accuracy.
Discrete survival time is another example of an ordinal response. For many illnesses and chronic conditions, it is impossible to record the precise date and time of disease onset or relapse. Further, the HIPPA Privacy Rule prevents recording of protected health information which includes all elements of dates (except year), so in the absence of a “limited dataset,” date of diagnosis or date of death are not available for calculating overall survival. Thus, we developed a method that is suitable for modeling high-dimensional discrete survival time data and assessed its performance by conducting a simulation study and by predicting the discrete survival times of acute myeloid leukemia patients using a high-dimensional dataset
Prognosis of symptomatic patients with Brugada Syndrome through electrocardiogram biomarkers and machine learning
La Síndrome de Brugada (BrS) és un trastorn cardiovascular poc comú però greu que pot causar batecs perillosament ràpids i es caracteritza per presentar un conjunt particular de patrons d'electrocardiograma (ECG) als seus pacients. És una condició molt impredictible. Moltes persones no presenten cap símptoma, mentre que per altres, lamentablement, el primer símptoma és la mort.
Per a pacients d'alt risc es recomana col•locar un desfibril•lador cardioversor implantable. Desafortunadament, això té greus riscos associats, com infeccions i descàrregues inadequades, per la qual cosa és clau identificar aquests pacients d'alt risc correctament.
L'objectiu d'aquest projecte ha estat desenvolupar eines basades en aprenentatge automàtic que poguessin diferenciar els pacients amb Síndrome de Brugada simptomàtics dels quals no ho són. Es van considerar pacients simptomàtics aquells que s'havien recuperat de mort cardíaca, van patir un síncope arritmogènic o taquicàrdia sostinguda.
Per fer-ho, després d'una investigació de l'estat de l'art dels temes pertinents, es van extreure diversos biomarcadors relacionats amb els patrons d'ECG de Brugada a partir de registres d'ECG de 24 hores de 45 pacients diferents, després d'haver estat processats per mitjà de promediat de senyal per reduir el soroll.
Aquests biomarcadors, juntament amb algunes dades clíniques, es van separar de diferents maneres per entrenar i provar diferents models de classificació automatitzats basats en aprenentatge automàtic.
Els resultats obtinguts dels models han estat molt pobres. Cap d'ells no ha pogut classificar de manera fiable els pacients amb BrS com es desitjava. Això no obstant, d'aquesta primera aproximació es poden extreure conclusions valuoses per assolir l'objectiu del projecte, i s’han desenvolupat eines útils que poden permetre un processament més ràpid de la base de dades utilitzada.El Síndrome de Brugada (BrS) es un trastorno cardiovascular poco común pero grave que puede causar latidos peligrosamente rápidos y se caracteriza por presentar un conjunto particular de patrones de electrocardiograma (ECG) en sus pacientes. Es una condición muy impredecible. Muchas personas no presentan ningún síntoma, mientras que para otras, lamentablemente, el primer síntoma es la muerte.
Para pacientes de alto riesgo se recomienda la colocación de un desfibrilador cardioversor implantable. Desafortunadamente, eso tiene graves riesgos asociados, como infecciones y descargas inapropiadas, por lo que es clave identificar a esos pacientes de alto riesgo correctamente.
El objetivo de este proyecto era desarrollar herramientas basadas en aprendizaje automático que puedan diferenciar a los pacientes con Síndrome de Brugada sintomáticos de aquellos que no lo son.
Se consideraron pacientes sintomáticos aquellos que se habían recuperado de muerte cardiaca, sufrieron un síncope arritmogénico o una taquicardia sostenida. Para ello, tras una investigación del estado del arte de los temas relevantes, se extrajeron varios biomarcadores relacionados con los patrones de ECG de Brugada a partir de registros de ECG de 24h de 45 pacientes diferentes, después de haber sido procesados mediante promedio de señal para reducir su ruido.
Estos biomarcadores, junto con algunos datos clínicos, se separaron de diferentes maneras para entrenar y probar diferentes modelos de clasificación automatizados basados en aprendizaje automático.
Los resultados de los modelos obtenidos han sido muy pobres. Ninguno de ellos pudo clasificar de manera confiable a los pacientes con BrS como se deseaba. No obstante, de esta primera aproximación se pueden extraer valiosas conclusiones para continuar avanzando hacia el objetivo perseguido, y se desarrollaron herramientas útiles que permitirán un procesamiento más rápido de la base de datos utilizada.The Brugada Syndrome (BrS) is a rare but serious cardiovascular disorder that can cause dangerously fast heartbeats and is characterized by a particular set of electrocardiogram (ECG) patterns. It’s a very unpredictable condition. Many people don’t present symptoms at all, while for others, unfortunately, the first symptom is death.
For high risk patients, having an implantable cardioverter-defibrillator placed is recommended. Unfortunately, that has severe risks associated, like infections and inappropriate shocks, so it’s key to identify those high risk patients.
The objective of this project was to develop machine learning based tools that are able to tell symptomatic Brugada Syndrome patients apart from those who are not. Symptomatic patients were considered those who had recovered from cardiac death, suffered an arrhythmogenic syncope or sustained tachycardia.
In order to do so, after an investigation of the state of the art of the relevant subjects, several biomarkers related with Brugada ECG patterns were extracted from 24h ECG recordings of 45 different patients, after having been processed by signal averaging in order to reduce their noise. Those biomarkers, alongside some clinical data, were then separated in different ways in order to train and test different machine learning based automated classifier models.
The performances of those models were very poor. None of them was able to reliably classify BrS patients as desired. Nevertheless, valuable conclusions can be extracted from this first approach to pursue the intended goal further, and useful tools were developed that would allow for a faster processing of the database used
A METHOD FOR DETECTING OPTIMAL SPLITS OVER TIME IN SURVIVAL ANALYSIS USING TREE-STRUCTURED MODELS
One of the most popular uses for tree-based methods is in survival analysis for censored time data where the goal is to identify factors that are predictive of survival. Tree-based methods, due to their ability to identify subgroups in a hierarchical manner, can sometimes provide a useful alternative to Cox's proportional hazards model (1972) for the exploration of survival data. Since the data are partitioned into approximately homogeneous groups, Kaplan-Meier estimators can be used to compare prognosis between the groups presented by "nodes" in the tree. The demand for tree-based methods comes from clinical studies where the investigators are interested in grouping patients with differing prognoses. Tree-based methods are usually conducted at landmark time points, for example, five-year overall survival, but the effects of some covariates might be attenuated or increased at some other landmark time point. In some applications, it may be of interest to also determine the time point with respect to the outcome interest where the greatest discrimination between subgroups occurs. Consequently, by using a conventional approach, the time point at which the discrimination is the greatest might be missed. To remediate this potential problem, we propose a tree-structure method that will split based on the potential time-varying effects of the covariates. Accordingly, with our method, we find the best point of discrimination of a covariate with respect to not only a particular value of that covariate but also to the time when the endpoint of interest is observed. We analyze survival data from the National Surgical Adjuvant Breast and Bowel Project (NSABP) Protocol B-09 to demonstrate our method. Simulations are used to assess the statistical properties of this proposed methodology.We propose a new method in survival analysis, which is an area of statistics that is commonly used to assess prognoses of patients or participants in large public health studies. Our proposed method has public health significance because it could potentially facilitate a more refined assessment of the effect of biological and clinical markers on the survival times of different patient populations
An uncertainty prediction approach for active learning - application to earth observation
Mapping land cover and land usage dynamics are crucial in remote sensing since farmers
are encouraged to either intensify or extend crop use due to the ongoing rise in the world’s
population. A major issue in this area is interpreting and classifying a scene captured in
high-resolution satellite imagery. Several methods have been put forth, including neural
networks which generate data-dependent models (i.e. model is biased toward data) and
static rule-based approaches with thresholds which are limited in terms of diversity(i.e.
model lacks diversity in terms of rules). However, the problem of having a machine learning
model that, given a large amount of training data, can classify multiple classes over different
geographic Sentinel-2 imagery that out scales existing approaches remains open.
On the other hand, supervised machine learning has evolved into an essential part of many
areas due to the increasing number of labeled datasets. Examples include creating classifiers
for applications that recognize images and voices, anticipate traffic, propose products, act
as a virtual personal assistant and detect online fraud, among many more. Since these
classifiers are highly dependent from the training datasets, without human interaction or
accurate labels, the performance of these generated classifiers with unseen observations
is uncertain. Thus, researchers attempted to evaluate a number of independent models
using a statistical distance. However, the problem of, given a train-test split and classifiers
modeled over the train set, identifying a prediction error using the relation between train
and test sets remains open.
Moreover, while some training data is essential for supervised machine learning, what
happens if there is insufficient labeled data? After all, assigning labels to unlabeled datasets
is a time-consuming process that may need significant expert human involvement. When
there aren’t enough expert manual labels accessible for the vast amount of openly available
data, active learning becomes crucial. However, given a large amount of training and
unlabeled datasets, having an active learning model that can reduce the training cost of
the classifier and at the same time assist in labeling new data points remains an open
problem.
From the experimental approaches and findings, the main research contributions, which
concentrate on the issue of optical satellite image scene classification include: building
labeled Sentinel-2 datasets with surface reflectance values; proposal of machine learning
models for pixel-based image scene classification; proposal of a statistical distance based
Evidence Function Model (EFM) to detect ML models misclassification; and proposal of
a generalised sampling approach for active learning that, together with the EFM enables
a way of determining the most informative examples.
Firstly, using a manually annotated Sentinel-2 dataset, Machine Learning (ML) models
for scene classification were developed and their performance was compared to Sen2Cor the reference package from the European Space Agency – a micro-F1 value of 84%
was attained by the ML model, which is a significant improvement over the corresponding
Sen2Cor performance of 59%. Secondly, to quantify the misclassification of the ML models,
the Mahalanobis distance-based EFM was devised. This model achieved, for the labeled
Sentinel-2 dataset, a micro-F1 of 67.89% for misclassification detection. Lastly, EFM was
engineered as a sampling strategy for active learning leading to an approach that attains
the same level of accuracy with only 0.02% of the total training samples when compared
to a classifier trained with the full training set.
With the help of the above-mentioned research contributions, we were able to provide
an open-source Sentinel-2 image scene classification package which consists of ready-touse
Python scripts and a ML model that classifies Sentinel-2 L1C images generating a
20m-resolution RGB image with the six studied classes (Cloud, Cirrus, Shadow, Snow,
Water, and Other) giving academics a straightforward method for rapidly and effectively
classifying Sentinel-2 scene images. Additionally, an active learning approach that uses, as
sampling strategy, the observed prediction uncertainty given by EFM, will allow labeling
only the most informative points to be used as input to build classifiers; Sumário:
Uma Abordagem de Previsão de Incerteza para
Aprendizagem Ativa – Aplicação à Observação da Terra
O mapeamento da cobertura do solo e a dinâmica da utilização do solo são cruciais na
deteção remota uma vez que os agricultores são incentivados a intensificar ou estender as
culturas devido ao aumento contínuo da população mundial. Uma questão importante
nesta área é interpretar e classificar cenas capturadas em imagens de satélite de alta resolução.
Várias aproximações têm sido propostas incluindo a utilização de redes neuronais
que produzem modelos dependentes dos dados (ou seja, o modelo é tendencioso em relação
aos dados) e aproximações baseadas em regras que apresentam restrições de diversidade
(ou seja, o modelo carece de diversidade em termos de regras). No entanto, a criação de
um modelo de aprendizagem automática que, dada uma uma grande quantidade de dados
de treino, é capaz de classificar, com desempenho superior, as imagens do Sentinel-2 em
diferentes áreas geográficas permanece um problema em aberto.
Por outro lado, têm sido utilizadas técnicas de aprendizagem supervisionada na resolução
de problemas nas mais diversas áreas de devido à proliferação de conjuntos de dados etiquetados.
Exemplos disto incluem classificadores para aplicações que reconhecem imagem
e voz, antecipam tráfego, propõem produtos, atuam como assistentes pessoais virtuais e
detetam fraudes online, entre muitos outros. Uma vez que estes classificadores são fortemente
dependente do conjunto de dados de treino, sem interação humana ou etiquetas
precisas, o seu desempenho sobre novos dados é incerta. Neste sentido existem propostas
para avaliar modelos independentes usando uma distância estatística. No entanto, o problema
de, dada uma divisão de treino-teste e um classificador, identificar o erro de previsão
usando a relação entre aqueles conjuntos, permanece aberto.
Mais ainda, embora alguns dados de treino sejam essenciais para a aprendizagem supervisionada,
o que acontece quando a quantidade de dados etiquetados é insuficiente? Afinal,
atribuir etiquetas é um processo demorado e que exige perícia, o que se traduz num envolvimento
humano significativo. Quando a quantidade de dados etiquetados manualmente por
peritos é insuficiente a aprendizagem ativa torna-se crucial. No entanto, dada uma grande
quantidade dados de treino não etiquetados, ter um modelo de aprendizagem ativa que
reduz o custo de treino do classificador e, ao mesmo tempo, auxilia a etiquetagem de novas
observações permanece um problema em aberto.
A partir das abordagens e estudos experimentais, as principais contribuições deste trabalho,
que se concentra na classificação de cenas de imagens de satélite óptico incluem:
criação de conjuntos de dados Sentinel-2 etiquetados, com valores de refletância de superfície;
proposta de modelos de aprendizagem automática baseados em pixels para classificação de cenas de imagens de satétite; proposta de um Modelo de Função de Evidência (EFM)
baseado numa distância estatística para detetar erros de classificação de modelos de aprendizagem;
e proposta de uma abordagem de amostragem generalizada para aprendizagem
ativa que, em conjunto com o EFM, possibilita uma forma de determinar os exemplos mais
informativos.
Em primeiro lugar, usando um conjunto de dados Sentinel-2 etiquetado manualmente,
foram desenvolvidos modelos de Aprendizagem Automática (AA) para classificação de cenas
e seu desempenho foi comparado com o do Sen2Cor – o produto de referência da
Agência Espacial Europeia – tendo sido alcançado um valor de micro-F1 de 84% pelo classificador,
o que representa uma melhoria significativa em relação ao desempenho Sen2Cor
correspondente, de 59%. Em segundo lugar, para quantificar o erro de classificação dos
modelos de AA, foi concebido o Modelo de Função de Evidência baseado na distância de
Mahalanobis. Este modelo conseguiu, para o conjunto de dados etiquetado do Sentinel-2
um micro-F1 de 67,89% na deteção de classificação incorreta. Por fim, o EFM foi utilizado
como uma estratégia de amostragem para a aprendizagem ativa, uma abordagem
que permitiu atingir o mesmo nível de desempenho com apenas 0,02% do total de exemplos
de treino quando comparado com um classificador treinado com o conjunto de treino
completo.
Com a ajuda das contribuições acima mencionadas, foi possível desenvolver um pacote
de código aberto para classificação de cenas de imagens Sentinel-2 que, utilizando num
conjunto de scripts Python, um modelo de classificação, e uma imagem Sentinel-2 L1C,
gera a imagem RGB correspondente (com resolução de 20m) com as seis classes estudadas
(Cloud, Cirrus, Shadow, Snow, Water e Other), disponibilizando à academia um método
direto para a classificação de cenas de imagens do Sentinel-2 rápida e eficaz. Além disso, a
abordagem de aprendizagem ativa que usa, como estratégia de amostragem, a deteção de
classificacão incorreta dada pelo EFM, permite etiquetar apenas os pontos mais informativos
a serem usados como entrada na construção de classificadores
Heuristic methods for support vector machines with applications to drug discovery.
The contributions to computer science presented in this thesis were inspired by the analysis of the data generated in the early stages of drug discovery. These data sets are generated by screening compounds against various biological receptors. This gives a first indication of biological activity. To avoid screening inactive compounds, decision rules for selecting compounds are required. Such a decision rule is a mapping from a compound representation to an estimated activity. Hand-coding such rules is time-consuming, expensive and subjective. An alternative is to learn these rules from the available data. This is difficult since the compounds may be characterized by tens to thousands of physical, chemical, and structural descriptors and it is not known which are most relevant to the prediction of biological activity. Further, the activity measurements are noisy, so the data can be misleading. The support vector machine (SVM) is a statistically well-founded learning machine that is not adversely affected by high-dimensional representations and is robust with respect to measurement inaccuracies. It thus appears to be ideally suited to the analysis of screening data. The novel application of the SVM to this domain highlights some shortcomings with the vanilla SVM. Three heuristics are developed to overcome these deficiencies: a stopping criterion, HERMES, that allows good solutions to be found in less time; an automated method, LAIKA, for tuning the Gaussian kernel SVM; and, an algorithm, STAR, that outputs a more compact solution. These heuristics achieve their aims on public domain data and are broadly successful when applied to the drug discovery data. The heuristics and associated data analysis are thus of benefit to both pharmacology and computer science
Human activity recognition using wearable sensors: a deep learning approach
In the past decades, Human Activity Recognition (HAR) grabbed considerable research attentions from a wide range of pattern recognition and human–computer interaction researchers due to its prominent applications such as smart home health care. The wealth of information requires efficient classification and analysis methods. Deep learning represents a promising technique for large-scale data analytics. There are various ways of using different sensors for human activity recognition in a smartly controlled environment. Among them, physical human activity recognition through wearable sensors provides valuable information about an individual’s degree of functional ability and lifestyle. There is abundant research that works upon real time processing and causes more power consumption of mobile devices. Mobile phones are resource-limited devices. It is a thought-provoking task to implement and evaluate different recognition systems on mobile devices.
This work proposes a Deep Belief Network (DBN) model for successful human activity recognition. Various experiments are performed on a real-world wearable sensor dataset to verify the effectiveness of the deep learning algorithm. The results show that the proposed DBN performs competitively in comparison with other algorithms and achieves satisfactory activity recognition performance. Some open problems and ideas are also presented and should be investigated as future research
Classic and Bayesian Tree-Based Methods
Tree-based methods are nonparametric techniques and machine-learning methods for data prediction and exploratory modeling. These models are one of valuable and powerful tools among data mining methods and can be used for predicting different types of outcome (dependent) variable: (e.g., quantitative, qualitative, and time until an event occurs (survival data)). Tree model is called classification tree/regression tree/survival tree based on the type of outcome variable. These methods have some advantages over against traditional statistical methods such as generalized linear models (GLMs), discriminant analysis, and survival analysis. Some of these advantages are: without requiring to determine assumptions about the functional form between outcome variable and predictor (independent) variables, invariant to monotone transformations of predictor variables, useful for dealing with nonlinear relationships and high-order interactions, deal with different types of predictor variable, ease of interpretation and understanding results without requiring to have statistical experience, robust to missing values, outliers, and multicollinearity. Several classic and Bayesian tree algorithms are proposed for classification and regression trees, and in this chapter, we provide a review of these algorithms and appropriate criteria for determining the predictive performance of them
- …