    Hyper-heuristic decision tree induction

    A hyper-heuristic is any algorithm that searches or operates in the space of heuristics as opposed to the space of solutions. Hyper-heuristics are increasingly used in function and combinatorial optimization. Rather than attempt to solve a problem using a fixed heuristic, a hyper-heuristic approach attempts to find a combination of heuristics that solve a problem (and in turn may be directly suitable for a class of problem instances). Hyper-heuristics have been little explored in data mining. This work presents novel hyper-heuristic approaches to data mining, by searching a space of attribute selection criteria for decision tree building algorithm. The search is conducted by a genetic algorithm. The result of the hyper-heuristic search in this case is a strategy for selecting attributes while building decision trees. Most hyper-heuristics work by trying to adapt the heuristic to the state of the problem being solved. Our hyper-heuristic is no different. It employs a strategy for adapting the heuristic used to build decision tree nodes according to some set of features of the training set it is working on. We introduce, explore and evaluate five different ways in which this problem state can be represented for a hyper-heuristic that operates within a decisiontree building algorithm. In each case, the hyper-heuristic is guided by a rule set that tries to map features of the data set to be split by the decision tree building algorithm to a heuristic to be used for splitting the same data set. We also explore and evaluate three different sets of low-level heuristics that could be employed by such a hyper-heuristic. This work also makes a distinction between specialist hyper-heuristics and generalist hyper-heuristics. The main difference between these two hyperheuristcs is the number of training sets used by the hyper-heuristic genetic algorithm. Specialist hyper-heuristics are created using a single data set from a particular domain for evolving the hyper-heurisic rule set. Such algorithms are expected to outperform standard algorithms on the kind of data set used by the hyper-heuristic genetic algorithm. Generalist hyper-heuristics are trained on multiple data sets from different domains and are expected to deliver a robust and competitive performance over these data sets when compared to standard algorithms. We evaluate both approaches for each kind of hyper-heuristic presented in this thesis. We use both real data sets as well as synthetic data sets. Our results suggest that none of the hyper-heuristics presented in this work are suited for specialization – in most cases, the hyper-heuristic’s performance on the data set it was specialized for was not significantly better than that of the best performing standard algorithm. On the other hand, the generalist hyper-heuristics delivered results that were very competitive to the best standard methods. In some cases we even achieved a significantly better overall performance than all of the standard methods

    Task decomposition with pattern distributor networks

    Classification, Diagnosis and Risk Assessment Methods in Diseases with Visual Impairment

    Tese de doutoramento em Ciências da Saúde, no ramo de Ciências Biomédicas, apresentada à Faculdade de Medicina da Universidade de CoimbraAs doenças da visão incluem a cegueira e a baixa visão, e afetam cerca de 4,25% da população mundial. Cerca de 80% destas podem ser prevenidas ou curadas. Estas estimativas, da Organização Mundial de Saúde, referem que 82% das pessoas com cegueira têm 50 ou mais anos. A sua prevalência está relacionada com o envelhecimento da população, emergindo neste contexto as doenças do segmento posterior. Nestas, inclui-se a retinopatia diabética, uma manifestação clínica da diabetes mellitus. Esta doença sistémica é a principal causa de novos casos de cegueira em todo o mundo, entre os 20 e os 74 anos de idade, sendo a complicação referida causada por danos acumulados ao longo do tempo sobretudo nos pequenos vasos sanguíneos na retina. A diabetes, especialmente do tipo 2, está entre as principais causas de morte e de invalidez, apresentando um elevado peso económico em todo o mundo. Teme-se que esta doença se torne epidémica, dado o aumento da sua incidência e prevalência devido ao crescimento e ao envelhecimento das populações, e ainda a alterações no estilo de vida tais como a redução da atividade física e o aumento da obesidade. Assim, a retinopatia diabética foi adicionada à lista de prioridades no que diz respeito a doenças da visão evitáveis. As últimas estimativas de prevalência de diabetes na população portuguesa entre os 20 e os 79 anos datam de 2012, e referem uma prevalência de 12,9%, representando um aumento de 1,2% desde 2009. Neste ano, a retinopatia diabética foi referida como a principal causa de cegueira na população portuguesa em idade ativa. A necessidade de diagnosticar precocemente ambas as doenças é fundamental em todos os contextos socioeconómicos, a fim de reduzir os seus custos diretos e, principalmente, os custos indiretos e intangíveis, quer para os diabéticos e seus familiares, quer para os Serviços Nacionais de Saúde. Apesar de os métodos para diagnóstico destas doenças estarem claramente definidos, a necessidade de encontrar novos marcadores e classificadores não invasivos, utilizados para rastreio noutros contextos médicos, tornou-se de extrema importância. Para construir um modelo que identificasse marcadores da diabetes tipo 2, utilizou-se uma amostra de treino constituída por 96 casos, dos quais 49 eram diabéticos tipo 2, com idade compreendida entre os 40 e os 75 anos. O grupo de diabéticos foi usado para o desenvolvimento de um classificador de retinopatia diabética em diabéticos tipo 2, na mesma faixa etária, sendo a amostra constituída por 40 sujeitos, dos quais 20 tinham retinopatia diabética não-proliferativa. Foi avaliada a correlação e concordância entre as medidas obtidas para os olhos direito e esquerdo, obtidas por Tomografia de Coerência Óptica, concluindo-se que um olho era suficiente para a análise. Foi seleccionado o olho dominante, já que os testes visuais psicofísicos foram realizados apenas neste olho. Foi construída uma medida global do desempenho para cada teste psicofísico (velocidade, visão acromática e visão cromática nos eixos Protan, Deutan e Tritan) com base nos valores obtidos para os meridianos 0º, 45º, 90º e 135º, em cada sujeito. Posteriormente, foi necessário proceder a uma redução de variáveis, tendo-se comparado os grupos através do teste t-Student para amostras independentes ou do teste de Mann-Whitney, de acordo com a distribuição amostral. Apenas prosseguiram em análise as variáveis que apresentaram diferença estatisticamente significativa entre os grupos, ao nível de significância de 5%. Subsequentemente, foi usada a análise Receiver Operating Characteristic (ROC), com o mesmo nível de significância, e identificou-se o conjunto das variáveis que, individualmente, podiam separar os grupos. Tornou-se assim possível a aplicação de métodos de classificação estatística, tais como a análise discriminante, a regressão logística e a utilização de algoritmos de árvore de decisão, ao conjunto de variáveis remanescentes. O desempenho dos classificadores estatísticos obtidos para a diabetes tipo 2 foi comparado, quer na amostra de treino, quer num conjunto de novos indivíduos participantes. O desempenho dos classificadores para a retinopatia diabética não proliferativa foi avaliado apenas na amostra de treino, mas tenciona-se também testá-lo, futuramente, num conjunto de novos sujeitos. O desempenho dos classificadores foi avaliado através da avaliação da sua acuidade, determinada pela área sob a curva ROC obtida para as probabilidades a posteriori de cada um dos modelos, e pela sensibilidade e razão de verossimilhança positiva determinada para as classificações nos grupos. Um classificador final é apresentado, quer para diabéticos tipo 2 com idades entre 40 e 75 anos de idade, quer para a retinopatia diabética não-proliferativa em diabéticos tipo 2, na mesma faixa etária, assim como os seus valores preditivos positivos ajustados para os dados mais recentes da prevalência de cada doença na população portuguesa. A visão cromática relativa ao eixo dos cones Tritan parece desempenhar um papel dominante para a classificação de ambas as doenças.Visual impairment, which includes blindness and low vision, affects about 4.25% of the world population, and about 80% is avoidable, since it can be prevented or cured. Those estimates, from the World Health Organization, refer that 82% of blind people are aged 50 or more. The largest proportion of visual impairment is necessarily related to the increase of the ageing of populations, and where posterior segment (retinal) diseases dominate. Among these diseases, there is diabetic retinopathy, an ocular manifestation of diabetes mellitus. This systemic disease, is the leading cause of new cases of blindness around the world in persons aged between 20 and 74 years old, and occurs as a result of long-term accumulated damage to the small blood vessels in the retina. Furthermore, the eye is considered to play an important role in the diagnostic of systemic diseases due to its composition. Every part of the eye is able to give important clues for diagnosis. Diabetes mellitus, especially type 2, is among the leading causes of death, disability and economic loss throughout the world. It is feared to become an epidemic disease, since its incidence and prevalence are increasing, mainly due to population growth and ageing, as well as a result of alterations in lifestyle, which are leading to the reduction of physical activity and to the increase of obesity. With its increase, diabetic retinopathy was gained a prominent role in the list of preventable visual impairment. The latest prevalence estimates for diabetes in the Portuguese population aged between 20 and 79 years date from 2012, and referred a value of 12.9%, which represents an increment of 1.2% since 2009. In fact, in 2009, diabetic retinopathy was referred as the leading cause of blindness for the Portuguese population in active age. The need for early diagnosis of both the diseases and its ocular complications is crucial in all socioeconomic contexts, in order to reduce its burden due to its direct costs, and mainly due to its indirect and intangible costs, either for diabetics and their families, or for the National Health Services. In spite of the fact that methods for diagnosing those diseases are clearly defined, the need to find new markers and non-invasive classifiers used for screening in other medical contexts has become of extreme importance. A training sample for determination of markers for type 2 diabetes was used, comprising 96 cases, of which 49 were type 2 diabetics, aged between 40 and 75 years old. The group of diabetics was used to build a classifier for diabetic retinopathy in type 2 diabetics in the same age group, and the sample comprised 40 subjects from which 20 had non-proliferative diabetic retinopathy. Correlation and concordance between measures obtained by Optical Coherence Tomography in the left and right eyes of the same subjects was evaluated, leading to the conclusion that only one eye was needed for the analysis. Hence, the dominant eye was selected for analysis since visual psychophysics tests were performed only in that eye. A global measure of the performance, for each subject, in each one of the visual psychophysics tests (speed, achromatic vision and chromatic vision over the Protan, Deutan and Tritan axes) was build, based upon values obtained for the 0º, 45º, 90º and 135º meridians. Afterwards, a variable reduction was performed applying an independent samples t test or a Mann-Whitney test, according to data distribution, and only the variables that showed statistical significances, at 5% significance level, were selected to remain in the analysis. Subsequently, a Receiver Operating Characteristic curve was applied to each one of the remaining variables, using the same significance level, and the set of variables which were able to separate groups, individually, was identified. By then, it was possible to apply different statistical classifying methods, such as discriminant analysis, logistic regression and decision tree algorithms. The performance of the classifiers obtained for type 2 diabetes was compared either in the training set, or in a test set of new subjects. Non-proliferative diabetic retinopathy classifiers were only tested on the training sample, at the moment. Hereafter, we intend to test their performance in a set of new cases. The performance of those classifiers was assessed using accuracy measures, determined by the area under the ROC curve for the posterior probabilities of models, and according to its sensitivity and positive likelihood ratio for group classification. A final classifier is presented, either for type 2 diabetics aged between 40 and 75 years, or for non-proliferative diabetic retinopathy in type 2 diabetics for the same age group, as well as its positive predictive values adjusted for the latest data on the Portuguese prevalence for each disease. Whichever the clinical category (presence of disease or complications), chromatic vision over the Tritan cone seems to play a main role for the classification of both diseases

    Data mining of vehicle telemetry data

    Driving a safety critical task that requires a high level of attention and workload from the driver. Despite this, people often perform secondary tasks such as eating or using a mobile phone, which increase workload levels and divert cognitive and physical attention from the primary task of driving. As well as these distractions, the driver may also be overloaded for other reasons, such as dealing with an incident on the road or holding conversations in the car. One solution to this distraction problem is to limit the functionality of in-car devices while the driver is overloaded. This can take the form of withholding an incoming phone call or delaying the display of a non-urgent piece of information about the vehicle. In order to design and build these adaptions in the car, we must first have an understanding of the driver's current level of workload. Traditionally, driver workload has been monitored using physiological sensors or camera systems in the vehicle. However, physiological systems are often intrusive and camera systems can be expensive and are unreliable in poor light conditions. It is important, therefore, to use methods that are non-intrusive, inexpensive and robust, such as sensors already installed on the car and accessible via the Controller Area Network (CAN)-bus. This thesis presents a data mining methodology for this problem, as well as for others in domains with similar types of data, such as human activity monitoring. It focuses on the variable selection stage of the data mining process, where inputs are chosen for models to learn from and make inferences. Selecting inputs from vehicle telemetry data is challenging because there are many irrelevant variables with a high level of redundancy. Furthermore, data in this domain often contains biases because only relatively small amounts can be collected and processed, leading to some variables appearing more relevant to the classification task than they are really. Over the course of this thesis, a detailed variable selection framework that addresses these issues for telemetry data is developed. A novel blocked permutation method is developed and applied to mitigate biases when selecting variables from potentially biased temporal data. This approach is infeasible computationally when variable redundancies are also considered, and so a novel permutation redundancy measure with similar properties is proposed. Finally, a known redundancy structure between features in telemetry data is used to enhance the feature selection process in two ways. First the benefits of performing raw signal selection, feature extraction, and feature selection in different orders are investigated. Second, a two-stage variable selection framework is proposed and the two permutation based methods are combined. Throughout the thesis, it is shown through classification evaluations and inspection of the features that these permutation based selection methods are appropriate for use in selecting features from CAN-bus data