9 research outputs found

    Voting with Random Classifiers (VORACE)

    Get PDF
    In many machine learning scenarios, looking for the best classifier that fits a particular dataset can be very costly in terms of time and resources. Moreover, it can require deep knowledge of the specific domain. We propose a new technique which does not require profound expertise in the domain and avoids the commonly used strategy of hyper-parameter tuning and model selection. Our method is an innovative ensemble technique that uses voting rules over a set of randomly-generated classifiers. Given a new input sample, we interpret the output of each classifier as a ranking over the set of possible classes. We then aggregate these output rankings using a voting rule, which treats them as preferences over the classes. We show that our approach obtains good results compared to the state-of-the-art, both providing a theoretical analysis and an empirical evaluation of the approach on several datasets

    DERMA: A Melanoma Diagnosis Platform Based on Collaborative Multilabel Analog Reasoning

    Get PDF
    The number of melanoma cancer-related death has increased over the last few years due to the new solar habits. Early diagnosis has become the best prevention method. This work presents a melanoma diagnosis architecture based on the collaboration of several multilabel case-based reasoning subsystems called DERMA. The system has to face up several challenges that include data characterization, pattern matching, reliable diagnosis, and self-explanation capabilities. Experiments using subsystems specialized in confocal and dermoscopy images have provided promising results for helping experts to assess melanoma diagnosis

    Credit Risk Scoring: A Stacking Generalization Approach

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Statistics and Information Management, specialization in Risk Analysis and ManagementCredit risk regulation has been receiving tremendous attention, as a result of the effects of the latest global financial crisis. According to the developments made in the Internal Rating Based approach, under the Basel guidelines, banks are allowed to use internal risk measures as key drivers to assess the possibility to grant a loan to an applicant. Credit scoring is a statistical approach used for evaluating potential loan applications in both financial and banking institutions. When applying for a loan, an applicant must fill out an application form detailing its characteristics (e.g., income, marital status, and loan purpose) that will serve as contributions to a credit scoring model which produces a score that is used to determine whether a loan should be granted or not. This enables faster and consistent credit approvals and the reduction of bad debt. Currently, many machine learning and statistical approaches such as logistic regression and tree-based algorithms have been used individually for credit scoring models. Newer contemporary machine learning techniques can outperform classic methods by simply combining models. This dissertation intends to be an empirical study on a publicly available bank loan dataset to study banking loan default, using ensemble-based techniques to increase model robustness and predictive power. The proposed ensemble method is based on stacking generalization an extension of various preceding studies that used different techniques to further enhance the model predictive capabilities. The results show that combining different models provides a great deal of flexibility to credit scoring models

    Modeling of learning curves with applications to POS tagging

    Get PDF
    An algorithm to estimate the evolution of learning curves on the whole of a training data base, based on the results obtained from a portion and using a functional strategy, is introduced. We approximate iteratively the sought value at the desired time, independently of the learning technique used and once a point in the process, called prediction level, has been passed. The proposal proves to be formally correct with respect to our working hypotheses and includes a reliable proximity condition. This allows the user to fix a convergence threshold with respect to the accuracy finally achievable, which extends the concept of stopping criterion and seems to be effective even in the presence of distorting observations. Our aim is to evaluate the training effort, supporting decision making in order to reduce the need for both human and computational resources during the learning process. The proposal is of interest in at least three operational procedures. The first is the anticipation of accuracy gain, with the purpose of measuring how much work is needed to achieve a certain degree of performance. The second relates the comparison of efficiency between systems at training time, with the objective of completing this task only for the one that best suits our requirements. The prediction of accuracy is also a valuable item of information for customizing systems, since we can estimate in advance the impact of settings on both the performance and the development costs. Using the generation of part-of-speech taggers as an example application, the experimental results are consistent with our expectations.Ministerio de EconomĂ­a y Competitividad | Ref. FFI2014-51978-C2-1-

    Ajuda al Diagnòstic de Càncer de Melanoma amb Raonament Analògic Multietiqueta

    Get PDF
    La mortalitat provocada pel càncer de melanoma ha augmentat en els últims anys a causa, principalment, dels nous hàbits d'exposició al sol. Atenent al criteri mèdic, el diagnòstic precoç s'ha convertit en el millor mètode de prevenció. No és però una tasca trivial ja que els experts del domini han de fer front a un problema caracteritzat per tenir un gran volum de dades, de format heterogeni i amb coneixement parcial. A partir d'aquestes necessitats es proposa la creació d'una eina de suport a la presa de decisions que sigui capaç d'ajudar els experts en melanoma en el seu diagnòstic. El sistema ha de fer front a diversos reptes plantejats, que inclouen la caracterització del domini, la identificació de patrons a les dades segons el criteri dels experts, la classificació de nous pacients i la capacitat d'explicar els pronòstics obtinguts. Aquestes fites s'han materialitzat en la plataforma DERMA, la qual està basada en la col•laboració de diversos subsistemes de raonament analògic multietiqueta. L'experimentació realitzada amb el sistema proposat utilitzant dades d'imatges confocals i dermatoscòpiques ha permès comprovar la fiabilitat del sistema. Els resultats obtinguts han estat validats pels experts en el diagnòstic del melanoma considerant-los positius.La mortalidad a causa del cáncer de melanoma ha aumentado en los últimos años debido, principalmente, a los nuevos hábitos de exposición al sol. Atendiendo al criterio médico, el diagnóstico precoz se ha convertido en el mejor método de prevención, pero no se trata de una tarea trivial puesto que los expertos del dominio deben hacer frente a un problema caracterizado por tener un gran volumen de datos, de formato heterogéneo y con conocimiento parcial. A partir de estas necesidades se propone la creación de una herramienta de ayuda a la toma de decisiones que sea capaz de ayudar a los expertos en melanoma en su diagnóstico. El sistema tiene que hacer frente a diversos retos planteados, que incluyen la caracterización del dominio, la identificación de patrones en los datos según el criterio médico, la clasificación de nuevos pacientes y la capacidad de explicar los pronósticos obtenidos. Estas metas se han materializado en la plataforma DERMA la cual está basada en la colaboración de varios subsistemas de razonamiento analógico multietiqueta. La experimentación realizada con el sistema propuesto utilizando datos de imágenes confocales y dermatoscópicas ha permitido verificar la fiabilidad del sistema. Los resultados obtenidos han sido validados por los expertos en el diagnóstico del melanoma considerándolos positivos.Mortality related to melanoma cancer has increased in recent years, mainly due to new habits of sun exposure. Considering the medical criteria, early diagnosis has become the best method of prevention but this is not trivial because experts are facing a problem characterized by a large volume of data, heterogeneous, and with partial knowledge. Based on these requirements we propose the creation of a decision support system that is able to assist experts in melanoma diagnosis. The system has to cope with various challenges, that include the characterization of the domain, the identification of data patterns attending to medical criteria, the classification of new patients, and the ability to explain predictions. These goals have been materialized in DERMA platform that is based on the collaboration of several analogical reasoning multi-label subsystems. The experiments conducted with the proposed system using confocal and dermoscopic images data have been allowed to ascertain the reliability of the system. The results have been validated by experts in diagnosis of melanoma considering it as positive

    Empirical comparisons of various voting methods in bagging

    No full text

    Empirical Comparisons of Various Voting Methods in Bagging

    No full text
    Finding effective methods for developing an ensemble of models has been an active research area of large-scale data mining in recent years. Models learned from data are often subject to some degree of uncertainty, for a variety of reasons. In classification, ensembles of models provide a useful means of averaging out error introduced by individual classifiers, hence reducing the generalization error of prediction. The plurality voting method is often chosen for bagging, because of its simplicity of implementation. However, the plurality approach to model reconciliation is ad-hoc. There are many other voting methods to choose from, including the anti-plurality method, the plurality method with elimination, the Borda count method, and Condorcet’s method of pairwise comparisons. Any of these could lead to a better method for reconciliation. In this paper, we analyze the use of these voting methods in model reconciliation. We present empirical results comparing performance of these voting methods when applied in bagging. These results include some surprises, and among other things suggest that (1) plurality is not always the best voting method; (2) the number of classes can affect the performance of voting methods; and (3) the degree of dataset noise can affect the performance of voting methods. While it is premature to make final judgments about specific voting methods, the results of this work raise interesting questions, and they open the door to the application of voting theory in classification theory. 1

    Heterogeneous recognition of bioacoustic signals for human-machine interfaces

    No full text
    Human-machine interfaces (HMI) provide a communication pathway between man and machine. Not only do they augment existing pathways, they can substitute or even bypass these pathways where functional motor loss prevents the use of standard interfaces. This is especially important for individuals who rely on assistive technology in their everyday life. By utilising bioacoustic activity, it can lead to an assistive HMI concept which is unobtrusive, minimally disruptive and cosmetically appealing to the user. However, due to the complexity of the signals it remains relatively underexplored in the HMI field. This thesis investigates extracting and decoding volition from bioacoustic activity with the aim of generating real-time commands. The developed framework is a systemisation of various processing blocks enabling the mapping of continuous signals into M discrete classes. Class independent extraction efficiently detects and segments the continuous signals while class-specific extraction exemplifies each pattern set using a novel template creation process stable to permutations of the data set. These templates are utilised by a generalised single channel discrimination model, whereby each signal is template aligned prior to classification. The real-time decoding subsystem uses a multichannel heterogeneous ensemble architecture which fuses the output from a diverse set of these individual discrimination models. This enhances the classification performance by elevating both the sensitivity and specificity, with the increased specificity due to a natural rejection capacity based on a non-parametric majority vote. Such a strategy is useful when analysing signals which have diverse characteristics, false positives are prevalent and have strong consequences, and when there is limited training data available. The framework has been developed with generality in mind with wide applicability to a broad spectrum of biosignals. The processing system has been demonstrated on real-time decoding of tongue-movement ear pressure signals using both single and dual channel setups. This has included in-depth evaluation of these methods in both offline and online scenarios. During online evaluation, a stimulus based test methodology was devised, while representative interference was used to contaminate the decoding process in a relevant and real fashion. The results of this research provide a strong case for the utility of such techniques in real world applications of human-machine communication using impulsive bioacoustic signals and biosignals in general
    corecore