10 research outputs found

    Empirical study of bagging predictors on medical data

    Full text link
    This study investigates the performance of bagging in terms of learning from imbalanced medical data. It is important for data miners to achieve highly accurate prediction models, and this is especially true for imbalanced medical applications. In these situations, practitioners are more interested in the minority class than the majority class; however, it is hard for a traditional supervised learning algorithm to achieve a highly accurate prediction on the minority class, even though it might achieve better results according to the most commonly used evaluation metric, Accuracy. Bagging is a simple yet effective ensemble method which has been applied to many real-world applications. However, some questions have not been well answered, e.g., whether bagging outperforms single learners on medical data-sets; which learners are the best predictors for each medical data-set; and what is the best predictive performance achievable for each medical data-set when we apply sampling techniques. We perform an extensive empirical study on the performance of 12 learning algorithms on 8 medical data-sets based on four performance measures: True Positive Rate (TPR), True Negative Rate (TNR), Geometric Mean (G-mean) of the accuracy rate of the majority class and the minority class, and Accuracy as evaluation metrics. In addition, the statistical analyses performed instil confidence in the validity of the conclusions of this research. © 2011, Australian Computer Society, Inc

    Evolutionary deep belief networks with bootstrap sampling for imbalanced class datasets

    Get PDF
    Imbalanced class data is a common issue faced in classification tasks. Deep Belief Networks (DBN) is a promising deep learning algorithm when learning from complex feature input. However, when handling imbalanced class data, DBN encounters low performance as other machine learning algorithms. In this paper, the genetic algorithm (GA) and bootstrap sampling are incorporated into DBN to lessen the drawbacks occurs when imbalanced class datasets are used. The performance of the proposed algorithm is compared with DBN and is evaluated using performance metrics. The results showed that there is an improvement in performance when Evolutionary DBN with bootstrap sampling is used to handle imbalanced class datasets

    Monitoring canid scent marking in space and time using a biologging and machine learning approach

    Get PDF
    For canid species, scent marking plays a critical role in territoriality, social dynamics, and reproduction. However, due in part to human dependence on vision as our primary sensory modality, research on olfactory communication is hampered by a lack of tractable methods. In this study, we leverage a powerful biologging approach, using accelerometers in concert with GPS loggers to monitor and describe scent-marking events in time and space. We performed a validation experiment with domestic dogs, monitoring them by video concurrently with the novel biologging approach. We attached an accelerometer to the pelvis of 31 dogs (19 males and 12 females), detecting raised-leg and squat posture urinations by monitoring the change in device orientation. We then deployed this technique to describe the scent marking activity of 3 guardian dogs as they defend livestock from coyote depredation in California, providing an example use-case for the technique. During validation, the algorithm correctly classifed 92% of accelerometer readings. High performance was partly due to the conspicuous signatures of archetypal raised-leg postures in the accelerometer data. Accuracy did not vary with the weight, age, and sex of the dogs, resulting in a method that is broadly applicable across canid species’ morphologies. We also used models trained on each individual to detect scent marking of others to emulate the use of captive surrogates for model training. We observed no relationship between the similarity in body weight between the dog pairs and the overall accuracy of predictions, although models performed best when trained and tested on the same individual. We discuss how existing methods in the feld of movement ecology can be extended to use this exciting new data type. This paper represents an important frst step in opening new avenues of research by leveraging the power of modern-technologies and machine-learning to this feldFil: Bidder, Owen. University of California at Berkeley; Estados UnidosFil: Di Virgilio, Agustina Soledad. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Patagonia Norte. Instituto de Investigaciones en Biodiversidad y Medioambiente. Universidad Nacional del Comahue. Centro Regional Universidad Bariloche. Instituto de Investigaciones en Biodiversidad y Medioambiente; ArgentinaFil: Hunter, Jennifer. University of California at Berkeley; Estados UnidosFil: McInturff, Alex. University of California at Berkeley; Estados UnidosFil: Gaynor, Kaitlyn. University of California at Berkeley; Estados UnidosFil: Smith, Alison. University of California at Berkeley; Estados UnidosFil: Dorcy, Janelle. University of California at Berkeley; Estados UnidosFil: Rosell, Frank. University of South-Eastern Norway; Norueg

    Statistical Theory for Imbalanced Binary Classification

    Full text link
    Within the vast body of statistical theory developed for binary classification, few meaningful results exist for imbalanced classification, in which data are dominated by samples from one of the two classes. Existing theory faces at least two main challenges. First, meaningful results must consider more complex performance measures than classification accuracy. To address this, we characterize a novel generalization of the Bayes-optimal classifier to any performance metric computed from the confusion matrix, and we use this to show how relative performance guarantees can be obtained in terms of the error of estimating the class probability function under uniform (L\mathcal{L}_\infty) loss. Second, as we show, optimal classification performance depends on certain properties of class imbalance that have not previously been formalized. Specifically, we propose a novel sub-type of class imbalance, which we call Uniform Class Imbalance. We analyze how Uniform Class Imbalance influences optimal classifier performance and show that it necessitates different classifier behavior than other types of class imbalance. We further illustrate these two contributions in the case of kk-nearest neighbor classification, for which we develop novel guarantees. Together, these results provide some of the first meaningful finite-sample statistical theory for imbalanced binary classification.Comment: Parts of this paper have been revised from arXiv:2004.04715v2 [math.ST

    Affective state level recognition in naturalistic facial and vocal expressions

    Get PDF
    Naturalistic affective expressions change at a rate much slower than the typical rate at which video or audio is recorded. This increases the probability that consecutive recorded instants of expressions represent the same affective content. In this paper, we exploit such a relationship to improve the recognition performance of continuous naturalistic affective expressions. Using datasets of naturalistic affective expressions (AVEC 2011 audio and video dataset, PAINFUL video dataset) continuously labeled over time and over different dimensions, we analyze the transitions between levels of those dimensions (e.g., transitions in pain intensity level). We use an information theory approach to show that the transitions occur very slowly and hence suggest modeling them as first-order Markov models. The dimension levels are considered to be the hidden states in the Hidden Markov Model (HMM) framework. Their discrete transition and emission matrices are trained by using the labels provided with the training set. The recognition problem is converted into a best path-finding problem to obtain the best hidden states sequence in HMMs. This is a key difference from previous use of HMMs as classifiers. Modeling of the transitions between dimension levels is integrated in a multistage approach, where the first level performs a mapping between the affective expression features and a soft decision value (e.g., an affective dimension level), and further classification stages are modeled as HMMs that refine that mapping by taking into account the temporal relationships between the output decision labels. The experimental results for each of the unimodal datasets show overall performance to be significantly above that of a standard classification system that does not take into account temporal relationships. In particular, the results on the AVEC 2011 audio dataset outperform all other systems presented at the international competition

    Técnicas de machine learning aplicadas al diagnóstico y tratamiento oncológico de precisión mediante el análisis de datos ómicos

    Get PDF
    Programa Oficial de Doutoramento en Tecnoloxías da Información e as Comunicacións. 5032V01Tese por compendio de publicacións[Resumen] Gracias al abaratamiento en los costes de secuenciación, cada día se genera una mayor cantidad de datos ómicos capaces de caracterizar el cáncer molecularmente. Grandes consorcios generan gran cantidad de estos datos, poniéndolos a disposición pública. Además, los modelos de Machine Learning (ML) ofrecen una ventaja significativa para extraer patrones complejos de datos biomédicos. Se requiere un estudio de su aplicación en este campo para poder obtener resultados más robustos y generalizados. Esta tesis estudia la aplicación de modelos de ML para el análisis de datos ómicos. Gracias a una revisión de trabajos previos, se identificaron ciertas limitaciones en cuanto reproducibilidad y validación en las metodologías. A partir de este estudio se establecieron las directrices para llevar a cabo un análisis de ML robusto y reproducible con datos ómicos. Se identificaron biomarcadores y pathways alterados en pacientes con cáncer de colon, se predijeron condiciones clínicas relevantes para el desarrollo del tumor y se desarrolló un modelo de screening automático de fármacos antitumorales. Los resultados se presentan en un compendio de tres publicaciones científicas. En conclusión, esta tesis ofrece diferentes aproximaciones computacionales que ayudan al diagnóstico y al tratamiento oncológico de precisión.[Abstract] As sequencing costs have been dramatically reduced, an increasing amount of omics data have been generated to molecularly characterise cancer. Large consortiums are generating large amount of this data and making them publicly available. In addition, Machine Learning (ML) models offer a significant advantage extracting complex patterns from biomedical data. A study of their application in this field is necessary in order to obtain more robust and generalised results. This thesis studies the application of ML models to omics data analysis. Performing a review of previous work, certain limitations in terms of reproducibility and validation of the methodologies were identified. From this study, a set of guidelines for robust and reproducible ML analysis of omics data have been established, allowing to identify altered biomarkers and pathways in colon cancer patients, predict clinical conditions relevant to tumour development, and develop an automatic anti-tumour drug screening model. These results are presented as a compendium of three scientific manuscripts. In conclusion, this thesis provides a variety of computational approaches to improve diagnosis and precision oncological treatment[Resumo] Grazas aos menores custos de secuenciación, cada día xéranse máis datos ómicos capaces de caracterizar molecularmente o cancro. Grandes consorcios están a xerar gran cantidade destes datos de forma pública. Ademais, os modelos de Machine Learning (ML) ofrecen unha vantaxe significativa para extraer complexos patróns de datos biomédicos. É necesario un estudo da súa aplicación neste campo para obter resultados máis robustos e xeneralizados. Esta tese estuda a aplicación de modelos de ML para a análise de datos ómicos. Grazas a unha revisión de traballos anteriores, identificáronse certas limitacións en termos de reprodutibilidade e validación nas metodoloxías. A partir deste estudo, establecéronse pautas para realizar unha análise de ML robusta e reproducible con datos ómicos. Identificáronse biomarcadores e vías alteradas en pacientes con cancro de colon, predixéronse condicións clínicas relevantes para o desenvolvemento tumoral e desenvolveuse un modelo de detección automática de medicamentos antitumorais. Os resultados preséntanse nun compendio de tres publicacións científicas. En conclusión, esta tese ofrece diferentes enfoques computacionais que axudan ao diagnóstico e tratamento preciso do cancro
    corecore