98 research outputs found

    Multistage feature selection methods for data classification

    Get PDF
    In data analysis process, a good decision can be made with the assistance of several sub-processes and methods. The most common processes are feature selection and classification processes. Various methods and processes have been proposed to solve many issues such as low classification accuracy, and long processing time faced by the decision-makers. The analysis process becomes more complicated especially when dealing with complex datasets that consist of large and problematic datasets. One of the solutions that can be used is by employing an effective feature selection method to reduce the data processing time, decrease the used memory space, and increase the accuracy of decisions. However, not all the existing methods are capable of dealing with these issues. The aim of this research was to assist the classifier in giving a better performance when dealing with problematic datasets by generating optimised attribute set. The proposed method comprised two stages of feature selection processes, that employed correlation-based feature selection method using a best first search algorithm (CFS-BFS) and as well as a soft set and rough set parameter selection method (SSRS). CFS-BFS is used to eliminate uncorrelated attributes in a dataset meanwhile SSRS was utilized to manage any problematic values such as uncertainty in a dataset. Several bench-marking feature selection methods such as classifier subset evaluation (CSE) and principle component analysis (PCA) and different classifiers such as support vector machine (SVM) and neural network (NN) were used to validate the obtained results. ANOVA and T-test were also conducted to verify the obtained results. The obtained averages for two experimentalworks have proven that the proposed method equally matched the performance of other benchmarking methods in terms of assisting the classifier in achieving high classification performance for complex datasets. The obtained average for another experimental work has shown that the proposed work has outperformed the other benchmarking methods. In conclusion, the proposed method is significant to be used as an alternative feature selection method and able to assist the classifiers in achieving better accuracy in the classification process especially when dealing with problematic datasets

    A multilayer multimodal detection and prediction model based on explainable artificial intelligence for Alzheimer’s disease

    Get PDF
    Alzheimer’s disease (AD) is the most common type of dementia. Its diagnosis and progression detection have been intensively studied. Nevertheless, research studies often have little effect on clinical practice mainly due to the following reasons: (1) Most studies depend mainly on a single modality, especially neuroimaging; (2) diagnosis and progression detection are usually studied separately as two independent problems; and (3) current studies concentrate mainly on optimizing the performance of complex machine learning models, while disregarding their explainability. As a result, physicians struggle to interpret these models, and feel it is hard to trust them. In this paper, we carefully develop an accurate and interpretable AD diagnosis and progression detection model. This model provides physicians with accurate decisions along with a set of explanations for every decision. Specifically, the model integrates 11 modalities of 1048 subjects from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) real-world dataset: 294 cognitively normal, 254 stable mild cognitive impairment (MCI), 232 progressive MCI, and 268 AD. It is actually a two-layer model with random forest (RF) as classifier algorithm. In the first layer, the model carries out a multi-class classification for the early diagnosis of AD patients. In the second layer, the model applies binary classification to detect possible MCI-to-AD progression within three years from a baseline diagnosis. The performance of the model is optimized with key markers selected from a large set of biological and clinical measures. Regarding explainability, we provide, for each layer, global and instance-based explanations of the RF classifier by using the SHapley Additive exPlanations (SHAP) feature attribution framework. In addition, we implement 22 explainers based on decision trees and fuzzy rule-based systems to provide complementary justifications for every RF decision in each layer. Furthermore, these explanations are represented in natural language form to help physicians understand the predictions. The designed model achieves a cross-validation accuracy of 93.95% and an F1-score of 93.94% in the first layer, while it achieves a cross-validation accuracy of 87.08% and an F1-Score of 87.09% in the second layer. The resulting system is not only accurate, but also trustworthy, accountable, and medically applicable, thanks to the provided explanations which are broadly consistent with each other and with the AD medical literature. The proposed system can help to enhance the clinical understanding of AD diagnosis and progression processes by providing detailed insights into the effect of different modalities on the disease riskThis work was supported by National Research Foundation of Korea-Grant funded by the Korean Government (Ministry of Science and ICT)-NRF-2020R1A2B5B02002478). In addition, Dr. Jose M. Alonso is Ramon y Cajal Researcher (RYC-2016-19802), and its research is supported by the Spanish Ministry of Science, Innovation and Universities (grants RTI2018-099646-B-I00, TIN2017-84796-C2-1-R, TIN2017-90773-REDT, and RED2018-102641-T) and the Galician Ministry of Education, University and Professional Training (grants ED431F 2018/02, ED431C 2018/29, ED431G/08, and ED431G2019/04), with all grants co-funded by the European Regional Development Fund (ERDF/FEDER program)S

    Estudio de métodos de construcción de ensembles de clasificadores y aplicaciones

    Get PDF
    La inteligencia artificial se dedica a la creación de sistemas informáticos con un comportamiento inteligente. Dentro de este área el aprendizaje computacional estudia la creación de sistemas que aprenden por sí mismos. Un tipo de aprendizaje computacional es el aprendizaje supervisado, en el cual, se le proporcionan al sistema tanto las entradas como la salida esperada y el sistema aprende a partir de estos datos. Un sistema de este tipo se denomina clasificador. En ocasiones ocurre, que en el conjunto de ejemplos que utiliza el sistema para aprender, el número de ejemplos de un tipo es mucho mayor que el número de ejemplos de otro tipo. Cuando esto ocurre se habla de conjuntos desequilibrados. La combinación de varios clasificadores es lo que se denomina "ensemble", y a menudo ofrece mejores resultados que cualquiera de los miembros que lo forman. Una de las claves para el buen funcionamiento de los ensembles es la diversidad. Esta tesis, se centra en el desarrollo de nuevos algoritmos de construcción de ensembles, centrados en técnicas de incremento de la diversidad y en los problemas desequilibrados. Adicionalmente, se aplican estas técnicas a la solución de varias problemas industriales.Ministerio de Economía y Competitividad, proyecto TIN-2011-2404

    Un-factorize non-food NPS on a food-based retailer

    Get PDF
    Dissertação de mestrado em Estatística para Ciência de DadosO Net Promoter Score (NPS) é uma métrica muito utilizada para medir o nível de lealdade dos consumidores. Neste sentido, esta dissertação pretende desenvolver um modelo de classificação que permita identificar a classe do NPS dos consumidores, ou seja, classificar o consumidor como Detrator, Passivo ou Promotor, assim como perceber os fatores que têm maior impacto nessa classificação. A informação recolhida permitirá à organização ter uma melhor percepção das áreas a melhorar de forma a elevar a satisfação do consumidor. Para tal, propõe-se uma abordagem de Data Mining para o problema de classificação multiclasse. A abordagem utiliza dados de um inquérito e dados transacionais do cartão de fidelização de um retalhista, que formam o conjunto de dados a partir dos quais se consegue obter informações sobre as pontuações do Net Promoter Score (NPS), o comportamento dos consumidores e informações das lojas. Inicialmente é feita uma análise exploratória dos dados extraídos. Uma vez que as classes são desbalanceadas, várias técnicas de reamostragem são aplicadas para equilibrar as mesmas. São aplicados dois algoritmos de classificação: Árvores de Decisão e Random Forests. Os resultados obtidos revelam um mau desempenho dos modelos. Uma análise de erro é feita ao último modelo, onde se conclui que este tem dificuldade em distinguir os Detratores e os Passivos, mas tem um bom desempenho a prever os Promotores. Numa ótica de negócio, esta metodologia pode ser utilizada para fazer uma distinção entre os Promotores e o resto dos consumidores, uma vez que os Promotores são a segmentação de clientes mais prováveis de beneficiar o mesmo a longo prazo, ajudando a promover a organização e atraíndo novos consumidores.More and more companies realise that understanding their customers can be a way to improve customer satisfaction and, consequently, customer loyalty, which in turn can result in an increase in sales. The NPS has been widely adopted by managers as a measure of customer loyalty and predictor of sales growth. In this regard, this dissertation aims to create a classification model focused not only in identi fying the customer’s NPS class, namely, classify the customer as Detractor, Passive or Promoter, but also in understanding which factors have the most impact on the customer’s classification. The goal in doing so is to collect relevant business insights as a way to identify areas that can help to improve customer satisfaction. We propose a Data Mining approach to the NPS multi-class classification problem. Our ap proach leverages survey data, as well as transactional data collected through a retailer’s loyalty card, building a data set from which we can extract information, such as NPS ratings, customer behaviour and store details. Initially, an exploratory analysis is done on the data. Several resam pling techniques are applied to the data set to handle class imbalance. Two different machine learning algorithms are applied: Decision Trees and Random Forests. The results did not show a good model’s performance. An error analysis was then performed in the later model, where it was concluded that the classifier has difficulty distinguishing the classes Detractors and Passives, but has a good performance when predicting the class Promoters. In a business sense, this methodology can be leveraged to distinguish the Promoters from the rest of the consumers, since the Promoters are more likely to provide good value in long term and can benefit the company by spreading the word for attracting new customers

    Machine Learning Approaches for Healthcare Analysis

    Get PDF
    Machine learning (ML)is a division of artificial intelligence that teaches computers how to discover difficult-to-distinguish patterns from huge or complex data sets and learn from previous cases by utilizing a range of statistical, probabilistic, data processing, and optimization methods. Nowadays, ML plays a vital role in many fields, such as finance, self-driving cars, image processing, medicine, and Speech recognition. In healthcare, ML has been used in applications such as the detection, prognosis, diagnosis, and treatment of diseases due to Its capability to handle large data. Moreover, ML has exceptional abilities to predict disease by uncovering patterns from medical datasets. Machine learning and deep learning are better suited for analyzing medical datasets than traditional methods because of the nature of these datasets. They are mostly large and complex heterogeneous data coming from different sources, requiring more efficient computational techniques to handle them. This dissertation presents several machine-learning techniques to tackle medical issues such as data imbalance, classification and upgrading tumor stages, and multi-omics integration. In the second chapter, we introduce a novel method to handle class-imbalanced dilemmas, a common issue in bioinformatics datasets. In class-imbalanced data, the number of samples in each class is unequal. Since most data sets contain usual versus unusual cases, e.g., cancer versus normal or miRNAs versus other noncoding RNA, the minority class with the least number of samples is the interesting class that contains the unusual cases. The learning models based on the standard classifiers, such as the support vector machine (SVM), random forest, and k-NN, are usually biased towards the majority class, which means that the classifier is most likely to predict the samples from the interesting class inaccurately. Thus, handling class-imbalanced datasets has gained researchers’ interest recently. A combination of proper feature selection, a cost-sensitive classifier, and ensembling based on the random forest method (BCECSC-RF) is proposed to handle the class-imbalanced data. Random class-balanced ensembles are built individually. Then, each ensemble is used as a training pool to classify the remaining out-bagged samples. Samples in each ensemble will be classified using a class-sensitive classifier incorporating random forest. The sample will be classified by selecting the most often class that has been voted for in all sample appearances in all the formed ensembles. A set of performance measurements, including a geometric measurement, suggests that the model can improve the classification of the minority class samples. In the third chapter, we introduce a novel study to predict the upgrading of the Gleason score on confirmatory magnetic resonance imaging-guided targeted biopsy (MRI-TB) of the prostate in candidates for active surveillance based on clinical features. MRI of the prostate is not accessible to many patients due to difficulty contacting patients, insurance denials, and African-American patients are disproportionately affected by barriers to MRI of the prostate during Active surveillance [6,7]. Modeling clinical variables with advanced methods, such as machine learning, could allow us to manage patients in resource-limited environments with limited technological access. Upgrading to significant prostate cancer on MRI-TB was defined as upgrading to G 3+4 (definition 1 - DF1) and 4+3 (DF2). For upgrading prediction, the AdaBoost model was highly predictive of upgrading DF1 (AUC 0.952), while for prediction of upgrading DF2, the Random Forest model had a lower but excellent prediction performance (AUC 0.947). In the fourth chapter, we introduce a multi-omics data integration method to analyze multi-omics data for biomedical applications, including disease prediction, disease subtypes, biomarker prediction, and others. Multi-omics data integration facilitates collecting richer understanding and perceptions than separate omics data. Our method is constructed using the combination of gene similarity network (GSN) based on Uniform Manifold Approximation and Projection (UMAP) and convolutional neural networks (CNNs). The method utilizes UMAP to embed gene expression, DNA methylation, and copy number alteration (CNA) to a lower dimension creating two-dimensional RGB images. Gene expression is used as a reference to construct the GSN and then integrate other omics data with the gene expression for better prediction. We used CNNs to predict the Gleason score levels of prostate cancer patients and the tumor stage in breast cancer patients. The results show that UMAP as an embedding technique can better integrate multi-omics maps into the prediction model than SO

    Predictive Modelling of Retail Banking Transactions for Credit Scoring, Cross-Selling and Payment Pattern Discovery

    Get PDF
    Evaluating transactional payment behaviour offers a competitive advantage in the modern payment ecosystem, not only for confirming the presence of good credit applicants or unlocking the cross-selling potential between the respective product and service portfolios of financial institutions, but also to rule out bad credit applicants precisely in transactional payments streams. In a diagnostic test for analysing the payment behaviour, I have used a hybrid approach comprising a combination of supervised and unsupervised learning algorithms to discover behavioural patterns. Supervised learning algorithms can compute a range of credit scores and cross-sell candidates, although the applied methods only discover limited behavioural patterns across the payment streams. Moreover, the performance of the applied supervised learning algorithms varies across the different data models and their optimisation is inversely related to the pre-processed dataset. Subsequently, the research experiments conducted suggest that the Two-Class Decision Forest is an effective algorithm to determine both the cross-sell candidates and creditworthiness of their customers. In addition, a deep-learning model using neural network has been considered with a meaningful interpretation of future payment behaviour through categorised payment transactions, in particular by providing additional deep insights through graph-based visualisations. However, the research shows that unsupervised learning algorithms play a central role in evaluating the transactional payment behaviour of customers to discover associations using market basket analysis based on previous payment transactions, finding the frequent transactions categories, and developing interesting rules when each transaction category is performed on the same payment stream. Current research also reveals that the transactional payment behaviour analysis is multifaceted in the financial industry for assessing the diagnostic ability of promotion candidates and classifying bad credit applicants from among the entire customer base. The developed predictive models can also be commonly used to estimate the credit risk of any credit applicant based on his/her transactional payment behaviour profile, combined with deep insights from the categorised payment transactions analysis. The research study provides a full review of the performance characteristic results from different developed data models. Thus, the demonstrated data science approach is a possible proof of how machine learning models can be turned into cost-sensitive data models

    IDENTIFYING A CUSTOMER CENTERED APPROACH FOR URBAN PLANNING: DEFINING A FRAMEWORK AND EVALUATING POTENTIAL IN A LIVABILITY CONTEXT

    Get PDF
    In transportation planning, public engagement is an essential requirement forinformed decision-making. This is especially true for assessing abstract concepts such aslivability, where it is challenging to define objective measures and to obtain input that canbe used to gauge performance of communities. This dissertation focuses on advancing adata-driven decision-making approach for the transportation planning domain in thecontext of livability. First, a conceptual model for a customer-centric framework fortransportation planning is designed integrating insight from multiple disciplines (chapter1), then a data-mining approach to extracting features important for defining customersatisfaction in a livability context is described (chapter 2), and finally an appraisal of thepotential of social media review mining for enhancing understanding of livability measuresand increasing engagement in the planning process is undertaken (chapter 3). The resultsof this work also include a sentiment analysis and visualization package for interpreting anautomated user-defined translation of qualitative measures of livability. The packageevaluates users satisfaction of neighborhoods through social media and enhances thetraditional approaches to defining livability planning measures. This approach has thepotential to capitalize on residents interests in social media outlets and to increase publicengagement in the planning process by encouraging users to participate in onlineneighborhood satisfaction reporting. The results inform future work for deploying acomprehensive approach to planning that draws the marketing structure of transportationnetwork products with residential nodes as the center of the structure
    corecore