6 research outputs found

    Modified Mahalanobis Taguchi System for Imbalance Data Classification

    Get PDF
    The Mahalanobis Taguchi System (MTS) is considered one of the most promising binary classification algorithms to handle imbalance data. Unfortunately, MTS lacks a method for determining an efficient threshold for the binary classification. In this paper, a nonlinear optimization model is formulated based on minimizing the distance between MTS Receiver Operating Characteristics (ROC) curve and the theoretical optimal point named Modified Mahalanobis Taguchi System (MMTS). To validate the MMTS classification efficacy, it has been benchmarked with Support Vector Machines (SVMs), Naive Bayes (NB), Probabilistic Mahalanobis Taguchi Systems (PTM), Synthetic Minority Oversampling Technique (SMOTE), Adaptive Conformal Transformation (ACT), Kernel Boundary Alignment (KBA), Hidden Naive Bayes (HNB), and other improved Naive Bayes algorithms. MMTS outperforms the benchmarked algorithms especially when the imbalance ratio is greater than 400. A real life case study on manufacturing sector is used to demonstrate the applicability of the proposed model and to compare its performance with Mahalanobis Genetic Algorithm (MGA)

    A comparative analysis of machine learning algorithms for genome wide association studies

    Get PDF
    Variations present in human genome play a vital role in the emergence of genetic disorders and abnormal traits. Single Nucleotide Polymorphism (SNP) is considered as the most common source of genetic variations. Genome Wide Association Studies (GWAS) probe these variations present in human population and find their association with complex genetic disorders. Now these days, recent advances in technology and drastic reduction in costs of Genome Wide Association Studies provide the opportunity to have a plethora of genomic data that delivers huge information of these variations to analyze. In fact, there is significant difference in pace of data generation and analysis, which led to new statistical, computational and biological challenges. Scientists are using numerous approaches to solve the current problems in Genome Wide Association Studies. In this thesis, a comparative analysis of three Machine learning algorithms is done on simulated GWAS datasets. The methods used for analysis are Recursive Partitioning, Logistic Regression and Naïve Bayes Classifier. The classification accuracy of these algorithms is calculated in terms of area under the receiver operating characteristic curve (AUC). Conclusively, the logistic regression model with binary classification seems to be the most promising one among the other four algorithms, as it outperformed the other tools in the AUC value

    Reducing Energy Waste through Eco-Aware Everyday Things

    Get PDF

    Predictive response mail campaign

    Get PDF
    O marketing direto está a tornar-se cada vez mais um componente crucial para a estratégia de marketing das empresas e é um processo que inclui várias abordagens para apresentar produtos ou serviços a clientes selecionados. Uma base de dados fiável de clientes-alvo é crítica para o sucesso do marketing direto. O objetivo principal da modelação de respostas é identificar clientes com maior probabilidade de responder a um anúncio direto. Existem dois desafios comuns ao lidar com dados de marketing: dados não balanceados, onde o número de clientes que não respondem é significativamente superior ao daqueles que respondem; e conjuntos de treino com elevada dimensão dado a enorme variedade de informações que são recolhidas normalmente. Esta tese descreve todo o processo de desenvolvimento de um modelo de previsão de respostas ao mesmo tempo que apresenta e estuda diversas técnicas e metodologias ao longo dos vários passos, desde o balanceamento dos dados e seleção de variáveis até ao desenvolvimento e teste dos modelos. Adicionalmente, é proposta uma técnica de seleção de variáveis que consiste no agrupamento de várias random forests para obter resultados mais robustos. Os resultados mostram que a técnica de seleção de variáveis proposta, combinada com random under-sampling para o balanceamento dos dados, e a recente técnica Extreme Gradient Boosting, conhecida como XGBoost, têm a melhor performance.Direct marketing is becoming a crucial part of companies advertising strategy and includes various approaches to presenting products or services to select customers. A reliable targeted customer database is critical to the success of direct marketing. The main objective of response modelling is to identify customers most likely to respond to a direct advertisement. There are two challenges commonly faced when dealing with marketing data: imbalanced data where the number of non-responding customers is significantly larger than that of responding customers; and large training datasets with high dimensionality due to the significant variety of features that are usually collected. This thesis describes the whole process of developing an efficient response prediction model while presenting and studying several different techniques and methods throughout the many steps, from data balancing and feature selection to model development and evaluation. Additionally, an ensemble feature selection technique that combines multiple random forests to yield a more robust result is proposed. The results show that the proposed feature selection method, combined with random under-sampling for class balancing, and the newer prediction technique Extreme Gradient Boosting, known as XGBoost, provide the best performance

    Recognizing End-User Transactions in Performance Management

    No full text
    Providing good quality of service (e.g., low response times) in distributed computer systems requires measuring end-user perceptions of performance. Unfortunately, in practice such measures are often expensive or impossible to obtain. Herein, we propose a machine learning approach to recognizing end-user transactions consisting of sequences of remote procedure calls (RPCs) received at a server. Two problems are addressed. The first is labeling previously segmented transaction instances with the correct transaction type. This is akin to work done in document classification. The second problem is segmenting RPC sequences into transaction instances. This is a more difficult problem, but it is similar to segmenting sounds into words as in speech understanding. Using Naive Bayes, we tackle the labeling problem with four combinations of feature vectors and probability distributions: RPC occurrences with the Bernoulli distribution and RPC counts with the multinomial, geometric, and shifted ge..
    corecore