10 research outputs found

    Clustering Data of Mixed Categorical and Numerical Type with Unsupervised Feature Learning

    Get PDF
    Mixed-type categorical and numerical data are a challenge in many applications. This general area of mixed-type data is among the frontier areas, where computational intelligence approaches are often brittle compared with the capabilities of living creatures. In this paper, unsupervised feature learning (UFL) is applied to the mixed-type data to achieve a sparse representation, which makes it easier for clustering algorithms to separate the data. Unlike other UFL methods that work with homogeneous data, such as image and video data, the presented UFL works with the mixed-type data using fuzzy adaptive resonance theory (ART). UFL with fuzzy ART (UFLA) obtains a better clustering result by removing the differences in treating categorical and numeric features. The advantages of doing this are demonstrated with several real-world data sets with ground truth, including heart disease, teaching assistant evaluation, and credit approval. The approach is also demonstrated on noisy, mixed-type petroleum industry data. UFLA is compared with several alternative methods. To the best of our knowledge, this is the first time UFL has been extended to accomplish the fusion of mixed data types

    Tesis Yerleştirme (p-Hub) Probleminin Yapay Arı Kolonisi Kullanılarak Çözülmesi

    Get PDF
    Tesis (p-Hub) yerleştirme problemi, mal, hizmet ve bilgi dağıtım sistemi stratejilerini konumlandırmayı amaçlayan polinomsal zamanda doğrulanabilen karar problemlerinin karmaşıklık sınıfı olarak bilinmektedir. Dağıtım sistemlerinde istenen düzeyde bir hizmet kalitesini kabul edilebilir bir maliyetle elde etmek için birbirine tahsis edilmiş hatlarla birbirine bağlanmış düğümlerden oluşan bir ağ tasarlanabilir. Tasarlanan bu ağın uygun çözüm maliyetli olmayabilir. Bundan dolayı toplam ulaşım maliyetini azaltmatabilmek amacıyla, diğer düğümler için birleştirme veya yönlendirme noktası olarak çalışan bazı tesisler (hublar) kullanılabilir. Taşımacılık yönetimi, kentsel yönetim, servis merkezlerinin konumlandırılması, sensör ağlarının tasarımı, bilgisayar mühendisliği, bilgisayar ağlarının tasarımı, iletişim ağlarının tasarımı, güç mühendisliği, onarım merkezlerinin konumunu, elektrik hatlarının bakımı ve izlenmesi ile imalat sistemlerinin tasarımı gibi sorunların çözümünde bu tür ağları oluştururken hub'lar kullanılmaktadır. Hub'lı zorlu bir nokta, hangi düğümlerin ağ özelliklerinin farklılık gösterebileceğine ve hub konum noktaları olarak kullanılacağına karar vermektir. Hub’lı yer tahsisinde kısa zamandaki iyi bir çözüm, uzun hesaplamalar sonucunda elde edilen en iyi çözümden daha etkilidir. Hem kısa zamanda hemde optimum çözüm elde edebilmek amacıyla p-Hub problemlerinin çözümünde son zamanlarda sezgisel temelli algoritmalar işe koşulmaktadır. Bundan dolayı bu çalışmada p-Hub konum problemini çözmek için Yapay Arı Koloni (YAK) algoritması önerilmiştir. Bu çalışmada, YAK algoritması p-Hub yer tahsisi problem çözümü için düğüm sayısına bağlı olarak üç farklı durumda uygulanmıştır. Birinci durum merkezde sabit olarak bulunan üç adet tesis ve toplam yirmi düğüm, ikinci durum merkezde sabit altı adet tesis ve bunlara bağlı otuz düğüm, üçüncü durum ise merkezde sabit yedi tesis ve bu tesislere bağlı kırk düğümden oluşmaktadır. YAK algoritması ile elde edilen minimum yer tahsisi maliyet fonksiyonu çözümleri tablolar ve grafiklerle verilmiştir. Elde edilen sonuçlar literatürde yer alan Parçacık Sürü Optimizasyonu sonuçları ile karşılaştırılmıştır. Çalışma sonucunda p-Hub yer tahsisi problem çözümünde YAK’ın daha iyi sonuç elde ettiği görülmüştür. Bundan dolayı yönerilen YAK algoritmasının tesis tahsisi (p-Hub) problemi çözümü için uygun bir yöntem olduğunu göstermiştir

    Applications of Clustering with Mixed Type Data in Life Insurance

    Full text link
    Death benefits are generally the largest cash flow item that affects financial statements of life insurers where some still do not have a systematic process to track and monitor death claims experience. In this article, we explore data clustering to examine and understand how actual death claims differ from expected, an early stage of developing a monitoring system crucial for risk management. We extend the kk-prototypes clustering algorithm to draw inference from a life insurance dataset using only the insured's characteristics and policy information without regard to known mortality. This clustering has the feature to efficiently handle categorical, numerical, and spatial attributes. Using gap statistics, the optimal clusters obtained from the algorithm are then used to compare actual to expected death claims experience of the life insurance portfolio. Our empirical data contains observations, during 2014, of approximately 1.14 million policies with a total insured amount of over 650 billion dollars. For this portfolio, the algorithm produced three natural clusters, with each cluster having a lower actual to expected death claims but with differing variability. The analytical results provide management a process to identify policyholders' attributes that dominate significant mortality deviations, and thereby enhance decision making for taking necessary actions.Comment: 25 pages, 6 figures, 5 table

    Clustering: Methodology, hybrid systems, visualization, validation and implementation

    Get PDF
    Unsupervised learning is one of the most important steps of machine learning applications. Besides its ability to obtain the insight of the data distribution, unsupervised learning is used as a preprocessing step for other machine learning algorithm. This dissertation investigates the application of unsupervised learning into various types of data for many machine learning tasks such as clustering, regression and classification. The dissertation is organized into three papers. In the first paper, unsupervised learning is applied to mixed categorical and numerical feature data type to transform the data objects from the mixed type feature domain into a new sparser numerical domain. By making use of the data fusion capacity of adaptive resonance theory clustering, the approach is able to reduce the distinction between the numerical and categorical features. The second paper presents a novel method to improve the performance of wind forecast by clustering the time series of the surrounding wind mills into the similar group by using hidden Markov model clustering and using the clustering information to enhance the forecast. A fast forecast method is also introduced by using extreme learning machine which can be trained by analytic form to choose the optimal value of past samples for prediction and appropriate size of the neural network. In the third paper, unsupervised learning is used to automatically learn the feature from the dataset itself without human design of sophisticated feature extractors. The paper points out that by using unsupervised feature learning with multi-quadric radial basis function extreme learning machine the performance of the classifier is better than several other supervised learning methods. The paper further improves the speed of training the neural network by presenting an algorithm that runs parallel on GPU --Abstract, page iv

    Influencia del pre-procesamiento de datos dentro del desempeño de modelos de perfilamiento de clientes elaborados con herramientas de minería de datos

    Get PDF
    El perfilamiento de clientes es una de las estrategias de mercadeo directo más utilizadas por las empresas, investigaciones en el campo de la minería de datos presentan un crecimiento en los últimos años (Patil, Revankar and Joshi, 2009). Algunas de las investigaciones sobre mercadeo directo en las que se utilizan soluciones de perfilamiento de clientes usando minería de datos resaltan la necesidad de estudiar aspectos específicos acerca de la influencia del pre-procesamiento de datos (PPD) para la mejora de resultados (Romdhane, N. Fadhel, and B. Ayeb, 2010). El objetivo de esta investigación es identificar la influencia del pre-procesamiento de datos dentro del desempeño de modelos de perfilamiento de clientes basados en minería de datos. Este documento cuenta con tres capítulos, el primero describe la metodología de la investigación, el segundo capítulo corresponde presentación de datos experimentales, el tercer y último capítulo corresponde al análisis de resultados. Como resultado de la investigación se describen las mejores prácticas para el pre-procesamiento, producto de los experimentos efectuados.Maestrí

    An inter-domain supervision framework for collaborative clustering of data with mixed types.

    Get PDF
    We propose an Inter-Domain Supervision (IDS) clustering framework to discover clusters within diverse data formats, mixed-type attributes and different sources of data. This approach can be used for combined clustering of diverse representations of the data, in particular where data comes from different sources, some of which may be unreliable or uncertain, or for exploiting optional external concept set labels to guide the clustering of the main data set in its original domain. We additionally take into account possible incompatibilities in the data via an automated inter-domain compatibility analysis. Our results in clustering real data sets with mixed numerical, categorical, visual and text attributes show that the proposed IDS clustering framework gives improved clustering results compared to conventional methods, over a wide range of parameters. Thus the automatically extracted knowledge, in the form of seeds or constraints, obtained from clustering one domain, can provide additional knowledge to guide the clustering in another domain. Additional empirical evaluations further show that our approach, especially when using selective mutual guidance between domains, outperforms common baselines such as clustering either domain on its own or clustering all domains converted to a single target domain. Our approach also outperforms other specialized multiple clustering methods, such as the fully independent ensemble clustering and the tightly coupled multiview clustering, after they were adapted to the task of clustering mixed data. Finally, we present a real life application of our IDS approach to the cluster-based automated image annotation problem and present evaluation results on a benchmark data set, consisting of images described with their visual content along with noisy text descriptions, generated by users on the social media sharing website, Flickr

    Coping with new Challenges in Clustering and Biomedical Imaging

    Get PDF
    The last years have seen a tremendous increase of data acquisition in different scientific fields such as molecular biology, bioinformatics or biomedicine. Therefore, novel methods are needed for automatic data processing and analysis of this large amount of data. Data mining is the process of applying methods like clustering or classification to large databases in order to uncover hidden patterns. Clustering is the task of partitioning points of a data set into distinct groups in order to minimize the intra cluster similarity and to maximize the inter cluster similarity. In contrast to unsupervised learning like clustering, the classification problem is known as supervised learning that aims at the prediction of group membership of data objects on the basis of rules learned from a training set where the group membership is known. Specialized methods have been proposed for hierarchical and partitioning clustering. However, these methods suffer from several drawbacks. In the first part of this work, new clustering methods are proposed that cope with problems from conventional clustering algorithms. ITCH (Information-Theoretic Cluster Hierarchies) is a hierarchical clustering method that is based on a hierarchical variant of the Minimum Description Length (MDL) principle which finds hierarchies of clusters without requiring input parameters. As ITCH may converge only to a local optimum we propose GACH (Genetic Algorithm for Finding Cluster Hierarchies) that combines the benefits from genetic algorithms with information-theory. In this way the search space is explored more effectively. Furthermore, we propose INTEGRATE a novel clustering method for data with mixed numerical and categorical attributes. Supported by the MDL principle our method integrates the information provided by heterogeneous numerical and categorical attributes and thus naturally balances the influence of both sources of information. A competitive evaluation illustrates that INTEGRATE is more effective than existing clustering methods for mixed type data. Besides clustering methods for single data objects we provide a solution for clustering different data sets that are represented by their skylines. The skyline operator is a well-established database primitive for finding database objects which minimize two or more attributes with an unknown weighting between these attributes. In this thesis, we define a similarity measure, called SkyDist, for comparing skylines of different data sets that can directly be integrated into different data mining tasks such as clustering or classification. The experiments show that SkyDist in combination with different clustering algorithms can give useful insights into many applications. In the second part, we focus on the analysis of high resolution magnetic resonance images (MRI) that are clinically relevant and may allow for an early detection and diagnosis of several diseases. In particular, we propose a framework for the classification of Alzheimer's disease in MR images combining the data mining steps of feature selection, clustering and classification. As a result, a set of highly selective features discriminating patients with Alzheimer and healthy people has been identified. However, the analysis of the high dimensional MR images is extremely time-consuming. Therefore we developed JGrid, a scalable distributed computing solution designed to allow for a large scale analysis of MRI and thus an optimized prediction of diagnosis. In another study we apply efficient algorithms for motif discovery to task-fMRI scans in order to identify patterns in the brain that are characteristic for patients with somatoform pain disorder. We find groups of brain compartments that occur frequently within the brain networks and discriminate well among healthy and diseased people
    corecore