79 research outputs found

    Genetic algorithm based two-mode clustering of metabolomics data

    Get PDF
    Metabolomics and other omics tools are generally characterized by large data sets with many variables obtained under different environmental conditions. Clustering methods and more specifically two-mode clustering methods are excellent tools for analyzing this type of data. Two-mode clustering methods allow for analysis of the behavior of subsets of metabolites under different experimental conditions. In addition, the results are easily visualized. In this paper we introduce a two-mode clustering method based on a genetic algorithm that uses a criterion that searches for homogeneous clusters. Furthermore we introduce a cluster stability criterion to validate the clusters and we provide an extended knee plot to select the optimal number of clusters in both experimental and metabolite modes. The genetic algorithm-based two-mode clustering gave biological relevant results when it was applied to two real life metabolomics data sets. It was, for instance, able to identify a catabolic pathway for growth on several of the carbon sources

    Segmentazione delle serie temporali nell’analisi dei dati: un esempio di applicazione a dati sismo-vulcanici.

    Get PDF
    Il presente report descrive quanto sviluppato dagli autori per l’analisi delle serie temporali utilizzate per il monitoraggio sismo-vulcanico del vulcano Etna. La necessità di ottenere una rappresentazione ridotta delle serie temporali ha portato alla ricerca ed alla implementazione degli algoritmi di segmentazione oggetto del presente lavoro. Le metodologie introdotte nel paragrafo 2, largamente applicate nella disciplina del data mining su serie temporali, costituiscono ad oggi lo stato dell’arte per quanto riguarda le tecniche di approssimazione di serie temporali. In particolare, l’applicazione dell’algoritmo bottom-up ha permesso una compressione elevata dei dati, consentendo quindi una rappresentazione con un numero di punti inferiore rispetto a quello delle serie temporali di partenza. In questo contesto la scelta delle soglie errore, legata indirettamente al numero di segmenti con cui si approssima la serie temporale, è stata scelta in modo empirico. Questa scelta è stata vincolata alla dimensione dei buffer di dati da impiegare per scopi di visualizzazione ed elaborazione. Future implementazioni riguarderanno l’ottimizzazione in linea degli algoritmi Sliding Window in modo da operare in real-time sugli streaming di dati ed ottimizzarne l’archiviazione e la visualizzazione

    Pengelompokan Data Menggunakan Hierarchical Clustering (AHC)

    Get PDF
    ABSTRAKSI: Data merupakan salah satu sumber yang digunakan untuk memperoleh suatu informasi. Namun tidak semua data dapat dimanfaatkan dengan baik. Jika data tersebut memiliki struktur yang kompleks, maka akan sulit untuk dimengerti. Sebagai contoh adalah data tagihan pelanggan PT.Telkom yang digunakan pada Tugas Akhir ini. Data tersebut memiliki jumlah record yang banyak dengan atribut yang banyak pula. Oleh karena itu diperlukan suatu proses pengelompokan yang bertujuan untuk membagi data tersebut ke dalam jumlah yang lebih sedikit sehingga proses penganalisisan data menjadi semakin mudah. Tugas Akhir ini mengimplementasikan salah satu teknik data mining yaitu clustering untuk melakukan pengelompokan data. Metode clustering yang digunakan adalah Agglomerative Hierarchical Clustering (AHC). Agglomerative Hierarchical Clustering adalah suatu metode hierarchical clustering yang bersifat bottom-up yaitu menggabungkan n buah klaster menjadi satu klaster tunggal. Metode ini dimulai dengan meletakkan setiap objek data sebagai sebuah klaster tersendiri (atomic cluster) dan selanjutnya menggabungkan klaster-klaster tersebut menjadi klaster yang lebih besar dan lebih besar lagi sampai akhirnya semua objek data menyatu dalam sebuah klaster tunggal. Kunci dari metode AHC adalah perhitungan proximity antara 2 klaster. Perhitungan ini terbagi menjadi 3 yaitu Single Linkage (jarak terkecil), Complete Linkage (jarak terbesar) dan Average Linkage (jarak ratarata). karena metode hirarki tidak dapat menghasilkan klaster secara langsung, maka digunakan metode cophenet distance untuk menganalisis hasil hirarki yang terbentuk. Dari hasil yang didapat menunjukkan bahwa Agglomerative Hierarchical Clustering (AHC) dapat digunakan untuk pengelompokan data.Kata Kunci : AHC, Single Linkage, Complete Linkage, Average Linkage,ABSTRACT: Data is one of resources which used for gathering information. However, not all data working well. If the data have a complex structure, it is hard to understand. For example, data of customer invoice in PT.Telkom which used in this final project. This data have sum up the record is to lot of with the attributes amount which is there also many. Therefore, we need grouping process which is dividing data into slimmer amount so process the data analysing become progressively easy to. This Final Project is inplements one of technique in data mining which is clustering to do grouping data. The clustering method that is used is Agglomerative Hierarchical Clustering (AHC). Agglometarive Hierarchical Clustering is a method of hierarchical clustering having the character of bottom up which is joining n cluster become one single cluster. This method has begin with placing each data object as one separate cluster (atomic cluster) and join that cluster-cluster become ones large cluster and bigger again untuil the last all of data object one in one single cluster. The keys from AHC method is calculation proximity between 2 cluster. This calculation is divisible become 3 which single linkage (shortest distance), complete linkage (longest distance) and average linkage (average distance). Because hierarchy method cannot result the cluster directly so we used a cophenetic distance method to analyse result of formed hierarchy. From result is in can indicate that Agglomerative Hierarchical Clustering (AHC) applicable to grouping data.Keyword: AHC, Single Linkage, Complete Linkage, Average Linkage

    Bayesian hierarchical clustering for studying cancer gene expression data with unknown statistics

    Get PDF
    Clustering analysis is an important tool in studying gene expression data. The Bayesian hierarchical clustering (BHC) algorithm can automatically infer the number of clusters and uses Bayesian model selection to improve clustering quality. In this paper, we present an extension of the BHC algorithm. Our Gaussian BHC (GBHC) algorithm represents data as a mixture of Gaussian distributions. It uses normal-gamma distribution as a conjugate prior on the mean and precision of each of the Gaussian components. We tested GBHC over 11 cancer and 3 synthetic datasets. The results on cancer datasets show that in sample clustering, GBHC on average produces a clustering partition that is more concordant with the ground truth than those obtained from other commonly used algorithms. Furthermore, GBHC frequently infers the number of clusters that is often close to the ground truth. In gene clustering, GBHC also produces a clustering partition that is more biologically plausible than several other state-of-the-art methods. This suggests GBHC as an alternative tool for studying gene expression data. The implementation of GBHC is available at https://sites. google.com/site/gaussianbhc

    Bad Communities with High Modularity

    Full text link
    In this paper we discuss some problematic aspects of Newman's modularity function QN. Given a graph G, the modularity of G can be written as QN = Qf -Q0, where Qf is the intracluster edge fraction of G and Q0 is the expected intracluster edge fraction of the null model, i.e., a randomly connected graph with same expected degree distribution as G. It follows that the maximization of QN must accomodate two factors pulling in opposite directions: Qf favors a small number of clusters and Q0 favors many balanced (i.e., with approximately equal degrees) clusters. In certain cases the Q0 term can cause overestimation of the true cluster number; this is the opposite of the well-known under estimation effect caused by the "resolution limit" of modularity. We illustrate the overestimation effect by constructing families of graphs with a "natural" community structure which, however, does not maximize modularity. In fact, we prove that we can always find a graph G with a "natural clustering" V of G and another, balanced clustering U of G such that (i) the pair (G; U) has higher modularity than (G; V) and (ii) V and U are arbitrarily different.Comment: Significantly improved version of the paper, with the help of L. Pitsouli

    Genetic algorithm based two-mode clustering of metabolomics data

    Get PDF

    Using hidden Markov models in credit card transaction fraud detection

    Get PDF
    In this paper we shall propose a credit card transaction fraud detection framework which uses Hidden Markov Models, a well established technology that has not as yet been tested in this area and through which we aim to address the limitations posed by currently used technologies. Hidden Markov Models have for many years been effectively implemented in other similar areas. The flexibility offered by these models together with the similarity in concepts between Automatic Speech Recognition and credit card fraud detection has instigated the idea of testing the usefulness of these models in our area of research. The study performed in this project investigated the utilisation of Hidden Markov Models by means of proposing a number of different frameworks, which frameworks are supported through the use of clustering and profiling mechanisms. The profiling mechanisms are used in order to build Hidden Markov Models which are more specialised and thus are deployed on training data that is specific to a set of cardholders which have similar spending behaviours. Clustering techniques were used in order to establish the association between different classes of transactions. Two different clustering algorithms were tested in order to determine the most effective one. Also, different Hidden Markov Models were built using different criteria for test data. The positive results achieved portray the effectiveness of these models in classifying fraudulent and legitimate transactions through a resultant percentage value which indicates the prominence of the transaction being contained in the respective model.peer-reviewe

    DYNAMIC CLUSTERING OF CELL-CYCLE MICROARRAY DATA

    Get PDF
    The cell cycle is a crucial series of events that are repeated over time, allowing the cell to grow, duplicate, and split. Cell-cycle systems play an important role in cancer and other biological processes. Using gene expression data gained from microarray technology it is possible to group or cluster genes that are involved in the cell-cycle for the purpose of exploring their functional co-regulation. Typically, the goal of clustering methods as applied to gene expression data is to place genes with similar expression patterns or profiles into the same group or cluster for the purpose of inferring the function of unknown genes that cluster with genes of known function. Since a gene may be involved in more than one biological process at any one time, co-regulated genes may not have visually similar expression patterns. Furthermore, the time duration for genes in a biological process may differ, and the number of co-regulated patterns or biological processes shared by two genes may be unknown. Based on this reasoning, biologically realistic gene clusters gained from gene co-regulation may not be accurately identified using traditional clustering methods. By taking advantage of techniques and theories from signal processing, it possible to cluster cell-cycle gene expression profiles using a dynamic perspective under the assumption that different spectral frequencies characterize different biological processes

    Gravitational Clustering: A Simple, Robust and Adaptive Approach for Distributed Networks

    Full text link
    Distributed signal processing for wireless sensor networks enables that different devices cooperate to solve different signal processing tasks. A crucial first step is to answer the question: who observes what? Recently, several distributed algorithms have been proposed, which frame the signal/object labelling problem in terms of cluster analysis after extracting source-specific features, however, the number of clusters is assumed to be known. We propose a new method called Gravitational Clustering (GC) to adaptively estimate the time-varying number of clusters based on a set of feature vectors. The key idea is to exploit the physical principle of gravitational force between mass units: streaming-in feature vectors are considered as mass units of fixed position in the feature space, around which mobile mass units are injected at each time instant. The cluster enumeration exploits the fact that the highest attraction on the mobile mass units is exerted by regions with a high density of feature vectors, i.e., gravitational clusters. By sharing estimates among neighboring nodes via a diffusion-adaptation scheme, cooperative and distributed cluster enumeration is achieved. Numerical experiments concerning robustness against outliers, convergence and computational complexity are conducted. The application in a distributed cooperative multi-view camera network illustrates the applicability to real-world problems.Comment: 12 pages, 9 figure

    A Framework for Exploring Functional Variability in Olfactory Receptor Genes

    Get PDF
    BACKGROUND: Olfactory receptors (ORs) are the largest gene family in mammalian genomes. Since nearly all OR genes are orphan receptors, inference of functional similarity or differences between odorant receptors typically relies on sequence comparisons. Based on the alignment of entire coding region sequence, OR genes are classified into families and subfamilies, a classification that is believed to be a proxy for OR gene functional variability. However, the assumption that overall protein sequence diversity is a good proxy for functional properties is untested. METHODOLOGY: Here, we propose an alternative sequence-based approach to infer the similarities and differences in OR binding capacity. Our approach is based on similarities and differences in the predicted binding pockets of OR genes, rather than on the entire OR coding region. CONCLUSIONS: Interestingly, our approach yields markedly different results compared to the analysis based on the entire OR coding-regions. While neither approach can be tested at this time, the discrepancy between the two calls into question the assumption that the current classification reliably reflects OR gene functional variability
    • …
    corecore