77 research outputs found

    Biclustering on expression data: A review

    Get PDF
    Biclustering has become a popular technique for the study of gene expression data, especially for discovering functionally related gene sets under different subsets of experimental conditions. Most of biclustering approaches use a measure or cost function that determines the quality of biclusters. In such cases, the development of both a suitable heuristics and a good measure for guiding the search are essential for discovering interesting biclusters in an expression matrix. Nevertheless, not all existing biclustering approaches base their search on evaluation measures for biclusters. There exists a diverse set of biclustering tools that follow different strategies and algorithmic concepts which guide the search towards meaningful results. In this paper we present a extensive survey of biclustering approaches, classifying them into two categories according to whether or not use evaluation metrics within the search method: biclustering algorithms based on evaluation measures and non metric-based biclustering algorithms. In both cases, they have been classified according to the type of meta-heuristics which they are based on.Ministerio de Economía y Competitividad TIN2011-2895

    The importance of data classification using machine learning methods in microarray data

    Get PDF
    The detection of genetic mutations has attracted global attention. several methods have proposed to detect diseases such as cancers and tumours. One of them is microarrays, which is a type of representation for gene expression that is helpful in diagnosis. To unleash the full potential of microarrays, machine-learning algorithms and gene selection methods can be implemented to facilitate processing on microarrays and to overcome other potential challenges. One of these challenges involves high dimensional data that are redundant, irrelevant, and noisy. To alleviate this problem, this representation should be simplified. For example, the feature selection process can be implemented by reducing the number of features adopted in clustering and classification. A subset of genes can be selected from a pool of gene expression data recorded on DNA micro-arrays. This paper reviews existing classification techniques and gene selection methods. The effectiveness of emerging techniques, such as the swarm intelligence technique in feature selection and classification in microarrays, are reported as well. These emerging techniques can be used in detecting cancer. The swarm intelligence technique can be combined with other statistical methods for attaining better results

    Temporal decision making using unsupervised learning

    Get PDF
    With the explosion of ubiquitous continuous sensing, on-line streaming clustering continues to attract attention. The requirements are that the streaming clustering algorithm recognize and adapt clusters as the data evolves, that anomalies are detected, and that new clusters are automatically formed as incoming data dictate. In this dissertation, we develop a streaming clustering algorithm, MU Streaming Clustering (MUSC), that is based on coupling a Gaussian mixture model (GMM) with possibilistic clustering to build an adaptive system for analyzing streaming multi-dimensional activity feature vectors. For this reason, the possibilistic C-Means (PCM) and Automatic Merging Possibilistic Clustering Method (AMPCM) are combined together to cluster the initial data points, detect anomalies and initialize the GMM. MUSC achieves our goals when tested on synthetic and real-life datasets. We also compare MUSC's performance with Sequential k-means (sk-means), Basic Sequential Clustering Algorithm (BSAS), and Modified BSAS (MBSAS) here MUSC shows superiority in the performance and accuracy. The performance of a streaming clustering algorithm needs to be monitored over time to understand the behavior of the streaming data in terms of new emerging clusters and number of outlier data points. Incremental internal Validity Indices (iCVIs) are used to monitor the performance of an on-line clustering algorithm. We study the internal incremental Davies-Bouldin (DB), Xie-Beni (XB), and Dunn internal cluster validity indices in the context of streaming data analysis. We extend the original incremental DB (iDB) to a more general version parameterized by the exponent of membership weights. Then we illustrate how the iDB can be used to analyze and understand the performance of MUSC algorithm. We give examples that illustrate the appearance of a new cluster, the effect of different cluster sizes, handling of outlier data samples, and the effect of the input order on the resultant cluster history. In addition, we investigate the internal incremental Davies-Bouldin (iDB) cluster validity index in the context of big streaming data analysis. We analyze the effect of large numbers of samples on the values of the iCVI (iDB). We also develop online versions of two modified generalized Dunn's indices that can be used for dynamic evaluation of evolving (cluster) structure in streaming data. We argue that this method is a good way to monitor the ongoing performance of online clustering algorithms and we illustrate several types of inferences that can be drawn from such indices. We compare the two new indices to the incremental Xie-Beni and Davies-Bouldin indices, which to our knowledge offer the only comparable approach, with numerical examples on a variety of synthetic and real data sets. We also study the performance of MUSC and iCVIs with big streaming data applications. We show the advantage of iCVIs in monitoring large streaming datasets and in providing useful information about the data stream in terms of emergence of a new structure, amount of outlier data, size of the clusters, and order of data samples in each cluster. We also propose a way to project streaming data into a lower space for cases where the distance measure does not perform as expected in the high dimensional space. Another example of streaming is the data acivity data coming from TigerPlace and other elderly residents' apartments in and around Columbia. MO. TigerPlace is an eldercare facility that promotes aging-in-place in Columbia Missouri. Eldercare monitoring using non-wearable sensors is a candidate solution for improving care and reducing costs. Abnormal sensor patterns produced by certain resident behaviors could be linked to early signs of illness. We propose an unsupervised method for detecting abnormal behavior patterns based on a new context preserving representation of daily activities. A preliminary analysis of the method was conducted on data collected in TigerPlace. Sensor firings of each day are converted into sequences of daily activities. Then, building a histogram from the daily sequences of a resident, we generate a single data vector representing that day. Using the proposed method, a day with hundreds of sequences is converted into a single data point representing that day and preserving the context of the daily routine at the same time. We obtained an average Area Under the Curve (AUC) of 0.9 in detecting days where elder adults need to be assessed. Our approach outperforms other approaches on the same datset. Using the context preserving representation, we develoed a multi-dimensional alert system to improve the existing single-dimensional alert system in TigerPlace. Also, this represenation is used to develop a framework that utilizes sensor sequence similarity and medical concepts extracted from the EHR to automatically inform the nursing staff when health problems are detected. Our context preserving representation of daily activities is used to measure the similarity between the sensor sequences of different days. The medical concepts are extracted from the nursing notes using MetamapLite, an NLP tool included in the Unified Medical Language System (UMLS). The proposed idea is validated on two pilot datasets from twelve Tiger Place residents, with a total of 5810 sensor days out of which 1966 had nursing notes

    Unsupervised Algorithms for Microarray Sample Stratification

    Get PDF
    The amount of data made available by microarrays gives researchers the opportunity to delve into the complexity of biological systems. However, the noisy and extremely high-dimensional nature of this kind of data poses significant challenges. Microarrays allow for the parallel measurement of thousands of molecular objects spanning different layers of interactions. In order to be able to discover hidden patterns, the most disparate analytical techniques have been proposed. Here, we describe the basic methodologies to approach the analysis of microarray datasets that focus on the task of (sub)group discovery.Peer reviewe

    Temporal - spatial recognizer for multi-label data

    Get PDF
    Pattern recognition is an important artificial intelligence task with practical applications in many fields such as medical and species distribution. Such application involves overlapping data points which are demonstrated in the multi- label dataset. Hence, there is a need for a recognition algorithm that can separate the overlapping data points in order to recognize the correct pattern. Existing recognition methods suffer from sensitivity to noise and overlapping points as they could not recognize a pattern when there is a shift in the position of the data points. Furthermore, the methods do not implicate temporal information in the process of recognition, which leads to low quality of data clustering. In this study, an improved pattern recognition method based on Hierarchical Temporal Memory (HTM) is proposed to solve the overlapping in data points of multi- label dataset. The imHTM (Improved HTM) method includes improvement in two of its components; feature extraction and data clustering. The first improvement is realized as TS-Layer Neocognitron algorithm which solves the shift in position problem in feature extraction phase. On the other hand, the data clustering step, has two improvements, TFCM and cFCM (TFCM with limit- Chebyshev distance metric) that allows the overlapped data points which occur in patterns to be separated correctly into the relevant clusters by temporal clustering. Experiments on five datasets were conducted to compare the proposed method (imHTM) against statistical, template and structural pattern recognition methods. The results showed that the percentage of success in recognition accuracy is 99% as compared with the template matching method (Featured-Based Approach, Area-Based Approach), statistical method (Principal Component Analysis, Linear Discriminant Analysis, Support Vector Machines and Neural Network) and structural method (original HTM). The findings indicate that the improved HTM can give an optimum pattern recognition accuracy, especially the ones in multi- label dataset

    Multi-biclustering solutions for classification and prediction problems

    Get PDF
    2009 - 20101The search for similarities in large data sets has a relevant role in many scientific fields. It permits to classify several types of data without an explicit information about them. Unfortunately, the experimental data contains noise and errors, and therefore the main task of mathematicians is to find algorithms that permit to analyze this data with maximal precision. In many cases researchers use methodologies such as clustering to classify data with respect to the patterns or conditions. But in the last few years new analysis tool such as biclustering was proposed and applied to many specific problems. My choice of biclustering methods is motivated by the accuracy obtained in the results and the possibility to find not only rows or columns that provide a dataset partition but also rows and columns together. In this work, two new biclustering algorithms, the Combinatorial Biclustering Algorithm (CBA) and an improvement of the Possibilistic Biclustering Algorithm, called Biclustering by resampling, are presented. The first algorithm (that I call Combinatorial) is based on the direct definition of bicluster, that makes it clear and very easy to understand. My algorithm permits to control the error of biclusters in each step, specifying the accepted value of the error and defining the dimensions of the desired biclusters from the beginning. The comparison with other known biclustering algorithms is shown. The second algorithm is an improvement of the Possibilistic Biclustering Algorithm (PBC). The PBC algorithm, proposed by M. Filippone et al., is based on the Possibilistic Clustering paradigm, and finds one bicluster at a time, assigning a membership to the bicluster for each gene and for each condition. PBC uses an objective function that maximizes a bicluster cardinality and minimizes a residual error. The biclustering problem is faced as the optimization of a proper functional. This algorithm obtains a fast convergence and good quality of the solutions. Unfortunately, PBC finds only one bicluster at a time. I propose an improved PBC algorithm based on data resampling, specifically Bootstrap aggregation, and Genetics algorithms. In such a way I can find all the possible biclusters together and include overlapped solutions. I apply the algorithm to a synthetic data and to the Yeast dataset and compare it with the original PBC method. [edited by the author]IX n.s

    Arabic Query Expansion Using WordNet and Association Rules

    Get PDF
    Query expansion is the process of adding additional relevant terms to the original queries to improve the performance of information retrieval systems. However, previous studies showed that automatic query expansion using WordNet do not lead to an improvement in the performance. One of the main challenges of query expansion is the selection of appropriate terms. In this paper, we review this problem using Arabic WordNet and Association Rules within the context of Arabic Language. The results obtained confirmed that with an appropriate selection method, we are able to exploit Arabic WordNet to improve the retrieval performance. Our empirical results on a sub-corpus from the Xinhua collection showed that our automatic selection method has achieved a significant performance improvement in terms of MAP and recall and a better precision with the first top retrieved documents

    TEACHING OLD CALIPERS NEW TRICKS: USING CRANIOMETRICS FOR ANCESTRY ADMIXTURE ESTIMATION VIA FUZZY MATH

    Get PDF
    Cranial measurements have been a cornerstone of physical anthropology since its formation as a discipline in the early 1900s. However, most other ancestry determination methods come with a significant epistemological issue: they differentiate individuals into discrete categories without accounting for the issue of admixture. Advances in data mining and analysis techniques can now be used to help resolve this issue through soft computing, also known as “fuzzy math”. This type of advanced computational math requires specialized knowledge in computer programming, statistics, and data analysis techniques unless one is using computer programs specially designed to run these analyses. This project compiled a database from multiple open-source craniometrics data and utilized prepared packages within the R statistical environment to find a valid soft computing method for fuzzy ancestry determination that does not require extensive knowledge in computer programming or data mining. Exploration of database demographics notes an excess of White-identified individuals, and when tested, this demographic skew impacts the ability of the given package to return valid results. The package chosen was valid using the compiled database. Exploration of causes for the invalid results, including a significant White skew in the underlying database due to accessibility of metric databases, overfitting, and the inherent issues of admixture on craniometric research, are explored, and future directions discussed

    Advances in Meta-Heuristic Optimization Algorithms in Big Data Text Clustering

    Full text link
    This paper presents a comprehensive survey of the meta-heuristic optimization algorithms on the text clustering applications and highlights its main procedures. These Artificial Intelligence (AI) algorithms are recognized as promising swarm intelligence methods due to their successful ability to solve machine learning problems, especially text clustering problems. This paper reviews all of the relevant literature on meta-heuristic-based text clustering applications, including many variants, such as basic, modified, hybridized, and multi-objective methods. As well, the main procedures of text clustering and critical discussions are given. Hence, this review reports its advantages and disadvantages and recommends potential future research paths. The main keywords that have been considered in this paper are text, clustering, meta-heuristic, optimization, and algorithm

    Multiple instance fuzzy inference.

    Get PDF
    A novel fuzzy learning framework that employs fuzzy inference to solve the problem of multiple instance learning (MIL) is presented. The framework introduces a new class of fuzzy inference systems called Multiple Instance Fuzzy Inference Systems (MI-FIS). Fuzzy inference is a powerful modeling framework that can handle computing with knowledge uncertainty and measurement imprecision effectively. Fuzzy Inference performs a non-linear mapping from an input space to an output space by deriving conclusions from a set of fuzzy if-then rules and known facts. Rules can be identified from expert knowledge, or learned from data. In multiple instance problems, the training data is ambiguously labeled. Instances are grouped into bags, labels of bags are known but not those of individual instances. MIL deals with learning a classifier at the bag level. Over the years, many solutions to this problem have been proposed. However, no MIL formulation employing fuzzy inference exists in the literature. In this dissertation, we introduce multiple instance fuzzy logic that enables fuzzy reasoning with bags of instances. Accordingly, different multiple instance fuzzy inference styles are proposed. The Multiple Instance Mamdani style fuzzy inference (MI-Mamdani) extends the standard Mamdani style inference to compute with multiple instances. The Multiple Instance Sugeno style fuzzy inference (MI-Sugeno) is an extension of the standard Sugeno style inference to handle reasoning with multiple instances. In addition to the MI-FIS inference styles, one of the main contributions of this work is an adaptive neuro-fuzzy architecture designed to handle bags of instances as input and capable of learning from ambiguously labeled data. The proposed architecture, called Multiple Instance-ANFIS (MI-ANFIS), extends the standard Adaptive Neuro Fuzzy Inference System (ANFIS). We also propose different methods to identify and learn fuzzy if-then rules in the context of MIL. In particular, a novel learning algorithm for MI-ANFIS is derived. The learning is achieved by using the backpropagation algorithm to identify the premise parameters and consequent parameters of the network. The proposed framework is tested and validated using synthetic and benchmark datasets suitable for MIL problems. Additionally, we apply the proposed Multiple Instance Inference to the problem of region-based image categorization as well as to fuse the output of multiple discrimination algorithms for the purpose of landmine detection using Ground Penetrating Radar
    corecore