24 research outputs found

    Techniques for clustering gene expression data

    Get PDF
    Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data profile. This review paper surveys state of the art applications which recognises these limitations and implements procedures to overcome them. It provides a framework for the evaluation of clustering in gene expression analyses. The nature of microarray data is discussed briefly. Selected examples are presented for the clustering methods considered

    A review of clustering techniques and developments

    Full text link
    © 2017 Elsevier B.V. This paper presents a comprehensive study on clustering: exiting methods and developments made at various times. Clustering is defined as an unsupervised learning where the objects are grouped on the basis of some similarity inherent among them. There are different methods for clustering the objects such as hierarchical, partitional, grid, density based and model based. The approaches used in these methods are discussed with their respective states of art and applicability. The measures of similarity as well as the evaluation criteria, which are the central components of clustering, are also presented in the paper. The applications of clustering in some fields like image segmentation, object and character recognition and data mining are highlighted

    Computational analysis of gene expression data

    Get PDF
    Gene expression is central to the function of living cells. While advances in sequencing and expression measurement technology over the past decade has greatly facilitated the further understanding of the genome and its functions, the characterisation of functional groups of genes remains one of the most important problems in modern biology. Technological advancements have resulted in massive information output, with the priority objective shifting to development of data analysis methods. As such, a large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments, and consequently, confusion regarding the best approach to take. Common techniques applied are not necessarily the most applicable for the analysis of patterns in microarray data. This confusion is clarified through provision of a framework for the analysis of clustering technique and investigation of how well they apply to gene expression data. To this end, the properties of microarray data itself are examined, followed by an examination of the properties of clustering techniques and how well they apply to gene expression. Clearly, each technique will find patterns even if the structures are not meaningful in a biological context and these structures are not usually the same for different algorithms. Also, these algorithms are inherently biased as properties of clusters reflect built in clustering criteria. From these considerations, it is clear that cluster validation is critical for algorithm development and verification of results, usually based on a manual, lengthy and subjective exploration process. Consequently, it is key to the interpretation of the gene expression data. We carry out a critical analysis of current methods used to evaluate clustering results. Clusters obtained from real and synthetic datasets are compared between algorithms. To understand the properties of complex gene expression datasets, graphical representations can be used. Intuitively, the data can be represented in terms of a bipartite graph, with weighted edges between gene-sample node couples corresponding to significant expression measurements of interest. In this research, this method of representation is extensively studied and methods are used, in combination with probabilistic models, to develop new clustering techniques for analysis of gene expression data in this mode of representation. Performance of these techniques can be influenced both by the search algorithm, and, by the graph weighting scheme and both merit vigorous investigation. A novel edge-weighting scheme, based on empirical evidence, is presented. The scheme is tested using several benchmark datasets at various levels of granularity, and comparisons are provided with current a popular data analysis method used in the Bioinformatics community. The analysis shows that the new empirical based scheme developed out-performs current edge-weighting methods by accounting for the subtleties in the data through a data-dependent threshold analysis, and selecting ‘interesting’ gene-sample couples based on relative values. The graphical theme of gene expression analysis is further developed by construction of a one-mode gene expression network which specifically focuses on local interactions among genes. Classical network theory is used to identify and examine organisational properties in the resulting graphs. A new algorithm, GraphCreate, is presented which finds functional modules in the one-mode graph, i.e. sets of genes which are coherently expressed over subsets of samples, and a scoring scheme developed (using bi-partite graph properties as a basis) to weight these modules. Use of this representation is used to extensively study published gene expression datasets and to identify functional modules of genes with GraphCreate. This work is important as it advances research in the area of transcriptome analyiii sis, beyond simply finding groups of coherently expressed genes, by developing a general framework to understand how and when gene sets are interacting

    Introducing biological information in the superparamagnetic clustering algorithm of gene expression data

    Get PDF
    Tesis (Doctorado en Nanociencias y Nanotecnología)"Los microarreglos proporcionan informaciòn de la actividad a nivel transcripcional de los genes de un organismo, bajo distintas circunstancias. Esto puede llevar al descubrimiento de genes clave en procesos celulares, clasificación molecular de enfermedades o identificar funciones para los genes, entre otras cosas. En el proceso de obtención de esta información, los algoritmos de clustering son una pieza importante al ayudar en la clasificación de los datos provenientes de microarreglos. En este trabajo modificamos el algoritmo de Clustering Superparamagnético añadiendo un peso extra en la fórmula de interacción que aprovecha la información que se tiene sobre los genes regulados por un mismo factor de transcripción. Con este algoritmo modificado, que nombramos SPCTF, analizamos los datos de microarreglos de Spellman et al. para ciclo celular en levadura (Saccharomyces cerevisiae) y encontramos clusters con un número mayor de integrantes, comparando con el algoritmo original SPC. Algunos de los genes que pudimos incorporar no fueron detectados por Spellman et al. en un principio, pero fueron identificados por otros estudios posteriormente. Otros de los genes que fueron incorporados aún no han sido clasificados, por lo que analizamos los clusters compuestos en su mayoría por estos genes sin identificar con el algoritmo MUSA y esto nos permitió seleccionar aquellos cuyos genes contienen sitios de unión a factores de transcripción correspondientes a ciclo celular. Estos clusters pueden ser estudiados ahora de manera experimental para descubrir nuevos genes involucrados en el ciclo celular. La idea de introducir la información biológica ya disponible para optimizar la clasificación de genes puede ser implementada para otros algoritmos de clustering.""Microarray technology allow researchers to examine the transcriptional activity of thousands of genes under different conditions. Microarrays have been used, for example, to discover key genes involved in cellular processes, disease classification, drug development and gene function annotation. Clustering algorithms have become an important step in the microarray data analysis in order to discover biologically relevant information. We modify the superparamagnetic clustering algorithm (SPC) by adding an extra weight to the interaction formula that considers which genes are regulated by the same transcription factor. This combined similarity measure for two genes relies on two types of information: their expression profiles generated by a microarray, and the number of shared transcription factors that have been proved (experimentally) to bind to their promoters. With this modified algorithm which we call SPCTF, we analyze the Spellman et al. microarray data for cell cycle genes in yeast (Saccharomyces cerevisiae), and find clusters with a higher number of elements compared with those obtained with the SPC algorithm. Some of the incorporated genes by using SPCFT were not detected at first by Spellman et al. but were later identified by other studies, whereas several genes still remain unclassified. The clusters composed by unidentified genes were analyzed with MUSA, the motif finding using an unsupervised approach algorithm, and this allow us to select the clusters whose elements contain cell cycle transcription factor binding sites as clusters worthy of further experimental studies because they would probably lead to new cell cycle genes. Our idea of introducing the available information about transcription factors to optimize the gene classification could be implemented for other distance-based clustering algorithms.

    Support Vector Machine-based Fuzzy Systems for Quantitative Prediction of Peptide Binding Affinity

    Get PDF
    Reliable prediction of binding affinity of peptides is one of the most challenging but important complex modelling problems in the post-genome era due to the diversity and functionality of the peptides discovered. Generally, peptide binding prediction models are commonly used to find out whether a binding exists between a certain peptide(s) and a major histocompatibility complex (MHC) molecule(s). Recent research efforts have been focused on quantifying the binding predictions. The objective of this thesis is to develop reliable real-value predictive models through the use of fuzzy systems. A non-linear system is proposed with the aid of support vector-based regression to improve the fuzzy system and applied to the real value prediction of degree of peptide binding. This research study introduced two novel methods to improve structure and parameter identification of fuzzy systems. First, the support-vector based regression is used to identify initial parameter values of the consequent part of type-1 and interval type-2 fuzzy systems. Second, an overlapping clustering concept is used to derive interval valued parameters of the premise part of the type-2 fuzzy system. Publicly available peptide binding affinity data sets obtained from the literature are used in the experimental studies of this thesis. First, the proposed models are blind validated using the peptide binding affinity data sets obtained from a modelling competition. In that competition, almost an equal number of peptide sequences in the training and testing data sets (89, 76, 133 and 133 peptides for the training and 88, 76, 133 and 47 peptides for the testing) are provided to the participants. Each peptide in the data sets was represented by 643 bio-chemical descriptors assigned to each amino acid. Second, the proposed models are cross validated using mouse class I MHC alleles (H2-Db, H2-Kb and H2-Kk). H2-Db, H2-Kb, and H2-Kk consist of 65 nona-peptides, 62 octa-peptides, and 154 octa-peptides, respectively. Compared to the previously published results in the literature, the support vector-based type-1 and support vector-based interval type-2 fuzzy models yield an improvement in the prediction accuracy. The quantitative predictive performances have been improved as much as 33.6\% for the first group of data sets and 1.32\% for the second group of data sets. The proposed models not only improved the performance of the fuzzy system (which used support vector-based regression), but the support vector-based regression benefited from the fuzzy concept also. The results obtained here sets the platform for the presented models to be considered for other application domains in computational and/or systems biology. Apart from improving the prediction accuracy, this research study has also identified specific features which play a key role(s) in making reliable peptide binding affinity predictions. The amino acid features "Polarity", "Positive charge", "Hydrophobicity coefficient", and "Zimm-Bragg parameter" are considered as highly discriminating features in the peptide binding affinity data sets. This information can be valuable in the design of peptides with strong binding affinity to a MHC I molecule(s). This information may also be useful when designing drugs and vaccines

    Supervised and unsupervised segmentation of textured images by efficient multi-level pattern classification

    Get PDF
    This thesis proposes new, efficient methodologies for supervised and unsupervised image segmentation based on texture information. For the supervised case, a technique for pixel classification based on a multi-level strategy that iteratively refines the resulting segmentation is proposed. This strategy utilizes pattern recognition methods based on prototypes (determined by clustering algorithms) and support vector machines. In order to obtain the best performance, an algorithm for automatic parameter selection and methods to reduce the computational cost associated with the segmentation process are also included. For the unsupervised case, the previous methodology is adapted by means of an initial pattern discovery stage, which allows transforming the original unsupervised problem into a supervised one. Several sets of experiments considering a wide variety of images are carried out in order to validate the developed techniques.Esta tesis propone metodologías nuevas y eficientes para segmentar imágenes a partir de información de textura en entornos supervisados y no supervisados. Para el caso supervisado, se propone una técnica basada en una estrategia de clasificación de píxeles multinivel que refina la segmentación resultante de forma iterativa. Dicha estrategia utiliza métodos de reconocimiento de patrones basados en prototipos (determinados mediante algoritmos de agrupamiento) y máquinas de vectores de soporte. Con el objetivo de obtener el mejor rendimiento, se incluyen además un algoritmo para selección automática de parámetros y métodos para reducir el coste computacional asociado al proceso de segmentación. Para el caso no supervisado, se propone una adaptación de la metodología anterior mediante una etapa inicial de descubrimiento de patrones que permite transformar el problema no supervisado en supervisado. Las técnicas desarrolladas en esta tesis se validan mediante diversos experimentos considerando una gran variedad de imágenes

    Applications

    Get PDF
    Volume 3 describes how resource-aware machine learning methods and techniques are used to successfully solve real-world problems. The book provides numerous specific application examples: in health and medicine for risk modelling, diagnosis, and treatment selection for diseases in electronics, steel production and milling for quality control during manufacturing processes in traffic, logistics for smart cities and for mobile communications

    Análisis de datos etnográficos, antropológicos y arqueológicos: una aproximación desde las humanidades digitales y los sistemas complejos

    Get PDF
    La llegada de las Ciencias de la Computación, el Big Data, el Análisis de Datos, el Aprendizaje Automático y la Minería de Datos ha modificado la manera en que se hace ciencia en todos los campos científicos, dando lugar, a su vez, a la aparición de nuevas disciplinas tales como la Mecánica Computacional, la Bioinformática, la Ingeniería de la Salud, las Ciencias Sociales Computacionales, la Economía Computacional, la Arqueología Computacional y las Humanidades Digitales –entre otras. Cabe destacar que todas estas nuevas disciplinas son todavía muy jóvenes y están en continuo crecimiento, por lo que contribuir a su avance y consolidación tiene un gran valor científico. En esta tesis doctoral contribuimos al desarrollo de una nueva línea de investigación dedicada al uso de modelos formales, métodos analíticos y enfoques computacionales para el estudio de las sociedades humanas tanto actuales como del pasado.El Ministerio de Ciencia e Innovación • Proyecto SimulPast – “Transiciones sociales y ambientales: simulando el pasado para entender el comportamiento humano” (CSD2010-00034 CONSOLIDER-INGENIO 2010). • Proyecto CULM – “Modelado del cultivo en la prehistoria” (HAR2016-77672-P). • Red de Excelencia SimPastNet – “Simular el pasado para entender el comportamiento humano” (HAR2017-90883-REDC). • Red de Excelencia SocioComplex – “Sistemas Complejos Socio-Tecnológicos” (RED2018-102518-T). La Consejería de Educación de la Junta de Castilla y León • Subvención a la línea de investigación “Entendiendo el comportamiento humano, una aproximación desde los sistemas complejos y las humanidades digitales” dentro del programa de apoyo a los grupos de investigación reconocidos (GIR) de las universidades públicas de Castilla y León (BDNS 425389

    Applications

    Get PDF
    Volume 3 describes how resource-aware machine learning methods and techniques are used to successfully solve real-world problems. The book provides numerous specific application examples: in health and medicine for risk modelling, diagnosis, and treatment selection for diseases in electronics, steel production and milling for quality control during manufacturing processes in traffic, logistics for smart cities and for mobile communications
    corecore