101 research outputs found

    Fine-grained parallelization of fitness functions in bioinformatics optimization problems: gene selection for cancer classification and biclustering of gene expression data

    Get PDF
    ANTECEDENTES: las metaheurísticas se utilizan ampliamente para resolver grandes problemas de optimización combinatoria en bioinformática debido al enorme conjunto de posibles soluciones. Dos problemas representativos son la selección de genes para la clasificación del cáncer y el agrupamiento de los datos de expresión génica. En la mayoría de los casos, estas metaheurísticas, así como otras técnicas no lineales, aplican una función de adecuación a cada solución posible con una población de tamaño limitado, y ese paso involucra latencias más altas que otras partes de los algoritmos, lo cual es la razón por la cual el tiempo de ejecución de las aplicaciones dependerá principalmente del tiempo de ejecución de la función de aptitud. Además, es habitual encontrar formulaciones aritméticas de punto flotante para las funciones de fitness. De esta manera, una paralelización cuidadosa de estas funciones utilizando la tecnología de hardware reconfigurable acelerará el cálculo, especialmente si se aplican en paralelo a varias soluciones de la población. RESULTADOS: una paralelización de grano fino de dos funciones de aptitud de punto flotante de diferentes complejidades y características involucradas en el biclustering de los datos de expresión génica y la selección de genes para la clasificación del cáncer permitió obtener mayores aceleraciones y cómputos de potencia reducida con respecto a los microprocesadores habituales. CONCLUSIONES: Los resultados muestran mejores rendimientos utilizando tecnología de hardware reconfigurable en lugar de los microprocesadores habituales, en términos de tiempo de consumo y consumo de energía, no solo debido a la paralelización de las operaciones aritméticas, sino también gracias a la evaluación de aptitud concurrente para varios individuos de la población en La metaheurística. Esta es una buena base para crear soluciones aceleradas y de bajo consumo de energía para escenarios informáticos intensivos.BACKGROUND: Metaheuristics are widely used to solve large combinatorial optimization problems in bioinformatics because of the huge set of possible solutions. Two representative problems are gene selection for cancer classification and biclustering of gene expression data. In most cases, these metaheuristics, as well as other non-linear techniques, apply a fitness function to each possible solution with a size-limited population, and that step involves higher latencies than other parts of the algorithms, which is the reason why the execution time of the applications will mainly depend on the execution time of the fitness function. In addition, it is usual to find floating-point arithmetic formulations for the fitness functions. This way, a careful parallelization of these functions using the reconfigurable hardware technology will accelerate the computation, specially if they are applied in parallel to several solutions of the population. RESULTS: A fine-grained parallelization of two floating-point fitness functions of different complexities and features involved in biclustering of gene expression data and gene selection for cancer classification allowed for obtaining higher speedups and power-reduced computation with regard to usual microprocessors. CONCLUSIONS: The results show better performances using reconfigurable hardware technology instead of usual microprocessors, in computing time and power consumption terms, not only because of the parallelization of the arithmetic operations, but also thanks to the concurrent fitness evaluation for several individuals of the population in the metaheuristic. This is a good basis for building accelerated and low-energy solutions for intensive computing scenarios.• Ministerio de Economía y Competitividad y Fondos FEDER. Contrato TIN2012-30685 (I+D+i) • Gobierno de Extremadura. Ayuda GR15011 para grupos TIC015 • CONICYT/FONDECYT/REGULAR/1160455. Beca para Ricardo Soto Guzmán • CONICYT/FONDECYT/REGULAR/1140897. Beca para Broderick CrawfordpeerReviewe

    Clustering Algorithms: Their Application to Gene Expression Data

    Get PDF
    Gene expression data hide vital information required to understand the biological process that takes place in a particular organism in relation to its environment. Deciphering the hidden patterns in gene expression data proffers a prodigious preference to strengthen the understanding of functional genomics. The complexity of biological networks and the volume of genes present increase the challenges of comprehending and interpretation of the resulting mass of data, which consists of millions of measurements; these data also inhibit vagueness, imprecision, and noise. Therefore, the use of clustering techniques is a first step toward addressing these challenges, which is essential in the data mining process to reveal natural structures and iden-tify interesting patterns in the underlying data. The clustering of gene expression data has been proven to be useful in making known the natural structure inherent in gene expression data, understanding gene functions, cellular processes, and subtypes of cells, mining useful information from noisy data, and understanding gene regulation. The other benefit of clustering gene expression data is the identification of homology, which is very important in vaccine design. This review examines the various clustering algorithms applicable to the gene expression data in order to discover and provide useful knowledge of the appropriate clustering technique that will guarantee stability and high degree of accuracy in its analysis procedure

    Unsupervised Algorithms for Microarray Sample Stratification

    Get PDF
    The amount of data made available by microarrays gives researchers the opportunity to delve into the complexity of biological systems. However, the noisy and extremely high-dimensional nature of this kind of data poses significant challenges. Microarrays allow for the parallel measurement of thousands of molecular objects spanning different layers of interactions. In order to be able to discover hidden patterns, the most disparate analytical techniques have been proposed. Here, we describe the basic methodologies to approach the analysis of microarray datasets that focus on the task of (sub)group discovery.Peer reviewe

    Dynamic biclustering of microarray data by multi-objective immune optimization

    Get PDF
    Abstract Background Newly microarray technologies yield large-scale datasets. The microarray datasets are usually presented in 2D matrices, where rows represent genes and columns represent experimental conditions. Systematic analysis of those datasets provides the increasing amount of information, which is urgently needed in the post-genomic era. Biclustering, which is a technique developed to allow simultaneous clustering of rows and columns of a dataset, might be useful to extract more accurate information from those datasets. Biclustering requires the optimization of two conflicting objectives (residue and volume), and a multi-objective artificial immune system capable of performing a multi-population search. As a heuristic search technique, artificial immune systems (AISs) can be considered a new computational paradigm inspired by the immunological system of vertebrates and designed to solve a wide range of optimization problems. During biclustering several objectives in conflict with each other have to be optimized simultaneously, so multi-objective optimization model is suitable for solving biclustering problem. Results Based on dynamic population, this paper proposes a novel dynamic multi-objective immune optimization biclustering (DMOIOB) algorithm to mine coherent patterns from microarray data. Experimental results on two common and public datasets of gene expression profiles show that our approach can effectively find significant localized structures related to sets of genes that show consistent expression patterns across subsets of experimental conditions. The mined patterns present a significant biological relevance in terms of related biological processes, components and molecular functions in a species-independent manner. Conclusions The proposed DMOIOB algorithm is an efficient tool to analyze large microarray datasets. It achieves a good diversity and rapid convergence

    ParBiBit: Parallel tool for binary biclustering on modern distributed-memory systems

    Get PDF
    [Abstract]: Biclustering techniques are gaining attention in the analysis of large-scale datasets as they identify two-dimensional submatrices where both rows and columns are correlated. In this work we present ParBiBit, a parallel tool to accelerate the search of interesting biclusters on binary datasets, which are very popular on different fields such as genetics, marketing or text mining. It is based on the state-of-the-art sequential Java tool BiBit, which has been proved accurate by several studies, especially on scenarios that result on many large biclusters. ParBiBit uses the same methodology as BiBit (grouping the binary information into patterns) and provides the same results. Nevertheless, our tool significantly improves performance thanks to an efficient implementation based on C++11 that includes support for threads and MPI processes in order to exploit the compute capabilities of modern distributed-memory systems, which provide several multicore CPU nodes interconnected through a network. Our performance evaluation with 18 representative input datasets on two different eight-node systems shows that our tool is significantly faster than the original BiBit. Source code in C++ and MPI running on Linux systems as well as a reference manual are available at https://sourceforge.net/projects/parbibit/.This work was supported by the Ministry of Economy, Industry and Competitiveness of Spain and FEDER funds of the European Union [grant TIN2016-75845-P (AEI/FEDER/UE)], as well as by Xunta de Galicia (Centro Singular de Investigacion de Galicia accreditation 2016-2019, ref. EDG431G/01).Xunta de Galicia; EDG431G/0

    Analysis of biomedical data with multilevel glyphs

    Get PDF
    BACKGROUND: This paper presents multilevel data glyphs optimized for the interactive knowledge discovery and visualization of large biomedical data sets. Data glyphs are three- dimensional objects defined by multiple levels of geometric descriptions (levels of detail) combined with a mapping of data attributes to graphical elements and methods, which specify their spatial position. METHODS: In the data mapping phase, which is done by a biomedical expert, meta information about the data attributes (scale, number of distinct values) are compared with the visual capabilities of the graphical elements in order to give a feedback to the user about the correctness of the variable mapping. The spatial arrangement of glyphs is done in a dimetric view, which leads to high data density, a simplified 3D navigation and avoids perspective distortion. RESULTS: We show the usage of data glyphs in the disease analyser a visual analytics application for personalized medicine and provide an outlook to a biomedical web visualization scenario. CONCLUSIONS: Data glyphs can be successfully applied in the disease analyser for the analysis of big medical data sets. Especially the automatic validation of the data mapping, selection of subgroups within histograms and the visual comparison of the value distributions were seen by experts as an important functionality

    Interactive knowledge discovery and data mining on genomic expression data with numeric formal concept analysis

    Get PDF
    Background: Gene Expression Data (GED) analysis poses a great challenge to the scientific community that can be framed into the Knowledge Discovery in Databases (KDD) and Data Mining (DM) paradigm. Biclustering has emerged as the machine learning method of choice to solve this task, but its unsupervised nature makes result assessment problematic. This is often addressed by means of Gene Set Enrichment Analysis (GSEA). Results: We put forward a framework in which GED analysis is understood as an Exploratory Data Analysis (EDA) process where we provide support for continuous human interaction with data aiming at improving the step of hypothesis abduction and assessment. We focus on the adaptation to human cognition of data interpretation and visualization of the output of EDA. First, we give a proper theoretical background to bi-clustering using Lattice Theory and provide a set of analysis tools revolving around K-Formal Concept Analysis (K-FCA), a lattice-theoretic unsupervised learning technique for real-valued matrices. By using different kinds of cost structures to quantify expression we obtain different sequences of hierarchical bi-clusterings for gene under- and over-expression using thresholds. Consequently, we provide a method with interleaved analysis steps and visualization devices so that the sequences of lattices for a particular experiment summarize the researcher’s vision of the data. This also allows us to define measures of persistence and robustness of biclusters to assess them. Second, the resulting biclusters are used to index external omics databases—for instance, Gene Ontology (GO)—thus offering a new way of accessing publicly available resources. This provides different flavors of gene set enrichment against which to assess the biclusters, by obtaining their p-values according to the terminology of those resources. We illustrate the exploration procedure on a real data example confirming results previously published. Conclusions: The GED analysis problem gets transformed into the exploration of a sequence of lattices enabling the visualization of the hierarchical structure of the biclusters with a certain degree of granularity. The ability of FCA-based bi-clustering methods to index external databases such as GO allows us to obtain a quality measure of the biclusters, to observe the evolution of a gene throughout the different biclusters it appears in, to look for relevant biclusters—by observing their genes and what their persistence is—to infer, for instance, hypotheses on their function
    corecore