159 research outputs found

    Propagation-Based Biclustering Algorithm for Extracting Inclusion-Maximal Motifs

    Get PDF
    Biclustering, which is simultaneous clustering of columns and rows in data matrix, became an issue when classical clustering algorithms proved not to be good enough to detect similar expressions of genes under subset of conditions. Biclustering algorithms may be also applied to different datasets, such as medical, economical, social networks etc. In this article we explain the concept beneath hybrid biclustering algorithms and present details of propagation-based biclustering, a novel approach for extracting inclusion-maximal gene expression motifs conserved in gene microarray data. We prove that this approach may successfully compete with other well-recognized biclustering algorithms

    ParBiBit: Parallel tool for binary biclustering on modern distributed-memory systems

    Get PDF
    [Abstract]: Biclustering techniques are gaining attention in the analysis of large-scale datasets as they identify two-dimensional submatrices where both rows and columns are correlated. In this work we present ParBiBit, a parallel tool to accelerate the search of interesting biclusters on binary datasets, which are very popular on different fields such as genetics, marketing or text mining. It is based on the state-of-the-art sequential Java tool BiBit, which has been proved accurate by several studies, especially on scenarios that result on many large biclusters. ParBiBit uses the same methodology as BiBit (grouping the binary information into patterns) and provides the same results. Nevertheless, our tool significantly improves performance thanks to an efficient implementation based on C++11 that includes support for threads and MPI processes in order to exploit the compute capabilities of modern distributed-memory systems, which provide several multicore CPU nodes interconnected through a network. Our performance evaluation with 18 representative input datasets on two different eight-node systems shows that our tool is significantly faster than the original BiBit. Source code in C++ and MPI running on Linux systems as well as a reference manual are available at https://sourceforge.net/projects/parbibit/.This work was supported by the Ministry of Economy, Industry and Competitiveness of Spain and FEDER funds of the European Union [grant TIN2016-75845-P (AEI/FEDER/UE)], as well as by Xunta de Galicia (Centro Singular de Investigacion de Galicia accreditation 2016-2019, ref. EDG431G/01).Xunta de Galicia; EDG431G/0

    DNA Microarray Data Analysis: A New Survey on Biclustering

    Get PDF
    There are subsets of genes that have similar behavior under subsets of conditions, so we say that they coexpress, but behave independently under other subsets of conditions. Discovering such coexpressions can be helpful to uncover genomic knowledge such as gene networks or gene interactions. That is why, it is of utmost importance to make a simultaneous clustering of genes and conditions to identify clusters of genes that are coexpressed under clusters of conditions. This type of clustering is called biclustering.Biclustering is an NP-hard problem. Consequently, heuristic algorithms are typically used to approximate this problem by finding suboptimal solutions. In this paper, we make a new survey on biclustering of gene expression data, also called microarray data

    Biclustering Algorithm for Embryonic Tumor Gene Expression Dataset: LAS Algorithm

    Get PDF
    An important step in considering of gene expression data is obtained groups of genes that have similarity patterns. Biclustering methods was recently introduced for discovering subsets of genes that have coherent values across a subset of conditions. The LAS algorithm relies on a heuristic randomized search to find biclusters. In this paper, we introduce biclustering LAS algorithm and then apply this procedure for real value gene expression data. In this study after normalized data, LAS performed. 31 biclusters were  discovered that 26 of them were for positive gene expression values and others were for negative. Biological validity for LAS procedure in biological process, in molecular function and in cellular component were 77.96% , 62.28% and 74.39% respictively. The result of biological validation of LAS algorithm in this study had shown LAS algorithm effectively convenient in discovering good biclusters

    Biclustering analysis of transcriptome big data identifies condition-specific microRNA targets

    Get PDF
    We present a novel approach to identify human microRNA (miRNA) regulatory modules (mRNA targets and relevant cell conditions) by biclustering a large collection of mRNA fold-change data for sequence-specific targets. Bicluster targets were assessed using validated messenger RNA (mRNA) targets and exhibited on an average 17.0% (median 19.4%) improved gain in certainty (sensitivity + specificity). The net gain was further increased up to 32.0% (median 33.4%) by incorporating functional networks of targets. We analyzed cancer-specific biclusters and found that the PI3K/Akt signaling pathway is strongly enriched with targets of a few miRNAs in breast cancer and diffuse large B-cell lymphoma. Indeed, five independent prognostic miRNAs were identified, and repression of bicluster targets and pathway activity by miR-29 was experimentally validated. In total, 29 898 biclusters for 459 human miRNAs were collected in the BiMIR database where biclusters are searchable for miRNAs, tissues, diseases, keywords and target genes

    Data Mining Using the Crossing Minimization Paradigm

    Get PDF
    Our ability and capacity to generate, record and store multi-dimensional, apparently unstructured data is increasing rapidly, while the cost of data storage is going down. The data recorded is not perfect, as noise gets introduced in it from different sources. Some of the basic forms of noise are incorrect recording of values and missing values. The formal study of discovering useful hidden information in the data is called Data Mining. Because of the size, and complexity of the problem, practical data mining problems are best attempted using automatic means. Data Mining can be categorized into two types i.e. supervised learning or classification and unsupervised learning or clustering. Clustering only the records in a database (or data matrix) gives a global view of the data and is called one-way clustering. For a detailed analysis or a local view, biclustering or co-clustering or two-way clustering is required involving the simultaneous clustering of the records and the attributes. In this dissertation, a novel fast and white noise tolerant data mining solution is proposed based on the Crossing Minimization (CM) paradigm; the solution works for one-way as well as two-way clustering for discovering overlapping biclusters. For decades the CM paradigm has traditionally been used for graph drawing and VLSI (Very Large Scale Integration) circuit design for reducing wire length and congestion. The utility of the proposed technique is demonstrated by comparing it with other biclustering techniques using simulated noisy, as well as real data from Agriculture, Biology and other domains. Two other interesting and hard problems also addressed in this dissertation are (i) the Minimum Attribute Subset Selection (MASS) problem and (ii) Bandwidth Minimization (BWM) problem of sparse matrices. The proposed CM technique is demonstrated to provide very convincing results while attempting to solve the said problems using real public domain data. Pakistan is the fourth largest supplier of cotton in the world. An apparent anomaly has been observed during 1989-97 between cotton yield and pesticide consumption in Pakistan showing unexpected periods of negative correlation. By applying the indigenous CM technique for one-way clustering to real Agro-Met data (2001-2002), a possible explanation of the anomaly has been presented in this thesis
    corecore