133 research outputs found

    Nonnegative factorization and the maximum edge biclique problem

    Get PDF
    Nonnegative matrix factorization (NMF) is a data analysis technique based on the approximation of a nonnegative matrix with a product of two nonnegative factors, which allows compression and interpretation of nonnegative data. In this paper, we study the case of rank-one factorization and show that when the matrix to be factored is not required to be nonnegative, the corresponding problem (R1NF) becomes NP-hard. This sheds new light on the complexity of NMF since any algorithm for fixed-rank NMF must be able to solve at least implicitly such rank-one subproblems. Our proof relies on a reduction of the maximum edge biclique problem to R1NF. We also link stationary points of R1NF to feasible solutions of the biclique problem, which allows us to design a new type of biclique finding algorithm based on the application of a block-coordinate descent scheme to R1NF. We show that this algorithm, whose algorithmic complexity per iteration is proportional to the number of edges in the graph, is guaranteed to converge to a biclique and that it performs competitively with existing methods on random graphs and text mining datasets.nonnegative matrix factorization, rank-one factorization, maximum edge biclique problem, algorithmic complexity, biclique finding algorithm

    Bicluster Analysis of Cheng and Church's Algorithm to Identify Patterns of People's Welfare in Indonesia

    Get PDF
    Biclustering is a method of grouping numerical data where rows and columns are grouped simultaneously. The Cheng and Church (CC) algorithm is one of the bi-clustering algorithms that try to find the maximum bi-cluster with a high similarity value, called MSR (Mean Square Residue). The association of rows and columns is called a bi-cluster if the MSR is lower than a predetermined threshold value (delta). Detection of people's welfare in Indonesia using Bi-Clustering is essential to get an overview of the characteristics of people's interest in each province in Indonesia. Bi-Clustering using the CC algorithm requires a threshold value (delta) determined by finding the MSR value of the actual data. The threshold value (delta) must be smaller than the MSR of the actual data. This study's threshold values are 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, and 0.8. After evaluating the optimum delta by considering the MSR value and the bi-cluster formed, the optimum delta is obtained as 0.1, with the number of bi-cluster included as 4

    Discovery of error-tolerant biclusters from noisy gene expression data

    Get PDF
    An important analysis performed on microarray gene-expression data is to discover biclusters, which denote groups of genes that are coherently expressed for a subset of conditions. Various biclustering algorithms have been proposed to find different types of biclusters from these real-valued gene-expression data sets. However, these algorithms suffer from several limitations such as inability to explicitly handle errors/noise in the data; difficulty in discovering small bicliusters due to their top-down approach; inability of some of the approaches to find overlapping biclusters, which is crucial as many genes participate in multiple biological processes. Association pattern mining also produce biclusters as their result and can naturally address some of these limitations. However, traditional association mining only finds exact biclusters, whic

    Biclustering electronic health records to unravel disease presentation patterns

    Get PDF
    Tese de mestrado, CiĂȘncia de Dados, Universidade de Lisboa, Faculdade de CiĂȘncias, 2019A Esclerose Lateral AmiotrĂłfica (ELA) Ă© uma doença neurodegenerativa heterogĂ©nea com padrĂ”es de apresentação altamente variĂĄveis. Dada a natureza heterogĂ©nea dos doentes com ELA, aquando do diagnĂłstico os clĂ­nicos normalmente estimam a progressĂŁo da doença utilizando uma taxa de decaimento funcional, calculada com base na Escala Revista de Avaliação Funcional de ELA (ALSFRS-R). A utilização de modelos de Aprendizagem AutomĂĄtica que consigam lidar com este padrĂ”es complexos Ă© necessĂĄria para compreender a doença, melhorar os cuidados aos doentes e a sua sobrevivĂȘncia. Estes modelos devem ser explicĂĄveis para que os clĂ­nicos possam tomar decisĂ”es informadas. Desta forma, o nosso objectivo Ă© descobrir padrĂ”es de apresentação da doença, para isso propondo uma nova abordagem de Prospecção de Dados: Descoberta de Meta-atributos Discriminativos (DMD), que utiliza uma combinação de Biclustering, Classificação baseada em Biclustering e Prospecção de Regras de Associação para Classificação. Estes padrĂ”es (chamados de Meta-atributos) sĂŁo compostos por subconjuntos de atributos discriminativos conjuntamente com os seus valores, permitindo assim distinguir e caracterizar subgrupos de doentes com padrĂ”es similares de apresentação da doença. Os Registos de SaĂșde ElectrĂłnicos (RSE) utilizados neste trabalho provĂȘm do conjunto de dados JPND ONWebDUALS (ONTology-based Web Database for Understanding Amyotrophic Lateral Sclerosis), composto por questĂ”es standardizadas acerca de factores de risco, mutaçÔes genĂ©ticas, atributos clĂ­nicos ou informação de sobrevivĂȘncia de uma coorte de doentes e controlos seguidos pelo consĂłrcio ENCALS (European Network to Cure ALS), que inclui vĂĄrios paĂ­ses europeus, incluindo Portugal. Nesta tese a metodologia proposta foi utilizada na parte portuguesa do conjunto de dados ONWebDUALS para encontrar padrĂ”es de apresentação da doença que: 1) distinguissem os doentes de ELA dos seus controlos e 2) caracterizassem grupos de doentes de ELA com diferentes taxas de progressĂŁo (categorizados em grupos Lentos, Neutros e RĂĄpidos). Nenhum padrĂŁo coerente emergiu das experiĂȘncias efectuadas para a primeira tarefa. Contudo, para a segunda tarefa os padrĂ”es encontrados para cada um dos trĂȘs grupos de progressĂŁo foram reconhecidos e validados por clĂ­nicos especialistas em ELA, como sendo caracterĂ­sticas relevantes de doentes com progressĂŁo Lenta, Neutra e RĂĄpida. Estes resultados sugerem que a nossa abordagem genĂ©rica baseada em Biclustering tem potencial para identificar padrĂ”es de apresentação noutros problemas ou doenças semelhantes.Amyotrophic Lateral Sclerosis (ALS) is a heterogeneous neurodegenerative disease with a high variability of presentation patterns. Given the heterogeneous nature of ALS patients and targeting a better prognosis, clinicians usually estimate disease progression at diagnosis using the rate of decay computed from the Revised ALS Functional Rating Scale (ALSFRS-R). In this context, the use of Machine Learning models able to unravel the complexity of disease presentation patterns is paramount for disease understanding, targeting improved patient care and longer survival times. Furthermore, explainable models are vital, since clinicians must be able to understand the reasoning behind a given model’s result before making a decision that can impact a patient’s life. Therefore we aim at unravelling disease presentation patterns by proposing a new Data Mining approach called Discriminative Meta-features Discovery (DMD), which uses a combination of Biclustering, Biclustering-based Classification and Class Association Rule Mining. These patterns (called Metafeatures) are composed of discriminative subsets of features together with their values, allowing to distinguish and characterize subgroups of patients with similar disease presentation patterns. The Electronic Health Record (EHR) data used in this work comes from the JPND ONWebDUALS (ONTology-based Web Database for Understanding Amyotrophic Lateral Sclerosis) dataset, comprised of standardized questionnaire answers regarding risk factors, genetic mutations, clinical features and survival information from a cohort of patients and controls from ENCALS (European Network to Cure ALS), a consortium of diverse European countries, including Portugal. In this work the proposed methodology was used on the ONWebDUALS Portuguese EHR data to find disease presentation patterns that: 1) distinguish the ALS patients from their controls and 2) characterize groups of ALS patients with different progression rates (categorized into Slow, Neutral and Fast groups). No clear pattern emerged from the experiments performed for the first task. However, in the second task the patterns found for each of the three progression groups were recognized and validated by ALS expert clinicians, as being relevant characteristics of slow, neutral and fast progressing patients. These results suggest that our generic Biclustering approach is a promising way to unravel disease presentation patterns and could be applied to similar problems and other diseases

    Pattern Recognition of Food Security in Indonesia Using Biclustering Plaid Model

    Get PDF
    Biclustering come in various algorithms, selecting the most suitable biclustering algorithm can be a challenging task. The performance of algorithms can vary significantly depending on the specific data characteristics. The Plaid model is one of popular biclustering algorithms, has gained recognition for its efficiency and versatility across various applications, including food security. Indonesia deals with complex food security challenges. The nation's unique geographic and socioeconomic diversity demands region-specific food security solutions. Identifying province-specific food security patterns is crucial for effective policymaking and resource allocation, ultimately promoting food sufficiency and stability at the regional level. This study assesses the performance of the Plaid model in identifying food security patterns at the provincial level in Indonesia. To optimize biclusters, we explore various parameter tuning scenarios (the choice of model, the number of layers, and the threshold value for row and column releases). The selection criteria are based on the change ratio of the initial matrix's mean square residue to the mean square residue of the Plaid model, the average mean square residue, and the number of biclusters. The constant column model was selected with a mean square residue change ratio of 0.52, an average mean square plaid model residue of 4.81, and it generates 6 overlapping biclusters. The results show each bicluster has unique characteristics. Notably, Bicluster 1 that consist of 2 provinces, exhibits the lowest food security levels, marked by variables X1, X2, X4, and X7. Furthermore, the variables X1, X4, and X7 consistently appear across several biclusters. This highlights the importance of prioritizing these three variables to improve the food security status of the regions.

    Genetic algorithm based two-mode clustering of metabolomics data

    Get PDF
    Metabolomics and other omics tools are generally characterized by large data sets with many variables obtained under different environmental conditions. Clustering methods and more specifically two-mode clustering methods are excellent tools for analyzing this type of data. Two-mode clustering methods allow for analysis of the behavior of subsets of metabolites under different experimental conditions. In addition, the results are easily visualized. In this paper we introduce a two-mode clustering method based on a genetic algorithm that uses a criterion that searches for homogeneous clusters. Furthermore we introduce a cluster stability criterion to validate the clusters and we provide an extended knee plot to select the optimal number of clusters in both experimental and metabolite modes. The genetic algorithm-based two-mode clustering gave biological relevant results when it was applied to two real life metabolomics data sets. It was, for instance, able to identify a catabolic pathway for growth on several of the carbon sources

    Design Methodology for Self-organized Mobile Networks Based

    Get PDF
    The methodology proposed in this article enables a systematic design of routing algorithms based on schemes of biclustering, which allows you to respond with timely techniques, clustering heuristics proposed by a researcher, and a focused approach to routing in the choice of clusterhead nodes. This process uses heuristics aimed at improving the different costs in communication surface groups called biclusters. This methodology globally enables a variety of techniques and heuristics of clustering that have been addressed in routing algorithms, but we have not explored all possible alternatives and their different assessments. Therefore, the methodology oriented design research of routing algorithms based on biclustering schemes will allow new concepts of evolutionary routing along with the ability to adapt the topological changes that occur in self-organized data networks
    • 

    corecore