8 research outputs found

    Recherche et représentation de communautés dans des grands graphes

    Get PDF
    15 pagesNational audienceThis paper deals with the analysis and the visualization of large graphs. Our interest in such a subject-matter is related to the fact that graphs are convenient widespread data structures. Indeed, this type of data can be encountered in a growing number of concrete problems: Web, information retrieval, social networks, biological interaction networks... Furthermore, the size of these graphs becomes increasingly large as the progression of the means for data gathering and storage steadily strengthens. This calls for new methods in graph analysis and visualization which are now important and dynamic research fields at the interface of many disciplines such as mathematics, statistics, computer science and sociology. In this paper, we propose a method for graphs representation and visualization based on a prior clustering of the vertices. Newman and Girvan (2004) points out that “reducing [the] level of complexity [of a network] to one that can be interpreted readily by the human eye, will be invaluable in helping us to understand the large-scale structure of these new network data”: we rely on this assumption to use a priori a clustering of the vertices as a preliminary step for simplifying the representation of the graphs - as a whole. The clustering phase consists in optimizing a quality measure specifically suitable for the research of dense groups in graphs. This quality measure is the modularity and expresses the “distance” to a null model in which the graph edges do not depend on the clustering. The modularity has shown its relevance in solving the problem of uncovering dense groups in a graph. Optimization of the modularity is done through a stochastic simulated annealing algorithm. The visualization/representation phase, as such, is based on a force-directed algorithm described in Truong et al. (2007). After giving a short introduction to the problem and detailing the vertices clustering and representation algorithms, the paper will introduce and discuss two applications from the social network field

    Cluster Data Analysis with a Fuzzy Equivalence Relation to Substantiate a Medical Diagnosis

    Get PDF
    This study aims to develop a methodology for the justification of medical diagnostic decisions based on the clustering of large volumes of statistical information stored in decision support systems. This aim is relevant since the analyzed medical data are often incomplete and inaccurate, negatively affecting the correctness of medical diagnosis and the subsequent choice of the most effective treatment actions. Clustering is an effective mathematical tool for selecting useful information under conditions of initial data uncertainty. The analysis showed that the most appropriate algorithm to solve the problem is based on fuzzy clustering and fuzzy equivalence relation. The methods of the present study are based on the use of this algorithm forming the technique of analyzing large volumes of medical data due to prepare a rationale for making medical diagnostic decisions. The proposed methodology involves the sequential implementation of the following procedures: preliminary data preparation, selecting the purpose of cluster data analysis, determining the form of results presentation, data normalization, selection of criteria for assessing the quality of the solution, application of fuzzy data clustering, evaluation of the sample, results and their use in further work. Fuzzy clustering quality evaluation criteria include partition coefficient, entropy separation criterion, separation efficiency ratio, and cluster power criterion. The novelty of the results of this article is related to the fact that the proposed methodology makes it possible to work with clusters of arbitrary shape and missing centers, which is impossible when using universal algorithms. Doi: 10.28991/esj-2021-01305 Full Text: PD

    CLUSTER ANALYSIS IN BIOTECHNOLOGY

    Full text link

    Managing a complex project using a Risk-Risk Multiple Domain Matrix

    Get PDF
    International audienceThis communication aims at presenting a clustering methodology applied to a complex project consisting of the delivery of three interdependent sub-systems. This enables small and complementary task forces to be constituted, enhancing the communication and coordination on transverse issues related to the complexity of the whole system. The problem is to gather and exploit data for such systems, with numerous and heterogeneous risks of different domains (product, process, organization). The method consists in regrouping actors through the clustering of the risks they own. The result is a highlight on important and transverse risk interdependencies, within and between projects. These should not be neglected in order to avoid potential severe issues, whether during the project or during the exploitation of its deliverable. An application on a real program of plant implementation in the CEA-DAM is presented, with a sensitivity analysis of the clustering results to the inputs and chosen configurations of the problem

    Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons

    Full text link
    The Jaccard similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information retrieval, and many other areas. We design and implement SimilarityAtScale, the first communication-efficient distributed algorithm for computing the Jaccard similarity among pairs of large datasets. Our algorithm provides an efficient encoding of this problem into a multiplication of sparse matrices. Both the encoding and sparse matrix product are performed in a way that minimizes data movement in terms of communication and synchronization costs. We apply our algorithm to obtain similarity among all pairs of a set of large samples of genomes. This task is a key part of modern metagenomics analysis and an evergrowing need due to the increasing availability of high-throughput DNA sequencing data. The resulting scheme is the first to enable accurate Jaccard distance derivations for massive datasets, using largescale distributed-memory systems. We package our routines in a tool, called GenomeAtScale, that combines the proposed algorithm with tools for processing input sequences. Our evaluation on real data illustrates that one can use GenomeAtScale to effectively employ tens of thousands of processors to reach new frontiers in large-scale genomic and metagenomic analysis. While GenomeAtScale can be used to foster DNA research, the more general underlying SimilarityAtScale algorithm may be used for high-performance distributed similarity computations in other data analytics application domains

    Modéliser et analyser les risques de propagations dans les projets complexes : application au développement de nouveaux véhicules

    Get PDF
    The management of complex projects requires orchestrating the cooperation of hundreds of individuals from various companies, professions and backgrounds, working on thousands of activities, deliverables, and risks. As well, these numerous project elements are more and more interconnected, and no decision or action is independent. This growing complexity is one of the greatest challenges of project management and one of the causes for project failure in terms of cost overruns and time delays. For instance, in the automotive industry, increasing market orientation and growing complexity of automotive product has changed the management structure of the vehicle development projects from a hierarchical to a networked structure, including the manufacturer but also numerous suppliers. Dependencies between project elements increase risks, since problems in one element may propagate to other directly or indirectly dependent elements. Complexity generates a number of phenomena, positive or negative, isolated or in chains, local or global, that will more or less interfere with the convergence of the project towards its goals. The thesis aim is thus to reduce the risks associated with the complexity of the vehicle development projects by increasing the understanding of this complexity and the coordination of project actors. To do so, a first research question is to prioritize actions to mitigate complexity-related risks. Then, a second research question is to propose a way to organize and coordinate actors in order to cope efficiently with the previously identified complexity-related phenomena.The first question will be addressed by modeling project complexity and by analyzing complexity-related phenomena within the project, at two levels. First, a high-level factor-based descriptive modeling is proposed. It permits to measure and prioritize project areas where complexity may have the most impact. Second, a low-level graph-based modeling is proposed, based on the finer modeling of project elements and interdependencies. Contributions have been made on the complete modeling process, including the automation of some data-gathering steps, in order to increase performance and decrease effort and error risk. These two models can be used consequently; a first high-level measure can permit to focus on some areas of the project, where the low-level modeling will be applied, with a gain of global efficiency and impact. Based on these models, some contributions are made to anticipate potential behavior of the project. Topological and propagation analyses are proposed to detect and prioritize critical elements and critical interdependencies, while enlarging the sense of the polysemous word “critical."The second research question will be addressed by introducing a clustering methodology to propose groups of actors in new product development projects, especially for the actors involved in many deliverable-related interdependencies in different phases of the project life cycle. This permits to increase coordination between interdependent actors who are not always formally connected via the hierarchical structure of the project organization. This allows the project organization to be actually closer to what a networked structure should be. The automotive-based industrial application has shown promising results for the contributions to both research questions. Finally, the proposed methodology is discussed in terms of genericity and seems to be applicable to a wide set of complex projects for decision support.La gestion de projets complexes nécessite d’orchestrer la coopération de centaines de personnes provenant de diverses entreprises, professions et compétences, de travailler sur des milliers d'activités, livrables, objectifs, actions, décisions et risques. En outre, ces nombreux éléments du projet sont de plus en plus interconnectés, et aucune décision ou action n’est indépendante. Cette complexité croissante est l'un des plus grands défis de la gestion de projet et l'une des causes de l'échec du projet en termes de dépassements de coûts et des retards. Par exemple, dans l'industrie automobile, l'augmentation de l'orientation du marché et de la complexité croissante des véhicules a changé la structure de gestion des projets de développement de nouveaux véhicules à partir d'une structure hiérarchique à une structure en réseau, y compris le constructeur, mais aussi de nombreux fournisseurs. Les dépendances entre les éléments du projet augmentent les risques, car les problèmes dans un élément peuvent se propager à d'autres éléments qui en dépendent directement ou indirectement. La complexité génère un certain nombre de phénomènes, positifs ou négatifs, isolés ou en chaînes, locaux ou globaux, qui vont plus ou moins interférer avec la convergence du projet vers ses objectifs.L'objectif de la thèse est donc de réduire les risques associés à la complexité des projets véhicules en augmentant la compréhension de cette complexité et de la coordination des acteurs du projet. Pour ce faire, une première question de recherche est de prioriser les actions pour atténuer les risques liés à la complexité. Puis, une seconde question de recherche est de proposer un moyen d'organiser et de coordonner les acteurs afin de faire face efficacement avec les phénomènes liés à la complexité identifiés précédemment.La première question sera abordée par la modélisation de complexité du projet en analysant les phénomènes liés à la complexité dans le projet, à deux niveaux. Tout d'abord, une modélisation descriptive de haut niveau basée facteur est proposé. Elle permet de mesurer et de prioriser les zones de projet où la complexité peut avoir le plus d'impact. Deuxièmement, une modélisation de bas niveau basée sur les graphes est proposée. Elle permet de modéliser plus finement les éléments du projet et leurs interdépendances. Des contributions ont été faites sur le processus complet de modélisation, y compris l'automatisation de certaines étapes de collecte de données, afin d'augmenter les performances et la diminution de l'effort et le risque d'erreur. Ces deux modèles peuvent être utilisés en conséquence; une première mesure de haut niveau peut permettre de se concentrer sur certains aspects du projet, où la modélisation de bas niveau sera appliquée, avec un gain global d'efficacité et d'impact. Basé sur ces modèles, certaines contributions sont faites pour anticiper le comportement potentiel du projet. Des analyses topologiques et de propagation sont proposées pour détecter et hiérarchiser les éléments essentiels et les interdépendances critiques, tout en élargissant le sens du mot polysémique "critique".La deuxième question de recherche sera traitée en introduisant une méthodologie de « Clustering » pour proposer des groupes d'acteurs dans les projets de développement de nouveaux produits, en particulier pour les acteurs impliqués dans de nombreuses interdépendances liées aux livrables à différentes phases du cycle de vie du projet. Cela permet d'accroître la coordination entre les acteurs interdépendants qui ne sont pas toujours formellement reliés par la structure hiérarchique de l'organisation du projet. Cela permet à l'organisation du projet d’être effectivement plus proche de la structure en « réseau » qu’elle devrait avoir. L'application industrielle aux projets de développement de nouveaux véhicules a montré des résultats prometteurs pour les contributions aux deux questions de recherche
    corecore