3 research outputs found

    How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity

    Get PDF
    Background: Hierarchical cluster analysis (HCA) is a widely used classificatory technique in many areas of scientific knowledge. Applications usually yield a dendrogram from an HCA run over a given data set, using a grouping algorithm and a similarity measure. However, even when such parameters are fixed, ties in proximity (i.e. two equidistant clusters from a third one) may produce several different dendrograms, having different possible clustering patterns (different classifications). This situation is usually disregarded and conclusions are based on a single result, leading to questions concerning the permanence of clusters in all the resulting dendrograms; this happens, for example, when using HCA for grouping molecular descriptors to select that less similar ones in QSAR studies. Results: Representing dendrograms in graph theoretical terms allowed us to introduce four measures of cluster frequency in a canonical way, and use them to calculate cluster frequencies over the set of all possible dendrograms, taking all ties in proximity into account. A toy example of well separated clusters was used, as well as a set of 1666 molecular descriptors calculated for a group of molecules having hepatotoxic activity to show how our functions may be used for studying the effect of ties in HCA analysis. Such functions were not restricted to the tie case; the possibility of using them to derive cluster stability measurements on arbitrary sets of dendrograms having the same leaves is discussed, e.g. dendrograms from variations of HCA parameters. It was found that ties occurred frequently, some yielding tens of thousands of dendrograms, even for small data sets. Conclusions: Our approach was able to detect trends in clustering patterns by offering a simple way of measuring their frequency, which is often very low. This would imply, that inferences and models based on descriptor classifications (e.g. QSAR) are likely to be biased, thereby requiring an assessment of their reliability. Moreover, any classification of molecular descriptors is likely to be far from unique. Our results highlight the need for evaluating the effect of ties on clustering patterns before classification results can be used accurately. © 2016 Leal et al

    Optimização em grafos num contexto de redes sociais, tecnológicas e no ensino da matemática: estudo do problema de balanceamento de uma comunidade na world wide web

    Get PDF
    Tese de doutoramento em Estatística e Investigação Operacional (Optimização), apresentada à Universidade de Lisboa através da Faculdade de Ciências, 2008Operations Research (OR) has been widely applied in several traditional contexts (e.g. route optimisation, partitioning problems, scheduling problems, etc.). More recently, new applications include the design of telecommunication networks and data mining among others. Within this context of new application domains, the exploration of OR to Mathematics teaching is also briefly discussed within this thesis in a chapter dedicated to conceptual framework and bibliographic revisions. However, the main investigation focus on the application of graph optimisation to socio-technical networks and the World Wide Web (Web). Web topologies are commonly characterised by hierarchical structures and highly unbalanced compositions, as illustrated by the centrality and connectivity of their elements. Web communities are one among many examples of Web structures. The major interest of the problem addressed in this thesis lies in reconfiguring such communities to reduce their initial disequilibria. The Web Balancing Problem is addressed, along with a network model and integer programming formulations. GRASP and tabu search heuristics were developed to find feasible solutions to the problem. Computational results are also reported, based on several web communities, involving comparison and combination of these meta-heuristics. Some of these Web communities were obtained by crawling the web and using epistemic boundaries. Other communities were randomly generated by network analysis tools. The results confirmed more balanced structures for the Web communities investigated.A Investigação Operacional (IO) tem sido tradicionalmente aplicada em diversos contextos como o desenho de rotas, problemas de partição e de escalonamento, etc.. Mais recentemente, começou a ser utilizada em novas áreas como o desenho de redes de telecomunicações e data-mining. Neste âmbito e num capítulo inicial de enquadramento e revisão bibliográfica, discute-se de forma sucinta, o modo como problemas de IO podem vir a ser utilizados no ensino da Matemática. Esta investigação tem enfoque particular na utilização de optimização em grafos para a modelação de redes sócio-tecnológicas, como a World Wide Web (Web). As topologias da Web são normalmente caracterizadas por estruturas hierárquicas e composições bastante desequilibradas, como é demonstrado pela centralidade e conectividade dos seus elementos. As comunidades Web são apenas um dos muitos exemplos destas estruturas na Web. O contributo proposto nesta tese tem como objectivo principal a reconfiguração destas comunidades Web, de modo a reduzir o seu desequilíbrio estrutural inicial. O trabalho de investigação caracterizou o Problema de Balanceamento de uma Comunidade na Web, apresentando um modelo de grafo bem como formulações em programação inteira. Foram desenvolvidas heurísticas de pesquisa GRASP e Tabu para encontrar soluções admissíveis para o problema. São também apresentados resultados computacionais baseados em diversas comunidades Web e envolvendo a comparação e combinação das referidas metaheurísticas. Algumas destas comunidades Web foram obtidas por crawling da Web e utilizando fronteiras epistémicas ou conceptuais. Outras comunidades foram geradas aleatoriamente com software de análise de redes. Os resultados confirmam a reconfiguração e obtenção de estruturas mais balanceadas para as comunidades Web em análise
    corecore