79 research outputs found

    Benchmarking in cluster analysis: A white paper

    Get PDF
    To achieve scientific progress in terms of building a cumulative body of knowledge, careful attention to benchmarking is of the utmost importance. This means that proposals of new methods of data pre-processing, new data-analytic techniques, and new methods of output post-processing, should be extensively and carefully compared with existing alternatives, and that existing methods should be subjected to neutral comparison studies. To date, benchmarking and recommendations for benchmarking have been frequently seen in the context of supervised learning. Unfortunately, there has been a dearth of guidelines for benchmarking in an unsupervised setting, with the area of clustering as an important subdomain. To address this problem, discussion is given to the theoretical conceptual underpinnings of benchmarking in the field of cluster analysis by means of simulated as well as empirical data. Subsequently, the practicalities of how to address benchmarking questions in clustering are dealt with, and foundational recommendations are made

    Clustering Algorithms For High Dimensional Data – A Survey Of Issues And Existing Approaches

    Get PDF
    Clustering is the most prominent data mining technique used for grouping the data into clusters based on distance measures. With the advent growth of high dimensional data such as microarray gene expression data, and grouping high dimensional data into clusters will encounter the similarity between the objects in the full dimensional space is often invalid because it contains different types of data. The process of grouping into high dimensional data into clusters is not accurate and perhaps not up to the level of expectation when the dimension of the dataset is high. It is now focusing tremendous attention towards research and development. The performance issues of the data clustering in high dimensional data it is necessary to study issues like dimensionality reduction, redundancy elimination, subspace clustering, co-clustering and data labeling for clusters are to analyzed and improved. In this paper, we presented a brief comparison of the existing algorithms that were mainly focusing at clustering on high dimensional data

    Computation applications in archaeology

    Get PDF
    This thesis is a critical analysis of the use which has been made of the computer in archaeology up to the year 1972. The main chapters cover the applications in archaeology of Statistics, Information Retrieval, Graphics, Pottery Classification and Survey Reduction. A large body of Miscellaneous Applications, including Pollen Analysis, are also examined. The majority of computer applications have been in Statistics. These applications include Numerical Taxonomy, Matrix Manipulation and Seriation, the generation of hypotheses and models, MUltidimensional Scaling, Cumulative Percentage Graphs and Trend Surface Analysis. It is worthwhile to note that for small sets of data several manual methods give comparable results to complex computer analyses and at far less cost. Computer Information Retrieval is examined in the light of its use for large bodies of specialist archaeological information, for museum cataloguing, and for the compilation of a site excavation record using a remote terminal. The use of Computer Graphics in the production of archaeological maps, plans and diagrams is examined. Facilities include the production of dot-density plots, distribution maps, histograms, piecharts, pottery diagrams, site block diagrams with 3D rotation and perspective, sections, pit outlines and projectile point classification by Fourier analysis. The use of the d-Mac Pencil Follower in the objective classification of pottery is described, followed by computer analysis of the resultant multivariate data. The use of the computer in the routine reduction of geophysical observations taken on archaeological sites is described. Complex filtering procedures for the removal of background effects and the enhancement of the archaeological anomalies are examined. Since other workers have concentrated on the applications of statistics in archaeology~ this thesis explores the relatively neglected fields of Graphics and Pottery Classification. Evidence is presented that significant advances have been made in the classification of pottery vessels and projectile points~ and in the graphical output of results. A number of new programs have been developed; these include software which may be operated from a remote terminal at an archaeological site. The P L U T A R C H System (Program Library Useful To ARCHaeologists) is described. This is a control program which uses interactive graphics and overlays to combine all the computer facilities available to the archaeologist. The individual graphics, statistics, instrument survey plotting and information retrieval techniques when combined in this way can communicate via global storage, and become even more powerful

    Classification trees

    Get PDF
    Available from British Library Document Supply Centre- DSC:DX171132 / BLDSC - British Library Document Supply CentreSIGLEGBUnited Kingdo

    Breaking the hierarchy - a new cluster selection mechanism for hierarchical clustering methods

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Hierarchical clustering methods like Ward's method have been used since decades to understand biological and chemical data sets. In order to get a partition of the data set, it is necessary to choose an optimal level of the hierarchy by a so-called level selection algorithm. In 2005, a new kind of hierarchical clustering method was introduced by Palla et al. that differs in two ways from Ward's method: it can be used on data on which no full similarity matrix is defined and it can produce overlapping clusters, i.e., allow for multiple membership of items in clusters. These features are optimal for biological and chemical data sets but until now no level selection algorithm has been published for this method.</p> <p>Results</p> <p>In this article we provide a general selection scheme, the <it>level independent clustering selection method</it>, called LInCS. With it, clusters can be selected from any level in quadratic time with respect to the number of clusters. Since hierarchically clustered data is not necessarily associated with a similarity measure, the selection is based on a graph theoretic notion of <it>cohesive clusters</it>. We present results of our method on two data sets, a set of drug like molecules and set of protein-protein interaction (PPI) data. In both cases the method provides a clustering with very good sensitivity and specificity values according to a given reference clustering. Moreover, we can show for the PPI data set that our graph theoretic cohesiveness measure indeed chooses biologically homogeneous clusters and disregards inhomogeneous ones in most cases. We finally discuss how the method can be generalized to other hierarchical clustering methods to allow for a level independent cluster selection.</p> <p>Conclusion</p> <p>Using our new cluster selection method together with the method by Palla et al. provides a new interesting clustering mechanism that allows to compute overlapping clusters, which is especially valuable for biological and chemical data sets.</p

    The Effectiveness of Query-Based Hierarchic Clustering of Documents for Information Retrieval

    Get PDF
    Hierarchic document clustering has been applied to Information Retrieval (IR) for over three decades. Its introduction to IR was based on the grounds of its potential to improve the effectiveness of IR systems. Central to the issue of improved effectiveness is the Cluster Hypothesis. The hypothesis states that relevant documents tend to be highly similar to each other, and therefore tend to appear in the same clusters. However, previous research has been inconclusive as to whether document clustering does bring improvements. The main motivation for this work has been to investigate methods for the improvement of the effectiveness of document clustering, by challenging some assumptions that implicitly characterise its application. Such assumptions relate to the static manner in which document clustering is typically performed, and include the static application of document clustering prior to querying, and the static calculation of interdocument associations. The type of clustering that is investigated in this thesis is query-based, that is, it incorporates information from the query into the process of generating clusters of documents. Two approaches for incorporating query information into the clustering process are examined: clustering documents which are returned from an IR system in response to a user query (post-retrieval clustering), and clustering documents by using query-sensitive similarity measures. For the first approach, post-retrieval clustering, an analytical investigation into a number of issues that relate to its retrieval effectiveness is presented in this thesis. This is in contrast to most of the research which has employed post-retrieval clustering in the past, where it is mainly viewed as a convenient and efficient means of presenting documents to users. In this thesis, post-retrieval clustering is employed based on its potential to introduce effectiveness improvements compared both to static clustering and best-match IR systems. The motivation for the second approach, the use of query-sensitive measures, stems from the role of interdocument similarities for the validity of the cluster hypothesis. In this thesis, an axiomatic view of the hypothesis is proposed, by suggesting that documents relevant to the same query (co-relevant documents) display an inherent similarity to each other which is dictated by the query itself. Because of this inherent similarity, the cluster hypothesis should be valid for any document collection. Past research has attributed failure to validate the hypothesis for a document collection to characteristics of the collection. Contrary to this, the view proposed in this thesis suggests that failure of a document set to adhere to the hypothesis is attributed to the assumptions made about interdocument similarity. This thesis argues that the query determines the context and the purpose for which the similarity between documents is judged, and it should therefore be incorporated in the similarity calculations. By taking the query into account when calculating interdocument similarities, co-relevant documents can be "forced" to be more similar to each other. This view challenges the typically static nature of interdocument relationships in IR. Specific formulas for the calculation of query-sensitive similarity are proposed in this thesis. Four hierarchic clustering methods and six document collections are used in the experiments. Three main issues are investigated: the effectiveness of hierarchic post-retrieval clustering which uses static similarity measures, the effectiveness of query-sensitive measures at increasing the similarity of pairs of co-relevant documents, and the effectiveness of hierarchic clustering which uses query-sensitive similarity measures. The results demonstrate the effectiveness improvements that are introduced by the use of both approaches of query-based clustering, compared both to the effectiveness of static clustering and to the effectiveness of best-match IR systems. Query-sensitive similarity measures, in particular, introduce significant improvements over the use of static similarity measures for document clustering, and they also significantly improve the structure of the document space in terms of the similarity of pairs of co-relevant documents. The results provide evidence for the effectiveness of hierarchic query-based clustering of documents, and also challenge findings of previous research which had dismissed the potential of hierarchic document clustering as an effective method for information retrieval

    Hybrid and adaptive genetic fuzzy clustering algorithms

    Get PDF
    Master'sMASTER OF ENGINEERIN

    Cluster Analysis of Legal Documents

    Get PDF
    Single-link cluster analysis has been used to provide classifications of several collections of legal documents, based on various characteristics of the text. Each document was represented in terms of the chosen characteristics by a vector whose elements were the frequencies of occurrence of the characteristics in that document. The values of similarity between documents were determined by calculating the cosine of the angle between each pair of document vectors. The clustering algorithm then operated on these similarity coefficients to group documents which were most similar. A suite of computer programs was written to perform the classification. Four programs were required to (a) select the document descriptors from the full-text of the documents, (b) construct document vectors, (c) calculate similarity coefficients, and (d) perform single-link clustering. Three classification experiments were performed. The first classified the full-text of both the English and French versions of the Treaties of the Council of Europe. The words of the full-text, taken singly and in pairs, were used to describe the treaties, and the two cases of including and excluding the 'common' words were investigated. The best classification was based on single words with common words excluded. Since each treaty was a lengthy collection of non-homogeneous clauses, it was thought that a classification - ii - of the individual articles would be more useful. In this case the formal and non-formal clauses clustered separately, whereas before the formal clauses, present in every. treaty, had caused semantically unrelated treaties to be brought together. During the course of this study an opportunity arose to investigate the use of cluster analysis to test the trustworthiness of certain oral confessions presented as evidence in criminal proceedings. The common or function words, which are generally agreed to characterise the style of an author, were used as document descriptors for two sets of statements, one which the defendant admitted, the other which he was alleged to have made but which he denied. The two sets of statements clustered separately, indicating a difference in style. On the basis of this and other comparative tests it was possible to say that the disputed statements were unlikely to have been made by the defendant. The third experiment involved the use of the marginal citations in Statutes as document descriptors. Statutes were regarded as semantically related if they cited the same Acts. The Public General Acts of Parliament for the three years 1973 - 1975 were successfully clustered into groups of related Acts

    Data clustering using a model granular magnet

    Full text link
    We present a new approach to clustering, based on the physical properties of an inhomogeneous ferromagnet. No assumption is made regarding the underlying distribution of the data. We assign a Potts spin to each data point and introduce an interaction between neighboring points, whose strength is a decreasing function of the distance between the neighbors. This magnetic system exhibits three phases. At very low temperatures it is completely ordered; all spins are aligned. At very high temperatures the system does not exhibit any ordering and in an intermediate regime clusters of relatively strongly coupled spins become ordered, whereas different clusters remain uncorrelated. This intermediate phase is identified by a jump in the order parameters. The spin-spin correlation function is used to partition the spins and the corresponding data points into clusters. We demonstrate on three synthetic and three real data sets how the method works. Detailed comparison to the performance of other techniques clearly indicates the relative success of our method.Comment: 46 pages, postscript, 15 ps figures include

    The Best Fit? 25 Years of Statistical Techniques in Archaeology

    Get PDF
    corecore