235 research outputs found

    Clustering files of chemical structures using the Szekely-Rizzo generalization of Ward's method

    Get PDF
    Ward's method is extensively used for clustering chemical structures represented by 2D fingerprints. This paper compares Ward clusterings of 14 datasets (containing between 278 and 4332 molecules) with those obtained using the Szekely–Rizzo clustering method, a generalization of Ward's method. The clusters resulting from these two methods were evaluated by the extent to which the various classifications were able to group active molecules together, using a novel criterion of clustering effectiveness. Analysis of a total of 1400 classifications (Ward and SzĂ©kely–Rizzo clustering methods, 14 different datasets, 5 different fingerprints and 10 different distance coefficients) demonstrated the general superiority of the SzĂ©kely–Rizzo method. The distance coefficient first described by Soergel performed extremely well in these experiments, and this was also the case when it was used in simulated virtual screening experiments

    Studies in numerical taxonomy of soils

    Get PDF
    A series of established numerical taxonomic strategies was applied to soil data from three sources: USDA (1975), De Alwis (1971) and the Soil Survey of England and Wales. The first two sources provided data for 41 soil profiles, which were classified without reference to their geographical location. The data obtained from the Soil Survey of England and Wales related to a particular geographical area (West Sussex Coastal Plain) and the geographical relationship between soil individuals was also examined. Two methods of soil characterization (soil profile models) were compared with respect to their effect on the results produced by two hierarchical agglomerative strategies based on two measures of inter-individual similarity. Comparison of results, obtained from the agglomerative strategies for the two soil profile models, was made. The nature of inter-attribute correlation for depth levels modelled as arrays of independent attributes was examined, and all attributes were classified on the basis of inter-attribute correlation. Seven hierarchical agglomerative strategies were examined with respect to their goodness-of-fit in the original space and also the relationship between goodness-of-fit and clarity of clusters was examined. From these comparisons, two agglomerative strategies were chosen to represent two classes of strategy: (a) strategies with minimum of distortion, (b) strategies with a greater distortion but clear clusters. The average linkage method from the first category and the Ward's error sum of squares (ESS) method from the second category were selected. These two strategies were applied to the data sets described above using two measures of similarity namely (a) squared Euclidean distance and (b) Mahalanobis D2, and a divisive strategy, REMUL, was also applied to classify the soil populations. The classifications obtained from these strategies were compared by Wilk's Criterion A and the classification which had the lowest A was treated as the best initial partition. The best two partitions of the two populations obtained from the agglomerative strategy, Ward's ESS method, were further analysed. The optimum number of groups (G) in each population was decided by the relationship between LambdaG2 and G. The soil profile groups produced by these methods were further examined and improved by a reallocation strategy based on the Mahalanobis distance between individuals and the group centroids. Reallocation was done using 30 attributes from the uppermost soil horizons. Canonical analysis was performed on the populations both before and after the classification. Canonical plots were produced and a comparison was made with the dendrograms obtained for the best partitions. The classifications obtained were examined in relation to parent material classes. The spatial relationship of the soil groups of the West Sussex Coastal Plain was also investigated. As shown by this study, it is possible to produce a better classification of soils by numerical taxonomic methods compared with traditional methods. For this end, it is not necessary to use all attributes of soils, but a sufficiently large number of properties, which can be empirically determined, is adequate for the purpose of producing a natural classification. The soil groups produced by numerical methods showed a closer association with parent materials.<p

    Clustering in an Object-Oriented Environment

    Get PDF
    This paper describes the incorporation of seven stand-alone clustering programs into S-PLUS, where they can now be used in a much more flexible way. The original Fortran programs carried out new cluster analysis algorithms introduced in the book of Kaufman and Rousseeuw (1990). These clustering methods were designed to be robust and to accept dissimilarity data as well as objects-by-variables data. Moreover, they each provide a graphical display and a quality index reflecting the strength of the clustering. The powerful graphics of S-PLUS made it possible to improve these graphical representations considerably. The integration of the clustering algorithms was performed according to the object-oriented principle supported by S-PLUS. The new functions have a uniform interface, and are compatible with existing S-PLUS functions. We will describe the basic idea and the use of each clustering method, together with its graphical features. Each function is briefly illustrated with an example.

    An Investigation of Cluster Analysis

    Get PDF
    Three cluster analysis programs were used to group the same 64 individuals, generated so as to represent eight populations of eight individuals each. Each individual had quantitative values for seven attributes. All eight populations shared a common attribute variance-covariance matrix. The first program, from F. J. Rohlf\u27s MINT package, implemented single linkage. Correlation was used as the basis for similarity. The results were not satisfactory, and the further use of correlation is in question. The second program, MDISP, bases similarity on Euclidean distance. It was found to give excellent results, in that it clustered individuals into the exact populations from which they were generated. It is the recommended program of the three used here. The last program, MINFO, uses similarity based on mutual information. It also gave very satisfactory results, but, due to visualization reasons, it was found to be less favorable than the MDISP program

    Microbial regulation of the L cell transcriptome.

    Get PDF
    L cells are an important class of enteroendocrine cells secreting hormones such as glucagon like peptide-1 and peptide YY that have several metabolic and physiological effects. The gut is home to trillions of bacteria affecting host physiology, but there has been limited understanding about how the microbiota affects gene expression in L cells. Thus, we rederived the reporter mouse strain, GLU-Venus expressing yellow fluorescent protein under the control of the proglucagon gene, as germ-free (GF). Lpos cells from ileum and colon of GF and conventionally raised (CONV-R) GLU-Venus mice were isolated and subjected to transcriptomic profiling. We observed that the microbiota exerted major effects on ileal L cells. Gene Ontology enrichment analysis revealed that microbiota suppressed biological processes related to vesicle localization and synaptic vesicle cycling in Lpos cells from ileum. This finding was corroborated by electron microscopy of Lpos cells showing reduced numbers of vesicles as well as by demonstrating decreased intracellular GLP-1 content in primary cultures from ileum of CONV-R compared with GF GLU-Venus mice. By analysing Lpos cells following colonization of GF mice we observed that the greatest transcriptional regulation was evident within 1 day of colonization. Thus, the microbiota has a rapid and pronounced effect on the L cell transcriptome, predominantly in the ileum

    Identifying experts and authoritative documents in social bookmarking systems

    Get PDF
    Social bookmarking systems allow people to create pointers to Web resources in a shared, Web-based environment. These services allow users to add free-text labels, or “tags”, to their bookmarks as a way to organize resources for later recall. Ease-of-use, low cognitive barriers, and a lack of controlled vocabulary have allowed social bookmaking systems to grow exponentially over time. However, these same characteristics also raise concerns. Tags lack the formality of traditional classificatory metadata and suffer from the same vocabulary problems as full-text search engines. It is unclear how many valuable resources are untagged or tagged with noisy, irrelevant tags. With few restrictions to entry, annotation spamming adds noise to public social bookmarking systems. Furthermore, many algorithms for discovering semantic relations among tags do not scale to the Web. Recognizing these problems, we develop a novel graph-based Expert and Authoritative Resource Location (EARL) algorithm to find the most authoritative documents and expert users on a given topic in a social bookmarking system. In EARL’s first phase, we reduce noise in a Delicious dataset by isolating a smaller sub-network of “candidate experts”, users whose tagging behavior shows potential domain and classification expertise. In the second phase, a HITS-based graph analysis is performed on the candidate experts’ data to rank the top experts and authoritative documents by topic. To identify topics of interest in Delicious, we develop a distributed method to find subsets of frequently co-occurring tags shared by many candidate experts. We evaluated EARL’s ability to locate authoritative resources and domain experts in Delicious by conducting two independent experiments. The first experiment relies on human judges’ n-point scale ratings of resources suggested by three graph-based algorithms and Google. The second experiment evaluated the proposed approach’s ability to identify classification expertise through human judges’ n-point scale ratings of classification terms versus expert-generated data
    • 

    corecore