235 research outputs found
Clustering files of chemical structures using the Szekely-Rizzo generalization of Ward's method
Ward's method is extensively used for clustering chemical structures represented by 2D fingerprints. This paper compares Ward clusterings of 14 datasets (containing between 278 and 4332 molecules) with those obtained using the SzekelyâRizzo clustering method, a generalization of Ward's method. The clusters resulting from these two methods were evaluated by the extent to which the various classifications were able to group active molecules together, using a novel criterion of clustering effectiveness. Analysis of a total of 1400 classifications (Ward and SzĂ©kelyâRizzo clustering methods, 14 different datasets, 5 different fingerprints and 10 different distance coefficients) demonstrated the general superiority of the SzĂ©kelyâRizzo method. The distance coefficient first described by Soergel performed extremely well in these experiments, and this was also the case when it was used in simulated virtual screening experiments
Studies in numerical taxonomy of soils
A series of established numerical taxonomic strategies was applied to soil data from three sources: USDA (1975), De Alwis (1971) and the Soil Survey of England and Wales. The first two sources provided data for 41 soil profiles, which were classified without reference to their geographical location. The data obtained from the Soil Survey of England and Wales related to a particular geographical area (West Sussex Coastal Plain) and the geographical relationship between soil individuals was also examined. Two methods of soil characterization (soil profile models) were compared with respect to their effect on the results produced by two hierarchical agglomerative strategies based on two measures of inter-individual similarity. Comparison of results, obtained from the agglomerative strategies for the two soil profile models, was made. The nature of inter-attribute correlation for depth levels modelled as arrays of independent attributes was examined, and all attributes were classified on the basis of inter-attribute correlation. Seven hierarchical agglomerative strategies were examined with respect to their goodness-of-fit in the original space and also the relationship between goodness-of-fit and clarity of clusters was examined. From these comparisons, two agglomerative strategies were chosen to represent two classes of strategy: (a) strategies with minimum of distortion, (b) strategies with a greater distortion but clear clusters. The average linkage method from the first category and the Ward's error sum of squares (ESS) method from the second category were selected. These two strategies were applied to the data sets described above using two measures of similarity namely (a) squared Euclidean distance and (b) Mahalanobis D2, and a divisive strategy, REMUL, was also applied to classify the soil populations. The classifications obtained from these strategies were compared by Wilk's Criterion A and the classification which had the lowest A was treated as the best initial partition. The best two partitions of the two populations obtained from the agglomerative strategy, Ward's ESS method, were further analysed. The optimum number of groups (G) in each population was decided by the relationship between LambdaG2 and G. The soil profile groups produced by these methods were further examined and improved by a reallocation strategy based on the Mahalanobis distance between individuals and the group centroids. Reallocation was done using 30 attributes from the uppermost soil horizons. Canonical analysis was performed on the populations both before and after the classification. Canonical plots were produced and a comparison was made with the dendrograms obtained for the best partitions. The classifications obtained were examined in relation to parent material classes. The spatial relationship of the soil groups of the West Sussex Coastal Plain was also investigated. As shown by this study, it is possible to produce a better classification of soils by numerical taxonomic methods compared with traditional methods. For this end, it is not necessary to use all attributes of soils, but a sufficiently large number of properties, which can be empirically determined, is adequate for the purpose of producing a natural classification. The soil groups produced by numerical methods showed a closer association with parent materials.<p
Clustering in an Object-Oriented Environment
This paper describes the incorporation of seven stand-alone clustering programs into S-PLUS, where they can now be used in a much more flexible way. The original Fortran programs carried out new cluster analysis algorithms introduced in the book of Kaufman and Rousseeuw (1990). These clustering methods were designed to be robust and to accept dissimilarity data as well as objects-by-variables data. Moreover, they each provide a graphical display and a quality index reflecting the strength of the clustering. The powerful graphics of S-PLUS made it possible to improve these graphical representations considerably. The integration of the clustering algorithms was performed according to the object-oriented principle supported by S-PLUS. The new functions have a uniform interface, and are compatible with existing S-PLUS functions. We will describe the basic idea and the use of each clustering method, together with its graphical features. Each function is briefly illustrated with an example.
An Investigation of Cluster Analysis
Three cluster analysis programs were used to group the same 64 individuals, generated so as to represent eight populations of eight individuals each. Each individual had quantitative values for seven attributes. All eight populations shared a common attribute variance-covariance matrix.
The first program, from F. J. Rohlf\u27s MINT package, implemented single linkage. Correlation was used as the basis for similarity. The results were not satisfactory, and the further use of correlation is in question.
The second program, MDISP, bases similarity on Euclidean distance. It was found to give excellent results, in that it clustered individuals into the exact populations from which they were generated. It is the recommended program of the three used here.
The last program, MINFO, uses similarity based on mutual information. It also gave very satisfactory results, but, due to visualization reasons, it was found to be less favorable than the MDISP program
Microbial regulation of the L cell transcriptome.
L cells are an important class of enteroendocrine cells secreting hormones such as glucagon like peptide-1 and peptide YY that have several metabolic and physiological effects. The gut is home to trillions of bacteria affecting host physiology, but there has been limited understanding about how the microbiota affects gene expression in L cells. Thus, we rederived the reporter mouse strain, GLU-Venus expressing yellow fluorescent protein under the control of the proglucagon gene, as germ-free (GF). Lpos cells from ileum and colon of GF and conventionally raised (CONV-R) GLU-Venus mice were isolated and subjected to transcriptomic profiling. We observed that the microbiota exerted major effects on ileal L cells. Gene Ontology enrichment analysis revealed that microbiota suppressed biological processes related to vesicle localization and synaptic vesicle cycling in Lpos cells from ileum. This finding was corroborated by electron microscopy of Lpos cells showing reduced numbers of vesicles as well as by demonstrating decreased intracellular GLP-1 content in primary cultures from ileum of CONV-R compared with GF GLU-Venus mice. By analysing Lpos cells following colonization of GF mice we observed that the greatest transcriptional regulation was evident within 1âday of colonization. Thus, the microbiota has a rapid and pronounced effect on the L cell transcriptome, predominantly in the ileum
Identifying experts and authoritative documents in social bookmarking systems
Social bookmarking systems allow people to create pointers to Web resources in a shared, Web-based environment. These services allow users to add free-text labels, or âtagsâ, to their bookmarks as a way to organize resources for later recall. Ease-of-use, low cognitive barriers, and a lack of controlled vocabulary have allowed social bookmaking systems to grow exponentially over time. However, these same characteristics also raise concerns. Tags lack the formality of traditional classificatory metadata and suffer from the same vocabulary problems as full-text search engines. It is unclear how many valuable resources are untagged or tagged with noisy, irrelevant tags. With few restrictions to entry, annotation spamming adds noise to public social bookmarking systems. Furthermore, many algorithms for discovering semantic relations among tags do not scale to the Web.
Recognizing these problems, we develop a novel graph-based Expert and Authoritative Resource Location (EARL) algorithm to find the most authoritative documents and expert users on a given topic in a social bookmarking system. In EARLâs first phase, we reduce noise in a Delicious dataset by isolating a smaller sub-network of âcandidate expertsâ, users whose tagging behavior shows potential domain and classification expertise. In the second phase, a HITS-based graph analysis is performed on the candidate expertsâ data to rank the top experts and authoritative documents by topic. To identify topics of interest in Delicious, we develop a distributed method to find subsets of frequently co-occurring tags shared by many candidate experts.
We evaluated EARLâs ability to locate authoritative resources and domain experts in Delicious by conducting two independent experiments. The first experiment relies on human judgesâ n-point scale ratings of resources suggested by three graph-based algorithms and Google. The second experiment evaluated the proposed approachâs ability to identify classification expertise through human judgesâ n-point scale ratings of classification terms versus expert-generated data
Recommended from our members
An Information Retrieval Approach for Automatically Constructing Software Libraries
Although software reuse presents clear advantages for programmer productivity and code reliability, it is not practiced enough. One of the reasons for the only moderate success of reuse is the lack of software libraries that facilitate the actual locating and understanding of reusable components. This paper describes a technology for automatically assembling large software libraries which promote software reuse by helping the user locate the components closest to her/his needs. Software libraries are automatically assembled from a set of unorganized components by using information retrieval techniques. The construction of the library is done in two steps. First, attributes are automatically extracted from natural language documentation by using a new indexing scheme based on the notions of lexical affinities and quantity of information. Then a hierarchy for browsing is automatically generated using a clustering technique which draws only on the information provided by the attributes. Thanks to the free-text indexing scheme, tools following this approach can accept free-style natural language queries. This technology has been implemented in the GURU system, which has been applied to construct an organized library of AIX utilities. An experiment was conducted in order to evaluate the retrieval effectiveness of GURU as compared to INFOEXPLORER a hypertext library system for AIX 3 on the IBM RISC System/6000 series. We followed the usual evaluation procedure used in information retrieval, based upon recall and precision measures, and determined that our system performs 15% better on a random test set, while being much less expensive to build than INFOEXPLORER
- âŠ