Search CORE

818 research outputs found

Development and evaluation of clustering techniques for finding people

Author: Dunlop M.D.
Publication venue: CEUR-WS.org
Publication date: 01/01/2000
Field of study

Typically in a large organisation much expertise and knowledge is held informally within employees' own memories. When employees leave an organisation many documented links that go through that person are broken and no mechanism is usually available to overcome these broken links. This match making problem is related to the problem of finding potential work partners in a large and distributed organisation. This paper reports a comparative investigation into using standard information retrieval techniques to group employees together based on their webpages. This information can, hopefully, be subsequently used to redirect broken links to people who worked closely with a departed employee or used to highlight people, say indifferent departments, who work on similar topics. The paper reports the design and positive results of an experiment conducted at Risø National Laboratory comparing four different IR searching and clustering approaches using real users' web pages

CiteSeerX

Combining multiple classifications of chemical structures using consensus clustering

Author: Adamson
Ayad
Ben-Dor
Bertolacci
Boecker
Boulis
Brown
Chia-Wei Chu
Chu
Downs
Dunbar
Engels
Everitt
Feher
Filkov
Fowlkes
Fred
Gionis
Goder
Gordon
Hert
Hudson
Jarvis
John D. Holliday
Menard
Monti
Peter Willett
Rand
Raymond
Rogers
Santos
Schuffenhauer
Strehl
Szekely
Topchy
Varin
Varin
Ward
Willett
Willett
Willett
Willett
Yin
Zhao
Publication venue: 'Elsevier BV'
Publication date: 01/01/2012
Field of study

Consensus clustering involves combining multiple clusterings of the same set of objects to achieve a single clustering that will, hopefully, provide a better picture of the groupings that are present in a dataset. This Letter reports the use of consensus clustering methods on sets of chemical compounds represented by 2D fingerprints. Experiments with DUD, IDAlert, MDDR and MUV data suggests that consensus methods are unlikely to result in significant improvements in clustering effectiveness as compared to the use of a single clustering method. (C) 2012 Elsevier Ltd. All rights reserved

Clustering files of chemical structures using the Szekely-Rizzo generalization of Ward's method

Author: Bureau R.
Mueller C.
Varin T.
Willett P.
Publication venue: 'Elsevier BV'
Publication date: 01/09/2009
Field of study

Ward's method is extensively used for clustering chemical structures represented by 2D fingerprints. This paper compares Ward clusterings of 14 datasets (containing between 278 and 4332 molecules) with those obtained using the Szekely–Rizzo clustering method, a generalization of Ward's method. The clusters resulting from these two methods were evaluated by the extent to which the various classifications were able to group active molecules together, using a novel criterion of clustering effectiveness. Analysis of a total of 1400 classifications (Ward and Székely–Rizzo clustering methods, 14 different datasets, 5 different fingerprints and 10 different distance coefficients) demonstrated the general superiority of the Székely–Rizzo method. The distance coefficient first described by Soergel performed extremely well in these experiments, and this was also the case when it was used in simulated virtual screening experiments

Origins of Modern Data Analysis Linked to the Beginnings and Early Development of Computer Science and Information Engineering

Author: Murtagh Fionn
Publication venue
Publication date: 01/01/2008
Field of study

The history of data analysis that is addressed here is underpinned by two themes, -- those of tabular data analysis, and the analysis of collected heterogeneous data. "Exploratory data analysis" is taken as the heuristic approach that begins with data and information and seeks underlying explanation for what is observed or measured. I also cover some of the evolving context of research and applications, including scholarly publishing, technology transfer and the economic relationship of the university to society.Comment: 26 page

arXiv.org e-Print Archive

Royal Holloway Research Online

De Montfort University Open Research Archive

Using interdocument similarity information in document retrieval systems

Author: Alan Griffiths
H. Claire Luckhurst
Peter Willett
Publication venue: 'Wiley'
Publication date: 01/01/2004
Field of study

Hierarchical classification for multiple, distributed web databases

Author: Yang Hui
Zhang Minjie
Publication venue
Publication date: 01/01/2004
Field of study

The proliferation of online information resources increases the importance of effective and efficient distributed searching. Our research aims to provide an alternative hierarchical categorization and search capability based on a Bayesian network learning algorithm. Our proposed approach, which is grounded on automatic textual analysis of subject content of online web databases, attempts to address the database selection problem by first classifying web databases into a hierarchy of topic categories. The experimental results reported demonstrate that such a classification approach not only effectively reduces the class search space, but also helps to significantly improve the accuracy of classification performance

Chemoinformatics Research at the University of Sheffield: A History and Citation Analysis

Author: Bishop N.
Gillet V.J.
Holliday J.D.
Willett P.
Publication venue: 'SAGE Publications'
Publication date: 01/07/2003
Field of study

This paper reviews the work of the Chemoinformatics Research Group in the Department of Information Studies at the University of Sheffield, focusing particularly on the work carried out in the period 1985-2002. Four major research areas are discussed, these involving the development of methods for: substructure searching in databases of three-dimensional structures, including both rigid and flexible molecules; the representation and searching of the Markush structures that occur in chemical patents; similarity searching in databases of both two-dimensional and three-dimensional structures; and compound selection and the design of combinatorial libraries. An analysis of citations to 321 publications from the Group shows that it attracted a total of 3725 residual citations during the period 1980-2002. These citations appeared in 411 different journals, and involved 910 different citing organizations from 54 different countries, thus demonstrating the widespread impact of the Group's work

Experiments in Clustering Homogeneous XML Documents to Validate an Existing Typology

Author: Despeyroux Thierry
Lechevallier Yves
Trousse Brigitte
Vercoustre Anne-Marie
Publication venue
Publication date: 01/01/2005
Field of study

This paper presents some experiments in clustering homogeneous XMLdocuments to validate an existing classification or more generally anorganisational structure. Our approach integrates techniques for extracting knowledge from documents with unsupervised classification (clustering) of documents. We focus on the feature selection used for representing documents and its impact on the emerging classification. We mix the selection of structured features with fine textual selection based on syntactic characteristics.We illustrate and evaluate this approach with a collection of Inria activity reports for the year 2003. The objective is to cluster projects into larger groups (Themes), based on the keywords or different chapters of these activity reports. We then compare the results of clustering using different feature selections, with the official theme structure used by Inria.Comment: (postprint); This version corrects a couple of errors in authors' names in the bibliograph

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

Does the Geometry of Word Embeddings Help Document Classification? A Case Study on Persistent Homology Based Representations

Author: Michel Paul
Ravichander Abhilasha
Rijhwani Shruti
Publication venue
Publication date: 01/01/2017
Field of study

We investigate the pertinence of methods from algebraic topology for text data analysis. These methods enable the development of mathematically-principled isometric-invariant mappings from a set of vectors to a document embedding, which is stable with respect to the geometry of the document in the selected metric space. In this work, we evaluate the utility of these topology-based document representations in traditional NLP tasks, specifically document clustering and sentiment classification. We find that the embeddings do not benefit text analysis. In fact, performance is worse than simple techniques like

\textit{tf-idf}

, indicating that the geometry of the document does not provide enough variability for classification on the basis of topic or sentiment in the chosen datasets.Comment: 5 pages, 3 figures. Rep4NLP workshop at ACL 201

arXiv.org e-Print Archive