Search CORE

9 research outputs found

A soft hierarchical algorithm for the clustering of multiple bioactive chemical compounds

Author: Salim Naomie
Shah J. Z.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/05/2007
Field of study

Most of the clustering methods used in the clustering of chemical structures such as Wards, Group Average, K- means and Jarvis-Patrick, are known as hard or crisp as they partition a dataset into strictly disjoint subsets; and thus are not suitable for the clustering of chemical structures exhibiting more than one activity. Although, fuzzy clustering algorithms such as fuzzy c-means provides an inherent mechanism for the clustering of overlapping structures (objects) but this potential of the fuzzy methods which comes from its fuzzy membership functions have not been utilized effectively. In this work a fuzzy hierarchical algorithm is developed which provides a mechanism not only to benefit from the fuzzy clustering process but also to get advantage of the multiple membership function of the fuzzy clustering. The algorithm divides each and every cluster, if its size is larger than a pre-determined threshold, into two sub clusters based on the membership values of each structure. A structure is assigned to one or both the clusters if its membership value is very high or very similar respectively. The performance of the algorithm is evaluated on two bench mark datasets and a large dataset of compound structures derived from MDL MDDR database. The results of the algorithm show significant improvement in comparison to a similar implementation of the hard c-means algorithm

Universiti Teknologi Malaysia Institutional Repository

Clustering files of chemical structures using the fuzzy k-means clustering method

Author: Chen M-Y.
Holliday J.D.
Lawson K.
Mahfouf M.
Mullier G.
Rodgers S.L.
Willett P.
Publication venue: 'American Chemical Society (ACS)'
Publication date: 27/07/2004
Field of study

This paper evaluates the use of the fuzzy k-means clustering method for the clustering of files of 2D chemical structures. Simulated property prediction experiments with the Starlist file of logP values demonstrate that use of the fuzzy k-means method can, in some cases, yield results that are superior to those obtained with the conventional k-means method and with Ward's clustering method. Clustering of several small sets of agrochemical compounds demonstrate the ability of the fuzzy k-means method to highlight multicluster membership and to identify outlier compounds, although the former can be difficult to interpret in some cases

Crossref

White Rose Research Online

Clustering Files of Chemical Structures Using the Fuzzy k-Means Clustering Method.

Author: Holliday
Publication venue: 'Wiley'
Publication date
Field of study

Crossref

Clustering Files of chemical Structures Using the Fuzzy k-means Clustering Method

Author: John D. Holliday
Mahdi Mahfouf
Min-you Chen
Peter Willett
Sarah L. Rodgers
Publication venue
Publication date
Field of study

This paper evaluates the use of the fuzzy k-means clustering method for the clustering of files of 2D chemical structures. Simulated property prediction experiments with the Starlist file of logP values demonstrate that use of the fuzzy k-means method can, in some cases, yield results that are superior to those obtained with the conventional k-means method and with Ward’s clustering method. Clustering of several small sets of agrochemical compounds demonstrate the ability of the fuzzy k-means method to highlight multicluster membership and to identify outlier compounds, although the former can be difficult to interpret in some cases

CiteSeerX

Similarity Methods in Chemoinformatics

Author: A-Razzak
Adamson
Adamson
Agrafiotis
Agrafiotis
Agrafiotis
Agrafiotis
Ajay Walters
Allen
Attias
Baber
Bajorath
Ballester
Ballester
Barker
Barker
Barnard
Barnard
Barton
Bawden
Bayley
Beitzel
Belkin
Ben-Dor
Bender
Bender
Berks
Berman
Blair
Boecker
Bohl
Bohl
Bostrom
Boyd
Breiman
Bremser
Briem
Brint
Brown
Brown
Brown
Brown
Brown
Brown
Brown
Brown
Bunin
Burbridge
Butina
Byvatov
Böhm
Böhm
Cannon
Capelli
Carbó
Carhart
Charifsen
Cheeseright
Chen
Chen
Chen
Chen
Chen
Chen
Cheng
Christianini
Clark
Clark
Clark
Clark
Clark
Clark
Clark
Cleves
Cole
Coles
Congreve
Corey
Corey
Cornell
Cosgrove
Cramer
Cramer
Cramer
Cramer
Cramer
Cramer
Crandell
Croft
Cruciani
Cuissart
Dalby
Danziger
Davis
DesJarlais
Diestel
DiMasi
Dittmar
Dixon
Dixon
Dixon
Dixon
Doman
Doweyko
Downie
Downs
Downs
Downs
Eckert
Eckert
Edgar
Egan
El-Hamdouchi
Engels
Erickson
Estrada
Everitt
Ewing
Ewing
Feher
Feldman
Fetchner
Fisanick
Fligner
Flower
Free
Freeland
Friesner
Frimurer
Gasteiger
Gedeck
Gillet
Gillet
Gillet
Gillet
Gillet
Gillet
Gillet
Gillet
Ginn
Ginn
Glen
Godden
Godden
Godden
Godden
Goldman
Good
Good
Good
Good
Good
Gorse
Graf
Grant
Gray
Greco
Green
Griffiths
Gund
Gund
Hagadone
Haigh
Hall
Hann
Hann
Hansch
Hansch
Hansch
Hansch
Harper
Harper
Hassan
Hassan
Hawkins
Hawkins
Hawkins
He
Hert
Hert
Hert
Hert
Hertzberg
Hessler
Hiller
Hinchcliffe
Holliday
Holliday
Holliday
Holliday
Hsu
Huang
Hudson
Hurst
Hyland
Jakes
Jakes
Jarvis
Jones
Jorissen
Kauvar
Kearsley
Keiser
Kelley
Kier
Klein
Klein
Kogej
Kubinyi
Kubinyi
Kubinyi
Kuntz
Kurogi
Lajiness
Langridge
Leach
Leach
Leach
Lee
Leeson
Leiter
Lemmen
Lengauer
Lesk
Lewis
Lind
Lindsay
Lipinski
Lipinski
Lipscomb
Loftus
Lombardino
Longley
Low
Lynch
Lynch
Lynch
Lyne
Maggiora
Mahe
Maizel
Makara
Maldonado
Marshall
Martin
Martin
Martin
Martin
Martin
Mason
Mason
Matter
Medina-Franco
Mestres
Mestres
Mestres
Monge
Moock
Moock
Moon
Morgan
Muller
Munk
Murrall
Murtagh
Ng
Nikolova
Nishibata
Nübling
Oda
Onodera
Oprea
Oprea
Oprea
Oprea
Ott
Paolini
Paris
Patterson
Pearlman
Pearlman
Pearlman
Perekhodtsev
Pickett
Prathipati
Pretsch
Proudfoot
Raha
Rarey
Rarey
Rarey
Rasmussen
Ray
Raymond
Raymond
Raymond
Raymond
Raymond
Raymond
Robertson
Rogers
Rush
Rush
Rusinko
Rössler
Sadowski
Saeh
Salim
Salton
Sasaki
Schneider
Schneider
Schneider
Schofield
Schreyer
Schuffenhauer
Schuffenhauer
Schuffenhauer
Schuffenhauer
Shanmugasundaram
Shelley
Shemetulskis
Shenton
Sheridan
Sheridan
Sheridan
Sheridan
Sheridan
Sheridan
Sheridan
Sheridan
Shively
Sirois
Smeaton
Snarey
Sneath
Spärck Jones
Spärck Jones
Stahl
Stahura
Steinbach
Steindl
Stiefl
Stiefl
Sultan
Sussenguth
Svetnik
Takahashi
Tate
Taylor
Teague
Terrett
Thorner
Thorner
Todeschini
Tong
Tong
Triballeau
Truchon
Tversky
Ullmann
van de Waterbeemd
van de Waterbeemd
van Rijsbergen
Veber
Verdonk
Verheij
Vieth
Vleduts
Wagener
Waldman
Walters
Wang
Wang
Ward
Warmuth
Warr
Warren
Weininger
Weisgerber
Whittle
Whittle
Whittle
Wild
Wild
Wild
Willett
Willett
Willett
Willett
Willett
Willett
Willett
Willett
Willett
Willett
Willett
Willett
Willett
Willett
Williams
Wilson
Wilton
Wipke
Wipke
Worboys
Xia
Xue
Yang
Yin
Yu
Zernov
Zhang
Zupan
Publication venue: 'Wiley'
Publication date: 01/01/2009
Field of study

promoting access to White Rose research paper

CiteSeerX

Crossref

White Rose Research Online

Clustering for 2D chemical structures

Author: Chu Chia-Wei
Publication venue: 'University of Sheffield Conference Proceedings'
Publication date: 01/01/2011
Field of study

The clustering of chemical structures is important and widely used in several areas of chemoinformatics. A little-discussed aspect of clustering is standardization, it ensures all descriptors in a chemical representation make a comparable contribution to the measurement of similarity. The initial study compares the effectiveness of seven different standardization procedures that have been suggested previously, the results were also compared with unstandardized datasets. It was found that no one standardization method offered consistently the best performance. Comparative studies of clustering effectiveness are helpful in providing suitability and guidelines of different methods. In order to examine the suitability of different clustering methods for the application in chemoinformatics, especially those had not previously been applied to chemoinformatics, the second piece of study carries out an effectiveness comparison of nine clustering methods. However, the result revealed that it is unlikely that a single clustering method can provide consistently the best partition under all circumstances. Consensus clustering is a technique to combine multiple input partitions of the same set of objects to achieve a single clustering that is expected to provide a more robust and more generally effective representation of the partitions that are submitted. The third piece of study reports the use of seven different consensus clustering methods which had not previously been used on sets of chemical compounds represented by 2D fingerprints. Their effectiveness was compared with some traditional clustering methods discussed in the second study. It was observed that no consistently best consensus clustering method was found

White Rose E-theses Online

The Application of Spectral Clustering in Drug Discovery

Author: Gan Sonny
Publication venue: 'University of Sheffield Conference Proceedings'
Publication date: 01/09/2013
Field of study

The application of clustering algorithms to chemical datasets is well established and has been reviewed extensively. Recently, a number of ‘modern’ clustering algorithms have been reported in other fields. One example is spectral clustering, which has yielded promising results in areas such as protein library analysis. The term spectral clustering is used to describe any clustering algorithm that utilises the eigenpairs of a matrix as the basis for partitioning a dataset. This thesis describes the development and optimisation of a non-overlapping spectral clustering method that is based upon a study by Brewer. The initial version of the spectral clustering algorithm was closely related to Brewer’s method and used a full matrix diagonalisation procedure to identify the eigenpairs of an input matrix. This spectral clustering method was compared to the k-means and Ward’s algorithms, producing encouraging results, for example, when coupled with extended connectivity fingerprints, this method outperformed the other clustering algorithms according to the QCI measure. Although the spectral clustering algorithm showed promising results, its operational costs restricted its application to small datasets. Hence, the method was optimised in successive studies. Firstly, the effect of matrix sparsity on the spectral clustering was examined and showed that spectral clustering with sparse input matrices can lead to an improvement in the results. Despite this improvement, the costs of spectral clustering remained prohibitive, so the full matrix diagonalisation procedure was replaced with the Lanczos algorithm that has lower associated costs, as suggested by Brewer. This method led to a significant decrease in the computational costs when identifying a small number of clusters, however a number of issues remained; leading to the adoption of a SVD-based eigendecomposition method. The SVD-based algorithm was shown to be highly efficient, accurate and scalable through a number of studies

White Rose E-theses Online

Identification of structure activity relationships in primary screening data of high-throughput screening assays

Author: Böcker-Felbek Alexander Dietmar
Publication venue
Publication date: 24/04/2007
Field of study

The aim of the thesis was to identify structure activity relationships (SAR) in the primary screening data of high-throughput screening (HTS) assays. The strategy was to perform a hierarchical clustering of the molecules, assign the primary screening data to the created clusters and derive models from the clusters. The models should serve to identify singletons, clusters enriched with actives, not confirmed hits and false-negatives. Two hierarchical clustering algorithms, NIPALSTREE and hierarchical k-means have been developed and adapted for this purpose, respectively. A graphical user interface (GUI) has been implemented to extract SAR from the clustering results. Retrospective and prospective applications of the clustering approach were performed. SAR models were created by combining the clustering results with different chemoinformatic methods. NIPALSTREE projects a data set onto one dimension using principle component analysis. The data set is sorted according to the scoring vector and split at the median position into two subsets. The algorithm is applied recursively onto the subsets. The hierarchical k-means recursively separates a data set into two clusters using the k-means algorithm. Both algorithms are capable of clustering large data sets with more than a million data points. They were validated and compared to each other on the basis of different structural classes. NIPALSTREE provided with the loading vectors first insights into SAR whereas the hierarchical k-means yielded superior results. A GUI was developed allowing the display of and the navigation in the clustering results. Functionalities were integrated to analyse the clusters in the dendrogram, molecules in a cluster, and physicochemical properties of a molecule. Measures were developed to identify clusters enriched with actives, to characterize singletons and to analyse selectivity and specificity. Different protease inhibitors of the COBRA database were examined using the hierarchical k-means algorithm. Supported by similarity searches and nearest neighbour analyses thrombin inhibitor singletons were quickly isolated and displayed in the dendrogram. By scaling enrichment factors to the logarithm of the dendrogram level, clusters enriched with different structural classes of factor Xa inhibitors were simultaneously identified. The observed co-clustering of other protease inhibitors provided a deeper insight into selectivity and specificity and shows the utility of the approach for constructing focussed screening libraries. Specificity was analyzed by extracting and clustering relative frequencies of the protease inhibitors from the clusters of dendrogram level 7. A unique ligand based point of view on the pocketome of the protease enzymes was obtained. To identify not confirmed hits and false-negatives in the primary screening data of HTS assays, three assays were retrospectively analysed with the hierarchical k-means algorithm. A rule catalogue was developed judging hits in terminal clusters based on the cluster size, the percent control values of the entries in a cluster, the overall hit rate, the hit rate in the cluster and the environment of a cluster in the dendrogram. It resulted in the identification of a high proportion of not confirmed hits and provided for each hit a rating in context of related non-hits. This allows prioritizing compounds for follow-up studies. Non-hits and hits were retrieved from terminal clusters containing hits. Molecules bearing false-negative scaffolds were co-extracted and enriched. To minimize the number of false-positives in the extracted lists, Bayesian regularized artificial neutral network classification models were trained with the data. Applying the models marked improvement of enrichment factors for the false-negatives was obtained. It proofs the scaffold-hopping potential of the approach. NIPALSTREE, the hierarchical k-means algorithm and self-organising maps were prospectively applied to identify novel lead candidates for dopamine D3 receptors. Compounds with novel scaffolds and low nanomolar binding affinity (65 nM, compound 42) were identified. To provide a deeper insight into the SAR of these molecules, different alternative computational methods were employed. Support vector-based regression and partial least squares were examined. Predictive models for dopamine D2 and D3 receptor binding affinity values were obtained. Important features explaining SAR were extracted from the models. The prospective application of the models to the diverse and novel virtual screening data was of limited success only. Docking studies were performed using a homology model of the dopamine D3 receptor. The visual inspection of the binding modes resulted in the hypothesis of two alternative binding pockets for the aryl moiety of dopamine D3 receptor antagonists. A pharmacophore model was created simultaneously requiring both aryl moieties. Virtual screening with the model identified a nanomolar hit (65 nM, compound 59) corroborating the hypothesis of the two binding pockets and providing a new lead structure for dopamine D3 receptors. The presented data shows that the combined approach of hierarchically clustering a data set in combination with the subsequent usage of the clusters for model generation is suited to extract SAR from screening data. The models are successful in identifying singletons, clusters enriched with actives, not confirmed hits and false-negative scaffolds.Das Ziel der Arbeit war es, Struktur-Aktivitätsbeziehungen (SAR) in primären Screeningdaten von Hochdurchsatzscreening (HTS)- Assays zu finden. Als Strategie sollten die Moleküle hierarchisch geclustert werden, die primären Screeningdaten den gebildeten Clustern zugeordnet und Modelle aus den Clustern abgeleitet werden. Die Modelle sollten das Auffinden von Singletons, mit Hits angereicherter Cluster, nicht bestätigter Hits und falsch Negativer ermöglichen. Zu diesem Zweck wurden zwei hierarchische Clusteralgorithmen, NIPALSTREE und hierarchischer k-means, entwickelt bzw. angepasst. Eine graphische Benutzeroberfläche (GUI) wurde implementiert, um SAR aus den Ergebnissen der Clusterung abzuleiten. Retrospektive und prospektive Anwendungen wurden mit den Clusteransätzen verfolgt. SAR Modelle wurden durch Verwendung der Ergebnisse der Clusterung mit verschiedenen chemoinformatischen Verfahren erstellt. NIPALSTREE projiziert mit Hilfe der Hauptkomponentenanalyse einen Datensatz auf eine Dimension. Der Datensatz wird anhand des Scoringvektors sortiert und, basierend auf dem Median, in zwei Teilmengen aufgetrennt. Der Algorithmus wird rekursiv auf die neu gebildeten Mengen angewandt. Der hierarchische k-means Algorithmus trennt, basierend auf dem k-means Algorithmus, einen Datensatz rekursiv in zwei Cluster auf. Beide Algorithmen sind in der Lage, große Datenmengen mit mehr als einer Million Datenpunkte zu clustern. Sie wurden anhand verschiedener Strukturklassen validiert und miteinander verglichen. NIPALSTREE erbrachte mit dem Loadingvektor erste Einblicke in die SAR, wohingegen der hierarchische k-means zu besseren Ergebnissen führte. Eine GUI wurde entwickelt, die es erlaubt, die Clusterergebnisse darzustellen und darin zu navigieren. Funktionalitäten wurden bereitgestellt, um die Cluster im Dendrogramm, die Moleküle eines Clusters und die physikochemischen Eigenschaften eines Moleküls zu analysieren. Verfahren wurden entwickelt, um mit Hits angereicherte Cluster zu finden, Singletons zu charakterisieren und Selektivität und Spezifität zu analysieren. Verschiedene Proteaseinhibitoren aus der COBRA-Datenbank wurden mit dem hierarchischen k-means Algorithmus näher betrachtet. Mit Hilfe von Ähnlichkeitssuchen und nächsten Nachbaranalysen wurden Thrombininhibitorsingletons im Dendrogram in kürzester Zeit isoliert und dargestellt. Cluster, die mit verschiedenen Strukturklassen von Faktor-Xa-Inhibitoren angereichert waren, wurden, durch Skalierung des Anreicherungsfaktors auf den Logarithmus der Dendrogrammebene, gleichzeitig im Dendrogramm identifiziert. Eine Clusterung der Faktor-Xa-Inhibitoren mit anderen Proteaseinhibitoren wurde beobachtet. Sie erbrachte einen vertieften Einblick in Selektivität und Spezifität und zeigt die Anwendbarkeit des Ansatzes zur Erstellung fokussierter Screeningbibliotheken. Durch Extrahierung und Clusterung der relativen Anteile der Proteaseinhibitoren aus den Clustern von Dendrogrammebene sieben wurde die Spezifität der Proteaseinhibitoren analysiert. Eine spezifische, Liganden basierte Betrachtung des Pocketoms der Proteaseenzyme wurde erhalten. Um nicht bestätigte Hits und falsch Negative in den primären Screening Daten von HTS Assays zu finden, wurden drei Assays in Retrospektive mit dem hierarchischen k-means analysiert. Ein Regelwerk wurde entwickelt, welches Hits anhand der Clustergröße, des Prozent-Kontrollwertes der Einträge eines Clusters, der Gesamthitrate, der Hitrate in einem Cluster und der Umgebung des Clusters im Dendrogramm bewertet. Das Regelwerk führte zum Auffindung eines großen Anteils nicht bestätigter Hits. Zudem wurde für jeden Hit eine Bewertung im Kontext verwandter Nichthits erhalten. Dies erlaubt ein Priorisieren von Molekülen für Folgeuntersuchungen. Nichthits und Hits wurden aus Endcluster, die Hits enthielten, extrahiert. Moleküle mit falsch negativen Molekülgrundgerüsten wurden koextrahiert und angereichert. Um falsch Positive in den extrahierten Listen zu minimieren, wurden Bayesische regularisierte neuronale Klassifizierungsnetze mit den Daten trainiert. Die Anwendung der Modelle ergab eine deutliche Verbesserung der Anreicherungsfaktoren der falsch Negativen. Es zeigt, dass die Methode in der Lage ist, einen Molekülgrundgerüstwechsel durchzuführen. NIPALSTREE, der hierarchische k-means und selbst organisierende Karten wurden prospektiv angewandt, um neue Leitstrukturkandidaten für Dopamin-D3-Rezeptoren zu finden. Moleküle mit neuen Molekülgrundgerüsten und Bindungsaffinitäten im niedrigen nanomolaren Bereich wurden gefunden (65 nM für Molekül 42). Um einen tieferen Einblick in die SAR dieser Moleküle zu erhalten, wurden verschiede Computerverfahren verwendet. Supportvektorregression und PLS („partial least squares“) wurden untersucht. Es war möglich, voraussagende Modelle für Dopamin-D2 und D3 Bindungsaffinitäten zu erstellen. Die SAR erklärende Moleküleigenschaften konnten aus den Modellen extrahiert werden. Die prospektive Anwendung der Modelle auf die diversen und neuen virtuellen Screeningdaten war nur von begrenztem Erfolg. Dockingstudien wurden mit einem Homologiemodell des Dopamin-D3-Rezeptors durchgeführt. Die visuelle Begutachtung der Bindemoden führte zur Hypothese zweier alternativer Bindetaschen für den Aryl-Rest von Dopamin-D3-Rezeptorantagonisten. Ein Pharmakophormodell wurde erstellt, welches beide Aryl-Reste gleichzeitig benötigt. Ein virtuelles Screening mit dem Modell identifizierte einen nanomolaren Hit (65 nM für Molekül 59), welcher die Hypothese unterstützt und eine neue Leitstruktur für Dopamin-D3-Rezeptoren darstellt. Die vorgestellten Daten zeigen, dass der kombinierte Ansatz aus hierarchischer Clusterung und anschließender Verwendung der Cluster zur Modellerstellung, SAR in HTS-Daten findet. Die Modelle sind geeignet zum Auffinden von Singletons, mit Hits angereichter Cluster, nicht bestätigter Hits und falsch negativer Molekülgrundgerüste

Hochschulschriftenserver - Universität Frankfurt am Main