7 research outputs found

    Verifying the fully “Laplacianised” posterior Naïve Bayesian approach and more

    Get PDF
    Mussa and Glen would like to thank Unilever for financial support, whereas Mussa and Mitchell thank the BBSRC for funding this research through grant BB/I00596X/1. Mitchell thanks the Scottish Universities Life Sciences Alliance (SULSA) for financial support.Background In a recent paper, Mussa, Mitchell and Glen (MMG) have mathematically demonstrated that the “Laplacian Corrected Modified Naïve Bayes” (LCMNB) algorithm can be viewed as a variant of the so-called Standard Naïve Bayes (SNB) scheme, whereby the role played by absence of compound features in classifying/assigning the compound to its appropriate class is ignored. MMG have also proffered guidelines regarding the conditions under which this omission may hold. Utilising three data sets, the present paper examines the validity of these guidelines in practice. The paper also extends MMG’s work and introduces a new version of the SNB classifier: “Tapered Naïve Bayes” (TNB). TNB does not discard the role of absence of a feature out of hand, nor does it fully consider its role. Hence, TNB encapsulates both SNB and LCMNB. Results LCMNB, SNB and TNB performed differently on classifying 4,658, 5,031 and 1,149 ligands (all chosen from the ChEMBL Database) distributed over 31 enzymes, 23 membrane receptors, and one ion-channel, four transporters and one transcription factor as their target proteins. When the number of features utilised was equal to or smaller than the “optimal” number of features for a given data set, SNB classifiers systematically gave better classification results than those yielded by LCMNB classifiers. The opposite was true when the number of features employed was markedly larger than the “optimal” number of features for this data set. Nonetheless, these LCMNB performances were worse than the classification performance achieved by SNB when the “optimal” number of features for the data set was utilised. TNB classifiers systematically outperformed both SNB and LCMNB classifiers. Conclusions The classification results obtained in this study concur with the mathematical based guidelines given in MMG’s paper—that is, ignoring the role of absence of a feature out of hand does not necessarily improve classification performance of the SNB approach; if anything, it could make the performance of the SNB method worse. The results obtained also lend support to the rationale, on which the TNB algorithm rests: handled judiciously, taking into account absence of features can enhance (not impair) the discriminatory classification power of the SNB approach.Publisher PDFPeer reviewe

    Rotation-vibration states of triatomic molecules using massively parallel computers

    Get PDF
    A formulation of the the nuclear motion (rotation-vibration) of triatomic molecules is discussed in the different implementations of the Discrete Variable Representation (DVR). The formulation is expressed in a set of internal co-ordinates using an exact nuclear motion Hamiltonian operator. We present a computer implementation of the Hamiltonian on some of the most powerful massively parallel computers in the world today. The Cray-T3E/T3D, and the IBM SP2 are used for this study. Accurate calculations of the rotation-vibrational energy levels up to the dissociation for the non-linear triatomic molecules, H2O and O3, with deep potential surfaces are presented. We also present results for two linear molecules, HN+2 and HCP. The water molecule is used as a detailed case study. Rotation-vibration studies are made using a number of realistic global potential energy surfaces. Radau co-ordinates are used for the calculations in the preconditioned DVR representation. After comprehensive variational convergence tests on the energy levels, all the J=0 bound states of the system are converged to within l cm-1 or better, giving about 1,000 states for each potential. Graphical analyses of the eigenfunctions are then made. Similar studies are performed for the J> 0. These are the first accurate rotation-vibrational calculations up to the dissociation obtained for this system. For the J> 0 case, convegence problems are found in previous, more limited, studies of the system

    A note on utilising binary features as ligand descriptors

    Get PDF
    Mussa and Mitchell thank the BBSRC for funding this research through grant BB/I00596X/1. Mitchell thanks the Scottish Universities Life Sciences Alliance (SULSA) for financial support.It is common in cheminformatics to represent the properties of a ligand as a string of 1’s and 0’s, with the intention of elucidating, inter alia, the relationship between the chemical structure of a ligand and its bioactivity. In this commentary we note that, where relevant but non-redundant features are binary, they inevitably lead to a classifier capable of capturing only a linear relationship between structural features and activity. If, instead, we were to use relevant but non-redundant real-valued features, the resulting predictive model would be capable of describing a non-linear structure-activity relationship. Hence, we suggest that real-valued features, where available, are to be preferred in this scenario.Publisher PDFPeer reviewe

    Enzyme mechanism prediction : a template matching problem on InterPro signature subspaces

    Get PDF
    The authors thank the BBSRC for funding this research through grant BB/I00596X/1 and are also grateful to the Scottish Universities Life Sciences Alliance (SULSA) for financial supportWe recently reported that one may be able to predict with high accuracy the chemical mechanism of an enzyme by employing a simple pattern recognition approach: a k Nearest Neighbour rule with k=1 (k1NN) and 321 InterPro sequence signatures as enzyme features. The nearest-neighbour rule is known to be highly sensitive to errors in the training data, in particular when the available training dataset is small. This was the case in our previous study, in which our dataset comprised 248 enzymes annotated against 71 enzymatic mechanism labels from MACiE. In the current study, we have carefully re-analysed our dataset and prediction results to “explain” why a high variance k1NN rule exhibited such remarkable classification performance. We find that enzymes with different chemical mechanism labels in this dataset reside in barely overlapping subspaces in the feature space defined by the 321 features selected. These features contain the appropriate information needed to accurately classify the enzymatic mechanisms rendering our classification problem a basic look-up exercise. This observation dovetails with the low misclassification rate we reported. Our results provide explanations for the “anomaly” – a basic nearest-neighbour algorithm exhibiting remarkable prediction performance for enzymatic mechanism despite the fact that the feature space was large and sparse. Our results also dovetail well with another finding we reported, namely that InterPro signatures are critical for accurate prediction of enzyme mechanism. We also suggest simple rules that might enable one to inductively predict whether a novel enzyme possesses any of our 71 predefined mechanisms.Publisher PDFPeer reviewe

    The Parzen Window method : in terms of two vectors and one matrix

    Get PDF
    We thank the BBSRC for funding this research through grant BB/I00596X/1. JBOM thanks the Scottish Universities Life Sciences Alliance (SULSA) for financial support.Pattern classification methods assign an object to one of several predefined classes/categories based on features extracted from observed attributes of the object (pattern). When L discriminatory features for the pattern can be accurately determined, the pattern classification problem presents no difficulty. However, precise identification of the relevant features for a classification algorithm (classifier) to able to categorize real world patterns without errors is generally infeasible. In this case, the pattern classification problem is often cast as devising a classifier that minimises the misclassification rate. One way of doing this is to consider both the pattern attributes and its class label as random variables, estimate the posterior class probabilities for a given pattern and then assign the pattern to class/category for which the posterior class probability value estimated is maximum. More often than not, the form of the posterior class probabilities is unknown. The so-called Parzen Window approach is widely employed to estimate class-conditional probability (class-specific probability) densities a given pattern. These probability densities can then be utilised to estimate the appropriate posterior class probabilities for that pattern. However, the Parzen Window scheme can become computationally impractical when the size of the training dataset is in the tens of thousands and L is also large few hundred or more). Over the years, various schemes have been suggested to ameliorate the computational drawback of the Parzen Window approach, but the problem still remains outstanding and unresolved. In this paper, we revisit the Parzen Window technique and introduce a novel approach that may circumvent the aforementioned computational bottleneck. The current paper presents the mathematical aspect of our idea. Practical realizations of the proposed scheme will be given elsewhere.Publisher PDFPeer reviewe