77 research outputs found

    A comparison of machine learning methods for classification using simulation with multiple real data examples from mental health studies

    Get PDF
    Background: Recent literature on the comparison of machine learning methods has raised questions about the neutrality, unbiasedness and utility of many comparative studies. Reporting of results on favourable datasets and sampling error in the estimated performance measures based on single samples are thought to be the major sources of bias in such comparisons. Better performance in one or a few instances does not necessarily imply so on an average or on a population level and simulation studies may be a better alternative for objectively comparing the performances of machine learning algorithms. Methods: We compare the classification performance of a number of important and widely used machine learning algorithms, namely the Random Forests (RF), Support Vector Machines (SVM), Linear Discriminant Analysis (LDA) and k-Nearest Neighbour (kNN). Using massively parallel processing on high-performance supercomputers, we compare the generalisation errors at various combinations of levels of several factors: number of features, training sample size, biological variation, experimental variation, effect size, replication and correlation between features. Results: For smaller number of correlated features, number of features not exceeding approximately half the sample size, LDA was found to be the method of choice in terms of average generalisation errors as well as stability (precision) of error estimates. SVM (with RBF kernel) outperforms LDA as well as RF and kNN by a clear margin as the feature set gets larger provided the sample size is not too small (at least 20). The performance of kNN also improves as the number of features grows and outplays that of LDA and RF unless the data variability is too high and/or effect sizes are too small. RF was found to outperform only kNN in some instances where the data are more variable and have smaller effect sizes, in which cases it also provide more stable error estimates than kNN and LDA. Applications to a number of real datasets supported the findings from the simulation study

    An adaptive version of k-medoids to deal with the uncertainty in clustering heterogeneous data using an intermediary fusion approach

    Get PDF
    This paper introduces Hk-medoids, a modified version of the standard k-medoids algorithm. The modification extends the algorithm for the problem of clustering complex heterogeneous objects that are described by a diversity of data types, e.g. text, images, structured data and time series. We first proposed an intermediary fusion approach to calculate fused similarities between objects, SMF, taking into account the similarities between the component elements of the objects using appropriate similarity measures. The fused approach entails uncertainty for incomplete objects or for objects which have diverging distances according to the different component. Our implementation of Hk-medoids proposed here works with the fused distances and deals with the uncertainty in the fusion process. We experimentally evaluate the potential of our proposed algorithm using five datasets with different combinations of data types that define the objects. Our results show the feasibility of the our algorithm, and also they show a performance enhancement when comparing to the application of the original SMF approach in combination with a standard k-medoids that does not take uncertainty into account. In addition, from a theoretical point of view, our proposed algorithm has lower computation complexity than the popular PAM implementation

    Is EC class predictable from reaction mechanism?

    Get PDF
    We thank the Scottish Universities Life Sciences Alliance (SULSA) and the Scottish Overseas Research Student Awards Scheme of the Scottish Funding Council (SFC) for financial support.Background: We investigate the relationships between the EC (Enzyme Commission) class, the associated chemical reaction, and the reaction mechanism by building predictive models using Support Vector Machine (SVM), Random Forest (RF) and k-Nearest Neighbours (kNN). We consider two ways of encoding the reaction mechanism in descriptors, and also three approaches that encode only the overall chemical reaction. Both cross-validation and also an external test set are used. Results: The three descriptor sets encoding overall chemical transformation perform better than the two descriptions of mechanism. SVM and RF models perform comparably well; kNN is less successful. Oxidoreductases and hydrolases are relatively well predicted by all types of descriptor; isomerases are well predicted by overall reaction descriptors but not by mechanistic ones. Conclusions: Our results suggest that pairs of similar enzyme reactions tend to proceed by different mechanisms. Oxidoreductases, hydrolases, and to some extent isomerases and ligases, have clear chemical signatures, making them easier to predict than transferases and lyases. We find evidence that isomerases as a class are notably mechanistically diverse and that their one shared property, of substrate and product being isomers, can arise in various unrelated ways. The performance of the different machine learning algorithms is in line with many cheminformatics applications, with SVM and RF being roughly equally effective. kNN is less successful, given the role that non-local information plays in successful classification. We note also that, despite a lack of clarity in the literature, EC number prediction is not a single problem; the challenge of predicting protein function from available sequence data is quite different from assigning an EC classification from a cheminformatics representation of a reaction.Publisher PDFPeer reviewe

    Guidelines for the use and interpretation of assays for monitoring autophagy (4th edition)1.

    Get PDF
    In 2008, we published the first set of guidelines for standardizing research in autophagy. Since then, this topic has received increasing attention, and many scientists have entered the field. Our knowledge base and relevant new technologies have also been expanding. Thus, it is important to formulate on a regular basis updated guidelines for monitoring autophagy in different organisms. Despite numerous reviews, there continues to be confusion regarding acceptable methods to evaluate autophagy, especially in multicellular eukaryotes. Here, we present a set of guidelines for investigators to select and interpret methods to examine autophagy and related processes, and for reviewers to provide realistic and reasonable critiques of reports that are focused on these processes. These guidelines are not meant to be a dogmatic set of rules, because the appropriateness of any assay largely depends on the question being asked and the system being used. Moreover, no individual assay is perfect for every situation, calling for the use of multiple techniques to properly monitor autophagy in each experimental setting. Finally, several core components of the autophagy machinery have been implicated in distinct autophagic processes (canonical and noncanonical autophagy), implying that genetic approaches to block autophagy should rely on targeting two or more autophagy-related genes that ideally participate in distinct steps of the pathway. Along similar lines, because multiple proteins involved in autophagy also regulate other cellular pathways including apoptosis, not all of them can be used as a specific marker for bona fide autophagic responses. Here, we critically discuss current methods of assessing autophagy and the information they can, or cannot, provide. Our ultimate goal is to encourage intellectual and technical innovation in the field

    Equivalence conditions for classes of linear and non-linear distributed parameter systems

    No full text
    Equivalence of certain classes of second-order non-linear distributed parameter systems and corresponding linear third-order systems is established through a differential transformation technique. As linear systems are amenable to analysis through existing techniques, this study is expected to offer a method of tackling certain classes of non-linear problems which may otherwise prove to be formidable in nature

    Dynamics of a class of social interaction systems

    No full text
    A mathematical model of social interaction in the form of two coupler! first-order non-linear differential equations, forms the topic of this study. This non-conservative model io representative of such varied social interaction problems as coexisting sub-populations of two different species, arms race between two rival countries and the like. Differential transformation techniques developed elsewhere in the literature are seen to be effective tools of dynamic analysis of this non-linear non-conservative mode! of social interaction process

    A differential transformation technique for the analysis of a class of non-linear time varying systems

    No full text
    The scope of the differential transformation technique, developed earlier for the study of non-linear, time invariant systems, has been extended to the domain of time-varying systems by modifications to the differential transformation laws proposed therein. Equivalence of a class of second-order, non-linear, non-autonomous systems with a linear autonomous model of second order is established through these transformation laws. The feasibility of application of this technique in obtaining the response of such non-linear time-varying systems is discussed

    Feature selection and the concept of immediate neighborhood in the context of clustering techniques

    No full text
    The concept of feature selection in a nonparametric unsupervised learning environment is practically undeveloped because no true measure for the effectiveness of a feature exists in such an environment. The lack of a feature selection phase preceding the clustering process seriously affects the reliability of such learning. New concepts such as significant features, level of significance of features, and immediate neighborhood are introduced which result in meeting implicitly the need for feature slection in the context of clustering techniques

    An innovative clustering technique for unsupervized learning in the context of remotely sensed earth resources data analysis

    No full text
    A new clustering technique, based on the concept of immediato neighbourhood, with a novel capability to self-learn the number of clusters expected in the unsupervized environment, has been developed. The method compares favourably with other clustering schemes based on distance measures, both in terms of conceptual innovations and computational economy. Test implementation of the scheme using C-l flight line training sample data in a simulated unsupervized mode has brought out the efficacy of the technique. The technique can easily be implemented as a front end to established pattern classification systems with supervized learning capabilities to derive unified learning systems capable of operating in both supervized and unsupervized environments. This makes the technique an attractive proposition in the context of remotely sensed earth resources data analysis wherein it is essential to have such a unified learning system capability
    corecore