1,006 research outputs found

    Multilabel Classification with R Package mlr

    Full text link
    We implemented several multilabel classification algorithms in the machine learning package mlr. The implemented methods are binary relevance, classifier chains, nested stacking, dependent binary relevance and stacking, which can be used with any base learner that is accessible in mlr. Moreover, there is access to the multilabel classification versions of randomForestSRC and rFerns. All these methods can be easily compared by different implemented multilabel performance measures and resampling methods in the standardized mlr framework. In a benchmark experiment with several multilabel datasets, the performance of the different methods is evaluated.Comment: 18 pages, 2 figures, to be published in R Journal; reference correcte

    A study of hierarchical and flat classification of proteins

    Get PDF
    Automatic classification of proteins using machine learning is an important problem that has received significant attention in the literature. One feature of this problem is that expert-defined hierarchies of protein classes exist and can potentially be exploited to improve classification performance. In this article we investigate empirically whether this is the case for two such hierarchies. We compare multi-class classification techniques that exploit the information in those class hierarchies and those that do not, using logistic regression, decision trees, bagged decision trees, and support vector machines as the underlying base learners. In particular, we compare hierarchical and flat variants of ensembles of nested dichotomies. The latter have been shown to deliver strong classification performance in multi-class settings. We present experimental results for synthetic, fold recognition, enzyme classification, and remote homology detection data. Our results show that exploiting the class hierarchy improves performance on the synthetic data, but not in the case of the protein classification problems. Based on this we recommend that strong flat multi-class methods be used as a baseline to establish the benefit of exploiting class hierarchies in this area

    Variable selection for the multicategory SVM via adaptive sup-norm regularization

    Get PDF
    The Support Vector Machine (SVM) is a popular classification paradigm in machine learning and has achieved great success in real applications. However, the standard SVM can not select variables automatically and therefore its solution typically utilizes all the input variables without discrimination. This makes it difficult to identify important predictor variables, which is often one of the primary goals in data analysis. In this paper, we propose two novel types of regularization in the context of the multicategory SVM (MSVM) for simultaneous classification and variable selection. The MSVM generally requires estimation of multiple discriminating functions and applies the argmax rule for prediction. For each individual variable, we propose to characterize its importance by the supnorm of its coefficient vector associated with different functions, and then minimize the MSVM hinge loss function subject to a penalty on the sum of supnorms. To further improve the supnorm penalty, we propose the adaptive regularization, which allows different weights imposed on different variables according to their relative importance. Both types of regularization automate variable selection in the process of building classifiers, and lead to sparse multi-classifiers with enhanced interpretability and improved accuracy, especially for high dimensional low sample size data. One big advantage of the supnorm penalty is its easy implementation via standard linear programming. Several simulated examples and one real gene data analysis demonstrate the outstanding performance of the adaptive supnorm penalty in various data settings.Comment: Published in at http://dx.doi.org/10.1214/08-EJS122 the Electronic Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of Mathematical Statistics (http://www.imstat.org

    AUTOMATIC IDENTIFICATION OF DYSPHONIAS USING MACHINE LEARNING ALGORITHMS

    Get PDF
    Dysphonia is a prevalent symptom of some respiratory diseases that affects voice quality, even for prolonged periods. For its diagnosis, speech-language pathologists make use of different acoustic parameters to perform objective evaluations on patients and determine the type of dysphonia that affects them, such as hyperfunctional and hypofunctional dysphonia, which is important because each type requires a different treatment. In the field of artificial intelligence this problem has been addressed through the use of acoustic parameters that are used as input data to train machine learning and deep learning models. However, its purpose is usually to identify whether a patient is ill or not, making binary classifications between healthy voices and voices with dysphonia, but not between dysphonias. In this paper, harmonic-to-noise ratio, cepstral peak prominence-smoothed, zero crossing rate and the means of the Mel frequency cepstral coefficients (2-19) are used to make multiclass classification of voices with euphony, hyperfunction and hypofunction by means of six machine learning algorithms, which are: Random Forest, K nearest neighbors, Logistic regression, Decision trees, Support vector machines and Naive Bayes. In order to evaluate which of them presents a better performance to identify the three voice classes, bootstrap.632 was used. It is concluded that the best confidence interval ranges from 87% to 92%, in terms of accuracy for the K Nearest Neighbors model. Results can be implemented in the development of a complementary application for the clinical diagnosis or monitoring of a patient under the supervision of a specialist

    Development of Machine Learning Models for Generation and Activity Prediction of the Protein Tyrosine Kinase Inhibitors

    Get PDF
    The field of computational drug discovery and development continues to grow at a rapid pace, using generative machine learning approaches to present us with solutions to high dimensional and complex problems in drug discovery and design. In this work, we present a platform of Machine Learning based approaches for generation and scoring of novel kinase inhibitor molecules. We utilized a binary Random Forest classification model to develop a Machine Learning based scoring function to evaluate the generated molecules on Kinase Inhibition Likelihood. By training the model on several chemical features of each known kinase inhibitor, we were able to create a metric that captures the differences between a SRC Kinase Inhibitor and a non-SRC Kinase Inhibitor. We implemented the scoring function into a Biased and Unbiased Bayesian Optimization framework to generate molecules based on features of SRC Kinase Inhibitors. We then used similarity metrics such as Tanimoto Similarity to assess their closeness to that of known SRC Kinase Inhibitors. The molecules generated from this experiment demonstrated potential for belonging to the SRC Kinase Inhibitor family though chemical synthesis would be needed to confirm the results. The top molecules generated from the Unbiased and Biased Bayesian Optimization experiments were calculated to respectively have Tanimoto Similarity scores of 0.711 and 0.709 to known SRC Kinase Inhibitors. With calculated Kinase Inhibition Likelihood scores of 0.586 and 0.575, the top molecules generated from the Bayesian Optimization demonstrate a disconnect between the similarity scores to known SRC Kinase Inhibitors and the calculated Kinase Inhibition Likelihood score. It was found that implementing a bias into the Bayesian Optimization process had little effect on the quality of generated molecules. In addition, several molecules generated from the Bayesian Optimization process were sent to the School of Pharmacy for chemical synthesis which gives the experiment more concrete results. The results of this study demonstrated that generating molecules throughBayesian Optimization techniques could aid in the generation of molecules for a specific kinase family, but further expansions of the techniques would be needed for substantial results

    Explainable deep learning models for biological sequence classification

    Get PDF
    Biological sequences - DNA, RNA and proteins - orchestrate the behavior of all living cells and trying to understand the mechanisms that govern and regulate the interactions among these molecules has motivated biological research for many years. The introduction of experimental protocols that analyze such interactions on a genome- or transcriptome-wide scale has also established the usage of machine learning in our field to make sense of the vast amounts of generated data. Recently, deep learning, a branch of machine learning based on artificial neural networks, and especially convolutional neural networks (CNNs) were shown to deliver promising results for predictive tasks and automated feature extraction. However, the resulting models are often very complex and thus make model application and interpretation hard, but the possibility to interpret which features a model has learned from the data is crucial to understand and to explain new biological mechanisms. This work therefore presents pysster, our open source software library that enables researchers to more easily train, apply and interpret CNNs on biological sequence data. We evaluate and implement different feature interpretation and visualization strategies and show that the flexibility of CNNs allows for the integration of additional data beyond pure sequences to improve the biological feature interpretability. We demonstrate this by building, among others, predictive models for transcription factor and RNA-binding protein binding sites and by supplementing these models with structural information in the form of DNA shape and RNA secondary structure. Features learned by models are then visualized as sequence and structure motifs together with information about motif locations and motif co-occurrence. By further analyzing an artificial data set containing implanted motifs we also illustrate how the hierarchical feature extraction process in a multi-layer deep neural network operates. Finally, we present a larger biological application by predicting RNA-binding of proteins for transcripts for which experimental protein-RNA interaction data is not yet available. Here, the comprehensive interpretation options of CNNs made us aware of potential technical bias in the experimental eCLIP data (enhanced crosslinking and immunoprecipitation) that were used as a basis for the models. This allowed for subsequent tuning of the models and data to get more meaningful predictions in practice

    PSP_MCSVM: brainstorming consensus prediction of protein secondary structures using two-stage multiclass support vector machines

    Get PDF
    Secondary structure prediction is a crucial task for understanding the variety of protein structures and performed biological functions. Prediction of secondary structures for new proteins using their amino acid sequences is of fundamental importance in bioinformatics. We propose a novel technique to predict protein secondary structures based on position-specific scoring matrices (PSSMs) and physico-chemical properties of amino acids. It is a two stage approach involving multiclass support vector machines (SVMs) as classifiers for three different structural conformations, viz., helix, sheet and coil. In the first stage, PSSMs obtained from PSI-BLAST and five specially selected physicochemical properties of amino acids are fed into SVMs as features for sequence-to-structure prediction. Confidence values for forming helix, sheet and coil that are obtained from the first stage SVM are then used in the second stage SVM for performing structure-to-structure prediction. The two-stage cascaded classifiers (PSP_MCSVM) are trained with proteins from RS126 dataset. The classifiers are finally tested on target proteins of critical assessment of protein structure prediction experiment-9 (CASP9). PSP_MCSVM with brainstorming consensus procedure performs better than the prediction servers like Predator, DSC, SIMPA96, for randomly selected proteins from CASP9 targets. The overall performance is found to be comparable with the current state-of-the art. PSP_MCSVM source code, train-test datasets and supplementary files are available freely in public domain at: http://sysbio.icm.edu.pl/secstruct and http://code.google.com/p/cmater-bioinfo

    Biomedical Data Classification with Improvised Deep Learning Architectures

    Get PDF
    With the rise of very powerful hardware and evolution of deep learning architectures, healthcare data analysis and its applications have been drastically transformed. These transformations mainly aim to aid a healthcare personnel with diagnosis and prognosis of a disease or abnormality at any given point of healthcare routine workflow. For instance, many of the cancer metastases detection depends on pathological tissue procedures and pathologist reviews. The reports of severity classification vary amongst different pathologist, which then leads to different treatment options for a patient. This labor-intensive work can lead to errors or mistreatments resulting in high cost of healthcare. With the help of machine learning and deep learning modules, some of these traditional diagnosis techniques can be improved and aid a doctor in decision making with an unbiased view. Some of such modules can help reduce the cost, shortage of an expertise, and time in identifying the disease. However, there are many other datapoints that are available with medical images, such as omics data, biomarker calculations, patient demographics and history. All these datapoints can enhance disease classification or prediction of progression with the help of machine learning/deep learning modules. However, it is very difficult to find a comprehensive dataset with all different modalities and features in healthcare setting due to privacy regulations. Hence in this thesis, we explore both medical imaging data with clinical datapoints as well as genomics datasets separately for classification tasks using combinational deep learning architectures. We use deep neural networks with 3D volumetric structural magnetic resonance images of Alzheimer Disease dataset for classification of disease. A separate study is implemented to understand classification based on clinical datapoints achieved by machine learning algorithms. For bioinformatics applications, sequence classification task is a crucial step for many metagenomics applications, however, requires a lot of preprocessing that requires sequence assembly or sequence alignment before making use of raw whole genome sequencing data, hence time consuming especially in bacterial taxonomy classification. There are only a few approaches for sequence classification tasks that mainly involve some convolutions and deep neural network. A novel method is developed using an intrinsic nature of recurrent neural networks for 16s rRNA sequence classification which can be adapted to utilize read sequences directly. For this classification task, the accuracy is improved using optimization techniques with a hybrid neural network
    corecore