235 research outputs found

    Data mining and machine learning: an Overview of Classifiers

    Get PDF
    At the same time of information age, digital revolution has made necessary using some of technologies to analyze most of essential information. Data mining is a technique to make sense to the available data. The aim of data mining is extracting the information from a vast volume of data and transforming them into a comprehensible form for human. For this purpose, machine learning methods are used to classify data. In this study, we discuss six popular and useful classifiers in the data mining process

    Large margin methods for partner specific prediction of interfaces in protein complexes

    Get PDF
    2014 Spring.The study of protein interfaces and binding sites is a very important domain of research in bioinformatics. Information about the interfaces between proteins can be used not only in understanding protein function but can also be directly employed in drug design and protein engineering. However, the experimental determination of protein interfaces is cumbersome, expensive and not possible in some cases with today's technology. As a consequence, the computational prediction of protein interfaces from sequence and structure has emerged as a very active research area. A number of machine learning based techniques have been proposed for the solution to this problem. However, the prediction accuracy of most such schemes is very low. In this dissertation we present large-margin classification approaches that have been designed to directly model different aspects of protein complex formation as well as the characteristics of available data. Most existing machine learning techniques for this task are partner-independent in nature, i.e., they ignore the fact that the binding propensity of a protein to bind to another protein is dependent upon characteristics of residues in both proteins. We have developed a pairwise support vector machine classifier called PAIRpred to predict protein interfaces in a partner-specific fashion. Due to its more detailed model of the problem, PAIRpred offers state of the art accuracy in predicting both binding sites at the protein level as well as inter-protein residue contacts at the complex level. PAIRpred uses sequence and structure conservation, local structural similarity and surface geometry, residue solvent exposure and template based features derived from the unbound structures of proteins forming a protein complex. We have investigated the impact of explicitly modeling the inter-dependencies between residues that are imposed by the overall structure of a protein during the formation of a protein complex through transductive and semi-supervised learning models. We also present a novel multiple instance learning scheme called MI-1 that explicitly models imprecision in sequence-level annotations of binding sites in proteins that bind calmodulin to achieve state of the art prediction accuracy for this task

    ML meets MLn: machine learning in ligand promoted homogeneous catalysis

    Get PDF
    The benefits of using machine learning approaches in the design, optimisation and understanding of homogeneous catalytic processes are being increasingly realised. We focus on the understanding and implementation of key concepts, which serve as conduits to more advanced chemical machine learning literature, much of which is (presently) outside the area of homogeneous catalysis. Potential pitfalls in the ‘workflow’ procedures needed in the machine learning process are identified and all the examples provided are in a chemical sciences context, including several from ‘real world’ catalyst systems. Finally, potential areas of expansion and impact for machine learning in homogeneous catalysis in the future are considered

    The doctoral research abstracts. Vol:6 2014 / Institute of Graduate Studies, UiTM

    Get PDF
    Congratulations to Institute of Graduate Studies on the continuous efforts to publish the 6th issue of the Doctoral Research Abstracts which ranged from the discipline of science and technology, business and administration to social science and humanities. This issue captures the novelty of research from 52 PhD doctorates receiving their scrolls in the UiTM’s 81st Convocation. This convocation is very significant especially for UiTM since we are celebrating the success of 52 PhD graduands – the highest number ever conferred at any one time. To the 52 doctorates, I would like it to be known that you have most certainly done UiTM proud by journeying through the scholastic path with its endless challenges and impediments, and by persevering right till the very end. This convocation should not be regarded as the end of your highest scholarly achievement and contribution to the body of knowledge but rather as the beginning of embarking into more innovative research from knowledge gained during this academic journey, for the community and country. As alumni of UiTM, we hold you dear to our hearts. The relationship that was once between a student and supervisor has now matured into comrades, forging and exploring together beyond the frontier of knowledge. We wish you all the best in your endeavour and may I offer my congratulations to all the graduands. ‘UiTM sentiasa dihati ku’ Tan Sri Dato’ Sri Prof Ir Dr Sahol Hamid Abu Bakar , FASc, PEng Vice Chancellor Universiti Teknologi MAR

    Automated interpretation of benthic stereo imagery

    Get PDF
    Autonomous benthic imaging, reduces human risk and increases the amount of collected data. However, manually interpreting these high volumes of data is onerous, time consuming and in many cases, infeasible. The objective of this thesis is to improve the scientific utility of the large image datasets. Fine-scale terrain complexity is typically quantified by rugosity and measured by divers using chains and tape measures. This thesis proposes a new technique for measuring terrain complexity from 3D stereo image reconstructions, which is non-contact and can be calculated at multiple scales over large spatial extents. Using robots, terrain complexity can be measured without endangering humans, beyond scuba depths. Results show that this approach is more robust, flexible and easily repeatable than traditional methods. These proposed terrain complexity features are combined with visual colour and texture descriptors and applied to classifying imagery. New multi-dataset feature selection methods are proposed for performing feature selection across multiple datasets, and are shown to improve the overall classification performance. The results show that the most informative predictors of benthic habitat types are the new terrain complexity measurements. This thesis presents a method that aims to reduce human labelling effort, while maximising classification performance by combining pre-clustering with active learning. The results support that utilising the structure of the unlabelled data in conjunction with uncertainty sampling can significantly reduce the number of labels required for a given level of accuracy. Typically 0.00001–0.00007% of image data is annotated and processed for science purposes (20–50 points in 1–2% of the images). This thesis proposes a framework that uses existing human-annotated point labels to train a superpixel-based automated classification system, which can extrapolate the classified results to every pixel across all the images of an entire survey

    Automated interpretation of benthic stereo imagery

    Get PDF
    Autonomous benthic imaging, reduces human risk and increases the amount of collected data. However, manually interpreting these high volumes of data is onerous, time consuming and in many cases, infeasible. The objective of this thesis is to improve the scientific utility of the large image datasets. Fine-scale terrain complexity is typically quantified by rugosity and measured by divers using chains and tape measures. This thesis proposes a new technique for measuring terrain complexity from 3D stereo image reconstructions, which is non-contact and can be calculated at multiple scales over large spatial extents. Using robots, terrain complexity can be measured without endangering humans, beyond scuba depths. Results show that this approach is more robust, flexible and easily repeatable than traditional methods. These proposed terrain complexity features are combined with visual colour and texture descriptors and applied to classifying imagery. New multi-dataset feature selection methods are proposed for performing feature selection across multiple datasets, and are shown to improve the overall classification performance. The results show that the most informative predictors of benthic habitat types are the new terrain complexity measurements. This thesis presents a method that aims to reduce human labelling effort, while maximising classification performance by combining pre-clustering with active learning. The results support that utilising the structure of the unlabelled data in conjunction with uncertainty sampling can significantly reduce the number of labels required for a given level of accuracy. Typically 0.00001–0.00007% of image data is annotated and processed for science purposes (20–50 points in 1–2% of the images). This thesis proposes a framework that uses existing human-annotated point labels to train a superpixel-based automated classification system, which can extrapolate the classified results to every pixel across all the images of an entire survey

    Discovery of Novel Glycogen Synthase Kinase-3beta Inhibitors: Molecular Modeling, Virtual Screening, and Biological Evaluation

    Get PDF
    Glycogen synthase kinase-3 (GSK-3) is a multifunctional serine/threonine protein kinase which is engaged in a variety of signaling pathways, regulating a wide range of cellular processes. Due to its distinct regulation mechanism and unique substrate specificity in the molecular pathogenesis of human diseases, GSK-3 is one of the most attractive therapeutic targets for the unmet treatment of pathologies, including type-II diabetes, cancers, inflammation, and neurodegenerative disease. Recent advances in drug discovery targeting GSK-3 involved extensive computational modeling techniques. Both ligand/structure-based approaches have been well explored to design ATP-competitive inhibitors. Molecular modeling plus dynamics simulations can provide insight into the protein-substrate and protein-protein interactions at substrate binding pocket and C-lobe hydrophobic groove, which will benefit the discovery of non-ATP-competitive inhibitors. To identify structurally novel and diverse compounds that effectively inhibit GSK-3â, we performed virtual screening by implementing a mixed ligand/structure-based approach, which included pharmacophore modeling, diversity analysis, and ensemble docking. The sensitivities of different docking protocols to the induced-fit effects at the ATP-competitive binding pocket of GSK-3â have been explored. An enrichment study was employed to verify the robustness of ensemble docking compared to individual docking in terms of retrieving active compounds from a decoy dataset. A total of 24 structurally diverse compounds obtained from the virtual screening experiment underwent biological validation. The bioassay results shothat 15 out of the 24 hit compounds are indeed GSK-3â inhibitors, and among them, one compound exhibiting sub-micromolar inhibitory activity is a reasonable starting point for further optimization. To further identify structurally novel GSK-3â inhibitors, we performed virtual screening by implementing another mixed ligand-based/structure-based approach, which included quantitative structure-activity relationship (QSAR) analysis and docking prediction. To integrate and analyze complex data sets from multiple experimental sources, we drafted and validated hierarchical QSAR, which adopts a multi-level structure to take data heterogeneity into account. A collection of 728 GSK-3 inhibitors with diverse structural scaffolds were obtained from published papers of 7 research groups based on different experimental protocols. Support vector machines and random forests were implemented with wrapper-based feature selection algorithms in order to construct predictive learning models. The best models for each single group of compounds were then selected, based on both internal and external validation, and used to build the final hierarchical QSAR model. The predictive performance of the hierarchical QSAR model can be demonstrated by an overall R2 of 0.752 for the 141 compounds in the test set. The compounds obtained from the virtual screening experiment underwent biological validation. The bioassay results confirmed that 2 hit compounds are indeed GSK-3â inhibitors exhibiting sub-micromolar inhibitory activity, and therefore validated hierarchical QSAR as an effective approach to be used in virtual screening experiments. We have successfully implemented a variant of supervised learning algorithm, named multiple-instance learning, in order to predict bioactive conformers of a given molecule which are responsible for the observed biological activity. The implementation requires instance-based embedding, and joint feature selection and classification. The goal of the present project is to implement multiple-instance learning in drug activity prediction, and subsequently to identify the bioactive conformers for each molecule. The proposed approach was proven not to suffer from overfitting and to be highly competitive with classical predictive models, so it is very powerful for drug activity prediction. The approach was also validated as a useful method for pursuit of bioactive conformers

    Convolutional Methods for Music Analysis

    Get PDF

    Materials & Machines: Simplifying the Mosaic of Modern Manufacturing

    Get PDF
    Manufacturing in modern society has taken on a different role than in previous generations. Today’s manufacturing processes involve many different physical phenomenon working in concert to produce the best possible material properties. It is the role of the materials engineer to evaluate, develop, and optimize applications for the successful commercialization of any potential materials. Laser-assisted cold spray (LACS) is a solid state manufacturing process relying on the impact of supersonic particles onto a laser heated surface to create coatings and near net structures. A process such as this that involves thermodynamics, fluid dynamics, heat transfer, diffusion, localized melting, deformation, and recrystallization is the perfect target for developing a data science framework for enabling rapid application development with the purpose of commercializing such a complex technology in a much shorter timescale than was previously possible. A general framework for such an approach will be discussed, followed by the execution of the framework for LACS. Results from the development of such a materials engineering model will be discussed as they relate to the methods used, the effectiveness of the final fitted model, and the application of such a model to solving modern materials engineering challenges

    Hybrid modelling of bioprocesses.

    Get PDF
    The two traditional approaches to modelling can be characterised as the development of mechanistic models from 'first principles' and the fitting of statistical models to data. The so-called 'hybrid approach' combines both elements within a single overall model and is thus composed of a set of mass balance constraints and a set of kinetic functions. This thesis considers methodologies for building hybrid models of bioprocesses. Two methodologies were developed, evaluated and demonstrated on a range of systems of simulated and experimental systems. A method for inferring models from data using support vector machines was developed and demonstrated on 3 experimental systems a Murine hybridoma shake flask cell culture, a Saccharopolyspora erythraea shake flask cultivation and a 42L Streptomyces clavuligerus batch cultivation. On the latter system the method produced models of similar accuracy to previously published hybrid modelling work. While support vector machines have been widely applied in other contexts this method is novel in the sense that there are no previously published papers on the use of support vector machines for kinetic modeling of bioprocesses. On 50 randomly created dynamical systems it was shown that the hybrid models produced using the support vector machine methodology were generally more accurate than those developed using feed forward neural networks and that could not be distinguished from models produced using a more computationally demanding method based round genetic programming. Additionally a Bayesian framework for hybrid modelling was developed and demonstrated on simple simulated systems. The Bayesian approach requires no interpolation of data, can cope with missing initial conditions and provides a principled framework for incorporating a priori beliefs. These features are likely to be useful in practical situations where high quality experimental data is difficult to produce
    • …
    corecore