18 research outputs found

    New Methods for the Prediction and Classification of Protein Domains

    Get PDF

    Testing and Validating Machine Learning Classifiers by Metamorphic Testing

    Get PDF
    Machine Learning algorithms have provided important core functionality to support solutions in many scientific computing applications - such as computational biology, computational linguistics, and others. However, it is difficult to test such applications because often there is no "test oracle" to indicate what the correct output should be for arbitrary input. To help address the quality of scientific computing software, in this paper we present a technique for testing the implementations of machine learning classification algorithms on which such scientific computing software depends. Our technique is based on an approach called "metamorphic testing", which has been shown to be effective in such cases. Also presented is a case study on a real-world machine learning application framework, and a discussion of how programmers implementing machine learning algorithms can avoid the common pitfalls discovered in our study. We also conduct mutation analysis and cross-validation, which reveal that our method has very high effectiveness in killing mutants, and that observing expected cross-validation result alone is not sufficient to test for the correctness of a supervised classification program. Metamorphic testing is strongly recommended as a complementary approach. Finally we discuss how our findings can be used in other areas of computational science and engineering

    MMRF for proteome annotation applied to human protein disease prediction

    Get PDF
    Proceedings of: 20th International Conference, ILP 2010, Florence, Italy, June 27-30, 2010Biological processes where every gene and protein participates is an essential knowledge for designing disease treatments. Nowadays, these annotations are still unknown for many genes and proteins. Since making annotations from in-vivo experiments is costly, computational predictors are needed for different kinds of annotation such as metabolic pathway, interaction network, protein family, tissue, disease and so on. Biological data has an intrinsic relational structure, including genes and proteins, which can be grouped by many criteria. This hinders the possibility of finding good hypotheses when attribute-value representation is used. Hence, we propose the generic Modular Multi-Relational Framework (MMRF) to predict different kinds of gene and protein annotation using Relational Data Mining (RDM). The specific MMRF application to annotate human protein with diseases verifies that group knowledge (mainly protein-protein interaction pairs) improves the prediction, particularly doubling the area under the precision-recall curvePublicad

    The impact of feature selection on one and two-class classification performance for plant microRNAs

    Get PDF
    MicroRNAs (miRNAs) are short nucleotide sequences that form a typical hairpin structure which is recognized by a complex enzyme machinery. It ultimately leads to the incorporation of 18-24 nt long mature miRNAs into RISC where they act as recognition keys to aid in regulation of target mRNAs. It is involved to determine miRNAs experimentally and, therefore, machine learning is used to complement such endeavors. The success of machine learning mostly depends on proper input data and appropriate features for parameterization of the data. Although, in general, two-class classification (TCC) is used in the field; because negative examples are hard to come by, one-class classification (OCC) has been tried for pre-miRNA detection. Since both positive and negative examples are currently somewhat limited, feature selection can prove to be vital for furthering the field of pre-miRNA detection. In this study, we compare the performance of OCC and TCC using eight feature selection methods and seven different plant species providing positive pre-miRNA examples. Feature selection was very successful for OCC where the best feature selection method achieved an average accuracy of 95.6%, thereby being ~29% better than the worst method which achieved 66.9% accuracy. While the performance is comparable to TCC, which performs up to 3% better than OCC, TCC is much less affected by feature selection and its largest performance gap is ~13% which only occurs for two of the feature selection methodologies. We conclude that feature selection is crucially important for OCC and that it can perform on par with TCC given the proper set of features.The Scientific and Technological Research Council of Turkey (grant number 113E326

    Large-Scale Automatic Feature Selection for Biomarker Discovery in High-Dimensional OMICs Data

    Get PDF
    The identification of biomarker signatures in omics molecular profiling is usually performed to predict outcomes in a precision medicine context, such as patient disease susceptibility, diagnosis, prognosis, and treatment response. To identify these signatures, we have developed a biomarker discovery tool, called BioDiscML. From a collection of samples and their associated characteristics, i.e., the biomarkers (e.g., gene expression, protein levels, clinico-pathological data), BioDiscML exploits various feature selection procedures to produce signatures associated to machine learning models that will predict efficiently a specified outcome. To this purpose, BioDiscML uses a large variety of machine learning algorithms to select the best combination of biomarkers for predicting categorical or continuous outcomes from highly unbalanced datasets. The software has been implemented to automate all machine learning steps, including data pre-processing, feature selection, model selection, and performance evaluation. BioDiscML is delivered as a stand-alone program and is available for download at https://github.com/mickaelleclercq/BioDiscML

    Cyberthreats, Attacks and Intrusion Detection in Supervisory Control and Data Acquisition Networks

    Get PDF
    Supervisory Control and Data Acquisition (SCADA) systems are computer-based process control systems that interconnect and monitor remote physical processes. There have been many real world documented incidents and cyber-attacks affecting SCADA systems, which clearly illustrate critical infrastructure vulnerabilities. These reported incidents demonstrate that cyber-attacks against SCADA systems might produce a variety of financial damage and harmful events to humans and their environment. This dissertation documents four contributions towards increased security for SCADA systems. First, a set of cyber-attacks was developed. Second, each attack was executed against two fully functional SCADA systems in a laboratory environment; a gas pipeline and a water storage tank. Third, signature based intrusion detection system rules were developed and tested which can be used to generate alerts when the aforementioned attacks are executed against a SCADA system. Fourth, a set of features was developed for a decision tree based anomaly based intrusion detection system. The features were tested using the datasets developed for this work. This dissertation documents cyber-attacks on both serial based and Ethernet based SCADA networks. Four categories of attacks against SCADA systems are discussed: reconnaissance, malicious response injection, malicious command injection and denial of service. In order to evaluate performance of data mining and machine learning algorithms for intrusion detection systems in SCADA systems, a network dataset to be used for benchmarking intrusion detection systemswas generated. This network dataset includes different classes of attacks that simulate different attack scenarios on process control systems. This dissertation describes four SCADA network intrusion detection datasets; a full and abbreviated dataset for both the gas pipeline and water storage tank systems. Each feature in the dataset is captured from network flow records. This dataset groups two different categories of features that can be used as input to an intrusion detection system. First, network traffic features describe the communication patterns in a SCADA system. This research developed both signature based IDS and anomaly based IDS for the gas pipeline and water storage tank serial based SCADA systems. The performance of both types of IDS were evaluates by measuring detection rate and the prevalence of false positives
    corecore