765 research outputs found

    Overview of Random Forest Methodology and Practical Guidance with Emphasis on Computational Biology and Bioinformatics

    Get PDF
    The Random Forest (RF) algorithm by Leo Breiman has become a standard data analysis tool in bioinformatics. It has shown excellent performance in settings where the number of variables is much larger than the number of observations, can cope with complex interaction structures as well as highly correlated variables and returns measures of variable importance. This paper synthesizes ten years of RF development with emphasis on applications to bioinformatics and computational biology. Special attention is given to practical aspects such as the selection of parameters, available RF implementations, and important pitfalls and biases of RF and its variable importance measures (VIMs). The paper surveys recent developments of the methodology relevant to bioinformatics as well as some representative examples of RF applications in this context and possible directions for future research

    Computer Intrusion Detection Through Statistical Analysis and Prediction Modeling

    Get PDF
    Information security is very important in today’s society. Computer intrusion is one type of security infraction that poses a threat to all of us. Almost every person in modern parts of the world depend upon automated information. Information systems deliver paychecks on time, manage taxes, transfer funds, deliver important information that enables decisions, and maintain situational awareness in many different ways. Interrupting, corrupting, or destroying this information is a real threat. Computer attackers, often posing as intruders masquerading as authentic users, are the nucleus of this threat. Preventive computer security measures often do not provide enough; digital firms need methods to detect attackers who have breached firewalls or other barriers. This thesis explores techniques to detect computer intruders based upon UNIX command usage of authentic users compared against command usage of attackers. The hypothesis is that computing behavior of authentic users differs from the computing behavior of attackers. In order to explore this hypothesis, seven different variables that measure computing commands are created and utilized to perform predictive modeling to determine the presence or absence of a attacker. This is a classification problem that involves two known groups: intruders and non intruders. Techniques explored include a proven algorithm published by Matthius Schonlau in [17] and several predictive model variations utilizing the aforementioned seven variables; predictive models include linear discrimination analysis, clustering, kernel partial least squares learning machines

    Bayesian models for syndrome- and gene-specific probabilities of novel variant pathogenicity

    Get PDF
    BACKGROUND: With the advent of affordable and comprehensive sequencing technologies, access to molecular genetics for clinical diagnostics and research applications is increasing. However, variant interpretation remains challenging, and tools that close the gap between data generation and data interpretation are urgently required. Here we present a transferable approach to help address the limitations in variant annotation. METHODS: We develop a network of Bayesian logistic regression models that integrate multiple lines of evidence to evaluate the probability that a rare variant is the cause of an individual's disease. We present models for genes causing inherited cardiac conditions, though the framework is transferable to other genes and syndromes. RESULTS: Our models report a probability of pathogenicity, rather than a categorisation into pathogenic or benign, which captures the inherent uncertainty of the prediction. We find that gene- and syndrome-specific models outperform genome-wide approaches, and that the integration of multiple lines of evidence performs better than individual predictors. The models are adaptable to incorporate new lines of evidence, and results can be combined with familial segregation data in a transparent and quantitative manner to further enhance predictions. Though the probability scale is continuous, and innately interpretable, performance summaries based on thresholds are useful for comparisons. Using a threshold probability of pathogenicity of 0.9, we obtain a positive predictive value of 0.999 and sensitivity of 0.76 for the classification of variants known to cause long QT syndrome over the three most important genes, which represents sufficient accuracy to inform clinical decision-making. A web tool APPRAISE [http://www.cardiodb.org/APPRAISE] provides access to these models and predictions. CONCLUSIONS: Our Bayesian framework provides a transparent, flexible and robust framework for the analysis and interpretation of rare genetic variants. Models tailored to specific genes outperform genome-wide approaches, and can be sufficiently accurate to inform clinical decision-making

    Performance analysis : a case study on network management system using machine learning

    Get PDF
    Businesses have legacy distributed software systems which are out of traditional data analysis methods due to their complexities. In addition, the software systems evolve and become complex to understand even with the knowledge of system architecture. Machine learning and big data analytic techniques are widely used in many technical domains to get insight from this large business data due to performance and accuracy. This study was conducted to investigate the applicability of machine learning techniques on performance utilization modelling on Nokia’s network management system. The objective was to study and develop resource utilization models based on system performance data and to study future business needs on capacity analysis of the software performance to minimize manual tasks. The performance data was extracted from network management system software which contains resource usages on system level and component level measurements based on input load. In general, the simulated load on a network management system is uniform with less variance. To overcome this during the research, different load profiles were simulated on the system to assess its performance. Later the data was processed and evaluated using set of machine learning techniques (linear regression, MARS, K-NN, random forest, SVR and feed forward neural networks) to construct resource utilization models. Further, the goodness of developed models was evaluated on simulated test and customer data. Overall, no single algorithm performed best on all resource entities, but neural networks performed well on most response variables as a multivariable output model. However, when comparing performance across customer and test datasets, there were some differences which were also studied. Overall, the results show the feasibility on modeling system resource that can be used in capacity analysis. In future iterations, further analysis on remaining system nodes and suggestions have been made in the report

    The Unbalanced Classification Problem: Detecting Breaches in Security

    Get PDF
    This research proposes several methods designed to improve solutions for security classification problems. The security classification problem involves unbalanced, high-dimensional, binary classification problems that are prevalent today. The imbalance within this data involves a significant majority of the negative class and a minority positive class. Any system that needs protection from malicious activity, intruders, theft, or other types of breaches in security must address this problem. These breaches in security are considered instances of the positive class. Given numerical data that represent observations or instances which require classification, state of the art machine learning algorithms can be applied. However, the unbalanced and high-dimensional structure of the data must be considered prior to applying these learning methods. High-dimensional data poses a “curse of dimensionality” which can be overcome through the analysis of subspaces. Exploration of intelligent subspace modeling and the fusion of subspace models is proposed. Detailed analysis of the one-class support vector machine, as well as its weaknesses and proposals to overcome these shortcomings are included. A fundamental method for evaluation of the binary classification model is the receiver operating characteristic (ROC) curve and the area under the curve (AUC). This work details the underlying statistics involved with ROC curves, contributing a comprehensive review of ROC curve construction and analysis techniques to include a novel graphic for illustrating the connection between ROC curves and classifier decision values. The major innovations of this work include synergistic classifier fusion through the analysis of ROC curves and rankings, insight into the statistical behavior of the Gaussian kernel, and novel methods for applying machine learning techniques to defend against computer intrusion detection. The primary empirical vehicle for this research is computer intrusion detection data, and both host-based intrusion detection systems (HIDS) and network-based intrusion detection systems (NIDS) are addressed. Empirical studies also include military tactical scenarios

    Modeling and forecasting gender-based violence through machine learning techniques

    Get PDF
    Gender-Based Violence (GBV) is a serious problem that societies and governments must address using all applicable resources. This requires adequate planning in order to optimize both resources and budget, which demands a thorough understanding of the magnitude of the problem, as well as analysis of its past impact in order to infer future incidence. On the other hand, for years, the rise of Machine Learning techniques and Big Data has led different countries to collect information on both GBV and other general social variables that in one way or another can affect violence levels. In this work, in order to forecast GBV, firstly, a database of features related to more than a decade’s worth of GBV is compiled and prepared from official sources available due to Spain’s open access. Then, secondly, a methodology is proposed that involves testing different methods of features selection so that, with each of the subsets generated, four techniques of predictive algorithms are applied and compared. The tests conducted indicate that it is possible to predict the number of GBV complaints presented to a court at a predictive horizon of six months with an accuracy (Root Median Squared Error) of 0.1686 complaints to the courts per 10,000 inhabitants—throughout the whole Spanish territory—with a Multi-Objective Evolutionary Search Strategy for the selection of variables, and with Random Forest as the predictive algorithm. The proposed methodology has also been successfully applied to three specific Spanish territories of different populations (large, medium, and small), pointing to the presented method’s possible use elsewhere in the world

    mbonsai: Application Package for Sequence Classification by Tree Methodology

    Get PDF
    In many applications such as transaction data analysis, the classification of long chains of sequences is required. For example, brand purchase history in customer transaction data is in a form like AABCABAA, where A, B, and C are brands of a consumer product. The decision tree-based package mbonsai is designed to handle sequence data of varying lengths using one or multiple variables of interest as predictor variables. This software package uses tree growing and pruning strategies adopted from C4.5 and CART algorithms, and includes new features for handling sequence data and indexing for classification purpose. The software uses a simple command line program for learning and predicting processes, and has the ability to generate user-friendly graphics depicting decision trees. The underlying C++ codes are designed to efficiently process large data sets in ASCII files. Two examples from transaction data sets are used to illustrate the application of mbonsai
    corecore