11 research outputs found

    Regularized Models for Fitting Zero-Inflated and Zero-Truncated Count Data: A Comparative Analysis

    Get PDF
    Generalized Linear Models (GLMs) are widely recognized for their efficacy in fitting count data, superior to the Ordinary Least Squares (OLS) approach. The incapability of OLS to suitably handle count data can be attributed to its tendency to overfit. This study proposes the utilization of regularized models, specifically Ridge Regression and the Least Absolute Shrinkage and Selection Operator (LASSO), for fitting count data. These models are compared to frequentist and Bayesian models commonly used for count data fitting, such as the Dirichlet prior mixture of generalized linear mixed models and the discrete Weibull. The findings reveal Ridge Regression's superiority over all other models based on the Akaike Information Criterion (AIC). However, its performance diminishes when evaluated using the Bayesian Information Criterion (BIC), even though it still outperforms LASSO. The study thereby suggests the use of regularized regression models for fitting zero-inflated count data, as demonstrated with simulated data. Further, the appropriateness of regularized zero for zero-truncated count is exemplified using life data

    Analysis of software engineering data with self-organizing maps

    Get PDF
    Nowadays software developers have tools that help to assist them during development and management processes of software products. These tools store large amount of Software Engineering data resulted from these processes. Analysis of the data can reveal valuable information about project performance and discover useful business patterns used during development. This information can be utilized to find projects where teams have some management problems and help to improve situation. Currently existing methods in the field have applicability limitations, because they require an expert knowledge for evaluation and are not capable to deal with large number of projects. In this thesis, we will explore possibility to apply Machine Learning methods to analysis of software engineering data. Machine Learning methods are capable to build a model from sample inputs and produce data-driven predictions. They have been gaining popularity over the last decades and show promising results in applications, where human expertise was traditionally used. In this work, we attempt to extract and analyze software project management patterns from software engineering data stored in the GitHub repositories. For this purpose, we have developed a system, which is capable of collecting the project data, extracting their features and comparing properties of large number of projects between each other. To collect projects, we used Unified Data Model that is capable of storing of software engineering data from various sources; we have also spotted a few limitations of this model and have improved it to meet requirements of our work. Obtained data was used for training of Self-Organizing Maps. The resulted map have demonstrated clear grouping principles of the projects according to chosen feature set. We estimated efficiencies for distinct areas of the map. Effects of different events occurred during issue lifetime, such as user assignation and labeling, were also investigated. Based on data analysis, we showed that labeling and user assignation is beneficial and can potentially decrease issue resolution time. The main result of our work is evaluation system that is capable of data collection, storage, cleaning and evaluation. Evaluation part of our system was based on analysis of 230 individual projects that was result of cooperation of 100 000 unique users from GitHub community. Further research directions can include verification of estimation subsystem by GitHub users who participated in project development

    Mind reading with regularized multinomial logistic regression

    Get PDF
    In this paper, we consider the problem of multinomial classification of magnetoencephalography (MEG) data. The proposed method participated in the MEG mind reading competition of ICANN'11 conference, where the goal was to train a classifier for predicting the movie the test person was shown. Our approach was the best among 10 submissions, reaching accuracy of 68 % of correct classifications in this five category problem. The method is based on a regularized logistic regression model, whose efficient feature selection is critical for cases with more measurements than samples. Moreover, a special attention is paid to the estimation of the generalization error in order to avoid overfitting to the training data. Here, in addition to describing our competition entry in detail, we report selected additional experiments, which question the usefulness of complex feature extraction procedures and the basic frequency decomposition of MEG signal for this application.Peer reviewe

    Predictive modeling using sparse logistic regression with applications

    Get PDF
    In this thesis, sparse logistic regression models are applied in a set of real world machine learning applications. The studied cases include supervised image segmentation, cancer diagnosis, and MEG data classification. Image segmentation is applied both in component detection in inkjet printed electronics manufacturing and in cell detection from microscope images. The results indicate that a simple linear classification method such as logistic regression often outperforms more sophisticated methods. Further, it is shown that the interpretability of the linear model offers great advantage in many applications. Model validation and automatic feature selection by means of L1 regularized parameter estimation have a significant role in this thesis. It is shown that a combination of a careful model assessment scheme and automatic feature selection by means of logistic regression model and coefficient regularization create a powerful, yet simple and practical, tool chain for applications of supervised learning and classification

    Machine Learning Methods for Structural Brain MRIs: Applications for Alzheimer’s Disease and Autism Spectrum Disorder

    Get PDF
    This thesis deals with the development of novel machine learning applications to automatically detect brain disorders based on magnetic resonance imaging (MRI) data, with a particular focus on Alzheimer’s disease and the autism spectrum disorder. Machine learning approaches are used extensively in neuroimaging studies of brain disorders to investigate abnormalities in various brain regions. However, there are many technical challenges in the analysis of neuroimaging data, for example, high dimensionality, the limited amount of data, and high variance in that data due to many confounding factors. These limitations make the development of appropriate computational approaches more challenging. To deal with these existing challenges, we target multiple machine learning approaches, including supervised and semi-supervised learning, domain adaptation, and dimensionality reduction methods.In the current study, we aim to construct effective biomarkers with sufficient sensitivity and specificity that can help physicians better understand the diseases and make improved diagnoses or treatment choices. The main contributions are 1) development of a novel biomarker for predicting Alzheimer’s disease in mild cognitive impairment patients by integrating structural MRI data and neuropsychological test results and 2) the development of a new computational approach for predicting disease severity in autistic patients in agglomerative data by automatically combining structural information obtained from different brain regions.In addition, we investigate various data-driven feature selection and classification methods for whole brain, voxel-based classification analysis of structural MRI and the use of semi-supervised learning approaches to predict Alzheimer’s disease. We also analyze the relationship between disease-related structural changes and cognitive states of patients with Alzheimer’s disease.The positive results of this effort provide insights into how to construct better biomarkers based on multisource data analysis of patient and healthy cohorts that may enable early diagnosis of brain disorders, detection of brain abnormalities and understanding effective processing in patient and healthy groups. Further, the methodologies and basic principles presented in this thesis are not only suited to the studied cases, but also are applicable to other similar problems
    corecore