35 research outputs found

    Minimum Enclosing Spheres Formulations for Support Vector Ordinal Regression

    Full text link
    We present two new support vector approaches for ordinal regression. These approaches find the concentric spheres with minimum volume that contain most of the training samples. Both approaches guarantee that the radii of the spheres are properly ordered at the optimal solution. The size of the optimization problem is linear in the number of training samples. The popular SMO algorithm is adapted to solve the resulting optimization problem. Numerical experiments on some real-world data sets verify the usefulness of our approaches for data mining

    An incremental dual nu-support vector regression algorithm

    Full text link
    © 2018, Springer International Publishing AG, part of Springer Nature. Support vector regression (SVR) has been a hot research topic for several years as it is an effective regression learning algorithm. Early studies on SVR mostly focus on solving large-scale problems. Nowadays, an increasing number of researchers are focusing on incremental SVR algorithms. However, these incremental SVR algorithms cannot handle uncertain data, which are very common in real life because the data in the training example must be precise. Therefore, to handle the incremental regression problem with uncertain data, an incremental dual nu-support vector regression algorithm (dual-v-SVR) is proposed. In the algorithm, a dual-v-SVR formulation is designed to handle the uncertain data at first, then we design two special adjustments to enable the dual-v-SVR model to learn incrementally: incremental adjustment and decremental adjustment. Finally, the experiment results demonstrate that the incremental dual-v-SVR algorithm is an efficient incremental algorithm which is not only capable of solving the incremental regression problem with uncertain data, it is also faster than batch or other incremental SVR algorithms

    A novel application of quantile regression for identification of biomarkers exemplified by equine cartilage microarray data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Identification of biomarkers among thousands of genes arrayed for disease classification has been the subject of considerable research in recent years. These studies have focused on disease classification, comparing experimental groups of effected to normal patients. Related experiments can be done to identify tissue-restricted biomarkers, genes with a high level of expression in one tissue compared to other tissue types in the body.</p> <p>Results</p> <p>In this study, cartilage was compared with ten other body tissues using a two color array experimental design. Thirty-seven probe sets were identified as cartilage biomarkers. Of these, 13 (35%) have existing annotation associated with cartilage including several well-established cartilage biomarkers. These genes comprise a useful database from which novel targets for cartilage biology research can be selected. We determined cartilage specific Z-scores based on the observed M to classify genes with Z-scores ≥ 1.96 in all ten cartilage/tissue comparisons as cartilage-specific genes.</p> <p>Conclusion</p> <p>Quantile regression is a promising method for the analysis of two color array experiments that compare multiple samples in the absence of biological replicates, thereby limiting quantifiable error. We used a nonparametric approach to reveal the relationship between percentiles of M and A, where M is log<sub>2</sub>(R/G) and A is 0.5 log<sub>2</sub>(RG) with R representing the gene expression level in cartilage and G representing the gene expression level in one of the other 10 tissues. Then we performed linear quantile regression to identify genes with a cartilage-restricted pattern of expression.</p

    Graphical modeling of binary data using the LASSO: a simulation study

    Get PDF
    Background: Graphical models were identified as a promising new approach to modeling high-dimensional clinical data. They provided a probabilistic tool to display, analyze and visualize the net-like dependence structures by drawing a graph describing the conditional dependencies between the variables. Until now, the main focus of research was on building Gaussian graphical models for continuous multivariate data following a multivariate normal distribution. Satisfactory solutions for binary data were missing. We adapted the method of Meinshausen and Buhlmann to binary data and used the LASSO for logistic regression. Objective of this paper was to examine the performance of the Bolasso to the development of graphical models for high dimensional binary data. We hypothesized that the performance of Bolasso is superior to competing LASSO methods to identify graphical models. Methods: We analyzed the Bolasso to derive graphical models in comparison with other LASSO based method. Model performance was assessed in a simulation study with random data generated via symmetric local logistic regression models and Gibbs sampling. Main outcome variables were the Structural Hamming Distance and the Youden Index. We applied the results of the simulation study to a real-life data with functioning data of patients having head and neck cancer. Results: Bootstrap aggregating as incorporated in the Bolasso algorithm greatly improved the performance in higher sample sizes. The number of bootstraps did have minimal impact on performance. Bolasso performed reasonable well with a cutpoint of 0.90 and a small penalty term. Optimal prediction for Bolasso leads to very conservative models in comparison with AIC, BIC or cross-validated optimal penalty terms. Conclusions: Bootstrap aggregating may improve variable selection if the underlying selection process is not too unstable due to small sample size and if one is mainly interested in reducing the false discovery rate. We propose using the Bolasso for graphical modeling in large sample sizes

    Predicting risk for Alcohol Use Disorder using longitudinal data with multimodal biomarkers and family history: a machine learning study.

    Get PDF
    Predictive models have succeeded in distinguishing between individuals with Alcohol use Disorder (AUD) and controls. However, predictive models identifying who is prone to develop AUD and the biomarkers indicating a predisposition to AUD are still unclear. Our sample (n = 656) included offspring and non-offspring of European American (EA) and African American (AA) ancestry from the Collaborative Study of the Genetics of Alcoholism (COGA) who were recruited as early as age 12 and were unaffected at first assessment and reassessed years later as AUD (DSM-5) (n = 328) or unaffected (n = 328). Machine learning analysis was performed for 220 EEG measures, 149 alcohol-related single nucleotide polymorphisms (SNPs) from a recent large Genome-wide Association Study (GWAS) of alcohol use/misuse and two family history (mother DSM-5 AUD and father DSM-5 AUD) features using supervised, Linear Support Vector Machine (SVM) classifier to test which features assessed before developing AUD predict those who go on to develop AUD. Age, gender, and ancestry stratified analyses were performed. Results indicate significant and higher accuracy rates for the AA compared with the EA prediction models and a higher model accuracy trend among females compared with males for both ancestries. Combined EEG and SNP features model outperformed models based on only EEG features or only SNP features for both EA and AA samples. This multidimensional superiority was confirmed in a follow-up analysis in the AA age groups (12-15, 16-19, 20-30) and EA age group (16-19). In both ancestry samples, the youngest age group achieved higher accuracy score than the two other older age groups. Maternal AUD increased the model's accuracy in both ancestries' samples. Several discriminative EEG measures and SNPs features were identified, including lower posterior gamma, higher slow wave connectivity (delta, theta, alpha), higher frontal gamma ratio, higher beta correlation in the parietal area, and 5 SNPs: rs4780836, rs2605140, rs11690265, rs692854, and rs13380649. Results highlight the significance of sampling uniformity followed by stratified (e.g., ancestry, gender, developmental period) analysis, and wider selection of features, to generate better prediction scores allowing a more accurate estimation of AUD development

    Individualized markers optimize class prediction of microarray data

    Get PDF
    BACKGROUND: Identification of molecular markers for the classification of microarray data is a challenging task. Despite the evident dissimilarity in various characteristics of biological samples belonging to the same category, most of the marker – selection and classification methods do not consider this variability. In general, feature selection methods aim at identifying a common set of genes whose combined expression profiles can accurately predict the category of all samples. Here, we argue that this simplified approach is often unable to capture the complexity of a disease phenotype and we propose an alternative method that takes into account the individuality of each patient-sample. RESULTS: Instead of using the same features for the classification of all samples, the proposed technique starts by creating a pool of informative gene-features. For each sample, the method selects a subset of these features whose expression profiles are most likely to accurately predict the sample's category. Different subsets are utilized for different samples and the outcomes are combined in a hierarchical framework for the classification of all samples. Moreover, this approach can innately identify subgroups of samples within a given class which share common feature sets thus highlighting the effect of individuality on gene expression. CONCLUSION: In addition to high classification accuracy, the proposed method offers a more individualized approach for the identification of biological markers, which may help in better understanding the molecular background of a disease and emphasize the need for more flexible medical interventions

    Scalable Rough Support Vector Clustering

    No full text
    In this paper a novel scalable soft support vector clustering algorithm is proposed. Here softness is imparted to Support Vector Clustering paradigm by employing rough set theory and scalability is achieved using Multi Sphere Support Vector Clustering method. Empirical results show that the proposed method gives meaningful cluster abstractions

    Predictive Approaches for Sparse Model Learning

    No full text
    In this paper we investigate cross validation and Geisser’s sample reuse approaches for designing linear regression models. These approaches generate sparse models by optimizing multiple smoothing parameters. Within certain approximation, we establish equivalence relationships that exist among these approaches. The computational complexity, sparseness and performance on some benchmark data sets are compared with those obtained using relevance vector machine

    Scalable non-linear Support Vector Machine using hierarchical clustering

    No full text
    This paper discusses a method for scaling SVM with Gaussian kernel function to handle large data sets by using a selective sampling strategy for the training set. It employs a scalable hierarchical clustering algorithm to construct cluster indexing structures of the training data in the kernel induced feature space. These are then used for selective sampling of the training data for SVM to impart scalability to the training process. Empirical studies made on real world data sets show that the proposed strategy performs well on large data sets
    corecore