52 research outputs found

    Using Interior Point Methods for Large-scale Support Vector Machine training

    Get PDF
    Support Vector Machines (SVMs) are powerful machine learning techniques for classification and regression, but the training stage involves a convex quadratic optimization program that is most often computationally expensive. Traditionally, active-set methods have been used rather than interior point methods, due to the Hessian in the standard dual formulation being completely dense. But as active-set methods are essentially sequential, they may not be adequate for machine learning challenges of the future. Additionally, training time may be limited, or data may grow so large that cluster-computing approaches need to be considered. Interior point methods have the potential to answer these concerns directly. They scale efficiently, they can provide good early approximations, and they are suitable for parallel and multi-core environments. To apply them to SVM training, it is necessary to address directly the most computationally expensive aspect of the algorithm. We therefore present an exact reformulation of the standard linear SVM training optimization problem that exploits separability of terms in the objective. By so doing, per-iteration computational complexity is reduced from O(n3) to O(n). We show how this reformulation can be applied to many machine learning problems in the SVM family. Implementation issues relating to specializing the algorithm are explored through extensive numerical experiments. They show that the performance of our algorithm for large dense or noisy data sets is consistent and highly competitive, and in some cases can out perform all other approaches by a large margin. Unlike active set methods, performance is largely unaffected by noisy data. We also show how, by exploiting the block structure of the augmented system matrix, a hybrid MPI/Open MP implementation of the algorithm enables data and linear algebra computations to be efficiently partitioned amongst parallel processing nodes in a clustered computing environment. The applicability of our technique is extended to nonlinear SVMs by low-rank approximation of the kernel matrix. We develop a heuristic designed to represent clusters using a small number of features. Additionally, an early approximation scheme reduces the number of samples that need to be considered. Both elements improve the computational efficiency of the training phase. Taken as a whole, this thesis shows that with suitable problem formulation and efficient implementation techniques, interior point methods are a viable optimization technology to apply to large-scale SVM training, and are able to provide state-of-the-art performance

    Automatic Generation of Story Highlights

    Get PDF
    In this paper we present a joint content selection and compression model for single-document summarization. The model operates over a phrase-based representation of the source document which we obtain by merging information from PCFG parse trees and dependency graphs. Using an integer linear programming formulation, the model learns to select and combine phrases subject to length, coverage and grammar constraints. We evaluate the approach on the task of generating “story highlights”—a small number of brief, self-contained sentences that allow readers to quickly gather information on news stories. Experimental results show that the model’s output is comparable to human-written highlights in terms of both grammaticality and content.

    Hybrid MPI/OpenMP Parallel Linear Support Vector Machine Training

    Get PDF

    Multiple aspect summarization using integer linear programming

    Get PDF
    Multi-document summarization involves many aspects of content selection and sur-face realization. The summaries must be informative, succinct, grammatical, and obey stylistic writing conventions. We present a method where such individual aspects are learned separately from data (without any hand-engineering) but optimized jointly using an integer linear programme. The ILP framework allows us to combine the decisions of the expert learners and to select and rewrite source content through a mixture of objective setting, soft and hard constraints. Experimental results on the TAC-08 data set show that our model achieves state-of-the-art performance using ROUGE and signifi-cantly improves the informativeness of the summaries.

    Title Generation with Quasi-Synchronous Grammar

    Get PDF
    The task of selecting information and rendering it appropriately appears in multiple contexts in summarization. In this paper we present a model that simultaneously optimizes selection and rendering preferences. The model operates over a phrase-based representation of the source document which we obtain by merging PCFG parse trees and dependency graphs. Selection preferences for individual phrases are learned discriminatively, while a quasi-synchronous grammar (Smith and Eisner, 2006) captures rendering preferences such as paraphrases and compressions. Based on an integer linear programming formulation, the model learns to generate summaries that satisfy both types of preferences, while ensuring that length, topic coverage and grammar constraints are met. Experiments on headline and image caption generation show that our method obtains state-of-the-art performance using essentially the same model for both tasks without any major modifications.

    An ontology enhanced parallel SVM for scalable spam filter training

    Get PDF
    This is the post-print version of the final paper published in Neurocomputing. The published article is available from the link below. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. Copyright @ 2013 Elsevier B.V.Spam, under a variety of shapes and forms, continues to inflict increased damage. Varying approaches including Support Vector Machine (SVM) techniques have been proposed for spam filter training and classification. However, SVM training is a computationally intensive process. This paper presents a MapReduce based parallel SVM algorithm for scalable spam filter training. By distributing, processing and optimizing the subsets of the training data across multiple participating computer nodes, the parallel SVM reduces the training time significantly. Ontology semantics are employed to minimize the impact of accuracy degradation when distributing the training data among a number of SVM classifiers. Experimental results show that ontology based augmentation improves the accuracy level of the parallel SVM beyond the original sequential counterpart

    Development of intra-oral automated landmark recognition (ALR) for dental and occlusal outcome measurements

    Get PDF
    BACKGROUND: Previous studies embracing digital technology and automated methods of scoring dental arch relationships have shown that such technology is valid and accurate. To date, however there is no published literature on artificial intelligence and machine learning to completely automate the process of dental landmark recognition. OBJECTIVES: This study aimed to develop and evaluate a fully automated system and software tool for the identification of landmarks on human teeth using geometric computing, image segmenting, and machine learning technology. METHODS: Two hundred and thirty-nine digital models were used in the automated landmark recognition (ALR) validation phase, 161 of which were digital models from cleft palate subjects aged 5 years. These were manually annotated to facilitate qualitative validation. Additionally, landmarks were placed on 20 adult digital models manually by 3 independent observers. The same models were subjected to scoring using the ALR software and the differences (in mm) were calculated. All the teeth from the 239 models were evaluated for correct recognition by the ALR with a breakdown to find which stages of the process caused the errors. RESULTS: The results revealed that 1526 out of 1915 teeth (79.7%) were correctly identified, and the accuracy validation gave 95% confidence intervals for the geometric mean error of [0.285, 0.317] for the humans and [0.269, 0.325] for ALR—a negligible difference. CONCLUSIONS/IMPLICATIONS: It is anticipated that ALR software tool will have applications throughout clinical dentistry and anthropology, and in research will constitute an accurate and objective tool for handling large datasets without the need for time intensive employment of experts to place landmarks manually
    corecore