15,620 research outputs found

    Word Combination Kernel for Text Classification with Support Vector Machines

    Get PDF
    In this paper we propose a novel kernel for text categorization. This kernel is an inner product defined in the feature space generated by all word combinations of specified length. A word combination is a collection of unique words co-occurring in the same sentence. The word combination of length k is weighted by the k rm th root of the product of the inverse document frequencies (IDF) of its words. By discarding word order, the word combination features are more compatible with the flexibility of natural language and the feature dimensions of documents can be reduced significantly to improve the sparseness of feature representations. By restricting the words to the same sentence and considering multi-word combinations, the word combination features can capture similarity at a more specific level than single words. A computationally simple and efficient algorithm was proposed to calculate this kernel. We conducted a series of experiments on the Reuters-21578 and 20 Newsgroups datasets. This kernel achieves better performance than the word kernel and word-sequence kernel. We also evaluated the computing efficiency of this kernel and observed the impact of the word combination length on performance

    Designing Semantic Kernels as Implicit Superconcept Expansions

    Get PDF
    Recently, there has been an increased interest in the exploitation of background knowledge in the context of text mining tasks, especially text classification. At the same time, kernel-based learning algorithms like Support Vector Machines have become a dominant paradigm in the text mining community. Amongst other reasons, this is also due to their capability to achieve more accurate learning results by replacing standard linear kernel (bag-of-words) with customized kernel functions which incorporate additional apriori knowledge. In this paper we propose a new approach to the design of ‘semantic smoothing kernels’ by means of an implicit superconcept expansion using well-known measures of term similarity. The experimental evaluation on two different datasets indicates that our approach consistently improves performance in situations where (i) training data is scarce or (ii) the bag-ofwords representation is too sparse to build stable models when using the linear kernel

    Automated Detection of Usage Errors in non-native English Writing

    Get PDF
    In an investigation of the use of a novelty detection algorithm for identifying inappropriate word combinations in a raw English corpus, we employ an unsupervised detection algorithm based on the one- class support vector machines (OC-SVMs) and extract sentences containing word sequences whose frequency of appearance is significantly low in native English writing. Combined with n-gram language models and document categorization techniques, the OC-SVM classifier assigns given sentences into two different groups; the sentences containing errors and those without errors. Accuracies are 79.30 % with bigram model, 86.63 % with trigram model, and 34.34 % with four-gram model

    GPstruct: Bayesian structured prediction using Gaussian processes

    Get PDF
    We introduce a conceptually novel structured prediction model, GPstruct, which is kernelized, non-parametric and Bayesian, by design. We motivate the model with respect to existing approaches, among others, conditional random fields (CRFs), maximum margin Markov networks (M ^3 N), and structured support vector machines (SVMstruct), which embody only a subset of its properties. We present an inference procedure based on Markov Chain Monte Carlo. The framework can be instantiated for a wide range of structured objects such as linear chains, trees, grids, and other general graphs. As a proof of concept, the model is benchmarked on several natural language processing tasks and a video gesture segmentation task involving a linear chain structure. We show prediction accuracies for GPstruct which are comparable to or exceeding those of CRFs and SVMstruct

    The Role of Text Pre-processing in Sentiment Analysis

    Get PDF
    It is challenging to understand the latest trends and summarise the state or general opinions about products due to the big diversity and size of social media data, and this creates the need of automated and real time opinion extraction and mining. Mining online opinion is a form of sentiment analysis that is treated as a difficult text classification task. In this paper, we explore the role of text pre-processing in sentiment analysis, and report on experimental results that demonstrate that with appropriate feature selection and representation, sentiment analysis accuracies using support vector machines (SVM) in this area may be significantly improved. The level of accuracy achieved is shown to be comparable to the ones achieved in topic categorisation although sentiment analysis is considered to be a much harder problem in the literature
    • …
    corecore