2,939 research outputs found

    An ontology enhanced parallel SVM for scalable spam filter training

    Get PDF
    This is the post-print version of the final paper published in Neurocomputing. The published article is available from the link below. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. Copyright @ 2013 Elsevier B.V.Spam, under a variety of shapes and forms, continues to inflict increased damage. Varying approaches including Support Vector Machine (SVM) techniques have been proposed for spam filter training and classification. However, SVM training is a computationally intensive process. This paper presents a MapReduce based parallel SVM algorithm for scalable spam filter training. By distributing, processing and optimizing the subsets of the training data across multiple participating computer nodes, the parallel SVM reduces the training time significantly. Ontology semantics are employed to minimize the impact of accuracy degradation when distributing the training data among a number of SVM classifiers. Experimental results show that ontology based augmentation improves the accuracy level of the parallel SVM beyond the original sequential counterpart

    Clustering Analysis within Text Classification Techniques

    Get PDF
    The paper represents a personal approach upon the main applications of classification which are presented in the area of knowledge based society by means of methods and techniques widely spread in the literature. Text classification is underlined in chapter two where the main techniques used are described, along with an integrated taxonomy. The transition is made through the concept of spatial representation. Having the elementary elements of geometry and the artificial intelligence analysis, spatial representation models are presented. Using a parallel approach, spatial dimension is introduced in the process of classification. The main clustering methods are described in an aggregated taxonomy. For an example, spam and ham words are clustered and spatial represented, when the concepts of spam, ham and common and linkage word are presented and explained in the xOy space representation.Knowledge Societies, Text Classification, Spatial Representation, Artificial Intelligence, Clustering Analysis, Spam Filtering

    Ranking News-Quality Multimedia

    Full text link
    News editors need to find the photos that best illustrate a news piece and fulfill news-media quality standards, while being pressed to also find the most recent photos of live events. Recently, it became common to use social-media content in the context of news media for its unique value in terms of immediacy and quality. Consequently, the amount of images to be considered and filtered through is now too much to be handled by a person. To aid the news editor in this process, we propose a framework designed to deliver high-quality, news-press type photos to the user. The framework, composed of two parts, is based on a ranking algorithm tuned to rank professional media highly and a visual SPAM detection module designed to filter-out low-quality media. The core ranking algorithm is leveraged by aesthetic, social and deep-learning semantic features. Evaluation showed that the proposed framework is effective at finding high-quality photos (true-positive rate) achieving a retrieval MAP of 64.5% and a classification precision of 70%.Comment: To appear in ICMR'1

    Data Sets: Word Embeddings Learned from Tweets and General Data

    Full text link
    A word embedding is a low-dimensional, dense and real- valued vector representation of a word. Word embeddings have been used in many NLP tasks. They are usually gener- ated from a large text corpus. The embedding of a word cap- tures both its syntactic and semantic aspects. Tweets are short, noisy and have unique lexical and semantic features that are different from other types of text. Therefore, it is necessary to have word embeddings learned specifically from tweets. In this paper, we present ten word embedding data sets. In addition to the data sets learned from just tweet data, we also built embedding sets from the general data and the combination of tweets with the general data. The general data consist of news articles, Wikipedia data and other web data. These ten embedding models were learned from about 400 million tweets and 7 billion words from the general text. In this paper, we also present two experiments demonstrating how to use the data sets in some NLP tasks, such as tweet sentiment analysis and tweet topic classification tasks

    Feature extraction and classification of movie reviews

    Get PDF

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
    corecore