48 research outputs found

    Automatic classification of new articles in Spanish

    Get PDF
    We apply machine learning techniques to the automatic classification of news articles from the local newspaper La Capital of Rosario, Argentina. The corpus (LCC) is an archive of approximately 75,000 manually categorized articles in Spanish published in 1991. We benchmark on LCC three widely used supervised learning methods: k-Nearest Neighbors, Na¨ ve Bayes and Arti ficial Neural Networks, illustrating the corpus properties.Eje: V - Workshop de agentes y sistemas inteligentesRed de Universidades con Carreras en Informátic

    Automatic classification of new articles in Spanish

    Get PDF
    We apply machine learning techniques to the automatic classification of news articles from the local newspaper La Capital of Rosario, Argentina. The corpus (LCC) is an archive of approximately 75,000 manually categorized articles in Spanish published in 1991. We benchmark on LCC three widely used supervised learning methods: k-Nearest Neighbors, Na¨ ve Bayes and Arti ficial Neural Networks, illustrating the corpus properties.Eje: V - Workshop de agentes y sistemas inteligentesRed de Universidades con Carreras en Informátic

    Yapay Sinir Ağları ile Web İçeriklerini Sınıflandırma

    Get PDF
    Recent developments and widespread usage of the Internet have made business and processes to be completed faster and easily in electronic media. The increasing size of the stored, transferred and processed data brings many problems that affect access to information on the Web. Because of users’ need get to access to the information in electronic environment quickly, correctly and appropriately, different methods of classification and categorization of data are strictly needed. Millions of search engines should be supported with new approaches every day in order for users to get access to relevant information quickly. In this study, Multilayered Perceptrons (MLP) artificial neural network model is used to classify the web sites according to the specified subjects. A software is developed to select the feature vector, to train the neural network and finally to categorize the web sites correctly. It is considered that this intelligent approach will provide more accurate and secure platform to the Internet users for classifying web contents precisely

    Using online linear classifiers to filter spam Emails

    Get PDF
    The performance of two online linear classifiers - the Perceptron and Littlestone’s Winnow – is explored for two anti-spam filtering benchmark corpora - PU1 and Ling-Spam. We study the performance for varying numbers of features, along with three different feature selection methods: Information Gain (IG), Document Frequency (DF) and Odds Ratio. The size of the training set and the number of training iterations are also investigated for both classifiers. The experimental results show that both the Perceptron and Winnow perform much better when using IG or DF than using Odds Ratio. It is further demonstrated that when using IG or DF, the classifiers are insensitive to the number of features and the number of training iterations, and not greatly sensitive to the size of training set. Winnow is shown to slightly outperform the Perceptron. It is also demonstrated that both of these online classifiers perform much better than a standard Naïve Bayes method. The theoretical and implementation computational complexity of these two classifiers are very low, and they are very easily adaptively updated. They outperform most of the published results, while being significantly easier to train and adapt. The analysis and promising experimental results indicate that the Perceptron and Winnow are two very competitive classifiers for anti-spam filtering

    Comparing Extant Story Classifiers: Results & New Directions

    Get PDF
    Having access to a large set of stories is a necessary first step for robust and wide-ranging computational narrative modeling; happily, language data - including stories - are increasingly available in electronic form. Unhappily, the process of automatically separating stories from other forms of written discourse is not straightforward, and has resulted in a data collection bottleneck. Therefore researchers have sought to develop reliable, robust automatic algorithms for identifying story text mixed with other non-story text. In this paper we report on the reimplementation and experimental comparison of the two approaches to this task: Gordon\u27s unigram classifier, and Corman\u27s semantic triplet classifier. We cross-analyze their performance on both Gordon\u27s and Corman\u27s corpora, and discuss similarities, differences, and gaps in the performance of these classifiers, and point the way forward to improving their approaches

    Text Classification in an Under-Resourced Language via Lexical Normalization and Feature Pooling

    Get PDF
    Automatic classification of textual content in an under-resourced language is challenging, since lexical resources and preprocessing tools are not available for such languages. Their bag-of-words (BoW) representation is usually highly sparse and noisy, and text classification built on such a representation yields poor performance. In this paper, we explore the effectiveness of lexical normalization of terms and statistical feature pooling for improving text classification in an under-resourced language. We focus on classifying citizen feedback on government services provided through SMS texts which are written predominantly in Roman Urdu (an informal forward transliterated version of the Urdu language). Our proposed methodology performs normalization of lexical variations of terms using phonetic and string similarity. It subsequently employs a supervised feature extraction technique to obtain category-specific highly discriminating features. Our experiments with classifiers reveal that significant improvement in classification performance is achieved by lexical normalization plus feature pooling over standard representations

    Clasificación de prescripciones médicas en español

    Get PDF
    El siguiente trabajo describe la problemática de la clasificación de textos médicos libres en español. Y propone una solución basada en los algoritmos de clasificación de texto: Naïve Bayes Multinomial (NBM) y Support Vector Machines (SVMs) justificando dichas decisiones y mostrando los resultados obtenidos con ambos métodos.Eje: XV Workshop de Agentes y Sistemas InteligentesRed de Universidades con Carreras de Informática (RedUNCI

    Clasificación de distintos conjuntos de datos utilizados en evaluación de métodos de extracción de conocimiento creados para la web

    Get PDF
    En varios artículos se han utilizado distintos textos de prueba, como datos de entrada para medir el desempeño de los métodos de extracción de relaciones semánticas para la Web (OIE): ReVerb y ClausIE. Sin embargo estos textos nunca han sido analizados para entender si ellos guardan o no similitudes o para saber si existe entre ellos un lenguaje común o pertenecen a un mismo dominio. Es la intención de este trabajo analizar dichos textos utilizando distintos algoritmos de clasificación. Y comprender si se pueden agrupar de una forma coherente, de tal suerte que a priori uno pueda identificar que textos son los que trabajan mejor con ClausIE y cuales con ReVerb.XIII Workshop Bases de datos y Minería de Datos (WBDMD).Red de Universidades con Carreras en Informática (RedUNCI
    corecore