    Automatically Assessing the Need of Additional Citations for Information Quality Verification in Wikipedia Articles

    Quality flaws prediction in Wikipedia is an ongoing research trend. In particular, in this work we tackle the problem of automatically assessing the need of including additional citations for contributing to verify the articles’ content; the so-called Refimprove quality flaw. This information quality flaw, ranks among the five most frequent flaws and represents 12.4% of the flawed articles in the English Wikipedia. Underbagged decision trees, biased-SVM, and centroid-based balanced SVM –three different state-of-the-art approaches– were evaluated, with the aim of handling the existing imbalances between the number of articles’ tagged as flawed content, and the remaining untagged documents that exist in Wikipedia, which can help in the learning stage of the algorithms. Also, a uniformly sampled balanced SVM classifier was evaluated as a baseline. The results showed that under-bagged decision trees with the min rule as aggregation method, perform best achieving an F1 score of 0.96 on the test corpus from the 1st International Competition on Quality Flaw Prediction in Wikipedia; a well-known uniform evaluation corpus from this research field. Likewise, biased-SVM also achieved an F1 score that outperform previously published results.II Track de Gobierno Digital y Ciudades Inteligentes.Red de Universidades con Carreras en Informátic

    Data Mining Revision Controlled Document History Metadata for Automatic Classification

    Version controlled documents provide a complete history of the changes to the document, including everything from what was changed to who made the change and much more. Through the use of cluster analysis and several sets of manipulated data, this research examines the revision history of Wikipedia in an attempt to find language-independent patterns that could assist in automatic page classification software. Utilizing two sample data sets and applying the aforementioned cluster analysis, no conclusive evidence was found that would indicate that such patterns exist. Our work on the software, however, does provide a foundation for more possible types of data manipulation and refined clustering algorithms to be used for further research into finding such patterns

    Evaluación de la calidad de la información de Wikipedia en español

    Este artículo describe, brevemente, las tareas de investigación y desarrollo que se están llevando a cabo para evaluar la calidad de información en la Web en el marco del proyecto “Herramientos y mecanismos para la toma de decisiones en agentes inteligentes artificiales”. En particular, se ha tomado como caso de estudio primario la enciclopedia online Wikipedia en español. Este tema de trabajo permite la interacción de las dos líneas de investigación que contiene este proyecto y además se está abordando en forma conjunta con investigadores de Alemania, España y México, en el contexto de un proyecto FP7 financiado por la Unión Europea.Eje: Agentes y Sistemas InteligentesRed de Universidades con Carreras en Informática (RedUNCI

    On the Use of PU Learning for Quality Flaw Prediction in Wikipedia

    [EN] In this article we describe a new approach to assess Quality Flaw Prediction in Wikipedia. The partially supervised method studied, called PU Learning, has been successfully applied in classi cations tasks with traditional corpora like Reuters-21578 or 20-Newsgroups. To the best of our knowledge, this is the rst time that it is applied in this domain. Throughout this paper, we describe how the original PU Learning approach was evaluated for assessing quality flaws and the modi cations introduced to get a quality aws predictor which obtained the best F1 scores in the task \Quality Flaw Prediction in Wikipedia" of the PAN challenge.Edgardo Ferretti and Marcelo Errecalde thank Universidad Nacional de San Luis (PROICO 30310). The collaboration of UNSL, INAOE and UPV has been funded by the European Commission as part of the WIQ-EI project (project no. 269180) within the FP7 People Programme. Manuel Montes is partially supported by CONACYT, No. 134186. The work of Paolo Rosso was carried out also in the framework of the MICINN Text-Enterprise (TIN2009-13391-C04-03) research project and the Microcluster VLC/Campus (International Campus of Excellence) on Multimodal Intelligent Systems.Ferretti, E.; Hernández Fusilier, D.; Guzmán Cabrera, R.; Montes Y Gómez, M.; Errecalde, M.; Rosso, P. (2012). On the Use of PU Learning for Quality Flaw Prediction in Wikipedia. CEUR Workshop Proceedings. 1178. http://hdl.handle.net/10251/46566S117

    Towards Information Quality Assurance in Spanish: Wikipedia

    Featured Articles (FA) are considered to be the best articles that Wikipedia has to offer and in the last years, researchers have found interesting to analyze whether and how they can be distinguished from “ordinary” articles. Likewise, identifying what issues have to be enhanced or fixed in ordinary articles in order to improve their quality is a recent key research trend. Most of the approaches developed to face these information quality problems have been proposed for the English Wikipedia. However, few efforts have been accomplished in Spanish Wikipedia, despite being Spanish, one of the most spoken languages in the world by native speakers. In this respect, we present a breakdown of Spanish Wikipedia’s quality flaw structure. Besides, we carry out studies with three different corpora to automatically assess information quality in Spanish Wikipedia, where FA identification is evaluated as a binary classification task. Our evaluation on a unified setting allows to compare with the English version, the performance achieved by our approach on the Spanish version. The best results obtained show that FA identification in Spanish, can be performed with an F1 score of 0.88 using a document model consisting of only twenty six features and Support Vector Machine as classification algorithm.Facultad de Informátic


    Modeling Non-Standard Text Classification Tasks

    Text classification deals with discovering knowledge in texts and is used for extracting, filtering, or retrieving information in streams and collections. The discovery of knowledge is operationalized by modeling text classification tasks, which is mainly a human-driven engineering process. The outcome of this process, a text classification model, is used to inductively learn a text classification solution from a priori classified examples. The building blocks of modeling text classification tasks cover four aspects: (1) the way examples are represented, (2) the way examples are selected, (3) the way classifiers learn from examples, and (4) the way models are selected. This thesis proposes methods that improve the prediction quality of text classification solutions for unseen examples, especially for non-standard tasks where standard models do not fit. The original contributions are related to the aforementioned building blocks: (1) Several topic-orthogonal text representations are studied in the context of non-standard tasks and a new representation, namely co-stems, is introduced. (2) A new active learning strategy that goes beyond standard sampling is examined. (3) A new one-class ensemble for improving the effectiveness of one-class classification is proposed. (4) A new model selection framework to cope with subclass distribution shifts that occur in dynamic environments is introduced

    XXV Congreso Argentino de Ciencias de la Computación - CACIC 2019: libro de actas

    Trabajos presentados en el XXV Congreso Argentino de Ciencias de la Computación (CACIC), celebrado en la ciudad de Río Cuarto los días 14 al 18 de octubre de 2019 organizado por la Red de Universidades con Carreras en Informática (RedUNCI) y Facultad de Ciencias Exactas, Físico-Químicas y Naturales - Universidad Nacional de Río CuartoRed de Universidades con Carreras en Informátic