34 research outputs found

    On the Assessment of Information Quality in Spanish Wikipedia

    Get PDF
    Featured Articles (FA) are considered to be the best articles that Wikipedia has to offer and in the last years, researchers have found interesting to analyze whether and how they can be distinguished from “ordinary” articles. Likewise, identifying what issues have to be enhanced or fixed in ordinary articles in order to improve their quality is a recent key research trend. Most of the approaches developed in these research trends have been proposed for the English Wikipedia. However, few efforts have been accomplished in Spanish Wikipedia, despite being Spanish, one of the most spoken languages in the world by native speakers. In this respect, we present a first breakdown of Spanish Wikipedia’s quality flaw structure. Besides, we carry out a study to automatically assess information quality in Spanish Wikipedia, where FA identification is evaluated as a binary classification task. The results obtained show that FA identification can be performed with an F1 score of 0.81, using a document model consisting of only twenty six features and AdaBoosted C4.5 decision trees as classification algorithm.XIII Workshop Bases de datos y Minería de Datos (WBDMD)Red de Universidades con Carreras en Informática (RedUNCI

    Incremental Information Gain Analysis of Input Attribute Impact on RBF-Kernel SVM Spam Detection

    Get PDF
    The massive increase of spam is posing a very serious threat to email and SMS, which have become an important means of communication. Not only do spams annoy users, but they also become a security threat. Machine learning techniques have been widely used for spam detection. Email spams can be detected through detecting senders’ behaviour, the contents of an email, subject and source address, etc, while SMS spam detection usually is based on the tokens or features of messages due to short content. However, a comprehensive analysis of email/SMS content may provide cures for users to aware of email/SMS spams. We cannot completely depend on automatic tools to identify all spams. In this paper, we propose an analysis approach based on information entropy and incremental learning to see how various features affect the performance of an RBF-based SVM spam detector, so that to increase our awareness of a spam by sensing the features of a spam. The experiments were carried out on the spambase and SMSSpemCollection databases in UCI machine learning repository. The results show that some features have significant impacts on spam detection, of which users should be aware, and there exists a feature space that achieves Pareto efficiency in True Positive Rate and True Negative Rate

    BlogForever: D2.5 Weblog Spam Filtering Report and Associated Methodology

    Get PDF
    This report is written as a first attempt to define the BlogForever spam detection strategy. It comprises a survey of weblog spam technology and approaches to their detection. While the report was written to help identify possible approaches to spam detection as a component within the BlogForver software, the discussion has been extended to include observations related to the historical, social and practical value of spam, and proposals of other ways of dealing with spam within the repository without necessarily removing them. It contains a general overview of spam types, ready-made anti-spam APIs available for weblogs, possible methods that have been suggested for preventing the introduction of spam into a blog, and research related to spam focusing on those that appear in the weblog context, concluding in a proposal for a spam detection workflow that might form the basis for the spam detection component of the BlogForever software

    A survey of app store analysis for software engineering

    Get PDF
    App Store Analysis studies information about applications obtained from app stores. App stores provide a wealth of information derived from users that would not exist had the applications been distributed via previous software deployment methods. App Store Analysis combines this non-technical information with technical information to learn trends and behaviours within these forms of software repositories. Findings from App Store Analysis have a direct and actionable impact on the software teams that develop software for app stores, and have led to techniques for requirements engineering, release planning, software design, security and testing. This survey describes and compares the areas of research that have been explored thus far, drawing out common aspects, trends and directions future research should take to address open problems and challenges

    Investigating and Validating Scam Triggers: A Case Study of a Craigslist Website

    Get PDF
    The internet and digital infrastructure play an important role in our day-to-day live, and it has also a huge impact on the organizations and how we do business transactions every day. Online business is booming in this 21st century, and there are many online platforms that enable sellers and buyers to do online transactions collectively. People can sell and purchase products that include vehicles, clothes, and shoes from anywhere and anytime. Thus, the purpose of this study is to identify and validate scam triggers using Craigslist as a case study. Craigslist is one of the websites where people can post advertising to sell and buy personal belongings online. However, with the growing number of people buying and selling, new threats and scams are created daily. Private cars are among the most significant items sold and purchased over the craigslist website. In this regard, several scammers have been drawn by the large number of vehicles being traded over craigslist. Scammers also use this forum to cheat others and exploit the vulnerable. The study identified online scam triggers including Bad key words, dealers’ posts as owners, personal email, multiple location, rogue picture and voice over IP to detect online scams that exists in craigslist. The study also found over 360 ads from craigslist based on our scam trigger. Finally, the study validated each and every one of the scam triggers and found 53.31% of our data is likelihood to be considered as a scam

    Towards Information Quality Assurance in Spanish: Wikipedia

    Get PDF
    Featured Articles (FA) are considered to be the best articles that Wikipedia has to offer and in the last years, researchers have found interesting to analyze whether and how they can be distinguished from “ordinary” articles. Likewise, identifying what issues have to be enhanced or fixed in ordinary articles in order to improve their quality is a recent key research trend. Most of the approaches developed to face these information quality problems have been proposed for the English Wikipedia. However, few efforts have been accomplished in Spanish Wikipedia, despite being Spanish, one of the most spoken languages in the world by native speakers. In this respect, we present a breakdown of Spanish Wikipedia’s quality flaw structure. Besides, we carry out studies with three different corpora to automatically assess information quality in Spanish Wikipedia, where FA identification is evaluated as a binary classification task. Our evaluation on a unified setting allows to compare with the English version, the performance achieved by our approach on the Spanish version. The best results obtained show that FA identification in Spanish, can be performed with an F1 score of 0.88 using a document model consisting of only twenty six features and Support Vector Machine as classification algorithm.Facultad de Informátic

    Quality Flaws Prediction in Wikipedia by Using Deep Learning Approaches

    Get PDF
    Quality flaws prediction in Wikipedia is an ongoing research trend. In particular, in this work we tackle the problem of automatically predicting four out of the ten most frequent quality flaws; namely: No footnotes, Notability, Primary Sources and Refmprove. Different deep learning state-of-the-art approaches were evaluated on the test corpus from the 1st International Competition on Quality Flaw Prediction in Wikipedia; a well-known uniform evaluation corpus from this research field. Particularly, the results show that TabNet reachs or improves the existing benchmarks for the Notability and Refmprove flaws, and performs in a very competitive way for the other two remaining flaws.XIX Workshop base de datos y Minería de datos (WBDMD)Red de Universidades con Carreras en Informátic
    corecore