324 research outputs found

    Cross-language learning from bots and users to detect vandalism on Wikipedia

    No full text
    Vandalism, the malicious modification of articles, is a serious problem for open access encyclopedias such as Wikipedia. The use of counter-vandalism bots is changing the way Wikipedia identifies and bans vandals, but their contributions are often not considered nor discussed. In this paper, we propose novel text features capturing the invariants of vandalism across five languages to learn and compare the contributions of bots and users in the task of identifying vandalism. We construct computationally efficient features that highlight the contributions of bots and users, and generalize across languages. We evaluate our proposed features through classification performance on revisions of five Wikipedia languages, totaling over 500 million revisions of over nine million articles. As a comparison, we evaluate these features on the small PAN Wikipedia vandalism data sets, used by previous research, which contain approximately 62,000 revisions. We show differences in the performance of our features on the PAN and the full Wikipedia data set. With the appropriate text features, vandalism bots can be effective across different languages while learning from only one language. Our ultimate aim is to build the next generation of vandalism detection bots based on machine learning approaches that can work effectively across many language

    Damage Detection and Mitigation in Open Collaboration Applications

    Get PDF
    Collaborative functionality is changing the way information is amassed, refined, and disseminated in online environments. A subclass of these systems characterized by open collaboration uniquely allow participants to *modify* content with low barriers-to-entry. A prominent example and our case study, English Wikipedia, exemplifies the vulnerabilities: 7%+ of its edits are blatantly unconstructive. Our measurement studies show this damage manifests in novel socio-technical forms, limiting the effectiveness of computational detection strategies from related domains. In turn this has made much mitigation the responsibility of a poorly organized and ill-routed human workforce. We aim to improve all facets of this incident response workflow. Complementing language based solutions we first develop content agnostic predictors of damage. We implicitly glean reputations for system entities and overcome sparse behavioral histories with a spatial reputation model combining evidence from multiple granularity. We also identify simple yet indicative metadata features that capture participatory dynamics and content maturation. When brought to bear over damage corpora our contributions: (1) advance benchmarks over a broad set of security issues ( vandalism ), (2) perform well in the first anti-spam specific approach, and (3) demonstrate their portability over diverse open collaboration use cases. Probabilities generated by our classifiers can also intelligently route human assets using prioritization schemes optimized for capture rate or impact minimization. Organizational primitives are introduced that improve workforce efficiency. The whole of these strategies are then implemented into a tool ( STiki ) that has been used to revert 350,000+ damaging instances from Wikipedia. These uses are analyzed to learn about human aspects of the edit review process, properties including scalability, motivation, and latency. Finally, we conclude by measuring practical impacts of work, discussing how to better integrate our solutions, and revealing outstanding vulnerabilities that speak to research challenges for open collaboration security

    Interpretable Classification of Wiki-Review Streams

    Get PDF
    Wiki articles are created and maintained by a crowd of editors, producing a continuous stream of reviews. Reviews can take the form of additions, reverts, or both. This crowdsourcing model is exposed to manipulation since neither reviews nor editors are automatically screened and purged. To protect articles against vandalism or damage, the stream of reviews can be mined to classify reviews and profile editors in real-time. The goal of this work is to anticipate and explain which reviews to revert. This way, editors are informed why their edits will be reverted. The proposed method employs stream-based processing, updating the profiling and classification models on each incoming event. The profiling uses side and content-based features employing Natural Language Processing, and editor profiles are incrementally updated based on their reviews. Since the proposed method relies on self-explainable classification algorithms, it is possible to understand why a review has been classified as a revert or a non-revert. In addition, this work contributes an algorithm for generating synthetic data for class balancing, making the final classification fairer. The proposed online method was tested with a real data set from Wikivoyage, which was balanced through the aforementioned synthetic data generation. The results attained near-90 % values for all evaluation metrics (accuracy, precision, recall, and F-measure).info:eu-repo/semantics/publishedVersio

    Wikipedia and Digital Currencies: Interplay Between Collective Attention and Market Performance

    Get PDF
    The production and consumption of information about Bitcoin and other digital-, or 'crypto'-, currencies have grown together with their market capitalisation. However, a systematic investigation of the relationship between online attention and market dynamics, across multiple digital currencies, is still lacking. Here, we quantify the interplay between the attention towards digital currencies in Wikipedia and their market performance. We consider the entire edit history of currency-related pages, and their view history from July 2015. First, we quantify the evolution of the cryptocurrency presence in Wikipedia by analysing the editorial activity and the network of co-edited pages. We find that a small community of tightly connected editors is responsible for most of the production of information about cryptocurrencies in Wikipedia. Then, we show that a simple trading strategy informed by Wikipedia views performs better, in terms of returns on investment, than classic baseline strategies for most of the covered period. Our results contribute to the recent literature on the interplay between online information and investment markets, and we anticipate it will be of interest for researchers as well as investors

    Detecting vandalism on Wikipedia across multiple languages

    No full text
    Vandalism, the malicious modification or editing of articles, is a serious problem for free and open access online encyclopedias such as Wikipedia. Over the 13 year lifetime of Wikipedia, editors have identified and repaired vandalism in 1.6% of more than 500 million revisions of over 9 million English articles, but smaller manually inspected sets of revisions for research show vandalism may appear in 7% to 11% of all revisions of English Wikipedia articles. The persistent threat of vandalism has led to the development of automated programs (bots) and editing assistance programs to help editors detect and repair vandalism. Research into improving vandalism detection through application of machine learning techniques have shown significant improvements to detection rates of a wider variety of vandalism. However, the focus of research is often only on the English Wikipedia, which has led us to develop a novel research area of cross-language vandalism detection (CLVD). CLVD provides a solution to detecting vandalism across several languages through the development of language-independent machine learning models. These models can identify undetected vandalism cases across languages that may have insufficient identified cases to build learning models. The two main challenges of CLVD are (1) identifying language-independent features of vandalism that are common to multiple languages, and (2) extensibility of vandalism detection models trained in one language to other languages without significant loss in detection rate. In addition, other important challenges of vandalism detection are (3) high detection rate of a variety of known vandalism types, (4) scalability to the size of Wikipedia in the number of revisions, and (5) ability to incorporate and generate multiple types of data that characterise vandalism. In this thesis, we present our research into CLVD onWikipedia, where we identify gaps and problems in existing vandalism detection techniques. To begin our thesis, we introduce the problem of vandalism onWikipedia with motivating examples, and then present a review of the literature. From this review, we identify and address the following research gaps. First, we propose techniques for summarising the user activity of articles and comparing the knowledge coverage of articles across languages. Second, we investigate CLVD using the metadata of article revisions together with article views to learn vandalism models and classify incoming revisions. Third, we propose new text features that are more suitable for CLVD than text features from the literature. Fourth, we propose a novel context-aware vandalism detection technique for sneaky types of vandalism that may not be detectable through constructing features. Finally, to show that our techniques of detecting malicious activities are not limited to Wikipedia, we apply our feature sets to detecting malicious attachments and URLs in spam emails. Overall, our ultimate aim is to build the next generation of vandalism detection bots that can learn and detect vandalism from multiple languages and extend their usefulness to other language editions of Wikipedia

    A Wikipedia Literature Review

    Full text link
    This paper was originally designed as a literature review for a doctoral dissertation focusing on Wikipedia. This exposition gives the structure of Wikipedia and the latest trends in Wikipedia research

    Analyzing and Predicting Quality Flaws in User-generated Content: The Case of Wikipedia

    Get PDF
    Web applications that are based on user-generated content are often criticized for containing low-quality information; a popular example is the online encyclopedia Wikipedia. The major points of criticism pertain to the accuracy, neutrality, and reliability of information. The identification of low-quality information is an important task since for a huge number of people around the world it has become a habit to first visit Wikipedia in case of an information need. Existing research on quality assessment in Wikipedia either investigates only small samples of articles, or else deals with the classification of content into high-quality or low-quality. This thesis goes further, it targets the investigation of quality flaws, thus providing specific indications of the respects in which low-quality content needs improvement. The original contributions of this thesis, which relate to the fields of user-generated content analysis, data mining, and machine learning, can be summarized as follows: (1) We propose the investigation of quality flaws in Wikipedia based on user-defined cleanup tags. Cleanup tags are commonly used in the Wikipedia community to tag content that has some shortcomings. Our approach is based on the hypothesis that each cleanup tag defines a particular quality flaw. (2) We provide the first comprehensive breakdown of Wikipedia's quality flaw structure. We present a flaw organization schema, and we conduct an extensive exploratory data analysis which reveals (a) the flaws that actually exist, (b) the distribution of flaws in Wikipedia, and, (c) the extent of flawed content. (3) We present the first breakdown of Wikipedia's quality flaw evolution. We consider the entire history of the English Wikipedia from 2001 to 2012, which comprises more than 508 million page revisions, summing up to 7.9 TB. Our analysis reveals (a) how the incidence and the extent of flaws have evolved, and, (b) how the handling and the perception of flaws have changed over time. (4) We are the first who operationalize an algorithmic prediction of quality flaws in Wikipedia. We cast quality flaw prediction as a one-class classification problem, develop a tailored quality flaw model, and employ a dedicated one-class machine learning approach. A comprehensive evaluation based on human-labeled Wikipedia articles underlines the practical applicability of our approach

    Automatically Characterizing Product and Process Incentives in Collective Intelligence

    Get PDF
    Social media facilitate interaction and information dissemination among an unprecedented number of participants. Why do users contribute, and why do they contribute to a specific venue? Does the information they receive cover all relevant points of view, or is it biased? The substantial and increasing importance of online communication makes these questions more pressing, but also puts answers within reach of automated methods. I investigate scalable algorithms for understanding two classes of incentives which arise in collective intelligence processes. Product incentives exist when contributors have a stake in the information delivered to other users. I investigate product-relevant user behavior changes, algorithms for characterizing the topics and points of view presented in peer-produced content, and the results of a field experiment with a prediction market framework having associated product incentives. Process incentives exist when users find contributing to be intrinsically rewarding. Algorithms which are aware of process incentives predict the effect of feedback on where users will make contributions, and can learn about the structure of a conversation by observing when users choose to participate in it. Learning from large-scale social interactions allows us to monitor the quality of information and the health of venues, but also provides fresh insights into human behavior

    Automated Detection of Sockpuppet Accounts in Wikipedia

    Get PDF
    Wikipedia is a free Internet-based encyclopedia that is built and maintained via the open-source collaboration of a community of volunteers. Wikipedia’s purpose is to benefit readers by acting as a widely accessible and free encyclopedia, a comprehensive written synopsis that contains information on all discovered branches of knowledge. The website has millions of pages that are maintained by thousands of volunteer editors. Unfortunately, given its open-editing format, Wikipedia is highly vulnerable to malicious activity, including vandalism, spam, undisclosed paid editing, etc. Malicious users often use sockpuppet accounts to circumvent a block or a ban imposed by Wikipedia administrators on the person’s original account. A sockpuppet is an “online identity used for the purpose of deception.” Usually, several sockpuppet accounts are controlled by a unique individual (or entity) called a puppetmaster. Currently, suspected sockpuppet accounts are manually verified by Wikipedia administrators, which makes the process slow and inefficient. The primary objective of this research is to develop an automated ML and neural-network-based system to recognize the patterns of sockpuppet accounts as early as possible and recommend suspension. We address the problem as a binary classification task and propose a set of new features to capture suspicious behavior that considers user activity and analyzes the contributed content. To comply with this work, we have focused on account-based and content-based features. Our solution was bifurcated into developing a strategy to automatically detect and categorize suspicious edits made by the same author from multiple accounts. We hypothesize that “you can hide behind the screen, but your personality can’t hide.” In addition to the above-mentioned method, we have also encountered the sequential nature of the work. Therefore, we have extended our analysis with a Long Short Term Memory (LSTM) model to track down the sequential pattern of users’ writing styles. Throughout the research, we strive to automate the sockpuppet account detection system and develop tools to help the Wikipedia administration maintain the quality of articles. We tested our system on a dataset we built containing 17K accounts validated as sockpuppets. Experimental results show that our approach achieves an F1 score of 0.82 and outperforms other systems proposed in the literature. We plan to deliver our research to the Wikipedia authorities to integrate it into their existing system
    • …
    corecore