13 research outputs found

    Detecting vandalism on Wikipedia across multiple languages

    No full text
    Vandalism, the malicious modification or editing of articles, is a serious problem for free and open access online encyclopedias such as Wikipedia. Over the 13 year lifetime of Wikipedia, editors have identified and repaired vandalism in 1.6% of more than 500 million revisions of over 9 million English articles, but smaller manually inspected sets of revisions for research show vandalism may appear in 7% to 11% of all revisions of English Wikipedia articles. The persistent threat of vandalism has led to the development of automated programs (bots) and editing assistance programs to help editors detect and repair vandalism. Research into improving vandalism detection through application of machine learning techniques have shown significant improvements to detection rates of a wider variety of vandalism. However, the focus of research is often only on the English Wikipedia, which has led us to develop a novel research area of cross-language vandalism detection (CLVD). CLVD provides a solution to detecting vandalism across several languages through the development of language-independent machine learning models. These models can identify undetected vandalism cases across languages that may have insufficient identified cases to build learning models. The two main challenges of CLVD are (1) identifying language-independent features of vandalism that are common to multiple languages, and (2) extensibility of vandalism detection models trained in one language to other languages without significant loss in detection rate. In addition, other important challenges of vandalism detection are (3) high detection rate of a variety of known vandalism types, (4) scalability to the size of Wikipedia in the number of revisions, and (5) ability to incorporate and generate multiple types of data that characterise vandalism. In this thesis, we present our research into CLVD onWikipedia, where we identify gaps and problems in existing vandalism detection techniques. To begin our thesis, we introduce the problem of vandalism onWikipedia with motivating examples, and then present a review of the literature. From this review, we identify and address the following research gaps. First, we propose techniques for summarising the user activity of articles and comparing the knowledge coverage of articles across languages. Second, we investigate CLVD using the metadata of article revisions together with article views to learn vandalism models and classify incoming revisions. Third, we propose new text features that are more suitable for CLVD than text features from the literature. Fourth, we propose a novel context-aware vandalism detection technique for sneaky types of vandalism that may not be detectable through constructing features. Finally, to show that our techniques of detecting malicious activities are not limited to Wikipedia, we apply our feature sets to detecting malicious attachments and URLs in spam emails. Overall, our ultimate aim is to build the next generation of vandalism detection bots that can learn and detect vandalism from multiple languages and extend their usefulness to other language editions of Wikipedia

    Analyzing and Predicting Quality Flaws in User-generated Content: The Case of Wikipedia

    Get PDF
    Web applications that are based on user-generated content are often criticized for containing low-quality information; a popular example is the online encyclopedia Wikipedia. The major points of criticism pertain to the accuracy, neutrality, and reliability of information. The identification of low-quality information is an important task since for a huge number of people around the world it has become a habit to first visit Wikipedia in case of an information need. Existing research on quality assessment in Wikipedia either investigates only small samples of articles, or else deals with the classification of content into high-quality or low-quality. This thesis goes further, it targets the investigation of quality flaws, thus providing specific indications of the respects in which low-quality content needs improvement. The original contributions of this thesis, which relate to the fields of user-generated content analysis, data mining, and machine learning, can be summarized as follows: (1) We propose the investigation of quality flaws in Wikipedia based on user-defined cleanup tags. Cleanup tags are commonly used in the Wikipedia community to tag content that has some shortcomings. Our approach is based on the hypothesis that each cleanup tag defines a particular quality flaw. (2) We provide the first comprehensive breakdown of Wikipedia's quality flaw structure. We present a flaw organization schema, and we conduct an extensive exploratory data analysis which reveals (a) the flaws that actually exist, (b) the distribution of flaws in Wikipedia, and, (c) the extent of flawed content. (3) We present the first breakdown of Wikipedia's quality flaw evolution. We consider the entire history of the English Wikipedia from 2001 to 2012, which comprises more than 508 million page revisions, summing up to 7.9 TB. Our analysis reveals (a) how the incidence and the extent of flaws have evolved, and, (b) how the handling and the perception of flaws have changed over time. (4) We are the first who operationalize an algorithmic prediction of quality flaws in Wikipedia. We cast quality flaw prediction as a one-class classification problem, develop a tailored quality flaw model, and employ a dedicated one-class machine learning approach. A comprehensive evaluation based on human-labeled Wikipedia articles underlines the practical applicability of our approach

    Implicit emotion detection in text

    Get PDF
    In text, emotion can be expressed explicitly, using emotion-bearing words (e.g. happy, guilty) or implicitly without emotion-bearing words. Existing approaches focus on the detection of explicitly expressed emotion in text. However, there are various ways to express and convey emotions without the use of these emotion-bearing words. For example, given two sentences: “The outcome of my exam makes me happy” and “I passed my exam”, both sentences express happiness, with the first expressing it explicitly and the other implying it. In this thesis, we investigate implicit emotion detection in text. We propose a rule-based approach for implicit emotion detection, which can be used without labeled corpora for training. Our results show that our approach outperforms the lexicon matching method consistently and gives competitive performance in comparison to supervised classifiers. Given that emotions such as guilt and admiration which often require the identification of blameworthiness and praiseworthiness, we also propose an approach for the detection of blame and praise in text, using an adapted psychology model, Path model to blame. Lack of benchmarking dataset led us to construct a corpus containing comments of individuals’ emotional experiences annotated as blame, praise or others. Since implicit emotion detection might be useful for conflict-of-interest (CoI) detection in Wikipedia articles, we built a CoI corpus and explored various features including linguistic and stylometric, presentation, bias and emotion features. Our results show that emotion features are important when using Nave Bayes, but the best performance is obtained with SVM on linguistic and stylometric features only. Overall, we show that a rule-based approach can be used to detect implicit emotion in the absence of labelled data; it is feasible to adopt the psychology path model to blame for blame/praise detection from text, and implicit emotion detection is beneficial for CoI detection in Wikipedia articles

    Characterization and Detection of Malicious Behavior on the Web

    Get PDF
    Web platforms enable unprecedented speed and ease in transmission of knowledge, and allow users to communicate and shape opinions. However, the safety, usability and reliability of these platforms is compromised by the prevalence of online malicious behavior -- for example 40% of users have experienced online harassment. This is present in the form of malicious users, such as trolls, sockpuppets and vandals, and misinformation, such as hoaxes and fraudulent reviews. This thesis presents research spanning two aspects of malicious behavior: characterization of their behavioral properties, and development of algorithms and models for detecting them. We characterize the behavior of malicious users and misinformation in terms of their activity, temporal frequency of actions, network connections to other entities, linguistic properties of how they write, and community feedback received from others. We find several striking characteristics of malicious behavior that are very distinct from those of benign behavior. For instance, we find that vandals and fraudulent reviewers are faster in their actions compared to benign editors and reviewers, respectively. Hoax articles are long pieces of plain text that are less coherent and created by more recent editors, compared to non-hoax articles. We find that sockpuppets are created that vary in their deceptiveness (i.e., whether they pretend to be different users) and their supportiveness (i.e., if they support arguments of other sockpuppets controlled by the same user). We create a suite of feature based and graph based algorithms to efficiently detect malicious from benign behavior. We first create the first vandal early warning system that accurately predicts vandals using very few edits. Next, based on the properties of Wikipedia articles, we develop a supervised machine learning classifier to predict whether an article is a hoax, and another that predicts whether a pair of accounts belongs to the same user, both with very high accuracy. We develop a graph-based decluttering algorithm that iteratively removes suspicious edges that malicious users use to masquerade as benign users, which outperforms existing graph algorithms to detect trolls. And finally, we develop an efficient graph-based algorithm to assess the fairness of all reviewers, reliability of all ratings, and goodness of all products, simultaneously, in a rating network, and incorporate penalties for suspicious behavior. Overall, in this thesis, we develop a suite of five models and algorithms to accurately identify and predict several distinct types of malicious behavior -- namely, vandals, hoaxes, sockpuppets, trolls and fraudulent reviewers -- in multiple web platforms. The analysis leading to the algorithms develops an interpretable understanding of malicious behavior on the web

    Statistical models for the analysis of short user-generated documents: author identification for conversational documents

    Get PDF
    In recent years short user-generated documents have been gaining popularity on the Internet and attention in the research communities. This kind of documents are generated by users of the various online services: platforms for instant messaging communication, for real-time status posting, for discussing and for writing reviews. Each of these services allows users to generate written texts with particular properties and which might require specific algorithms for being analysed. In this dissertation we are presenting our work which aims at analysing this kind of documents. We conducted qualitative and quantitative studies to identify the properties that might allow for characterising them. We compared the properties of these documents with the properties of standard documents employed in the literature, such as newspaper articles, and defined a set of characteristics that are distinctive of the documents generated online. We also observed two classes within the online user-generated documents: the conversational documents and those involving group discussions. We later focused on the class of conversational documents, that are short and spontaneous. We created a novel collection of real conversational documents retrieved online (e.g. Internet Relay Chat) and distributed it as part of an international competition (PAN @ CLEF'12). The competition was about author characterisation, which is one of the possible studies of authorship attribution documented in the literature. Another field of study is authorship identification, that became our main topic of research. We approached the authorship identification problem in its closed-class variant. For each problem we employed documents from the collection we released and from a collection of Twitter messages, as representative of conversational or short user-generated documents. We proved the unsuitability of standard authorship identification techniques for conversational documents and proposed novel methods capable of reaching better accuracy rates. As opposed to standard methods that worked well only for few authors, the proposed technique allowed for reaching significant results even for hundreds of users

    Wiktionary: The Metalexicographic and the Natural Language Processing Perspective

    Get PDF
    Dictionaries are the main reference works for our understanding of language. They are used by humans and likewise by computational methods. So far, the compilation of dictionaries has almost exclusively been the profession of expert lexicographers. The ease of collaboration on the Web and the rising initiatives of collecting open-licensed knowledge, such as in Wikipedia, caused a new type of dictionary that is voluntarily created by large communities of Web users. This collaborative construction approach presents a new paradigm for lexicography that poses new research questions to dictionary research on the one hand and provides a very valuable knowledge source for natural language processing applications on the other hand. The subject of our research is Wiktionary, which is currently the largest collaboratively constructed dictionary project. In the first part of this thesis, we study Wiktionary from the metalexicographic perspective. Metalexicography is the scientific study of lexicography including the analysis and criticism of dictionaries and lexicographic processes. To this end, we discuss three contributions related to this area of research: (i) We first provide a detailed analysis of Wiktionary and its various language editions and dictionary structures. (ii) We then analyze the collaborative construction process of Wiktionary. Our results show that the traditional phases of the lexicographic process do not apply well to Wiktionary, which is why we propose a novel process description that is based on the frequent and continual revision and discussion of the dictionary articles and the lexicographic instructions. (iii) We perform a large-scale quantitative comparison of Wiktionary and a number of other dictionaries regarding the covered languages, lexical entries, word senses, pragmatic labels, lexical relations, and translations. We conclude the metalexicographic perspective by finding that the collaborative Wiktionary is not an appropriate replacement for expert-built dictionaries due to its inconsistencies, quality flaws, one-fits-all-approach, and strong dependence on expert-built dictionaries. However, Wiktionary's rapid and continual growth, its high coverage of languages, newly coined words, domain-specific vocabulary and non-standard language varieties, as well as the kind of evidence based on the authors' intuition provide promising opportunities for both lexicography and natural language processing. In particular, we find that Wiktionary and expert-built wordnets and thesauri contain largely complementary entries. In the second part of the thesis, we study Wiktionary from the natural language processing perspective with the aim of making available its linguistic knowledge for computational applications. Such applications require vast amounts of structured data with high quality. Expert-built resources have been found to suffer from insufficient coverage and high construction and maintenance cost, whereas fully automatic extraction from corpora or the Web often yields resources of limited quality. Collaboratively built encyclopedias present a viable solution, but do not cover well linguistically oriented knowledge as it is found in dictionaries. That is why we propose extracting linguistic knowledge from Wiktionary, which we achieve by the following three main contributions: (i) We propose the novel multilingual ontology OntoWiktionary that is created by extracting and harmonizing the weakly structured dictionary articles in Wiktionary. A particular challenge in this process is the ambiguity of semantic relations and translations, which we resolve by automatic word sense disambiguation methods. (ii) We automatically align Wiktionary with WordNet 3.0 at the word sense level. The largely complementary information from the two dictionaries yields an aligned resource with higher coverage and an enriched representation of word senses. (iii) We represent Wiktionary according to the ISO standard Lexical Markup Framework, which we adapt to the peculiarities of collaborative dictionaries. This standardized representation is of great importance for fostering the interoperability of resources and hence the dissemination of Wiktionary-based research. To this end, our work presents a foundational step towards the large-scale integrated resource UBY, which facilitates a unified access to a number of standardized dictionaries by means of a shared web interface for human users and an application programming interface for natural language processing applications. A user can, in particular, switch between and combine information from Wiktionary and other dictionaries without completely changing the software. Our final resource and the accompanying datasets and software are publicly available and can be employed for multiple different natural language processing applications. It particularly fills the gap between the small expert-built wordnets and the large amount of encyclopedic knowledge from Wikipedia. We provide a survey of previous works utilizing Wiktionary, and we exemplify the usefulness of our work in two case studies on measuring verb similarity and detecting cross-lingual marketing blunders, which make use of our Wiktionary-based resource and the results of our metalexicographic study. We conclude the thesis by emphasizing the usefulness of collaborative dictionaries when being combined with expert-built resources, which bears much unused potential

    XXV Congreso Argentino de Ciencias de la Computación - CACIC 2019: libro de actas

    Get PDF
    Trabajos presentados en el XXV Congreso Argentino de Ciencias de la Computación (CACIC), celebrado en la ciudad de Río Cuarto los días 14 al 18 de octubre de 2019 organizado por la Red de Universidades con Carreras en Informática (RedUNCI) y Facultad de Ciencias Exactas, Físico-Químicas y Naturales - Universidad Nacional de Río CuartoRed de Universidades con Carreras en Informátic

    XXV Congreso Argentino de Ciencias de la Computación - CACIC 2019: libro de actas

    Get PDF
    Trabajos presentados en el XXV Congreso Argentino de Ciencias de la Computación (CACIC), celebrado en la ciudad de Río Cuarto los días 14 al 18 de octubre de 2019 organizado por la Red de Universidades con Carreras en Informática (RedUNCI) y Facultad de Ciencias Exactas, Físico-Químicas y Naturales - Universidad Nacional de Río CuartoRed de Universidades con Carreras en Informátic

    Factors Influencing Customer Satisfaction towards E-shopping in Malaysia

    Get PDF
    Online shopping or e-shopping has changed the world of business and quite a few people have decided to work with these features. What their primary concerns precisely and the responses from the globalisation are the competency of incorporation while doing their businesses. E-shopping has also increased substantially in Malaysia in recent years. The rapid increase in the e-commerce industry in Malaysia has created the demand to emphasize on how to increase customer satisfaction while operating in the e-retailing environment. It is very important that customers are satisfied with the website, or else, they would not return. Therefore, a crucial fact to look into is that companies must ensure that their customers are satisfied with their purchases that are really essential from the ecommerce’s point of view. With is in mind, this study aimed at investigating customer satisfaction towards e-shopping in Malaysia. A total of 400 questionnaires were distributed among students randomly selected from various public and private universities located within Klang valley area. Total 369 questionnaires were returned, out of which 341 questionnaires were found usable for further analysis. Finally, SEM was employed to test the hypotheses. This study found that customer satisfaction towards e-shopping in Malaysia is to a great extent influenced by ease of use, trust, design of the website, online security and e-service quality. Finally, recommendations and future study direction is provided. Keywords: E-shopping, Customer satisfaction, Trust, Online security, E-service quality, Malaysia
    corecore