181 research outputs found

    Is Wikipedia Inefficient? Modelling Effort and Participation in Wikipedia

    No full text
    International audienceConcerns have been raisedabout the decreased ability of Wikipedia to recruit editors and in to harness the effort of contributors to create new articles and improve existing articles. But, as Marwell & Oliver explained,in collective projects, in the initial stage of the project, people are few and efforts costly; in the diffusion phase, the number of participants grows as their efforts are rewarding; and in the mature phase, some inefficiency may appear as the number of contributors is more than the work requires. In this paper, thanks to original data we extract from 36 of the main language projects, we compare the efficiency of Wikipedia projects in different languages and at different states of development to examine this effect

    Web 2.0, language resources and standards to automatically build a multilingual named entity lexicon

    Get PDF
    This paper proposes to advance in the current state-of-the-art of automatic Language Resource (LR) building by taking into consideration three elements: (i) the knowledge available in existing LRs, (ii) the vast amount of information available from the collaborative paradigm that has emerged from the Web 2.0 and (iii) the use of standards to improve interoperability. We present a case study in which a set of LRs for different languages (WordNet for English and Spanish and Parole-Simple-Clips for Italian) are extended with Named Entities (NE) by exploiting Wikipedia and the aforementioned LRs. The practical result is a multilingual NE lexicon connected to these LRs and to two ontologies: SUMO and SIMPLE. Furthermore, the paper addresses an important problem which affects the Computational Linguistics area in the present, interoperability, by making use of the ISO LMF standard to encode this lexicon. The different steps of the procedure (mapping, disambiguation, extraction, NE identification and postprocessing) are comprehensively explained and evaluated. The resulting resource contains 974,567, 137,583 and 125,806 NEs for English, Spanish and Italian respectively. Finally, in order to check the usefulness of the constructed resource, we apply it into a state-of-the-art Question Answering system and evaluate its impact; the NE lexicon improves the system’s accuracy by 28.1%. Compared to previous approaches to build NE repositories, the current proposal represents a step forward in terms of automation, language independence, amount of NEs acquired and richness of the information represented

    Temporal Factors to evaluate trustworthiness of virtual identities

    Get PDF
    In this paper we investigate how temporal factors (i.e. factors computed by considering only the time-distribution of interactions) can be used as an evidence of an entity’s trustworthiness. While reputation and direct experience are the two most widely used sources of trust in applications, we believe that new sources of evidence and new applications should be investigated [1]. Moreover, while these two classical techniques are based on evaluating the outcomes of interactions (direct or indirect), temporal factors are based on quantitative analysis, representing an alternative way of assessing trust. Our presumption is that, even with this limited information, temporal factors could be a plausible evidence of trust that might be aggregated with more traditional sources. After defining our formal model of four main temporal factors - activity, presence, regularity, frequency, we performed an evaluation over the Wikipedia project, considering more than 12000 users and 94000 articles. Our encouraging results show how, based solely on temporal factors, plausible trust decisions can be achieved

    {VoG}: {Summarizing} and Understanding Large Graphs

    Get PDF
    How can we succinctly describe a million-node graph with a few simple sentences? How can we measure the "importance" of a set of discovered subgraphs in a large graph? These are exactly the problems we focus on. Our main ideas are to construct a "vocabulary" of subgraph-types that often occur in real graphs (e.g., stars, cliques, chains), and from a set of subgraphs, find the most succinct description of a graph in terms of this vocabulary. We measure success in a well-founded way by means of the Minimum Description Length (MDL) principle: a subgraph is included in the summary if it decreases the total description length of the graph. Our contributions are three-fold: (a) formulation: we provide a principled encoding scheme to choose vocabulary subgraphs; (b) algorithm: we develop \method, an efficient method to minimize the description cost, and (c) applicability: we report experimental results on multi-million-edge real graphs, including Flickr and the Notre Dame web graph

    Anomaly detection in the dynamics of web and social networks

    Get PDF
    In this work, we propose a new, fast and scalable method for anomaly detection in large time-evolving graphs. It may be a static graph with dynamic node attributes (e.g. time-series), or a graph evolving in time, such as a temporal network. We define an anomaly as a localized increase in temporal activity in a cluster of nodes. The algorithm is unsupervised. It is able to detect and track anomalous activity in a dynamic network despite the noise from multiple interfering sources. We use the Hopfield network model of memory to combine the graph and time information. We show that anomalies can be spotted with a good precision using a memory network. The presented approach is scalable and we provide a distributed implementation of the algorithm. To demonstrate its efficiency, we apply it to two datasets: Enron Email dataset and Wikipedia page views. We show that the anomalous spikes are triggered by the real-world events that impact the network dynamics. Besides, the structure of the clusters and the analysis of the time evolution associated with the detected events reveals interesting facts on how humans interact, exchange and search for information, opening the door to new quantitative studies on collective and social behavior on large and dynamic datasets.Comment: The Web Conference 2019, 10 pages, 7 figure

    Patterns of Creation and Usage of Wikipedia Content

    Get PDF
    Wikipedia is the largest online service storing user-generated content. Its pages are open to anyone for addition,deletion and modifications, and the effort of contributors is recorded and can be tracked in time. Although potentially the Wikipedia web content could exhibit unbounded growth, it is still not clear whether the effort of developers and the output generated are actually following patterns of continuous growth. It is also not clear how the users access such content, and if recurring patterns of usage are detectable showing how the Wikipedia content typically is viewed by interested readers. Using the category of Wikipedia as macro-agglomerates, this study reveals that Wikipedia categories face a decreasing growth trend over time, after an initial, exponential phase of development. On the other hand the study demonstrates that the number of views to the pages within the categories follow a linear, unbounded growth. The link between software usefulness and the need for software maintenance over time has been established by Lehman and other; the link betweenWikipedia usage and changes to the content, unlike software, appear to follow a two-phase evolution of production followed by consumption

    Inferring multilingual domain-specific word embeddings from large document corpora

    Get PDF
    The use of distributed vector representations of words in Natural Language Processing has become established. To tailor general-purpose vector spaces to the context under analysis, several domain adaptation techniques have been proposed. They all require sufficiently large document corpora tailored to the target domains. However, in several cross-lingual NLP domains both large enough domain-specific document corpora and pre-trained domain-specific word vectors are hard to find for languages other than English. This paper aims at tackling the aforesaid issue. It proposes a new methodology to automatically infer aligned domain-specific word embeddings for a target language on the basis of the general-purpose and domain-specific models available for a source language (typically, English). The proposed inference method relies on a two-step process, which first automatically identifies domain-specific words and then opportunistically reuses the non-linear space transformations applied to the word vectors of the source language in order to learn how to tailor the vector space of the target language to the domain of interest. The performance of the proposed method was validated via extrinsic evaluation by addressing the established word retrieval task. To this aim, a new benchmark multilingual dataset, derived from Wikipedia, has been released. The results confirmed the effectiveness and usability of the proposed approach

    CONTROLLING THE OPEN CONTENT CREATION PROCESS: AN ANALYSIS OF CONTROL MECHANISMS USING THE REPERTORY GRID METHOD

    Get PDF
    We develop a governance framework for open collaboration, specifically for the process of collaborative content creation. Our analysis is based on in-depth interviews with 12 active Wikipedians using the repertory grid method. The framework reflects the governance of wiki-based peer production by identifying the different structures, processes and mechanisms which guide and control the contributions and activities of individuals. Our findings concerning the driving principles for successful governance recognize four such principles: the power of the many, the influence of the few, the role of (persistent) conversations, and the value of rules
    corecore