Search CORE

181 research outputs found

Is Wikipedia Inefficient? Modelling Effort and Participation in Wikipedia

Author: Crowston Kevin
Jullien Nicolas
Ortega Felipe
Publication venue: IEEE Computer Society
Publication date: 07/01/2013
Field of study

International audienceConcerns have been raisedabout the decreased ability of Wikipedia to recruit editors and in to harness the effort of contributors to create new articles and improve existing articles. But, as Marwell & Oliver explained,in collective projects, in the initial stage of the project, people are few and efforts costly; in the diffusion phase, the number of participants grows as their efforts are rewarding; and in the mature phase, some inefficiency may appear as the number of contributors is more than the work requires. In this paper, thanks to original data we extract from 36 of the main language projects, we compare the efficiency of Wikipedia projects in different languages and at different states of development to examine this effect

HAL-Université de Bretagne Occidentale

HAL-Rennes 1

Web 2.0, language resources and standards to automatically build a multilingual named entity lexicon

Author: Ferrández Sergio
Monachini Monica
Muñoz Rafael
Toral Antonio
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 17/06/2011
Field of study

This paper proposes to advance in the current state-of-the-art of automatic Language Resource (LR) building by taking into consideration three elements: (i) the knowledge available in existing LRs, (ii) the vast amount of information available from the collaborative paradigm that has emerged from the Web 2.0 and (iii) the use of standards to improve interoperability. We present a case study in which a set of LRs for diﬀerent languages (WordNet for English and Spanish and Parole-Simple-Clips for Italian) are extended with Named Entities (NE) by exploiting Wikipedia and the aforementioned LRs. The practical result is a multilingual NE lexicon connected to these LRs and to two ontologies: SUMO and SIMPLE. Furthermore, the paper addresses an important problem which aﬀects the Computational Linguistics area in the present, interoperability, by making use of the ISO LMF standard to encode this lexicon. The diﬀerent steps of the procedure (mapping, disambiguation, extraction, NE identiﬁcation and postprocessing) are comprehensively explained and evaluated. The resulting resource contains 974,567, 137,583 and 125,806 NEs for English, Spanish and Italian respectively. Finally, in order to check the usefulness of the constructed resource, we apply it into a state-of-the-art Question Answering system and evaluate its impact; the NE lexicon improves the system’s accuracy by 28.1%. Compared to previous approaches to build NE repositories, the current proposal represents a step forward in terms of automation, language independence, amount of NEs acquired and richness of the information represented

DCU Online Research Access Service

Temporal Factors to evaluate trustworthiness of virtual identities

Author: Barrett Stephen
Dondio Pierpaolo
Longo Luca
Publication venue: Technological University Dublin
Publication date: 01/09/2007
Field of study

In this paper we investigate how temporal factors (i.e. factors computed by considering only the time-distribution of interactions) can be used as an evidence of an entity’s trustworthiness. While reputation and direct experience are the two most widely used sources of trust in applications, we believe that new sources of evidence and new applications should be investigated [1]. Moreover, while these two classical techniques are based on evaluating the outcomes of interactions (direct or indirect), temporal factors are based on quantitative analysis, representing an alternative way of assessing trust. Our presumption is that, even with this limited information, temporal factors could be a plausible evidence of trust that might be aggregated with more traditional sources. After defining our formal model of four main temporal factors - activity, presence, regularity, frequency, we performed an evaluation over the Wikipedia project, considering more than 12000 users and 94000 articles. Our encouraging results show how, based solely on temporal factors, plausible trust decisions can be achieved

Arrow@TUDublin

{VoG}: {Summarizing} and Understanding Large Graphs

Author: Faloutsos C.
Kang U.
Koutra D.
Vreeken J.
Publication venue
Publication date: 01/01/2014
Field of study

How can we succinctly describe a million-node graph with a few simple sentences? How can we measure the "importance" of a set of discovered subgraphs in a large graph? These are exactly the problems we focus on. Our main ideas are to construct a "vocabulary" of subgraph-types that often occur in real graphs (e.g., stars, cliques, chains), and from a set of subgraphs, find the most succinct description of a graph in terms of this vocabulary. We measure success in a well-founded way by means of the Minimum Description Length (MDL) principle: a subgraph is included in the summary if it decreases the total description length of the graph. Our contributions are three-fold: (a) formulation: we provide a principled encoding scheme to choose vocabulary subgraphs; (b) algorithm: we develop \method, an efficient method to minimize the description cost, and (c) applicability: we report experimental results on multi-million-edge real graphs, including Flickr and the Notre Dame web graph

MPG.PuRe

Towards a more systematic analysis of twitter data: A framework for the analysis of twitter communities

Author: Antelmi A.
Publication venue: CEUR-WS
Publication date: 01/01/2018
Field of study

Institutional Research Information System University of Turin

Extracting central places from the link structure in Wikipedia

Author: Baddeley
Baskin
Berry
Berry
Brush
Christaller
Gao
Goodchild
Hardy
Hecht
Hsu
Keßler
Lösch
Openshaw
Tobler
Publication venue
Publication date: 01/01/2017
Field of study

Crossref

VBN

Anomaly detection in the dynamics of web and social networks

Author: Benzi Kirell
Miz Volodymyr
Ricaud Benjamin
Vandergheynst Pierre
Publication venue
Publication date: 01/01/2019
Field of study

In this work, we propose a new, fast and scalable method for anomaly detection in large time-evolving graphs. It may be a static graph with dynamic node attributes (e.g. time-series), or a graph evolving in time, such as a temporal network. We define an anomaly as a localized increase in temporal activity in a cluster of nodes. The algorithm is unsupervised. It is able to detect and track anomalous activity in a dynamic network despite the noise from multiple interfering sources. We use the Hopfield network model of memory to combine the graph and time information. We show that anomalies can be spotted with a good precision using a memory network. The presented approach is scalable and we provide a distributed implementation of the algorithm. To demonstrate its efficiency, we apply it to two datasets: Enron Email dataset and Wikipedia page views. We show that the anomalous spikes are triggered by the real-world events that impact the network dynamics. Besides, the structure of the clusters and the analysis of the time evolution associated with the detected events reveals interesting facts on how humans interact, exchange and search for information, opening the door to new quantitative studies on collective and social behavior on large and dynamic datasets.Comment: The Web Conference 2019, 10 pages, 7 figure

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Patterns of Creation and Usage of Wikipedia Content

Author: Boldyreff C.
Boldyreff C.
Capiluppi Andrea
Capiluppi Andrea
Pimentel Ana Claudia Duarte
Pimentel Ana Claudia Duarte
Publication venue
Publication date: 01/01/2012
Field of study

Wikipedia is the largest online service storing user-generated content. Its pages are open to anyone for addition,deletion and modifications, and the effort of contributors is recorded and can be tracked in time. Although potentially the Wikipedia web content could exhibit unbounded growth, it is still not clear whether the effort of developers and the output generated are actually following patterns of continuous growth. It is also not clear how the users access such content, and if recurring patterns of usage are detectable showing how the Wikipedia content typically is viewed by interested readers. Using the category of Wikipedia as macro-agglomerates, this study reveals that Wikipedia categories face a decreasing growth trend over time, after an initial, exponential phase of development. On the other hand the study demonstrates that the number of views to the pages within the categories follow a linear, unbounded growth. The link between software usefulness and the need for software maintenance over time has been established by Lehman and other; the link betweenWikipedia usage and changes to the content, unlike software, appear to follow a two-phase evolution of production followed by consumption

UEL Research Repository at University of East London

Inferring multilingual domain-specific word embeddings from large document corpora

Author: Luca Cagliero
Moreno La Quatra
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

The use of distributed vector representations of words in Natural Language Processing has become established. To tailor general-purpose vector spaces to the context under analysis, several domain adaptation techniques have been proposed. They all require sufficiently large document corpora tailored to the target domains. However, in several cross-lingual NLP domains both large enough domain-specific document corpora and pre-trained domain-specific word vectors are hard to find for languages other than English. This paper aims at tackling the aforesaid issue. It proposes a new methodology to automatically infer aligned domain-specific word embeddings for a target language on the basis of the general-purpose and domain-specific models available for a source language (typically, English). The proposed inference method relies on a two-step process, which first automatically identifies domain-specific words and then opportunistically reuses the non-linear space transformations applied to the word vectors of the source language in order to learn how to tailor the vector space of the target language to the domain of interest. The performance of the proposed method was validated via extrinsic evaluation by addressing the established word retrieval task. To this aim, a new benchmark multilingual dataset, derived from Wikipedia, has been released. The results confirmed the effectiveness and usability of the proposed approach

Directory of Open Access Journals

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Open Access Repository

CONTROLLING THE OPEN CONTENT CREATION PROCESS: AN ANALYSIS OF CONTROL MECHANISMS USING THE REPERTORY GRID METHOD

Author: Schroeder Andreas
Wagner Christian
Publication venue: AIS Electronic Library (AISeL)
Publication date: 23/03/2010
Field of study

We develop a governance framework for open collaboration, specifically for the process of collaborative content creation. Our analysis is based on in-depth interviews with 12 active Wikipedians using the repertory grid method. The framework reflects the governance of wiki-based peer production by identifying the different structures, processes and mechanisms which guide and control the contributions and activities of individuals. Our findings concerning the driving principles for successful governance recognize four such principles: the power of the many, the influence of the few, the role of (persistent) conversations, and the value of rules

AIS Electronic Library (AISeL)