10,754 research outputs found

    Combining Thesaurus Knowledge and Probabilistic Topic Models

    Full text link
    In this paper we present the approach of introducing thesaurus knowledge into probabilistic topic models. The main idea of the approach is based on the assumption that the frequencies of semantically related words and phrases, which are met in the same texts, should be enhanced: this action leads to their larger contribution into topics found in these texts. We have conducted experiments with several thesauri and found that for improving topic models, it is useful to utilize domain-specific knowledge. If a general thesaurus, such as WordNet, is used, the thesaurus-based improvement of topic models can be achieved with excluding hyponymy relations in combined topic models.Comment: Accepted to AIST-2017 conference (http://aistconf.ru/). The final publication will be available at link.springer.co

    Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data

    Get PDF
    Use of socially generated "big data" to access information about collective states of the minds in human societies has become a new paradigm in the emerging field of computational social science. A natural application of this would be the prediction of the society's reaction to a new product in the sense of popularity and adoption rate. However, bridging the gap between "real time monitoring" and "early predicting" remains a big challenge. Here we report on an endeavor to build a minimalistic predictive model for the financial success of movies based on collective activity data of online users. We show that the popularity of a movie can be predicted much before its release by measuring and analyzing the activity level of editors and viewers of the corresponding entry to the movie in Wikipedia, the well-known online encyclopedia.Comment: 13 pages, Including Supporting Information, 7 Figures, Download the dataset from: http://wwm.phy.bme.hu/SupplementaryDataS1.zi

    Regional Languages on Wikipedia. Venetian Wikipedia’s user interaction over time

    Get PDF
    Given that little is known about regional language user interaction practices on Wikipedia, this study analyzed content creation process, user social interaction and exchanged content over the course of the existence of Venetian Wikipedia. Content of and user interactions over time on Venetian Wikipedia exhibit practices shared within larger Wikipedia communities and display behaviors that are pertinent to this specific community. Shared practices with\ud other Wikipedias (eg. English Wikipedia) included coordination content as a dominant category of exchanged content, user-role based structure where and most active communicators are administrators was another shared feature, as well as socialization tactics to involve users in online projects. While Venetian Wikipedia stood out for its geographically-linked users who emphasized their regional identity. User exchanges over time spilled over from online to offline domains. This analysis provides a different side of Wikipedia collaboration which is based on creation, maintenance, and negotiation of the content but also shows\ud engagement into interpersonal communication. Thus, this study exemplifies how regional language Wikipedias provide ways to their users not only to preserve their cultural heritage through the language use on regional language Wikipedia space and connect through shared contents of interest, but also, how it could serve as a community maintenance platform that unifies users with shared goals and extends communication to offline realm

    Using Google Analytics Data to Expand Discovery and Use of Digital Archival Content

    Get PDF
    This article presents opportunities for the use of Google Analytics, a popular and freely available web analytics tool, to inform decision making for digital archivists managing online digital archives content. Emphasis is placed on the analysis of Google Analytics data to increase the visibility and discoverability of content. The article describes the use of Google Analytics to support fruitful digital outreach programs, to guide metadata creation for enhancing access, and to measure user demand to aid selection for digitization. Valuable reports, features, and tools in Google Analytics are identified and the use of these tools to gather meaningful data is explained

    Modeling the structure and evolution of discussion cascades

    Get PDF
    We analyze the structure and evolution of discussion cascades in four popular websites: Slashdot, Barrapunto, Meneame and Wikipedia. Despite the big heterogeneities between these sites, a preferential attachment (PA) model with bias to the root can capture the temporal evolution of the observed trees and many of their statistical properties, namely, probability distributions of the branching factors (degrees), subtree sizes and certain correlations. The parameters of the model are learned efficiently using a novel maximum likelihood estimation scheme for PA and provide a figurative interpretation about the communication habits and the resulting discussion cascades on the four different websites.Comment: 10 pages, 11 figure

    Dating Texts without Explicit Temporal Cues

    Full text link
    This paper tackles temporal resolution of documents, such as determining when a document is about or when it was written, based only on its text. We apply techniques from information retrieval that predict dates via language models over a discretized timeline. Unlike most previous works, we rely {\it solely} on temporal cues implicit in the text. We consider both document-likelihood and divergence based techniques and several smoothing methods for both of them. Our best model predicts the mid-point of individuals' lives with a median of 22 and mean error of 36 years for Wikipedia biographies from 3800 B.C. to the present day. We also show that this approach works well when training on such biographies and predicting dates both for non-biographical Wikipedia pages about specific years (500 B.C. to 2010 A.D.) and for publication dates of short stories (1798 to 2008). Together, our work shows that, even in absence of temporal extraction resources, it is possible to achieve remarkable temporal locality across a diverse set of texts
    • …
    corecore