10,754 research outputs found
Combining Thesaurus Knowledge and Probabilistic Topic Models
In this paper we present the approach of introducing thesaurus knowledge into
probabilistic topic models. The main idea of the approach is based on the
assumption that the frequencies of semantically related words and phrases,
which are met in the same texts, should be enhanced: this action leads to their
larger contribution into topics found in these texts. We have conducted
experiments with several thesauri and found that for improving topic models, it
is useful to utilize domain-specific knowledge. If a general thesaurus, such as
WordNet, is used, the thesaurus-based improvement of topic models can be
achieved with excluding hyponymy relations in combined topic models.Comment: Accepted to AIST-2017 conference (http://aistconf.ru/). The final
publication will be available at link.springer.co
Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data
Use of socially generated "big data" to access information about collective
states of the minds in human societies has become a new paradigm in the
emerging field of computational social science. A natural application of this
would be the prediction of the society's reaction to a new product in the sense
of popularity and adoption rate. However, bridging the gap between "real time
monitoring" and "early predicting" remains a big challenge. Here we report on
an endeavor to build a minimalistic predictive model for the financial success
of movies based on collective activity data of online users. We show that the
popularity of a movie can be predicted much before its release by measuring and
analyzing the activity level of editors and viewers of the corresponding entry
to the movie in Wikipedia, the well-known online encyclopedia.Comment: 13 pages, Including Supporting Information, 7 Figures, Download the
dataset from: http://wwm.phy.bme.hu/SupplementaryDataS1.zi
Recommended from our members
Enriching videos with light semantics
This paper describes an ongoing prototypical framework to annotate and retrieve web videos with light semantics. The proposed framework reuses many existing vocabularies along with a video model. The knowledge is captured from three different information spaces (media content, context, document). We also describe ways to extract the semantic content descriptions from the existing usergenerated content using multiple approaches of linguistic processing and Named Entity Recognition, which are later identified with DBpedia resources to establish meanings for the tags. Finally, the implemented prototype is described with multiple search interfaces and retrieval processes. Evaluation on semantic enrichment shows a considerable (50% of videos) improvement in content description
Regional Languages on Wikipedia. Venetian Wikipedia’s user interaction over time
Given that little is known about regional language user interaction practices on Wikipedia, this study analyzed content creation process, user social interaction and exchanged content over the course of the existence of Venetian Wikipedia. Content of and user interactions over time on Venetian Wikipedia exhibit practices shared within larger Wikipedia communities and display behaviors that are pertinent to this specific community. Shared practices with\ud
other Wikipedias (eg. English Wikipedia) included coordination content as a dominant category of exchanged content, user-role based structure where and most active communicators are administrators was another shared feature, as well as socialization tactics to involve users in online projects. While Venetian Wikipedia stood out for its geographically-linked users who emphasized their regional identity. User exchanges over time spilled over from online to offline domains. This analysis provides a different side of Wikipedia collaboration which is based on creation, maintenance, and negotiation of the content but also shows\ud
engagement into interpersonal communication. Thus, this study exemplifies how regional language Wikipedias provide ways to their users not only to preserve their cultural heritage through the language use on regional language Wikipedia space and connect through shared contents of interest, but also, how it could serve as a community maintenance platform that unifies users with shared goals and extends communication to offline realm
Using Google Analytics Data to Expand Discovery and Use of Digital Archival Content
This article presents opportunities for the use of Google Analytics, a popular and freely available web analytics tool, to inform decision making for digital archivists managing online digital archives content. Emphasis is placed on the analysis of Google Analytics data to increase the visibility and discoverability of content. The article describes the use of Google Analytics to support fruitful digital outreach programs, to guide metadata creation for enhancing access, and to measure user demand to aid selection for digitization. Valuable reports, features, and tools in Google Analytics are identified and the use of these tools to gather meaningful data is explained
Modeling the structure and evolution of discussion cascades
We analyze the structure and evolution of discussion cascades in four popular
websites: Slashdot, Barrapunto, Meneame and Wikipedia. Despite the big
heterogeneities between these sites, a preferential attachment (PA) model with
bias to the root can capture the temporal evolution of the observed trees and
many of their statistical properties, namely, probability distributions of the
branching factors (degrees), subtree sizes and certain correlations. The
parameters of the model are learned efficiently using a novel maximum
likelihood estimation scheme for PA and provide a figurative interpretation
about the communication habits and the resulting discussion cascades on the
four different websites.Comment: 10 pages, 11 figure
Dating Texts without Explicit Temporal Cues
This paper tackles temporal resolution of documents, such as determining when
a document is about or when it was written, based only on its text. We apply
techniques from information retrieval that predict dates via language models
over a discretized timeline. Unlike most previous works, we rely {\it solely}
on temporal cues implicit in the text. We consider both document-likelihood and
divergence based techniques and several smoothing methods for both of them. Our
best model predicts the mid-point of individuals' lives with a median of 22 and
mean error of 36 years for Wikipedia biographies from 3800 B.C. to the present
day. We also show that this approach works well when training on such
biographies and predicting dates both for non-biographical Wikipedia pages
about specific years (500 B.C. to 2010 A.D.) and for publication dates of short
stories (1798 to 2008). Together, our work shows that, even in absence of
temporal extraction resources, it is possible to achieve remarkable temporal
locality across a diverse set of texts
- …