Search CORE

10,754 research outputs found

Combining Thesaurus Knowledge and Probabilistic Topic Models

Author: A Smith
D Blei
J Lau
K Frantzi
T Griffiths
Y Gao
Publication venue
Publication date: 31/07/2017
Field of study

In this paper we present the approach of introducing thesaurus knowledge into probabilistic topic models. The main idea of the approach is based on the assumption that the frequencies of semantically related words and phrases, which are met in the same texts, should be enhanced: this action leads to their larger contribution into topics found in these texts. We have conducted experiments with several thesauri and found that for improving topic models, it is useful to utilize domain-specific knowledge. If a general thesaurus, such as WordNet, is used, the thesaurus-based improvement of topic models can be achieved with excluding hyponymy relations in combined topic models.Comment: Accepted to AIST-2017 conference (http://aistconf.ru/). The final publication will be available at link.springer.co

arXiv.org e-Print Archive

Crossref

Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data

Author: A Halavais
A Ishii
A Spoerri
A Spoerri
Attila Szolnoki
B Suh
C Castillo
CA Hidalgo
G Eysenbach
HS Moat
J Bollen
J Ginsberg
J Ratkiewicz
J Török
János Kertész
Márton Mestyán
R Kimmons
R Sharda
RK Pan
S Saavedra
S Sinha
S Sreenivasan
T Brody
T Holloway
T Preis
T Preis
T Yasseri
T Yasseri
T Yasseri
T Yasseri
Taha Yasseri
X Shuai
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2013
Field of study

Use of socially generated "big data" to access information about collective states of the minds in human societies has become a new paradigm in the emerging field of computational social science. A natural application of this would be the prediction of the society's reaction to a new product in the sense of popularity and adoption rate. However, bridging the gap between "real time monitoring" and "early predicting" remains a big challenge. Here we report on an endeavor to build a minimalistic predictive model for the financial success of movies based on collective activity data of online users. We show that the popularity of a movie can be predicted much before its release by measuring and analyzing the activity level of editors and viewers of the corresponding entry to the movie in Wikipedia, the well-known online encyclopedia.Comment: 13 pages, Including Supporting Information, 7 Figures, Download the dataset from: http://wwm.phy.bme.hu/SupplementaryDataS1.zi

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Aaltodoc Publication Archive

Oxford University Research Archive

FigShare

Recommended from our members

Editorial: The future of Digital Culture & Education (DCE)

Author: Apperley Thomas
Walsh Christopher
Publication venue
Publication date: 01/01/2011
Field of study

Open Research Online (The Open University)

Recommended from our members

Enriching videos with light semantics

Author: Breslin John G.
Choudhury Smitashree
Publication venue
Publication date: 01/10/2010
Field of study

This paper describes an ongoing prototypical framework to annotate and retrieve web videos with light semantics. The proposed framework reuses many existing vocabularies along with a video model. The knowledge is captured from three different information spaces (media content, context, document). We also describe ways to extract the semantic content descriptions from the existing usergenerated content using multiple approaches of linguistic processing and Named Entity Recognition, which are later identified with DBpedia resources to establish meanings for the tags. Finally, the implemented prototype is described with multiple search interfaces and retrieval processes. Evaluation on semantic enrichment shows a considerable (50% of videos) improvement in content description

Open Research Online (The Open University)

Regional Languages on Wikipedia. Venetian Wikipedia’s user interaction over time

Author: Massa Paolo
Zelenkauskaite Asta
Publication venue: Murdoch University
Publication date: 01/01/2012
Field of study

Given that little is known about regional language user interaction practices on Wikipedia, this study analyzed content creation process, user social interaction and exchanged content over the course of the existence of Venetian Wikipedia. Content of and user interactions over time on Venetian Wikipedia exhibit practices shared within larger Wikipedia communities and display behaviors that are pertinent to this specific community. Shared practices with\ud other Wikipedias (eg. English Wikipedia) included coordination content as a dominant category of exchanged content, user-role based structure where and most active communicators are administrators was another shared feature, as well as socialization tactics to involve users in online projects. While Venetian Wikipedia stood out for its geographically-linked users who emphasized their regional identity. User exchanges over time spilled over from online to offline domains. This analysis provides a different side of Wikipedia collaboration which is based on creation, maintenance, and negotiation of the content but also shows\ud engagement into interpersonal communication. Thus, this study exemplifies how regional language Wikipedias provide ways to their users not only to preserve their cultural heritage through the language use on regional language Wikipedia space and connect through shared contents of interest, but also, how it could serve as a community maintenance platform that unifies users with shared goals and extends communication to offline realm

Elektronisch archivierte Theorie - Sammelpunkt

Using Google Analytics Data to Expand Discovery and Use of Digital Archival Content

Author: Szajewski Michael
Publication venue: DigitalCommons@ILR
Publication date: 01/11/2013
Field of study

This article presents opportunities for the use of Google Analytics, a popular and freely available web analytics tool, to inform decision making for digital archivists managing online digital archives content. Emphasis is placed on the analysis of Google Analytics data to increase the visibility and discoverability of content. The article describes the use of Google Analytics to support fruitful digital outreach programs, to guide metadata creation for enhancing access, and to measure user demand to aid selection for digitization. Valuable reports, features, and tools in Google Analytics are identified and the use of these tools to gather meaningful data is explained

DigitalCommons@ILR

eCommons@Cornell

Modeling the structure and evolution of discussion cascades

Author: Gómez Vicenç
Kaltenbrunner Andreas
Kappen Hilbert J.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2011
Field of study

We analyze the structure and evolution of discussion cascades in four popular websites: Slashdot, Barrapunto, Meneame and Wikipedia. Despite the big heterogeneities between these sites, a preferential attachment (PA) model with bias to the root can capture the temporal evolution of the observed trees and many of their statistical properties, namely, probability distributions of the branching factors (degrees), subtree sizes and certain correlations. The parameters of the model are learned efficiently using a novel maximum likelihood estimation scheme for PA and provide a figurative interpretation about the communication habits and the resulting discussion cascades on the four different websites.Comment: 10 pages, 11 figure

arXiv.org e-Print Archive

CiteSeerX

Crossref

Radboud Repository

Dating Texts without Explicit Temporal Cues

Author: Baldridge Jason
Ghosh Joydeep
Kumar Abhimanu
Lease Matthew
Publication venue
Publication date: 10/11/2012
Field of study

This paper tackles temporal resolution of documents, such as determining when a document is about or when it was written, based only on its text. We apply techniques from information retrieval that predict dates via language models over a discretized timeline. Unlike most previous works, we rely {\it solely} on temporal cues implicit in the text. We consider both document-likelihood and divergence based techniques and several smoothing methods for both of them. Our best model predicts the mid-point of individuals' lives with a median of 22 and mean error of 36 years for Wikipedia biographies from 3800 B.C. to the present day. We also show that this approach works well when training on such biographies and predicting dates both for non-biographical Wikipedia pages about specific years (500 B.C. to 2010 A.D.) and for publication dates of short stories (1798 to 2008). Together, our work shows that, even in absence of temporal extraction resources, it is possible to achieve remarkable temporal locality across a diverse set of texts

arXiv.org e-Print Archive

CiteSeerX