4,776 research outputs found
Global disease monitoring and forecasting with Wikipedia
Infectious disease is a leading threat to public health, economic stability,
and other key social structures. Efforts to mitigate these impacts depend on
accurate and timely monitoring to measure the risk and progress of disease.
Traditional, biologically-focused monitoring techniques are accurate but costly
and slow; in response, new techniques based on social internet data such as
social media and search queries are emerging. These efforts are promising, but
important challenges in the areas of scientific peer review, breadth of
diseases and countries, and forecasting hamper their operational usefulness.
We examine a freely available, open data source for this use: access logs
from the online encyclopedia Wikipedia. Using linear models, language as a
proxy for location, and a systematic yet simple article selection procedure, we
tested 14 location-disease combinations and demonstrate that these data
feasibly support an approach that overcomes these challenges. Specifically, our
proof-of-concept yields models with up to 0.92, forecasting value up to
the 28 days tested, and several pairs of models similar enough to suggest that
transferring models from one location to another without re-training is
feasible.
Based on these preliminary results, we close with a research agenda designed
to overcome these challenges and produce a disease monitoring and forecasting
system that is significantly more effective, robust, and globally comprehensive
than the current state of the art.Comment: 27 pages; 4 figures; 4 tables. Version 2: Cite McIver & Brownstein
and adjust novelty claims accordingly; revise title; various revisions for
clarit
Non-Compositional Term Dependence for Information Retrieval
Modelling term dependence in IR aims to identify co-occurring terms that are
too heavily dependent on each other to be treated as a bag of words, and to
adapt the indexing and ranking accordingly. Dependent terms are predominantly
identified using lexical frequency statistics, assuming that (a) if terms
co-occur often enough in some corpus, they are semantically dependent; (b) the
more often they co-occur, the more semantically dependent they are. This
assumption is not always correct: the frequency of co-occurring terms can be
separate from the strength of their semantic dependence. E.g. "red tape" might
be overall less frequent than "tape measure" in some corpus, but this does not
mean that "red"+"tape" are less dependent than "tape"+"measure". This is
especially the case for non-compositional phrases, i.e. phrases whose meaning
cannot be composed from the individual meanings of their terms (such as the
phrase "red tape" meaning bureaucracy). Motivated by this lack of distinction
between the frequency and strength of term dependence in IR, we present a
principled approach for handling term dependence in queries, using both lexical
frequency and semantic evidence. We focus on non-compositional phrases,
extending a recent unsupervised model for their detection [21] to IR. Our
approach, integrated into ranking using Markov Random Fields [31], yields
effectiveness gains over competitive TREC baselines, showing that there is
still room for improvement in the very well-studied area of term dependence in
IR
Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data
Use of socially generated "big data" to access information about collective
states of the minds in human societies has become a new paradigm in the
emerging field of computational social science. A natural application of this
would be the prediction of the society's reaction to a new product in the sense
of popularity and adoption rate. However, bridging the gap between "real time
monitoring" and "early predicting" remains a big challenge. Here we report on
an endeavor to build a minimalistic predictive model for the financial success
of movies based on collective activity data of online users. We show that the
popularity of a movie can be predicted much before its release by measuring and
analyzing the activity level of editors and viewers of the corresponding entry
to the movie in Wikipedia, the well-known online encyclopedia.Comment: 13 pages, Including Supporting Information, 7 Figures, Download the
dataset from: http://wwm.phy.bme.hu/SupplementaryDataS1.zi
A Scalable Machine Learning Online Service for Big Data Real-Time Analysis
Proceedings of: IEEE Symposium Series on Computational Intelligence (SSCI 2014). Orlando, FL, USA, December 09-12, 2014.This work describes a proposal for developing and testing a scalable machine learning architecture able to provide real-time predictions or analytics as a service over domain-independent big data, working on top of the Hadoop ecosystem and providing real-time analytics as a service through a RESTful API. Systems implementing this architecture could provide companies with on-demand tools facilitating the tasks of storing, analyzing, understanding and reacting to their data, either in batch or stream fashion; and could turn into a valuable asset for improving the business performance and be a key market differentiator in this fast pace environment. In order to validate the proposed architecture, two systems are developed, each one providing classical machine-learning services in different domains: the first one involves a recommender system for web advertising, while the second consists in a prediction system which learns from gamers' behavior and tries to predict future events such as purchases or churning. An evaluation is carried out on these systems, and results show how both services are able to provide fast responses even when a number of concurrent requests are made, and in the particular case of the second system, results clearly prove that computed predictions significantly outperform those obtained if random guess was used.This research work is part of Memento Data Analysis project, co-funded by the Spanish Ministry of Industry, Energy and Tourism with identifier TSI-020601-2012-99.Publicad
- …