41,682 research outputs found
Why We Read Wikipedia
Wikipedia is one of the most popular sites on the Web, with millions of users
relying on it to satisfy a broad range of information needs every day. Although
it is crucial to understand what exactly these needs are in order to be able to
meet them, little is currently known about why users visit Wikipedia. The goal
of this paper is to fill this gap by combining a survey of Wikipedia readers
with a log-based analysis of user activity. Based on an initial series of user
surveys, we build a taxonomy of Wikipedia use cases along several dimensions,
capturing users' motivations to visit Wikipedia, the depth of knowledge they
are seeking, and their knowledge of the topic of interest prior to visiting
Wikipedia. Then, we quantify the prevalence of these use cases via a
large-scale user survey conducted on live Wikipedia with almost 30,000
responses. Our analyses highlight the variety of factors driving users to
Wikipedia, such as current events, media coverage of a topic, personal
curiosity, work or school assignments, or boredom. Finally, we match survey
responses to the respondents' digital traces in Wikipedia's server logs,
enabling the discovery of behavioral patterns associated with specific use
cases. For instance, we observe long and fast-paced page sequences across
topics for users who are bored or exploring randomly, whereas those using
Wikipedia for work or school spend more time on individual articles focused on
topics such as science. Our findings advance our understanding of reader
motivations and behavior on Wikipedia and can have implications for developers
aiming to improve Wikipedia's user experience, editors striving to cater to
their readers' needs, third-party services (such as search engines) providing
access to Wikipedia content, and researchers aiming to build tools such as
recommendation engines.Comment: Published in WWW'17; v2 fixes caption of Table
Ab initio data-analytics study of carbon-dioxide activation on semiconductor oxide surfaces
The excessive emissions of carbon dioxide (CO) into the atmosphere
threaten to shift the CO cycle planet-wide and induce unpredictable climate
changes. Using artificial intelligence (AI) trained on high-throughput first
principles based data for a broad family of oxides, we develop a strategy for a
rational design of catalytic materials for converting CO to fuels and other
useful chemicals. We demonstrate that an electron transfer to the
-antibonding orbital of the adsorbed molecule and the associated bending
of the initially linear molecule, previously proposed as the indicator of
activation, are insufficient to account for the good catalytic performance of
experimentally characterized oxide surfaces. Instead, our AI model identifies
the common feature of these surfaces in the binding of a molecular O atom to a
surface cation, which results in a strong elongation and therefore weakening of
one molecular C-O bond. This finding suggests using the C-O bond elongation as
an indicator of CO activation. Based on these findings, we propose a set of
new promising oxide-based catalysts for CO conversion, and a recipe to find
more
Online Model Evaluation in a Large-Scale Computational Advertising Platform
Online media provides opportunities for marketers through which they can
deliver effective brand messages to a wide range of audiences. Advertising
technology platforms enable advertisers to reach their target audience by
delivering ad impressions to online users in real time. In order to identify
the best marketing message for a user and to purchase impressions at the right
price, we rely heavily on bid prediction and optimization models. Even though
the bid prediction models are well studied in the literature, the equally
important subject of model evaluation is usually overlooked. Effective and
reliable evaluation of an online bidding model is crucial for making faster
model improvements as well as for utilizing the marketing budgets more
efficiently. In this paper, we present an experimentation framework for bid
prediction models where our focus is on the practical aspects of model
evaluation. Specifically, we outline the unique challenges we encounter in our
platform due to a variety of factors such as heterogeneous goal definitions,
varying budget requirements across different campaigns, high seasonality and
the auction-based environment for inventory purchasing. Then, we introduce
return on investment (ROI) as a unified model performance (i.e., success)
metric and explain its merits over more traditional metrics such as
click-through rate (CTR) or conversion rate (CVR). Most importantly, we discuss
commonly used evaluation and metric summarization approaches in detail and
propose a more accurate method for online evaluation of new experimental models
against the baseline. Our meta-analysis-based approach addresses various
shortcomings of other methods and yields statistically robust conclusions that
allow us to conclude experiments more quickly in a reliable manner. We
demonstrate the effectiveness of our evaluation strategy on real campaign data
through some experiments.Comment: Accepted to ICDM201
Big-Data-Driven Materials Science and its FAIR Data Infrastructure
This chapter addresses the forth paradigm of materials research -- big-data
driven materials science. Its concepts and state-of-the-art are described, and
its challenges and chances are discussed. For furthering the field, Open Data
and an all-embracing sharing, an efficient data infrastructure, and the rich
ecosystem of computer codes used in the community are of critical importance.
For shaping this forth paradigm and contributing to the development or
discovery of improved and novel materials, data must be what is now called FAIR
-- Findable, Accessible, Interoperable and Re-purposable/Re-usable. This sets
the stage for advances of methods from artificial intelligence that operate on
large data sets to find trends and patterns that cannot be obtained from
individual calculations and not even directly from high-throughput studies.
Recent progress is reviewed and demonstrated, and the chapter is concluded by a
forward-looking perspective, addressing important not yet solved challenges.Comment: submitted to the Handbook of Materials Modeling (eds. S. Yip and W.
Andreoni), Springer 2018/201
On the Complexity of Rule Discovery from Distributed Data
This paper analyses the complexity of rule selection for supervised learning in distributed scenarios. The selection of rules is usually guided by a utility measure such as predictive accuracy or weighted relative accuracy. Other examples are support and confidence, known from association rule mining. A common strategy to tackle rule selection from distributed data is to evaluate rules locally on each dataset. While this works well for homogeneously distributed data, this work proves limitations of this strategy if distributions are allowed to deviate. To identify those subsets for which local and global distributions deviate may be regarded as an interesting learning task of its own, explicitly taking the locality of data into account. This task can be shown to be basically as complex as discovering the globally best rules from local data. Based on the theoretical results some guidelines for algorithm design are derived. --
Comparing Knowledge-Based Sampling to Boosting
Boosting algorithms for classification are based on altering the ini- tial distribution assumed to underly a given example set. The idea of knowledge-based sampling (KBS) is to sample out prior knowledge and previously discovered patterns to achieve that subsequently ap- plied data mining algorithms automatically focus on novel patterns without any need to adjust the base algorithm. This sampling strat- egy anticipates a user's expectation based on a set of constraints how to adjust the distribution. In the classified case KBS is similar to boosting. This article shows that a specific, very simple KBS algo- rithm is able to boost weak base classifiers. It discusses differences to AdaBoost.M1 and LogitBoost, and it compares performances of these algorithms empirically in terms of predictive accuracy, the area under the ROC curve measure, and squared error. --
Unsupervised learning with contrastive latent variable models
In unsupervised learning, dimensionality reduction is an important tool for
data exploration and visualization. Because these aims are typically
open-ended, it can be useful to frame the problem as looking for patterns that
are enriched in one dataset relative to another. These pairs of datasets occur
commonly, for instance a population of interest vs. control or signal vs.
signal free recordings.However, there are few methods that work on sets of data
as opposed to data points or sequences. Here, we present a probabilistic model
for dimensionality reduction to discover signal that is enriched in the target
dataset relative to the background dataset. The data in these sets do not need
to be paired or grouped beyond set membership. By using a probabilistic model
where some structure is shared amongst the two datasets and some is unique to
the target dataset, we are able to recover interesting structure in the latent
space of the target dataset. The method also has the advantages of a
probabilistic model, namely that it allows for the incorporation of prior
information, handles missing data, and can be generalized to different
distributional assumptions. We describe several possible variations of the
model and demonstrate the application of the technique to de-noising, feature
selection, and subgroup discovery settings
Discovery of mating in the major African livestock pathogen Trypanosoma congolense
The protozoan parasite, Trypanosoma congolense, is one of the most economically important pathogens of livestock in Africa and, through its impact on cattle health and productivity, has a significant effect on human health and well being. Despite the importance of this parasite our knowledge of some of the fundamental biological processes is limited. For example, it is unknown whether mating takes place. In this paper we have taken a population genetics based approach to address this question. The availability of genome sequence of the parasite allowed us to identify polymorphic microsatellite markers, which were used to genotype T. congolense isolates from livestock in a discrete geographical area of The Gambia. The data showed a high level of diversity with a large number of distinct genotypes, but a deficit in heterozygotes. Further analysis identified cryptic genetic subdivision into four sub-populations. In one of these, parasite genotypic diversity could only be explained by the occurrence of frequent mating in T. congolense. These data are completely inconsistent with previous suggestions that the parasite expands asexually in the absence of mating. The discovery of mating in this species of trypanosome has significant consequences for the spread of critical traits, such as drug resistance, as well as for fundamental aspects of the biology and epidemiology of this neglected but economically important pathogen
Ab initio data-analytics study of carbon-dioxide activation on semiconductor oxide surfaces
The excessive emissions of carbon dioxide (CO2) into the atmosphere threaten to shift the CO2 cycle planet-wide and induce unpredictable climate changes. Using artificial intelligence (AI) trained on high-throughput first principles based data for a broad family of oxides, we develop a strategy for a rational design of catalytic materials for converting CO2 to fuels and other useful chemicals. We demonstrate that an electron transfer to the π-antibonding orbital of the adsorbed molecule and the associated bending of the initially linear molecule, previously proposed as the indicator of activation, are insufficient to account for the good catalytic performance of experimentally characterized oxide surfaces. Instead, our AI model identifies the common feature of these surfaces in the binding of a molecular O atom to a surface cation, which results in a strong elongation and therefore weakening of one molecular C-O bond. This finding suggests using the C-O bond elongation as an indicator of CO2 activation. Based on these findings, we propose a set of new promising oxide-based catalysts for CO2 conversion, and a recipe to find more
- …