Search CORE

41,682 research outputs found

Why We Read Wikipedia

Author: DeMaio T. J.
Gelman A.
Goel S.
Harkness J. A.
Jurgens D.
Kish L.
Klösgen W.
Krug S.
Lee B. K.
Mukhopadhyay P.
Salganik M. J.
Strauss A.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

Wikipedia is one of the most popular sites on the Web, with millions of users relying on it to satisfy a broad range of information needs every day. Although it is crucial to understand what exactly these needs are in order to be able to meet them, little is currently known about why users visit Wikipedia. The goal of this paper is to fill this gap by combining a survey of Wikipedia readers with a log-based analysis of user activity. Based on an initial series of user surveys, we build a taxonomy of Wikipedia use cases along several dimensions, capturing users' motivations to visit Wikipedia, the depth of knowledge they are seeking, and their knowledge of the topic of interest prior to visiting Wikipedia. Then, we quantify the prevalence of these use cases via a large-scale user survey conducted on live Wikipedia with almost 30,000 responses. Our analyses highlight the variety of factors driving users to Wikipedia, such as current events, media coverage of a topic, personal curiosity, work or school assignments, or boredom. Finally, we match survey responses to the respondents' digital traces in Wikipedia's server logs, enabling the discovery of behavioral patterns associated with specific use cases. For instance, we observe long and fast-paced page sequences across topics for users who are bored or exploring randomly, whereas those using Wikipedia for work or school spend more time on individual articles focused on topics such as science. Our findings advance our understanding of reader motivations and behavior on Wikipedia and can have implications for developers aiming to improve Wikipedia's user experience, editors striving to cater to their readers' needs, third-party services (such as search engines) providing access to Wikipedia content, and researchers aiming to build tools such as recommendation engines.Comment: Published in WWW'17; v2 fixes caption of Table

arXiv.org e-Print Archive

Crossref

MAnnheim DOCument Server

Publikationsserver der RWTH Aachen University

Ab initio data-analytics study of carbon-dioxide activation on semiconductor oxide surfaces

Author: Ghiringhelli Luca M.
Illas Francesc
Levchenko Sergey V.
Mazheika Aliaksei
Scheffler Matthias
Valero Rosendo
Vines Francesc
Wang Yanggang
Publication venue
Publication date: 29/05/2020
Field of study

The excessive emissions of carbon dioxide (CO

_2

) into the atmosphere threaten to shift the CO

_2

cycle planet-wide and induce unpredictable climate changes. Using artificial intelligence (AI) trained on high-throughput first principles based data for a broad family of oxides, we develop a strategy for a rational design of catalytic materials for converting CO

_2

to fuels and other useful chemicals. We demonstrate that an electron transfer to the

\pi^*

-antibonding orbital of the adsorbed molecule and the associated bending of the initially linear molecule, previously proposed as the indicator of activation, are insufficient to account for the good catalytic performance of experimentally characterized oxide surfaces. Instead, our AI model identifies the common feature of these surfaces in the binding of a molecular O atom to a surface cation, which results in a strong elongation and therefore weakening of one molecular C-O bond. This finding suggests using the C-O bond elongation as an indicator of CO

_2

activation. Based on these findings, we propose a set of new promising oxide-based catalysts for CO

_2

conversion, and a recipe to find more

arXiv.org e-Print Archive

PubMed Central

Diposit Digital de la Universitat de Barcelona

MPG.PuRe

Online Model Evaluation in a Large-Scale Computational Advertising Platform

Author: Dasdan Ali
Orten Burkay
Shariat Shahriar
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 31/08/2015
Field of study

Online media provides opportunities for marketers through which they can deliver effective brand messages to a wide range of audiences. Advertising technology platforms enable advertisers to reach their target audience by delivering ad impressions to online users in real time. In order to identify the best marketing message for a user and to purchase impressions at the right price, we rely heavily on bid prediction and optimization models. Even though the bid prediction models are well studied in the literature, the equally important subject of model evaluation is usually overlooked. Effective and reliable evaluation of an online bidding model is crucial for making faster model improvements as well as for utilizing the marketing budgets more efficiently. In this paper, we present an experimentation framework for bid prediction models where our focus is on the practical aspects of model evaluation. Specifically, we outline the unique challenges we encounter in our platform due to a variety of factors such as heterogeneous goal definitions, varying budget requirements across different campaigns, high seasonality and the auction-based environment for inventory purchasing. Then, we introduce return on investment (ROI) as a unified model performance (i.e., success) metric and explain its merits over more traditional metrics such as click-through rate (CTR) or conversion rate (CVR). Most importantly, we discuss commonly used evaluation and metric summarization approaches in detail and propose a more accurate method for online evaluation of new experimental models against the baseline. Our meta-analysis-based approach addresses various shortcomings of other methods and yields statistically robust conclusions that allow us to conclude experiments more quickly in a reliable manner. We demonstrate the effectiveness of our evaluation strategy on real campaign data through some experiments.Comment: Accepted to ICDM201

arXiv.org e-Print Archive

Crossref

Big-Data-Driven Materials Science and its FAIR Data Infrastructure

This chapter addresses the forth paradigm of materials research -- big-data driven materials science. Its concepts and state-of-the-art are described, and its challenges and chances are discussed. For furthering the field, Open Data and an all-embracing sharing, an efficient data infrastructure, and the rich ecosystem of computer codes used in the community are of critical importance. For shaping this forth paradigm and contributing to the development or discovery of improved and novel materials, data must be what is now called FAIR -- Findable, Accessible, Interoperable and Re-purposable/Re-usable. This sets the stage for advances of methods from artificial intelligence that operate on large data sets to find trends and patterns that cannot be obtained from individual calculations and not even directly from high-throughput studies. Recent progress is reviewed and demonstrated, and the chapter is concluded by a forward-looking perspective, addressing important not yet solved challenges.Comment: submitted to the Handbook of Materials Modeling (eds. S. Yip and W. Andreoni), Springer 2018/201

arXiv.org e-Print Archive

Crossref

MPG.PuRe

On the Complexity of Rule Discovery from Distributed Data

Author: Scholz Martin
Publication venue
Publication date
Field of study

This paper analyses the complexity of rule selection for supervised learning in distributed scenarios. The selection of rules is usually guided by a utility measure such as predictive accuracy or weighted relative accuracy. Other examples are support and confidence, known from association rule mining. A common strategy to tackle rule selection from distributed data is to evaluate rules locally on each dataset. While this works well for homogeneously distributed data, this work proves limitations of this strategy if distributions are allowed to deviate. To identify those subsets for which local and global distributions deviate may be regarded as an interesting learning task of its own, explicitly taking the locality of data into account. This task can be shown to be basically as complex as discovering the globally best rules from local data. Based on the theoretical results some guidelines for algorithm design are derived. --

Research Papers in Economics

Comparing Knowledge-Based Sampling to Boosting

Author: Scholz Martin
Publication venue
Publication date
Field of study

Boosting algorithms for classification are based on altering the ini- tial distribution assumed to underly a given example set. The idea of knowledge-based sampling (KBS) is to sample out prior knowledge and previously discovered patterns to achieve that subsequently ap- plied data mining algorithms automatically focus on novel patterns without any need to adjust the base algorithm. This sampling strat- egy anticipates a user's expectation based on a set of constraints how to adjust the distribution. In the classified case KBS is similar to boosting. This article shows that a specific, very simple KBS algo- rithm is able to boost weak base classifiers. It discusses differences to AdaBoost.M1 and LogitBoost, and it compares performances of these algorithms empirically in terms of predictive accuracy, the area under the ROC curve measure, and squared error. --

Research Papers in Economics

Unsupervised learning with contrastive latent variable models

Author: Ghosh Soumya
Ng Kenney
Severson Kristen
Publication venue
Publication date: 14/11/2018
Field of study

In unsupervised learning, dimensionality reduction is an important tool for data exploration and visualization. Because these aims are typically open-ended, it can be useful to frame the problem as looking for patterns that are enriched in one dataset relative to another. These pairs of datasets occur commonly, for instance a population of interest vs. control or signal vs. signal free recordings.However, there are few methods that work on sets of data as opposed to data points or sequences. Here, we present a probabilistic model for dimensionality reduction to discover signal that is enriched in the target dataset relative to the background dataset. The data in these sets do not need to be paired or grouped beyond set membership. By using a probabilistic model where some structure is shared amongst the two datasets and some is unique to the target dataset, we are able to recover interesting structure in the latent space of the target dataset. The method also has the advantages of a probabilistic model, namely that it allows for the incorporation of prior information, handles missing data, and can be generalized to different distributional assumptions. We describe several possible variations of the model and demonstrate the application of the technique to de-noising, feature selection, and subgroup discovery settings

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Discovery of mating in the major African livestock pathogen Trypanosoma congolense

The protozoan parasite, Trypanosoma congolense, is one of the most economically important pathogens of livestock in Africa and, through its impact on cattle health and productivity, has a significant effect on human health and well being. Despite the importance of this parasite our knowledge of some of the fundamental biological processes is limited. For example, it is unknown whether mating takes place. In this paper we have taken a population genetics based approach to address this question. The availability of genome sequence of the parasite allowed us to identify polymorphic microsatellite markers, which were used to genotype T. congolense isolates from livestock in a discrete geographical area of The Gambia. The data showed a high level of diversity with a large number of distinct genotypes, but a deficit in heterozygotes. Further analysis identified cryptic genetic subdivision into four sub-populations. In one of these, parasite genotypic diversity could only be explained by the occurrence of frequent mating in T. congolense. These data are completely inconsistent with previous suggestions that the parasite expands asexually in the absence of mating. The discovery of mating in this species of trypanosome has significant consequences for the spread of critical traits, such as drug resistance, as well as for fundamental aspects of the biology and epidemiology of this neglected but economically important pathogen

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Edinburgh Research Explorer

Enlighten

Ab initio data-analytics study of carbon-dioxide activation on semiconductor oxide surfaces

Author: Ghiringhelli L.
Illas F.
Levchenko S.
Mazheika A.
Scheffler M.
Valero R.
Vines F.
Wang Y.
Publication venue
Publication date
Field of study

The excessive emissions of carbon dioxide (CO2) into the atmosphere threaten to shift the CO2 cycle planet-wide and induce unpredictable climate changes. Using artificial intelligence (AI) trained on high-throughput first principles based data for a broad family of oxides, we develop a strategy for a rational design of catalytic materials for converting CO2 to fuels and other useful chemicals. We demonstrate that an electron transfer to the π-antibonding orbital of the adsorbed molecule and the associated bending of the initially linear molecule, previously proposed as the indicator of activation, are insufficient to account for the good catalytic performance of experimentally characterized oxide surfaces. Instead, our AI model identifies the common feature of these surfaces in the binding of a molecular O atom to a surface cation, which results in a strong elongation and therefore weakening of one molecular C-O bond. This finding suggests using the C-O bond elongation as an indicator of CO2 activation. Based on these findings, we propose a set of new promising oxide-based catalysts for CO2 conversion, and a recipe to find more

MPG.PuRe