17,417 research outputs found
Decoding Information from noisy, redundant, and intentionally-distorted sources
Advances in information technology reduce barriers to information
propagation, but at the same time they also induce the information overload
problem. For the making of various decisions, mere digestion of the relevant
information has become a daunting task due to the massive amount of information
available. This information, such as that generated by evaluation systems
developed by various web sites, is in general useful but may be noisy and may
also contain biased entries. In this study, we establish a framework to
systematically tackle the challenging problem of information decoding in the
presence of massive and redundant data. When applied to a voting system, our
method simultaneously ranks the raters and the ratees using only the evaluation
data, consisting of an array of scores each of which represents the rating of a
ratee by a rater. Not only is our appraoch effective in decoding information,
it is also shown to be robust against various hypothetical types of noise as
well as intentional abuses.Comment: 19 pages, 5 figures, accepted for publication in Physica
A Puff of Steem: Security Analysis of Decentralized Content Curation
Decentralized content curation is the process through which uploaded posts are ranked and filtered based exclusively on users\u27 feedback. Platforms such as the blockchain-based Steemit employ this type of curation while providing monetary incentives to promote the visibility of high quality posts according to the perception of the participants. Despite the wide adoption of the platform very little is known regarding its performance and resilience characteristics. In this work, we provide a formal model for decentralized content curation that identifies salient complexity and game-theoretic measures of performance and resilience to selfish participants. Armed with our model, we provide a first analysis of Steemit identifying the conditions under which the system can be expected to correctly converge to curation while we demonstrate its susceptibility to selfish participant behaviour. We validate our theoretical results with system simulations in various scenarios
Equality of Voice: Towards Fair Representation in Crowdsourced Top-K Recommendations
To help their users to discover important items at a particular time, major
websites like Twitter, Yelp, TripAdvisor or NYTimes provide Top-K
recommendations (e.g., 10 Trending Topics, Top 5 Hotels in Paris or 10 Most
Viewed News Stories), which rely on crowdsourced popularity signals to select
the items. However, different sections of a crowd may have different
preferences, and there is a large silent majority who do not explicitly express
their opinion. Also, the crowd often consists of actors like bots, spammers, or
people running orchestrated campaigns. Recommendation algorithms today largely
do not consider such nuances, hence are vulnerable to strategic manipulation by
small but hyper-active user groups.
To fairly aggregate the preferences of all users while recommending top-K
items, we borrow ideas from prior research on social choice theory, and
identify a voting mechanism called Single Transferable Vote (STV) as having
many of the fairness properties we desire in top-K item (s)elections. We
develop an innovative mechanism to attribute preferences of silent majority
which also make STV completely operational. We show the generalizability of our
approach by implementing it on two different real-world datasets. Through
extensive experimentation and comparison with state-of-the-art techniques, we
show that our proposed approach provides maximum user satisfaction, and cuts
down drastically on items disliked by most but hyper-actively promoted by a few
users.Comment: In the proceedings of the Conference on Fairness, Accountability, and
Transparency (FAT* '19). Please cite the conference versio
Detecting Policy Preferences and Dynamics in the UN General Debate with Neural Word Embeddings
Foreign policy analysis has been struggling to find ways to measure policy
preferences and paradigm shifts in international political systems. This paper
presents a novel, potential solution to this challenge, through the application
of a neural word embedding (Word2vec) model on a dataset featuring speeches by
heads of state or government in the United Nations General Debate. The paper
provides three key contributions based on the output of the Word2vec model.
First, it presents a set of policy attention indices, synthesizing the semantic
proximity of political speeches to specific policy themes. Second, it
introduces country-specific semantic centrality indices, based on topological
analyses of countries' semantic positions with respect to each other. Third, it
tests the hypothesis that there exists a statistical relation between the
semantic content of political speeches and UN voting behavior, falsifying it
and suggesting that political speeches contain information of different nature
then the one behind voting outcomes. The paper concludes with a discussion of
the practical use of its results and consequences for foreign policy analysis,
public accountability, and transparency
Towards Data-Driven Autonomics in Data Centers
Continued reliance on human operators for managing data centers is a major
impediment for them from ever reaching extreme dimensions. Large computer
systems in general, and data centers in particular, will ultimately be managed
using predictive computational and executable models obtained through
data-science tools, and at that point, the intervention of humans will be
limited to setting high-level goals and policies rather than performing
low-level operations. Data-driven autonomics, where management and control are
based on holistic predictive models that are built and updated using generated
data, opens one possible path towards limiting the role of operators in data
centers. In this paper, we present a data-science study of a public Google
dataset collected in a 12K-node cluster with the goal of building and
evaluating a predictive model for node failures. We use BigQuery, the big data
SQL platform from the Google Cloud suite, to process massive amounts of data
and generate a rich feature set characterizing machine state over time. We
describe how an ensemble classifier can be built out of many Random Forest
classifiers each trained on these features, to predict if machines will fail in
a future 24-hour window. Our evaluation reveals that if we limit false positive
rates to 5%, we can achieve true positive rates between 27% and 88% with
precision varying between 50% and 72%. We discuss the practicality of including
our predictive model as the central component of a data-driven autonomic
manager and operating it on-line with live data streams (rather than off-line
on data logs). All of the scripts used for BigQuery and classification analyses
are publicly available from the authors' website.Comment: 12 pages, 6 figure
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
- …