327 research outputs found
Recommended from our members
The Cost of Sharing Information in a Social World
With the increasing prevalence of large scale online social networks, the field has evolved from studying small scale networks and interactions to massive ones that encompass huge fractions of the world’s population. While many methods focus on techniques at scale applied to a single domain, methods that apply techniques across multiple domains are becoming increasingly important. These methods rely on understanding the complex relationships in the data. In the context of social networks, the big data available allows us to better model and analyze the flow of information within the network.
The first part of this thesis discusses methods to more effectively learn and predict in a social network by leveraging information across multiple domains and types of data. We document a method to identify users from their access to content in a network and their click behavior. Even on a macro level, click behavior is often hard to obtain. We describe a technique to predict click behavior using other public information about the social network.
Communication within a network inevitably has some bias that can be attributed to individual preferences and quality as well as the underlying structure of the network. The second part of the thesis characterizes the structural bias in a network by modeling the underlying information flow as a commodity of trade
Crowdsourcing with Sparsely Interacting Workers
We consider estimation of worker skills from worker-task interaction data
(with unknown labels) for the single-coin crowd-sourcing binary classification
model in symmetric noise. We define the (worker) interaction graph whose nodes
are workers and an edge between two nodes indicates whether or not the two
workers participated in a common task. We show that skills are asymptotically
identifiable if and only if an appropriate limiting version of the interaction
graph is irreducible and has odd-cycles. We then formulate a weighted rank-one
optimization problem to estimate skills based on observations on an
irreducible, aperiodic interaction graph. We propose a gradient descent scheme
and show that for such interaction graphs estimates converge asymptotically to
the global minimum. We characterize noise robustness of the gradient scheme in
terms of spectral properties of signless Laplacians of the interaction graph.
We then demonstrate that a plug-in estimator based on the estimated skills
achieves state-of-art performance on a number of real-world datasets. Our
results have implications for rank-one matrix completion problem in that
gradient descent can provably recover rank-one matrices based on
off-diagonal observations of a connected graph with a single odd-cycle
Efficient inference algorithms for network activities
The real social network and associated communities are often hidden under the declared friend or group lists in social networks. We usually observe the manifestation of these hidden networks and communities in the form of recurrent and time-stamped individuals' activities in the social network. The inference of relationship between users/nodes or groups of users/nodes could be further complicated when activities are interval-censored, that is, when one only observed the number of activities that occurred in certain time windows. The same phenomenon happens in the online advertisement world where the advertisers often offer a set of advertisement impressions and observe a set of conversions (i.e. product/service adoption). In this case, the advertisers desire to know which advertisements best appeal to the customers and most importantly, their rate of conversions.
Inspired by these challenges, we investigated inference algorithms that efficiently recover user relationships in both cases: time-stamped data and interval-censored data. In case of time-stamped data, we proposed a novel algorithm called NetCodec, which relies on a Hawkes process that models the intertwine relationship between group participation and between-user influence. Using Bayesian variational principle and optimization techniques, NetCodec could infer both group participation and user influence simultaneously with iteration complexity being O((N+I)G), where N is the number of events, I is the number of users, and G is the number of groups. In case of interval-censored data, we proposed a Monte-Carlo EM inference algorithm where we iteratively impute the time-stamped events using a Poisson process that has intensity function approximates the underlying intensity function. We show that that proposed simulated approach delivers better inference performance than baseline methods.
In the advertisement problem, we propose a Click-to-Conversion delay model that uses Hawkes processes to model the advertisement impressions and thinned Poisson processes to model the Click-to-Conversion mechanism. We then derive an efficient Maximum Likelihood Estimator which utilizes the Minorization-Maximization framework. We verify the model against real life online advertisement logs in comparison with recent conversion rate estimation methods.
To facilitate reproducible research, we also developed an open-source software package that focuses on various Hawkes processes proposed in the above mentioned works and prior works. We provided efficient parallel (multi-core) implementations of the inference algorithms using the Bayesian variational inference framework. To further speed up these inference algorithms, we also explored distributed optimization techniques for convex optimization under the distributed data situation. We formulate this problem as a consensus-constrained optimization problem and solve it with the alternating direction method for multipliers (ADMM). It turns out that using bipartite graph as communication topology exhibits the fastest convergence.Ph.D
Estimating user interaction probability for non-guaranteed display advertising
Billions of advertisements are displayed to internet users every hour, a market worth approximately $110 billion in 2013. The process of displaying advertisements to internet users is managed
by advertising exchanges, automated systems which match advertisements to users while balancing
conflicting advertiser, publisher, and user objectives. Real-time bidding is a recent development in
the online advertising industry that allows more than one exchange (or demand-side platform) to
bid for the right to deliver an ad to a specific user while that user is loading a webpage, creating
a liquid market for ad impressions. Real-time bidding accounted for around 10% of the German
online advertising market in late 2013, a figure which is growing at an annual rate of around 40%.
In this competitive market, accurately calculating the expected value of displaying an ad to a user
is essential for profitability.
In this thesis, we develop a system that significantly improves the existing method for estimating
the value of displaying an ad to a user in a German advertising exchange and demand-side platform.
The most significant calculation in this system is estimating the probability of a user interacting
with an ad in a given context. We first implement a hierarchical main-effects and latent factor
model which is similar enough to the existing exchange system to allow a simple and robust upgrade
path, while improving performance substantially. We then use regularized generalized linear models
to estimate the probability of an ad interaction occurring following an individual user impression
event. We build a system capable of training thousands of campaign models daily, handling over 300
million events per day, 18 million recurrent users, and thousands of model dimensions. Together,
these systems improve on the log-likelihood of the existing method by over 10%.
We also provide an overview of the real-time bidding market microstructure in the German real-
time bidding market in September and November 2013, and indicate potential areas for exploiting
competitors’ behaviour, including building user features from real-time bid responses. Finally,
for personal interest, we experiment with scalable k-nearest neighbour search algorithms, nonlinear
dimension reduction, manifold regularization, graph clustering, and stochastic block model inference
using the large datasets from the linear model
Quantum circuits with many photons on a programmable nanophotonic chip
Growing interest in quantum computing for practical applications has led to a
surge in the availability of programmable machines for executing quantum
algorithms. Present day photonic quantum computers have been limited either to
non-deterministic operation, low photon numbers and rates, or fixed random gate
sequences. Here we introduce a full-stack hardware-software system for
executing many-photon quantum circuits using integrated nanophotonics: a
programmable chip, operating at room temperature and interfaced with a fully
automated control system. It enables remote users to execute quantum algorithms
requiring up to eight modes of strongly squeezed vacuum initialized as two-mode
squeezed states in single temporal modes, a fully general and programmable
four-mode interferometer, and genuine photon number-resolving readout on all
outputs. Multi-photon detection events with photon numbers and rates exceeding
any previous quantum optical demonstration on a programmable device are made
possible by strong squeezing and high sampling rates. We verify the
non-classicality of the device output, and use the platform to carry out
proof-of-principle demonstrations of three quantum algorithms: Gaussian boson
sampling, molecular vibronic spectra, and graph similarity
Measuring Collective Attention in Online Content: Sampling, Engagement, and Network Effects
The production and consumption of online content have been increasing rapidly, whereas human attention is a scarce resource. Understanding how the content captures collective attention has become a challenge of growing importance. In this thesis, we tackle this challenge from three fronts -- quantifying sampling effects of social media data; measuring engagement behaviors towards online content; and estimating network effects induced by the recommender systems.
Data sampling is a fundamental problem. To obtain a list of items, one common method is sampling based on the item prevalence in social media streams. However, social data is often noisy and incomplete, which may affect the subsequent observations. For each item, user behaviors can be conceptualized as two steps -- the first step is relevant to the content appeal, measured by the number of clicks; the second step is relevant to the content quality, measured by the post-clicking metrics, e.g., dwell time, likes, or comments. We categorize online attention (behaviors) into two classes: popularity (clicking) and engagement (watching, liking, or commenting). Moreover, modern platforms use recommender systems to present the users with a tailoring content display for maximizing satisfaction. The recommendation alters the appeal of an item by changing its ranking, and consequently impacts its popularity.
Our research is enabled by the data available from the largest video hosting site YouTube. We use YouTube URLs shared on Twitter as a sampling protocol to obtain a collection of videos, and we track their prevalence from 2015 to 2019. This method creates a longitudinal dataset consisting of more than 5 billion tweets. Albeit the volume is substantial, we find Twitter still subsamples the data. Our dataset covers about 80% of all tweets with YouTube URLs. We present a comprehensive measurement study of the Twitter sampling effects across different timescales and different subjects. We find that the volume of missing tweets can be estimated by Twitter rate limit messages, true entity ranking can be inferred based on sampled observations, and sampling compromises the quality of network and diffusion models.
Next, we present the first large-scale measurement study of how users collectively engage with YouTube videos. We study the time and percentage of each video being watched. We propose a duration-calibrated metric, called relative engagement, which is correlated with recognized notion of content quality, stable over time, and predictable even before a video's upload.
Lastly, we examine the network effects induced by the YouTube recommender system. We construct the recommendation network for 60,740 music videos from 4,435 professional artists. An edge indicates that the target video is recommended on the webpage of source video. We discover the popularity bias -- videos are disproportionately recommended towards more popular videos. We use the bow-tie structure to characterize the network and find that the largest strongly connected component consists of 23.1% of videos while occupying 82.6% of attention. We also build models to estimate the latent influence between videos and artists. By taking into account the network structure, we can predict video popularity 9.7% better than other baselines.
Altogether, we explore the collective consuming patterns of human attention towards online content. Methods and findings from this thesis can be used by content producers, hosting sites, and online users alike to improve content production, advertising strategies, and recommender systems. We expect our new metrics, methods, and observations can generalize to other multimedia platforms such as the music streaming service Spotify
- …