80 research outputs found
Easy over Hard: A Case Study on Deep Learning
While deep learning is an exciting new technique, the benefits of this method
need to be assessed with respect to its computational cost. This is
particularly important for deep learning since these learners need hours (to
weeks) to train the model. Such long training time limits the ability of (a)~a
researcher to test the stability of their conclusion via repeated runs with
different random seeds; and (b)~other researchers to repeat, improve, or even
refute that original work.
For example, recently, deep learning was used to find which questions in the
Stack Overflow programmer discussion forum can be linked together. That deep
learning system took 14 hours to execute. We show here that applying a very
simple optimizer called DE to fine tune SVM, it can achieve similar (and
sometimes better) results. The DE approach terminated in 10 minutes; i.e. 84
times faster hours than deep learning method.
We offer these results as a cautionary tale to the software analytics
community and suggest that not every new innovation should be applied without
critical analysis. If researchers deploy some new and expensive process, that
work should be baselined against some simpler and faster alternatives.Comment: 12 pages, 6 figures, accepted at FSE201
500+ Times Faster Than Deep Learning (A Case Study Exploring Faster Methods for Text Mining StackOverflow)
Deep learning methods are useful for high-dimensional data and are becoming
widely used in many areas of software engineering. Deep learners utilizes
extensive computational power and can take a long time to train-- making it
difficult to widely validate and repeat and improve their results. Further,
they are not the best solution in all domains. For example, recent results show
that for finding related Stack Overflow posts, a tuned SVM performs similarly
to a deep learner, but is significantly faster to train. This paper extends
that recent result by clustering the dataset, then tuning very learners within
each cluster. This approach is over 500 times faster than deep learning (and
over 900 times faster if we use all the cores on a standard laptop computer).
Significantly, this faster approach generates classifiers nearly as good
(within 2\% F1 Score) as the much slower deep learning method. Hence we
recommend this faster methods since it is much easier to reproduce and utilizes
far fewer CPU resources. More generally, we recommend that before researchers
release research results, that they compare their supposedly sophisticated
methods against simpler alternatives (e.g applying simpler learners to build
local models)
OCC: A Smart Reply System for Efficient In-App Communications
Smart reply systems have been developed for various messaging platforms. In
this paper, we introduce Uber's smart reply system: one-click-chat (OCC), which
is a key enhanced feature on top of the Uber in-app chat system. It enables
driver-partners to quickly respond to rider messages using smart replies. The
smart replies are dynamically selected according to conversation content using
machine learning algorithms. Our system consists of two major components:
intent detection and reply retrieval, which are very different from standard
smart reply systems where the task is to directly predict a reply. It is
designed specifically for mobile applications with short and non-canonical
messages. Reply retrieval utilizes pairings between intent and reply based on
their popularity in chat messages as derived from historical data. For intent
detection, a set of embedding and classification techniques are experimented
with, and we choose to deploy a solution using unsupervised distributed
embedding and nearest-neighbor classifier. It has the advantage of only
requiring a small amount of labeled training data, simplicity in developing and
deploying to production, and fast inference during serving and hence highly
scalable. At the same time, it performs comparably with deep learning
architectures such as word-level convolutional neural network. Overall, the
system achieves a high accuracy of 76% on intent detection. Currently, the
system is deployed in production for English-speaking countries and 71% of
in-app communications between riders and driver-partners adopted the smart
replies to speedup the communication process.Comment: link to demo: https://www.youtube.com/watch?v=nOffUT7rS0A&t=32
Tuning Word2vec for Large Scale Recommendation Systems
Word2vec is a powerful machine learning tool that emerged from Natural
Lan-guage Processing (NLP) and is now applied in multiple domains, including
recom-mender systems, forecasting, and network analysis. As Word2vec is often
used offthe shelf, we address the question of whether the default
hyperparameters are suit-able for recommender systems. The answer is
emphatically no. In this paper, wefirst elucidate the importance of
hyperparameter optimization and show that un-constrained optimization yields an
average 221% improvement in hit rate over thedefault parameters. However,
unconstrained optimization leads to hyperparametersettings that are very
expensive and not feasible for large scale recommendationtasks. To this end, we
demonstrate 138% average improvement in hit rate with aruntime
budget-constrained hyperparameter optimization. Furthermore, to
makehyperparameter optimization applicable for large scale recommendation
problemswhere the target dataset is too large to search over, we investigate
generalizinghyperparameters settings from samples. We show that applying
constrained hy-perparameter optimization using only a 10% sample of the data
still yields a 91%average improvement in hit rate over the default parameters
when applied to thefull datasets. Finally, we apply hyperparameters learned
using our method of con-strained optimization on a sample to the Who To Follow
recommendation serviceat Twitter and are able to increase follow rates by 15%.Comment: 11 pages, 4 figures, Fourteenth ACM Conference on Recommender System
Crowdsourcing Cybersecurity: Cyber Attack Detection using Social Media
Social media is often viewed as a sensor into various societal events such as
disease outbreaks, protests, and elections. We describe the use of social media
as a crowdsourced sensor to gain insight into ongoing cyber-attacks. Our
approach detects a broad range of cyber-attacks (e.g., distributed denial of
service (DDOS) attacks, data breaches, and account hijacking) in an
unsupervised manner using just a limited fixed set of seed event triggers. A
new query expansion strategy based on convolutional kernels and dependency
parses helps model reporting structure and aids in identifying key event
characteristics. Through a large-scale analysis over Twitter, we demonstrate
that our approach consistently identifies and encodes events, outperforming
existing methods.Comment: 13 single column pages, 5 figures, submitted to KDD 201
Extracting Knowledge from the Geometric Shape of Social Network Data Using Topological Data Analysis
Topological data analysis is a noble approach to extract meaningful information from high-dimensional data and is robust to noise. It is based on topology, which aims to study the geometric shape of data. In order to apply topological data analysis, an algorithm called mapper is adopted. The output from mapper is a simplicial complex that represents a set of connected clusters of data points. In this paper, we explore the feasibility of topological data analysis for mining social network data by addressing the problem of image popularity. We randomly crawl images from Instagram and analyze the effects of social context and image content on an image’s popularity using mapper. Mapper clusters the images using each feature, and the ratio of popularity in each cluster is computed to determine the clusters with a high or low possibility of popularity. Then, the popularity of images are predicted to evaluate the accuracy of topological data analysis. This approach is further compared with traditional clustering algorithms, including k-means and hierarchical clustering, in terms of accuracy, and the results show that topological data analysis outperforms the others. Moreover, topological data analysis provides meaningful information based on the connectivity between the clusters.https://doi.org/10.3390/e1907036
A data-driven analysis of workers' earnings on Amazon Mechanical Turk
A growing number of people are working as part of on-line crowd work. Crowd work is often thought to be low wage work. However, we know little about the wage distribution in practice and what causes low/high earnings in this setting. We recorded 2,676 workers performing 3.8 million tasks on Amazon Mechanical Turk. Our task-level analysis revealed that workers earned a median hourly wage of only ~2 USD/h, and only 4% earned more than 7.25 USD/h. While the average requester pays more than 11 USD/h, lower-paying requesters post much more work. Our wage calculations are influenced by how unpaid work is accounted for, e.g., time spent searching for tasks, working on tasks that are rejected, and working on tasks that are ultimately not submitted. We further explore the characteristics of tasks and working patterns that yield higher hourly wages. Our analysis informs platform design and worker tools to create a more positive future for crowd work
Narrow lenses for capturing the complexity of fisheries: A topic analysis of fisheries science from 1990 to 2016
Despite increased fisheries science output and publication outlets, the global crisis in fisheries management is as present as ever. Since a narrow research focus may be a contributing factor to this failure, this study uncovers topics in fisheries research and their trends over time. This interdisciplinary research evaluates whether science is diversifying fisheries research topics in an attempt to capture the complexity of the fisheries system, or whether it is multiplying research on similar topics, attempting to achieve an in-depth, but possibly marginal, understanding of a few selected components of this system. By utilizing latent Dirichlet allocation as a generative probabilistic topic model, we analyse a unique dataset consisting of 46,582 full-text articles published in the last 26 years in 21 specialized scientific fisheries journals. Among the 25 topics uncovered by the model, only one (Fisheries management) refers to the human dimension of fisheries understood as socio-ecological complex adaptive systems. The most prevalent topics in our dataset directly relating to fisheries refer to Fisheries management, Stock assessment, and Fishing gear, with Fisheries management attracting the most interest. We propose directions for future research focus that most likely could contribute to providing useful advice for successful management of fisheries
Stable Numerical Methods for PDE Models of Asian Options
Asian options are exotic financial derivative products which price must be calculated by numerical evaluation. In this thesis, we study certain ways of solving partial differential equations, which are associated with these derivatives. Since standard numerical techniques for Asian options are often incorrect and impractical, we discuss their variations, which are efficiently applicable for handling frequent numerical instabilities reflected in form of oscillatory solutions. We will show that this crucial problem can be treated and eliminated by adopting flux limiting techniques, which are total variation dimishing
Stable Numerical Methods for PDE Models of Asian Options
Asian options are exotic financial derivative products which price must be calculated by numerical evaluation. In this thesis, we study certain ways of solving partial differential equations, which are associated with these derivatives. Since standard numerical techniques for Asian options are often incorrect and impractical, we discuss their variations, which are efficiently applicable for handling frequent numerical instabilities reflected in form of oscillatory solutions. We will show that this crucial problem can be treated and eliminated by adopting flux limiting techniques, which are total variation dimishing
- …