11 research outputs found
Easy over Hard: A Case Study on Deep Learning
While deep learning is an exciting new technique, the benefits of this method
need to be assessed with respect to its computational cost. This is
particularly important for deep learning since these learners need hours (to
weeks) to train the model. Such long training time limits the ability of (a)~a
researcher to test the stability of their conclusion via repeated runs with
different random seeds; and (b)~other researchers to repeat, improve, or even
refute that original work.
For example, recently, deep learning was used to find which questions in the
Stack Overflow programmer discussion forum can be linked together. That deep
learning system took 14 hours to execute. We show here that applying a very
simple optimizer called DE to fine tune SVM, it can achieve similar (and
sometimes better) results. The DE approach terminated in 10 minutes; i.e. 84
times faster hours than deep learning method.
We offer these results as a cautionary tale to the software analytics
community and suggest that not every new innovation should be applied without
critical analysis. If researchers deploy some new and expensive process, that
work should be baselined against some simpler and faster alternatives.Comment: 12 pages, 6 figures, accepted at FSE201
500+ Times Faster Than Deep Learning (A Case Study Exploring Faster Methods for Text Mining StackOverflow)
Deep learning methods are useful for high-dimensional data and are becoming
widely used in many areas of software engineering. Deep learners utilizes
extensive computational power and can take a long time to train-- making it
difficult to widely validate and repeat and improve their results. Further,
they are not the best solution in all domains. For example, recent results show
that for finding related Stack Overflow posts, a tuned SVM performs similarly
to a deep learner, but is significantly faster to train. This paper extends
that recent result by clustering the dataset, then tuning very learners within
each cluster. This approach is over 500 times faster than deep learning (and
over 900 times faster if we use all the cores on a standard laptop computer).
Significantly, this faster approach generates classifiers nearly as good
(within 2\% F1 Score) as the much slower deep learning method. Hence we
recommend this faster methods since it is much easier to reproduce and utilizes
far fewer CPU resources. More generally, we recommend that before researchers
release research results, that they compare their supposedly sophisticated
methods against simpler alternatives (e.g applying simpler learners to build
local models)
OCC: A Smart Reply System for Efficient In-App Communications
Smart reply systems have been developed for various messaging platforms. In
this paper, we introduce Uber's smart reply system: one-click-chat (OCC), which
is a key enhanced feature on top of the Uber in-app chat system. It enables
driver-partners to quickly respond to rider messages using smart replies. The
smart replies are dynamically selected according to conversation content using
machine learning algorithms. Our system consists of two major components:
intent detection and reply retrieval, which are very different from standard
smart reply systems where the task is to directly predict a reply. It is
designed specifically for mobile applications with short and non-canonical
messages. Reply retrieval utilizes pairings between intent and reply based on
their popularity in chat messages as derived from historical data. For intent
detection, a set of embedding and classification techniques are experimented
with, and we choose to deploy a solution using unsupervised distributed
embedding and nearest-neighbor classifier. It has the advantage of only
requiring a small amount of labeled training data, simplicity in developing and
deploying to production, and fast inference during serving and hence highly
scalable. At the same time, it performs comparably with deep learning
architectures such as word-level convolutional neural network. Overall, the
system achieves a high accuracy of 76% on intent detection. Currently, the
system is deployed in production for English-speaking countries and 71% of
in-app communications between riders and driver-partners adopted the smart
replies to speedup the communication process.Comment: link to demo: https://www.youtube.com/watch?v=nOffUT7rS0A&t=32
Tuning Word2vec for Large Scale Recommendation Systems
Word2vec is a powerful machine learning tool that emerged from Natural
Lan-guage Processing (NLP) and is now applied in multiple domains, including
recom-mender systems, forecasting, and network analysis. As Word2vec is often
used offthe shelf, we address the question of whether the default
hyperparameters are suit-able for recommender systems. The answer is
emphatically no. In this paper, wefirst elucidate the importance of
hyperparameter optimization and show that un-constrained optimization yields an
average 221% improvement in hit rate over thedefault parameters. However,
unconstrained optimization leads to hyperparametersettings that are very
expensive and not feasible for large scale recommendationtasks. To this end, we
demonstrate 138% average improvement in hit rate with aruntime
budget-constrained hyperparameter optimization. Furthermore, to
makehyperparameter optimization applicable for large scale recommendation
problemswhere the target dataset is too large to search over, we investigate
generalizinghyperparameters settings from samples. We show that applying
constrained hy-perparameter optimization using only a 10% sample of the data
still yields a 91%average improvement in hit rate over the default parameters
when applied to thefull datasets. Finally, we apply hyperparameters learned
using our method of con-strained optimization on a sample to the Who To Follow
recommendation serviceat Twitter and are able to increase follow rates by 15%.Comment: 11 pages, 4 figures, Fourteenth ACM Conference on Recommender System
Crowdsourcing Cybersecurity: Cyber Attack Detection using Social Media
Social media is often viewed as a sensor into various societal events such as
disease outbreaks, protests, and elections. We describe the use of social media
as a crowdsourced sensor to gain insight into ongoing cyber-attacks. Our
approach detects a broad range of cyber-attacks (e.g., distributed denial of
service (DDOS) attacks, data breaches, and account hijacking) in an
unsupervised manner using just a limited fixed set of seed event triggers. A
new query expansion strategy based on convolutional kernels and dependency
parses helps model reporting structure and aids in identifying key event
characteristics. Through a large-scale analysis over Twitter, we demonstrate
that our approach consistently identifies and encodes events, outperforming
existing methods.Comment: 13 single column pages, 5 figures, submitted to KDD 201
A data-driven analysis of workers' earnings on Amazon Mechanical Turk
A growing number of people are working as part of on-line crowd work. Crowd work is often thought to be low wage work. However, we know little about the wage distribution in practice and what causes low/high earnings in this setting. We recorded 2,676 workers performing 3.8 million tasks on Amazon Mechanical Turk. Our task-level analysis revealed that workers earned a median hourly wage of only ~2 USD/h, and only 4% earned more than 7.25 USD/h. While the average requester pays more than 11 USD/h, lower-paying requesters post much more work. Our wage calculations are influenced by how unpaid work is accounted for, e.g., time spent searching for tasks, working on tasks that are rejected, and working on tasks that are ultimately not submitted. We further explore the characteristics of tasks and working patterns that yield higher hourly wages. Our analysis informs platform design and worker tools to create a more positive future for crowd work
A literature review and comparison of three feature location techniques using ArgoUML-SPL
Publisher Copyright: © 2019 Association for Computing Machinery.Over the last decades, the adoption of Software Product Line (SPL) engineering for supporting software reuse has increased. An SPL can be extracted from one single product or from a family of related software products, and feature location strategies are widely used for variability mining. Several feature location strategies have been proposed in the literature and they usually aim to map a feature to its source code implementation. In this paper, we present a systematic literature review that identifies and characterizes existing feature location strategies. We also evaluated three different strategies based on textual information retrieval in the context of the ArgoUML-SPL feature location case study. In this evaluation, we compare the strategies based on their ability to correctly identify the source code of several features from ArgoUML-SPL ground truth. We then discuss the strengths and weaknesses of each feature location strategy.This research was partially supported by Brazilian funding agencies: CNPq (Grant 424340/2016-0), CAPES, and FAPEMIG (grant PPM-00651-17).Peer reviewe
A free Web API for single and multi-document summarization
In this work we present a free Web API for single and multi-text summarization. The summarization algorithm follows an extractive approach, thus selecting the most relevant sentences from a single document or a document set. It integrates in a novel pipeline different text analysis techniques - ranging from keyword and entity extraction, to topic modelling and sentence clustering - and gives SoA competitive results. The application, written in Python, supports as input both plain texts and Web URLs. The API is publicly accessible for free using the specific conference token [1] as described in the reference page [2]. The browser-based demo version, for summarization of single documents only, is publicly accessible at http://yonderlabs.com/demo
Toolset for entity and semantic associations - Initial Release
In this document we describe the initial release of the toolset for entity and semantic associations, integrating Unsupervised Document Clustering (initially implemented by MU) and Citation Indexing and Matching (as provided by ICM and UJF/CMD). We give a brief description of each tool and some initial evaluation