Search CORE

11 research outputs found

Easy over Hard: A Case Study on Deep Learning

Author: Bergstra James
Mou Lili
Pedregosa Fabian
Pennington Jeffrey
Rehurek Radim
Romano Jeanine
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 24/06/2017
Field of study

While deep learning is an exciting new technique, the benefits of this method need to be assessed with respect to its computational cost. This is particularly important for deep learning since these learners need hours (to weeks) to train the model. Such long training time limits the ability of (a)~a researcher to test the stability of their conclusion via repeated runs with different random seeds; and (b)~other researchers to repeat, improve, or even refute that original work. For example, recently, deep learning was used to find which questions in the Stack Overflow programmer discussion forum can be linked together. That deep learning system took 14 hours to execute. We show here that applying a very simple optimizer called DE to fine tune SVM, it can achieve similar (and sometimes better) results. The DE approach terminated in 10 minutes; i.e. 84 times faster hours than deep learning method. We offer these results as a cautionary tale to the software analytics community and suggest that not every new innovation should be applied without critical analysis. If researchers deploy some new and expensive process, that work should be baselined against some simpler and faster alternatives.Comment: 12 pages, 6 figures, accepted at FSE201

arXiv.org e-Print Archive

Crossref

500+ Times Faster Than Deep Learning (A Case Study Exploring Faster Methods for Text Mining StackOverflow)

Author: Arthur David
Chen Peter Y
Choetkiertikul M.
Guo Gongde
Mihalcea Rada
Pedregosa Fabian
Rehurek Radim
Ron
Publication venue
Publication date: 14/02/2018
Field of study

Deep learning methods are useful for high-dimensional data and are becoming widely used in many areas of software engineering. Deep learners utilizes extensive computational power and can take a long time to train-- making it difficult to widely validate and repeat and improve their results. Further, they are not the best solution in all domains. For example, recent results show that for finding related Stack Overflow posts, a tuned SVM performs similarly to a deep learner, but is significantly faster to train. This paper extends that recent result by clustering the dataset, then tuning very learners within each cluster. This approach is over 500 times faster than deep learning (and over 900 times faster if we use all the cores on a standard laptop computer). Significantly, this faster approach generates classifiers nearly as good (within 2\% F1 Score) as the much slower deep learning method. Hence we recommend this faster methods since it is much easier to reproduce and utilizes far fewer CPU resources. More generally, we recommend that before researchers release research results, that they compare their supposedly sophisticated methods against simpler alternatives (e.g applying simpler learners to build local models)

arXiv.org e-Print Archive

Crossref

OCC: A Smart Reply System for Efficient In-App Communications

Author: Hermann Jeremy
Kim Yoon
Miklos Balint
Rehurek Radim
Sutskever Ilya
Suzie Lee
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 18/07/2019
Field of study

Smart reply systems have been developed for various messaging platforms. In this paper, we introduce Uber's smart reply system: one-click-chat (OCC), which is a key enhanced feature on top of the Uber in-app chat system. It enables driver-partners to quickly respond to rider messages using smart replies. The smart replies are dynamically selected according to conversation content using machine learning algorithms. Our system consists of two major components: intent detection and reply retrieval, which are very different from standard smart reply systems where the task is to directly predict a reply. It is designed specifically for mobile applications with short and non-canonical messages. Reply retrieval utilizes pairings between intent and reply based on their popularity in chat messages as derived from historical data. For intent detection, a set of embedding and classification techniques are experimented with, and we choose to deploy a solution using unsupervised distributed embedding and nearest-neighbor classifier. It has the advantage of only requiring a small amount of labeled training data, simplicity in developing and deploying to production, and fast inference during serving and hence highly scalable. At the same time, it performs comparably with deep learning architectures such as word-level convolutional neural network. Overall, the system achieves a high accuracy of 76% on intent detection. Currently, the system is deployed in production for English-speaking countries and 71% of in-app communications between riders and driver-partners adopted the smart replies to speedup the communication process.Comment: link to demo: https://www.youtube.com/watch?v=nOffUT7rS0A&t=32

arXiv.org e-Print Archive

Crossref

Tuning Word2vec for Large Scale Recommendation Systems

Author: Bodon Ferenc
Mnih Andriy
Rehurek Radim
Shahriari Bobak
Turrin Roberto
Zhao Kui
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 24/09/2020
Field of study

Word2vec is a powerful machine learning tool that emerged from Natural Lan-guage Processing (NLP) and is now applied in multiple domains, including recom-mender systems, forecasting, and network analysis. As Word2vec is often used offthe shelf, we address the question of whether the default hyperparameters are suit-able for recommender systems. The answer is emphatically no. In this paper, wefirst elucidate the importance of hyperparameter optimization and show that un-constrained optimization yields an average 221% improvement in hit rate over thedefault parameters. However, unconstrained optimization leads to hyperparametersettings that are very expensive and not feasible for large scale recommendationtasks. To this end, we demonstrate 138% average improvement in hit rate with aruntime budget-constrained hyperparameter optimization. Furthermore, to makehyperparameter optimization applicable for large scale recommendation problemswhere the target dataset is too large to search over, we investigate generalizinghyperparameters settings from samples. We show that applying constrained hy-perparameter optimization using only a 10% sample of the data still yields a 91%average improvement in hit rate over the default parameters when applied to thefull datasets. Finally, we apply hyperparameters learned using our method of con-strained optimization on a sample to the Who To Follow recommendation serviceat Twitter and are able to increase follow rates by 15%.Comment: 11 pages, 4 figures, Fourteenth ACM Conference on Recommender System

arXiv.org e-Print Archive

Crossref

Crowdsourcing Cybersecurity: Cyber Attack Detection using Social Media

Author: Becker Hila
Flora
Ji Heng
Khandpur Rupinder P.
Lee Wenke
Li Frank
Liu Yang
Modi A.
Muthiah Sathappan
Ovelgonne Michael
Rehurek Radim
Sabottke Carl
Soska Kyle
Tanev Hristo
Weller-Fahy David J.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 24/02/2017
Field of study

Social media is often viewed as a sensor into various societal events such as disease outbreaks, protests, and elections. We describe the use of social media as a crowdsourced sensor to gain insight into ongoing cyber-attacks. Our approach detects a broad range of cyber-attacks (e.g., distributed denial of service (DDOS) attacks, data breaches, and account hijacking) in an unsupervised manner using just a limited fixed set of seed event triggers. A new query expansion strategy based on convolutional kernels and dependency parses helps model reporting structure and aids in identifying key event characteristics. Through a large-scale analysis over Twitter, we demonstrate that our approach consistently identifies and encodes events, outperforming existing methods.Comment: 13 single column pages, 5 figures, submitted to KDD 201

arXiv.org e-Print Archive

Crossref

A data-driven analysis of workers' earnings on Amazon Mechanical Turk

Author: Blei David M
Brault Matthew W.
Callison-Burch Chris
Chuang Jason
Harris Seth D.
Hitlin Paul
James Gareth
Juan
Kaufmann Nicolas
Marcadent Philippe
Poursabzi-Sangdeh Forough
Rehurek Radim
Turk Participation Agreement Amazon Mechanical
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 28/12/2017
Field of study

A growing number of people are working as part of on-line crowd work. Crowd work is often thought to be low wage work. However, we know little about the wage distribution in practice and what causes low/high earnings in this setting. We recorded 2,676 workers performing 3.8 million tasks on Amazon Mechanical Turk. Our task-level analysis revealed that workers earned a median hourly wage of only ~2 USD/h, and only 4% earned more than 7.25 USD/h. While the average requester pays more than 11 USD/h, lower-paying requesters post much more work. Our wage calculations are influenced by how unpaid work is accounted for, e.g., time spent searching for tasks, working on tasks that are rejected, and working on tasks that are ultimately not submitted. We further explore the characteristics of tasks and working patterns that yield higher hourly wages. Our analysis informs platform design and worker tools to create a more positive future for crowd work

arXiv.org e-Print Archive

Crossref

Institutional Knowledge at Singapore Management University

Oxford University Research Archive

Toolset for entity and semantic associations - Final release

Author: Anghelache Romeo
Bouche Thierry
Goutorbe Claude
Hatlapatka Radim
Kucbel Maroš
Lee Mark
Rehurek Radim
Sojka Petr
Wojciechowski Krzysztof
Publication venue: HAL CCSD
Publication date: 08/02/2013
Field of study

Hal - Université Grenoble Alpes

HAL Descartes

A literature review and comparison of three feature location techniques using ArgoUML-SPL

Author: Al-Msie'deen Ra'Fat
Blei David M
Kitchenham Barbara
Krueger Charles W.
Rahman Mohammad Masudur
Rehurek Radim
Publication venue: Association for Computing Machinery
Publication date: 06/02/2019
Field of study

Publisher Copyright: © 2019 Association for Computing Machinery.Over the last decades, the adoption of Software Product Line (SPL) engineering for supporting software reuse has increased. An SPL can be extracted from one single product or from a family of related software products, and feature location strategies are widely used for variability mining. Several feature location strategies have been proposed in the literature and they usually aim to map a feature to its source code implementation. In this paper, we present a systematic literature review that identifies and characterizes existing feature location strategies. We also evaluated three different strategies based on textual information retrieval in the context of the ArgoUML-SPL feature location case study. In this evaluation, we compare the strategies based on their ability to correctly identify the source code of several features from ArgoUML-SPL ground truth. We then discuss the strengths and weaknesses of each feature location strategy.This research was partially supported by Brazilian funding agencies: CNPq (Grant 424340/2016-0), CAPES, and FAPEMIG (grant PPM-00651-17).Peer reviewe

Crossref

TECNALIA Publications

A free Web API for single and multi-document summarization

Author: Blei David M
Document
Hatzivassiloglou Vasileios
Pedregosa Fabian
Rehurek Radim
Smedt Tom De
Teufel Simone
Publication venue
Publication date: 01/01/2017
Field of study

In this work we present a free Web API for single and multi-text summarization. The summarization algorithm follows an extractive approach, thus selecting the most relevant sentences from a single document or a document set. It integrates in a novel pipeline different text analysis techniques - ranging from keyword and entity extraction, to topic modelling and sentence clustering - and gives SoA competitive results. The application, written in Python, supports as input both plain texts and Web URLs. The API is publicly accessible for free using the specific conference token [1] as described in the reference page [2]. The browser-based demo version, for summarization of single documents only, is publicly accessible at http://yonderlabs.com/demo

Crossref

Archivio istituzionale della ricerca - Università di Brescia

Toolset for entity and semantic associations - Initial Release

Author: Bolikowski Łukasz
Bouche Thierry
Goutorbe Claude
Hury Wojtek
Lee Mark
Rehurek Radim
Sojka Petr
Sorge Volker
Publication venue: HAL CCSD
Publication date: 27/05/2011
Field of study

In this document we describe the initial release of the toolset for entity and semantic associations, integrating Unsupervised Document Clustering (initially implemented by MU) and Citation Indexing and Matching (as provided by ICM and UJF/CMD). We give a brief description of each tool and some initial evaluation

Hal - Université Grenoble Alpes

HAL Descartes