Search CORE

80 research outputs found

Easy over Hard: A Case Study on Deep Learning

Author: Bergstra James
Mou Lili
Pedregosa Fabian
Pennington Jeffrey
Rehurek Radim
Romano Jeanine
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 24/06/2017
Field of study

While deep learning is an exciting new technique, the benefits of this method need to be assessed with respect to its computational cost. This is particularly important for deep learning since these learners need hours (to weeks) to train the model. Such long training time limits the ability of (a)~a researcher to test the stability of their conclusion via repeated runs with different random seeds; and (b)~other researchers to repeat, improve, or even refute that original work. For example, recently, deep learning was used to find which questions in the Stack Overflow programmer discussion forum can be linked together. That deep learning system took 14 hours to execute. We show here that applying a very simple optimizer called DE to fine tune SVM, it can achieve similar (and sometimes better) results. The DE approach terminated in 10 minutes; i.e. 84 times faster hours than deep learning method. We offer these results as a cautionary tale to the software analytics community and suggest that not every new innovation should be applied without critical analysis. If researchers deploy some new and expensive process, that work should be baselined against some simpler and faster alternatives.Comment: 12 pages, 6 figures, accepted at FSE201

arXiv.org e-Print Archive

Crossref

500+ Times Faster Than Deep Learning (A Case Study Exploring Faster Methods for Text Mining StackOverflow)

Author: Arthur David
Chen Peter Y
Choetkiertikul M.
Guo Gongde
Mihalcea Rada
Pedregosa Fabian
Rehurek Radim
Ron
Publication venue
Publication date: 14/02/2018
Field of study

Deep learning methods are useful for high-dimensional data and are becoming widely used in many areas of software engineering. Deep learners utilizes extensive computational power and can take a long time to train-- making it difficult to widely validate and repeat and improve their results. Further, they are not the best solution in all domains. For example, recent results show that for finding related Stack Overflow posts, a tuned SVM performs similarly to a deep learner, but is significantly faster to train. This paper extends that recent result by clustering the dataset, then tuning very learners within each cluster. This approach is over 500 times faster than deep learning (and over 900 times faster if we use all the cores on a standard laptop computer). Significantly, this faster approach generates classifiers nearly as good (within 2\% F1 Score) as the much slower deep learning method. Hence we recommend this faster methods since it is much easier to reproduce and utilizes far fewer CPU resources. More generally, we recommend that before researchers release research results, that they compare their supposedly sophisticated methods against simpler alternatives (e.g applying simpler learners to build local models)

arXiv.org e-Print Archive

Crossref

OCC: A Smart Reply System for Efficient In-App Communications

Author: Hermann Jeremy
Kim Yoon
Miklos Balint
Rehurek Radim
Sutskever Ilya
Suzie Lee
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 18/07/2019
Field of study

Smart reply systems have been developed for various messaging platforms. In this paper, we introduce Uber's smart reply system: one-click-chat (OCC), which is a key enhanced feature on top of the Uber in-app chat system. It enables driver-partners to quickly respond to rider messages using smart replies. The smart replies are dynamically selected according to conversation content using machine learning algorithms. Our system consists of two major components: intent detection and reply retrieval, which are very different from standard smart reply systems where the task is to directly predict a reply. It is designed specifically for mobile applications with short and non-canonical messages. Reply retrieval utilizes pairings between intent and reply based on their popularity in chat messages as derived from historical data. For intent detection, a set of embedding and classification techniques are experimented with, and we choose to deploy a solution using unsupervised distributed embedding and nearest-neighbor classifier. It has the advantage of only requiring a small amount of labeled training data, simplicity in developing and deploying to production, and fast inference during serving and hence highly scalable. At the same time, it performs comparably with deep learning architectures such as word-level convolutional neural network. Overall, the system achieves a high accuracy of 76% on intent detection. Currently, the system is deployed in production for English-speaking countries and 71% of in-app communications between riders and driver-partners adopted the smart replies to speedup the communication process.Comment: link to demo: https://www.youtube.com/watch?v=nOffUT7rS0A&t=32

arXiv.org e-Print Archive

Crossref

Tuning Word2vec for Large Scale Recommendation Systems

Author: Bodon Ferenc
Mnih Andriy
Rehurek Radim
Shahriari Bobak
Turrin Roberto
Zhao Kui
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 24/09/2020
Field of study

Word2vec is a powerful machine learning tool that emerged from Natural Lan-guage Processing (NLP) and is now applied in multiple domains, including recom-mender systems, forecasting, and network analysis. As Word2vec is often used offthe shelf, we address the question of whether the default hyperparameters are suit-able for recommender systems. The answer is emphatically no. In this paper, wefirst elucidate the importance of hyperparameter optimization and show that un-constrained optimization yields an average 221% improvement in hit rate over thedefault parameters. However, unconstrained optimization leads to hyperparametersettings that are very expensive and not feasible for large scale recommendationtasks. To this end, we demonstrate 138% average improvement in hit rate with aruntime budget-constrained hyperparameter optimization. Furthermore, to makehyperparameter optimization applicable for large scale recommendation problemswhere the target dataset is too large to search over, we investigate generalizinghyperparameters settings from samples. We show that applying constrained hy-perparameter optimization using only a 10% sample of the data still yields a 91%average improvement in hit rate over the default parameters when applied to thefull datasets. Finally, we apply hyperparameters learned using our method of con-strained optimization on a sample to the Who To Follow recommendation serviceat Twitter and are able to increase follow rates by 15%.Comment: 11 pages, 4 figures, Fourteenth ACM Conference on Recommender System

arXiv.org e-Print Archive

Crossref

Crowdsourcing Cybersecurity: Cyber Attack Detection using Social Media

Author: Becker Hila
Flora
Ji Heng
Khandpur Rupinder P.
Lee Wenke
Li Frank
Liu Yang
Modi A.
Muthiah Sathappan
Ovelgonne Michael
Rehurek Radim
Sabottke Carl
Soska Kyle
Tanev Hristo
Weller-Fahy David J.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 24/02/2017
Field of study

Social media is often viewed as a sensor into various societal events such as disease outbreaks, protests, and elections. We describe the use of social media as a crowdsourced sensor to gain insight into ongoing cyber-attacks. Our approach detects a broad range of cyber-attacks (e.g., distributed denial of service (DDOS) attacks, data breaches, and account hijacking) in an unsupervised manner using just a limited fixed set of seed event triggers. A new query expansion strategy based on convolutional kernels and dependency parses helps model reporting structure and aids in identifying key event characteristics. Through a large-scale analysis over Twitter, we demonstrate that our approach consistently identifies and encodes events, outperforming existing methods.Comment: 13 single column pages, 5 figures, submitted to KDD 201

arXiv.org e-Print Archive

Crossref

Extracting Knowledge from the Geometric Shape of Social Network Data Using Topological Data Analysis

Author: Arthur
Bird
Cartan
Deza
Jeongkyu Lee
Khaled Almgren
Minkyu Kim
Munkres
Oglesbee
Pedregosa
Rehurek
Schebesch
Webster
Wu
Publication venue: 'MDPI AG'
Publication date: 01/07/2017
Field of study

Topological data analysis is a noble approach to extract meaningful information from high-dimensional data and is robust to noise. It is based on topology, which aims to study the geometric shape of data. In order to apply topological data analysis, an algorithm called mapper is adopted. The output from mapper is a simplicial complex that represents a set of connected clusters of data points. In this paper, we explore the feasibility of topological data analysis for mining social network data by addressing the problem of image popularity. We randomly crawl images from Instagram and analyze the effects of social context and image content on an image’s popularity using mapper. Mapper clusters the images using each feature, and the ratio of popularity in each cluster is computed to determine the clusters with a high or low possibility of popularity. Then, the popularity of images are predicted to evaluate the accuracy of topological data analysis. This approach is further compared with traditional clustering algorithms, including k-means and hierarchical clustering, in terms of accuracy, and the results show that topological data analysis outperforms the others. Moreover, topological data analysis provides meaningful information based on the connectivity between the clusters.https://doi.org/10.3390/e1907036

Multidisciplinary Digital Publishing Institute

UB ScholarWorks

Crossref

Directory of Open Access Journals

A data-driven analysis of workers' earnings on Amazon Mechanical Turk

Author: Blei David M
Brault Matthew W.
Callison-Burch Chris
Chuang Jason
Harris Seth D.
Hitlin Paul
James Gareth
Juan
Kaufmann Nicolas
Marcadent Philippe
Poursabzi-Sangdeh Forough
Rehurek Radim
Turk Participation Agreement Amazon Mechanical
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 28/12/2017
Field of study

A growing number of people are working as part of on-line crowd work. Crowd work is often thought to be low wage work. However, we know little about the wage distribution in practice and what causes low/high earnings in this setting. We recorded 2,676 workers performing 3.8 million tasks on Amazon Mechanical Turk. Our task-level analysis revealed that workers earned a median hourly wage of only ~2 USD/h, and only 4% earned more than 7.25 USD/h. While the average requester pays more than 11 USD/h, lower-paying requesters post much more work. Our wage calculations are influenced by how unpaid work is accounted for, e.g., time spent searching for tasks, working on tasks that are rejected, and working on tasks that are ultimately not submitted. We further explore the characteristics of tasks and working patterns that yield higher hourly wages. Our analysis informs platform design and worker tools to create a more positive future for crowd work

arXiv.org e-Print Archive

Crossref

Institutional Knowledge at Singapore Management University

Oxford University Research Archive

Narrow lenses for capturing the complexity of fisheries: A topic analysis of fisheries science from 1990 to 2016

Author: Aksnes
Angelini
Belgrano
Blei
Blei
Blei
Blei
Blei
Brooks
Campbell
Chang
Chang
Charles
Chuang
Debortoli
DiMaggio
Doyle
Erosheva
Evangelopoulos
Fergus
Fulton
Griffiths
Grimmer
Hall
Harris
Hill
Hoffman
Hoggarth
Huang
Jakobsen
Jarić
Jennings
Johnson
Kim
King
Kumaresan
Lau
Lennox
Levin
Lewis
Link
Mather
Mehran
Mohr
Molina
Neff
Newman
Ostrom
Porteous
Purcell
Rehurek
Reisinger
Rhody
Rose
Rosen-Zvi
Rusch
Röder
Scott
Sievert
Smith
Sowman
Spalding
Steyvers
Stone-Jovicich
Sun
Syed
Syed
Syed
Symes
Teh
Wallach
Wallach
Wang
Wang
Wang
Westgate
Whye Teh
Österblom
Publication venue: 'Wiley'
Publication date: 01/01/2018
Field of study

Despite increased fisheries science output and publication outlets, the global crisis in fisheries management is as present as ever. Since a narrow research focus may be a contributing factor to this failure, this study uncovers topics in fisheries research and their trends over time. This interdisciplinary research evaluates whether science is diversifying fisheries research topics in an attempt to capture the complexity of the fisheries system, or whether it is multiplying research on similar topics, attempting to achieve an in-depth, but possibly marginal, understanding of a few selected components of this system. By utilizing latent Dirichlet allocation as a generative probabilistic topic model, we analyse a unique dataset consisting of 46,582 full-text articles published in the last 26 years in 21 specialized scientific fisheries journals. Among the 25 topics uncovered by the model, only one (Fisheries management) refers to the human dimension of fisheries understood as socio-ecological complex adaptive systems. The most prevalent topics in our dataset directly relating to fisheries refer to Fisheries management, Stock assessment, and Fishing gear, with Fisheries management attracting the most interest. We propose directions for future research focus that most likely could contribute to providing useful advice for successful management of fisheries

Crossref

E-space: Manchester Metropolitan University's Research Repository

Munin - Open Research Archive

NORA - Norwegian Open Research Archives

Utrecht University Repository

Stable Numerical Methods for PDE Models of Asian Options

Author: Rehurek Adam
Publication venue: Högskolan i Halmstad, Tillämpad matematik och fysik (MPE-lab)
Publication date: 01/01/2011
Field of study

Asian options are exotic financial derivative products which price must be calculated by numerical evaluation. In this thesis, we study certain ways of solving partial differential equations, which are associated with these derivatives. Since standard numerical techniques for Asian options are often incorrect and impractical, we discuss their variations, which are efficiently applicable for handling frequent numerical instabilities reflected in form of oscillatory solutions. We will show that this crucial problem can be treated and eliminated by adopting flux limiting techniques, which are total variation dimishing

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Stable Numerical Methods for PDE Models of Asian Options

Author: Rehurek Adam
Publication venue: Högskolan i Halmstad, Tillämpad matematik och fysik (MPE-lab)
Publication date: 01/01/2011
Field of study

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Högskolebiblioteket i Halmstad Publikationer