Search CORE

679 research outputs found

Streaming and Sketch Algorithms for Large Data NLP

Author: Goyal Amit
Publication venue
Publication date: 01/01/2013
Field of study

The availability of large and rich quantities of text data is due to the emergence of the World Wide Web, social media, and mobile devices. Such vast data sets have led to leaps in the performance of many statistically-based problems. Given a large magnitude of text data available, it is computationally prohibitive to train many complex Natural Language Processing (NLP) models on large data. This motivates the hypothesis that simple models trained on big data can outperform more complex models with small data. My dissertation provides a solution to effectively and efficiently exploit large data on many NLP applications. Datasets are growing at an exponential rate, much faster than increase in memory. To provide a memory-efficient solution for handling large datasets, this dissertation show limitations of existing streaming and sketch algorithms when applied to canonical NLP problems and proposes several new variants to overcome those shortcomings. Streaming and sketch algorithms process the large data sets in one pass and represent a large data set with a compact summary, much smaller than the full size of the input. These algorithms can easily be implemented in a distributed setting and provide a solution that is both memory- and time-efficient. However, the memory and time savings come at the expense of approximate solutions. In this dissertation, I demonstrate that approximate solutions achieved on large data are comparable to exact solutions on large data and outperform exact solutions on smaller data. I focus on many NLP problems that boil down to tracking many statistics, like storing approximate counts, computing approximate association scores like pointwise mutual information (PMI), finding frequent items (like n-grams), building streaming language models, and measuring distributional similarity. First, I introduce the concept of approximate streaming large-scale language models in NLP. Second, I present a novel variant of the Count-Min sketch that maintains approximate counts of all items. Third, I conduct a systematic study and compare many sketch algorithms that approximate count of items with focus on large-scale NLP tasks. Last, I develop fast large-scale approximate graph (FLAG), a system that quickly constructs a large-scale approximate nearest-neighbor graph from a large corpus

Finding Subcube Heavy Hitters in Analytics Data Streams

Author: Kveton Branislav
Muthukrishnan S.
Vu Hoa T.
Xian Yikun
Publication venue
Publication date: 01/01/2018
Field of study

Data streams typically have items of large number of dimensions. We study the fundamental heavy-hitters problem in this setting. Formally, the data stream consists of

d

-dimensional items

x_1,\ldots,x_m \in [n]^d

. A

k

-dimensional subcube

T

is a subset of distinct coordinates

\{ T_1,\cdots,T_k \} \subseteq [d]

. A subcube heavy hitter query

{\rm Query}(T,v)

v \in [n]^k

, outputs YES if

f_T(v) \geq \gamma

and NO if

f_T(v) < \gamma/4

, where

f_T

is the ratio of number of stream items whose coordinates

T

have joint values

v

. The all subcube heavy hitters query

{\rm AllQuery}(T)

outputs all joint values

v

that return YES to

{\rm Query}(T,v)

. The one dimensional version of this problem where

d=1

was heavily studied in data stream theory, databases, networking and signal processing. The subcube heavy hitters problem is applicable in all these cases. We present a simple reservoir sampling based one-pass streaming algorithm to solve the subcube heavy hitters problem in

\tilde{O}(kd/\gamma)

space. This is optimal up to poly-logarithmic factors given the established lower bound. In the worst case, this is

\Theta(d^2/\gamma)

which is prohibitive for large

d

, and our goal is to circumvent this quadratic bottleneck. Our main contribution is a model-based approach to the subcube heavy hitters problem. In particular, we assume that the dimensions are related to each other via the Naive Bayes model, with or without a latent dimension. Under this assumption, we present a new two-pass,

\tilde{O}(d/\gamma)

-space algorithm for our problem, and a fast algorithm for answering

{\rm AllQuery}(T)

O(k/\gamma^2)

time. Our work develops the direction of model-based data stream analysis, with much that remains to be explored.Comment: To appear in WWW 201

arXiv.org e-Print Archive

Identifying Users with Opposing Opinions in Twitter Debates

Author: Liu Huan
Rajadesingan Ashwin
Publication venue
Publication date: 01/01/2014
Field of study

In recent times, social media sites such as Twitter have been extensively used for debating politics and public policies. These debates span millions of tweets and numerous topics of public importance. Thus, it is imperative that this vast trove of data is tapped in order to gain insights into public opinion especially on hotly contested issues such as abortion, gun reforms etc. Thus, in our work, we aim to gauge users' stance on such topics in Twitter. We propose ReLP, a semi-supervised framework using a retweet-based label propagation algorithm coupled with a supervised classifier to identify users with differing opinions. In particular, our framework is designed such that it can be easily adopted to different domains with little human supervision while still producing excellent accuracyComment: Corrected typos in Section 4, under "Visibly Opinionated Users". The numbers did not add up. Results remain unchange

arXiv.org e-Print Archive

CiteSeerX

Sign Stable Projections, Sign Cauchy Projections and Chi-Square Kernels

Author: Hopcroft John
Li Ping
Samorodnitsky Gennady
Publication venue
Publication date: 05/08/2013
Field of study

The method of stable random projections is popular for efficiently computing the Lp distances in high dimension (where 0<p<=2), using small space. Because it adopts nonadaptive linear projections, this method is naturally suitable when the data are collected in a dynamic streaming fashion (i.e., turnstile data streams). In this paper, we propose to use only the signs of the projected data and analyze the probability of collision (i.e., when the two signs differ). We derive a bound of the collision probability which is exact when p=2 and becomes less sharp when p moves away from 2. Interestingly, when p=1 (i.e., Cauchy random projections), we show that the probability of collision can be accurately approximated as functions of the chi-square similarity. For example, when the (un-normalized) data are binary, the maximum approximation error of the collision probability is smaller than 0.0192. In text and vision applications, the chi-square similarity is a popular measure for nonnegative data when the features are generated from histograms. Our experiments confirm that the proposed method is promising for large-scale learning applications

arXiv.org e-Print Archive

CiteSeerX

Frequency Estimation in Data Streams: Learning the Optimal Hashing Scheme

Author: Bertsimas Dimitris
Digalakis Jr Vassilis
Publication venue
Publication date: 02/06/2021
Field of study

We present a novel approach for the problem of frequency estimation in data streams that is based on optimization and machine learning. Contrary to state-of-the-art streaming frequency estimation algorithms, which heavily rely on random hashing to maintain the frequency distribution of the data steam using limited storage, the proposed approach exploits an observed stream prefix to near-optimally hash elements and compress the target frequency distribution. We develop an exact mixed-integer linear optimization formulation, which enables us to compute optimal or near-optimal hashing schemes for elements seen in the observed stream prefix; then, we use machine learning to hash unseen elements. Further, we develop an efficient block coordinate descent algorithm, which, as we empirically show, produces high quality solutions, and, in a special case, we are able to solve the proposed formulation exactly in linear time using dynamic programming. We empirically evaluate the proposed approach both on synthetic datasets and on real-world search query data. We show that the proposed approach outperforms existing approaches by one to two orders of magnitude in terms of its average (per element) estimation error and by 45-90% in terms of its expected magnitude of estimation error.Comment: Submitted to IEEE Transactions on Knowledge and Data Engineering on 07/2020. Revised on 05/202

arXiv.org e-Print Archive

Taste and the algorithm

Author: Arielli Emanuele
Publication venue
Publication date: 01/01/2018
Field of study

Today, a consistent part of our everyday interaction with art and aesthetic artefacts occurs through digital media, and our preferences and choices are systematically tracked and analyzed by algorithms in ways that are far from transparent. Our consumption is constantly documented, and then, we are fed back through tailored information. We are therefore witnessing the emergence of a complex interrelation between our aesthetic choices, their digital elaboration, and also the production of content and the dynamics of creative processes. All are involved in a process of mutual influences, and are partially determined by the invisible guiding hand of algorithms. With regard to this topic, this paper will introduce some key issues concerning the role of algorithms in aesthetic domains, such as taste detection and formation, cultural consumption and production, and showing how aesthetics can contribute to the ongoing debate about the impact of today’s “algorithmic culture”

PhilPapers