1,789 research outputs found
Detecting word substitutions in text
Searching for words on a watchlist is one way in which large-scale surveillance of communication can be done, for example in intelligence and counterterrorism settings. One obvious defense is to replace words that might attract attention to a message with other, more innocuous, words. For example, the sentence the attack will be tomorrow" might be altered to the complex will be tomorrow", since 'complex' is a word whose frequency is close to that of 'attack'. Such substitutions are readily detectable by humans since they do not make sense. We address the problem of detecting such substitutions automatically, by looking for discrepancies between words and their contexts, and using only syntactic information. We define a set of measures, each of which is quite weak, but which together produce per-sentence detection rates around 90% with false positive rates around 10%. Rules for combining persentence detection into per-message detection can reduce the false positive and false negative rates for messages to practical levels. We test the approach using sentences from the Enron email and Brown corpora, representing informal and formal text respectively
Enron versus EUSES: A Comparison of Two Spreadsheet Corpora
Spreadsheets are widely used within companies and often form the basis for
business decisions. Numerous cases are known where incorrect information in
spreadsheets has lead to incorrect decisions. Such cases underline the
relevance of research on the professional use of spreadsheets.
Recently a new dataset became available for research, containing over 15.000
business spreadsheets that were extracted from the Enron E-mail Archive. With
this dataset, we 1) aim to obtain a thorough understanding of the
characteristics of spreadsheets used within companies, and 2) compare the
characteristics of the Enron spreadsheets with the EUSES corpus which is the
existing state of the art set of spreadsheets that is frequently used in
spreadsheet studies.
Our analysis shows that 1) the majority of spreadsheets are not large in
terms of worksheets and formulas, do not have a high degree of coupling, and
their formulas are relatively simple; 2) the spreadsheets from the EUSES corpus
are, with respect to the measured characteristics, quite similar to the Enron
spreadsheets.Comment: In Proceedings of the 2nd Workshop on Software Engineering Methods in
Spreadsheet
A New Approach to Speeding Up Topic Modeling
Latent Dirichlet allocation (LDA) is a widely-used probabilistic topic
modeling paradigm, and recently finds many applications in computer vision and
computational biology. In this paper, we propose a fast and accurate batch
algorithm, active belief propagation (ABP), for training LDA. Usually batch LDA
algorithms require repeated scanning of the entire corpus and searching the
complete topic space. To process massive corpora having a large number of
topics, the training iteration of batch LDA algorithms is often inefficient and
time-consuming. To accelerate the training speed, ABP actively scans the subset
of corpus and searches the subset of topic space for topic modeling, therefore
saves enormous training time in each iteration. To ensure accuracy, ABP selects
only those documents and topics that contribute to the largest residuals within
the residual belief propagation (RBP) framework. On four real-world corpora,
ABP performs around to times faster than state-of-the-art batch LDA
algorithms with a comparable topic modeling accuracy.Comment: 14 pages, 12 figure
Memory-Efficient Topic Modeling
As one of the simplest probabilistic topic modeling techniques, latent
Dirichlet allocation (LDA) has found many important applications in text
mining, computer vision and computational biology. Recent training algorithms
for LDA can be interpreted within a unified message passing framework. However,
message passing requires storing previous messages with a large amount of
memory space, increasing linearly with the number of documents or the number of
topics. Therefore, the high memory usage is often a major problem for topic
modeling of massive corpora containing a large number of topics. To reduce the
space complexity, we propose a novel algorithm without storing previous
messages for training LDA: tiny belief propagation (TBP). The basic idea of TBP
relates the message passing algorithms with the non-negative matrix
factorization (NMF) algorithms, which absorb the message updating into the
message passing process, and thus avoid storing previous messages. Experimental
results on four large data sets confirm that TBP performs comparably well or
even better than current state-of-the-art training algorithms for LDA but with
a much less memory consumption. TBP can do topic modeling when massive corpora
cannot fit in the computer memory, for example, extracting thematic topics from
7 GB PUBMED corpora on a common desktop computer with 2GB memory.Comment: 20 pages, 7 figure
Recommended from our members
Identifying idiolect in forensic authorship attribution: an n-gram textbite approach
Forensic authorship attribution is concerned with identifying authors of disputed or anonymous documents, which are potentially evidential in legal cases, through the analysis of linguistic clues left behind by writers. The forensic linguist “approaches this problem of questioned authorship from the theoretical position that every native speaker has their own distinct and individual version of the language [. . . ], their own idiolect” (Coulthard, 2004: 31). However, given the diXculty in empirically substantiating a theory of idiolect, there is growing concern in the Veld that it remains too abstract to be of practical use (Kredens, 2002; Grant, 2010; Turell, 2010). Stylistic, corpus, and computational approaches to text, however, are able to identify repeated collocational patterns, or n-grams, two to six word chunks of language, similar to the popular notion of soundbites: small segments of no more than a few seconds of speech that journalists are able to recognise as having news value and which characterise the important moments of talk. The soundbite oUers an intriguing parallel for authorship attribution studies, with the following question arising: looking at any set of texts by any author, is it possible to identify ‘n-gram textbites’, small textual segments that characterise that author’s writing, providing DNA-like chunks of identifying material
Joint Tensor Factorization and Outlying Slab Suppression with Applications
We consider factoring low-rank tensors in the presence of outlying slabs.
This problem is important in practice, because data collected in many
real-world applications, such as speech, fluorescence, and some social network
data, fit this paradigm. Prior work tackles this problem by iteratively
selecting a fixed number of slabs and fitting, a procedure which may not
converge. We formulate this problem from a group-sparsity promoting point of
view, and propose an alternating optimization framework to handle the
corresponding () minimization-based low-rank tensor
factorization problem. The proposed algorithm features a similar per-iteration
complexity as the plain trilinear alternating least squares (TALS) algorithm.
Convergence of the proposed algorithm is also easy to analyze under the
framework of alternating optimization and its variants. In addition,
regularization and constraints can be easily incorporated to make use of
\emph{a priori} information on the latent loading factors. Simulations and real
data experiments on blind speech separation, fluorescence data analysis, and
social network mining are used to showcase the effectiveness of the proposed
algorithm
- …